commited on
Update submodule Tokenizer, adapt README.md
Browse filesREADME.md:
- add "Prerequisites" section
- remove Icelandic text to minimize redundancies
- update links
- use correct form of cmdline text formatting
Submodule Tokenizer:
- use correct version of the Tokenizer submodule
Signed-off-by: Daniel Schnell <dschnell@grammatek.com>
- .gitattributes +1 -0
- README.md +13 -26
- Tokenizer +1 -1
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
36 |
diaparser-is-combined-v211/diaparser.model filter=lfs diff=lfs merge=lfs -text
@@ -1,27 +1,10 @@
1 |
2 |
3 |
Mælt er með því að þáttarinn sé keyrður með Python3.8 í sýndarumhverfi.
4 |
Hægt er að nota conda fyrir sýndarumhverfi: https://conda.io/projects/conda/en/latest/user-guide/getting-started.html
5 |
Einnig er hægt að nota venv fyrir sýndarumhverfi: https://docs.python.org/3/library/venv.html
6 |
7 |
Til þess að keyra þáttarann þarf að setja upp nauðsynlega pakka, eftir að sýndarumhverfi hefur verið virkjað: python3 -m pip install -r requirements.txt
8 |
Tokenizer-mappan er klónuð gagnahirsla [tókarans frá Miðeind](https://github.com/mideind/Tokenizer).
9 |
10 |
Hægt er að keyra þáttarann svona: ~~~python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt~~~
11 |
~~~transformer_models/~~~ inniheldur forþjálfað transformer-líkan, electra-base-igc-is, sem tókarinn sækir samhengisháðar orðgreypingar og athygli í. Það var þjálfað af Jóni Friðriki Daðasyni.
12 |
13 |
14 |
15 |
Metric | Precision | Recall | F1 Score | AligndAcc
16 |
17 |
Tokens | 99.70 | 99.77 | 99.73 |
18 |
Sentences | 100.00 | 100.00 | 100.00 |
19 |
Words | 99.62 | 99.61 | 99.61 |
20 |
UAS | 89.58 | 89.57 | 89.58 | 89.92
21 |
LAS | 86.46 | 86.45 | 86.46 | 86.79
22 |
CLAS | 82.30 | 81.81 | 82.05 | 82.24
23 |
24 |
25 |
26 |
## A Universal Dependency parser built on top of a Transformer language model
27 |
@@ -32,11 +15,15 @@ You can also use venv for a virtual environment: https://docs.python.org/3/libra
32 |
33 |
To run this package, after having activated your virtual environment, you need to install the requirements: python3 -m pip install -r requirements.txt.
34 |
35 |
The Tokenizer
36 |
37 |
38 |
39 |
40 |
41 |
The parser scores as follows:
42 |
@@ -49,5 +36,5 @@ UAS | 89.58 | 89.57 | 89.58 | 89.92
49 |
LAS | 86.46 | 86.45 | 86.46 | 86.79
50 |
CLAS | 82.30 | 81.81 | 82.05 | 82.24
51 |
52 |
53 |
1 |
## Prerequisites
2 |
3 |
After cloning the repository, first fetch submodule dependencies and run:
4 |
5 |
6 |
git submodule update --init --recursive
7 |
8 |
9 |
## A Universal Dependency parser built on top of a Transformer language model
10 |
15 |
16 |
To run this package, after having activated your virtual environment, you need to install the requirements: python3 -m pip install -r requirements.txt.
17 |
18 |
The Tokenizer submodule is using [Miðeind's tokenizer](https://github.com/icelandic-lt/Tokenizer). It is included because one of Diaparser's modules is named tokenizer.
19 |
20 |
The parser can be run as follows:
21 |
22 |
23 |
python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt
24 |
25 |
26 |
The directory `transformer_models/` contains a pretrained model, [electra-base-igc-is](https://huggingface.co/Icelandic-lt/electra-base-igc-is), which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.
27 |
28 |
The parser scores as follows:
29 |
36 |
LAS | 86.46 | 86.45 | 86.46 | 86.79
37 |
CLAS | 82.30 | 81.81 | 82.05 | 82.24
38 |
39 |
## License
40 |
@@ -1 +1 @@
1 |
Subproject commit
1 |
Subproject commit 5ae4551ad3a3a99ad657bd0528dd4147f4f5f95f