danielschnell
commited on
Commit
·
e81ff46
1
Parent(s):
695687f
Update submodule Tokenizer, adapt README.md
Browse filesREADME.md:
- add "Prerequisites" section
- remove Icelandic text to minimize redundancies
- update links
- use correct form of cmdline text formatting
Submodule Tokenizer:
- use correct version of the Tokenizer submodule
Signed-off-by: Daniel Schnell <dschnell@grammatek.com>
- .gitattributes +1 -0
- README.md +13 -26
- Tokenizer +1 -1
.gitattributes
CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
diaparser-is-combined-v211/diaparser.model filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
@@ -1,27 +1,10 @@
|
|
1 |
-
##
|
2 |
-
|
3 |
-
Mælt er með því að þáttarinn sé keyrður með Python3.8 í sýndarumhverfi.
|
4 |
-
Hægt er að nota conda fyrir sýndarumhverfi: https://conda.io/projects/conda/en/latest/user-guide/getting-started.html
|
5 |
-
Einnig er hægt að nota venv fyrir sýndarumhverfi: https://docs.python.org/3/library/venv.html
|
6 |
-
|
7 |
-
Til þess að keyra þáttarann þarf að setja upp nauðsynlega pakka, eftir að sýndarumhverfi hefur verið virkjað: python3 -m pip install -r requirements.txt
|
8 |
-
Tokenizer-mappan er klónuð gagnahirsla [tókarans frá Miðeind](https://github.com/mideind/Tokenizer).
|
9 |
-
|
10 |
-
Hægt er að keyra þáttarann svona: ~~~python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt~~~
|
11 |
-
~~~transformer_models/~~~ inniheldur forþjálfað transformer-líkan, electra-base-igc-is, sem tókarinn sækir samhengisháðar orðgreypingar og athygli í. Það var þjálfað af Jóni Friðriki Daðasyni.
|
12 |
-
|
13 |
-
Skor:
|
14 |
-
|
15 |
-
Metric | Precision | Recall | F1 Score | AligndAcc
|
16 |
-
-----------+-----------+-----------+-----------+-----------
|
17 |
-
Tokens | 99.70 | 99.77 | 99.73 |
|
18 |
-
Sentences | 100.00 | 100.00 | 100.00 |
|
19 |
-
Words | 99.62 | 99.61 | 99.61 |
|
20 |
-
UAS | 89.58 | 89.57 | 89.58 | 89.92
|
21 |
-
LAS | 86.46 | 86.45 | 86.46 | 86.79
|
22 |
-
CLAS | 82.30 | 81.81 | 82.05 | 82.24
|
23 |
|
|
|
24 |
|
|
|
|
|
|
|
25 |
|
26 |
## A Universal Dependency parser built on top of a Transformer language model
|
27 |
|
@@ -32,11 +15,15 @@ You can also use venv for a virtual environment: https://docs.python.org/3/libra
|
|
32 |
|
33 |
To run this package, after having activated your virtual environment, you need to install the requirements: python3 -m pip install -r requirements.txt.
|
34 |
|
35 |
-
The Tokenizer
|
|
|
|
|
36 |
|
37 |
-
|
|
|
|
|
38 |
|
39 |
-
|
40 |
|
41 |
The parser scores as follows:
|
42 |
|
@@ -49,5 +36,5 @@ UAS | 89.58 | 89.57 | 89.58 | 89.92
|
|
49 |
LAS | 86.46 | 86.45 | 86.46 | 86.79
|
50 |
CLAS | 82.30 | 81.81 | 82.05 | 82.24
|
51 |
|
52 |
-
|
53 |
https://opensource.org/licenses/Apache-2.0
|
|
|
1 |
+
## Prerequisites
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
+
After cloning the repository, first fetch submodule dependencies and run:
|
4 |
|
5 |
+
```bash
|
6 |
+
git submodule update --init --recursive
|
7 |
+
```
|
8 |
|
9 |
## A Universal Dependency parser built on top of a Transformer language model
|
10 |
|
|
|
15 |
|
16 |
To run this package, after having activated your virtual environment, you need to install the requirements: python3 -m pip install -r requirements.txt.
|
17 |
|
18 |
+
The Tokenizer submodule is using [Miðeind's tokenizer](https://github.com/icelandic-lt/Tokenizer). It is included because one of Diaparser's modules is named tokenizer.
|
19 |
+
|
20 |
+
The parser can be run as follows:
|
21 |
|
22 |
+
```python
|
23 |
+
python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt
|
24 |
+
```
|
25 |
|
26 |
+
The directory `transformer_models/` contains a pretrained model, [electra-base-igc-is](https://huggingface.co/Icelandic-lt/electra-base-igc-is), which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.
|
27 |
|
28 |
The parser scores as follows:
|
29 |
|
|
|
36 |
LAS | 86.46 | 86.45 | 86.46 | 86.79
|
37 |
CLAS | 82.30 | 81.81 | 82.05 | 82.24
|
38 |
|
39 |
+
## License
|
40 |
https://opensource.org/licenses/Apache-2.0
|
Tokenizer
CHANGED
@@ -1 +1 @@
|
|
1 |
-
Subproject commit
|
|
|
1 |
+
Subproject commit 5ae4551ad3a3a99ad657bd0528dd4147f4f5f95f
|