danielschnell commited on
Commit
e81ff46
·
1 Parent(s): 695687f

Update submodule Tokenizer, adapt README.md

Browse files

README.md:

- add "Prerequisites" section
- remove Icelandic text to minimize redundancies
- update links
- use correct form of cmdline text formatting


Submodule Tokenizer:

- use correct version of the Tokenizer submodule

Signed-off-by: Daniel Schnell <dschnell@grammatek.com>

Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +13 -26
  3. Tokenizer +1 -1
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ diaparser-is-combined-v211/diaparser.model filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,27 +1,10 @@
1
- ## UD-þáttari sem nýtir sér upplýsingar úr Transformer-mállíkani
2
-
3
- Mælt er með því að þáttarinn sé keyrður með Python3.8 í sýndarumhverfi.
4
- Hægt er að nota conda fyrir sýndarumhverfi: https://conda.io/projects/conda/en/latest/user-guide/getting-started.html
5
- Einnig er hægt að nota venv fyrir sýndarumhverfi: https://docs.python.org/3/library/venv.html
6
-
7
- Til þess að keyra þáttarann þarf að setja upp nauðsynlega pakka, eftir að sýndarumhverfi hefur verið virkjað: python3 -m pip install -r requirements.txt
8
- Tokenizer-mappan er klónuð gagnahirsla [tókarans frá Miðeind](https://github.com/mideind/Tokenizer).
9
-
10
- Hægt er að keyra þáttarann svona: ~~~python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt~~~
11
- ~~~transformer_models/~~~ inniheldur forþjálfað transformer-líkan, electra-base-igc-is, sem tókarinn sækir samhengisháðar orðgreypingar og athygli í. Það var þjálfað af Jóni Friðriki Daðasyni.
12
-
13
- Skor:
14
-
15
- Metric | Precision | Recall | F1 Score | AligndAcc
16
- -----------+-----------+-----------+-----------+-----------
17
- Tokens | 99.70 | 99.77 | 99.73 |
18
- Sentences | 100.00 | 100.00 | 100.00 |
19
- Words | 99.62 | 99.61 | 99.61 |
20
- UAS | 89.58 | 89.57 | 89.58 | 89.92
21
- LAS | 86.46 | 86.45 | 86.46 | 86.79
22
- CLAS | 82.30 | 81.81 | 82.05 | 82.24
23
 
 
24
 
 
 
 
25
 
26
  ## A Universal Dependency parser built on top of a Transformer language model
27
 
@@ -32,11 +15,15 @@ You can also use venv for a virtual environment: https://docs.python.org/3/libra
32
 
33
  To run this package, after having activated your virtual environment, you need to install the requirements: python3 -m pip install -r requirements.txt.
34
 
35
- The Tokenizer directory is a clone of [Miðeind's tokenizer](https://github.com/mideind/Tokenizer). It is included because one of Diaparser's modules is named tokenizer.
 
 
36
 
37
- The parser can be run as follows: ~~~python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt~~~
 
 
38
 
39
- ~~~transformer_models/~~~ contains a pretrained model, electra-base-igc-is, which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.
40
 
41
  The parser scores as follows:
42
 
@@ -49,5 +36,5 @@ UAS | 89.58 | 89.57 | 89.58 | 89.92
49
  LAS | 86.46 | 86.45 | 86.46 | 86.79
50
  CLAS | 82.30 | 81.81 | 82.05 | 82.24
51
 
52
- ### License
53
  https://opensource.org/licenses/Apache-2.0
 
1
+ ## Prerequisites
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
+ After cloning the repository, first fetch submodule dependencies and run:
4
 
5
+ ```bash
6
+ git submodule update --init --recursive
7
+ ```
8
 
9
  ## A Universal Dependency parser built on top of a Transformer language model
10
 
 
15
 
16
  To run this package, after having activated your virtual environment, you need to install the requirements: python3 -m pip install -r requirements.txt.
17
 
18
+ The Tokenizer submodule is using [Miðeind's tokenizer](https://github.com/icelandic-lt/Tokenizer). It is included because one of Diaparser's modules is named tokenizer.
19
+
20
+ The parser can be run as follows:
21
 
22
+ ```python
23
+ python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt
24
+ ```
25
 
26
+ The directory `transformer_models/` contains a pretrained model, [electra-base-igc-is](https://huggingface.co/Icelandic-lt/electra-base-igc-is), which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.
27
 
28
  The parser scores as follows:
29
 
 
36
  LAS | 86.46 | 86.45 | 86.46 | 86.79
37
  CLAS | 82.30 | 81.81 | 82.05 | 82.24
38
 
39
+ ## License
40
  https://opensource.org/licenses/Apache-2.0
Tokenizer CHANGED
@@ -1 +1 @@
1
- Subproject commit be8ee4de465ecf0dbf008d986b99df43210f27bf
 
1
+ Subproject commit 5ae4551ad3a3a99ad657bd0528dd4147f4f5f95f