File size: 1,754 Bytes
e81ff46 695687f e81ff46 695687f e81ff46 695687f e81ff46 695687f e81ff46 695687f e81ff46 695687f c1fa0b3 695687f c1fa0b3 695687f e81ff46 695687f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
## Prerequisites
After cloning the repository, first fetch submodule dependencies and run:
```bash
git submodule update --init --recursive
```
## A Universal Dependency parser built on top of a Transformer language model
Python3.8 recommended, as well as a virtual environment.
You can use conda for a virtual environment: https://conda.io/projects/conda/en/latest/user-guide/getting-started.html
You can also use venv for a virtual environment: https://docs.python.org/3/library/venv.html
To run this package, after having activated your virtual environment, you need to install the requirements: python3 -m pip install -r requirements.txt.
The Tokenizer submodule is using [Miðeind's tokenizer](https://github.com/icelandic-lt/Tokenizer). It is included because one of Diaparser's modules is named tokenizer.
The parser can be run as follows:
```python
python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt
```
The directory `transformer_models/` contains a pretrained model, [electra-base-igc-is](https://huggingface.co/Icelandic-lt/electra-base-igc-is), which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.
The parser scores as follows:
```
Metric | Precision | Recall | F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens | 99.70 | 99.77 | 99.73 |
Sentences | 100.00 | 100.00 | 100.00 |
Words | 99.62 | 99.61 | 99.61 |
UAS | 89.58 | 89.57 | 89.58 | 89.92
LAS | 86.46 | 86.45 | 86.46 | 86.79
CLAS | 82.30 | 81.81 | 82.05 | 82.24
```
## License
https://opensource.org/licenses/Apache-2.0
|