## Prerequisites

After cloning the repository, first fetch submodule dependencies and run:

```bash
git submodule update --init --recursive
```

## A Universal Dependency parser built on top of a Transformer language model

Python3.8 recommended, as well as a virtual environment. 

You can use conda for a virtual environment: https://conda.io/projects/conda/en/latest/user-guide/getting-started.html
You can also use venv for a virtual environment: https://docs.python.org/3/library/venv.html

To run this package, after having activated your virtual environment, you need to install the requirements: python3 -m pip install -r requirements.txt.

The Tokenizer submodule is using [Miðeind's tokenizer](https://github.com/icelandic-lt/Tokenizer). It is included because one of Diaparser's modules is named tokenizer.

The parser can be run as follows:

```python
python3 parse_file.py --parser diaparser-is-combined-v211/diaparser.model --infile test_file.txt
```

The directory `transformer_models/` contains a pretrained model, [electra-base-igc-is](https://huggingface.co/Icelandic-lt/electra-base-igc-is), which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.

The parser scores as follows:

```
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.70 |     99.77 |     99.73 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.62 |     99.61 |     99.61 |
UAS        |     89.58 |     89.57 |     89.58 |     89.92
LAS        |     86.46 |     86.45 |     86.46 |     86.79
CLAS       |     82.30 |     81.81 |     82.05 |     82.24
```

## License
https://opensource.org/licenses/Apache-2.0