Readability ES Sentences for three classes
Model based on the Roberta architecture finetuned on BERTIN for readability assessment of Spanish texts.
Description and performance
This version of the model was trained on a mix of datasets, using sentence-level granularity when possible. The model performs classification among three complexity levels:
- Basic.
- Intermediate.
- Advanced.
The relationship of these categories with the Common European Framework of Reference for Languages is described in our report.
This model achieves a F1 macro average score of 0.6951, measured on the validation set.
Model variants
readability-es-sentences
. Two classes, sentence-based dataset.readability-es-paragraphs
. Two classes, paragraph-based dataset.readability-es-3class-sentences
(this model). Three classes, sentence-based dataset.readability-es-3class-paragraphs
. Three classes, paragraph-based dataset.
Datasets
readability-es-hackathon-pln-public
, composed of:- coh-metrix-esp corpus.
- Various text resources scraped from websites.
- Other non-public datasets: newsela-es, simplext.
Training details
Please, refer to this training run for full details on hyperparameters and training regime.
Biases and Limitations
- Due to the scarcity of data and the lack of a reliable gold test set, performance metrics are reported on the validation set.
- One of the datasets involved is the Spanish version of newsela, which is frequently used as a reference. However, it was created by translating previous datasets, and therefore it may contain somewhat unnatural phrases.
- Some of the datasets used cannot be publicly disseminated, making it more difficult to assess the existence of biases or mistakes.
- Language might be biased towards the Spanish dialect spoken in Spain. Other regional variants might be sub-represented.
- No effort has been performed to alleviate the shortcomings and biases described in the original implementation of BERTIN.
Authors
- Downloads last month
- 11
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.