|
--- |
|
language: en |
|
license: mit |
|
datasets: |
|
- arxmliv |
|
- math-stackexchange |
|
--- |
|
|
|
# MathBERTa model |
|
|
|
Pretrained model on English language and LaTeX using a masked language modeling |
|
(MLM) objective. It was introduced in [this paper][1] and first released in |
|
[this repository][2]. This model is case-sensitive: it makes a difference |
|
between english and English. |
|
|
|
[1]: http://ceur-ws.org/Vol-3180/paper-06.pdf |
|
[2]: https://github.com/witiko/scm-at-arqmath3 |
|
|
|
## Model description |
|
|
|
MathBERTa is [the RoBERTa base transformer model][3] whose [tokenizer has been |
|
extended with LaTeX math symbols][7] and which has been [fine-tuned on a large |
|
corpus of English mathematical texts][8]. |
|
|
|
Like RoBERTa, MathBERTa has been fine-tuned with the Masked language modeling |
|
(MLM) objective. Taking a sentence, the model randomly masks 15% of the words |
|
and math symbols in the input then run the entire masked sentence through the |
|
model and has to predict the masked words and symbols. This way, the model |
|
learns an inner representation of the English language and LaTeX that can then |
|
be used to extract features useful for downstream tasks. |
|
|
|
[3]: https://huggingface.co/roberta-base |
|
[7]: https://github.com/Witiko/scm-at-arqmath3/blob/main/02-train-tokenizers.ipynb |
|
[8]: https://github.com/witiko/scm-at-arqmath3/blob/main/03-finetune-roberta.ipynb |
|
|
|
## Intended uses & limitations |
|
|
|
You can use the raw model for masked language modeling, but it's mostly |
|
intended to be fine-tuned on a downstream task. See the [model |
|
hub][4] to look for fine-tuned versions on a task that interests you. |
|
|
|
Note that this model is primarily aimed at being fine-tuned on tasks that use |
|
the whole sentence (potentially masked) to make decisions, such as sequence |
|
classification, token classification or question answering. For tasks such as |
|
text generation you should look at model like GPT2. |
|
|
|
[4]: https://huggingface.co/models?filter=roberta |
|
|
|
### How to use |
|
|
|
|
|
*Due to the large number of added LaTeX tokens, MathBERTa is affected by [a |
|
software bug in the 🤗 Transformers library][9] that causes it to load for tens |
|
of minutes. The bug was [fixed in version 4.20.0][10].* |
|
|
|
You can use this model directly with a pipeline for masked language modeling: |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
>>> unmasker = pipeline('fill-mask', model='witiko/mathberta') |
|
>>> unmasker(r"If [MATH] \theta = \pi [/MATH] , then [MATH] \sin(\theta) [/MATH] is <mask>.") |
|
|
|
[{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is zero.' |
|
'score': 0.23291291296482086, |
|
'token': 4276, |
|
'token_str': ' zero'}, |
|
{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is 0.' |
|
'score': 0.11734672635793686, |
|
'token': 321, |
|
'token_str': ' 0'}, |
|
{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is real.' |
|
'score': 0.0793389230966568, |
|
'token': 588, |
|
'token_str': ' real'}, |
|
{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is 1.' |
|
'score': 0.0753420740365982, |
|
'token': 112, |
|
'token_str': ' 1'}, |
|
{'sequence': ' If \\theta = \\pi, then\\sin(\\theta ) is even.' |
|
'score': 0.06487451493740082, |
|
'token': 190, |
|
'token_str': ' even'}] |
|
``` |
|
|
|
Here is how to use this model to get the features of a given text in PyTorch: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
tokenizer = AutoTokenizer.from_pretrained('witiko/mathberta') |
|
model = AutoModel.from_pretrained('witiko/mathberta') |
|
text = r"Replace me by any text and [MATH] \text{math} [/MATH] you'd like." |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
output = model(**encoded_input) |
|
``` |
|
|
|
## Training data |
|
|
|
Our model was fine-tuned on two datasets: |
|
|
|
- [ArXMLiv 2020][5], a dataset consisting of 1,581,037 ArXiv documents. |
|
- [Math StackExchange][6], a dataset of 2,466,080 questions and answers. |
|
|
|
Together theses datasets weight 52GB of text and LaTeX. |
|
|
|
## Intrinsic evaluation results |
|
|
|
Our model achieves the following intrinsic evaluation results: |
|
|
|
![Intrinsic evaluation results of MathBERTa][11] |
|
|
|
[5]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/ |
|
[6]: https://www.cs.rit.edu/~dprl/ARQMath/arqmath-resources.html |
|
[9]: https://github.com/huggingface/transformers/issues/16936 |
|
[10]: https://github.com/huggingface/transformers/pull/17119 |
|
[11]: https://huggingface.co/witiko/mathberta/resolve/main/learning-curves.png |
|
|
|
## Citing |
|
|
|
### Text |
|
|
|
Vít Novotný and Michal Štefánik. “Combining Sparse and Dense Information |
|
Retrieval. Soft Vector Space Model and MathBERTa at ARQMath-3”. |
|
In: *Proceedings of the Working Notes of CLEF 2022*. To Appear. |
|
CEUR-WS, 2022. |
|
|
|
### Bib(La)TeX |
|
|
|
``` bib |
|
@inproceedings{novotny2022combining, |
|
booktitle = {Proceedings of the Working Notes of {CLEF} 2022}, |
|
editor = {Faggioli, Guglielmo and Ferro, Nicola and Hanbury, Allan and Potthast, Martin}, |
|
issn = {1613-0073}, |
|
title = {Combining Sparse and Dense Information Retrieval}, |
|
subtitle = {Soft Vector Space Model and MathBERTa at ARQMath-3 Task 1 (Answer Retrieval)}, |
|
author = {Novotný, Vít and Štefánik, Michal}, |
|
publisher = {{CEUR-WS}}, |
|
year = {2022}, |
|
pages = {104-118}, |
|
numpages = {15}, |
|
url = {http://ceur-ws.org/Vol-3180/paper-06.pdf}, |
|
urldate = {2022-08-12}, |
|
} |
|
``` |