PantagrueLLM
/

jargon-multidomain-base

+---
+license: mit
+language:
+- fr
+library_name: transformers
+tags:
+- linformer
+- legal
+- medical
+- RoBERTa
+- pytorch
+---
+# Jargon-multidomain-base
+[Jargon](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf) is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture.
+Jargon is available in several versions with different context sizes and types of pre-training corpora.
+<!-- Provide a quick summary of what the model is/does. -->
+<!-- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
+ -->
+| **Model**                                                                           | **Initialised from...** |**Training Data**|
+|-------------------------------------------------------------------------------------|:-----------------------:|:----------------:|
+| [jargon-general-base](https://huggingface.co/PantagrueLLM/jargon-general-base)        |         scratch         |8.5GB Web Corpus|
+| [jargon-general-biomed](https://huggingface.co/PantagrueLLM/jargon-general-biomed)    |   jargon-general-base   |5.4GB Medical Corpus|
+| jargon-general-legal                                                                |   jargon-general-base   |18GB Legal Corpus
+| [jargon-multidomain-base](https://huggingface.co/PantagrueLLM/jargon-multidomain-base) |   jargon-general-base   |Medical+Legal Corpora|
+| jargon-legal                                                                        |         scratch         |18GB Legal Corpus|
+| jargon-legal-4096                                                                   |         scratch         |18GB Legal Corpus|
+| [jargon-biomed](https://huggingface.co/PantagrueLLM/jargon-biomed)                    |         scratch         |5.4GB Medical Corpus|
+| [jargon-biomed-4096](https://huggingface.co/PantagrueLLM/jargon-biomed-4096)          |         scratch         |5.4GB Medical Corpus|
+| [jargon-NACHOS](https://huggingface.co/PantagrueLLM/jargon-NACHOS)                    |         scratch         |[NACHOS](https://drbert.univ-avignon.fr/)|
+| [jargon-NACHOS-4096](https://huggingface.co/PantagrueLLM/jargon-NACHOS-4096)        |         scratch         |[NACHOS](https://drbert.univ-avignon.fr/)|
+## Evaluation
+The Jargon models were evaluated on an range of specialized downstream tasks.
+For more info please check out the [paper](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf), accepted for publication at [LREC-COLING 2024](https://lrec-coling-2024.org/list-of-accepted-papers/).
+## Using Jargon models with HuggingFace transformers
+You can get started with `jargon-general-biomed` using the code snippet below:
+```python
+from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
+tokenizer = AutoTokenizer.from_pretrained("PantagrueLLM/jargon-multidomain-base", trust_remote_code=True)
+model = AutoModelForMaskedLM.from_pretrained("PantagrueLLM/jargon-multidomain-base", trust_remote_code=True)
+jargon_maskfiller = pipeline("fill-mask", model=model, tokenizer=tokenizer)
+output = jargon_maskfiller("Il est allé au <mask> hier")
+```
+You can also use the classes `AutoModel`, `AutoModelForSequenceClassification`, or `AutoModelForTokenClassification` to load Jargon models, depending on the downstream task in question.
+- **Language(s):** French
+- **License:** MIT
+- **Developed by:** Vincent Segonne
+- **Funded by**
+  - GENCI-IDRIS (Grant 2022 A0131013801)
+  - French National Research Agency: Pantagruel grant ANR-23-IAS1-0001
+  - MIAI@Grenoble Alpes ANR-19-P3IA-0003
+  - PROPICTO ANR-20-CE93-0005
+  - Lawbot ANR-20-CE38-0013
+  - Swiss National Science Foundation (grant PROPICTO N°197864)
+- **Authors**
+  - Vincent Segonne
+  - Aidan Mannion
+  - Laura Cristina Alonzo Canul
+  - Alexandre Audibert
+  - Xingyu Liu
+  - Cécile Macaire
+  - Adrien Pupier
+  - Yongxin Zhou
+  - Mathilde Aguiar
+  - Felix Herron
+  - Magali Norré
+  - Massih-Reza Amini
+  - Pierrette Bouillon
+  - Iris Eshkol-Taravella
+  - Emmanuelle Esperança-Rodier
+  - Thomas François
+  - Lorraine Goeuriot
+  - Jérôme Goulian
+  - Mathieu Lafourcade
+  - Benjamin Lecouteux
+  - François Portet
+  - Fabien Ringeval
+  - Vincent Vandeghinste
+  - Maximin Coavoux
+  - Marco Dinarelli
+  - Didier Schwab
+## Citation
+If you use this model for your own research work, please cite as follows:
+```bibtex
+@inproceedings{segonne:hal-04535557,
+  TITLE = {{Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains}},
+  AUTHOR = {Segonne, Vincent and Mannion, Aidan and Alonzo Canul, Laura Cristina and Audibert, Alexandre and Liu, Xingyu and Macaire, C{\'e}cile and Pupier, Adrien and Zhou, Yongxin and Aguiar, Mathilde and Herron, Felix and Norr{\'e}, Magali and Amini, Massih-Reza and Bouillon, Pierrette and Eshkol-Taravella, Iris and Esperan{\c c}a-Rodier, Emmanuelle and Fran{\c c}ois, Thomas and Goeuriot, Lorraine and Goulian, J{\'e}r{\^o}me and Lafourcade, Mathieu and Lecouteux, Benjamin and Portet, Fran{\c c}ois and Ringeval, Fabien and Vandeghinste, Vincent and Coavoux, Maximin and Dinarelli, Marco and Schwab, Didier},
+  URL = {https://hal.science/hal-04535557},
+  BOOKTITLE = {{LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation}},
+  ADDRESS = {Turin, Italy},
+  YEAR = {2024},
+  MONTH = May,
+  KEYWORDS = {Self-supervised learning ; Pretrained language models ; Evaluation benchmark ; Biomedical document processing ; Legal document processing ; Speech transcription},
+  PDF = {https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf},
+  HAL_ID = {hal-04535557},
+  HAL_VERSION = {v1},
+}
+```
+<!-- - **Finetuned from model [optional]:** [More Information Needed] -->
+<!--
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->