--- tags: - generated_from_trainer model-index: - name: EUBERT results: [] language: - bg - cs - da - de - el - en - es - et - fi - fr - ga - hr - hu - it - lt - lv - mt - nl - pl - pt - ro - sk - sl - sv --- ## Model Card: EUBERT ### Overview - **Model Name**: EUBERT - **Model Version**: 1.0 - **Date of Release**: 02 October 2023 - **Model Architecture**: BERT (Bidirectional Encoder Representations from Transformers) - **Training Data**: Documents registered by the European Publications Office - **Model Use Case**: Text Classification, Question Answering, Language Understanding ![EUBERT](https://huggingface.co/EuropeanParliament/EUBERT/resolve/main/EUBERT_small.png) ### Model Description EUBERT is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the [European Publications Office](https://op.europa.eu/). These documents span the last 30 years, providing a comprehensive dataset that encompasses a wide range of topics and domains. EUBERT is designed to be a versatile language model that can be fine-tuned for various natural language processing tasks, making it a valuable resource for a variety of applications. ### Intended Use EUBERT serves as a starting point for building more specific natural language understanding models. Its versatility makes it suitable for a wide range of tasks, including but not limited to: 1. **Text Classification**: EUBERT can be fine-tuned for classifying text documents into different categories, making it useful for applications such as sentiment analysis, topic categorization, and spam detection. 2. **Question Answering**: By fine-tuning EUBERT on question-answering datasets, it can be used to extract answers from text documents, facilitating tasks like information retrieval and document summarization. 3. **Language Understanding**: EUBERT can be employed for general language understanding tasks, including named entity recognition, part-of-speech tagging, and text generation. ### Performance The specific performance metrics of EUBERT may vary depending on the downstream task and the quality and quantity of training data used for fine-tuning. Users are encouraged to fine-tune the model on their specific task and evaluate its performance accordingly. ### Considerations - **Data Privacy and Compliance**: Users should ensure that the use of EUBERT complies with all relevant data privacy and compliance regulations, especially when working with sensitive or personally identifiable information. - **Fine-Tuning**: The effectiveness of EUBERT on a given task depends on the quality and quantity of the training data, as well as the fine-tuning process. Careful experimentation and evaluation are essential to achieve optimal results. - **Bias and Fairness**: Users should be aware of potential biases in the training data and take appropriate measures to mitigate bias when fine-tuning EUBERT for specific tasks. ### Conclusion EUBERT is a pretrained BERT model that leverages a substantial corpus of documents from the European Publications Office. It offers a versatile foundation for developing natural language processing solutions across a wide range of applications, enabling researchers and developers to create custom models for text classification, question answering, and language understanding tasks. Users are encouraged to exercise diligence in fine-tuning and evaluating the model for their specific use cases while adhering to data privacy and fairness considerations. --- ## Training procedure Dedicated Byte Level BPE tokenizer vocabulary size 2**16, min frequency 2 ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 32 - eval_batch_size: 32 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 1 ### Training results Coming soon ### Framework versions - Transformers 4.33.3 - Pytorch 2.0.1+cu117 - Datasets 2.14.5 - Tokenizers 0.13.3 ### Infrastructure - **Hardware Type:** 4 x GPUs 24GB - **GPU Days:** 16 - **Cloud Provider:** EuroHPC - **Compute Region:** Meluxina # Model Card Authors Sebastien Campion # Model Card Contact sebastien.campion@europarl.europa.eu