|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- bigscience-data/roots_vi_binhvq_news_corpus |
|
- wikipedia |
|
language: |
|
- vi |
|
- en |
|
- zh |
|
library_name: transformers |
|
tags: |
|
- t5 |
|
- flant5 |
|
- summarization |
|
- translation |
|
- question-answering |
|
pipeline_tag: fill-mask |
|
--- |
|
## Extend vocabulary and Pretrain |
|
We utilized [SentencePiece](https://github.com/google/sentencepiece) to retrain a tokenizer for Vietnamese, English, and Chinese. This newly trained tokenizer's vocabulary was then combined with Flan-T5's original vocabulary, eliminating any duplicate tokens. The resulting merged vocabulary consists of 106611 tokens. |
|
|
|
For a single-epoch continual pretraining, also referred to as incremental pretraining, we employed the Flan-T5-Large model. This pretraining was conducted on a diverse dataset exceeding 100 GB, incorporating the following sources: |
|
- [NewsCorpus](https://github.com/binhvq/news-corpus) |
|
- Vietnamese Wikipedia |
|
- Vietnamese books |
|
- Vietnamese legal documents |
|
- Vietnamese legal text |
|
- English Wikipedia |
|
- Chinese Text |
|
|
|
## How to use |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
tokenizer = AutoTokenizer.from_pretrained("Hatto/HattoFlanT5-Large") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("Hatto/HattoFlanT5-Large") |
|
model.cuda() |
|
``` |
|
|
|
## Finetune and Benchmark |
|
|
|
- Wikilingua |
|
- Vietnews |
|
- Pho_NER |
|
- ..... |
|
|
|
## Citation |
|
- Hatto |
|
- Ipcoms |