rabindralamsal
commited on
Commit
•
01ec027
1
Parent(s):
7ee3d7f
Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
# CrisisTransformers
|
2 |
-
CrisisTransformers is a family of pre-trained language models and sentence encoders introduced in the papers "[CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts](https://
|
3 |
|
4 |
CrisisTransformers were evaluated on 18 public crisis-specific datasets against strong baselines. Our pre-trained models outperform the baselines across all 18 datasets in classification tasks, and our best-performing sentence-encoder (mono-lingual) outperforms the state-of-the-art by more than 17\% in sentence encoding tasks. The multi-lingual sentence encoders (support 50+ languages; see [associated paper](https://arxiv.org/abs/2403.16614)) are designed to approximate the embedding space of the best-performing mono-lingual sentence encoder.
|
5 |
|
@@ -29,16 +29,6 @@ CrisisTransformers has 8 pre-trained models, 1 mono-lingual and 2 multi-lingual
|
|
29 |
|
30 |
Languages supported by the multi-lingual sentence encoders: Albanian, Arabic, Armenian, Bulgarian, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, Estonian, Finnish, French, French (Canada), Galician, Georgian, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Kurdish (Sorani), Latvian, Lithuanian, Macedonian, Malay, Marathi, Mongolian, Myanmar (Burmese), Norwegian, Persian, Polish, Portuguese, Portuguese (Brazil), Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu, and Vietnamese.
|
31 |
|
32 |
-
## Results
|
33 |
-
Here are the main results from the CrisisTransformers' paper.
|
34 |
-
|
35 |
-
<p float="left">
|
36 |
-
<a href="https://raw.githubusercontent.com/rabindralamsal/images/main/cls.png"><img width="100%" alt="classification" src="https://raw.githubusercontent.com/rabindralamsal/images/main/cls.png"></a>
|
37 |
-
<a href="https://raw.githubusercontent.com/rabindralamsal/images/main/se.png"><img width="50%" alt="sentence encoding" src="https://raw.githubusercontent.com/rabindralamsal/images/main/se.png"></a>
|
38 |
-
</p>
|
39 |
-
|
40 |
-
For results from the multi-lingual sentence encoders, please refer to the [associated paper](https://arxiv.org/abs/2403.16614).
|
41 |
-
|
42 |
## Citation
|
43 |
If you use CrisisTransformers and the mono-lingual sentence encoder, please cite the following paper:
|
44 |
```
|
@@ -47,10 +37,10 @@ If you use CrisisTransformers and the mono-lingual sentence encoder, please cite
|
|
47 |
author={Rabindra Lamsal and
|
48 |
Maria Rodriguez Read and
|
49 |
Shanika Karunasekera},
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
}
|
55 |
```
|
56 |
|
|
|
1 |
# CrisisTransformers
|
2 |
+
CrisisTransformers is a family of pre-trained language models and sentence encoders introduced in the papers "[CrisisTransformers: Pre-trained language models and sentence encoders for crisis-related social media texts](https://www.sciencedirect.com/science/article/pii/S0950705124005501)" and "[Semantically Enriched Cross-Lingual Sentence Embeddings for Crisis-related Social Media Texts](https://arxiv.org/abs/2403.16614)". The models were trained based on the RoBERTa pre-training procedure on a massive corpus of over 15 billion word tokens sourced from tweets associated with 30+ crisis events such as disease outbreaks, natural disasters, conflicts, etc. Please refer to the [associated paper](https://www.sciencedirect.com/science/article/pii/S0950705124005501) for more details.
|
3 |
|
4 |
CrisisTransformers were evaluated on 18 public crisis-specific datasets against strong baselines. Our pre-trained models outperform the baselines across all 18 datasets in classification tasks, and our best-performing sentence-encoder (mono-lingual) outperforms the state-of-the-art by more than 17\% in sentence encoding tasks. The multi-lingual sentence encoders (support 50+ languages; see [associated paper](https://arxiv.org/abs/2403.16614)) are designed to approximate the embedding space of the best-performing mono-lingual sentence encoder.
|
5 |
|
|
|
29 |
|
30 |
Languages supported by the multi-lingual sentence encoders: Albanian, Arabic, Armenian, Bulgarian, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, Estonian, Finnish, French, French (Canada), Galician, Georgian, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Kurdish (Sorani), Latvian, Lithuanian, Macedonian, Malay, Marathi, Mongolian, Myanmar (Burmese), Norwegian, Persian, Polish, Portuguese, Portuguese (Brazil), Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu, and Vietnamese.
|
31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
## Citation
|
33 |
If you use CrisisTransformers and the mono-lingual sentence encoder, please cite the following paper:
|
34 |
```
|
|
|
37 |
author={Rabindra Lamsal and
|
38 |
Maria Rodriguez Read and
|
39 |
Shanika Karunasekera},
|
40 |
+
journal={Knowledge-Based Systems},
|
41 |
+
pages={111916},
|
42 |
+
year={2024},
|
43 |
+
publisher={Elsevier}
|
44 |
}
|
45 |
```
|
46 |
|