File size: 3,849 Bytes
387208c
 
6a9d3eb
387208c
 
6a9d3eb
 
8eb7ad5
6a9d3eb
 
 
 
87a41af
6a9d3eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf32bee
6a9d3eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf32bee
6a9d3eb
ae45943
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
library_name: transformers
tags: [disaster management, twitter]
---

# Disaster-Twitter-XLM-RoBERTa-AL

This is a multilingual [Twitter-XLM-RoBERTa-base model](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base) fine-tuned for the identification of disaster-related tweets. It was trained using a two-step procedure. First, we fine-tuned the model with 179,391 labelled tweets from [CrisisLex](https://crisislex.org/) in English, Spanish, German, French and Italian. Subsequently, the model was fine-tuned further using data from the 2021 Ahr Valley flood in Germany and the 2023 Chile forest fires using a greedy coreset active learning approach.

- Paper: [Active Learning for Identifying Disaster-Related Tweets: A Comparison with Keyword Filtering and Generic Fine-Tuning](https://link.springer.com/chapter/10.1007/978-3-031-66428-1_8)

## Labels
The model classifies short texts using either one of the following two labels:
- `LABEL_0`: NOT disaster-related
- `LABEL_1`: Disaster-related

## Example Pipeline
```python
from transformers import pipeline
MODEL_NAME = 'hannybal/disaster-twitter-xlm-roberta-al'
classifier = pipeline('text-classification', model=MODEL_NAME, tokenizer='cardiffnlp/twitter-xlm-roberta-base')
classifier('I can see fire and smoke from the nearby fire!')
```

Output:
```
[{'label': 'LABEL_0', 'score': 0.9967854022979736}]
```


## Full Classification Example

```python
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
import numpy as np
from scipy.special import softmax

def preprocess(text: str) -> str:
    """Pre-process texts by replacing usernames and links with placeholders.
    """
    new_text: list[str] = []
    for t in text.split(" "):
        t: str = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

MODEL_NAME = 'hannybal/disaster-twitter-xlm-roberta-al'

tokenizer = AutoTokenizer.from_pretrained('cardiffnlp/twitter-xlm-roberta-base')
config = AutoConfig.from_pretrained(MODEL_NAME)

# example classification
text = "Das ist alles, was von meinem Keller noch übrig ist... #flood #ahr @ Bad Neuenahr-Ahrweiler https://t.co/C68fBaKZWR"
text = preprocess(text)
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

# print labels and their respective scores
ranking = np.argsort(scores)
ranking = ranking[::-1]
for i in range(scores.shape[0]):
    l = config.id2label[ranking[i]]
    s = scores[ranking[i]]
    print(f"{i+1}) {l} {np.round(float(s), 4)}")
```

Output:
```
1) LABEL_1 0.9999
2) LABEL_0 0.0001
```

## Reference
```
@inproceedings{Hanny.2024a,
  title = {Active {{Learning}} for~{{Identifying Disaster-Related Tweets}}: {{A Comparison}} with~{{Keyword Filtering}} and~{{Generic Fine-Tuning}}},
  shorttitle = {Active {{Learning}} for~{{Identifying Disaster-Related Tweets}}},
  booktitle = {Intelligent {{Systems}} and {{Applications}}},
  author = {Hanny, David and Schmidt, Sebastian and Resch, Bernd},
  editor = {Arai, Kohei},
  year = {2024},
  pages = {126--142},
  publisher = {Springer Nature Switzerland},
  address = {Cham},
  doi = {10.1007/978-3-031-66428-1_8},
  isbn = {978-3-031-66428-1},
  langid = {english}
}
```

## Acknowledgements
This work has received funding from the European Commission - European Union under HORIZON EUROPE (HORIZON Research and Innovation Actions) as part of the [TEMA project](https://tema-project.eu/) (grant agreement 101093003; HORIZON-CL4-2022-DATA-01-01). This work has also received funding from the Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology (BMK) project GeoSHARING (Grant Number 878652).