Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,212 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# GLiREL : Generalist and Lightweight model for Zero-Shot Relation Extraction
|
2 |
+
|
3 |
+
GLiREL is a Relation Extraction model capable of classifying unseen relations given the entities within a text. This builds upon the excelent work done by Urchade Zaratiana, Nadi Tomeh, Pierre Holat, Thierry Charnois on the [GLiNER](https://github.com/urchade/GLiNER) library which enables efficient zero-shot Named Entity Recognition.
|
4 |
+
|
5 |
+
* GLiNER paper: [GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer](https://arxiv.org/abs/2311.08526)
|
6 |
+
|
7 |
+
* Train a Zero-shot model: <a href="https://colab.research.google.com/github/jackboyla/GLiREL/blob/main/train.ipynb" target="_blank">
|
8 |
+
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
9 |
+
</a>
|
10 |
+
|
11 |
+
<!-- <img src="demo.jpg" alt="Demo Image" width="50%"/> -->
|
12 |
+
|
13 |
+
---
|
14 |
+
# Installation
|
15 |
+
|
16 |
+
```bash
|
17 |
+
pip install glirel
|
18 |
+
```
|
19 |
+
|
20 |
+
## Usage
|
21 |
+
Once you've downloaded the GLiREL library, you can import the `GLiREL` class. You can then load this model using `GLiREL.from_pretrained` and predict entities with `predict_relations`.
|
22 |
+
|
23 |
+
```python
|
24 |
+
from glirel import GLiREL
|
25 |
+
import spacy
|
26 |
+
|
27 |
+
model = GLiREL.from_pretrained("jackboyla/glirel_beta")
|
28 |
+
|
29 |
+
nlp = spacy.load('en_core_web_sm')
|
30 |
+
|
31 |
+
text = 'Derren Nesbitt had a history of being cast in "Doctor Who", having played villainous warlord Tegana in the 1964 First Doctor serial "Marco Polo".'
|
32 |
+
doc = nlp(text)
|
33 |
+
tokens = [token.text for token in doc]
|
34 |
+
|
35 |
+
labels = ['country of origin', 'licensed to broadcast to', 'father', 'followed by', 'characters']
|
36 |
+
|
37 |
+
ner = [[26, 27, 'PERSON', 'Marco Polo'], [22, 23, 'Q2989412', 'First Doctor']] # 'type' is not used -- it can be any string!
|
38 |
+
|
39 |
+
relations = model.predict_relations(tokens, labels, threshold=0.0, ner=ner, top_k=1)
|
40 |
+
|
41 |
+
print('Number of relations:', len(relations))
|
42 |
+
|
43 |
+
sorted_data_desc = sorted(relations, key=lambda x: x['score'], reverse=True)
|
44 |
+
print("\nDescending Order by Score:")
|
45 |
+
for item in sorted_data_desc:
|
46 |
+
print(f"{item['head_text']} --> {item['label']} --> {item['tail_text']} | score: {item['score']}")
|
47 |
+
```
|
48 |
+
|
49 |
+
### Expected Output
|
50 |
+
|
51 |
+
```
|
52 |
+
Number of relations: 2
|
53 |
+
|
54 |
+
Descending Order by Score:
|
55 |
+
{'head_pos': [26, 28], 'tail_pos': [22, 24], 'head_text': ['Marco', 'Polo'], 'tail_text': ['First', 'Doctor'], 'label': 'characters', 'score': 0.9923334121704102}
|
56 |
+
{'head_pos': [22, 24], 'tail_pos': [26, 28], 'head_text': ['First', 'Doctor'], 'tail_text': ['Marco', 'Polo'], 'label': 'characters', 'score': 0.9915636777877808}
|
57 |
+
```
|
58 |
+
|
59 |
+
## Constrain labels
|
60 |
+
In practice, we usually want to define the types of entities that can exist as a head and/or tail of a relationship. This is already implemented in GLiREL:
|
61 |
+
|
62 |
+
```python
|
63 |
+
labels = {"glirel_labels": {
|
64 |
+
'co-founder': {"allowed_head": ["PERSON"], "allowed_tail": ["ORG"]},
|
65 |
+
'no relation': {}, # head and tail can be any entity type
|
66 |
+
'country of origin': {"allowed_head": ["PERSON", "ORG"], "allowed_tail": ["LOC", "GPE"]},
|
67 |
+
'parent': {"allowed_head": ["PERSON"], "allowed_tail": ["PERSON"]},
|
68 |
+
'located in or next to body of water': {"allowed_head": ["LOC", "GPE", "FAC"], "allowed_tail": ["LOC", "GPE"]},
|
69 |
+
'spouse': {"allowed_head": ["PERSON"], "allowed_tail": ["PERSON"]},
|
70 |
+
'child': {"allowed_head": ["PERSON"], "allowed_tail": ["PERSON"]},
|
71 |
+
'founder': {"allowed_head": ["PERSON"], "allowed_tail": ["ORG"]},
|
72 |
+
'founded on date': {"allowed_head": ["ORG"], "allowed_tail": ["DATE"]},
|
73 |
+
'headquartered in': {"allowed_head": ["ORG"], "allowed_tail": ["LOC", "GPE", "FAC"]},
|
74 |
+
'acquired by': {"allowed_head": ["ORG"], "allowed_tail": ["ORG", "PERSON"]},
|
75 |
+
'subsidiary of': {"allowed_head": ["ORG"], "allowed_tail": ["ORG", "PERSON"]},
|
76 |
+
}
|
77 |
+
}
|
78 |
+
```
|
79 |
+
|
80 |
+
## Usage with spaCy
|
81 |
+
|
82 |
+
You can also load GliREL into a regular spaCy NLP pipeline. Here's an example using an English pipeline.
|
83 |
+
|
84 |
+
```python
|
85 |
+
import spacy
|
86 |
+
import glirel
|
87 |
+
|
88 |
+
# Load a blank spaCy model or an existing one
|
89 |
+
nlp = spacy.load('en_core_web_sm')
|
90 |
+
|
91 |
+
# Add the GLiREL component to the pipeline
|
92 |
+
nlp.add_pipe("glirel", after="ner")
|
93 |
+
|
94 |
+
# Now you can use the pipeline with the GLiREL component
|
95 |
+
text = "Apple Inc. was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976. The company is headquartered in Cupertino, California."
|
96 |
+
|
97 |
+
labels = {"glirel_labels": {
|
98 |
+
'co-founder': {"allowed_head": ["PERSON"], "allowed_tail": ["ORG"]},
|
99 |
+
'country of origin': {"allowed_head": ["PERSON", "ORG"], "allowed_tail": ["LOC", "GPE"]},
|
100 |
+
'licensed to broadcast to': {"allowed_head": ["ORG"]},
|
101 |
+
'no relation': {},
|
102 |
+
'parent': {"allowed_head": ["PERSON"], "allowed_tail": ["PERSON"]},
|
103 |
+
'followed by': {"allowed_head": ["PERSON", "ORG"], "allowed_tail": ["PERSON", "ORG"]},
|
104 |
+
'located in or next to body of water': {"allowed_head": ["LOC", "GPE", "FAC"], "allowed_tail": ["LOC", "GPE"]},
|
105 |
+
'spouse': {"allowed_head": ["PERSON"], "allowed_tail": ["PERSON"]},
|
106 |
+
'child': {"allowed_head": ["PERSON"], "allowed_tail": ["PERSON"]},
|
107 |
+
'founder': {"allowed_head": ["PERSON"], "allowed_tail": ["ORG"]},
|
108 |
+
'headquartered in': {"allowed_head": ["ORG"], "allowed_tail": ["LOC", "GPE", "FAC"]},
|
109 |
+
'acquired by': {"allowed_head": ["ORG"], "allowed_tail": ["ORG", "PERSON"]},
|
110 |
+
'subsidiary of': {"allowed_head": ["ORG"], "allowed_tail": ["ORG", "PERSON"]},
|
111 |
+
}
|
112 |
+
}
|
113 |
+
|
114 |
+
# Add the labels to the pipeline at inference time
|
115 |
+
docs = list( nlp.pipe([(text, labels)], as_tuples=True) )
|
116 |
+
relations = docs[0][0]._.relations
|
117 |
+
|
118 |
+
print('Number of relations:', len(relations))
|
119 |
+
|
120 |
+
sorted_data_desc = sorted(relations, key=lambda x: x['score'], reverse=True)
|
121 |
+
print("\nDescending Order by Score:")
|
122 |
+
for item in sorted_data_desc:
|
123 |
+
print(f"{item['head_text']} --> {item['label']} --> {item['tail_text']} | score: {item['score']}")
|
124 |
+
|
125 |
+
```
|
126 |
+
|
127 |
+
### Expected Output
|
128 |
+
|
129 |
+
```
|
130 |
+
Number of relations: 5
|
131 |
+
|
132 |
+
Descending Order by Score:
|
133 |
+
['Apple', 'Inc.'] --> headquartered in --> ['California'] | score: 0.9854260683059692
|
134 |
+
['Apple', 'Inc.'] --> headquartered in --> ['Cupertino'] | score: 0.9569844603538513
|
135 |
+
['Steve', 'Wozniak'] --> co-founder --> ['Apple', 'Inc.'] | score: 0.09025496244430542
|
136 |
+
['Steve', 'Jobs'] --> co-founder --> ['Apple', 'Inc.'] | score: 0.08805803954601288
|
137 |
+
['Ronald', 'Wayne'] --> co-founder --> ['Apple', 'Inc.'] | score: 0.07996643334627151
|
138 |
+
```
|
139 |
+
|
140 |
+
|
141 |
+
## To run experiments
|
142 |
+
|
143 |
+
FewRel: ~56k examples
|
144 |
+
WikiZSL: ~85k examples
|
145 |
+
|
146 |
+
```bash
|
147 |
+
# few_rel
|
148 |
+
cd data
|
149 |
+
python process_few_rel.py
|
150 |
+
cd ..
|
151 |
+
# adjust config
|
152 |
+
python train.py --config config_few_rel.yaml
|
153 |
+
```
|
154 |
+
|
155 |
+
```bash
|
156 |
+
# wiki_zsl
|
157 |
+
cd data
|
158 |
+
python process_wiki_zsl.py
|
159 |
+
cd ..
|
160 |
+
# <adjust config>
|
161 |
+
python train.py --config config_wiki_zsl.yaml
|
162 |
+
```
|
163 |
+
|
164 |
+
## Example training data
|
165 |
+
|
166 |
+
NOTE that the entity indices are inclusive i.e `"Binsey"` is `[7, 7]`. This differs from spaCy where the end index is exclusive (in this case spaCy would set the indices to `[7, 8]`)
|
167 |
+
|
168 |
+
JSONL file:
|
169 |
+
```json
|
170 |
+
{
|
171 |
+
"ner": [
|
172 |
+
[7, 7, "Q4914513", "Binsey"],
|
173 |
+
[11, 12, "Q19686", "River Thames"]
|
174 |
+
],
|
175 |
+
"relations": [
|
176 |
+
{
|
177 |
+
"head": {"mention": "Binsey", "position": [7, 7], "type": "LOC"}, # 'type' is not used -- it can be any string!
|
178 |
+
"tail": {"mention": "River Thames", "position": [11, 12], "type": "Q19686"},
|
179 |
+
"relation_text": "located in or next to body of water"
|
180 |
+
}
|
181 |
+
],
|
182 |
+
"tokenized_text": ["The", "race", "took", "place", "between", "Godstow", "and", "Binsey", "along", "the", "Upper", "River", "Thames", "."]
|
183 |
+
},
|
184 |
+
{
|
185 |
+
"ner": [
|
186 |
+
[9, 10, "Q4386693", "Legislative Assembly"],
|
187 |
+
[1, 3, "Q1848835", "Parliament of Victoria"]
|
188 |
+
],
|
189 |
+
"relations": [
|
190 |
+
{
|
191 |
+
"head": {"mention": "Legislative Assembly", "position": [9, 10], "type": "Q4386693"},
|
192 |
+
"tail": {"mention": "Parliament of Victoria", "position": [1, 3], "type": "Q1848835"},
|
193 |
+
"relation_text": "part of"
|
194 |
+
}
|
195 |
+
],
|
196 |
+
"tokenized_text": ["The", "Parliament", "of", "Victoria", "consists", "of", "the", "lower", "house", "Legislative", "Assembly", ",", "the", "upper", "house", "Legislative", "Council", "and", "the", "Queen", "of", "Australia", "."]
|
197 |
+
}
|
198 |
+
```
|
199 |
+
|
200 |
+
## License
|
201 |
+
|
202 |
+
[GLiREL](https://github.com/jackboyla/GLiREL) by [Jack Boylan](https://github.com/jackboyla) is licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1).
|
203 |
+
|
204 |
+
<a href="https://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer">
|
205 |
+
<img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1" alt="CC Logo" style="height: 20px; margin-right: 5px; vertical-align: text-bottom;">
|
206 |
+
<img src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1" alt="BY Logo" style="height: 20px; margin-right: 5px; vertical-align: text-bottom;">
|
207 |
+
<img src="https://mirrors.creativecommons.org/presskit/icons/nc.svg?ref=chooser-v1" alt="NC Logo" style="height: 20px; margin-right: 5px; vertical-align: text-bottom;">
|
208 |
+
<img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg?ref=chooser-v1" alt="SA Logo" style="height: 20px; margin-right: 5px; vertical-align: text-bottom;">
|
209 |
+
</a>
|
210 |
+
|
211 |
+
|
212 |
+
|