Token Classification
Scikit-learn
English
ner
legal
crf
shashankmc commited on
Commit
6830d6b
·
verified ·
1 Parent(s): 89a3dd1

Readme.md file updated

Browse files
Files changed (1) hide show
  1. README.md +390 -3
README.md CHANGED
@@ -1,3 +1,390 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - darrow-ai/LegalLensNER
5
+ language:
6
+ - en
7
+ metrics:
8
+ - f1
9
+ pipeline_tag: token-classification
10
+ library_name: sklearn
11
+ tags:
12
+ - ner
13
+ - legal
14
+ - crf
15
+ ---
16
+ # Model Card for Model ID
17
+
18
+ <!-- Provide a quick summary of what the model is/does. -->
19
+ Conditional Random Field model for performing named entity recognition with hand crafted features. Named entities recognied - Violation-on, Violation-by, and Law.
20
+ The dataset is of the BIO format. The model achieves an F1-score of 0.32.
21
+
22
+ ## Model Details
23
+
24
+ ### Model Description
25
+
26
+ <!-- Provide a longer summary of what this model is. -->
27
+ The model was developed for LegalLens 2024 competition as part of Natural Legal Language Processing 2024. The model has handcrafted features for identifying named
28
+ entities in the BIO format.
29
+
30
+
31
+ - **Developed by:** Shashank M Chakravarthy
32
+ - **Funded by [optional]:** NA
33
+ - **Shared by [optional]:** NA
34
+ - **Model type:** Statistical Model
35
+ - **Language(s) (NLP):** English
36
+ - **License:** Apache 2.0 License
37
+ - **Finetuned from model [optional]:** NA
38
+
39
+ ### Model Sources [optional]
40
+
41
+ <!-- Provide the basic links for the model. -->
42
+
43
+ - **Repository:** NA
44
+ - **Paper [optional]:** [https://aclanthology.org/2024.nllp-1.33.pdf]
45
+ - **Demo [optional]:** NA
46
+
47
+ ## Uses
48
+
49
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
50
+ The model is used to detect named entities in unstructured text. The model can be extended to other entities with further modification to the handcrafted features.
51
+
52
+ ### Direct Use
53
+
54
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
55
+
56
+ The model can be directly used on any unstructured text with a bit of preprocessing. The files contain the evaluation script.
57
+
58
+ ### Downstream Use [optional]
59
+
60
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
61
+
62
+
63
+ ### Out-of-Scope Use
64
+
65
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
66
+ This model is handcrafted for detecting violations and law in text. Can be used for other legal text which may contain similar entities.
67
+
68
+ ## Bias, Risks, and Limitations
69
+
70
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
71
+
72
+ The limitation comes with the handcrafting the features.
73
+
74
+ ### Recommendations
75
+
76
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
77
+
78
+ If the text used for prediction is improperly processed without POS tags, the model will not perform as its designed to.
79
+
80
+ ## How to Get Started with the Model
81
+
82
+ Use the code below to get started with the model.
83
+ ### Load libraries
84
+ ```
85
+ import ast
86
+ import pandas as pd
87
+ import joblib
88
+ import nltk
89
+ from nltk import pos_tag
90
+ import string
91
+ from nltk.stem import WordNetLemmatizer
92
+ from nltk.stem import PorterStemmer
93
+ ```
94
+
95
+ ### Check if nltk modules are downloaded, if not download them
96
+ ```
97
+ nltk.download('wordnet')
98
+ nltk.download('omw-1.4')
99
+ nltk.download("averaged_perceptron_tagger")
100
+ ```
101
+ ### Class for grouping tokens as sentences (redundant if text processed directly)
102
+ ```
103
+ class getsentence(object):
104
+ '''
105
+ This class is used to get the sentences from the dataset.
106
+ Converts from BIO format to sentences using their sentence numbers
107
+ '''
108
+ def __init__(self, data):
109
+ self.n_sent = 1.0
110
+ self.data = data
111
+ self.empty = False
112
+ self.grouped = self.data.groupby("sentence_num").apply(self._agg_func)
113
+ self.sentences = [s for s in self.grouped]
114
+
115
+ def _agg_func(self, s):
116
+ return [(w, p) for w, p in zip(s["token"].values.tolist(),
117
+ s["pos_tag"].values.tolist())]
118
+
119
+ ```
120
+ ### Creates features for words in a sentence (code can be reduced using iteration)
121
+ ```
122
+ def word2features(sent, i):
123
+ '''
124
+ This method is used to extract features from the words in the sentence.
125
+ The main features extracted are:
126
+ - word.lower(): The word in lowercase
127
+ - word.isdigit(): If the word is a digit
128
+ - word.punct(): If the word is a punctuation
129
+ - postag: The pos tag of the word
130
+ - word.lemma(): The lemma of the word
131
+ - word.stem(): The stem of the word
132
+ The features (not all) are also extracted for the 4 previous and 4 next words.
133
+ '''
134
+ global token_count
135
+ wordnet_lemmatizer = WordNetLemmatizer()
136
+ porter_stemmer = PorterStemmer()
137
+ word = sent[i][0]
138
+ postag = sent[i][1]
139
+
140
+ features = {
141
+ 'bias': 1.0,
142
+ 'word.lower()': word.lower(),
143
+ 'word.isdigit()': word.isdigit(),
144
+ # Check if its punctuations
145
+ 'word.punct()': word in string.punctuation,
146
+ 'postag': postag,
147
+ # Lemma of the word
148
+ 'word.lemma()': wordnet_lemmatizer.lemmatize(word),
149
+ # Stem of the word
150
+ 'word.stem()': porter_stemmer.stem(word)
151
+ }
152
+ if i > 0:
153
+ word1 = sent[i-1][0]
154
+ postag1 = sent[i-1][1]
155
+ features.update({
156
+ '-1:word.lower()': word1.lower(),
157
+ '-1:word.isdigit()': word1.isdigit(),
158
+ '-1:word.punct()': word1 in string.punctuation,
159
+ '-1:postag': postag1
160
+ })
161
+ if i - 2 >= 0:
162
+ features.update({
163
+ '-2:word.lower()': sent[i-2][0].lower(),
164
+ '-2:word.isdigit()': sent[i-2][0].isdigit(),
165
+ '-2:word.punct()': sent[i-2][0] in string.punctuation,
166
+ '-2:postag': sent[i-2][1]
167
+ })
168
+ if i - 3 >= 0:
169
+ features.update({
170
+ '-3:word.lower()': sent[i-3][0].lower(),
171
+ '-3:word.isdigit()': sent[i-3][0].isdigit(),
172
+ '-3:word.punct()': sent[i-3][0] in string.punctuation,
173
+ '-3:postag': sent[i-3][1]
174
+ })
175
+ if i - 4 >= 0:
176
+ features.update({
177
+ '-4:word.lower()': sent[i-4][0].lower(),
178
+ '-4:word.isdigit()': sent[i-4][0].isdigit(),
179
+ '-4:word.punct()': sent[i-4][0] in string.punctuation,
180
+ '-4:postag': sent[i-4][1]
181
+ })
182
+ else:
183
+ features['BOS'] = True
184
+
185
+ if i < len(sent)-1:
186
+ word1 = sent[i+1][0]
187
+ postag1 = sent[i+1][1]
188
+ features.update({
189
+ '+1:word.lower()': word1.lower(),
190
+ '+1:word.isdigit()': word1.isdigit(),
191
+ '+1:word.punct()': word1 in string.punctuation,
192
+ '+1:postag': postag1
193
+ })
194
+ if i + 2 < len(sent):
195
+ features.update({
196
+ '+2:word.lower()': sent[i+2][0].lower(),
197
+ '+2:word.isdigit()': sent[i+2][0].isdigit(),
198
+ '+2:word.punct()': sent[i+2][0] in string.punctuation,
199
+ '+2:postag': sent[i+2][1]
200
+ })
201
+ if i + 3 < len(sent):
202
+ features.update({
203
+ '+3:word.lower()': sent[i+3][0].lower(),
204
+ '+3:word.isdigit()': sent[i+3][0].isdigit(),
205
+ '+3:word.punct()': sent[i+3][0] in string.punctuation,
206
+ '+3:postag': sent[i+3][1]
207
+ })
208
+ if i + 4 < len(sent):
209
+ features.update({
210
+ '+4:word.lower()': sent[i+4][0].lower(),
211
+ '+4:word.isdigit()': sent[i+4][0].isdigit(),
212
+ '+4:word.punct()': sent[i+4][0] in string.punctuation,
213
+ '+4:postag': sent[i+4][1]
214
+ })
215
+ else:
216
+ features['EOS'] = True
217
+
218
+ return features
219
+ ```
220
+ ### Obtain features for a given sentence
221
+ ```
222
+ def sent2features(sent):
223
+ '''
224
+ This method is used to extract features from the sentence.
225
+ '''
226
+ return [word2features(sent, i) for i in range(len(sent))]
227
+ ```
228
+ ### Load file from your directory
229
+ ```
230
+ df_eval = pd.read_excel("testset_NER_LegalLens.xlsx")
231
+ ```
232
+ ### Evaluate data type and create pos_tags for each token
233
+ ```
234
+ df_eval["tokens"] = df_eval["tokens"].apply(ast.literal_eval)
235
+ df_eval['pos_tags'] = df_eval['tokens'].apply(lambda x: [tag[1]
236
+ for tag in pos_tag(x)])
237
+ ```
238
+ ### Aggregate tokens to sentences
239
+ ```
240
+ data_eval = []
241
+ for i in range(len(df_eval)):
242
+ for j in range(len(df_eval["tokens"][i])):
243
+ data_eval.append(
244
+ {
245
+ "sentence_num": i+1,
246
+ "id": df_eval["id"][i],
247
+ "token": df_eval["tokens"][i][j],
248
+ "pos_tag": df_eval["pos_tags"][i][j],
249
+ }
250
+ )
251
+ data_eval = pd.DataFrame(data_eval)
252
+ getter = getsentence(data_eval)
253
+ sentences_eval = getter.sentences
254
+ X_eval = [sent2features(s) for s in sentences_eval]
255
+ ```
256
+ ### Load model from your directory
257
+ ```
258
+ crf = joblib.load("../models/crf.pkl")
259
+ y_pred_eval = crf.predict(X_eval)
260
+ print("NER tags predicted.")
261
+ df_eval["ner_tags"] = y_pred_eval
262
+ df_eval.drop(columns=["pos_tags"], inplace=True)
263
+ print("Saving the predictions...")
264
+ df_eval.to_csv("predictions_NERLens.csv", index=False)
265
+ print("Predictions saved.")
266
+ ```
267
+
268
+ ## Training Details
269
+
270
+ ### Training Data
271
+
272
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
273
+
274
+ [https://huggingface.co/datasets/darrow-ai/LegalLensNER]
275
+
276
+ ### Training Procedure
277
+
278
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
279
+ The dataset was first evaluated for its datatype, POS_tags were created for each token in the text. With handcrafted features,
280
+ the model was trained on a CPU. Training time is around 20-30 minutes for this dataset.
281
+ #### Preprocessing [optional]
282
+ For every token, POS_tags were assigned using NLTK library.
283
+
284
+
285
+ #### Training Hyperparameters
286
+
287
+ - **Training regime:** NA <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
288
+
289
+ #### Speeds, Sizes, Times [optional]
290
+
291
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
292
+ NA
293
+
294
+ ## Evaluation
295
+
296
+ <!-- This section describes the evaluation protocols and provides the results. -->
297
+ The model was evaluated using macro-F1 score. A score of 0.32 was obtained on unseen test data.
298
+
299
+ ### Testing Data, Factors & Metrics
300
+
301
+ #### Testing Data
302
+
303
+ <!-- This should link to a Dataset Card if possible. -->
304
+
305
+ [https://huggingface.co/datasets/darrow-ai/LegalLensNER]
306
+
307
+ #### Factors
308
+
309
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
310
+
311
+ [More Information Needed]
312
+
313
+ #### Metrics
314
+
315
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
316
+
317
+ Macro-F1 score as it evaluates the true performance of the model and mitigates the performance boost created by highly skewed entities in the dataset.
318
+
319
+ ### Results
320
+
321
+ 0.32 macro-F1 score on unseen data.
322
+
323
+ #### Summary
324
+
325
+ The model was designed and developed to tackle NER task in unstructured text.
326
+
327
+ ## Model Examination [optional]
328
+
329
+ <!-- Relevant interpretability work for the model goes here -->
330
+ NA
331
+
332
+ ## Environmental Impact
333
+
334
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
335
+
336
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
337
+
338
+ - **Hardware Type:** 13th Gen Intel(R) Core(TM) i7-1365U
339
+ - **Hours used:** 0.5 hours
340
+ - **Cloud Provider:** NA
341
+ - **Compute Region:** NA
342
+ - **Carbon Emitted:** Unknown
343
+
344
+ ## Technical Specifications [optional]
345
+
346
+ ### Model Architecture and Objective
347
+
348
+ [More Information Needed]
349
+
350
+ ### Compute Infrastructure
351
+
352
+ [More Information Needed]
353
+
354
+ #### Hardware
355
+
356
+ [More Information Needed]
357
+
358
+ #### Software
359
+
360
+ [More Information Needed]
361
+
362
+ ## Citation [optional]
363
+
364
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
365
+
366
+ **BibTeX:**
367
+
368
+ [More Information Needed]
369
+
370
+ **APA:**
371
+
372
+ [More Information Needed]
373
+
374
+ ## Glossary [optional]
375
+
376
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
377
+
378
+ [More Information Needed]
379
+
380
+ ## More Information [optional]
381
+
382
+ [More Information Needed]
383
+
384
+ ## Model Card Authors [optional]
385
+
386
+ [More Information Needed]
387
+
388
+ ## Model Card Contact
389
+
390
+ [More Information Needed]