shashankmc
/

crf_ner_violations_legallens

+---
+license: apache-2.0
+datasets:
+- darrow-ai/LegalLensNER
+language:
+- en
+metrics:
+- f1
+pipeline_tag: token-classification
+library_name: sklearn
+tags:
+- ner
+- legal
+- crf
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+Conditional Random Field model for performing named entity recognition with hand crafted features. Named entities recognied - Violation-on, Violation-by, and Law.
+The dataset is of the BIO format. The model achieves an F1-score of 0.32.
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+The model was developed for LegalLens 2024 competition as part of Natural Legal Language Processing 2024. The model has handcrafted features for identifying named
+entities in the BIO format.
+- **Developed by:** Shashank M Chakravarthy
+- **Funded by [optional]:** NA
+- **Shared by [optional]:** NA
+- **Model type:** Statistical Model
+- **Language(s) (NLP):** English
+- **License:** Apache 2.0 License
+- **Finetuned from model [optional]:** NA
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** NA
+- **Paper [optional]:** [https://aclanthology.org/2024.nllp-1.33.pdf]
+- **Demo [optional]:** NA
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+The model is used to detect named entities in unstructured text. The model can be extended to other entities with further modification to the handcrafted features.
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+The model can be directly used on any unstructured text with a bit of preprocessing. The files contain the evaluation script.
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+This model is handcrafted for detecting violations and law in text. Can be used for other legal text which may contain similar entities.
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+The limitation comes with the handcrafting the features.
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+If the text used for prediction is improperly processed without POS tags, the model will not perform as its designed to.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+### Load libraries
+```
+import ast
+import pandas as pd
+import joblib
+import nltk
+from nltk import pos_tag
+import string
+from nltk.stem import WordNetLemmatizer
+from nltk.stem import PorterStemmer
+```
+### Check if nltk modules are downloaded, if not download them
+```
+nltk.download('wordnet')
+nltk.download('omw-1.4')
+nltk.download("averaged_perceptron_tagger")
+```
+### Class for grouping tokens as sentences (redundant if text processed directly)
+```
+class getsentence(object):
+    '''
+    This class is used to get the sentences from the dataset.
+    Converts from BIO format to sentences using their sentence numbers
+    '''
+    def __init__(self, data):
+        self.n_sent = 1.0
+        self.data = data
+        self.empty = False
+        self.grouped = self.data.groupby("sentence_num").apply(self._agg_func)
+        self.sentences = [s for s in self.grouped]
+    def _agg_func(self, s):
+        return [(w, p) for w, p in zip(s["token"].values.tolist(),
+                                       s["pos_tag"].values.tolist())]
+```
+### Creates features for words in a sentence (code can be reduced using iteration)
+```
+def word2features(sent, i):
+    '''
+    This method is used to extract features from the words in the sentence.
+    The main features extracted are:
+    - word.lower(): The word in lowercase
+    - word.isdigit(): If the word is a digit
+    - word.punct(): If the word is a punctuation
+    - postag: The pos tag of the word
+    - word.lemma(): The lemma of the word
+    - word.stem(): The stem of the word
+    The features (not all) are also extracted for the 4 previous and 4 next words.
+    '''
+    global token_count
+    wordnet_lemmatizer = WordNetLemmatizer()
+    porter_stemmer = PorterStemmer()
+    word = sent[i][0]
+    postag = sent[i][1]
+    features = {
+        'bias': 1.0,
+        'word.lower()': word.lower(),
+        'word.isdigit()': word.isdigit(),
+        # Check if its punctuations
+        'word.punct()': word in string.punctuation,
+        'postag': postag,
+        # Lemma of the word
+        'word.lemma()': wordnet_lemmatizer.lemmatize(word),
+        # Stem of the word
+        'word.stem()': porter_stemmer.stem(word)
+    }
+    if i > 0:
+        word1 = sent[i-1][0]
+        postag1 = sent[i-1][1]
+        features.update({
+            '-1:word.lower()': word1.lower(),
+            '-1:word.isdigit()': word1.isdigit(),
+            '-1:word.punct()': word1 in string.punctuation,
+            '-1:postag': postag1
+        })
+        if i - 2 >= 0:
+            features.update({
+                '-2:word.lower()': sent[i-2][0].lower(),
+                '-2:word.isdigit()': sent[i-2][0].isdigit(),
+                '-2:word.punct()': sent[i-2][0] in string.punctuation,
+                '-2:postag': sent[i-2][1]
+            })
+        if i - 3 >= 0:
+            features.update({
+                '-3:word.lower()': sent[i-3][0].lower(),
+                '-3:word.isdigit()': sent[i-3][0].isdigit(),
+                '-3:word.punct()': sent[i-3][0] in string.punctuation,
+                '-3:postag': sent[i-3][1]
+            })
+        if i - 4 >= 0:
+            features.update({
+                '-4:word.lower()': sent[i-4][0].lower(),
+                '-4:word.isdigit()': sent[i-4][0].isdigit(),
+                '-4:word.punct()': sent[i-4][0] in string.punctuation,
+                '-4:postag': sent[i-4][1]
+            })
+    else:
+        features['BOS'] = True
+    if i < len(sent)-1:
+        word1 = sent[i+1][0]
+        postag1 = sent[i+1][1]
+        features.update({
+            '+1:word.lower()': word1.lower(),
+            '+1:word.isdigit()': word1.isdigit(),
+            '+1:word.punct()': word1 in string.punctuation,
+            '+1:postag': postag1
+        })
+        if i + 2 < len(sent):
+            features.update({
+                '+2:word.lower()': sent[i+2][0].lower(),
+                '+2:word.isdigit()': sent[i+2][0].isdigit(),
+                '+2:word.punct()': sent[i+2][0] in string.punctuation,
+                '+2:postag': sent[i+2][1]
+            })
+        if i + 3 < len(sent):
+            features.update({
+                '+3:word.lower()': sent[i+3][0].lower(),
+                '+3:word.isdigit()': sent[i+3][0].isdigit(),
+                '+3:word.punct()': sent[i+3][0] in string.punctuation,
+                '+3:postag': sent[i+3][1]
+            })
+        if i + 4 < len(sent):
+            features.update({
+                '+4:word.lower()': sent[i+4][0].lower(),
+                '+4:word.isdigit()': sent[i+4][0].isdigit(),
+                '+4:word.punct()': sent[i+4][0] in string.punctuation,
+                '+4:postag': sent[i+4][1]
+            })
+    else:
+        features['EOS'] = True
+    return features
+```
+### Obtain features for a given sentence
+```
+def sent2features(sent):
+    '''
+    This method is used to extract features from the sentence.
+    '''
+    return [word2features(sent, i) for i in range(len(sent))]
+```
+### Load file from your directory
+```
+df_eval = pd.read_excel("testset_NER_LegalLens.xlsx")
+```
+### Evaluate data type and create pos_tags for each token
+```
+df_eval["tokens"] = df_eval["tokens"].apply(ast.literal_eval)
+df_eval['pos_tags'] = df_eval['tokens'].apply(lambda x: [tag[1]
+                                                         for tag in pos_tag(x)])
+```
+### Aggregate tokens to sentences
+```
+data_eval = []
+for i in range(len(df_eval)):
+    for j in range(len(df_eval["tokens"][i])):
+        data_eval.append(
+            {
+                "sentence_num": i+1,
+                "id": df_eval["id"][i],
+                "token": df_eval["tokens"][i][j],
+                "pos_tag": df_eval["pos_tags"][i][j],
+            }
+        )
+data_eval = pd.DataFrame(data_eval)
+getter = getsentence(data_eval)
+sentences_eval = getter.sentences
+X_eval = [sent2features(s) for s in sentences_eval]
+```
+### Load model from your directory
+```
+crf = joblib.load("../models/crf.pkl")
+y_pred_eval = crf.predict(X_eval)
+print("NER tags predicted.")
+df_eval["ner_tags"] = y_pred_eval
+df_eval.drop(columns=["pos_tags"], inplace=True)
+print("Saving the predictions...")
+df_eval.to_csv("predictions_NERLens.csv", index=False)
+print("Predictions saved.")
+```
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[https://huggingface.co/datasets/darrow-ai/LegalLensNER]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+The dataset was first evaluated for its datatype, POS_tags were created for each token in the text. With handcrafted features,
+the model was trained on a CPU. Training time is around 20-30 minutes for this dataset.
+#### Preprocessing [optional]
+For every token, POS_tags were assigned using NLTK library.
+#### Training Hyperparameters
+- **Training regime:** NA <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+NA
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+The model was evaluated using macro-F1 score. A score of 0.32 was obtained on unseen test data.
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[https://huggingface.co/datasets/darrow-ai/LegalLensNER]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+Macro-F1 score as it evaluates the true performance of the model and mitigates the performance boost created by highly skewed entities in the dataset.
+### Results
+0.32 macro-F1 score on unseen data.
+#### Summary
+The model was designed and developed to tackle NER task in unstructured text.
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+NA
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** 13th Gen Intel(R) Core(TM) i7-1365U
+- **Hours used:** 0.5 hours
+- **Cloud Provider:** NA
+- **Compute Region:** NA
+- **Carbon Emitted:** Unknown
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]