How to handle the 384 words limit?

#2
by rjalexa - opened

How to best chunk longer texts ? Thanks

one way: chunk it into texts smaller than 384 characters, calculate entities in individual chunks and append results!
see my code below:

from gliner import GLiNER

# -------------------------------------

INPUT

CHANGE: Example long text from which to extract entities

text = "e.g. this is a very long text of over 384 characters from which I want to extract the found entities: John is in his house at 22 Street name."
text = text*100

CHANGE: Labels for the named entity recognition (NER) model

labels = ["person", "location", "address"]

CHANGE: Threshold for NER model confidence

ner_threshold = 0.5

CHANGE: Load the pre-trained GLiNER model

model = GLiNER.from_pretrained("urchade/gliner_large-v2.1")

-------------------------------------

Initialize a list to store all extracted entities

all_entities = []

-------------------------------------

Function to chunk text into smaller segments

def chunk_text(text, max_length=384):
return [text[i:i+max_length] for i in range(0, len(text), max_length)]

-------------------------------------

Check if the text needs to be chunked

if len(text) > 384:
chunks = chunk_text(text)
print("Number of chunks:", len(chunks))
else:
chunks = [text]

Predict entities for each chunk of text

for chunk in chunks:
entities = model.predict_entities(chunk, labels, threshold=ner_threshold)
all_entities.extend(entities)

-------------------------------------

Output all found entities

print(all_entities)

Sign up or log in to comment