Is there a for the model to show more than 4 results?

by fmpapso - opened Nov 20, 2024

Nov 20, 2024

I've either been having trouble getting the model to recognise a large number of tokens, or running the model correctly on a long string. I have not seen the model categorizing more than four tokens anytime I've run it, even when there are more than four. On a large string, the model returned "Denmark" three times and one name, whilst missing another name it had classified previously in another string.

Any help with this would be appreciated!

saattrupdan

Owner Nov 20, 2024

Hi @fmpapso ! How many tokens are we talking about here? Since the model can only handle 512 tokens at a time.

If you do have less than 512 tokens, could you send a few failure mode examples?

fmpapso

Nov 27, 2024

Hi, thanks for taking the time to help me out!
I'm not sure if the data I'm testing on is sensitive, but maybe the issue stems from the length of data I am using. There may be as thousands, tens of thousands, or hundreds of thousands of characters, cause I am concatenating messages together that I want get the ner output from.
I am a bit of a junior developer and I realise now what was most likely happening is that when say 5 messages, it may have only been reading a part of the first one, and disregarding what came after. It was coincidental that different message streams ended up having 4 results when the text was cut off, better testing on my part has shown the model itself is fine!
I haven't seen any results more than 1516 characters in, but maybe the limit is 2048 characters?
I imagine I could segment messages/text into 2048 or maximum characters long snippets, then run the model on each segment, then finally putting it all back together..

saattrupdan

Owner Nov 27, 2024

•

edited Nov 27, 2024

Yeah, I would suggest you split up the data beforehand.

You can tokenize the data to double check that they're all within 512 tokens (load the tokenizer with AutoTokenizer.from_pretrained('saattrupdan/nbailab-base-ner-scandi')).

To tokenize naturally, rather than simply chopping the text into smaller bits, I would recommend the NLTK punkt sentence splitter (the sent_tokenize function; remember to specify the language here).

Closing here, but feel free to open another issue if it turns out that splitting up the data didn't help :)

saattrupdan changed discussion status to closed Nov 27, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment