IMPORTENT! READ THIS!

Model description

This model recognizes scientific terms in a given text. The best way to use it is as follows:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from nltk.tokenize import word_tokenize
import torch
import spacy

# You might want to use it to remove enteties in the text (the model usually predicts them as scientific)
nlp = spacy.load("en_core_web_sm")
# doc = nlp(text)
# names = [ent.text for ent in doc.ents]

tokenizer = AutoTokenizer.from_pretrained("JonyC/scibert-science-word-classifier")
model = AutoModelForTokenClassification.from_pretrained("JonyC/scibert-science-word-classifier")

# define max_len as needed.
def classify_term(term, max_len=12):
    term = term.lower()
    tokens = tokenizer(term, return_tensors="pt", truncation=True, padding=True, max_length=max_len).to(device)
    output = model(**tokens).logits
    pred = torch.argmax(output).item()
    
    return "Scientific" if pred == 1 else "Non-Scientific"

# For single term:
print(classify_term("quantum mechanics")) 
print(classify_term("table"))
print(classify_term("photosynthesis"))

# For sentences:
words = word_tokenize("some sentence") # you can also use sentence.split()
results = []
for w in words:
    res = classify_term(w)
    results.append(res)

for w, p in zip(words, results):
    print(f"Word: {w}, Predicted Label: {p}")

Example usage

Given the following text: "Quantum computing is a new field that changes how we think about solving complex problems. Unlike regular computers that use bits (which are either 0 or 1), quantum computers use qubits, which can be both 0 and 1 at the same time, thanks to a property called superposition. One important feature of quantum computers is quantum entanglement, where two qubits can be linked in such a way that changing one will instantly affect the other, no matter how far apart they are. This allows quantum computers to perform certain calculations much faster than traditional computers. For example, quantum computers could one day factor large numbers much faster, which is currently a task that takes regular computers a very long time. However, there are still challenges to overcome, like maintaining the qubits' state long enough to do calculations without errors. Scientists are working on ways to fix these errors, which is necessary for quantum computers to work on a large scale and solve real-world problems more efficiently than today's computers."

the words he classified as scientific are:

['Quantum', 'computing', 'field', 'complex', 'quantum', 'qubits', 'property', 'superposition', 'entanglement', 'matter', 'factor', 'state', 'scale'] 

results 'scibert-science-word-classifier'

This model is a fine-tuned version of allenai/scibert_scivocab_cased on the JonyC/ScienceGlossary dataset. It achieves the following results on the evaluation set: - Loss: 0.1763 - Precision: 0.9487 - Recall: 0.9068 - F1: 0.9273 - Accuracy: 0.9695

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 7e-05
  • train_batch_size: 128
  • eval_batch_size: 128
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • num_epochs: 35
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for JonyC/scibert-science-word-classifier

Finetuned
(73)
this model