Usage in python
Dear allen ai
I am trying to use scivocab as a pre-trained model for some topic modelling on scientific papers.
Unfortunately, I cannot download scivocab using SentenceTransformers, and the transformers.pipelines won't work either, since there is no specified pipeline type.
How do you suggest usage in python?
From here https://github.com/allenai/scibert and there's a .sh to get you started. Something like...
from transformers import *
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')
Thanks. It also works if I skip the pipeline i.e.
from transformers import *
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
embed_model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')
However, I am unsure if BERTopic is actually using it, or just defaulting to . When I runtopic_model = BERTopic(embedding_model=embed_model, language="english", nr_topics="auto", verbose=True )
topics, probs = topic_model.fit_transform(docs)
the verbose output is:
loading configuration file `.cache\torch\sentence_transformers\sentence-transformers_all-MiniLM-L6-v2\config.json
Model config BertConfig {
"_name_or_path": ".cache\\torch\\sentence_transformers\\sentence-transformers_all-MiniLM-L6-v2\\", "architectures": [
"BertModel"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 384,
"initializer_range": 0.02,
"intermediate_size": 1536,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 6,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"transformers_version": "4.22.2",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}
loading weights file .cache\torch\sentence_transformers\sentence-transformers_all-MiniLM-L6-v2\pytorch_model.bin
All model checkpoint weights were used when initializing BertModel.
All the weights of BertModel were initialized from the model checkpoint at C:\Users\oskar/.cache\torch\sentence_transformers\sentence-transformers_all-MiniLM-L6-v2\.
THis might of course be an issue in the Bertopic package.
I have downloaded the recommended Tensorflow model from the GitHub ReadMe file. But I am not sure how to use that model in Python now. I am new to using BERT and its derivatives, so any help would be appreciated.