Nomic-ai-embedding fine tuning with SentenceTransformersFinetuneEngine
Hi
Im trying to finetune Nomic-ai-embedding using SentenceTransformersFinetuneEngine and am running into an issue:
from llama_index.finetuning import SentenceTransformersFinetuneEngine
finetune_engine = SentenceTransformersFinetuneEngine(
train_dataset, # Dataset to be trained on
model_id="nomic-ai/nomic-embed-text-v1.5", # HuggingFace reference to base embeddings model
model_output_path="llama_model_v1", # Output directory for fine-tuned embeddings model
val_dataset=test_dataset, # Dataset to validate on
epochs=2, # Number of Epochs to train for
)
I would reach out to the package SentenceTransformers as I don't have as deep knowledge of what's going on there
Hello!
I'm afraid that this is not currently conveniently possible, because this SentenceTransformer
instance must be initialized here with trust_remote_code=True
as the model must pull code from Hugging Face. I would recommend opening an issue in LlamaIndex for it.
That said, I think you should be able to solve your problem. You can first download the model to a local directory. Then, you can download these two files and also place them in the repository:
- https://huggingface.co/nomic-ai/nomic-embed-text-v1/blob/main/modeling_hf_nomic_bert.py
- https://huggingface.co/nomic-ai/nomic-embed-text-v1/blob/main/configuration_hf_nomic_bert.py
Then, you must update your local config.json
to no longer say:
"auto_map": {
"AutoConfig": "nomic-ai/nomic-embed-text-v1--configuration_hf_nomic_bert.NomicBertConfig",
"AutoModel": "nomic-ai/nomic-embed-text-v1--modeling_hf_nomic_bert.NomicBertModel",
"AutoModelForMaskedLM": "nomic-ai/nomic-bert-2048--modeling_hf_nomic_bert.NomicBertForPreTraining"
},
but instead to say:
"auto_map": {
"AutoConfig": "configuration_hf_nomic_bert.NomicBertConfig",
"AutoModel": "modeling_hf_nomic_bert.NomicBertModel",
},
Now these files are local, and we don't need to download them from Hugging Face. As a result, you should now be able to initialize the SentenceTransformersFinetuneEngine
with the path to your local directory. It should then no longer complain about the lack of trust_remote_code=True
.
- Tom Aarsen
thank you tom!
do i need just the model tensors and config.json or would i need to clone the entire repo?
You should probably just clone the entire repo
thank you!
also, how do i use the model with SentenceTransformersFinetuneEngine ? because there is only a model_id parameter in SentenceTransformersFinetuneEngine , there is no way to pass the actual model
would you recommend cloning the repo , making the changes and uploading the model to huggingface? if so , would i need to make any other changes to the files?
how do i use the model with SentenceTransformersFinetuneEngine ?
model_id
can also be a path to a local model, you should use that instead.
And no, I wouldn't upload it to Hugging Face for this, because then it still has to pull code from Hugging Face and it'll still need trust_remote_code=True
.
hi @tomaarsen is there anything else i can do to solve my issue