Geraldine/msmarco-distilbert-base-v4-ead
Model Details
- Model Name: Geraldine/msmarco-distilbert-base-v4-ead
- Base Model: sentence-transformers/msmarco-distilbert-base-v4
- Intended Use: This model is optimized for creating text embeddings with specific handling of XML/EAD elements.
- Architecture: DistilBERT-based sentence-transformer model, fine-tuned for MSMARCO and adapted to recognize XML/EAD elements.
Model Description
This model is built on top of sentence-transformers/msmarco-distilbert-base-v4 and enhanced with two key modifications:
Special Tokens for XML/EAD Elements: The tokenizer includes additional tokens to handle EAD (Encoded Archival Description) and XML elements and attributes. This allows the model to generate embeddings that capture structural metadata commonly used in archival contexts.
Dimensionality Reduction with PCA: A PCA model is applied to reduce the dimensionality of embeddings from 768 to 128. This makes the embeddings more compact while preserving essential semantic information, which is useful for downstream tasks requiring lower-dimensional representations.
Model Usage
Installation and Setup
from transformers import AutoModel, AutoTokenizer
import joblib
from huggingface_hub import hf_hub_download
# Load the embeddings model
model = AutoModel.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead")
tokenizer = AutoTokenizer.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead")
# Load the PCA model
pca_path = hf_hub_download("Geraldine/msmarco-distilbert-base-v4-ead", "pca_model.joblib")
pca = joblib.load(pca_path)
Encoding Text and Reducing Dimensionality
To use the model for generating 128-dimensional embeddings, follow these steps:
# Encode text using the model and tokenizer
text = "Your EAD/XML text goes here"
inputs = tokenizer(text, return_tensors="pt")
embeddings = model(**inputs).last_hidden_state
# Apply PCA to reduce dimensionality
reduced_embeddings = pca.transform(embeddings.detach().numpy())
Full example to use with Langchain or Llamaindex
from transformers import AutoModel, AutoTokenizer, pipeline
import joblib
from huggingface_hub import hf_hub_download
# Load the embeddings model
model = AutoModel.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead")
tokenizer = AutoTokenizer.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead")
# Load the PCA model
pca_path = hf_hub_download("Geraldine/msmarco-distilbert-base-v4-ead", "pca_model.joblib")
feature_extraction_pipeline = pipeline("feature-extraction", model=model, tokenizer=tokenizer)
class HuggingFaceEmbeddingFunction:
def __init__(self, pipeline, pca_model_path):
self.pipeline = pipeline
self.pca = joblib.load(pca_model_path)
# Function for embedding documents (lists of text)
def embed_documents(self, texts):
# Get embeddings as numpy arrays
embeddings = self.pipeline(texts)
embeddings = [embedding[0][0] for embedding in embeddings]
embeddings = np.array(embeddings)
# Transform embeddings using PCA
reduced_embeddings = self.pca.transform(embeddings)
return reduced_embeddings.tolist()
# Function for embedding individual queries
def embed_query(self, text):
embedding = self.pipeline(text)
embedding = np.array(embedding[0][0]).reshape(1, -1)
# Transform embedding using PCA
reduced_embedding = self.pca.transform(embedding)
return reduced_embedding.flatten().tolist()
embeddings = HuggingFaceEmbeddingFunction(feature_extraction_pipeline, pca_model_path="pca_model.joblib")
Intended Use Cases
This model is well-suited for:
- Archival Data Embeddings: Generate embeddings for texts containing EAD/XML elements, making it ideal for digital archives and library sciences.
- Semantic Search: Improve search results for content with complex metadata or hierarchical data, like archival records or digital collections.
- Information Retrieval: Use embeddings to power retrieval tasks where reducing storage and maintaining relevance in the embeddings are essential.
Training Data
The base model was fine-tuned on MSMARCO data by sentence-transformers. Additional training or fine-tuning with EAD/XML-specific tokens was not required; instead, the tokenizer was adapted to recognize XML/EAD elements and attributes as distinct tokens.
Limitations and Considerations
- Domain-Specific Tokenization: The model's tokenizer recognizes EAD/XML tokens, making it particularly useful in contexts where such elements are frequently used. However, this specialization may not be necessary for general NLP tasks.
- Dimensionality Reduction Trade-Off: PCA reduces the embedding dimensions from 768 to 128, which can introduce minor losses in the information encoded in embeddings. This trade-off is balanced to retain essential semantic information.
Evaluation
The base model has been evaluated on MSMARCO, and the added tokenization aligns it for use in XML/EAD contexts. Further evaluation can be conducted on EAD-specific datasets or tasks to ensure model effectiveness in domain-specific applications.
Citation
If you use this model, please cite it as follows:
@misc{geraldine2024eadxml,
author = {Géraldine Geoffroy},
title = {Geraldine/msmarco-distilbert-base-v4-ead: A DistilBERT Embedding Model for EAD/XML Text},
year = {2024},
howpublished = {\url{https://huggingface.co/Geraldine/msmarco-distilbert-base-v4-ead}},
}
Model Card Authors [optional]
Géraldine Geoffroy
Model Card Contact
- Downloads last month
- 17