Geraldine/msmarco-distilbert-base-v4-ead

Model Details

Model Name: Geraldine/msmarco-distilbert-base-v4-ead
Base Model: sentence-transformers/msmarco-distilbert-base-v4
Intended Use: This model is optimized for creating text embeddings with specific handling of XML/EAD elements.
Architecture: DistilBERT-based sentence-transformer model, fine-tuned for MSMARCO and adapted to recognize XML/EAD elements.

Model Description

This model is built on top of sentence-transformers/msmarco-distilbert-base-v4 and enhanced with two key modifications:

Special Tokens for XML/EAD Elements: The tokenizer includes additional tokens to handle EAD (Encoded Archival Description) and XML elements and attributes. This allows the model to generate embeddings that capture structural metadata commonly used in archival contexts.
Dimensionality Reduction with PCA: A PCA model is applied to reduce the dimensionality of embeddings from 768 to 128. This makes the embeddings more compact while preserving essential semantic information, which is useful for downstream tasks requiring lower-dimensional representations.

Model Usage

Installation and Setup

from transformers import AutoModel, AutoTokenizer
import joblib
from huggingface_hub import hf_hub_download

# Load the embeddings model
model = AutoModel.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead")
tokenizer = AutoTokenizer.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead")

# Load the PCA model
pca_path = hf_hub_download("Geraldine/msmarco-distilbert-base-v4-ead", "pca_model.joblib")
pca = joblib.load(pca_path)

Encoding Text and Reducing Dimensionality

To use the model for generating 128-dimensional embeddings, follow these steps:

# Encode text using the model and tokenizer
text = "Your EAD/XML text goes here"
inputs = tokenizer(text, return_tensors="pt")
embeddings = model(**inputs).last_hidden_state

# Apply PCA to reduce dimensionality
reduced_embeddings = pca.transform(embeddings.detach().numpy())

Full example to use with Langchain or Llamaindex

from transformers import AutoModel, AutoTokenizer, pipeline
import joblib
from huggingface_hub import hf_hub_download

# Load the embeddings model
model = AutoModel.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead")
tokenizer = AutoTokenizer.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead")

# Load the PCA model
pca_path = hf_hub_download("Geraldine/msmarco-distilbert-base-v4-ead", "pca_model.joblib")

feature_extraction_pipeline = pipeline("feature-extraction", model=model, tokenizer=tokenizer)

class HuggingFaceEmbeddingFunction:
    def __init__(self, pipeline, pca_model_path):
        self.pipeline = pipeline
        self.pca = joblib.load(pca_model_path)
            
    # Function for embedding documents (lists of text)
    def embed_documents(self, texts):
        # Get embeddings as numpy arrays
        embeddings = self.pipeline(texts)
        embeddings = [embedding[0][0] for embedding in embeddings]
        embeddings = np.array(embeddings)

        # Transform embeddings using PCA
        reduced_embeddings = self.pca.transform(embeddings)
        return reduced_embeddings.tolist()

    # Function for embedding individual queries
    def embed_query(self, text):
        embedding = self.pipeline(text)
        embedding = np.array(embedding[0][0]).reshape(1, -1)

        # Transform embedding using PCA
        reduced_embedding = self.pca.transform(embedding)
        return reduced_embedding.flatten().tolist()

embeddings = HuggingFaceEmbeddingFunction(feature_extraction_pipeline, pca_model_path="pca_model.joblib")

Intended Use Cases

This model is well-suited for:

Archival Data Embeddings: Generate embeddings for texts containing EAD/XML elements, making it ideal for digital archives and library sciences.
Semantic Search: Improve search results for content with complex metadata or hierarchical data, like archival records or digital collections.
Information Retrieval: Use embeddings to power retrieval tasks where reducing storage and maintaining relevance in the embeddings are essential.

Training Data

The base model was fine-tuned on MSMARCO data by sentence-transformers. Additional training or fine-tuning with EAD/XML-specific tokens was not required; instead, the tokenizer was adapted to recognize XML/EAD elements and attributes as distinct tokens.

Limitations and Considerations

Domain-Specific Tokenization: The model's tokenizer recognizes EAD/XML tokens, making it particularly useful in contexts where such elements are frequently used. However, this specialization may not be necessary for general NLP tasks.
Dimensionality Reduction Trade-Off: PCA reduces the embedding dimensions from 768 to 128, which can introduce minor losses in the information encoded in embeddings. This trade-off is balanced to retain essential semantic information.

Evaluation

The base model has been evaluated on MSMARCO, and the added tokenization aligns it for use in XML/EAD contexts. Further evaluation can be conducted on EAD-specific datasets or tasks to ensure model effectiveness in domain-specific applications.

Citation

If you use this model, please cite it as follows:

@misc{geraldine2024eadxml,
  author = {Géraldine Geoffroy},
  title = {Geraldine/msmarco-distilbert-base-v4-ead: A DistilBERT Embedding Model for EAD/XML Text},
  year = {2024},
  howpublished = {\url{https://huggingface.co/Geraldine/msmarco-distilbert-base-v4-ead}},
}

Model Card Authors [optional]

Géraldine Geoffroy

Model Card Contact

grldn.geoffroy@gmail.com

Geraldine
/

msmarco-distilbert-base-v4-ead