Feature Extraction
Transformers
Safetensors
ModularStarEncoder
custom_code

ModularStarEncoder-550M Fine-Tuned model

ModularStarEncoder-finetuned-18 is an encoder built on top of ModularStarEncoder-1B Pre-trained on SynthCode2Code2NL. ModularStarEncoder fine-tuned-18 is an encoder for code-to-code and text-to-code retrieval tasks, enabling the end user to select the model size that meets their memory and computational constraints. We built ModularStarEncoder on top of StarCoder-2, reducing its size from 15B to 1B parameters in bfloat16.

This version contains only the first 18 layers of ModularStarEncoder-finetuned, with the related projection head. We have released this version to enhance the model's usability by allowing users to download only the desired size.

The model is finetuned with CLIP objective.

ModularStarEncoder fine-tuned works with instruction prompts; to get the most out of the model, embed the task in the input. The How to Use section below provides more details.

How to use

from transformers import AutoModel
from transformers import AutoTokenizer

#import the model
model = AutoModel.from_pretrained("andreagurioli1995/ModularStarEncoder-finetuned-18", trust_remote_code=True)

#import the tokenizer, the tokenizer applies LEFT padding!
tokenizer = AutoTokenizer.from_pretrained("andreagurioli1995/ModularStarEncoder-finetuned-18")

 
language = "yourlanguagelowercased"

#instruction in case of code embedding in a code language
instruction_code = f"Represent this {language} code snippet for retrieval:"

#instruction in case of code embedding in English
instruction_natural_language = "Represent this code description for retrieving supporting snippets of code:"

code_snippet = "your code to embed here"

#You should follow this pattern to embed a snippet of code or natural language queries 
sentence =  f"{tokenizer.sep_token}{instruction_code}{tokenizer.sep_token}{code_snippet}{tokenizer.cls_token}"

#Tokenizing your sentence
tokenized_sentence = tokenizer(sentence, return_tensors="pt",truncation=True, max_length=2048)

#Embedding the tokenized sentence
embedded_sentence = model(**tokenized_sentence)

You will get as an output three elements:

  • projected_pooled_normalized: a list of the projected, pooled, and normalized embeddings from the five exit points;
  • raw_hidden_states: raw representation from all the hidden states of the model, without pooling, normalization, and projection
  • attentions: attention scores from the encoder

Training

We fine-tuned ModularStarEncoder with a batch size of 2048 contrastive samples for 20,000 training steps. The pre-training and fine-tuning were conducted on 512 NVIDIA Ampere (64GB) GPUs using the Leonardo supercomputer, requiring 450,000 GPU working hours.

Hyperparameter Value
Hidden size 1024
Max. position embeddings 2048
Num. of attention heads 12
Num. of key values heads 4
Num. of hidden layers 36
Attention GQA
Num. of parameters ≈1B
Loss function CLIP loss
Multi-layer loss yes

Evaluation

Here we briefly show our codeSearchNet (codeXGLUE) results between different layers; for full results over text-to-code and code-to-code refer to the article:

Layer Avg. MRR
Layer 4 73.2
Layer 9 77.3
Layer 18* 81.0
Layer 27 80.3
Layer 36 79.6
  • (* size and corresponding projection head present in this model)

Licence

The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement here.

Citation

@article{gurioli2025modeltrainallhierarchical,
      title={One Model to Train them All: Hierarchical Self-Distillation for Enhanced Early Layer Embeddings}, 
      author={Andrea Gurioli and Federico Pennino and João Monteiro and Maurizio Gabbrielli},
      year={2025},
      eprint={2503.03008},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.03008}, 
}
Downloads last month
55
Safetensors
Model size
553M params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support model that require custom code execution.

Model tree for modularStarEncoder/ModularStarEncoder-finetuned-18

Finetuned
(5)
this model

Datasets used to train modularStarEncoder/ModularStarEncoder-finetuned-18