Doge 320M

SmallDoge

Doge uses Dynamic Mask Attention as sequence transformation and can use Multi-Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self-attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi-Layer Perceptron for further training. This model is trained by SmallDoge community, for detailed algorithm and model architecture, paper coming soon, all training details and code are available in the small-doge repository.

Uses

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-320M")
>>> model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-320M", trust_remote_code=True)
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")

>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))

Model Details

We build the Doge by doing Per-Training on Smollm-Corpus. If you want to continue pre-training this model, you can find the unconverged checkpoint here. These models has not been fine-tuned for instruction, the instruction model is here.

Pre-Training:

Model Training Data Steps Content Length Tokens LR Batch Size Precision RTX 4090 GPU hours
Doge-20M HuggingFaceTB/smollm-corpus 8k 2048 4B 8e-3 0.5M bfloat16 14
Doge-60M HuggingFaceTB/smollm-corpus 16k 2048 16B 6e-3 1M bfloat16 128
Doge-160M HuggingFaceTB/smollm-corpus 24k 2048 32B 4e-3 1.5M bfloat16 522
Doge-320M HuggingFaceTB/smollm-corpus 32k 2048 64B 2e-3 2M bfloat16 1856

Evaluation:

Model MMLU TriviaQA ARC PIQA HellaSwag OBQA Winogrande tokens / s on i7-11 CPU
Doge-20M 25.4 0.03 29.8 58.4 27.3 25.6 50.2 142
Doge-60M 26.4 0.2 37.9 61.4 31.5 28.0 50.8 62
Doge-160M 29.2 4.8 44.4 70.1 43.4 34.4 52.2 28
Doge-320M 33.8 9.4 52.1 73.9 52.7 37.9 55.0 16

All evaluations are done using five-shot settings, without additional training on the benchmarks.

Procedure:

Visualize in Weights & Biases

Environment:

  • Image: nvcr.io/nvidia/pytorch:24.12-py3
  • Hardware: 1x NVIDIA RTX 4090
  • Software: Transformers

Citation

@misc{smalldoges,
  title={SmallDoges},
  author={SmallDoge Team and Jingze, Shi and Yifan, Wu and Bingheng, Wu},
  year={2025},
  month={March}, 
}
Downloads last month
31
Safetensors
Model size
336M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support model that require custom code execution.

Dataset used to train SmallDoge/Doge-320M

Collection including SmallDoge/Doge-320M