metadata

license: apache-2.0
tags:
  - ipt
  - alibi
inference: false
datasets:
  - oscar-corpus/OSCAR-2301
language:
  - it

ipt-350m

ipt-350m is a decoder-style transformer pretrained from scratch on ~13B tokens of Italian text (wip: trained on unfiltered oscar).

It uses a modified transformer architecture optimized for efficient training and inference. Positional embeddings are replaced with Attention with Linear Biases (ALiBi).

ipt-350m is:

Licensed for the possibility of commercial use
Prepared to handle extremely long inputs thanks to ALiBi.
Capable of fast training and inference (via FlashAttention and FasterTransformer)
Equipped with highly efficient open-source training code via the llm-foundry repository

If you find this project useful, consider supporting its development:

How to Use

import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
  'efederici/ipt-350m',
  trust_remote_code=True
)

Note: This model requires that trust_remote_code=True be passed to the from_pretrained method.

To use the optimized triton implementation of FlashAttention, you can load the model on GPU (cuda:0) with attn_impl='triton' and with bfloat16 precision:

import torch
import transformers

name = 'efederici/ipt-350m'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton'
config.init_device = 'cuda:0'

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  torch_dtype=torch.bfloat16,
  trust_remote_code=True
)

Although the model was trained with a sequence length of 2048, ALiBi enables to increase the maximum sequence length during finetuning and/or inference.

import transformers

name = 'efederici/ipt-350m'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  trust_remote_code=True
)

Model Description

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:

It uses FlashAttention
It uses ALiBi (Attention with Linear Biases) and does not use positional embeddings
It does not use biases

Hyperparameter	Value
n_parameters	350M
n_layers	24
n_heads	16
d_model	1024
vocab size	50432
sequence length	2048

Dataset

The model was trained for ~13B tokens (with batch size 64 and sequence length 2048) on OSCAR-2301. Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.

Vocabulary size is 50432, a multiple of 128 as suggested in MEGATRON-LM, model flop utilization (MFU) increased by up to four percentage points.