Vocabulary contains hole for index 51959
When trying to further pre-train the model on a specific domain I encountered an error:
When tokenizing using the robeczech-base tokenizer warning accours:The OrderedVocab you are attempting to save contains a hole for index 51959, your vocabulary could be corrupted !
When I start training the model python throws following error PyTroch:
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [94,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
# This error is repeated for various block and thread values.
Traceback (most recent call last):
File "/home/jovyan/tomas/medical-lm/PyTorch/TrainLM-masked-pytorch.py", line 98, in <module>
result = trainer.train()
File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1645, in train
return inner_training_loop(
File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 1938, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2759, in training_step
loss = self.compute_loss(model, inputs)
File "/opt/conda/lib/python3.9/site-packages/transformers/trainer.py", line 2784, in compute_loss
outputs = model(**inputs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/operations.py", line 553, in forward
return model_forward(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/operations.py", line 541, in __call__
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/opt/conda/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/transformers/models/roberta/modeling_roberta.py", line 1100, in forward
outputs = self.roberta(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/transformers/models/roberta/modeling_roberta.py", line 845, in forward
embedding_output = self.embeddings(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/transformers/models/roberta/modeling_roberta.py", line 123, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Code I use to pre-train the model:
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling
from datasets import load_dataset
import math
import evaluate
from pynvml import *
model_name = "ufal/robeczech-base"
def print_gpu_utilization():
nvmlInit()
handle = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(handle)
print(f"GPU memory occupied: {info.used//1024**2} MB.")
def print_summary(result):
print(f"Time: {result.metrics['train_runtime']:.2f}")
print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
print_gpu_utilization()
tokenizer = AutoTokenizer.from_pretrained(model_name)
dataset = load_dataset('text', data_dir="data/")
datasets = dataset["train"].train_test_split(test_size = 0.1)
print(tokenizer.model_max_length)
print_gpu_utilization()
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=512)
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=16, remove_columns=["text"])
block_size=512
def group_texts(examples):
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
total_length = (total_length // block_size) * block_size
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
batch_size=1000,
num_proc=16,
)
model = AutoModelForMaskedLM.from_pretrained(model_name)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
training_args = TrainingArguments(
f"{model_name}-pre-trained-med",
learning_rate=2e-5,
weight_decay=0.01,
warmup_steps=2000,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
fp16=True,
logging_dir=f"{model_name}-pre-trained-med",
logging_strategy="steps",
num_train_epochs=10,
logging_steps=100,
save_strategy="epoch",
save_total_limit=3,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=lm_datasets["train"],
eval_dataset=lm_datasets["test"],
data_collator=data_collator,
)
result = trainer.train()
print_summary(result)
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
trainer.save_model(f"py{model_name}-med-pretrained")
I tried to download the model and tokenizer locally, "fill" the vocab hole and reload the tokenizer, but I was not able to reload the fixed tokenizer.
Are there any possible ways how to fix this?
Hi,
yes, the tokenizer of RobeCzech model is unfortunately a bit non-standard. Notably, there are multiple subwords with the same ID 3 (originally the ID of an <unk>
token).
The problem was caused by the following. We first created a ByteBPE tokenizer, remapped the inputs, and then trained the model using FairSeq. However, that again renumbered the subwords and embeddings were created only for subwords that actually appeared in the training data. After training, we "composed" the two mappings, arriving at the final tokenizer. However, the ByteBPE tokenizer requires the 256 special tokens representing the 0-255 byte values, and some of them were not present in the training data, so they did not get an embedding. However, without some of the subwords for 0-255 byte values the ByteBPE tokenizer does not even load.
Unfortunately, we "solved" the issue by mapping the missing subwords to index 3 (
<unk>
).We provide both the "fast" and "slow" tokenizers in this repo, mapping multiple tokens to ID 3. However, they cannot be saved (as the tokenizers are expected to be injective), so you must refrain from saving them. Furthermore, the number of embeddings is not the same as the number of subwords in the tokenizer.
Other than that, the model works fine, and we have finetuned it successfully both in PyTorch and in TensorFlow.
Retrospectively, a much better fix would be to actually append the embeddings for the missing tokens (and initialize them to the value of
<unk>
) -- the tokenizer would be injective and standard.However, changing
ufal/robeczech-base
this way would not be backward compatible, so if people finetuned the original version and then tried to load it into the updated model, it would fail (because the number of embeddings would be different) -- which is why we haven't done it in the first place. We could release a model with a name likeufal/robeczech-standardtokenizer-base
with a "normal" tokenizer, but we do not currently think it is worth it.
Sorry for the trouble and cheers!