YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

dant5-large


language: - da language_bcp47: - da - da-bornholm - da-synnejyl tags: - t5 license: cc-by-4.0 datasets: - dagw widget: - text: "Aarhus er Danmarks ." co2_eq_emissions: training_type: "pretraining" geographical_location: "Copenhagen, Denmark" hardware_used: "4 A100 GPUs, 508 training hours" emissions: 132080

dant5-large is a 770M parameter model with architecture identical to t5-large. Training details are given in the paper Training a T5 Using Lab-sized Resources. It was trained for 10 epochs on the Danigh GigaWord Corpus (official website, paper).

To use the model

from transformers import AutoTokenizer, T5ForConditionalGeneration

model_name = "strombergnlp/dant5-large"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

original_text = "Aarhus er Danmarks <extra_id_0> landets ældste. Under navnet Aros, som betyder å-munding, optræder den i skriftlige kilder i 900-tallet, men <extra_id_1> historie tilbage til 700-tallet.<extra_id_2>"
original_label = "<extra_id_0> næststørste by og en af <extra_id_1> arkæologiske fund fører dens <extra_id_2>"
input_ids = tokenizer(original_text, return_tensors="pt").input_ids
labels = tokenizer(original_label, return_tensors="pt").input_ids

loss = model(input_ids=input_ids, labels=labels).loss
print(f"Original text: {original_text}")
print(f"Original label: {original_label}")
print(f"Loss for the original label is {loss.item()}")

sequence_ids = model.generate(input_ids)
sequences = tokenizer.batch_decode(sequence_ids)
print(f"A sample generated continuation: ")
print(sequences[0])

You should see output similar to:

Original text: Aarhus er Danmarks <extra_id_0> landets ældste. Under navnet Aros, som betyder å-munding, optræder den i skriftlige kilder i 900-tallet, men <extra_id_1> historie tilbage til 700-tallet.<extra_id_2>
Original label: <extra_id_0> næststørste by og en af <extra_id_1> arkæologiske fund fører dens <extra_id_2>
Loss for the original label is 4.174272537231445
A sample generated continuation: 
<pad><extra_id_0> ældste by og<extra_id_1> har sin<extra_id_2> Se også<extra_id_3></s>
Downloads last month
347
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.