Pythia-14M Fine-Tuned for High-Quality English Sentence Generation

This model is a fine-tuned version of the Pythia-14M language model, optimized for generating high-quality English sentences. It builds upon the base model agentlans/pythia-14m-finewebedu-sentences and has been further trained on a curated dataset of well-formed English sentences agentlans/high-quality-english-sentences.

Model Description

The model is based on the Pythia-14M architecture, which is a relatively compact language model. It has been fine-tuned specifically for generating (mostly) grammatically correct and coherent English sentences across a variety of topics and styles.

Intended Uses & Limitations

This model is designed for:

Generating high-quality English sentences
Completing partial sentences
Assisting with writing tasks that require well-formed English

Limitations:

Not suitable for tasks requiring deep domain knowledge
May struggle with very long-form text generation
Fails on non-English text
It's tiny so don't expect too much

Training Data

The model was fine-tuned on a combination of datasets:

Web-scraped educational content (finewebedu)
High-quality web text (fineweb)
Filtered Common Crawl data (C4)

For the composition and preprocessing of the training data, see agentlans/high-quality-english-sentences.

How To Use

To generate 10 random sentences starting from an empty string on a CUDA device:

from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='agentlans/pythia-14m-sentences', device='cuda')

set_seed(1234)
results = generator("", max_length=100, num_return_sequences=10, do_sample=True)

for x in results:
    print(x['generated_text'])

Output:

The most common cause of the number of diseases is the common cause of death.
And there are many people in the war.
The average household income is 35.5 percent.
He was the most influential theologians of the country in this world.
On the other hand, the students will be able to learn the value of the current and the time.
However, the effect of the study would be greater than that of a drug-related drug drug.
To understand today, our nation's largest international commitment to the use of new technology and technology across the country.
On Sunday, the UK was first held in the state of the Australian, where a foreign trade union was used since the first year.
I've said that the program is most effective in education in the middle of the world.
So a year, it is important to identify a community where a student has a disability.

To let the model continue the sentence:

results = generator("The meaning of life is", max_length=100, num_return_sequences=10, do_sample=True)
for x in results:
    print(x['generated_text'])

Output:

The meaning of life is one of the most extraordinary stories of the great world, and some of the most brilliant examples of the world of science.
The meaning of life is to develop.
The meaning of life is to the person, or to make it a personal impression of what is the case for the reader.
The meaning of life is no longer the most important concept of the human language.
The meaning of life is the form of a personal or personal character.
The meaning of life is the world's real and our future.
The meaning of life is the true one of the nation's largest historical experiences.
The meaning of life is the basis of the Church's first, the church of the Holy Spirit, and a living faith.
The meaning of life is that the law requires that the truth be lost.
The meaning of life is the best reason for the poor and poor economy.

Training Procedure

The model was trained using the following hyperparameters:

Learning rate: 5e-05
Train batch size: 8
Eval batch size: 8
Optimizer: Adam (betas=(0.9,0.999), epsilon=1e-08)
LR scheduler: Linear
Number of epochs: 3.0

Evaluation Results

On the evaluation set, the model achieved:

Loss: 6.2540
Accuracy: 0.1776

Ethical Considerations

As with any text generation model, users should be aware of potential biases in the training data that may be reflected in the model's outputs. The model should not be used to generate or propagate harmful content.

Technical Specifications

Library: Transformers 4.45.1
Framework: PyTorch 2.4.1+cu121
Datasets: 3.0.1
Tokenizers: 0.20.0

agentlans
/

pythia-14m-sentences