Pythia-14M Fine-Tuned for High-Quality English Sentence Generation
This model is a fine-tuned version of the Pythia-14M language model, optimized for generating high-quality English sentences. It builds upon the base model agentlans/pythia-14m-finewebedu-sentences and has been further trained on a curated dataset of well-formed English sentences agentlans/high-quality-english-sentences.
Model Description
The model is based on the Pythia-14M architecture, which is a relatively compact language model. It has been fine-tuned specifically for generating (mostly) grammatically correct and coherent English sentences across a variety of topics and styles.
Intended Uses & Limitations
This model is designed for:
- Generating high-quality English sentences
- Completing partial sentences
- Assisting with writing tasks that require well-formed English
Limitations:
- Not suitable for tasks requiring deep domain knowledge
- May struggle with very long-form text generation
- Fails on non-English text
- It's tiny so don't expect too much
Training Data
The model was fine-tuned on a combination of datasets:
- Web-scraped educational content (finewebedu)
- High-quality web text (fineweb)
- Filtered Common Crawl data (C4)
For the composition and preprocessing of the training data, see agentlans/high-quality-english-sentences.
How To Use
To generate 10 random sentences starting from an empty string on a CUDA device:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='agentlans/pythia-14m-sentences', device='cuda')
set_seed(1234)
results = generator("", max_length=100, num_return_sequences=10, do_sample=True)
for x in results:
print(x['generated_text'])
Output:
The most common cause of the number of diseases is the common cause of death.
And there are many people in the war.
The average household income is 35.5 percent.
He was the most influential theologians of the country in this world.
On the other hand, the students will be able to learn the value of the current and the time.
However, the effect of the study would be greater than that of a drug-related drug drug.
To understand today, our nation's largest international commitment to the use of new technology and technology across the country.
On Sunday, the UK was first held in the state of the Australian, where a foreign trade union was used since the first year.
I've said that the program is most effective in education in the middle of the world.
So a year, it is important to identify a community where a student has a disability.
To let the model continue the sentence:
results = generator("The meaning of life is", max_length=100, num_return_sequences=10, do_sample=True)
for x in results:
print(x['generated_text'])
Output:
The meaning of life is one of the most extraordinary stories of the great world, and some of the most brilliant examples of the world of science.
The meaning of life is to develop.
The meaning of life is to the person, or to make it a personal impression of what is the case for the reader.
The meaning of life is no longer the most important concept of the human language.
The meaning of life is the form of a personal or personal character.
The meaning of life is the world's real and our future.
The meaning of life is the true one of the nation's largest historical experiences.
The meaning of life is the basis of the Church's first, the church of the Holy Spirit, and a living faith.
The meaning of life is that the law requires that the truth be lost.
The meaning of life is the best reason for the poor and poor economy.
Training Procedure
The model was trained using the following hyperparameters:
- Learning rate: 5e-05
- Train batch size: 8
- Eval batch size: 8
- Optimizer: Adam (betas=(0.9,0.999), epsilon=1e-08)
- LR scheduler: Linear
- Number of epochs: 3.0
Evaluation Results
On the evaluation set, the model achieved:
- Loss: 6.2540
- Accuracy: 0.1776
Ethical Considerations
As with any text generation model, users should be aware of potential biases in the training data that may be reflected in the model's outputs. The model should not be used to generate or propagate harmful content.
Technical Specifications
- Library: Transformers 4.45.1
- Framework: PyTorch 2.4.1+cu121
- Datasets: 3.0.1
- Tokenizers: 0.20.0
- Downloads last month
- 14
Model tree for agentlans/pythia-14m-sentences
Base model
EleutherAI/pythia-14m