BueormLLC
/

ST3

Text Generation

Model card Files Files and versions Community

ST3 / README.md

Gerson Fabian Buenahora Ormaza

Update README.md

2dfbb7d verified about 2 months ago

|

3.35 kB

	---
	license: mit
	datasets:
	- wikimedia/wikipedia
	language:
	- es
	base_model:
	- openai-community/gpt2
	pipeline_tag: text-generation
	---



	# ST3: Simple Transformer 3

	## Model description
	ST3 (Simple Transformer 3) is a lightweight transformer-based model derived from OpenAI's GPT-2 architecture. It was specifically designed to enable quick fine-tuning and experimentation, making it a great choice for researchers and developers seeking an efficient model for downstream tasks.

	### Key features:
	- Architecture: GPT-2-based model with 3 attention heads and 3 layers.
	- Embedding size: 288 parameters.
	- Context size: 2048 tokens, allowing for extended input/output sequences.
	- Pretrained on: Wikimedia/Wikipedia subset "20231101.es" (Spanish text corpus).
	- Parameters: 4 million FP32 parameters.
	- Batch size: 32.
	- Training environment: 1 epoch on a Kaggle P100 GPU.
	- Tokenizer: Custom WordPiece tokenizer "ST3" that generates tokens with "##" as a prefix for subword units.

	## Intended use
	ST3 is not a highly powerful or fully functional model compared to larger transformer models but can be used for:
	- Quick fine-tuning on small datasets.
	- Research purposes to test new ideas.
	- Educational and experimentation purposes.

	This model has not been fine-tuned or evaluated with performance metrics as it’s not designed for state-of-the-art tasks.

	### Usage
	To use the ST3 model, you can follow this example:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("BueormLLC/ST3")
	model = AutoModelForCausalLM.from_pretrained("BueormLLC/ST3")

	def clean_wordpiece_tokens(text):
	return text.replace(" ##", "").replace("##", "")

	input_text = "Esto es un ejemplo"
	inputs = tokenizer(input_text, return_tensors="pt")

	outputs = model.generate(inputs.input_ids, max_length=2048, num_return_sequences=1)

	generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	cleaned_text = clean_wordpiece_tokens(generated_text)

	print(cleaned_text)
	```

	### Explanation
	The ST3 tokenizer uses the WordPiece algorithm, which generates tokens prefixed with "##" to indicate subword units. The provided `clean_wordpiece_tokens` function removes these prefixes, allowing for cleaner output text.

	## Limitations
	- Performance: ST3 lacks the power of larger models and may not perform well on complex language tasks.
	- No evaluation: The model hasn’t been benchmarked with metrics.
	- Not suitable for production use without further fine-tuning.

	## Training details
	- Dataset: Wikimedia/Wikipedia subset "20231101.es".
	- Number of layers: 3.
	- Number of attention heads: 3.
	- Embedding size: 288.
	- Parameters: 4 million.
	- Training: The model was trained for one epoch with a batch size of 32 on a P100 GPU provided by Kaggle.

	## Developer and publisher
	- Developed by: BueormAI.
	- Published by: BueormLLC.

	## Acknowledgments
	Thank you for using ST3! Your feedback and support are appreciated as we continue to develop and improve our models.

	If you find this model useful and would like to support further development, please consider making a donation to:

	- [Patreon](https://patreon.com/bueom)
	- [PayPal](https://paypal.me/bueorm)

	---

	Contributions to this project are always welcome!