Mxode
/

NanoLM-365M-Base

Text Generation

text-generation-inference

Model card Files Files and versions Community

NanoLM-365M-Base / README.md

Mxode's picture

Update README.md

99fb990 verified 3 months ago

|

2.84 kB

	---
	license: gpl-3.0
	language:
	- en
	datasets:
	- HuggingFaceTB/cosmopedia-100k
	- pleisto/wikipedia-cn-20230720-filtered
	pipeline_tag: text-generation
	tags:
	- text-generation-inference
	---
	# NanoLM-365M-base

	English \| [简体中文](README_zh-CN.md)

	## Introduction

	Based on [Qwen2-0.5B](https://huggingface.co/Qwen/Qwen2-0.5B), the tokenizer has been replaced with [BilingualTokenizer-8K](https://huggingface.co/Mxode/Bilingual-Tokenizer) to reduce the number of parameters. The total parameters have been reduced from 0.5B to 365M.

	## Details

	To recover some performance and facilitate fine-tuning for downstream tasks, I chose to freeze the backbone parameters and only train the embedding part after replacing the tokenizer. Training was conducted for 40,000 steps on [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered) and [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k).

	\| \| Value \|
	\| :-------------------------: \| :----------------------------------------------------------: \|
	\| Total Params \| 365 M \|
	\| Trainable Params \| < 10 M \|
	\| Trainable Parts \| `model.embed_tokens` \|
	\| Training Steps \| 40,000 \|
	\| Training Dataset \| [wikipedia-zh](https://huggingface.co/datasets/pleisto/wikipedia-cn-20230720-filtered), [cosmopedia-100k](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k) \|
	\| Optimizer \| adamw_torch \|
	\| Learning Rate \| 2e-4 \|
	\| LR Scheduler \| cosine \|
	\| Weight Decay \| 0.1 \|
	\| Warm-up Ratio \| 0.03 \|
	\| Batch Size \| 16 \|
	\| Gradient Accumulation Steps \| 1 \|
	\| Seq Len \| 4096 \|
	\| Dtype \| bf16 \|
	\| Peak GPU Memory \| < 48 GB \|
	\| Device \| NVIDIA A100-SXM4-80GB \|


	The specific training records are as follows:
	![result](static/result.png)