pszemraj
/

flan-t5-large-grammar-synthesis-gguf

Text2Text Generation

Inference Endpoints

Model card Files Files and versions

flan-t5-large-grammar-synthesis-gguf / README.md

pszemraj's picture

Update README.md

535bdaf verified 4 months ago

|

2.42 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- pszemraj/flan-t5-large-grammar-synthesis
	pipeline_tag: text2text-generation
	tags:
	- grammar
	- spelling
	---

	# flan-t5-large-grammar-synthesis - GGUF


	GGUF files for [flan-t5-large-grammar-synthesis](https://huggingface.co/pszemraj/flan-t5-large-grammar-synthesis) for use with Ollama, llama.cpp, or any other framework that supports t5 models in GGUF format.

	This repo contains mostly 'higher precision'/larger quants, as the point of this model is for grammar/spelling correction and will be rather useless in low precision with incorrect fixes etc.

	Refer to the original repo for more details.

	## Usage

	You can use the GGUFs with [llamafile](https://github.com/Mozilla-Ocho/llamafile) (or llama-cli) like this:

	```
	llamafile.exe -m grammar-synthesis-Q6_K.gguf --temp 0 -p "There car broke down so their hitching a ride to they're class."
	```

	and it will output the corrected text:

	```
	system_info: n_threads = 4 / 8 \| AVX = 1 \| AVX_VNNI = 0 \| AVX2 = 1 \| AVX512 = 1 \| AVX512_VBMI = 1 \| AVX512_VNNI = 1 \| AVX512_BF16 = 0 \| FMA = 1 \| NEON = 0 \| SVE = 0 \| ARM_FMA = 0 \| F16C = 1 \| FP16_VA = 0 \| WASM_SIMD = 0 \| BLAS = 0 \| SSE3 = 1 \| SSSE3 = 1 \| VSX = 0 \| MATMUL_INT8 = 0 \| LLAMAFILE = 1 \|
	sampling:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
	sampling order:
	CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
	generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 0


	The car broke down so they had to take a ride to school. [end of text]


	llama_print_timings: load time = 782.21 ms
	llama_print_timings: sample time = 0.23 ms / 16 runs ( 0.01 ms per token, 68376.07 tokens per second)
	llama_print_timings: prompt eval time = 85.08 ms / 19 tokens ( 4.48 ms per token, 223.33 tokens per second)
	llama_print_timings: eval time = 341.74 ms / 15 runs ( 22.78 ms per token, 43.89 tokens per second)
	llama_print_timings: total time = 456.56 ms / 34 tokens
	Log end
	```

	If you have a GPU, be sure to add `-ngl 9999` to your command to automatically place as many layers as the GPU can handle for faster inference.