pszemraj's picture
Update README.md
535bdaf verified
|
raw
history blame
2.42 kB
metadata
license: apache-2.0
language:
  - en
base_model:
  - pszemraj/flan-t5-large-grammar-synthesis
pipeline_tag: text2text-generation
tags:
  - grammar
  - spelling

flan-t5-large-grammar-synthesis - GGUF

GGUF files for flan-t5-large-grammar-synthesis for use with Ollama, llama.cpp, or any other framework that supports t5 models in GGUF format.

This repo contains mostly 'higher precision'/larger quants, as the point of this model is for grammar/spelling correction and will be rather useless in low precision with incorrect fixes etc.

Refer to the original repo for more details.

Usage

You can use the GGUFs with llamafile (or llama-cli) like this:

llamafile.exe -m grammar-synthesis-Q6_K.gguf --temp 0 -p "There car broke down so their hitching a ride to they're class."

and it will output the corrected text:

system_info: n_threads = 4 / 8 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 0


 The car broke down so they had to take a ride to school. [end of text]


llama_print_timings:        load time =     782.21 ms
llama_print_timings:      sample time =       0.23 ms /    16 runs   (    0.01 ms per token, 68376.07 tokens per second)
llama_print_timings: prompt eval time =      85.08 ms /    19 tokens (    4.48 ms per token,   223.33 tokens per second)
llama_print_timings:        eval time =     341.74 ms /    15 runs   (   22.78 ms per token,    43.89 tokens per second)
llama_print_timings:       total time =     456.56 ms /    34 tokens
Log end

If you have a GPU, be sure to add -ngl 9999 to your command to automatically place as many layers as the GPU can handle for faster inference.