Empowering Character-level Text Infilling by Eliminating Sub-Tokens

πŸ“„ Paper β€’ 🏠 Repo β€’ πŸ€– Models

Introduction

FIM-SE stands for Fill-In-the-Middle with both Starting and Ending character constraints. The proposed method addresses character-level infilling tasks by utilizing a line-level format to avoid predicting any sub-token in inference.


Models

Model Checkpoint Size License
FIM-SE-CL-7B πŸ€— HF Link 7B Llama2
FIM-SE-CL-34B πŸ€— HF Link 13B Llama2
FIM-SE-SC-1B πŸ€— HF Link 1B StarCoder
FIM-SE-SC-15B πŸ€— HF Link 15B StarCoder

How to Use

Prompt Format

As shown in the figure, the prompt is organized as

<PRE>R-Prefix<SUF>R-Suffix<START>L-Prefix<END>F-Suffix<MID>

Inference Code

Please refer to our GitHub Repo for more technical details.

Citation

If you find this repo useful for your research, please kindly cite our paper:

@misc{ren2024empowering,
    title={Empowering Character-level Text Infilling by Eliminating Sub-Tokens}, 
    author={Houxing Ren and Mingjie Zhan and Zhongyuan Wu and Hongsheng Li},
    year={2024},
    eprint={2405.17103},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Acknowledgments

We thank the following amazing projects that truly inspired us:

Downloads last month
25
Safetensors
Model size
15.8B params
Tensor type
BF16
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including SenseLLM/FIM-SE-SC-15B