Is my understanding correct that the monkey patch will be needed to be added for inference only?
ie. when I convert this model into GGML/GPTQ, I will need to make sure the inference engine is using this patch logic right?
Correct
Have you looked into applying this as a config.json patch activated with trust_remote_code=True, like how Landmark Attention is applied eg at https://huggingface.co/eugenepentland/Minotaur-13b-Landmark ?
Then Transformers could auto load rather than needing manual editing of inference code. That could make it a lot more accessible, if it's possible?
That'd be wonderful! I think that will really help to get people using your model. I will provide a quantised GPTQ once that is done, and publicise your work.
Thank you
@TheBloke
Since
@emozilla
has already added the code for trust_remote_code, can you take it from there? https://huggingface.co/emozilla/open_llama_7b-scaled
Since this is a LoRA I don't think it would benefit to put the code here, no? Only in the final merged model repository