200K Version
This lora seems to work on the 200K version of Yi, but if you ever revisit it, would you consider using that as a base model instead?
The long context is hugely useful (and seems to work well).
It would be possible, but I chose to train on the base version to maintain (hopefully) better compatibility for merging to other models, especially since the base 34b is already usable up to 32k ctx at inference. I would however like to try to train the dataset at full length rather than 4k, however that was a bit compute-prohibitive.
maintain (hopefully) better compatibility for merging to other models
You mean most other trainers will use the 34K model as well?
IDK, long context with no RoPE stretching is super appealing to me. I figured everyone would default to the 200K model.
Also, I believe at least one other trainer is doing the 200K model: https://old.reddit.com/r/LocalLLaMA/comments/17rzed4/yi34b_vs_yi34b200k_on_sequences_32k_and_4k/