Over the weekend after a failed initial run I got excited by Pete's success Jamba Tuning and decided to throw a little compute on a similar-sized dataset (the main shisa-v1 bilingual tuning set).

Like my initial runs, training graphs look fine, but the results were less than spectacular.

Here are the JA MT-Bench evals for the 2416 checkpoint (eval/loss plateau) and the 4228 (3 epoch) tune:

shisa-jamba-v1-checkpoint-2416     2.491525
shisa-jamba-v1-checkpoint-4228     2.508475

You can view the answers in the repo (lots of repetitions and nonsense) and compare to proper JA MT-Bench scores from my testing.

While an "unsuccessful" experiment, it was still worth the practice, although I got a little excited and should have gone w/ my more typical lighter testing obviously.

This kicks off official shisa-v2 base model evaluation. I was a bit hesitant about throwing this model out there (since it's useless as an artifact), but since I've actually made the in-process code available while working on it, I'll share this as well just in case (and to do this writeup).

Here is the current full code/steps for Axolotl training and eval (modified llm-judge inferencing code):

Thanks to Pete for the useful initial report and the axolotl team for their fast integration of Jamba (way better than my raw tune code).

Downloads last month
15
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train shisa-ai/shisa-jamba-v1-checkpoint-4228