Model evaluation

#1
by timesler - opened

Hi there, thanks for all the great models! Is there any plan for your team to upload evaluation results for any of the GM models?

We are currently working on publishing some additional benchmarks.

For now, we ran MT-Bench from https://github.com/lm-sys/FastChat

Here are some relative results:
model mt- score
gpt-3.5-turbo 8.04375
h2ogpt-gm-falcon-40b-v1 6.53125
h2ogpt-gm-open-llama-13b 5.60625
h2ogpt-gm-oasst1-en-xgen-7b-8k 5.28125
h2ogpt-gm-open-llama-7b 5.10625
h2ogpt-gm-falcon-7b 4.92500

Hello , please what's the main difference between this model and the h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v1 , I just see V1 and V2 with same dataset , so i was wondering if there's something specific or a quality improve?

Just a re-run with some personalization, and other hyperparameters.

Both should be pretty much on-par.

Just a re-run with some personalization, and other hyperparameters.

Both should be pretty much on-par.

Thanks a lot for the feedback. We are about to use a similar method to build our open source model so I wanted to make sure i'm not missing an important point.

This comment has been hidden

Amazing work!!

Sign up or log in to comment