Discussion: naming pattern to converge on to better identify fine-tunes

#761
by ThiloteE - opened

I want to find finetunes of mistral-7b-v0.3, which has a new tokenizer and is said to be better with its 32k context, but the leaderboard is so full of mistral-7b-v0.1 finetunes that it is impossible to find the newer models. The issue is caused by most of the model authors not following a standardized naming scheme, which renders the searchbar in the leaderboard useless in this case. Since both models I am looking for have the same parameter count, filtering for this property doesn't work either (even if it worked flawlessly, which it doesn't!). Additionally, sometimes only the architecture is is mentioned in the models config.json file, but not the real name of the base model. There is not even a possibility to filter by "last added to leaderboard", which would be a very unsatisfactory workaround either.

I am a little at a loss. The few possibilities for improvement I can think of are:

  • standardization of model names in the model name. Example: https://github.com/ggerganov/ggml/issues/820.
    <BaseModel>-<Version>-<Model>-<Version>-<ExpertsCount>x<Parameters>-<Quantization>.gguf or <Model>-<Version>-<BaseModel>-<Version>-<ExpertsCount>x<Parameters>-<Quantization>.gguf
  • standartization of base model names in the accompanying configuration files, but allow arbitrary model name.
  • Having a leaderboard "cleanup crew" or a script that manually adds tags, labels and notes to models who's author is unresponsive and hides models with unsatisfactory model cards and model names from the default view of the leaderboard. Forcefully rename a model, in documented and exceptional circumstances.

TL;DR: There is no standardized naming scheme, the search feature is insufficient and model authors fail to provide relevant information. How to find finetunes of specific base models?

Open LLM Leaderboard org

Hi!

I agree that more standard naming conventions would be great, and I like the pattern you are suggesting in your first bullet!
At the moment, we already apply the third option, within the time we have available for this - we don't allow models with no model cards, and manually add tags to the leaderboard's view of the model depending on user reports. However, we won't manually manage the naming convention issues of all available models.

For your initial question about how to find fine-tunes of specific base models, I don't have a better solution for you right now.

I'm going to leave the discussion open to gather feedback from other users on which conventions would be interesting to follow and see what we converge on.

I am not particularly good at coding, but at the very least, I could create a regex that checks, if the model name deviates from a particular standardized naming scheme.
Instead of checking, if the name is correct, it could be checked if the name does not adhere to basic syntax.

I do feel the naming scheme would be best to follow a progression pattern that would tell the story of the model.
<BaseModel>-<Version>-<Model>-<Version>-<ExpertsCount>x<Parameters>-<Methods>-<Quantization>.gguf

Without needing to know the history of LLMs...
I would know it is based on. <BaseModel>-<Version>
I would know which variant it is. <Model>-<Version>-<ExpertsCount>x<Parameters>
I would know how it was modified. <Methods><Quantization>

clefourrier changed discussion title from How to find finetunes with specific basemodel? to Discussion: naming pattern to converge on to better identify fine-tunes

I thought some more. Quantization is nice to have, but not a strict requirement at this leaderboard. Models that are not quantized are evaluated as well.

<BaseModel>-<Version>-<Model>-<Version>-<ExpertsCount>x<Parameters>-<MethodsorVariant>

I have experimented with a regex that would detect said pattern.

If it helps, I've also did a similar experiment. Anyhow this is the current form my PR https://github.com/ggerganov/llama.cpp/pull/7499 is taking up shape (FYI, this is mostly inspired by TheBloke naming scheme e.g. https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/tree/main )

js regex pattern detection example
#!/usr/bin/env node

const ggufRegex = /^(?<model_name>[A-Za-z0-9\s-]+)-(?:(?<experts_count>\d+)x)?(?<model_weights>\d+[A-Za-z]+)(?:-(?<fine_tune>[A-Za-z0-9\s-]+))?(?:-(?<version_string>v\d+(?:\.\d+)*))?-(?<encoding_scheme>[\w_]+)(?:-(?<shard>\d{5})-of-(?<shardTotal>\d{5}))?\.gguf$/;

function parseGGUFFilename(filename) {
  const match = ggufRegex.exec(filename);
  if (!match) 
    return null;
  const {model_name, version_string = null, experts_count = null, model_weights, fine_tune=null, encoding_scheme, shard = null, shardTotal = null} = match.groups;
  return {modelName: model_name.trim().replace(/-/g, ' '), expertsCount: experts_count ? +experts_count : null, model_weights, fine_tune: fine_tune, version: version_string, encodingScheme: encoding_scheme, shard: shard ? +shard : null, shardTotal: shardTotal ? +shardTotal : null};
}

const testCases = [
  {filename: 'Llama-7B-Q4_0.gguf', expected: { modelName: 'Llama', expertsCount: null, model_weights: '7B', fine_tune: null, version: null, encodingScheme: 'Q4_0', shard: null, shardTotal: null }},
  {filename: 'Llama-7B-v1.0-Q4_0.gguf', expected: { modelName: 'Llama', expertsCount: null, model_weights: '7B', fine_tune: null, version: 'v1.0', encodingScheme: 'Q4_0', shard: null, shardTotal: null }},
  {filename: 'GPT-3-175B-v3.0.1-F16.gguf', expected: { modelName: 'GPT 3', expertsCount: null, model_weights: '175B', fine_tune: null, version: 'v3.0.1', encodingScheme: 'F16', shard: null, shardTotal: null }},
  {filename: 'GPT-NeoX-20B-v0.9-Q4_K-00001-of-00010.gguf', expected: { modelName: 'GPT NeoX', expertsCount: null, model_weights: '20B', fine_tune: null, version: 'v0.9', encodingScheme: 'Q4_K', shard: 1, shardTotal: 10 }},
  {filename: 'EleutherAI-13B-v2.1.4-IQ4_XS-00002-of-00005.gguf', expected: { modelName: 'EleutherAI', expertsCount: null, model_weights: '13B', fine_tune: null, version: 'v2.1.4', encodingScheme: 'IQ4_XS', shard: 2, shardTotal: 5 }},
  {filename: 'Llama-7B-Research-v1.0-Q4_0.gguf', expected: { modelName: 'Llama', expertsCount: null, model_weights: '7B', fine_tune: 'Research', version: 'v1.0', encodingScheme: 'Q4_0', shard: null, shardTotal: null }},
  {filename: 'GPT-3-175B-Instruct-v3.0.1-F16.gguf', expected: { modelName: 'GPT 3', expertsCount: null, model_weights: '175B', fine_tune: 'Instruct', version: 'v3.0.1', encodingScheme: 'F16', shard: null, shardTotal: null }},
  {filename: 'not-a-known-arrangement.gguf', expected: null},
];

testCases.forEach(({ filename, expected }) => {
  const result = parseGGUFFilename(filename);
  const passed = JSON.stringify(result) === JSON.stringify(expected);
  console.log(`${filename}: ${passed ? "PASS" : "FAIL"}`);
});

Regarding one question about 'differentiating' between the base and finetune model version. You potentially have two approach I could think of so far

  • Two separate version number: Mixtral-8x7B-v2.3-Instruct-v1.0-Q2_K.gguf
    • Pros: Can have semi ver approach on both base and finetune section
    • Cons: longer filename and two versions in one string so less obvious
  • One version number where v<base major>.<base minor>.<finetune>:
    • Mixtral-8x7B-Instruct-v2.3.1-Q2_K.gguf means base model version v2.3 and fine-tune edition 1
    • Pros: Only one string to visually track
    • Cons: Version string less flexible and less digits for finetune

Have a chat and see what you like and I'll give it a consideration. The second approach of just using one version string would fortunately mean I won't have to do any extra coding.


FYI I previously added the initial naming standard in https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#gguf-naming-convention but was not happy with the pattern as it stands as i felt like versions won't be bunched up correctly if there is a different set of parameter mix or model name (or as I learned finetuning). So hence this new PR to experiment with a better file naming convention that I found by studying TheBloke's naming approach.

Your javascript regex is great!
I don't think users will like having two version numbers merged into one, such as v<base major>.<base minor>.<finetune>. It is confusing.
Either have two version numbers or leave out the version number of the base model completely.

Can we not store and then fetch the basemodel information from within the model file? the readme? or config.json or other configuration file? I think sometimes it is mentioned in the config.json, but not every time. Anyway, that would require improving the leaderboards search feature though. Basically feature request to add full-text search in the leaderboard.
I am not sure about old models (would require pull-request and some nagging), but at least new models could only be allowed to be upload with the new standard.

By the way, here is my python regex for <BaseModel>-<Version>-<Model>-<Version>-<ExpertsCount>x<Parameters>-<MethodsorVariant>:

  • ^(?!-|_|\d)\w+(?<!\db)-v\d+\.\d+-\w+-v\d+\.\d+(-\d+b-\w+|-\d+x\d+b-\w+-*|-\d+b-\w+-.*|-\w+-\d+b|-\w+-\w+-.*)

It is not yet perfect but it is not half bad either, as you can see in the test cases.

https://regex101.com/r/yFZD9f/6

I don't think we need to worry much about name length if it is merely due to a section for version.

I can see where several version numbers would enhance the name convention while assisting in organization. (similar to yyyy/mm/dd)
As long as the naming convention remains progressive this will only serve to enrich the situation.
Llama-ver-Hermes-ver-Instruct-DPO-ver

Llama-v1.0-Hermes-v1.0-Instruct-DPO-v0.1
Llama-v1.0-Hermes-v1.0-Instruct-DPO-v0.2
Llama-v2.0-Hermes-v1.5-Instruct-DPO-v0.1
Llama-v2.0-Hermes-v2.0-Instruct-DPO-v0.1
Llama-v2.0-Hermes-v2.0-Instruct-DPO-v0.2
Llama-v2.0-Hermes-v2.0-Instruct-SLERP-v0.1
Llama-v3.0-HermesPro-v2.5-Instruct-DPO-v0.1

Okay I've updated the PR https://github.com/ggerganov/llama.cpp/pull/7499 further to take model card as extra background metadata source (main discussion to go to https://github.com/ggerganov/ggml/issues/820).

As for the version basename I'm now convinced that we should not include it in the name. Instead I'm now focusing on putting as much information useful for leaderboard in the gguf KV store. Huggingface team previously mentioned that they can easily parse the gguf KV store so it won't be an issue.

The information that goes into the filename MUST in my opinion be all the information directly related to the finetuned model itself only.

So these are the kv that i think may be of interest in the leaderboard (some old kv name and plus some new one marked by +)

general.name
general.basename +
general.finetune +
general.author
general.version
general.base_version +
general.url
general.description
general.license
general.license.name +
general.license.link +
general.source.url
general.source.huggingface.repository
general.file_type
general.parameter_size_class +
general.tags +
general.languages +
general.datasets +

By the way. We were also wondering if it would make sense to include a model hash ID (UUID?)

Also if so, then should the model hash be dependent or independent of the quantization that was applied to it?

  1. We could either just hash the GGUF tensor payload (excluding metdata) straight up (easy to do)... but any change to quantitation will change the hash. This is good if you consider different quantization to be different models.

  2. Investigate some form of hashing that will survive quantization. This means multiple models that is from the same model but just converted will have the same hash. This is proving technically difficult to do and I'm not sure if it matter much to the community if the linage can be traced anyway via general.source in the KV store.

The benefit I can see with having some form of UUID would be in disambiguating specific models by hash in hugging-face, especially if there is multiple models sharing the same name.

Sign up or log in to comment