Training data

#2
by Vezora - opened

Awesome model and paper congratulations! I was really wondering/hoping if there were any plans of releasing the training data used for this model?
Thank you!

Hey, thanks for asking. We are actually evaluating whether to open source the training data and the code used to create the training data. The sponsor provided a substantial amount of money (for us) and the corresponding GPU resources for early-stage trial and error, as well as subsequent data generation and model training. We are still discussing related open source matters. We will keep you informed if there are any updates. Thanks!

@Bin12345 ❤️❤️ awesome, thank you!

My unwanted opinion: “I would love to see the data be open source!”

Vezora changed discussion status to closed
Vezora changed discussion status to open

@Bin12345 speaking of the training data, are there any plans for a codeqwen1.5 7b autocoder, and or codestral-22b?

If you guys have resources that would be supercool , other wise open source the data, and community can finetune these models on that. Progress should keep moving though.

Hey Vezora, thanks for asking, I'm trying to get more resources for this project. I will do that if I can get more GPU node hours. Thanks!

Hey Vezora, thanks for asking, I'm trying to get more resources for this project. I will do that if I can get more GPU node hours. Thanks!

@Bin12345 that would be awesome; i hope you get the gpu hours!! Thank you for the reply!

Looking forward seeing the CodeQwen 1.5 finetune, it would be awesome as it looks like a very potent model (the chat version is at the top four in the evalplus leaderboard), the current finetune is only this NXcode CQ Orpo

@quanthunter Thanks, i will do that!

@Bin12345 llama-3-8b is a much stronger model than code-qwen, its was trained on alot more data (15 trillion tokens) if properly finetuned on a coding instruct dataset, it would make a much better coding model. I would suggest using that instead.

See also https://github.com/bin123apple/AutoCoder/issues/6

@Bin12345 said:

Hey, thanks for using AutoCoder! I am also interested in sharing the related dataset and the code for dataset generation. And I hope to contribute to the open-source community.

However, at this time, I am not allowed to do this. I am currently discussing related matters with the sponsors. I will update you promptly if there is any news. Thank you.

@rombodawg @quanthunter I will try to fine-tune CodeQwen 1.5 and llama-3-8b between June 17th - June 30th, thanks!

I uploaded the AutoCoder_QW_7B, its base model is CodeQwen1.5-7b.

I uploaded the AutoCoder_QW_7B, its base model is CodeQwen1.5-7b.

Hey, first of all thanks alot for listening to our request, i tried to convert the model into gguf using gguf-my-repo but got this error:

Error: Error converting to fp16: b'INFO:hf-to-gguf:Loading model: AutoCoder_QW_7B\nINFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only\nINFO:hf-to-gguf:Set model parameters\nINFO:hf-to-gguf:gguf: context length = 65536\nINFO:hf-to-gguf:gguf: embedding length = 4096\nINFO:hf-to-gguf:gguf: feed forward length = 13440\nINFO:hf-to-gguf:gguf: head count = 32\nINFO:hf-to-gguf:gguf: key-value head count = 4\nINFO:hf-to-gguf:gguf: rope theta = 1000000\nINFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05\nINFO:hf-to-gguf:gguf: file type = 1\nINFO:hf-to-gguf:Set model tokenizer\nSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\nWARNING:hf-to-gguf:\n\nWARNING:hf-to-gguf:**********************************************************************************\nWARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!\nWARNING:hf-to-gguf:** There are 2 possible reasons for this:\nWARNING:hf-to-gguf:** - the model has not been added to convert-hf-to-gguf-update.py yet\nWARNING:hf-to-gguf:** - the pre-tokenization config has changed upstream\nWARNING:hf-to-gguf:** Check your model files and convert-hf-to-gguf-update.py and update them accordingly.\nWARNING:hf-to-gguf:** ref: https://github.com/ggerganov/llama.cpp/pull/6920\nWARNING:hf-to-gguf:**\nWARNING:hf-to-gguf:** chkhsh: cd88bc280b3debbdddec3304015afd7e215c61d674846a2dac7271275384810c\nWARNING:hf-to-gguf:**********************************************************************************\nWARNING:hf-to-gguf:\n\nTraceback (most recent call last):\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 1622, in set_vocab\n self._set_vocab_sentencepiece()\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 582, in _set_vocab_sentencepiece\n raise FileNotFoundError(f"File not found: {tokenizer_path}")\nFileNotFoundError: File not found: AutoCoder_QW_7B/tokenizer.model\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 2887, in \n main()\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 2872, in main\n model_instance.set_vocab()\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 1624, in set_vocab\n self._set_vocab_gpt2()\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 509, in _set_vocab_gpt2\n tokens, toktypes, tokpre = self.get_vocab_base()\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 382, in get_vocab_base\n tokpre = self.get_vocab_base_pre(tokenizer)\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 500, in get_vocab_base_pre\n raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")\nNotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()\n'

Any chance you could upload the GGUF version for this? once again thanks alot! can't wait for the tests result (HumanEval & MBPP)

Sign up or log in to comment