Training data

by Vezora - opened May 28, 2024

May 28, 2024

Awesome model and paper congratulations! I was really wondering/hoping if there were any plans of releasing the training data used for this model?
Thank you!

Bin12345

Owner May 28, 2024

Hey, thanks for asking. We are actually evaluating whether to open source the training data and the code used to create the training data. The sponsor provided a substantial amount of money (for us) and the corresponding GPU resources for early-stage trial and error, as well as subsequent data generation and model training. We are still discussing related open source matters. We will keep you informed if there are any updates. Thanks!

Vezora

May 29, 2024

@Bin12345 ❤️❤️ awesome, thank you!

My unwanted opinion: “I would love to see the data be open source!”

Vezora changed discussion status to closed May 29, 2024

Vezora changed discussion status to open May 30, 2024

Vezora

May 30, 2024

@Bin12345 speaking of the training data, are there any plans for a codeqwen1.5 7b autocoder, and or codestral-22b?

sparsh35

May 30, 2024

If you guys have resources that would be supercool , other wise open source the data, and community can finetune these models on that. Progress should keep moving though.

Bin12345

Owner May 30, 2024

Hey Vezora, thanks for asking, I'm trying to get more resources for this project. I will do that if I can get more GPU node hours. Thanks!

Vezora

May 30, 2024

Hey Vezora, thanks for asking, I'm trying to get more resources for this project. I will do that if I can get more GPU node hours. Thanks!

@Bin12345 that would be awesome; i hope you get the gpu hours!! Thank you for the reply!

quanthunter

Jun 1, 2024

•

edited Jun 1, 2024

Looking forward seeing the CodeQwen 1.5 finetune, it would be awesome as it looks like a very potent model (the chat version is at the top four in the evalplus leaderboard), the current finetune is only this NXcode CQ Orpo

Bin12345

Owner Jun 1, 2024

@quanthunter Thanks, i will do that!

rombodawg

Jun 3, 2024

@Bin12345 llama-3-8b is a much stronger model than code-qwen, its was trained on alot more data (15 trillion tokens) if properly finetuned on a coding instruct dataset, it would make a much better coding model. I would suggest using that instead.

ethanc8

Jun 3, 2024

@Bin12345 said:

Hey, thanks for using AutoCoder! I am also interested in sharing the related dataset and the code for dataset generation. And I hope to contribute to the open-source community.

However, at this time, I am not allowed to do this. I am currently discussing related matters with the sponsors. I will update you promptly if there is any news. Thank you.

Bin12345

Owner Jun 4, 2024

@rombodawg @quanthunter I will try to fine-tune CodeQwen 1.5 and llama-3-8b between June 17th - June 30th, thanks!

Bin12345

Owner Jun 23, 2024

I uploaded the AutoCoder_QW_7B, its base model is CodeQwen1.5-7b.

quanthunter

Jun 24, 2024

•

edited Jun 24, 2024

I uploaded the AutoCoder_QW_7B, its base model is CodeQwen1.5-7b.

Hey, first of all thanks alot for listening to our request, i tried to convert the model into gguf using gguf-my-repo but got this error:

Error: Error converting to fp16: b'INFO:hf-to-gguf:Loading model: AutoCoder_QW_7B\nINFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only\nINFO:hf-to-gguf:Set model parameters\nINFO:hf-to-gguf:gguf: context length = 65536\nINFO:hf-to-gguf:gguf: embedding length = 4096\nINFO:hf-to-gguf:gguf: feed forward length = 13440\nINFO:hf-to-gguf:gguf: head count = 32\nINFO:hf-to-gguf:gguf: key-value head count = 4\nINFO:hf-to-gguf:gguf: rope theta = 1000000\nINFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05\nINFO:hf-to-gguf:gguf: file type = 1\nINFO:hf-to-gguf:Set model tokenizer\nSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\nWARNING:hf-to-gguf:\n\nWARNING:hf-to-gguf:**********************************************************************************\nWARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!\nWARNING:hf-to-gguf:** There are 2 possible reasons for this:\nWARNING:hf-to-gguf:** - the model has not been added to convert-hf-to-gguf-update.py yet\nWARNING:hf-to-gguf:** - the pre-tokenization config has changed upstream\nWARNING:hf-to-gguf:** Check your model files and convert-hf-to-gguf-update.py and update them accordingly.\nWARNING:hf-to-gguf:** ref: https://github.com/ggerganov/llama.cpp/pull/6920\nWARNING:hf-to-gguf:**\nWARNING:hf-to-gguf:** chkhsh: cd88bc280b3debbdddec3304015afd7e215c61d674846a2dac7271275384810c\nWARNING:hf-to-gguf:**********************************************************************************\nWARNING:hf-to-gguf:\n\nTraceback (most recent call last):\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 1622, in set_vocab\n self._set_vocab_sentencepiece()\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 582, in _set_vocab_sentencepiece\n raise FileNotFoundError(f"File not found: {tokenizer_path}")\nFileNotFoundError: File not found: AutoCoder_QW_7B/tokenizer.model\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 2887, in \n main()\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 2872, in main\n model_instance.set_vocab()\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 1624, in set_vocab\n self._set_vocab_gpt2()\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 509, in _set_vocab_gpt2\n tokens, toktypes, tokpre = self.get_vocab_base()\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 382, in get_vocab_base\n tokpre = self.get_vocab_base_pre(tokenizer)\n File "/home/user/app/llama.cpp/convert-hf-to-gguf.py", line 500, in get_vocab_base_pre\n raise NotImplementedError("BPE pre-tokenizer was not recognized - update get_vocab_base_pre()")\nNotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()\n'

Any chance you could upload the GGUF version for this? once again thanks alot! can't wait for the tests result (HumanEval & MBPP)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment