which dataset?

#1
by cruiser - opened

gpt4all or something else? If gpt4all, hopefully it was on the unfiltered dataset with all the "as a large language model" removed.

I hope it's a gpt 4 dataset without some "I'm sorry, as a large language model" bullshit inside

It is gpt-4 self instruct.

I have Gpt4all installed. How do I install this?’

This has nothing to do with gpt4all, you pip install transformers and run inference on this model

It is gpt-4 self instruct.

will you be releasing the dataset for further research?

It is gpt-4 self instruct.

will you be releasing the dataset for further research?

Here is the dataset: https://github.com/teknium1/GPTeacher
Its the general instruct set

ValueError: Tokenizer class LlamaTokenizer does not exist or is not currently imported.

ValueError: Tokenizer class LlamaTokenizer does not exist or is not currently imported.

You are not in the right thread ! Try 'pip install git+https://github.com/huggingface/transformers' in your environment

Yes there is an issue with the class names in the model.config, open it up and change LLaMA to Llama

Also an issue in the tokenizer.config, change 512 to 2048 in the max token spot

HI @teknium , where does it show the fine-tuning code change the max sequence length to 2048?

HI @teknium , where does it show the fine-tuning code change the max sequence length to 2048?

LLaMA is already set for 2048 tokens, its just set wrong in the config here

Thank you, I got it now.

It seems weird that Llama recommends changing the config to 512 to make it fit better with GPUs, I always thought that the input size into LLM are fixed and lower length input are always padded to the maximum length anyways. A question that does not relate to this repo but:

How does reducing the sequence length to 512 during inference (like in Llama) help? Wouldn't the model just pad to the maximum size that it was trained on?

Sign up or log in to comment