Request for re-opening the test space
Hi, I am currently working on a Dutch LLM project, and I was interested to test your chat, to know if I should follow the same approach for creating a chatbot like you did, or if I should avoid doing it this way.
Hi!
Thank you for your interest in GEITje! It's good to see more people working on Dutch LLMs.
Unfortunately I cannot reopen this space at this time. I don't know if you have seen it yet, but you can read more about the approach I took on the GEITje GitHub page: https://github.com/Rijgersberg/GEITje/blob/main/README-en.md
Thank you so much for the answer. I already read the GitHub page, thank you very much for providing the information openly. If you have the time, I wanted to ask for your opinion on something.
I am still doing research about LLMs because this will be my first LLM project. I am not finding a lot of info online for tokenizers, but I found some free open-source Dutch Tokenizers that I can use. I tested them with the GEITje 7B model and while one of them is quite similar in generating text as the Mistral tokenizer, its ability to tokenize the sentences in smaller chunks is better. Do you think using this new Dutch tokenizer would help the model learn the Dutch linguistics better, or just complicate the learning process? Should I just stick with the Mistral tokenizer?
As a heads up, I first wanted to do full parameter tuning with chatbot-type Dutch datasets (Transfer Learning first and then Reward Modelling) so the model can learn the ability to chat. Afterwards do Task-Specific Fine-Tuning with a smaller dataset, so the model can learn explain how to use the features of my webpage for people who need help. The dataset for this is probably gonna be much much smaller.
Thank you very much in advance.
Sounds like a cool project!
In general, tokenizers are tied to a model during the pre-training phase. The first layer of the transformer is a look-up table that maps token ids to (learned) embeddings. Switching to a different tokenizer after pretraining will generally give disastrous results because the sequences of ids don't match.
However, it is not completely hopeless. There are some tricks you can do to adapt a model from one tokenizer to another by some smart initialisation of the embedding of new tokens. See this work by @FremyCompany , @pdelobelle et al for an example: https://arxiv.org/abs/2310.03477.
Thank you again very much. After some more research I decided to stick with the Mistral Tokenizer.
I noticed your Chat versions of GEITje-7B have the Apache 2.0 license. Would you recommend using those chats (if yes, which one?) and just doing fine-tuning with my small dataset for the specific task I need? I am pretty low on resources, and I'm not sure if I can afford to do all the fine-tuning methods I mentioned in the last comment. But I do want my project to work very accurately though.
I am sorry to bother you with all the questions, but I'm still just a bachelors student with very few ML&AI knowledge and this is my first time creating an LLM all by myself. I dont really know any people who are knowledgeable in this field :(
No problem! If you want, come and hang out in the Dutch NLP Discord where more questions like yours are being discussed.
As for chat datasets: after my chat models @BramVanroy made a better chat model with better chat datasets. You can find more info here: https://huggingface.co/posts/BramVanroy/679226771675158