Spaces:

pauri32
/

llm-challenge

Sleeping

File size: 5,347 Bytes

1e88d8f

# LanguageWire Technical challenge
Author: Pau Rodriguez Inserte (@pauri32)

## Setup and running instructions
TODO

## Reasoning of the LLM design choices
### Model selected: BloomZ
The model selected for this project is BloomZ with 1.1B parameters. The size of the model was decided according to my hardware limitations (this is the largest model I could fit without a GPU, since 4bit is not possible on CPU). The type of model used has been chosen considering the following characteristics: 
* BloomZ is a model fine-tuned on instructions with Bloom as base model. Bloom was trained on 46 different languages, this is highly relevant for the language detection task. Finding a model trained on a smaller subset of languages may have been a good option, but with BloomZ it will be easier to scale to a highly multilingual classification.
* The main language in which the model has been trained is English. Therefore it can still be strong for the entity recognition task, and that is the reason why the instructions are in English.
* The model has a wide range of sizes, up to 176B. It is easy to increase the size of the model in case the project continues with a different hardware without further modifications.
* A model fine-tuned to follow human instructions will have a better performance than one trained on documents for unknown tasks. In this case, since the model has not been fine-tuned for the specific tasks targeted, zero and few-shot performance are crucial.

### Task 1 design: Language detection
* The model receives the instruction of identifying if the language of the string is English, Spanish or French.
* The template follows the format <input_sentence>(language_id), selected after a quick prompt engineering process.
* The languages are identified with 'english' for English, 'español' for Spanish and 'française' for French. My reasoning behind this decision is  the hypothesis that the model, if it has been trained on these languages, will be more likely to keep the language of the sentence during the generation of the next token, since rarely documents switch languages within a sentence. So just the fact of having the language names in that language, helps with the classification.
* 3 shots are added with random sentences (that should be curated in a real scenario), one sentence for each language.
* If none of the language identifiers is detected, the language is classificated as unknown.
### Task 2 design: Name entity recognition
* The model is asked to identify entities related to locations, people and organizations.
* The LLM generates the entities in the following format <entity>(entity_type), selected after a quick prompt engineering process.
* Asking the model to identify the type of entity, even when it is not needed, is a way to decompose the problem, something similar to a chain-of-thought (COT). The model might not be familiar with the concept of named entity, but it is with locations, people and organizations. In the end, this step was helpful for the model after evaluating some prompts.
* Finally, the counting of the entities is done by scanning the generated string with regex. There is no need to ask the model to count them and add complexity!
* 3 shots were added to the prompt to improve performance, showing the model the style and all the entity types considered.


## Further improvements
The most important improvement would be to fine-tune the model on the specific tasks. To do this, I would follow the following steps:
* For the language detection task, any collection of documents of the targeted languages would be useful (if we know in which language is every document). These documents would be split at sentence-level and I would create a dataset of instructions with the same format as the current shots.
* For the second task, the best option would be to find an existing dataset for this task, such as XXX. With this dataset and the prompt template designed, we could generate a dataset. 
* Another option for the second task, in case there was no dataset for the task (for example if we want to target a specific domain or do it in a different language with less resources), we could generate the dataset by 'distilling' information from another LLM. For example, if this task required entity recognition in Catalan and there was no dataset, we could infer a few examples from a bigger model like GPT4. However, this solution has three main drawbacks: (1) we may have to pay for the models; (2) their license not always allows this; (3) even the best models make mistakes, we would be assuming some error in our dataset.

Fine-tuning the models for specific tasks would make the model perform better on them. Another variation we could do is to remove the few-shots used in the current code. Despite curated shots are usually helpful for the model, they also increase inference time. It should be studied if the improvement with shots is relevant compared to a higher latency (this would depend on the specifications of the project).

A third improvement would be to use 'forced decoding' for the classification task. By this, we could make sure one of the 3 language identifiers are generated and the answer would never be 'unknown'. During fine-tuning, the model could be instructed to generate directly the codes 'en', 'es', 'fr'.

## Evaluation proposal