How was the dataset from TowerBlocks read into the TowerInstruct

#3
by alvations - opened

Since the data from TowerBlocks varies depending on the task, is there a particular way the data is read for the supervised fine-tuning?

E.g. from the NER task, we have:

{'conversations': [{'from': 'human',
   'value': 'Study this taxonomy for classifying named entities:\n- Product (Consumer products such as food, drinks, clothing, and vehicles)\n- Location (Location or physical facilities)\n- Group (Groups of people, organizations, corporations or other entities)\n- Medical (Entities from the medical domain, including diseases, symptoms, and medications)\n- Person (Names of people)\n- CreativeWorks (Titles of creative works like movie, song, and book titles). Identify all named entities in the following tokens:\n["el", "republicano", "emilio", "castelar", "manifestaría", "al", "respecto", ":"]\nAdditionally, you should add B- to the first token of a given entity and I- to subsequent ones if they exist. For tokens that are not named entities, mark them as O.\nAnswer: '},
  {'from': 'gpt',
   'value': '[("el", "O"), ("republicano", "O"), ("emilio", "B-Person"), ("castelar", "I-Person"), ("manifestaría", "O"), ("al", "O"), ("respecto", "O"), (":", "O")]'}],
 'lang': 'es',
 'split': 'dev',
 'dataset': 'multiconer2023',
 'task': 'named_entity_recognition'}

Would the input to the tokenizer be something like:

source = tokenizer( row['conversations']['from']['human']['value'] )
target = tokenizer( row['conversations']['from']['gpt']['value'] )

In the above example, does gpt mean the system's output? It is not referring to any of the OpenAI's model right?

Thank you in advance for the clarification!

Regards,
Liling

P/S: Thank you for compiling and sharing the data collection for TowerLLM

Sign up or log in to comment