File size: 7,939 Bytes
01435ff
 
9be4390
 
 
 
 
 
 
 
 
 
 
 
01435ff
c7e7a9d
34e425c
c7e7a9d
 
 
d3fd23a
b19cb09
3f4b64e
0b8041a
 
dd3a543
ebb0682
d3fd23a
12513a6
 
 
 
f662a9f
c7e7a9d
 
d3fd23a
2cc489f
a71ef62
 
 
be83a0d
 
 
c7e7a9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ebb0682
d571bc6
 
 
 
 
 
 
 
05e7f26
d571bc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ebb0682
2102352
d571bc6
 
26446a4
d571bc6
2d0db97
d571bc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2102352
d571bc6
ebb0682
 
d571bc6
cea8923
 
138cc0e
68a1b59
be83a0d
 
 
 
 
 
 
 
 
 
 
2447645
68a1b59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138cc0e
 
2d0db97
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
license: apache-2.0
datasets:
- hackathon-somos-nlp-2023/ask2democracy-cfqa-salud-pension
language:
- es
library_name: transformers
pipeline_tag: text2text-generation
tags:
- democracy
- public debate
- question answering
- RAG
- Retrieval Augmented Generation
---
<h1>
<a alt="About Ask2Democracy project" href="https://github.com/jorge-henao/ask2democracy">About Ask2Democracy project</a>
</h1>
<hr>

## About Ask2Democracy project
This model was trained during the 2023 Somos NLP Hackathon and it's part of the ask2democracy project. Our focus during the hackathon was on enhancing Retretrieval Augmented Generation (RAG) capabilities in spanish, using an open source model adapted for public debate discussions. 
This generative model is intended to be integrated with the retrieval system exposed in the project demo (currently integrated with OpenAI), in order to generate conversational source based answers.
However, we encountered performance limitations due to the model's large size, which caused issues when running it on limited hardware. Specifically, we observed an inference time of approximately 70 seconds when using a GPU.

To address this issue, we are currently working on optimizing ways to integrate the model into the AskDemocracy space demo. Remaining work is required in order to improve the model's performance.
Further updates are expected to be integrated in [the AskDemocracy space demo](https://huggingface.co/spaces/jorge-henao/ask2democracycol).

**Developed by:**
- 🇨🇴 [Jorge Henao](https://linktr.ee/jorgehenao)
- 🇨🇴 [David Torres ](https://github.com/datorresb)

## What's baizemocracy-lora-7B-cfqa-conv model?

This model is an open-source chat model fine-tuned with [LoRA](https://github.com/microsoft/LoRA) inspired by [Baize project](https://github.com/project-baize/baize-chatbot/tree/main/). It was trained with the Baize datasets and the ask2democracy-cfqa-salud-pension dataset, wich contains almost 4k instructions to answers questions based on a context relevant to citizen concerns and public debate in spanish.

Two model variations was trained during the Hackathon Somos NLP 2023: 
- A conversational style focused model:  focused in a more conversational way of asking questions, dee Pre-proccessing dataset section.
- A generative context focused model: This model variation is more focused on source based augmented retrieval generation [Baizemocracy-RAGfocused](https://huggingface.co/hackathon-somos-nlp-2023/baizemocracy-lora-7B-cfqa).


Testing is a work in progress, we decide to share both model variations with community in order to invovle more people experimenting what it works better and find other possible use cases.

## Training Parameters

- Base Model: [LLaMA-7B](https://arxiv.org/pdf/2302.13971.pdf)
- Training Epoch: 1
- Batch Size: 16
- Maximum Input Length: 512
- Learning Rate: 2e-4
- LoRA Rank: 8
- Updated Modules: All Linears

## Training Dataset

- [Ask2Democracy-cfqa-salud-pension](https://huggingface.co/datasets/hackathon-somos-nlp-2023/ask2democracy-cfqa-salud-pension) (3,806)
- [Standford Alpaca](https://github.com/tatsu-lab/stanford_alpaca) (51,942)
- [Quora Dialogs](https://github.com/project-baize/baize) (54,456):
- [StackOverflow Dialogs](https://github.com/project-baize/baize) (57,046)
- [Alpacaca chat Dialogs](https://github.com/project-baize/baize)
- [Medical chat Dialogs](https://github.com/project-baize/baize)


## How to use it

```python
import time
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "hackathon-somos-nlp-2023/baizemocracy-lora-7B-cfqa-conv"
config = PeftConfig.from_pretrained(peft_model_id)
base_model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-7b-hf", return_dict=True, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)

# Load the Lora model
tuned_model = PeftModel.from_pretrained(base_model, peft_model_id)

def generate(text):
  stt = time.time()
  print("hackathon-somos-nlp-2023/baizemocracy-lora-7B-cfqa response:")
  inputs = tokenizer(text, return_tensors="pt")
  input_ids = inputs["input_ids"].cuda()
  with torch.cuda.amp.autocast():
      tuned_model.eval()

      generation_output = tuned_model.generate(
          input_ids=input_ids[:,1:-1],
          generation_config=generation_config,
          return_dict_in_generate=True,
          output_scores=True,
          max_new_tokens=256
      )
      for s in generation_output.sequences:
          output = tokenizer.decode(s)
          print(output)
  ent = time.time()
  elapsed_time = round(ent - stt, 2)
  print (f"{elapsed_time} seconds")

```

## Example outputs

baizemocracy-lora-7B-cfqa-conv model:
```python
#Text taken from Mexican political reform from https://www.gob.mx/cms/uploads/attachment/file/3080/EXPLICACION_AMPLIADA_REFORMA_POLITICA_ELECTORAL.pdf
text = """
The conversation between human and AI assistant.Given the context answer the Human question.
Context:'Ratificación del Plan Nacional de Desarrollo y de la Estrategia Nacional de
Seguridad Pública
Se adiciona como facultad de la Cámara de Diputados la aprobación del Plan Nacional de Desarrollo, con lo que la pluralidad de intereses y las visiones expresadas por las distintas fuerzas
políticas que componen la Cámara de Diputados quedarán plasmadas en la ruta que el Ejecutivo
Federal traza para sus acciones durante cada sexenio.
De igual manera, el Senado de la República ratificará la Estrategia Nacional de Seguridad Pública. Toda vez que la función principal del Estado es garantizar la seguridad, es indispensable que
dicha estrategia sea aprobada por un órgano representativo de la voluntad popular como es el
caso del Senado.
El papel que desempeñarán las Cámaras del Congreso de la Unión en el contexto de la Reforma
Política-Electoral permite aumentar el nivel de corresponsabilidad entre los Poderes de la Unión,
al mismo tiempo que preserva la capacidad del Estado mexicano para responder oportunamente ante las amenazas al orden público y para poner en marcha acciones de trascendencia nacional.'
[|Human|] ¿cual será la nueva facultad de la cámara?"""
generate(text)
output:
[|AI|] La nueva facultad de la Cámara de Diputados será la aprobación del Plan Nacional de Desarrollo.
```

## About dataset pre-processing

Ask2Democracy-cfqa-salud-pension dataset was pre-processed in a conversational style in two variations like this:
```python

def format_instruction_without_context(example):
  example["topic"] = example['input']
  input = "La conversación entre un humano y un asistente de IA."
  input += "\n[|Human|] "+example['input']
  input += "\n[|AI|] "+example["output"]
  if len(example["topics"])>0:
    topics = ", ".join(example["topics"])
    input += "\n[|Human|] "+"¿En cuáles tópicos clasificarías su respuesta?"
    input += "\n[|AI|] "+f"Aquí una lista de tópicos: {topics}."
    example["topic"] += f" ({topics})"
  example["input"] = input
  return example`

def format_instruction_with_context(example):
  example["topic"] = example['input']
  context = example['instruction'].replace("Given the context please answer the question. Context:","")
  context = ' '.join(context.strip().split())[1:-3]
  input = "La conversación entre un humano y un asistente de IA."
  input += "\n[|Human|] "+example['input']+f"\nPara responder la pregunta, usa el siguiente contexto:\n{context}"
  input += "\n[|AI|] "+example["output"]
  if len(example["topics"])>0:
    topics = ", ".join(example["topics"])
    input += "\n[|Human|] "+"¿En cuáles tópicos clasificarías su respuesta?"
    input += "\n[|AI|] "+f"Aquí una lista de tópicos: {topics}."
    example["topic"] += f" ({topics})"
  example["input"] = input
  return example

```

More details can be found in the Ask2Democracy project [GitHub](https://github.com/jorge-henao/ask2democracy)