Update README.md

a9571a9 over 1 year ago

8.53 kB

	---
	license: apache-2.0
	datasets:
	- hackathon-somos-nlp-2023/ask2democracy-cfqa-salud-pension
	language:
	- es
	library_name: transformers
	pipeline_tag: text2text-generation
	tags:
	- democracy
	- public debate
	- question answering
	- RAG
	- Retrieval Augmented Generation
	---
	---
	license: apache-2.0
	---
	<h1>
	<a alt="About Ask2Democracy project" href="https://github.com/jorge-henao/ask2democracy">About Ask2Democracy project</a>
	</h1>
	<hr>

	## About Ask2Democracy project
	This model was trained during the 2023 Somos NLP Hackathon and it's part of the ask2democracy project. Our focus during the hackathon was on enhancing Retretrieval Augmented Generation (RAG) capabilities in spanish, using an open source model adapted for public debate discussions.
	This generative model is intended to be integrated with the retrieval system exposed in the project demo (currently integrated with OpenAI), in order to generate conversational source based answers.
	However, we encountered performance limitations due to the model's large size, which caused issues when running it on limited hardware. Specifically, we observed an inference time of approximately 70 seconds when using a GPU.

	To address this issue, we are currently working on optimizing ways to integrate the model into the AskDemocracy space demo. Remaining work is required in order to improve the model's performance.
	Further updates are expected to be integrated in [the AskDemocracy space demo](https://huggingface.co/spaces/jorge-henao/ask2democracycol).

	Developed by:
	- 🇨🇴 [Jorge Henao](https://linktr.ee/jorgehenao)
	- 🇨🇴 [David Torres ](https://github.com/datorresb)

	## What's baizemocracy-lora-7B-cfqa-conv model?

	This model is an open-source chat model fine-tuned with [LoRA](https://github.com/microsoft/LoRA) inspired by [Baize project](https://github.com/project-baize/baize-chatbot/tree/main/). It was trained with the Baize datasets and the ask2democracy-cfqa-salud-pension dataset, wich contains almost 4k instructions to answers questions based on a context relevant to citizen concerns and public debate in spanish.

	Two model variations was trained during the Hackathon Somos NLP 2023:
	- A generative context focused model: This model variation is more focused on source based augmented retrieval generation. See Pre-proccessing dataset section.
	- A conversational style focused model: [Baizemocracy-conv](https://huggingface.co/hackathon-somos-nlp-2023/baizemocracy-lora-7B-cfqa-conv) is another variation focused in a more conversational way of asking questions.

	Testing is a work in progress, we decide to share both model variations with community in order to invovle more people experimenting what it works better and find other possible use cases.

	## Training Parameters

	- Base Model: [LLaMA-7B](https://arxiv.org/pdf/2302.13971.pdf)
	- Training Epoch: 1
	- Batch Size: 16
	- Maximum Input Length: 512
	- Learning Rate: 2e-4
	- LoRA Rank: 8
	- Updated Modules: All Linears

	## Training Dataset

	- [Ask2Democracy-cfqa-salud-pension](https://huggingface.co/datasets/hackathon-somos-nlp-2023/ask2democracy-cfqa-salud-pension) (3,806)
	- [Standford Alpaca](https://github.com/tatsu-lab/stanford_alpaca) (51,942)
	- [Quora Dialogs](https://github.com/project-baize/baize) (54,456):
	- [StackOverflow Dialogs](https://github.com/project-baize/baize) (57,046)
	- [Alpacaca chat Dialogs](https://github.com/project-baize/baize)
	- [Medical chat Dialogs](https://github.com/project-baize/baize)

	## How to use it

	```python
	import time
	import torch
	from peft import PeftModel, PeftConfig
	from transformers import AutoModelForCausalLM, AutoTokenizer

	peft_model_id = "hackathon-somos-nlp-2023/baizemocracy-lora-7B-cfqa"
	config = PeftConfig.from_pretrained(peft_model_id)
	base_model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-7b-hf", return_dict=True, load_in_8bit=True, device_map='auto')
	tokenizer = AutoTokenizer.from_pretrained(peft_model_id)

	# Load the Lora model
	tuned_model = PeftModel.from_pretrained(base_model, peft_model_id)

	def generate(text):
	stt = time.time()
	print("hackathon-somos-nlp-2023/baizemocracy-lora-7B-cfqa response:")
	inputs = tokenizer(text, return_tensors="pt")
	input_ids = inputs["input_ids"].cuda()
	with torch.cuda.amp.autocast():
	tuned_model.eval()

	generation_output = tuned_model.generate(
	input_ids=input_ids[:,1:-1],
	generation_config=generation_config,
	return_dict_in_generate=True,
	output_scores=True,
	max_new_tokens=256
	)
	for s in generation_output.sequences:
	output = tokenizer.decode(s)
	print(output)
	ent = time.time()
	elapsed_time = round(ent - stt, 2)
	print (f"{elapsed_time} seconds")

	```

	## Example outputs

	baizemocracy-lora-7B-cfqa model:

	```python
	#Text taken from Mexican political reform from https://www.gob.mx/cms/uploads/attachment/file/3080/EXPLICACION_AMPLIADA_REFORMA_POLITICA_ELECTORAL.pdf
	text = """
	Given the Context answer the Question. Answers must be source based, use topics to elaborate on the Response if they're provided.
	Context:'Se otorga autonomía constitucional al Consejo Nacional de Evaluación de la Política de Desarrollo Social (CONEVAL), hasta ahora un organismo público descentralizado dependiente de la
	Secretaría de Desarrollo Social. La autonomía garantizará la objetividad, independencia y rigor necesarios para evaluar la política social del país. Esto permitirá perfeccionar el diseño y aplicación de las políticas públicas destinadas a mejorar la calidad de vida de los sectores de menores
	ingresos'
	Question: '¿para qué se le dará autonomía al CONEVAL?'"""
	generate(text)
	output:
	Respuesta: El CONEVAL recibirá autonomía para garantizar la objetividad, independencia y rigor necesarios para evaluar la política social del país. Esto permitirá perfeccionar el diseño y aplicación de las políticas públicas destinadas a mejorar la calidad de vida de los sectores de menores
	ingresos.
	```

	```python
	#Text taken from Mexican political reform from https://www.gob.mx/cms/uploads/attachment/file/3080/EXPLICACION_AMPLIADA_REFORMA_POLITICA_ELECTORAL.pdf
	text = """
	Given the Context answer the Question. Answers must be source based, use topics to elaborate on the Response if they're provided.
	Context:'Ratificación del Plan Nacional de Desarrollo y de la Estrategia Nacional de
	Seguridad Pública
	Se adiciona como facultad de la Cámara de Diputados la aprobación del Plan Nacional de Desarrollo, con lo que la pluralidad de intereses y las visiones expresadas por las distintas fuerzas
	políticas que componen la Cámara de Diputados quedarán plasmadas en la ruta que el Ejecutivo
	Federal traza para sus acciones durante cada sexenio.
	De igual manera, el Senado de la República ratificará la Estrategia Nacional de Seguridad Pública. Toda vez que la función principal del Estado es garantizar la seguridad, es indispensable que
	dicha estrategia sea aprobada por un órgano representativo de la voluntad popular como es el
	caso del Senado.
	El papel que desempeñarán las Cámaras del Congreso de la Unión en el contexto de la Reforma
	Política-Electoral permite aumentar el nivel de corresponsabilidad entre los Poderes de la Unión,
	al mismo tiempo que preserva la capacidad del Estado mexicano para responder oportunamente ante las amenazas al orden público y para poner en marcha acciones de trascendencia nacional.'
	Question: '¿cual será la nueva facultad de la cámara?'"""
	generate(text)
	output:
	Answer: La nueva facultad de la Cámara de Diputados será la aprobación del Plan Nacional de Desarrollo, con lo que la pluralidad de intereses y las visiones expresadas por las distintas fuerzas políticas que componen la Cámara de Diputados quedarán plasmadas en la ruta que el Ejecutivo Federal traza para sus acciones durante cada sexenio.
	```

	## About dataset formating

	Ask2Democracy-cfqa-salud-pension dataset was formated like this::
	```python
	def format_ds(example):
	example["text"] = (
	"Given the Context answer the Question. Answers must be source based, use topics to elaborate on the Response if they're provided."
	+ " Question: '{}'".format(example['input'].strip())
	+ " Context: {}".format(example['instruction'].strip())
	+ " Topics: {}".format(example['topics'])
	+ " Response: '{}'".format(example['output'].strip())
	)
	return example
	```

	More details can be found in the Ask2Democracy project [GitHub](https://github.com/jorge-henao/ask2democracy)