root
commited on
Commit
·
5c83af4
1
Parent(s):
e12fd29
update README and add code
Browse files- README.md +135 -0
- all_commands.sh +49 -0
- code/arguments.py +95 -0
- code/dataset_conv.py +273 -0
- code/evaluate_cqa_vllm_chatqa2.py +97 -0
README.md
CHANGED
@@ -1,3 +1,138 @@
|
|
1 |
---
|
2 |
license: llama3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: llama3
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
pipeline_tag: text-generation
|
6 |
+
tags:
|
7 |
+
- nvidia
|
8 |
+
- chatqa-2
|
9 |
+
- chatqa
|
10 |
+
- llama-3
|
11 |
+
- pytorch
|
12 |
---
|
13 |
+
|
14 |
+
|
15 |
+
## Model Details
|
16 |
+
We introduce Llama3-ChatQA-2, which bridges the gap between open-source LLMs and leading proprietary models (e.g., GPT-4-Turbo) in long-context understanding and retrieval-augmented generation (RAG) capabilities. Llama3-ChatQA-2 is developed using an improved training recipe from [ChatQA-1.5 paper](https://arxiv.org/pdf/2401.10225), and it is built on top of [Llama-3 base model](https://huggingface.co/meta-llama/Meta-Llama-3-70B). Specifically, we continued training of Llama-3 base models to extend the context window from 8K to 128K tokens, along with a three-stage instruction tuning process to enhance the model’s instruction-following, RAG performance, and long-context understanding capabilities. Llama3-ChatQA-2 has two variants: Llama3-ChatQA-2-8B and Llama3-ChatQA-2-70B. Both models were originally trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), we converted the checkpoints to Hugging Face format. **For more information about ChatQA, check the [website](https://chatqa-project.github.io/)!**
|
17 |
+
|
18 |
+
## Other Resources
|
19 |
+
[Llama3-ChatQA-2-70B](https://huggingface.co/nvidia/Llama3-ChatQA-2-70B)   [Evaluation Data](https://huggingface.co/datasets/nvidia/ChatRAG-Bench)   [Training Data](https://huggingface.co/datasets/nvidia/ChatQA2-Training-Data)   [Retriever](https://huggingface.co/intfloat/e5-mistral-7b-instruct)   [Website](https://chatqa2-project.github.io/)   [Paper](https://arxiv.org/abs/2407.14482)
|
20 |
+
|
21 |
+
## Overview of Benchmark Results
|
22 |
+
Results in [ChatRAG Bench](https://huggingface.co/datasets/nvidia/ChatRAG-Bench) are as follows:
|
23 |
+
|
24 |
+
|
25 |
+
![Example Image](example.png)
|
26 |
+
| | ChatQA-2-70B | GPT-4-Turbo-2024-04-09 | Qwen2-72B-Instruct | Llama3.1-70B-Instruct |
|
27 |
+
| -- |:--:|:--:|:--:|:--:|
|
28 |
+
| Ultra-long (4k) | 41.04 | 33.16 | 39.77 | 39.81 |
|
29 |
+
| Long (32k) | 48.15 | 51.93 | 49.94 | 49.92 |
|
30 |
+
| Short (4k) | 56.30 | 54.72 | 54.06 | 52.12 |
|
31 |
+
|
32 |
+
Note that ChatQA-2 is built based on Llama-3 base model.
|
33 |
+
|
34 |
+
|
35 |
+
## Prompt Format
|
36 |
+
**We highly recommend that you use the prompt format we provide, as follows:**
|
37 |
+
### when context is available
|
38 |
+
<pre>
|
39 |
+
System: {System}
|
40 |
+
|
41 |
+
{Context}
|
42 |
+
|
43 |
+
User: {Question}
|
44 |
+
|
45 |
+
Assistant: {Response}
|
46 |
+
|
47 |
+
User: {Question}
|
48 |
+
|
49 |
+
Assistant:
|
50 |
+
</pre>
|
51 |
+
|
52 |
+
### when context is not available
|
53 |
+
<pre>
|
54 |
+
System: {System}
|
55 |
+
|
56 |
+
User: {Question}
|
57 |
+
|
58 |
+
Assistant: {Response}
|
59 |
+
|
60 |
+
User: {Question}
|
61 |
+
|
62 |
+
Assistant:
|
63 |
+
</pre>
|
64 |
+
**The content of the system's turn (i.e., {System}) for both scenarios is as follows:**
|
65 |
+
<pre>
|
66 |
+
This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context.
|
67 |
+
</pre>
|
68 |
+
**Note that our ChatQA-1.5 models are optimized for the capability with context, e.g., over documents or retrieved context.**
|
69 |
+
|
70 |
+
## How to use
|
71 |
+
|
72 |
+
### take the whole document as context
|
73 |
+
This can be applied to the scenario where the whole document can be fitted into the model, so that there is no need to run retrieval over the document.
|
74 |
+
```python
|
75 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
76 |
+
import torch
|
77 |
+
|
78 |
+
model_id = "nvidia/Llama3-ChatQA-1.5-8B"
|
79 |
+
|
80 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
81 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
|
82 |
+
|
83 |
+
messages = [
|
84 |
+
{"role": "user", "content": "what is the percentage change of the net income from Q4 FY23 to Q4 FY24?"}
|
85 |
+
]
|
86 |
+
|
87 |
+
document = """NVIDIA (NASDAQ: NVDA) today reported revenue for the fourth quarter ended January 28, 2024, of $22.1 billion, up 22% from the previous quarter and up 265% from a year ago.\nFor the quarter, GAAP earnings per diluted share was $4.93, up 33% from the previous quarter and up 765% from a year ago. Non-GAAP earnings per diluted share was $5.16, up 28% from the previous quarter and up 486% from a year ago.\nQ4 Fiscal 2024 Summary\nGAAP\n| $ in millions, except earnings per share | Q4 FY24 | Q3 FY24 | Q4 FY23 | Q/Q | Y/Y |\n| Revenue | $22,103 | $18,120 | $6,051 | Up 22% | Up 265% |\n| Gross margin | 76.0% | 74.0% | 63.3% | Up 2.0 pts | Up 12.7 pts |\n| Operating expenses | $3,176 | $2,983 | $2,576 | Up 6% | Up 23% |\n| Operating income | $13,615 | $10,417 | $1,257 | Up 31% | Up 983% |\n| Net income | $12,285 | $9,243 | $1,414 | Up 33% | Up 769% |\n| Diluted earnings per share | $4.93 | $3.71 | $0.57 | Up 33% | Up 765% |"""
|
88 |
+
|
89 |
+
def get_formatted_input(messages, context):
|
90 |
+
system = "System: This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context."
|
91 |
+
instruction = "Please give a full and complete answer for the question."
|
92 |
+
|
93 |
+
for item in messages:
|
94 |
+
if item['role'] == "user":
|
95 |
+
## only apply this instruction for the first user turn
|
96 |
+
item['content'] = instruction + " " + item['content']
|
97 |
+
break
|
98 |
+
|
99 |
+
conversation = '\n\n'.join(["User: " + item["content"] if item["role"] == "user" else "Assistant: " + item["content"] for item in messages]) + "\n\nAssistant:"
|
100 |
+
formatted_input = system + "\n\n" + context + "\n\n" + conversation
|
101 |
+
|
102 |
+
return formatted_input
|
103 |
+
|
104 |
+
formatted_input = get_formatted_input(messages, document)
|
105 |
+
tokenized_prompt = tokenizer(tokenizer.bos_token + formatted_input, return_tensors="pt").to(model.device)
|
106 |
+
|
107 |
+
terminators = [
|
108 |
+
tokenizer.eos_token_id,
|
109 |
+
tokenizer.convert_tokens_to_ids("<|eot_id|>")
|
110 |
+
]
|
111 |
+
|
112 |
+
outputs = model.generate(input_ids=tokenized_prompt.input_ids, attention_mask=tokenized_prompt.attention_mask, max_new_tokens=128, eos_token_id=terminators)
|
113 |
+
|
114 |
+
response = outputs[0][tokenized_prompt.input_ids.shape[-1]:]
|
115 |
+
print(tokenizer.decode(response, skip_special_tokens=True))
|
116 |
+
```
|
117 |
+
|
118 |
+
## Command to run generation
|
119 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset ${dataset_name} --start-idx 0 --end-idx ${num_samples} --max-tokens ${max_tokens} --sample-input-file ${dataset_path}
|
120 |
+
|
121 |
+
see all_command.sh for all detailed configuration.
|
122 |
+
|
123 |
+
## Correspondence to
|
124 |
+
Peng Xu (pengx@nvidia.com), Wei Ping (wping@nvidia.com)
|
125 |
+
|
126 |
+
## Citation
|
127 |
+
<pre>
|
128 |
+
@article{xu2024chatqa,
|
129 |
+
title={ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities},
|
130 |
+
author={Xu, Peng and Ping, Wei and Wu, Xianchao and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan},
|
131 |
+
journal={arXiv preprint arXiv:2407.14482},
|
132 |
+
year={2024}
|
133 |
+
}
|
134 |
+
</pre>
|
135 |
+
|
136 |
+
|
137 |
+
## License
|
138 |
+
The use of this model is governed by the [META LLAMA 3 COMMUNITY LICENSE AGREEMENT](https://llama.meta.com/llama3/license/)
|
all_commands.sh
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
model_path=""
|
2 |
+
data_home=""
|
3 |
+
|
4 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset longbook_choice_eng_gpt4_same --start-idx 0 --end-idx 1000 --sample-input-file ${data_home}/longbook_choice_eng_gpt4_same/test.json
|
5 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset longbook_qa_eng_gpt4_same --start-idx 0 --end-idx 1000 --sample-input-file ${data_home}/longbook_qa_eng_gpt4_same/test.json
|
6 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset longdialogue_qa_eng_gpt4_same --start-idx 0 --end-idx 2000 --max-tokens 1024 --sample-input-file ${data_home}/longdialogue_qa_eng_gpt4_same/test.json
|
7 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset longbook_sum_eng_gpt4_same --start-idx 0 --end-idx 120 --max-tokens 1024 --sample-input-file ${data_home}/longbook_sum_eng_gpt4_same/test.json
|
8 |
+
|
9 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset longbook_choice_eng.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 1000 --use-retrieved-neighbours --sample-input-file ${data_home}/longbook_choice_eng.e5_mistral_retriever_chunkbysents1200/test.json
|
10 |
+
longbook_qa_eng.e5_mistral_retriever_chunkbysents1200
|
11 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset longbook_qa_eng.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 1000 --use-retrieved-neighbours --sample-input-file ${data_home}/longbook_qa_eng.e5_mistral_retriever_chunkbysents1200/test.json
|
12 |
+
qasper.e5_mistral_retriever_chunkbysents1200
|
13 |
+
|
14 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset qasper.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 2000 --use-retrieved-neighbours --sample-input-file ${data_home}/qasper.e5_mistral_retriever_chunkbysents1200/test.json
|
15 |
+
qmsum.e5_mistral_retriever_chunkbysents1200
|
16 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset qmsum.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 200 --use-retrieved-neighbours --sample-input-file ${data_home}/qmsum.e5_mistral_retriever_chunkbysents1200/test.json
|
17 |
+
quality.e5_mistral_retriever_chunkbysents1200
|
18 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset quality.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 2000 --use-retrieved-neighbours --sample-input-file ${data_home}/quality.e5_mistral_retriever_chunkbysents1200/test.json
|
19 |
+
musique.e5_mistral_retriever_chunkbysents1200
|
20 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset musique.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 200 --use-retrieved-neighbours --sample-input-file ${data_home}/musique.e5_mistral_retriever_chunkbysents1200/test.json
|
21 |
+
hotpotqa.e5_mistral_retriever_chunkbysents1200
|
22 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset hotpotqa.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 200 --use-retrieved-neighbours --sample-input-file ${data_home}/hotpotqa.e5_mistral_retriever_chunkbysents1200/test.json
|
23 |
+
multifieldqa_en.e5_mistral_retriever_chunkbysents1200
|
24 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset multifieldqa_en.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 200 --use-retrieved-neighbours --sample-input-file ${data_home}/multifieldqa_en.e5_mistral_retriever_chunkbysents1200/test.json
|
25 |
+
longbook_choice_eng.e5_mistral_retriever_chunkbysents1200
|
26 |
+
|
27 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset qasper.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 2000 --sample-input-file ${data_home}/qasper.e5_mistral_retriever_chunkbysents1200/test.json
|
28 |
+
qmsum.e5_mistral_retriever_chunkbysents1200
|
29 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset qmsum.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 200 --sample-input-file ${data_home}/qmsum.e5_mistral_retriever_chunkbysents1200/test.json
|
30 |
+
quality.e5_mistral_retriever_chunkbysents1200
|
31 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset quality.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 2000 --sample-input-file ${data_home}/quality.e5_mistral_retriever_chunkbysents1200/test.json
|
32 |
+
musique.e5_mistral_retriever_chunkbysents1200
|
33 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset musique.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 200 --sample-input-file ${data_home}/musique.e5_mistral_retriever_chunkbysents1200/test.json
|
34 |
+
hotpotqa.e5_mistral_retriever_chunkbysents1200
|
35 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset hotpotqa.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 200 --sample-input-file ${data_home}/hotpotqa.e5_mistral_retriever_chunkbysents1200/test.json
|
36 |
+
multifieldqa_en.e5_mistral_retriever_chunkbysents1200
|
37 |
+
python evaluate_cqa_vllm_chatqa2.py --model-folder ${model_path} --eval-dataset multifieldqa_en.e5_mistral_retriever_chunkbysents1200 --start-idx 0 --end-idx 200 --sample-input-file ${data_home}/multifieldqa_en.e5_mistral_retriever_chunkbysents1200/test.json
|
38 |
+
|
39 |
+
python evaluate_cqa_vllm_chatqa2.py --eval-dataset doc2dial --start-idx 0 --end-idx 4000 --use-retrieved-neighbours --model-folder ${model_path} --sample-input-file /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/data/test_benchmarks/multi-turn-qa/doc2dial/doc2dial_ftdragon_chatgptgen7k_chunk150_QA_test.json
|
40 |
+
python evaluate_cqa_vllm_chatqa2.py --eval-dataset convfinqa_general_long_answer --start-idx 0 --end-idx 1500 --use-retrieved-neighbours --model-folder /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/checkpoints/applications/long_131072_25_multiturn_qa_blend_commercial_v28_9_multiturn_llama3_8b_step_1000_8b_64_3e-7_step_3300_hf --sample-input-file /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/data/test_benchmarks//multi-turn-qa/convfinqa_general/convfinqa_general_QA_dev.json
|
41 |
+
python evaluate_cqa_vllm_chatqa2.py --eval-dataset sqa_general_long_answer --start-idx 0 --end-idx 3100 --use-retrieved-neighbours --model-folder /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/checkpoints/applications/long_131072_25_multiturn_qa_blend_commercial_v28_9_multiturn_llama3_8b_step_1000_8b_64_3e-7_step_3300_hf --sample-input-file /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/data/test_benchmarks//multi-turn-qa/sqa_general/sqa_general_QA_test.json
|
42 |
+
python evaluate_cqa_vllm_chatqa2.py --eval-dataset coqa --start-idx 0 --end-idx 8000 --use-retrieved-neighbours --model-folder /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/checkpoints/applications/long_131072_25_multiturn_qa_blend_commercial_v28_9_multiturn_llama3_8b_step_1000_8b_64_3e-7_step_3300_hf --sample-input-file /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/data/test_benchmarks/multi-turn-qa/coqa/coqa_QA_dev.json
|
43 |
+
python evaluate_cqa_vllm_chatqa2.py --eval-dataset doqa_cooking --start-idx 0 --end-idx 2000 --use-retrieved-neighbours --model-folder /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/checkpoints/applications/long_131072_25_multiturn_qa_blend_commercial_v28_9_multiturn_llama3_8b_step_1000_8b_64_3e-7_step_3300_hf --sample-input-file /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/data/test_benchmarks//multi-turn-qa/doqa/doqa_cooking_QA_test.json
|
44 |
+
python evaluate_cqa_vllm_chatqa2.py --eval-dataset doqa_travel --start-idx 0 --end-idx 2000 --use-retrieved-neighbours --model-folder /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/checkpoints/applications/long_131072_25_multiturn_qa_blend_commercial_v28_9_multiturn_llama3_8b_step_1000_8b_64_3e-7_step_3300_hf --sample-input-file /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/data/test_benchmarks//multi-turn-qa/doqa/doqa_travel_QA_test.json
|
45 |
+
python evaluate_cqa_vllm_chatqa2.py --eval-dataset doqa_movies --start-idx 0 --end-idx 2000 --use-retrieved-neighbours --model-folder /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/checkpoints/applications/long_131072_25_multiturn_qa_blend_commercial_v28_9_multiturn_llama3_8b_step_1000_8b_64_3e-7_step_3300_hf --sample-input-file /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/data/test_benchmarks//multi-turn-qa/doqa/doqa_movies_QA_test.json
|
46 |
+
python evaluate_cqa_vllm_chatqa2.py --eval-dataset topiocqa --start-idx 0 --end-idx 2600 --num-ctx 20 --use-retrieved-neighbours --model-folder /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/checkpoints/applications/long_131072_25_multiturn_qa_blend_commercial_v28_9_multiturn_llama3_8b_step_1000_8b_64_3e-7_step_3300_hf --sample-input-file /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/data/test_benchmarks//multi-turn-qa/topiocqa/topiocqa_dev_retrieval_dragon_ft_chatgptgen7k.json
|
47 |
+
python evaluate_cqa_vllm_chatqa2.py --eval-dataset inscit --start-idx 0 --end-idx 600 --num-ctx 20 --use-retrieved-neighbours --model-folder /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/checkpoints/applications/long_131072_25_multiturn_qa_blend_commercial_v28_9_multiturn_llama3_8b_step_1000_8b_64_3e-7_step_3300_hf --sample-input-file /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/data/test_benchmarks/multi-turn-qa/inscit/inscit_dev_retrieval_dragon_ft_chatgptgen7k_with_topic.json
|
48 |
+
python evaluate_cqa_vllm_chatqa2.py --eval-dataset qrecc --start-idx 0 --end-idx 4000 --use-retrieved-neighbours --model-folder /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/checkpoints/applications/long_131072_25_multiturn_qa_blend_commercial_v28_9_multiturn_llama3_8b_step_1000_8b_64_3e-7_step_3300_hf --sample-input-file /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/data/test_benchmarks/multi-turn-qa/qrecc/qrecc_ftdragon_chatgptgen7k_chunk150_QA_test.json
|
49 |
+
python evaluate_cqa_vllm_chatqa2.py --eval-dataset quac --start-idx 0 --end-idx 8000 --use-retrieved-neighbours --model-folder /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/checkpoints/applications/long_131072_25_multiturn_qa_blend_commercial_v28_9_multiturn_llama3_8b_step_1000_8b_64_3e-7_step_3300_hf --sample-input-file /lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/data/test_benchmarks/multi-turn-qa/quac/quac_ftdragon_chatgptgen7k_chunk150_QA_test.json
|
code/arguments.py
ADDED
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
import argparse
|
3 |
+
import os
|
4 |
+
|
5 |
+
def get_args():
|
6 |
+
parser = argparse.ArgumentParser(description="ChatQA-HF")
|
7 |
+
|
8 |
+
## model
|
9 |
+
# parser.add_argument('--model-folder', type=str, default='/lustre/fsw/portfolios/llmservice/users/pengx/projects/vllm_run/')
|
10 |
+
# parser.add_argument('--model-name', type=str, default='Llama-3-70B-Instruct-Gradient-262k')
|
11 |
+
|
12 |
+
parser.add_argument('--model-folder', type=str, default='/lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/checkpoints/applications/long_131072_25_multiturn_qa_blend_commercial_v28_9_multiturn_pp1_hf')
|
13 |
+
parser.add_argument('--model-name', type=str, default='ChatQA2')
|
14 |
+
|
15 |
+
## tokenizer
|
16 |
+
# parser.add_argument('--tokenizer-path', type=str, default='/lustre/fsw/portfolios/adlr/users/zihanl/inform/ckpts/llama2-tokenizer')
|
17 |
+
parser.add_argument('--tokenizer-path', type=str, default='/lustre/fsw/portfolios/llmservice/users/pengx/projects/vllm_run/Llama-3-70B-Instruct-Gradient-262k/')
|
18 |
+
# parser.add_argument('--tokenizer-path', type=str, default='/lustre/fsw/portfolios/llmservice/users/pengx/projects/swa_long_pretrain_llama2/checkpoints/applications/long_131072_25_multiturn_qa_blend_commercial_v28_9_multiturn_pp1_hf')
|
19 |
+
|
20 |
+
## dataset path
|
21 |
+
# parser.add_argument('--data-folder', type=str, default='/lustre/fsw/portfolios/adlr/users/zihanl/datasets/foundational_qa/test_benchmarks/multi-turn-qa')
|
22 |
+
parser.add_argument('--data-folder', type=str, default='/lustre/fs1/portfolios/llmservice/users/pengx/projects/vllm_run/oss_test/')
|
23 |
+
parser.add_argument('--data-folder-singleturn', type=str, default='/lustre/fsw/portfolios/adlr/users/zihanl/datasets/foundational_qa/test_benchmarks/single-turn-qa')
|
24 |
+
parser.add_argument('--data-folder-scrolleval', type=str, default='/lustre/fsw/portfolios/adlr/users/zihanl/datasets/foundational_qa/scroll_eval_data')
|
25 |
+
|
26 |
+
parser.add_argument('--eval-dataset', type=str, default='')
|
27 |
+
# parser.add_argument('--doc2dial-path', type=str, default='doc2dial/doc2dial_ftdragon_chatgptgen7k_chunk150_QA_test.json')
|
28 |
+
# parser.add_argument('--convfinqa-path', type=str, default='convfinqav3/convfinqav3_QA_dev.json')
|
29 |
+
# parser.add_argument('--convfinqa-path', type=str, default='convfinqa_general/convfinqa_general_QA_dev.json')
|
30 |
+
# parser.add_argument('--quac-path', type=str, default='quac/quac_ftdragon_chatgptgen7k_chunk150_QA_test.json')
|
31 |
+
# parser.add_argument('--qrecc-path', type=str, default='qrecc/qrecc_ftdragon_chatgptgen7k_chunk150_QA_test.json')
|
32 |
+
# parser.add_argument('--doqa-cooking-path', type=str, default='doqa/doqa_cooking_QA_test.json')
|
33 |
+
# parser.add_argument('--doqa-travel-path', type=str, default='doqa/doqa_travel_QA_test.json')
|
34 |
+
# parser.add_argument('--doqa-movies-path', type=str, default='doqa/doqa_movies_QA_test.json')
|
35 |
+
# parser.add_argument('--coqa-path', type=str, default='coqa/coqa_QA_dev.json')
|
36 |
+
# # parser.add_argument('--hybridial-path', type=str, default='HybridDial/HybridDial_fqa_test.json')
|
37 |
+
# parser.add_argument('--hybridial-path', type=str, default='HybridDial_general/HybridDial_general_QA_test.json')
|
38 |
+
# # parser.add_argument('--sqa-path', type=str, default='sqa/sqa_QA_test.json')
|
39 |
+
# parser.add_argument('--sqa-path', type=str, default='sqa_general/sqa_general_QA_test.json')
|
40 |
+
# parser.add_argument('--topiocqa-path', type=str, default='topiocqa/topiocqa_dev_retrieval_dragon_ft_chatgptgen7k.json')
|
41 |
+
# parser.add_argument('--inscit-path', type=str, default='inscit/inscit_dev_retrieval_dragon_ft_chatgptgen7k_with_topic.json')
|
42 |
+
|
43 |
+
parser.add_argument('--doc2dial-path', type=str, default='doc2dial/test.json')
|
44 |
+
parser.add_argument('--convfinqa-path', type=str, default='convfinqa/dev.json')
|
45 |
+
parser.add_argument('--quac-path', type=str, default='quac/test.json')
|
46 |
+
parser.add_argument('--qrecc-path', type=str, default='qrecc/test.json')
|
47 |
+
parser.add_argument('--doqa-cooking-path', type=str, default='doqa/test_cooking.json')
|
48 |
+
parser.add_argument('--doqa-travel-path', type=str, default='doqa/test_travel.json')
|
49 |
+
parser.add_argument('--doqa-movies-path', type=str, default='doqa/test_movies.json')
|
50 |
+
parser.add_argument('--coqa-path', type=str, default='coqa/dev.json')
|
51 |
+
parser.add_argument('--hybridial-path', type=str, default='hybridial/test.json')
|
52 |
+
parser.add_argument('--sqa-path', type=str, default='sqa/test.json')
|
53 |
+
parser.add_argument('--topiocqa-path', type=str, default='topiocqa/dev.json')
|
54 |
+
parser.add_argument('--inscit-path', type=str, default='inscit/dev.json')
|
55 |
+
|
56 |
+
parser.add_argument('--kilt-nq-path', type=str, default='kilt/nq/test.json')
|
57 |
+
parser.add_argument('--kilt-tqa-path', type=str, default='kilt/tqa/test.json')
|
58 |
+
parser.add_argument('--kilt-hotpotqa-path', type=str, default='kilt/hotpotqa/test.json')
|
59 |
+
# parser.add_argument('--kilt-hotpotqa-path', type=str, default='kilt/hotpotqa_rerank/test.json')
|
60 |
+
|
61 |
+
parser.add_argument('--nq-path', type=str, default='nq_dragon_retrieved/test.json')
|
62 |
+
parser.add_argument('--tqa-path', type=str, default='triviaqa_dragon_retrieved/test.json')
|
63 |
+
parser.add_argument('--hotpotqa-path', type=str, default='hotpotqa_dragon_retrieved/test.json')
|
64 |
+
|
65 |
+
## scroll eval
|
66 |
+
parser.add_argument('--scroll-hotpotqa-chunk1200-path', type=str, default='hotpotqa.e5_mistral_retriever_chunkbysents1200/test.json')
|
67 |
+
parser.add_argument('--scroll-musique-chunk1200-path', type=str, default='musique.e5_mistral_retriever_chunkbysents1200/test.json')
|
68 |
+
parser.add_argument('--scroll-qasper-chunk1200-path', type=str, default='qasper.e5_mistral_retriever_chunkbysents1200/test.json')
|
69 |
+
parser.add_argument('--scroll-narrative_qa-chunk1200-path', type=str, default='narrative_qa.e5_mistral_retriever_chunkbysents1200/test.json')
|
70 |
+
parser.add_argument('--scroll-quality-chunk1200-path', type=str, default='quality.e5_mistral_retriever_chunkbysents1200/test.json')
|
71 |
+
parser.add_argument('--scroll-multifieldqa_en-chunk1200-path', type=str, default='multifieldqa_en.e5_mistral_retriever_chunkbysents1200/test.json')
|
72 |
+
parser.add_argument('--scroll-qmsum-chunk1200-path', type=str, default='qmsum.e5_mistral_retriever_chunkbysents1200/test.json')
|
73 |
+
|
74 |
+
parser.add_argument('--scroll-hotpotqa-chunk300-path', type=str, default='hotpotqa.e5_mistral_retriever_chunkbysents1200/test.json')
|
75 |
+
parser.add_argument('--scroll-musique-chunk300-path', type=str, default='musique.e5_mistral_retriever_chunkbysents1200/test.json')
|
76 |
+
parser.add_argument('--scroll-qasper-chunk300-path', type=str, default='qasper.e5_mistral_retriever_chunkbysents1200/test.json')
|
77 |
+
parser.add_argument('--scroll-narrative_qa-chunk300-path', type=str, default='narrative_qa.e5_mistral_retriever_chunkbysents1200/test.json')
|
78 |
+
parser.add_argument('--scroll-quality-chunk300-path', type=str, default='quality.e5_mistral_retriever_chunkbysents1200/test.json')
|
79 |
+
parser.add_argument('--scroll-multifieldqa_en-chunk300-path', type=str, default='multifieldqa_en.e5_mistral_retriever_chunkbysents1200/test.json')
|
80 |
+
parser.add_argument('--scroll-qmsum-chunk300-path', type=str, default='qmsum.e5_mistral_retriever_chunkbysents1200/test.json')
|
81 |
+
|
82 |
+
parser.add_argument('--sample-input-file', type=str, default='')
|
83 |
+
parser.add_argument("--use-retrieved-neighbours", action='store_true', default=False,
|
84 |
+
help='Use retrieved neighbours')
|
85 |
+
|
86 |
+
## others
|
87 |
+
parser.add_argument('--max-seq-length', type=int, default=128000)
|
88 |
+
parser.add_argument('--num-ctx', type=int, default=5)
|
89 |
+
parser.add_argument('--start-idx', type=int, default=-1)
|
90 |
+
parser.add_argument('--end-idx', type=int, default=-1)
|
91 |
+
parser.add_argument('--max-tokens', type=int, default=64)
|
92 |
+
|
93 |
+
args = parser.parse_args()
|
94 |
+
|
95 |
+
return args
|
code/dataset_conv.py
ADDED
@@ -0,0 +1,273 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# coding=utf-8
|
2 |
+
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
|
3 |
+
#
|
4 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
5 |
+
# you may not use this file except in compliance with the License.
|
6 |
+
# You may obtain a copy of the License at
|
7 |
+
#
|
8 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
9 |
+
#
|
10 |
+
# Unless required by applicable law or agreed to in writing, software
|
11 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
12 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
13 |
+
# See the License for the specific language governing permissions and
|
14 |
+
# limitations under the License.
|
15 |
+
|
16 |
+
import json
|
17 |
+
import collections
|
18 |
+
from multiprocessing.sharedctypes import Value
|
19 |
+
import os
|
20 |
+
import torch
|
21 |
+
import numpy as np
|
22 |
+
import glob
|
23 |
+
|
24 |
+
def format_multichoice(multichoice_options):
|
25 |
+
|
26 |
+
options_text = ["({}) {}".format(chr(ord('A')+i), option) for i, option in zip(range(len(multichoice_options)), multichoice_options)]
|
27 |
+
return "Choose one based on the following options: {}".format(" ".join(options_text))
|
28 |
+
|
29 |
+
def format_multichoice_question(question, multichoice_options):
|
30 |
+
|
31 |
+
return "{}\n{}".format(question, format_multichoice(multichoice_options))
|
32 |
+
|
33 |
+
def format_answer(answer):
|
34 |
+
return " {}".format(answer)
|
35 |
+
|
36 |
+
"""GPT ft dataset."""
|
37 |
+
def preprocess(data_file, inference_only=False, retrieved_neighbours=False, fix_newsqa=False):
|
38 |
+
|
39 |
+
nq_examples = []
|
40 |
+
for my_data_file in sorted(glob.glob(data_file)):
|
41 |
+
with open(my_data_file, "r", encoding='utf-8') as f:
|
42 |
+
nq_examples.extend(json.load(f))
|
43 |
+
|
44 |
+
data = []
|
45 |
+
for instance in nq_examples:
|
46 |
+
question = instance["question"]
|
47 |
+
if 'qa_type' in instance and instance['qa_type'] == "multi_choice_qa":
|
48 |
+
question = format_multichoice_question(question, instance["multichoice_options"])
|
49 |
+
if True:
|
50 |
+
if retrieved_neighbours:
|
51 |
+
contexts = instance["ctxs"]
|
52 |
+
neighbours = ["title: " + ctx["title"] + ", source: " + ctx["text"] for ctx in contexts]
|
53 |
+
else:
|
54 |
+
if "document" in instance:
|
55 |
+
doc = instance["document"]
|
56 |
+
if type(doc) == list:
|
57 |
+
neighbours = [" ".join(doc)]
|
58 |
+
else:
|
59 |
+
neighbours = [doc]
|
60 |
+
elif "sub-paragraphs" in instance:
|
61 |
+
neighbours = ["title: , source: " + instance["sub-paragraphs"]]
|
62 |
+
elif fix_newsqa and "sub_paragraph" in instance:
|
63 |
+
neighbours = ["title: , source: " + instance["sub_paragraph"]]
|
64 |
+
else:
|
65 |
+
neighbours = ["title: , source: "]
|
66 |
+
|
67 |
+
if inference_only:
|
68 |
+
data.append((question, None, neighbours))
|
69 |
+
else:
|
70 |
+
if True:
|
71 |
+
if "answers" in instance:
|
72 |
+
answers = instance["answers"]
|
73 |
+
elif "answer" in instance:
|
74 |
+
if type(instance["answer"]) is str:
|
75 |
+
answers = [instance["answer"]]
|
76 |
+
elif type(instance["answer"]) is list:
|
77 |
+
answers = instance["answer"]
|
78 |
+
else:
|
79 |
+
answers = [str(instance["answer"])]
|
80 |
+
else:
|
81 |
+
raise ValueError("need to have answer or answers")
|
82 |
+
if len(answers) < 1:
|
83 |
+
continue
|
84 |
+
# answers = ["This question cannot be answered based on the given information."]
|
85 |
+
else:
|
86 |
+
## only take answer 0
|
87 |
+
if type(answers[0]) is dict:
|
88 |
+
answers = [answers[0]["text"].strip()]
|
89 |
+
elif type(answers[0]) is str:
|
90 |
+
answers = [answers[0]]
|
91 |
+
else:
|
92 |
+
raise ValueError("unsupported type for answer(s)")
|
93 |
+
|
94 |
+
for answer in answers:
|
95 |
+
answer = format_answer(answer)
|
96 |
+
data.append((question, answer, neighbours))
|
97 |
+
|
98 |
+
return data
|
99 |
+
|
100 |
+
|
101 |
+
def reformat_prompt_v2(query, neighbours, dataset_name, ft_neighbours, \
|
102 |
+
max_output_len, tokenizer, max_seq_length):
|
103 |
+
|
104 |
+
system = "System: This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context.\n\n"
|
105 |
+
|
106 |
+
if dataset_name in ["oasst", "tuluv2", "tuluv2official", "quiet_cockatoo", "quiet-cockatoo_commercial", "primitive-stingray16k"]:
|
107 |
+
if dataset_name == "tuluv2official":
|
108 |
+
all_input = query
|
109 |
+
else:
|
110 |
+
all_input = system + query
|
111 |
+
|
112 |
+
input_tokens = tokenizer.encode(all_input)
|
113 |
+
return input_tokens
|
114 |
+
|
115 |
+
short_span_with_context = ["drop", "NarrativeQA", "NarrativeQAretrieval", "QASC", "Quoref", "ROPES", "squad1.1", "squad2.0", "newsqa", "nq", "BioASQ", "DuoRC_ParaphraseRC", "TextbookQA", "WikiTableQuestions", "HybridQA", "hotpotqa", "wikimqa", "kilt_nq_short", "kilt_tqa_short", "kilt_hotpotqa_short", "nqtables", "qasper", "narrative_qa", "quality", "musique", "hotpotqa", "multifieldqa_en", "longbook_qa_eng", "kv_retrieval", "math_find", "passkey", "number_string", "code_debug", "code_run", "math_calc", "longdialogue_qa_eng", "longbook_qa_eng_gpt4_same", "longdialogue_qa_eng_gpt4_same" ]
|
116 |
+
yes_no_without_context = ["boolq", "multirc"]
|
117 |
+
multichoices = ["race", "longbook_choice_eng", "longbook_choice_eng_gpt4_same"]
|
118 |
+
# multi-turn qa datasets
|
119 |
+
formatted_dataset_name = ["convqa", "convqav2", "chatgptgen", "chatgptgennoanswer", "chatgptgennoanswerv2", "doc2dial", "doc2dialv2", "doc2dial_dragon", "quac", "quacv2", "quac_dragon", "qrecc", "qrecc_dragon", "sharc", "nvolvemultiturn1300", "nvolvemultiturn1700", "nvolvemultiturnfiltered5k", "nvolvemultiturnfiltered5knoanswer", "nvolvemultiturnfiltered7k", "nvolvemultiturnfiltered7knoanswer", "nvolvemultiturnfiltered7knoanswer1k", "nvolvemultiturnfiltered7knoanswer2k", "nvolvemultiturnfiltered7knoanswer3k",
|
120 |
+
"nvolvemultiturnfiltered7knoanswerlonghistory",
|
121 |
+
"nvolvemultiturnfiltered7knoanswerlonghistorydiscont", "nvolvemultiturnfiltered7knoanswerlonghistorydiscontfixv1", "nvolvemultiturnfiltered7knoanswerlonghistorydiscontfixv2", "scalecqav1", "scalecqanoanswer", "extracqanoanswerv1", "instructv1", "instructv2", "instructv3", "instructtablegeneral", "instructtablegeneralv2", "instructtableunansreasongeneral", "doqa_cooking", "doqa_movies", "doqa_travel", "hybriddial", "hybriddial_general", "hybriddialunanswerable", "hybriddialunanswerablemixed", "hybriddialunanswerablegeneral", "hybriddialunanswerablegeneralv2", "inscit", "inscit_dragon", "convfinqalong", "convfinqalonggeneral", "convfinqalongunanswerable", "convfinqalongunanswerablegeneral", "convfinqalongunanswerablegeneralv2", "convfinqalongunanswerablewithreasongeneral", "cornercases", "cornercasesv2", "convfinqa_general_long_answer"]
|
122 |
+
|
123 |
+
formatted_dataset_name_short = ["coqa"]
|
124 |
+
formatted_dataset_name_short_and_long = ["sqa", "sqa_general", "topiocqa", "topiocqa_dragon"]
|
125 |
+
formatted_dataset_name_entity = ["sqa_general_long_answer"]
|
126 |
+
singleturn_dataset_name_short_and_long = ["tatqamultispan", "llmware", "tatqamultispangeneral"]
|
127 |
+
singleturn_dataset_name_long = ["kilt_nq", "kilt_tqa", "kilt_hotpotqa", "kilt_hotpotqa_rerank"]
|
128 |
+
singleturn_dataset_entity = ["tatqamultispanv2general"]
|
129 |
+
|
130 |
+
math_program_with_context = ["finqa", "finqav2"]
|
131 |
+
math_program_with_context_v2 = ['tatqav2']
|
132 |
+
math_program_with_context_v3 = ['tatqav3', 'tatqageneral']
|
133 |
+
math_program_multiturn = ["convfinqa", "convfinqav2"]
|
134 |
+
math_program_multiturn_v2 = ["convfinqav3", "convfinqa_general"]
|
135 |
+
|
136 |
+
user_template = ""
|
137 |
+
|
138 |
+
if dataset_name in formatted_dataset_name:
|
139 |
+
# dialogue_turn = query
|
140 |
+
|
141 |
+
## adding this instruction to multi-turn
|
142 |
+
tmp_list = query.split("User:", 1) # split will stop at the first "User:"
|
143 |
+
dialogue_turn = "User: Please give a full and complete answer for the question." + tmp_list[1]
|
144 |
+
|
145 |
+
elif dataset_name in formatted_dataset_name_short_and_long:
|
146 |
+
|
147 |
+
tmp_list = query.split("User:")
|
148 |
+
tmp_list = tmp_list[1:]
|
149 |
+
|
150 |
+
dialogue_turn = ""
|
151 |
+
if len(tmp_list) > 1:
|
152 |
+
for item in tmp_list[:-1]:
|
153 |
+
dialogue_turn += "User:" + item
|
154 |
+
dialogue_turn += "User: Answer the following question with a short span, or a full and complete answer." + tmp_list[-1]
|
155 |
+
|
156 |
+
elif dataset_name in formatted_dataset_name_entity:
|
157 |
+
|
158 |
+
tmp_list = query.split("User:")
|
159 |
+
tmp_list = tmp_list[1:]
|
160 |
+
|
161 |
+
dialogue_turn = ""
|
162 |
+
if len(tmp_list) > 1:
|
163 |
+
for item in tmp_list[:-1]:
|
164 |
+
dialogue_turn += "User:" + item
|
165 |
+
dialogue_turn += "User: Answer the following question with one or a list of items." + tmp_list[-1]
|
166 |
+
|
167 |
+
elif dataset_name in formatted_dataset_name_short:
|
168 |
+
tmp_list = query.split("User:")
|
169 |
+
tmp_list = tmp_list[1:]
|
170 |
+
|
171 |
+
dialogue_turn = ""
|
172 |
+
if len(tmp_list) > 1:
|
173 |
+
for item in tmp_list[:-1]:
|
174 |
+
dialogue_turn += "User:" + item
|
175 |
+
dialogue_turn += "User: Answer the following question with a short span. The answer needs to be just in a few words." + tmp_list[-1]
|
176 |
+
|
177 |
+
elif dataset_name in math_program_multiturn:
|
178 |
+
|
179 |
+
## for training
|
180 |
+
tmp_list = query.split("User:", 1) # split will stop at the first "User:"
|
181 |
+
dialogue_turn = "User: Answer the following question with a number from context or the math arithmetic (add, subtract, multiply, and divide)." + tmp_list[1]
|
182 |
+
|
183 |
+
elif dataset_name in math_program_multiturn_v2:
|
184 |
+
## for evaluation
|
185 |
+
tmp_list = query.split("User:")
|
186 |
+
tmp_list = tmp_list[1:]
|
187 |
+
dialogue_turn = ""
|
188 |
+
if len(tmp_list) > 1:
|
189 |
+
for item in tmp_list[:-1]:
|
190 |
+
dialogue_turn += "User:" + item
|
191 |
+
dialogue_turn += "User: Answer the following question with a number from context or the math arithmetic using +, -, *, or /." + tmp_list[-1]
|
192 |
+
|
193 |
+
else:
|
194 |
+
if dataset_name in short_span_with_context:
|
195 |
+
user = "Answer the following question with a short span. The answer needs to be just in a few words. {}".format(query)
|
196 |
+
elif dataset_name in yes_no_without_context:
|
197 |
+
user = "Answer the following question with True or False. {}".format(query)
|
198 |
+
elif dataset_name in multichoices:
|
199 |
+
user = "Answer the following question by selecting one of the provided options. {}".format(query)
|
200 |
+
elif dataset_name in math_program_with_context:
|
201 |
+
|
202 |
+
## for evaluation
|
203 |
+
user = "Answer the following question with the math arithmetic using +, -, *, or /. {}".format(query)
|
204 |
+
elif dataset_name in math_program_with_context_v2:
|
205 |
+
## for evaluation
|
206 |
+
user = "Answer the following question with a short span or a number from context or the math arithmetic (add, subtract, multiply, and divide). {}".format(query)
|
207 |
+
elif dataset_name in math_program_with_context_v3:
|
208 |
+
## for training
|
209 |
+
user = "Answer the following question with a number from context or the math arithmetic using +, -, *, or /. {}".format(query)
|
210 |
+
|
211 |
+
elif dataset_name in singleturn_dataset_name_short_and_long:
|
212 |
+
user = "Answer the following question with a short span, or a full and complete answer. {}".format(query)
|
213 |
+
|
214 |
+
elif dataset_name in singleturn_dataset_name_long:
|
215 |
+
user = "Please give a full and complete answer for the question. {}".format(query)
|
216 |
+
|
217 |
+
elif dataset_name in singleturn_dataset_entity:
|
218 |
+
user = "Answer the following question with one or a list of items. {}".format(query)
|
219 |
+
|
220 |
+
elif dataset_name == "qmsum":
|
221 |
+
user = "Please summarize a full and complete answer for the following question. {}".format(query)
|
222 |
+
elif dataset_name == "longbook_sum_eng" or dataset_name == "longbook_sum_eng_gpt4_same":
|
223 |
+
user = "Summarize the book above with a long paragraph."
|
224 |
+
|
225 |
+
else:
|
226 |
+
# fetaqa/llmware_unanswerable goes to here by default
|
227 |
+
user = "Please give a full and complete answer for the question. {}".format(query)
|
228 |
+
|
229 |
+
if dataset_name in ["kilt_nq_short", "kilt_tqa_short", "kilt_hotpotqa_short"]:
|
230 |
+
dialogue_format = "User: {}\n\nAssistant: The answer is "
|
231 |
+
else:
|
232 |
+
dialogue_format = "User: {}\n\nAssistant:"
|
233 |
+
dialogue_turn = dialogue_format.format(user)
|
234 |
+
|
235 |
+
if ft_neighbours > 0:
|
236 |
+
|
237 |
+
## normal ordering
|
238 |
+
context = "\n\n".join(neighbours[0:ft_neighbours]) + "\n\n"
|
239 |
+
|
240 |
+
context_tokens = tokenizer.encode(context)
|
241 |
+
dialogue_tokens = tokenizer.encode(dialogue_turn)
|
242 |
+
system_tokens = tokenizer.encode(system)
|
243 |
+
|
244 |
+
if len(system_tokens) + len(dialogue_tokens) + len(context_tokens) + max_output_len > max_seq_length:
|
245 |
+
context_tokens = context_tokens[:max_seq_length - max_output_len - len(dialogue_tokens) - len(system_tokens)]
|
246 |
+
context = tokenizer.decode(context_tokens, clean_up_tokenization_spaces=False) + "\n"
|
247 |
+
|
248 |
+
all_input = system + context + dialogue_turn
|
249 |
+
|
250 |
+
input_tokens = tokenizer.encode(all_input)
|
251 |
+
else:
|
252 |
+
all_input = system + dialogue_turn
|
253 |
+
|
254 |
+
input_tokens = tokenizer.encode(all_input)
|
255 |
+
|
256 |
+
return input_tokens
|
257 |
+
|
258 |
+
|
259 |
+
def get_chatqa2_input(data_list, eval_dataset, tokenizer, num_ctx, max_output_len, max_seq_length):
|
260 |
+
|
261 |
+
ft_neighbours = num_ctx
|
262 |
+
dataset_name = eval_dataset
|
263 |
+
prompt_list = []
|
264 |
+
|
265 |
+
for sample in data_list:
|
266 |
+
query, _, neighbours = sample
|
267 |
+
input_tokens = reformat_prompt_v2(query, neighbours, dataset_name.split(".")[0], ft_neighbours, \
|
268 |
+
max_output_len, tokenizer, max_seq_length)
|
269 |
+
raw_text = tokenizer.decode(input_tokens, clean_up_tokenization_spaces=False)
|
270 |
+
assert raw_text.startswith("<|begin_of_text|>")
|
271 |
+
raw_text = raw_text[17:]
|
272 |
+
prompt_list.append(raw_text)
|
273 |
+
return prompt_list
|
code/evaluate_cqa_vllm_chatqa2.py
ADDED
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
## The following code is adapted from
|
3 |
+
## https://docs.mystic.ai/docs/llama-2-with-vllm-7b-13b-multi-gpu-70b
|
4 |
+
|
5 |
+
|
6 |
+
from transformers import AutoTokenizer
|
7 |
+
from vllm import LLM, SamplingParams
|
8 |
+
from arguments import get_args
|
9 |
+
from dataset_conv import get_chatqa2_input, preprocess
|
10 |
+
from tqdm import tqdm
|
11 |
+
import torch
|
12 |
+
import os
|
13 |
+
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
14 |
+
os.environ['VLLM_NCCL_SO_PATH'] = '/usr/local/lib/python3.8/dist-packages/nvidia/nccl/lib/libnccl.so.2'
|
15 |
+
|
16 |
+
def get_prompt_list(args):
|
17 |
+
|
18 |
+
## get tokenizer
|
19 |
+
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path)
|
20 |
+
|
21 |
+
|
22 |
+
data_list = preprocess(args.sample_input_file, inference_only=True, retrieved_neighbours=args.use_retrieved_neighbours)
|
23 |
+
print("number of total data_list:", len(data_list))
|
24 |
+
if args.start_idx != -1 and args.end_idx != -1:
|
25 |
+
print("getting data from %d to %d" % (args.start_idx, args.end_idx))
|
26 |
+
data_list = data_list[args.start_idx:args.end_idx]
|
27 |
+
|
28 |
+
print("number of test samples in the dataset:", len(data_list))
|
29 |
+
prompt_list = get_chatqa2_input(data_list, args.eval_dataset, tokenizer, num_ctx=args.num_ctx, max_output_len=args.max_tokens, max_seq_length=args.max_seq_length)
|
30 |
+
|
31 |
+
return prompt_list
|
32 |
+
|
33 |
+
|
34 |
+
def main():
|
35 |
+
args = get_args()
|
36 |
+
|
37 |
+
## bos token for llama-3
|
38 |
+
bos_token = "<|begin_of_text|>"
|
39 |
+
## get model_path
|
40 |
+
model_path = args.model_folder
|
41 |
+
|
42 |
+
## get prompt_list
|
43 |
+
prompt_list = get_prompt_list(args)
|
44 |
+
|
45 |
+
output_path = os.path.join(model_path, "outputs")
|
46 |
+
if not os.path.exists(output_path):
|
47 |
+
os.mkdir(output_path)
|
48 |
+
|
49 |
+
## get output_datapath
|
50 |
+
if args.start_idx != -1 and args.end_idx != -1:
|
51 |
+
if args.use_retrieved_neighbours:
|
52 |
+
output_datapath = os.path.join(output_path, "%s_output_%dto%d_ctx%d.txt" % (args.eval_dataset, args.start_idx, args.end_idx, args.num_ctx))
|
53 |
+
else:
|
54 |
+
output_datapath = os.path.join(output_path, "%s_output_%dto%d.txt" % (args.eval_dataset, args.start_idx, args.end_idx))
|
55 |
+
else:
|
56 |
+
if args.use_retrieved_neighbours:
|
57 |
+
output_datapath = os.path.join(output_path, "%s_output_ctx%d.txt" % (args.eval_dataset, args.num_ctx))
|
58 |
+
else:
|
59 |
+
output_datapath = os.path.join(output_path, "%s_output.txt" % (args.eval_dataset))
|
60 |
+
|
61 |
+
## run inference
|
62 |
+
sampling_params = SamplingParams(temperature=0, top_k=1, max_tokens=args.max_tokens)
|
63 |
+
|
64 |
+
## This changes the GPU support to 8
|
65 |
+
model_vllm = LLM(model_path, tensor_parallel_size=8, dtype=torch.bfloat16)
|
66 |
+
print(model_vllm)
|
67 |
+
|
68 |
+
output_list = []
|
69 |
+
for prompt in tqdm(prompt_list):
|
70 |
+
prompt = bos_token + prompt
|
71 |
+
output = model_vllm.generate([prompt], sampling_params)[0]
|
72 |
+
generated_text = output.outputs[0].text
|
73 |
+
generated_text = generated_text.strip().replace("\n", " ")
|
74 |
+
|
75 |
+
## for llama3
|
76 |
+
if "<|eot_id|>" in generated_text:
|
77 |
+
idx = generated_text.index("<|eot_id|>")
|
78 |
+
generated_text = generated_text[:idx]
|
79 |
+
if "<|end_of_text|>" in generated_text:
|
80 |
+
idx = generated_text.index("<|end_of_text|>")
|
81 |
+
generated_text = generated_text[:idx]
|
82 |
+
|
83 |
+
print("="*80)
|
84 |
+
print("prompt:", prompt)
|
85 |
+
print("-"*80)
|
86 |
+
print("generated_text:", generated_text)
|
87 |
+
print("="*80)
|
88 |
+
output_list.append(generated_text)
|
89 |
+
|
90 |
+
print("writing to %s" % output_datapath)
|
91 |
+
with open(output_datapath, "w", encoding="utf-8") as f:
|
92 |
+
for output in output_list:
|
93 |
+
f.write(output + "\n")
|
94 |
+
|
95 |
+
|
96 |
+
if __name__ == "__main__":
|
97 |
+
main()
|