parambharat commited on
Commit
85bfd70
1 Parent(s): 049ff35

chore: fix citations and response format

Browse files
Files changed (1) hide show
  1. rag/rag.py +14 -11
rag/rag.py CHANGED
@@ -29,11 +29,7 @@ Here are the relevant snippets from the Llama 3 405B model research paper:
29
  {context_str}
30
  </snippets>
31
 
32
- <question>
33
- {query_str}
34
- </question>
35
-
36
- To answer this question:
37
 
38
  1. Carefully read and analyze the provided snippets.
39
  2. Identify information that is directly relevant to the user's question.
@@ -50,11 +46,14 @@ Guidelines for your answer:
50
  6. Cite the relevant sentences from the snippets and their page numbers to support your answer.
51
  7. Answer in MFAQ format (Minimal Facts Answerable Question), providing the most concise and accurate response possible.
52
  8. Use Markdown to format your response and include citations to indicate the snippets and the page number used to derive your answer.
 
53
 
54
  Here's an example of a question and an answer. You must use this as a template to format your response:
55
 
56
  <example>
57
- Question: What was the main mix of the training data ? How much data was used to train the model ?
 
 
58
 
59
  ### Answer
60
  The main mix of the training data for the Llama 3 405 billion parameter model is as follows:
@@ -66,16 +65,20 @@ The main mix of the training data for the Llama 3 405 billion parameter model is
66
 
67
  Regarding the amount of data used to train the model, the snippets do not provide a specific total volume of data in terms of tokens or bytes. However, they do mention that the model was pre-trained on a large dataset containing knowledge until the end of 2023[^2^]. Additionally, the training process involved pre-training on 2.87 trillion tokens before further adjustments[^3^].
68
 
69
- ### References
70
 
71
- [^1^]: "Scaling Laws for Data Mix," page 6.
72
- [^2^]: "Pre-Training Data," page 4.
73
- [^3^]: "Initial Pre-Training," page 14.
74
 
75
  </example>
76
 
77
  Remember, your role is to accurately convey the information from the research paper snippets, not to speculate or provide information from other sources.
78
 
 
 
 
 
79
  Answer:
80
  """
81
 
@@ -113,7 +116,7 @@ class SimpleRAGPipeline(weave.Model):
113
  nodes,
114
  embed_model=self._get_embedding_model(),
115
  show_progress=True,
116
- insert_batch_size=128,
117
  )
118
 
119
  return index
 
29
  {context_str}
30
  </snippets>
31
 
32
+ To answer the question:
 
 
 
 
33
 
34
  1. Carefully read and analyze the provided snippets.
35
  2. Identify information that is directly relevant to the user's question.
 
46
  6. Cite the relevant sentences from the snippets and their page numbers to support your answer.
47
  7. Answer in MFAQ format (Minimal Facts Answerable Question), providing the most concise and accurate response possible.
48
  8. Use Markdown to format your response and include citations to indicate the snippets and the page number used to derive your answer.
49
+ 9. Your answer must only have two headings: 'Answer' and 'Citations'.
50
 
51
  Here's an example of a question and an answer. You must use this as a template to format your response:
52
 
53
  <example>
54
+ <question>
55
+ What was the main mix of the training data ? How much data was used to train the model ?
56
+ </question>
57
 
58
  ### Answer
59
  The main mix of the training data for the Llama 3 405 billion parameter model is as follows:
 
65
 
66
  Regarding the amount of data used to train the model, the snippets do not provide a specific total volume of data in terms of tokens or bytes. However, they do mention that the model was pre-trained on a large dataset containing knowledge until the end of 2023[^2^]. Additionally, the training process involved pre-training on 2.87 trillion tokens before further adjustments[^3^].
67
 
68
+ ### Citations
69
 
70
+ - [^1^]: "Scaling Laws for Data Mix," page 6.
71
+ - [^2^]: "Pre-Training Data," page 4.
72
+ - [^3^]: "Initial Pre-Training," page 14.
73
 
74
  </example>
75
 
76
  Remember, your role is to accurately convey the information from the research paper snippets, not to speculate or provide information from other sources.
77
 
78
+ <question>
79
+ {query_str}
80
+ </question>
81
+
82
  Answer:
83
  """
84
 
 
116
  nodes,
117
  embed_model=self._get_embedding_model(),
118
  show_progress=True,
119
+ insert_batch_size=512,
120
  )
121
 
122
  return index