JosephusCheung
commited on
Commit
•
d322a11
1
Parent(s):
3b3e966
Update README.md
Browse files
README.md
CHANGED
@@ -34,7 +34,7 @@ This release, CausalLM/35b-beta-long, represents the culmination of our experien
|
|
34 |
|
35 |
We chose Cohere's multilingual, 35B-parameter with long context [CohereForAI/c4ai-command-r-v01] MHA model as our base. In our evaluation, it proved to be the most responsive to the quality of training data throughout the Supervised Fine-Tuning process, outperforming other open-source LLMs. Although its initial SFT/RL focuses on specific tasks and comes with a non-commercial license, we believe it's currently the best foundation for personal and internal use cases.
|
36 |
|
37 |
-
Utilizing extensive factual content from web crawls, we synthesized over 30 million multi-turn dialogue data
|
38 |
|
39 |
Our data synthesis approach addressed crucial limitations in typical LLM training corpora. LLMs often struggle to extract thematic summaries, key information, or perform comparisons at the paragraph or document level. Therefore, we focused on generating fact-based data using multiple documents within a long context setting. This involved leveraging existing SOTA LLMs with human guidance to synthesize information through thematic summarization, information extraction, and comparison of source materials.
|
40 |
|
|
|
34 |
|
35 |
We chose Cohere's multilingual, 35B-parameter with long context [CohereForAI/c4ai-command-r-v01] MHA model as our base. In our evaluation, it proved to be the most responsive to the quality of training data throughout the Supervised Fine-Tuning process, outperforming other open-source LLMs. Although its initial SFT/RL focuses on specific tasks and comes with a non-commercial license, we believe it's currently the best foundation for personal and internal use cases.
|
36 |
|
37 |
+
Utilizing extensive factual content from web crawls, we synthesized over 30 million multi-turn dialogue data entries, grounded in multiple web-pages or documents. This process involved substantial human oversight and a data pipeline designed to ensure high quality. The model was then trained on this data in full 128K context using BF16 precision. We also incorporated widely-used open-source dialogue datasets to enhance general conversational fluency.
|
38 |
|
39 |
Our data synthesis approach addressed crucial limitations in typical LLM training corpora. LLMs often struggle to extract thematic summaries, key information, or perform comparisons at the paragraph or document level. Therefore, we focused on generating fact-based data using multiple documents within a long context setting. This involved leveraging existing SOTA LLMs with human guidance to synthesize information through thematic summarization, information extraction, and comparison of source materials.
|
40 |
|