metadata

license: apache-2.0
language:
  - en
  - zh
tags:
  - finance
  - invest

Deepmoney

Introducing Greed in the Seven Deadly Sins series of models.

Full-para pre-training on Yi-34b
High-quality research reports
High-end cleaning process

1. What do I want to do?

Most of the current so-called financial models are mostly trained on public knowledge, but in the actual financial field, these public knowledge are often seriously insufficient for the current market interpretability. If you are interested, you can learn about the various propositions of Keynes, Friedman and even current behavioral finance. According to my observation, most financial models cannot make investment judgments. Because they are trained on ordinary textbooks, entry-level analyst exams, and even company public reports. I think this is of very little value for the investment.

You can think I'm joking, but the fact is that the logic of many subjective analysts may not be as rigorous as that of large models of 34b and above (excluding those excellent ones, of course). The market is changing every moment, with a lot of news and massive data in real time. For most retail investors, instead of waiting for a crappy analyst to write a report, why not use a large model to make a pipeline?

In my plan, this model is the base model of this process. In my plan, models such as information collector, target judge, qualitative analyst, quantitative analyst, and data extractor are all part of this process. . But the model itself is undoubtedly important to master a large number of qualitative and quantitative methods. That's why this model was born.

2. About the data

As I just said, a lot of public knowledge has some questionable validity - but that doesn't mean it's wrong. The theoretical support behind many research methods in research reports also relies on this knowledge. So in my training, I picked up some college textbooks and some professional books. Not a lot of quantity but good quality. In addition, I selected a large number of research report data from 2019 to December 2023 - these reports are issued by a variety of publishers, including traditional brokers and professional research institutions. Most of them are paid and only available to institutions. But I got them anyway through various means.

If you have read research reports, especially high-quality ones, you will find that research reports are all subjective judgment + quantitative analysis, and data support in quantitative analysis is crucial to the entire logical chain. In order to extract this data (most of them are in the form of graphs or tables), I tried a lot of multi-modal models, and the process was very painful. The conclusion is that cog-agent and emu2 are very effective for this kind of tasks. In order to better extract information, I created a process that summarizes the context of research reports as part of the prompt.

Finally, I made a blend of the data. General data is not included because it is just for greed. Moreover, the knowledge in industry research reports is comprehensive enough.

3. About training

Raw text, full parameter training. The base uses long context yi-34b-200k. This is necessary to complete and understand an in-depth report.

Of course, I also did a sft. This is the analyzer in my process – I haven’t broken down the qualitative and quantitative analysis yet, but I’m already blown away by how well it works.

More technical details and evals coming soon……

1. 我想干什么？

当下大多数所谓的金融模型大多在公开知识上进行训练，但在实际的金融领域，这些公开知识对当前的市场可解释性往往严重不足。如果您感兴趣，可以了解一下凯恩斯、弗里德曼乃至当下行为金融学的各类主张。而据我观察，大多数金融模型无法对投资进行判断。因为它们都是在普通的教材、入门的分析师考试，乃至公司的公开报告上训练。我认为这对于投资的价值非常小。你可以当我开玩笑，但事实是很多主观分析师的逻辑性可能还不如34b及以上的大模型来的严谨（当然不包括那些优秀的）。而每时每刻，市场都在变化，大量的新闻，海量的数据都是实时的，对于大多数散户们，与其等待蹩脚的分析师写出报告，为什么不用大模型制作一套pipeline呢？在我的计划中，该模型是这套流程的基座模型，在我的计划中，信息搜集者、标的判断者、定性分析者定性分析者、定量分析者、数据提取者等模型都是该流程的一部分。但模型本身掌握大量的定性和定量方法毫无疑问是重要的。这就是这个模型诞生的理由。

2. 关于数据：

正如我刚才所说，很多公开知识的有效性都有些问题——但这并不意味着它们是错误的。在研报中很多研究方法背后的理论支撑也依赖这些知识。所以在我的训练中，我挑选了一些大学教材和一些专业书籍。数量不是很多但质量还不错。另外，我挑选了在2019-2023年12月的大量研究报告数据——这些报告的发布者多种多样，有传统的broke，也有专业研究机构。他们中的大多数是付费的，而且只对机构提供。但无论如何我通过各种各样的手段获取了它们。

如果你看过研报，尤其是高质量的那些，你会发现研报都是主观判断+定量分析，而定量分析中的数据支撑对于整个逻辑链条至关重要。为了提取这些数据（他们中的大多数以图形或者表格的形式出现），我尝试了很多多模态模型，过程非常痛苦，结论是cog-agent和emu2对于这类任务很有效。为了更好的提取信息，我制作了一套从研报上下文总结作为prompt一部分的流程。

最后，我把这些数据做了一个混合。并没有放入通识数据, 因为它就是为了greed而生的。而且行业研报中的知识足够全。

3：关于训练：

raw text，全参数训练。基座采用了长上下文的yi-34b-200k。这对于完成理解一篇深度报告是必须的。

当然，我也做了一次sft。这是我的流程中的分析者——目前还没有细分定性和定量分析，但它的效果已经让我大吃一惊了。

LoneStriker
/

deepmoney-34b-200k-base-5.0bpw-h6-exl2