--- strip-comments: true bibliography: ["ref.bib"] format: revealjs: logo: ./figures/logo/sustech.png # footer: | #

# > #

slide-number: true multiplex: false show-notes: false theme: sustech.scss show-slide-number: all controls: false preview-links: true transition: "slide" preload-iframes: true view-distance: 10 width: 1280 height: 720 mermaid: theme: dark code-overflow: wrap callout-icon: false execute: echo: false revealjs-plugins: - verticator - codewindow - qrcode --- ## {.theme-title .center} ::: {.titlebox style="text-align:center; font-size: 2em;"} [Modeling on Internet-scale Data]{.adlery style="color:#320005;"} [Bingyi Jing@ML-Summit]{style="font-size:0.5em;"} [Apr 25th, 2024]{style="font-size:0.5em;"} ::: ## {.theme-content} :::: columns ::: {.column width="30%"} ::: ::: {.column width="70%"}
::: {.titlebox style="font-size: 1.5em;"} - LLM/LVM is Data-hungry ::: ::: {.titlebox style="font-size: 1.5em;"} - Streaming Data Flow ::: ::: {.titlebox style="font-size: 1.5em;"} - Scaling Exact Attention ::: ::: :::: ::: {.notes} - 多模态大模型面临的新挑战 - 如何处理并训练互联网规模的海量数据 - 如何建模超长序列 ::: # {.theme-section} ::: {.title} LLM/LVM is Data-hungry ::: ## Revisiting the Pre-GPT Era 对于相同的文本输入,不同的任务需要不同的标注数据和模型。 ::: columns ::: {.column width="50%"} - 情绪分析 ([IMDB](https://huggingface.co/datasets/imdb): 100k rows, 84.1MB) ```{mermaid} flowchart LR markdown["我今天去国家大剧院看了一场精彩的演出"] newLines["积极"] markdown --> newLines ``` - 命名实体识别 ([CoNLL-2003](https://huggingface.co/datasets/jnlpba): 20k rows, 15MB) ```{mermaid} flowchart TD A["我"] B["今天去"] C["国家大剧院"] D["看了一场"] E["精彩的演出"] A-->AN[人物] B-->BN[时间] C-->CN["地点"] E-->EN["事件"] ``` ::: ::: {.column width="50%"} - 文本翻译 ([wmt19](https://huggingface.co/datasets/wmt19/viewer/cs-en/train): 15M rows, 1.25GB) ```{mermaid} flowchart TD markdown["我今天去国家大剧院看了一场精彩的演出"] newLines[" I went to the National Grand Theater today and watched a wonderful performance."] markdown <--> newLines ``` - ... ::: ::: ## Unified Task Paradigm {auto-animate=true}
{{< fa solid user >}}  **You** ::: {style="margin-left: 54px;"} 请问“我今天去国家大剧院看了一场精彩的演出”表达了怎样的情绪? ::: {{< include chatgpt.qmd >}} ::: {style="margin-left: 54px;"} 积极 ::: ## Unified Task Paradigm {auto-animate=true}
{{< fa solid user >}}  **You** ::: {style="margin-left: 54px;"} 请抽取以下文本中出现的时间、地点、任务和发生的时间:“我今天去国家大剧院看了一场精彩的演出”。 ::: {{< include chatgpt.qmd >}} ::: {style="margin-left: 54px;font-size: 0.8em;"} 在您提供的文本中: - 时间:今天 - 地点:国家大剧院 - 任务:看了一场精彩的演出 - 发生的时间:不明确,但由“今天”可以推断是在今天发生的。 ::: ## Unified Task Paradigm {auto-animate=true}
{{< fa solid user >}}  **You** ::: {style="margin-left: 54px;"} 将“我今天去国家大剧院看了一场精彩的演出”翻译成英文。 ::: {{< include chatgpt.qmd >}} ::: {style="margin-left: 54px;"} "I went to the National Grand Theater today and watched a wonderful performance." ::: ## Unified Task Paradigm {auto-animate=true} ::: columns ::: {.column width="50%"} ![](./figures/causal_modeling.svg) ::: ::: {.column width="50%"} ![](./figures/masked_modeling.svg) ::: ::: ::: columns ::: {.column width="50%"} ::: {.fragment .strike} 数据集难以获取, 大小受限 ::: ::: ::: {.column width="50%"} ::: {.fragment} 任何文章都可以作为数据进行训练 ::: ::: ::: ::: columns ::: {.column width="50%"} ::: {.fragment .strike} 不同模型之间不能共享知识 ::: ::: ::: {.column width="50%"} ::: {.fragment} 只需要一个模型 ::: ::: ::: ::: columns ::: {.column width="50%"} ::: {.fragment .strike} 无标注的数据很多, 但是很难利用起来. ::: ::: ::: {.column width="50%"} ::: {.fragment} 数据无需标注, 可以自然直接对文档进行训练. ::: ::: ::: ## Pretrained models are data-hungry {auto-animate=true} ```{=html} {{< include components/nlp.qmd >}} ``` ::: {style="text-align:center; font-size: 0.4em;"} The official datasets hosted on Hugging Face as of April 2024, categorized into a tree diagram by task type,
compared with the data used to pre-train GPT-3. ::: 现代的大语言模型,需要远超传统NLP的数据进行预训练。 ## Pretrained models are data-hungry {auto-animate=true} 训练GPT-3使用了大约0.75TB的文本数据 - {{}} CommonCrawl 570GB - WebText 50GB - {{}} Wikipedia 11GB - {{}} Books 21GB - Acadamic Journals 101GB ## Pretrained models are data-hungry {auto-animate=true} 训练GPT-3使用了大约0.75TB的文本数据 这样的训练量在如今看来并不算多 ```{=html} {{< include components/gpt.qmd >}} ``` ## Pretrained models are data-hungry {auto-animate=true} 训练GPT-3使用了大约0.75TB的文本数据 这样的训练量在如今看来并不算多 ::: {style="text-align:center;"} ![](./figures/2024-Alan-D-Thompson-AI-Bubbles-Planets-Rev-6.png){width=80%} ::: ## {auto-animate=true background-video="./figures/tokyo-walk.mp4"} ## Dawning of the World Model Era {auto-animate=true .smaller background-video="./figures/tokyo-walk.mp4" background-opacity=0.25} How many data SORA uses? :::: columns ::: {.column width="50%"} > We take inspiration from large language models which acquire generalist capabilities by training on **internet-scale** data [^1] ::: ::: {.column width="50%"} > 一个可供对比的数据量是:每分钟上传至 YouTube 的视频是 500h 的量级。则近五年的 YouTube 上的视频数据总量为:13亿小时 = 788亿分钟 。由于Diffusion模型训练text to video 需要高质量的标注视频,因此我们可以估计Sora 训练的视频量级为1亿分钟左右。 > > 目前有一个比较准确的估计, 一分钟视频约为 1M tokens 。[^2] ::: :::: ::: {.notes} - 1.3 word ~= 1 token - 参考 Diffusion Transformer, 256x256 的图片会被划分为 32x32 个 patch。 我们假设 1920x1080 分辨率的高清图像经过下采样得到 512x256 大小的图片,假设一个 patch 为 8x8 的像素块,则得到 64x32 大小的 patch 矩阵, 一张图片则约为: 64x32=2048 个 patch。 高清视频 1s 约为 30 帧以上,但实际训练和推理也会做压缩,我们估计压缩后 1s 约为 9 帧。 则一分钟共 540 帧。 一分钟的视频一共有:64x32x540=1.1M 昆仑万维 ::: ## Dawning of the World Model Era {auto-animate=true background-video="./figures/tokyo-walk.mp4" background-opacity=0.25} :::: columns ::: {.column width="50%"} > 一个可供对比的数据量是:每分钟上传至 YouTube 的视频是 500h 的量级。则近五年的 YouTube 上的视频数据总量为:13亿小时 = 788亿分钟 。由于Diffusion模型训练text to video 需要高质量的标注视频,因此我们可以估计Sora 训练的视频量级为1亿分钟左右。 ::: ::: {.column width="50%"} ::: {.r-fit-text} ~[500TB]{style="background-color:#e31c54; color:white;"} trained data ~[500PB]{style="background-color:#e31c54; color:white;"} raw data ::: ::: :::: ::: {.notes} 以一分钟高清视频大小5MB计算,1亿分钟的视频数据量为500TB,而筛选岀这高质量的1亿分钟可能需要500PB的原始数据。 ::: ## Dawning of the World Model Era {auto-animate=true background-video="./figures/tokyo-walk.mp4" background-opacity=0.25} ```{=html} {{< include components/token-bar.qmd >}} ``` ## Challenge {auto-animate=true background-video="./figures/tokyo-walk.mp4" background-opacity=0.25} :::: columns ::: {.column width="50%"} ::: {.r-fit-text} Training on
[internet-]{.flow}
scale data ::: ::: ::: {.column width="45%"} ::: {.r-fit-text} Modeling
[ultra-long]{.flow}
sequence ::: ::: :::: [^1]: [Video generation models as world simulators(SORA tech report)](https://openai.com/research/video-generation-models-as-world-simulators) [^2]: [浅谈 Sora 未来的千倍推理算力需求](https://zhuanlan.zhihu.com/p/683636677) # {.theme-section} ::: {.title} Streaming Data Flow ::: ## Legacy training paradigm {auto-animate=true .smaller} 传统的训练方式通常是一次性将数据下载到本地,然后进行处理。 ```{.python code-line-numbers="5-8|11-14"} {{< include ./scripts/hf.py >}} ``` 1. 下载数据集 2. 将数据集处理为模型输入,并保存到本地 3. 准备训练 ::: notes - 将全部数据下载到共享存储空间 - 一次性将数据处理为模型接受的形式 ::: ## Legacy training paradigm {auto-animate=true} :::: columns ::: {.column width="40%"} ![](./figures/etl-explain-large2.webp){width=90%} ::: ::: {.column width="60%" .fragment} 这种范式下ETL与模型训练完全串行,是一种简单明了的方式。 ```{=html} {{< include ./components/profile-old.qmd >}} ``` ::: :::: ## What's the Problem? {auto-animate=true} :::: columns ::: {.column width="60%"}

![](./figures/etl-ai.jpg)

::: ::: {.column width="40%"} 多模态大模型的ETL流程正变得越来越复杂 - E: 数据模态多,来源复杂,拉取时间长 - T: 数据处理流程复杂 - L: 存储占用高 ::: :::: ## What's the Problem? {auto-animate=true} 多模态数据由于版权和存储原因,大多以下载链接的形式分发,获取速率受到限制 ```{=html} {{< include components/webvid.qmd >}} ``` ::: {style="text-align:center; font-size: 0.4em;"} webvid以url形式提供, 共包括10730233条数据 ::: ::: {.notes} - 这意味着国内需要使用昂贵的国际带宽来获取数据,对于一个小型数据中心, 下载相当于Sora训练量的数据集可能需要花费数年的时间。 - 即便只下载webvid这样中等规模的数据,下载和处理的时间可能也是训练的瓶颈. ::: ## What's the Problem? {auto-animate=true} 处理流程复杂耗时,甚至超过训练开销 :::: columns ::: {.column width="60%"} ::: {style="margin-top: 50px;"} ![](./figures/caption.jpg) ::: ::: ::: {.column width="40%"} ChatGPT Logo GPT-4V (20s/it) ::: {style="font-size: 0.4em;"} An aerial video sweeps over a majestic ocean cliff with striatedrock formations in rich hues of red, green, and orange.The sun'srays enhance the colorful palette of the landscape, while the sea'sazure watersgently crash aqainst the clif's base. Visible are thetextured detais ofthe clif face and the contrast of the areeralaae and seaqrasscoating parts of the rock.Seabirds can beseen flying close to the water around the rocky outcrop. Thevideo conveys a serene yet dynamic coastal scenery, highlightingthe natural beauty and geological diversity of a rugged coastline ::: 🌋 **LLaVA-1.6-Yi-34B (3s/it)** ::: {style="font-size: 0.4em;"} A breathtakina aerial view of a rocky cliff iutting out into theocean.The cliff. adorned with vibrant qreen moss and patches ofyellow and red lichen, is bathed in the warm glow of the sun. Theocean, a deep blue, is dotted with small white waves that crashagainst the cliff's base. The perspective of the video is fromabove, providing a bird's eye view of the cliff and the surroundingocean.The sun's rays illuminate the scene, casting a beautifullight on the cliff and the ocean. The video is a stunningrepresentation of the raw beauty of nature. ::: ::: :::: ## What's the Problem? {auto-animate=true} > [Storage]{.red} plays an important role in AI training, and yet is one of the least talked-about aspects. As the GenAI training jobs become more multimodal over time, consuming large amounts of [ image, video, and text data ]{.red}, the need for data storage grows rapidly. [^llama3] - 要从原始数据中筛选出一亿分钟数据,可能意味着原始数据量高达数十PB以上 - 对于一般的小型数据中心,没有能力搭建适应视频预训练的存储设施。 [^llama3]: [Building Meta’s GenAI Infrastructure](https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/) ## What's the Problem? {auto-animate=true} :::: columns ::: {.column width="50%"}

![](./figures/etl-problem.webp)

::: ::: {.column width="50%"} ::: {.incremental} - 数据来源复杂 - 数据源不能立即被拉取 - 数据处理流程复杂 - 数据处理和模型训练耦合 - 数据量过大,无法一次性处理 - ... ::: ::: :::: ## What's the Problem? {auto-animate=true} :::: columns ::: {.column width="50%"} - 数据流离模型训练越来越远 - 仍然使用传统的方式处理数据,
数据流将成为阻塞训练的瓶颈。 ::: {style="margin-left: 54px;"} ![](./figures/etl-explain-small.webp){width=80%} ::: ::: ::: {.column width="50%"} ```{=html} {{< include ./components/profile-naive.qmd >}} ``` ::: :::: ## {auto-animate=true} ::: {.r-fit-text} How to train
on [internet-scale]{.flow}
data? ::: ## {auto-animate=true} :::: columns ::: {.column width="50%"} ::: {.r-fit-text} How to train
on [internet-scale]{.flow}
data? ::: ::: ::: {.column width="45%"} ::: {.r-fit-text .fragment} Just training on
[the internet]{.flow}! ::: ::: :::: ## Streaming to the rescue {auto-animate=true} :::: columns ::: {.column width="50%"} ![](./figures/streaming.gif) ::: {.incremental} - 流式传输数据可以解决这些问题 - 但流式数据传输只是一个开始,我们需要构建完全基于流式数据的训练框架 ::: ::: ::: {.column width="50%"} ::: {.fragment} ![](./figures/decoup-data.png) ::: ::: :::: ## Streaming to the rescue {auto-animate=true}

![](./figures/rain1.svg){width=80%}

## Streaming to the rescue {auto-animate=true .smaller} :::: columns ::: {.column width="60%"}

![](./figures/rain1.svg){width=80%}

::: ::: {.column width="40%"} ::: {.incremental} - [x] 零启动开销 - [x] 数据处理进程和模型训练进程完全分离 - [x] 节点内通过`SharedMemory`通信, 节点间通过内存数据库通信 - [x] 数据处理集群拓扑与GPU拓扑无关, 可以动态调整 - [x] 定时sink数据库,允许回溯数据流 - [x] 确定性的数据切分和洗牌算法,确保回溯的一致性 ::: ::: :::: ## {auto-animate=true background="./figures/mosaicml-streaming-dataset-img-1.gif"} ::: {.notes} 每个云上shard内的样本具备确定性的切分和洗牌算法,确保回溯的一致性, 并与训练拓扑无关 ::: ## Training on the internet {auto-animate=true .smaller background="./figures/mosaicml-streaming-dataset-img-1.gif" background-opacity=0.25} 使用S3作为数据和权重的存储后端, 无缝进行不同规模的云迁移 ```{=html} {{< include components/cloud-switch.qmd >}} ``` ## Training on the internet {auto-animate=true .smaller} 引入DPU集群,允许将数据直接传输到GPU, 消除内存数据库的开销

![](./figures/dpu.png){width=100%}

Powered by   ChatGPT Logo ::: {.notes} 与中立云服务商UCloud合作 ::: ## Training on the internet {auto-animate=true .smaller}

![](./figures/rain2.svg)

## Training on the internet {auto-animate=true .smaller} :::: columns ::: {.column width="50%"} ![](./figures/rain2.svg) ::: ::: {.column width="50%"} - 进一步分离了数据处理和模型训练 - 使ETL与模型训练完全并行 ::: :::: ::: {.fragment} ```{=html} {{< include components/profile-stream.qmd >}} ``` ::: # {.theme-section} ::: {.title} Scaling Exact Attention ::: ## Efficient distributed training infra {auto-animate="true"} | | Flash-Attn-2 | FP8 (H100) | 3D Parallel + Zero | Padding Free | Fused Kernel | Static Graph | TGS[^l] | |------------:|:------------:|:----------:|:------------------:|:------------:|:------------:|:------------:|:---:| | Platformers | ✔️ | ✔️ | ✔️ | ✔️ | [100%]{style="color:red;"} | ✔️ | [3743]{style="color:red;"} | | Megatron-LM | ✖️ | ✔️ | ✔️ | ✖️ | 80% | ✖️ | 3581 | | Deepspeed | ✔️ | ✖️ | ✔️ | ✖️ | 60% | ✖️ |✖️ | | Colossal-ai | ✖️ | ✖️ | ✔️ | ✖️ | 40% | ✖️ | 2610 | [^l]: Training LLaMA2 7b on DGX (8*A100 40GB) with 4096 sequence Length ## Scaling exact attention to ultra long sequence {auto-animate="true"} ![](./figures/context_parallel.svg) ## Scaling exact attention to ultra long sequence {auto-animate="true"}

![](./figures/computation_reduce.svg){width=80%}

## Scaling exact attention to ultra long sequence {auto-animate="true"} :::: columns ::: {.column width="50%"} ```{=html} {{< include ./components/seq-time.qmd >}} ``` ::: ::: {.column width="50%"} ```{=html} {{< include ./components/seq-tflops.qmd >}} ``` ::: :::: ## Scaling exact attention to ultra long sequence {auto-animate="true"} ```{=html} {{< include mocha.qmd >}} ``` # {.theme-end} ::: columns ::: {.column width="50%"} ::: {.r-fit-text} Thanks ::: ::: ::: {.column width="25%"} ::: {style="text-align:center;"} ![wechat](./figures/qr/code.png) ::: ::: ::: {.column width="25%"} ::: {style="text-align:center;"} ![e-mail](./figures/qr/mail-data.png) ::: ::: :::