test-demo / index.html
SaulLu's picture
add website
baf9de9
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<title>Training Transformers Together</title>
<meta name="description" content="A NeurIPS'21 demonstration that explains how to train large models together with multiple collaborators.">
<link rel="mask-icon" href="https://learning-at-home.github.io/logo_small.png">
<link rel="alternate icon" class="js-site-favicon" type="image/png" href="https://learning-at-home.github.io/logo.png">
<link rel="icon" class="js-site-favicon" type="image/png" href="https://learning-at-home.github.io/logo.png">
<meta property="og:url" content="https://training-transformers-together.github.io">
<meta property="og:site_name" content="Training Transformers Together">
<meta property="og:title" content="Train vast neural networks together">
<meta property="og:description" content="A NeurIPS'21 demonstration that explains how to train large models together with multiple collaborators.">
<meta property="og:image" content="https://learning-at-home.github.io/logo_small.png">
<meta property="og:image:type" content="image/png">
<meta property="og:image:width" content="96">
<meta property="og:image:height" content="96">
<meta property="twitter:site" content="https://training-transformers-together.github.io">
<meta property="twitter:creator" content="Yandex, Hugging Face, University of Washington, Hivemind team & contributors">
<meta property="twitter:card" content="summary_large_image">
<meta property="twitter:title" content="Training Transformers Together">
<meta property="twitter:description" content="A NeurIPS'21 demonstration that explains how to train large models together with multiple collaborators.">
<meta property="twitter:image:src" content="https://learning-at-home.github.io/logo_horizontal.png">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<!-- Bootstrap core CSS -->
<link href="https://bootswatch.com/5/flatly/bootstrap.css" rel="stylesheet">
<!-- Custom styles for this template -->
<link href="./style.css" rel="stylesheet">
</head>
<body>
<div id="header_main" style="display: block;" class="mb-0 pb-0">
<canvas></canvas>
<div id="overlay">
<div id="header_window">
<div id="header">
<img src="https://learning-at-home.github.io/logo.png" id="bug-logo"
style="width: 40%; max-height: 320px; max-width: 320px; z-index:1000; position: relative;">
<br>
<h1 class="faded title title_elem mb-1 pb-1" style="margin-top:-25px; margin-bottom:-10px">
<p style="margin-top: 0px; font-weight:bolder; margin-bottom:0px;">
<span id="title_text">Training Transformers Together</span>
</p>
<p style="font-size: 18px; margin-top:0px; margin-bottom:5px;">
large-scale deep learning for everyone, by everyone</p>
<p style="font-size: 18px; font-weight:lighter; margin-top:0px; margin-bottom:0px;">
A NeurIPS 2021 Demonstration</p>
</h1>
</div>
</div>
</div>
</div>
<script src="./header-animate.js"></script>
<div class="container d-flex justify-content-center mb-2 pb-2" style="max-width: 500px">
<div class="row text-center align-items-center justify-content-center">
<div class="col-3">
<a href="https://research.yandex.com/">
<img src="logos/yandex.png" class="img-fluid center-block" style="max-width: 66%" alt="Yandex Research">
</a>
</div>
<div class="col-3 px-2">
<a href="https://huggingface.co/">
<img src="logos/huggingface.png" class="img-fluid center-block" style="max-width: 66%" alt="Hugging Face">
</a>
</div>
<div class="col-3 px-3">
<a href="https://www.hse.ru/en/">
<img src="logos/hse.png" class="img-fluid center-block" style="max-width: 66%" alt="HSE University">
</a>
</div>
<div class="col-3 px-2">
<a href="http://www.washington.edu/">
<img src="logos/uwash.png" class="img-fluid center-block" alt="University of Washington">
</a>
</div>
</div>
</div>
<div class="container" style="display: block;">
<p>
There was a time when you could comfortably train state-of-the-art vision and language models at home on your workstation.
The first convolutional neural net to beat ImageNet
(<a target="_blank" href="https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf">AlexNet</a>)
was trained for 5-6 days on two gamer-grade GPUs. In contrast, today's Top-1 ImageNet model
(<a target="_blank" href="https://arxiv.org/abs/2106.04803">CoAtNet</a>)
takes 20,000 TPU-v3 days. And things are even worse in the NLP world: training
<a target="_blank" href="https://arxiv.org/abs/2005.14165">GPT‑3</a>
on a top-tier server with 8x A100 would take decades.
</p>
<p>
So, can individual researchers and small labs still train state-of-the-art models? Yes we can!
All it takes is for a bunch of us to come together. In fact, we're doing it right now and <b>you are invited to join!</b>
</p>
<iframe id="iframe_main" src="https://hf.space/streamlitiframe/training-transformers-together/dashboard-embedded/+"
data-src="https://hf.space/streamlitiframe/training-transformers-together/dashboard-embedded/+"
data-sdk="streamlit"
title="Streamlit app" class="container p-0 flex-grow space-iframe"
allow="accelerometer; ambient-light-sensor; autoplay; battery; camera; document-domain; encrypted-media; fullscreen; geolocation; gyroscope; layout-animations; legacy-image-formats; magnetometer; microphone; midi; oversized-images; payment; picture-in-picture; publickey-credentials-get; sync-xhr; usb; vr ; wake-lock; xr-spatial-tracking"
sandbox="allow-forms allow-modals allow-popups allow-popups-to-escape-sandbox allow-same-origin allow-scripts allow-downloads"
style="top:-200px; left:0; bottom:0; right:0; width:100%; height:200px; border:none; margin:0; padding:0; z-index:999999;" scrolling=no>
<p>This was meant to be an IFrame, but your browser did not display it.</p>
<p>Please go to <a href="https://huggingface.co/spaces/training-transformers-together/demo">https://huggingface.co/spaces/training-transformers-together/demo</a>.</p>
</iframe>
<p>
In this demo, we train a model similar to <a target="_blank" href="https://openai.com/blog/dall-e/">OpenAI DALL-E</a>
a Transformer model that generates images from text descriptions.
It is trained on <a target="_blank" href="https://laion.ai/laion-400-open-dataset/">LAION-400M</a>,
the world's largest openly available image-text-pair dataset with 400 million samples. Our model is based on
the <a target="_blank" href="https://github.com/lucidrains/DALLE-pytorch">dalle‑pytorch</a> implementation
by <a target="_blank" href="https://github.com/lucidrains">Phil Wang</a> with a few tweaks to make it communication-efficient.
</p>
<div class="accordion" id="accordionExample">
<div class="accordion-item">
<h2 class="accordion-header" id="headingOne">
<button class="accordion-button collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#collapseOne" aria-expanded="false" aria-controls="collapseOne">
How to train efficiently over the Internet?
</button>
</h2>
<div id="collapseOne" class="accordion-collapse collapse" aria-labelledby="headingOne" data-bs-parent="#accordionExample">
<div class="accordion-body">
<p>
Modern distributed training algorithms are designed for HPC clusters with a 10-100 gigabit per second bandwidth.
In turn, a typical Internet connection runs at 10-100 megabits per second: that’s three orders of magnitude slower.
To make distributed training efficient, you need to win back these three orders of magnitude.
This may seem daunting at first, but in reality, DL researchers have already made all the necessary pieces for solving this puzzle:
</p>
<table class="table table-hover">
<thead>
<tr>
<th scope="col">Speed&#8209;up</th>
<th scope="col">How to achieve</th>
</tr>
</thead>
<tbody>
<tr><td class="centered"><strong>4-16x</strong></td><td>
<strong>Large-batch training:</strong> <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/1904.00962">You et al. (2019)</a> proposed a way for training neural networks efficiently with larger batches, and hence, fewer communication rounds.
</td></tr>
<tr><td class="centered"><strong>4-32x</strong></td><td>
<strong>Gradient compression:</strong> from simple <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/1511.04561">8-bit quantization</a>
to advanced techniques such as <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/1712.01887">Deep Gradient Compression</a>,
<a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/1905.13727">PowerSGD</a>, <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2102.02888">1-bit Adam</a>,
and many others. As a rule of thumb, these techniques can safely reduce communication by 16-32x. More extreme compression is often
possible, but it may affect stability or final quality.
</td></tr>
<tr><td class="centered"><strong>4-24x</strong></td><td>
<strong>Parameter sharing:</strong> reusing parameters between model layers results in a model with fewer parameters,
and hence, fewer gradients to communicate. <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/1909.11942">Lan et al. (2019)</a> and
<a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/pdf/2107.11817.pdf">Xue et al. (2021)</a> propose efficient parameter sharing architectures
for NLP and computer vision.
</td></tr>
<tr><td class="centered"><strong>1.5-2x</strong></td><td>
<strong>Overlapping computation with communication:</strong> running network communication in background while
computing the next portion of gradients. This is a <a target="_blank" rel="noopener noreferrer" href="https://ur.booksc.eu/book/1624068/2d0506">long-standing trick from HPC</a>
that was recently adapted for DL training. <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2101.06840">Ren et al. (2021)</a> show that
updating parameters in background while computing the next batch of gradients does not harm convergence.
</td></tr>
</tbody>
</table>
<p>
These techniques are already more than enough to cover 1000x slower communication.
This means that in practice you can pick and choose which of them you want in your training run.
For this demo, we use 8x larger batches, 4x compression, 12x parameter sharing and partial overlapping.
If you don’t want parameter sharing, you can trade it for more advanced gradient compression or larger batches.
</p>
</div>
</div>
</div>
</div>
<div class="accordion" id="accordionAnother" style="margin-top: 10px;">
<div class="accordion-item">
<h2 class="accordion-header" id="headingTwo">
<button class="accordion-button collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#collapseTwo" aria-expanded="false" aria-controls="collapseOne">
How to train with different device types?
</button>
</h2>
<div id="collapseTwo" class="accordion-collapse collapse" aria-labelledby="headingTwo" data-bs-parent="#accordionAnother">
<div class="accordion-body">
<p>
Most distributed DL frameworks assume that the computation is performed by a fleet of identical devices,
typically GPU servers or TPU cores. Under this assumption, each device can be assigned an equal part of
computation, such as processing a fixed batch size of training samples.
However, this quickly breaks down if workers use different device types. If one participant uses a GPU (e.g. P100)
and another runs on TPU-v2-8, it is difficult to find a regime where both devices will be fully utilized.
</p>
<p>
To make the best use of all available devices, we let each device accumulate gradients at its own pace
with individually tuned batch size and some other features (e.g. gradient checkpointing or using XLA).
Once workers collectively aggregate some predefined global batch size, they average their gradients
with weights proportional to each worker's individual contribution (i.e. number of samples processed).
</p>
<a class="block overflow-hidden">
<div class="w-full h-40 mb-2 bg-gray-900 group-hover:bg-gray-850 rounded-lg flex items-start justify-start overflow-hidden">
<iframe src="https://www.youtube.com/embed/zdVsg5zsGdc" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" frameborder="0"
style="width: 100%; height: 240px"></iframe>
</div>
</a>
<p>
This technique allows the "swarm" to automatically adjust its behavior as peers join, leave or fail.
For instance, if several high-performance peers join the experiment, other peers will need to process a smaller
number of samples per optimizer step, and hence, the collaboration will train faster with the same hyperparameters.
In turn, if one of the workers fails and loses its progress (e.g. due to a fp16 overflow), others will make
up for that by processing slightly more. For more details on how this works, please refer to
<a target="_blank" rel="noopener noreferrer" href="https://papers.nips.cc/paper/2021/hash/41a60377ba920919939d83326ebee5a1-Abstract.html">
"Deep Learning In Open Collaborations"</a> paper or the corresponding <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/blog/collaborative-training">blog post</a>.
</p>
</div>
</div>
</div>
</div>
<h3 class="my-4">How do I join?</h3>
<p>This section will be updated <strong>on December 7</strong>.</p>
<h3 class="my-4">Practical aspects</h3>
<div class="border-bottom pb-3">
<!-- Nav tabs -->
<ul class="nav nav-tabs m-3">
<li class="nav-item">
<a class="nav-link active" data-bs-toggle="tab" href="#memory-efficiency">Memory-Efficient Training</a>
</li>
<li class="nav-item">
<a class="nav-link" data-bs-toggle="tab" href="#security">Security</a>
</li>
<li class="nav-item">
<a class="nav-link" data-bs-toggle="tab" href="#make-your-own">Make Your Own</a>
</li>
</ul>
<!-- Tab panes -->
<div class="tab-content">
<div class="tab-pane fade active show" id="memory-efficiency">
<p>
Our aim is to train a large model in a decentralized fashion on consumer hardware or low-end cloud instances.
This means we need to make the model, dataset, and other memory buffers fit onto a few GB of disk, 12-16 GB of CPU RAM,
and 8-12 GB of GPU memory. Unfortunately, this rules out many popular techniques such as
<a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2101.06840">ZeRO-Offload</a>:
there is simply not enough RAM for that. Instead, we must make better use of what limited memory we have.
To do this, we use two techniques: 8-bit Optimizers for GPU memory and dataset streaming for RAM & HDD.
</p>
<p>
<b>8-bit optimizers:</b>
Using optimizers such as LAMB or Adam requires four times as much GPU memory as simply storing model parameters (8 bytes vs 2 bytes)
because of additional gradient statistics.
As such, for training large models with many parameters, the optimizer state takes the largest amount of memory.
With 8-bit optimizers, this amount is reduced by 75% (2 bytes), making it much easier to fit large models onto consumer GPUs.
</p><p>
Naturally, we can combine this technique with offloading and store 8-bit optimizer states in the CPU memory rather
than in the GPU memory (0 bytes GPU, 2 bytes CPU). To perform an optimizer update, we transfer the GPU gradients
to the CPU, update the model parameters, and then copy the new weights to the GPU.
We can do this for each weight one-by-one so that the additional CPU memory required for the
optimizer update is minimal.
This combination of offloading and 8-bit optimizers means that we conserve GPU memory (0 bytes per parameter)
and also use only a limited amount of CPU memory (2 bytes per parameter).
</p>
<p>
<b>Dataset streaming:</b>
Usually data is stored on disk and needs to be fully or partially loaded into RAM for training.
Large datasets used for pretraining measure in <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2101.00027">hundreds of gigabytes</a>
or even <a target="_blank" rel="noopener noreferrer" href="https://laion.ai/laion-400-open-dataset/">terabytes</a>.
This can pose a significant problem, as most desktop and cheap cloud instances simply do not have that much free space.
Furthermore, downloading the data over the Internet would take up hours before one can even begin training.
</p>
<center>
<img src="./logos/stream.gif" id="stream"
style="width: 80%; max-height: 200px; max-width: 640px; z-index:1000; top:-10px; position: relative;">
</center>
<p>
To circumvent these problems, it is possible to stream the data in the same way as you stream online videos.
Participants download a small random portion of the training dataset and immediately begin training on it,
while additional data is loaded in the background. As such, we can train a model with virtually no storage
overhead from the dataset, and switching to a new dataset is as simple as changing an argument of the dataset class.
</p>
<h5><b>Here's our tutorial covering these methods:</b>
<a target="_blank" rel="noopener noreferrer" href="https://colab.research.google.com/gist/justheuristic/75f6a2a731f05a213a55cd2c8a458aaf/fine-tune-a-language-model-with-dataset-streaming-and-8-bit-optimizers.ipynb">
<span>
<img src="https://colab.research.google.com/assets/colab-badge.svg" width="150px">
</span>
</a></h5>
</div>
<div class="tab-pane fade" id="security">
<p>In this section, we discuss common concerns related to security of collaborative training:</p>
<p>
<b>Q: If I join a collaborative experiment, do I allow other people to execute code on my computer?</b>
</p>
<p>
<b>A:</b> During the training, participants only exchange data (gradients, statistics, model weights) and never send code to each other.
No other peer can execute arbitrary code on your computer.
</p>
<p>
To join the experiment, you typically need to run the code (implementing the model, data streaming, training loop, etc.)
from a repository or a Colab notebook provided by the authors of the experiment.
This is no different from running any other open source project/Colab notebook.
</p>
<p>
<b>Q: Can a malicious participant influence the training outcome?</b>
</p>
<p>
<b>A:</b> It is indeed possible unless we use some defense mechanisms.
For instance, a malicious participant can damage model weights by sending large numbers instead of correct gradients.
The same can happen due to broken hardware or misconfiguration.
</p>
<ul>
<li>
<p>
One possible defense is using <b>authentication</b> combined with <b>model checkpointing</b>.
In this case, participants should log in (e.g. with their Hugging Face account) to interact with the rest of the collaboration.
In turn, moderators can screen potential participants and add them to an allowlist.
If something goes wrong (e.g. a participant sends invalid gradients and the model diverges),
the moderators remove them from the list and revert the model to the latest checkpoint unaffected by the attack.
</p>
<!-- <p><b>Spoiler (TODO): How to implement authentication in a decentralized system efficiently?</b></p>-->
<p>
Nice bonus: using this data, the moderators can acknowledge the personal contribution of each participant.
</p>
</li>
<li>
<p>
Another defense is replacing the naive averaging of the peers' gradients with an <b>aggregation technique that is robust to outliers</b>.
<a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2012.10333">Karimireddy et al. (2020)</a>
suggested such a technique (named CenteredClip) and proved that it does not significantly affect the model's convergence.
</p>
<!-- <p><b>Spoiler (TODO): How does CenteredClip protect from outliers? (Interactive Demo)</b></p>-->
<p>
In our case, CenteredClip is useful but not enough to protect from malicious participants,
since it implies that the CenteredClip procedure itself is performed by a trusted server.
By contrast, in our decentralized system, all participants can aggregate a part of the gradients,
and we cannot assume any of them to be trusted.
</p>
<p>
Recently, <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2106.11257">Gorbunov et al. (2021)</a>
proposed a robust aggregation protocol for decentralized systems that does not require this assumption.
This protocol uses CenteredClip as a subroutine but is able to detect and ban participants who performed it incorrectly.
</p>
</li>
</ul>
</div>
<div class="tab-pane fade" id="make-your-own">
<p>In this section, we provide a recipe for you to run a collaborative training experiment yourself.</p>
<p>
<b>Got confused?</b> Feel free to ask any questions in our <a target="_blank" rel="noopener noreferrer" href="https://discord.gg/uGugx9zYvN">Discord</a>!
</p>
<ol>
<li class="mb-2">
Set up dataset streaming:
<ul>
<li>
<a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/docs/datasets/share_dataset.html">Upload</a> your dataset to the Hugging Face Hub
in a streaming-friendly format (<a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/datasets/laion/laion_100m_vqgan_f8">example</a>).
</li>
<li>Set up dataset streaming (see the "Memory-Efficient Training" section).</li>
</ul>
</li>
<li class="mb-2">
Write the code of training peers (<a target="_blank" rel="noopener noreferrer" href="https://github.com/learning-at-home/dalle-hivemind/blob/main/run_trainer.py">example</a>):
<ul>
<li>Implement your model, set up dataset streaming, and write the training loop.</li>
<li>
Get familiar with the <a href="https://github.com/learning-at-home/hivemind">hivemind</a> library
(<a target="_blank" rel="noopener noreferrer" href="https://learning-at-home.readthedocs.io/en/latest/user/quickstart.html">quickstart</a>).
</li>
<li>
In the training loop, wrap up your PyTorch optimizer with
<a target="_blank" rel="noopener noreferrer" href="https://learning-at-home.readthedocs.io/en/latest/modules/optim.html#hivemind.optim.experimental.optimizer.Optimizer">hivemind.Optimizer</a>
(<a target="_blank" rel="noopener noreferrer" href="https://github.com/learning-at-home/dalle-hivemind/blob/main/task.py#L121">example</a>).
</li>
</ul>
</li>
<li class="mb-2">
<b>(optional)</b> Write the code of auxiliary peers (<a target="_blank" rel="noopener noreferrer" href="https://github.com/learning-at-home/dalle-hivemind/blob/main/run_aux_peer.py">example</a>):
<ul>
<li>
Auxiliary peers are a special kind of peers responsible for
logging experiment progress (e.g., to <a target="_blank" rel="noopener noreferrer" href="https://wandb.ai/">Weights & Biases</a>)
and uploading model checkpoints (e.g., to <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/docs/transformers/model_sharing">Hugging Face Hub</a>).
</li>
<li>
Such peers don't need to calculate gradients and may be launched on cheap machines without GPUs.
</li>
<li>
They can serve as a convenient entry point to
<a href="https://learning-at-home.readthedocs.io/en/latest/modules/dht.html">hivemind.DHT</a>
(i.e., their address can be specified as <code>initial_peers</code>).
</li>
<li>
It is useful to fix their address by providing <code>host_maddrs</code> and <code>identity_path</code>
arguments to <code>hivemind.DHT</code>
(these are forwarded to the underlying <a target="_blank" rel="noopener noreferrer" href="https://libp2p.io/">libp2p</a> daemon).
</li>
</ul>
</li>
<li class="mb-2">
<b>(optional)</b> Make it easier for other people to join:
<!-- To be discussed: What about "Make it easier and safer for other people to join" as there is the authentication bullet point?-->
<ul>
<li>
Create notebooks for free GPU providers (Google Colab, Kaggle, AWS SageMaker, etc.).
People may run them online and/or download and run them on their own hardware.
</li>
<li>
<a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/organizations/new">Create</a> a Hugging Face organization
with all resources related to the training
(dataset, model, inference demo, how-to-join walkthrough, links to a dashboard with loss and other metrics, etc.).
Look at <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/training-transformers-together">ours</a> for an example.
</li>
<li>
Set up an authentication system (see the "Security" section).
For example, you can ask people to join your organization with their Hugging Face accounts
(the website allows either sharing a link for joining or manually approving new participants).
This allows you to screen the peers,
acknowledge their contributions (e.g., make a leaderboard), and
ban accounts who behave maliciously. You can use our <a href="https://collaborative-training-auth.huggingface.co/docs">authentication system</a> or deploy your own
(our <a href="https://github.com/huggingface/collaborative-training-auth/tree/demo-neurips">server implementation</a> might be a good start).
</li>
<li>
Set up an inference demo for your model (e.g., using <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/spaces">Spaces</a>) or
a script that periodically uploads the inference results to show the training progress.
</li>
</ul>
</li>
</ol>
</div>
</div>
</div>
<h3 class="my-3">Organizers</h3>
This demonstration was created by
<a href="https://twitter.com/sasha_borzunov">Alexander Borzunov*</a>,
<a href="https://twitter.com/m_ryabinin">Max Ryabinin*</a>,
<a href="https://twitter.com/Tim_Dettmers">Tim Dettmers*</a>,
<a href="https://twitter.com/qlhoest">Quentin Lhoest*</a>,
<a href="https://twitter.com/LucileSaulnier">Lucile Saulnier*</a>,
<a href="https://twitter.com/michael_diskin">Michael Diskin</a>,
<a href="https://twitter.com/YJernite">Yacine Jernite</a>, and
<a href="https://twitter.com/Thom_Wolf">Thomas Wolf</a>.
<h3 class="my-3">Learn more</h3>
<ul class="mb-5">
<li>A NeurIPS 2021 <a href="https://arxiv.org/abs/2106.10207">paper</a> on collaborative deep learning.</li>
<li><a href="https://github.com/learning-at-home/hivemind">hivemind</a> is a PyTorch library for decentralized deep learning.</li>
<li><a href="https://github.com/huggingface/datasets">🤗 Datasets</a> allows uploading and streaming training data from the Hub.</li>
<li><a href="https://github.com/facebookresearch/bitsandbytes">bitsandbytes</a> contains implementations of 8-bit optimizers.</li>
<li>A <a href="https://arxiv.org/abs/2110.02861">paper</a> on blockwise quantization for communication-efficient training.</li>
<!-- To be discussed: How about mentioning the blog post containing videos that explain SahajBERT's collaborative training? -->
<!-- <li>A <a href="https://hf.co/blog/collaborative-training">blog post</a> (with videos) on a collaborative training of a LM with 40 volunteers .</li> -->
</ul>
<!-- Bootstrap core JavaScript
================================================== -->
<!-- Placed at the end of the document so the pages load faster -->
<script src="https://getbootstrap.com/docs/5.0/dist/js/bootstrap.min.js"></script>
</body>
</html>