|
--- |
|
|
|
title: README |
|
|
|
emoji: π |
|
|
|
colorFrom: orange |
|
|
|
colorTo: indigo |
|
|
|
sdk: static |
|
|
|
pinned: false |
|
|
|
--- |
|
|
|
|
|
<div> |
|
<img src="https://raw.githubusercontent.com/NCAI-Research/CALM/main/assets/logo.png" width="380" alt="CALM Logo" /> |
|
<p class="mb-2" style="font-size:30px;font-weight:bold"> |
|
CALM: Collaborative Arabic Language Model |
|
</p> |
|
<p class="mb-2"> |
|
The CALM project is joint effort lead by <u><a target="_blank" href="https://sdaia.gov.sa/ncai/?Lang=en">NCAI</a></u> in collaboration with |
|
<u><a target="_blank" href="https://yandex.com/">Yandex</a></u>, <u><a href="https://huggingface.co/">HuggingFace</a></u> and <u><a href="http://www.washington.edu/">UW</a></u> to train an Arabic language model with |
|
volunteers from around the globe. The project is an adaptation of the framework proposed at the NeurIPS 2021 demonstration: |
|
<u><a target="_blank" href="https://huggingface.co/training-transformers-together">Training Transformers Together</a></u>. |
|
</p> |
|
<p class="mb-2"> |
|
One of the main obstacles facing many researchers in the Arabic NLP community is the lack of computing resources that are needed for training large models. Models with |
|
leading performane on Arabic NLP tasks, such as <u><a target="_blank" href="https://github.com/aub-mind/arabert">AraBERT</a></u>, |
|
<u><a href="https://github.com/CAMeL-Lab/CAMeLBERT" target="_blank" >CamelBERT</a></u>, |
|
<u><a href="https://huggingface.co/aubmindlab/araelectra-base-generator" target="_blank" >AraELECTRA</a></u>, and |
|
<u><a href="https://huggingface.co/qarib">QARiB</a></u>, |
|
took days to train on TPUs. In the spirit of democratization of AI and community enabling, a core value at NCAI, CALM aims to demonstrate the effectiveness |
|
of collaborative training and form a community of volunteers for ANLP researchers with basic level cloud GPUs who wish to train their own models collaboratively. |
|
</p> |
|
<p class="mb-2"> |
|
CALM trains a single BERT model on a dataset that combines MSA, Oscar and Arabic Wikipedia, and dialectal data for the gulf region from existing open source datasets. |
|
Each volunteer GPU trains the model locally at its own pace on a portion of the dataset while another portion is being streamed in the background to reduces local |
|
memory consumption. Computing the gradients and aggregating them is performed in a distributed manner, based on the computing abilities of each participating |
|
volunteer. Details of the distributed training process are further described in the paper |
|
<u><a target="_blank" href="https://papers.nips.cc/paper/2021/hash/41a60377ba920919939d83326ebee5a1-Abstract.html">Deep Learning in Open Collaborations</a></u>. |
|
</p> |
|
|
|
<p class="mb-2" style="font-size:20px;font-weight:bold"> |
|
How to participate in training? |
|
</p> |
|
<p class="mb-2"> |
|
To join the collaborative training, all you have to do is to keep a notebook running for at <b>least 15 minutes</b>, you're free to close it after that and join again |
|
in another time. There are few steps before running the notebook: |
|
</p> |
|
|
|
<ul class="mb-2"> |
|
<li>π Create an account on <u><a target="_blank" href="https://huggingface.co">Huggingface</a></u>.</li> |
|
<li>π Join the <u><a target="_blank" href="https://huggingface.co/CALM">NCAI-CALM Organization</a></u> on Huggingface through the invitation link shared with you by email.</li> |
|
<li>π Get your Access Token, it's later required in the notebook. |
|
</li> |
|
</ul> |
|
|
|
<p class="h2 mb-2" style="font-size:18px;font-weight:bold">How to get my Huggingface Access Token</p> |
|
<ul class="mb-2"> |
|
<li>π Go to your <u><a target="_blank" href="https://huggingface.co">HF account</a></u>.</li> |
|
<li>π Go to Settings β Access Tokens.</li> |
|
<li>π Generate a new Access Token and enter any name for "what's this token for".</li> |
|
<li>π Select <code>read</code> role.</li> |
|
<li>π Copy your access token.</li> |
|
<li>π In cell 4, it will ask you for an Access Token, paste it there.</li> |
|
</ul> |
|
|
|
<p class="mb-2" style="font-size:20px;font-weight:bold"> |
|
Start training |
|
</p> |
|
<p class="mb-2">Pick one of the following methods to run the training code. |
|
<br /><em>NOTE: Kaggle gives you around 40 hrs per week of GPU time, so it's preferred over Colab, unless you have Colab Pro or Colab Pro+.</em></p> |
|
<ul class="mb-2"> |
|
<li>π <span><a href="https://www.kaggle.com/prmais/volunteer-gpu-notebook"> |
|
<img style="display:inline;margin:0px" src="https://img.shields.io/badge/kaggle-Open%20in%20Kaggle-blue.svg"/> |
|
</a></span> <b> (recommended)</b> <br /> |
|
</li> |
|
<li>π <span><a href="https://colab.research.google.com/github/NCAI-Research/CALM/blob/main/notebooks/volunteer-gpu-notebook.ipynb"> |
|
<img style="display:inline;margin:0px" src="https://colab.research.google.com/assets/colab-badge.svg"/> |
|
</a></span> |
|
</li> |
|
<li>π Running locally: If you have additional local computing GPUs, please visit our discord channel for instructions to set it. |
|
</li> |
|
</ul> |
|
|
|
<p class="mb-2" style="font-size:20px;font-weight:bold"> |
|
Issues or questions? |
|
</p> |
|
|
|
<p class="mb-2"> |
|
Feel free to reach us on <u><a target="_blank" href="https://discord.gg/peU5Nx77">Discord</a></u> if you have any questions π |
|
</p> |
|
</div> |
|
|