rahular
/

varta-bert

@@ -4,7 +4,7 @@
 {}
 ---
-# Varta BERT model
 <!-- Provide a quick summary of what the model is/does. -->
@@ -18,15 +18,11 @@ The dataset and the model are introduced in [this paper](https://arxiv.org/abs/2
 ## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-You can use the raw model for masked language modelling, but it is mostly intended to be fine-tuned on a downstream task.
 Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at our [Varta-T5](https://huggingface.co/rahular/varta-t5) model.
 ## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
 This work is mainly dedicated to the curation of a new multilingual dataset for Indic languages, many of which are low-resource languages. During data collection, we face several limitations that can potentially result in ethical concerns. Some of the important ones are mentioned below: <br>
 - Our dataset contains only those articles written by DailyHunt's partner publishers. This has the potential to result in a bias towards a particular narrative or ideology that can affect the representativeness and diversity of the dataset.
@@ -36,7 +32,7 @@ This work is mainly dedicated to the curation of a new multilingual dataset for
 ## How to Get Started with the Model
-You can use this model directly with a pipeline for masked language modeling.
 ```python
 from transformers import AutoTokenizer, AutoModelForMaskedLM
@@ -63,15 +59,12 @@ With 34.5 million non-English article-headline pairs, it is the largest document
 - We train the model for a total of 1M steps which takes 10 days to finish.
 - We use an effective batch size of 4096 and train the model on TPU v3-128 chips.
-<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 ### Evaluation Results
 Please see [the paper](https://arxiv.org/pdf/2305.05858.pdf).
 ## Citation
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 ```
 @misc{aralikatte2023varta,
       title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages},

 {}
 ---
+# Varta-BERT
 <!-- Provide a quick summary of what the model is/does. -->
 ## Uses
+You can use the raw model for masked language modeling, but it is mostly intended to be fine-tuned on a downstream task.
 Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at our [Varta-T5](https://huggingface.co/rahular/varta-t5) model.
 ## Bias, Risks, and Limitations
 This work is mainly dedicated to the curation of a new multilingual dataset for Indic languages, many of which are low-resource languages. During data collection, we face several limitations that can potentially result in ethical concerns. Some of the important ones are mentioned below: <br>
 - Our dataset contains only those articles written by DailyHunt's partner publishers. This has the potential to result in a bias towards a particular narrative or ideology that can affect the representativeness and diversity of the dataset.
 ## How to Get Started with the Model
+You can use this model directly for masked language modeling.
 ```python
 from transformers import AutoTokenizer, AutoModelForMaskedLM
 - We train the model for a total of 1M steps which takes 10 days to finish.
 - We use an effective batch size of 4096 and train the model on TPU v3-128 chips.
+Since data sizes across languages in Varta vary from 1.5K (Bhojpuri) to 14.4M articles (Hindi), we use standard temperature-based sampling to upsample data when necessary.
 ### Evaluation Results
 Please see [the paper](https://arxiv.org/pdf/2305.05858.pdf).
 ## Citation
 ```
 @misc{aralikatte2023varta,
       title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages},