rahular commited on
Commit
ed79b0f
1 Parent(s): 9f4c911

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -11
README.md CHANGED
@@ -4,7 +4,7 @@
4
  {}
5
  ---
6
 
7
- # Varta BERT model
8
 
9
  <!-- Provide a quick summary of what the model is/does. -->
10
 
@@ -18,15 +18,11 @@ The dataset and the model are introduced in [this paper](https://arxiv.org/abs/2
18
 
19
 
20
  ## Uses
21
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
22
- You can use the raw model for masked language modelling, but it is mostly intended to be fine-tuned on a downstream task.
23
 
24
  Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at our [Varta-T5](https://huggingface.co/rahular/varta-t5) model.
25
 
26
  ## Bias, Risks, and Limitations
27
-
28
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
29
-
30
  This work is mainly dedicated to the curation of a new multilingual dataset for Indic languages, many of which are low-resource languages. During data collection, we face several limitations that can potentially result in ethical concerns. Some of the important ones are mentioned below: <br>
31
 
32
  - Our dataset contains only those articles written by DailyHunt's partner publishers. This has the potential to result in a bias towards a particular narrative or ideology that can affect the representativeness and diversity of the dataset.
@@ -36,7 +32,7 @@ This work is mainly dedicated to the curation of a new multilingual dataset for
36
 
37
  ## How to Get Started with the Model
38
 
39
- You can use this model directly with a pipeline for masked language modeling.
40
 
41
  ```python
42
  from transformers import AutoTokenizer, AutoModelForMaskedLM
@@ -63,15 +59,12 @@ With 34.5 million non-English article-headline pairs, it is the largest document
63
  - We train the model for a total of 1M steps which takes 10 days to finish.
64
  - We use an effective batch size of 4096 and train the model on TPU v3-128 chips.
65
 
66
- <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
67
 
68
  ### Evaluation Results
69
  Please see [the paper](https://arxiv.org/pdf/2305.05858.pdf).
70
 
71
  ## Citation
72
-
73
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
74
-
75
  ```
76
  @misc{aralikatte2023varta,
77
  title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages},
 
4
  {}
5
  ---
6
 
7
+ # Varta-BERT
8
 
9
  <!-- Provide a quick summary of what the model is/does. -->
10
 
 
18
 
19
 
20
  ## Uses
21
+ You can use the raw model for masked language modeling, but it is mostly intended to be fine-tuned on a downstream task.
 
22
 
23
  Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at our [Varta-T5](https://huggingface.co/rahular/varta-t5) model.
24
 
25
  ## Bias, Risks, and Limitations
 
 
 
26
  This work is mainly dedicated to the curation of a new multilingual dataset for Indic languages, many of which are low-resource languages. During data collection, we face several limitations that can potentially result in ethical concerns. Some of the important ones are mentioned below: <br>
27
 
28
  - Our dataset contains only those articles written by DailyHunt's partner publishers. This has the potential to result in a bias towards a particular narrative or ideology that can affect the representativeness and diversity of the dataset.
 
32
 
33
  ## How to Get Started with the Model
34
 
35
+ You can use this model directly for masked language modeling.
36
 
37
  ```python
38
  from transformers import AutoTokenizer, AutoModelForMaskedLM
 
59
  - We train the model for a total of 1M steps which takes 10 days to finish.
60
  - We use an effective batch size of 4096 and train the model on TPU v3-128 chips.
61
 
62
+ Since data sizes across languages in Varta vary from 1.5K (Bhojpuri) to 14.4M articles (Hindi), we use standard temperature-based sampling to upsample data when necessary.
63
 
64
  ### Evaluation Results
65
  Please see [the paper](https://arxiv.org/pdf/2305.05858.pdf).
66
 
67
  ## Citation
 
 
 
68
  ```
69
  @misc{aralikatte2023varta,
70
  title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages},