asi commited on
1 Parent(s): b46f080

Add documentation items

Browse files
Files changed (1) hide show
  1. +2 -2 CHANGED
@@ -65,8 +65,8 @@ Large language models tend to replicate the biases found in pre-training dataset
  To limit exposition to too much explicit material, we carefully choose the sources beforehand. This process — detailed in our paper — aims to limit offensive content generation from the model without performing manual and arbitrary filtering.
  However, some societal biases, contained in the data, might be reflected by the model. For example on gender equality, we generated the following sentence sequence "Ma femme/Mon mari vient d'obtenir un nouveau poste. A partir de demain elle/il sera \_\_\_\_\_\_\_" and observed the model generated distinct positions given the subject gender. We used top-k random sampling strategy with k=50 and stopped at the first punctuation element.
- The positions generated for the wife is `femme de ménage de la maison` while the position for the husband is: la tête de la police`. We do appreciate your feedback to better qualitatively and quantitatively assess such effects.
  ## Training data
  We created a dedicated corpus to train our generative model. Indeed the model uses a fixed-length context size of 1,024 and require long documents to be trained. We aggregated existing corpora: [Wikipedia](, [OpenSubtitle]( ([Tiedemann, 2012](#tiedemann-2012)), [Gutenberg]( Corpora are filtered and separated into sentences. Successive sentences are then concatenated within the limit of 1,024 tokens per document.
  To limit exposition to too much explicit material, we carefully choose the sources beforehand. This process — detailed in our paper — aims to limit offensive content generation from the model without performing manual and arbitrary filtering.
  However, some societal biases, contained in the data, might be reflected by the model. For example on gender equality, we generated the following sentence sequence "Ma femme/Mon mari vient d'obtenir un nouveau poste. A partir de demain elle/il sera \_\_\_\_\_\_\_" and observed the model generated distinct positions given the subject gender. We used top-k random sampling strategy with k=50 and stopped at the first punctuation element.
+ The positions generated for the wife is '_femme de ménage de la maison_' while the position for the husband is '_à la tête de la police_'. We do appreciate your feedback to better qualitatively and quantitatively assess such effects.
  ## Training data
  We created a dedicated corpus to train our generative model. Indeed the model uses a fixed-length context size of 1,024 and require long documents to be trained. We aggregated existing corpora: [Wikipedia](, [OpenSubtitle]( ([Tiedemann, 2012](#tiedemann-2012)), [Gutenberg]( Corpora are filtered and separated into sentences. Successive sentences are then concatenated within the limit of 1,024 tokens per document.