1-800-BAD-CODE
/

punctuation_fullstop_truecase_romance

Text2Text Generation

Model card Files Files and versions Community

1-800-BAD-CODE commited on Mar 24, 2023

Commit

1f0729d

•

1 Parent(s): 044610d

Update README.md

Files changed (1) hide show

README.md +7 -7

README.md CHANGED Viewed

@@ -29,7 +29,7 @@ for text in the 6 most popular Romance languages:
 Together, these languages cover approximately 97% of native speakers of the Romance language family.
-This model predicts the following punctuation tokens:
 * .
 * ,
@@ -44,6 +44,12 @@ The model is released as a `SentencePiece` tokenizer and an `ONNX` graph.
 The easy way to run inference is to use the `punctuators` package:
 # Training Parameters
 This model was trained by concatenating between 1 and 14 random sentences.
 The concatenation points became sentence boundary targets,
@@ -60,10 +66,4 @@ This is accomplished behind the scenes by splitting the input into overlapping s
 If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
-# Training Data
-For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/).
-Catalan is not included in StatMT's News Crawl.
-For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan.
 # Metrics

 Together, these languages cover approximately 97% of native speakers of the Romance language family.
+This model predicts the following punctuation per input subtoken:
 * .
 * ,
 The easy way to run inference is to use the `punctuators` package:
+# Training Data
+For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/).
+Catalan is not included in StatMT's News Crawl.
+For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan.
 # Training Parameters
 This model was trained by concatenating between 1 and 14 random sentences.
 The concatenation points became sentence boundary targets,
 If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
 # Metrics