1-800-BAD-CODE commited on
Commit
1f0729d
1 Parent(s): 044610d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -29,7 +29,7 @@ for text in the 6 most popular Romance languages:
29
 
30
  Together, these languages cover approximately 97% of native speakers of the Romance language family.
31
 
32
- This model predicts the following punctuation tokens:
33
 
34
  * .
35
  * ,
@@ -44,6 +44,12 @@ The model is released as a `SentencePiece` tokenizer and an `ONNX` graph.
44
 
45
  The easy way to run inference is to use the `punctuators` package:
46
 
 
 
 
 
 
 
47
  # Training Parameters
48
  This model was trained by concatenating between 1 and 14 random sentences.
49
  The concatenation points became sentence boundary targets,
@@ -60,10 +66,4 @@ This is accomplished behind the scenes by splitting the input into overlapping s
60
 
61
  If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
62
 
63
- # Training Data
64
- For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/).
65
-
66
- Catalan is not included in StatMT's News Crawl.
67
- For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan.
68
-
69
  # Metrics
 
29
 
30
  Together, these languages cover approximately 97% of native speakers of the Romance language family.
31
 
32
+ This model predicts the following punctuation per input subtoken:
33
 
34
  * .
35
  * ,
 
44
 
45
  The easy way to run inference is to use the `punctuators` package:
46
 
47
+ # Training Data
48
+ For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/).
49
+
50
+ Catalan is not included in StatMT's News Crawl.
51
+ For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan.
52
+
53
  # Training Parameters
54
  This model was trained by concatenating between 1 and 14 random sentences.
55
  The concatenation points became sentence boundary targets,
 
66
 
67
  If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
68
 
 
 
 
 
 
 
69
  # Metrics