1-800-BAD-CODE
commited on
Commit
•
1f0729d
1
Parent(s):
044610d
Update README.md
Browse files
README.md
CHANGED
@@ -29,7 +29,7 @@ for text in the 6 most popular Romance languages:
|
|
29 |
|
30 |
Together, these languages cover approximately 97% of native speakers of the Romance language family.
|
31 |
|
32 |
-
This model predicts the following punctuation
|
33 |
|
34 |
* .
|
35 |
* ,
|
@@ -44,6 +44,12 @@ The model is released as a `SentencePiece` tokenizer and an `ONNX` graph.
|
|
44 |
|
45 |
The easy way to run inference is to use the `punctuators` package:
|
46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
# Training Parameters
|
48 |
This model was trained by concatenating between 1 and 14 random sentences.
|
49 |
The concatenation points became sentence boundary targets,
|
@@ -60,10 +66,4 @@ This is accomplished behind the scenes by splitting the input into overlapping s
|
|
60 |
|
61 |
If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
|
62 |
|
63 |
-
# Training Data
|
64 |
-
For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/).
|
65 |
-
|
66 |
-
Catalan is not included in StatMT's News Crawl.
|
67 |
-
For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan.
|
68 |
-
|
69 |
# Metrics
|
|
|
29 |
|
30 |
Together, these languages cover approximately 97% of native speakers of the Romance language family.
|
31 |
|
32 |
+
This model predicts the following punctuation per input subtoken:
|
33 |
|
34 |
* .
|
35 |
* ,
|
|
|
44 |
|
45 |
The easy way to run inference is to use the `punctuators` package:
|
46 |
|
47 |
+
# Training Data
|
48 |
+
For all languages except Catalan, this model was trained with ~10M lines of text per language from StatMT's [News Crawl](https://data.statmt.org/news-crawl/).
|
49 |
+
|
50 |
+
Catalan is not included in StatMT's News Crawl.
|
51 |
+
For completeness of the Romance language family, ~500k lines of `OpenSubtitles` was used for Catalan.
|
52 |
+
|
53 |
# Training Parameters
|
54 |
This model was trained by concatenating between 1 and 14 random sentences.
|
55 |
The concatenation points became sentence boundary targets,
|
|
|
66 |
|
67 |
If you use the raw ONNX graph, note that while the model will accept sequences up to 512 tokens, only 256 positional embeddings have been trained.
|
68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
69 |
# Metrics
|