1-800-BAD-CODE
commited on
Commit
·
6d889b7
1
Parent(s):
0575d5b
Update README.md
Browse files
README.md
CHANGED
@@ -125,7 +125,20 @@ This model predicts the following set of "post" punctuation tokens:
|
|
125 |
| ፧ | Ethiopic question mark | Amharic |
|
126 |
|
127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
128 |
# Usage
|
|
|
|
|
|
|
|
|
|
|
129 |
|
130 |
|
131 |
# Training Details
|
@@ -143,4 +156,12 @@ This model was trained on news data, and may not perform well on conversational
|
|
143 |
|
144 |
This is also a base-sized model with many languages and many tasks, so capacity may be limited.
|
145 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
146 |
# Evaluation
|
|
|
125 |
| ፧ | Ethiopic question mark | Amharic |
|
126 |
|
127 |
|
128 |
+
## Pre-Punctuation Tokens
|
129 |
+
This model predicts the following set of "post" punctuation tokens:
|
130 |
+
|
131 |
+
| Token | Description | Relavant Languages |
|
132 |
+
| ---: | :---------- | :----------- |
|
133 |
+
| ¿ | Inverted question mark | Spanish |
|
134 |
+
|
135 |
+
|
136 |
# Usage
|
137 |
+
This model is released in two parts:
|
138 |
+
|
139 |
+
1. The ONNX graph
|
140 |
+
2. The SentencePiece tokenizer
|
141 |
+
|
142 |
|
143 |
|
144 |
# Training Details
|
|
|
156 |
|
157 |
This is also a base-sized model with many languages and many tasks, so capacity may be limited.
|
158 |
|
159 |
+
This model also predicts punctuation only once per subword.
|
160 |
+
This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuation.
|
161 |
+
This concession was accepted on two grounds:
|
162 |
+
1. Such acronyms are rare, especially in the context of multi-lingual models
|
163 |
+
2. Punctuated acronyms are typically pronounced as individual characters, e.g., 'U.S.' vs. 'NATO'.
|
164 |
+
Since the expected use-case of this model is the output of an ASR system, it is presumed that such
|
165 |
+
pronunciations would be transcribed as separate tokens, e.g, 'u s' vs. 'us' (though this depends on the model's pre-processing).
|
166 |
+
|
167 |
# Evaluation
|