1-800-BAD-CODE commited on
Commit
6d889b7
·
1 Parent(s): 0575d5b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md CHANGED
@@ -125,7 +125,20 @@ This model predicts the following set of "post" punctuation tokens:
125
  | ፧ | Ethiopic question mark | Amharic |
126
 
127
 
 
 
 
 
 
 
 
 
128
  # Usage
 
 
 
 
 
129
 
130
 
131
  # Training Details
@@ -143,4 +156,12 @@ This model was trained on news data, and may not perform well on conversational
143
 
144
  This is also a base-sized model with many languages and many tasks, so capacity may be limited.
145
 
 
 
 
 
 
 
 
 
146
  # Evaluation
 
125
  | ፧ | Ethiopic question mark | Amharic |
126
 
127
 
128
+ ## Pre-Punctuation Tokens
129
+ This model predicts the following set of "post" punctuation tokens:
130
+
131
+ | Token | Description | Relavant Languages |
132
+ | ---: | :---------- | :----------- |
133
+ | ¿ | Inverted question mark | Spanish |
134
+
135
+
136
  # Usage
137
+ This model is released in two parts:
138
+
139
+ 1. The ONNX graph
140
+ 2. The SentencePiece tokenizer
141
+
142
 
143
 
144
  # Training Details
 
156
 
157
  This is also a base-sized model with many languages and many tasks, so capacity may be limited.
158
 
159
+ This model also predicts punctuation only once per subword.
160
+ This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuation.
161
+ This concession was accepted on two grounds:
162
+ 1. Such acronyms are rare, especially in the context of multi-lingual models
163
+ 2. Punctuated acronyms are typically pronounced as individual characters, e.g., 'U.S.' vs. 'NATO'.
164
+ Since the expected use-case of this model is the output of an ASR system, it is presumed that such
165
+ pronunciations would be transcribed as separate tokens, e.g, 'u s' vs. 'us' (though this depends on the model's pre-processing).
166
+
167
  # Evaluation