1-800-BAD-CODE
commited on
Commit
•
651e333
1
Parent(s):
6d889b7
Update README.md
Browse files
README.md
CHANGED
@@ -57,6 +57,9 @@ language:
|
|
57 |
# Model Overview
|
58 |
This model accepts as input lower-cased, unpunctuated, unsegmented text in 47 languages and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
|
59 |
|
|
|
|
|
|
|
60 |
# Model Details
|
61 |
|
62 |
This model generally follows the graph shown below, with brief descriptions for each step following.
|
@@ -87,14 +90,14 @@ In practice, this means the inverted question mark for Spanish and Asturian, `¿
|
|
87 |
Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
|
88 |
|
89 |
5. **Sentence boundary detection**
|
90 |
-
Parallel to the "pre" punctuation, another classification predicts from the re-encoded text
|
91 |
In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning.
|
92 |
|
93 |
6. **Shift and concat sentence boundaries**
|
94 |
In many languages, the first character of each sentence should be upper-cased.
|
95 |
-
Thus, we
|
96 |
Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
|
97 |
-
Therefore, we shift right by one
|
98 |
Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
|
99 |
|
100 |
7. **True-case prediction**
|
@@ -151,12 +154,12 @@ This model was trained with News Crawl data from WMT.
|
|
151 |
|
152 |
Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
|
153 |
|
154 |
-
#
|
155 |
This model was trained on news data, and may not perform well on conversational or informal data.
|
156 |
|
157 |
This is also a base-sized model with many languages and many tasks, so capacity may be limited.
|
158 |
|
159 |
-
This model
|
160 |
This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuation.
|
161 |
This concession was accepted on two grounds:
|
162 |
1. Such acronyms are rare, especially in the context of multi-lingual models
|
|
|
57 |
# Model Overview
|
58 |
This model accepts as input lower-cased, unpunctuated, unsegmented text in 47 languages and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
|
59 |
|
60 |
+
All languages are processed with the same algorithm with no need for language tags or language-specific branches in the graph.
|
61 |
+
This includes continuous-script and non-continuous script languages, predicting language-specific punctuation, etc.
|
62 |
+
|
63 |
# Model Details
|
64 |
|
65 |
This model generally follows the graph shown below, with brief descriptions for each step following.
|
|
|
90 |
Note that a `¿` can only appear if a `?` is predicted, hence the conditioning.
|
91 |
|
92 |
5. **Sentence boundary detection**
|
93 |
+
Parallel to the "pre" punctuation, another classification network predicts sentence boundaries from the re-encoded text.
|
94 |
In all languages, sentence boundaries can occur only if a potential full stop is predicted, hence the conditioning.
|
95 |
|
96 |
6. **Shift and concat sentence boundaries**
|
97 |
In many languages, the first character of each sentence should be upper-cased.
|
98 |
+
Thus, we should feed the sentence boundary information to the true-case classification network.
|
99 |
Since the true-case classification network is feed-forward and has no context, each time step must embed whether it is the first word of a sentence.
|
100 |
+
Therefore, we shift the binary sentence boundary decisions to the right by one: if token `N-1` is a sentence boundary, token `N` is the first word of a sentence.
|
101 |
Concatenating this with the re-encoded text, each time step contains whether it is the first word of a sentence as predicted by the SBD head.
|
102 |
|
103 |
7. **True-case prediction**
|
|
|
154 |
|
155 |
Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author.
|
156 |
|
157 |
+
# Limitations
|
158 |
This model was trained on news data, and may not perform well on conversational or informal data.
|
159 |
|
160 |
This is also a base-sized model with many languages and many tasks, so capacity may be limited.
|
161 |
|
162 |
+
This model predicts punctuation only once per subword.
|
163 |
This implies that some acronyms, e.g., 'U.S.', cannot properly be punctuation.
|
164 |
This concession was accepted on two grounds:
|
165 |
1. Such acronyms are rare, especially in the context of multi-lingual models
|