Spaces:
Sleeping
Sleeping
cointegrated
commited on
Commit
•
0e63a90
1
Parent(s):
fe0779d
fix the hard and soft signs
Browse files- examples/zontik_lat.txt +85 -0
- myv_translit.py +40 -8
- test_translit.py +43 -2
examples/zontik_lat.txt
ADDED
@@ -0,0 +1,85 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Zontik dy Panamka
|
2 |
+
|
3 |
+
Ěräś-ašteś Zontik. Lijatnede a javovicä, viškinestě purnavicä. Tüsozo sěneĺ. Ěrämosonzo vesemeś uĺneś vadrä, vejkede baška. Son lisneś uĺcäv ansäk pizeme škane, sekskak arseś, buto perť-peĺksěś sval načko, menelenť tüsozojak valdoks a ěrsi.
|
4 |
+
|
5 |
+
«Vaj, kodamo nusmanä ěrämom? – kortaś ěś pačkanzo, kirmicävoź a pokške süvorksoks. – Vaj, kodaška tošnačim!»
|
6 |
+
|
7 |
+
– Mezť seke tev stakasto uksnät? – vesť kevkstize Zontikenť syre Šläpaś, konaś ašteś teke žo škapsonť, ansäk sede vere lavcänť langso.
|
8 |
+
|
9 |
+
– Mezes, toń kojsě, kecnems? – tago stakasto ukstaś Zontikeś. – Perťpeĺga ansäk pizemeť valyť… Načko…
|
10 |
+
|
11 |
+
– Vaj, pajstomne! – sedejmaräś tenzě Šläpaś. – Alkukskak, ton vadrä pogodajak ěziť nekšne! Ušov ějsěť livtniť ansäk pizeme alga. Pajstomo, koda melävtan kisěť!
|
12 |
+
|
13 |
+
– Tošna teń! – tago ukstaś Zontikeś.
|
14 |
+
|
15 |
+
Šläpaś zäryja ška arseś, mejle meri:
|
16 |
+
|
17 |
+
– Varčasa lezdams teť. Kunsolok, meze merän! Nesak se sumkanť? Sy nedläčistě ton, a neemga, keverť ějzěnzě. Čaŕkodik?
|
18 |
+
|
19 |
+
– Čaŕkodija, – merś Zontikeś. – Mejle meze?
|
20 |
+
|
21 |
+
– Mejle? – mizoldoź merś Šläpaś. – Mejle tonś nesak…
|
22 |
+
|
23 |
+
Syre Šläpanť prevsputomazo Zontikenteń ovse ěź čaŕkodeve, ansäk nedläčistě tejś istä, koda tenzě merś Šläpaś. Apak marä, apak nee, son keverś valdo sěń pokš sumkanteń, konań bokaso uĺneś risovaź troks putoź matrosoń kartuzso kenärksov cörabaeń čačo.
|
24 |
+
|
25 |
+
Sumkaś vasnä čavoĺ. Mejle ějzěnzě neŕgsť ěŕva mezť, dy kov-buti saiź. Zontikeś pongś sehte alov. Sonzě langs praś kodamo-buti ěčke kniga. Knigaś apak lotkse motordś, bažaś velävtoms dy muems ěstenzě jon tarka, ansäk te teveś tenzě kodajak ěź teeve. Zontikenť vaksso madeź ašteś keĺme limonadto peštäź sulika. Sulikaś kežejstě buĺkaeś dy melezěĺ valoms vese vedenzě lijatneń langs.
|
26 |
+
|
27 |
+
Pupicä sursemeś peznavtynze peenzě Zontikenť prä kočoms dy istä meĺsparoso čikordś ějsěst, natoj kavto pejtneń sindinze.
|
28 |
+
|
29 |
+
Okojniki kuvaka ki langoś prädovś. Zontikeś čaŕkodize, sumkanť kozoń-buti putyź. Mejle panžiź, meze-buti ějstěnzě sajsť dy pekstyź ods.
|
30 |
+
|
31 |
+
Perť-peĺga kaštmolezevś. Zontikenť turtov te uĺneś sedejak beräń. Sumkaś mezde-buti karmaś ěžeme. Zontikenteń a mejsě karmaś leksems. Apak učne sumkaś tago panžovś dy te škane Zontikeś praś ušov. Son ovse ěź čaŕkode, meks langozonzo vese teťkiź seĺmest, prok tamaša langs, teke neiť vasencede? Pejdsť-tejsť, dy panžiź…
|
32 |
+
|
33 |
+
Zontikeś divamodonť vajkstaś dy ěĺ ěź pra mastorov. Son istämo valdo či onstonzojak ěź nekšne. Perťpeĺga vesemeś uĺneś pek mazyj!
|
34 |
+
|
35 |
+
Valdo-sěń meneĺ, psi čipaj, sěń morä, ožo čuvar dy sädot istät žo zontikt, kodamo son, ansäk sörmavt dy ěŕva kodamo tüsoń.
|
36 |
+
|
37 |
+
Tede vesemedenť Zontikenť karmaś čaramo präzo. Čaraś, čaraś, čaraś…
|
38 |
+
|
39 |
+
Vana alamodo saś jožos, varštaś tej-tov dy avoĺ pek vasolo, merema, ovse vakssonzo, neś valdo sörmav šläpine. Kodamo-buti lija jonov a molicä, peŕkanzo sülmseź lamo surepulot. Istämo šläpa Zontikeś tede ikele ěź nekšne.
|
40 |
+
|
41 |
+
– Ton kijat? – kevkstize Zontikeś šläpinenť.
|
42 |
+
|
43 |
+
– Mon? – divazevś šläpineś. – Panamka.
|
44 |
+
|
45 |
+
– Pa-nam-ka… – slogoń slog jovtyze Zontikeś, dy ěź soda, mezde kortams sede tov.
|
46 |
+
|
47 |
+
– A mon toń sodatan. Ton – Zontik! – merś Panamkaś. – Ansäk a čaŕkodän, nej žo pizeme araś? A toń kondämo zontiktne pläžov a jakiť.
|
48 |
+
|
49 |
+
– Mon… Mon… – Zontikeś ěź soda meze merems, sekskak kelezě pongoneś. – Mon syń tonť langs varštamo! – okojniki muś valt Zontikeś.
|
50 |
+
|
51 |
+
– Moń langs? – divazevś Panamkaś. – Mezeks te teť ěrävi?
|
52 |
+
|
53 |
+
– Veĺť ton mazyjat! – merś Zontikeś.
|
54 |
+
|
55 |
+
– Gm, – mizoldozevś Panamkaś dy merś vizdeź.
|
56 |
+
|
57 |
+
– Tongak moń meĺs tuiť.
|
58 |
+
|
59 |
+
– Žaĺ, – stakasto ukstaś Zontikeś.
|
60 |
+
|
61 |
+
– Meześ žaĺ? – kevkstize Panamkaś.
|
62 |
+
|
63 |
+
– Kurok čokšne dy savi minenek javovoms.
|
64 |
+
|
65 |
+
– Alkukskak, – istä žo ukstaś Panamkaś. – Ton ěšte sat tej?
|
66 |
+
|
67 |
+
– A sodan, – merś Zontikeś. – Mon žo avoĺ pläžga jakicän. Pizeme araś.
|
68 |
+
|
69 |
+
– Mejle meze?
|
70 |
+
|
71 |
+
Te škastonť Zontikenť purnyź dy son tago pongś sodaviks sumkanteń. Ansäk nej sehte vereĺ. Sudozojak tarnoś ušoso, seĺmenzějak nesť meze teevi peŕkanzo.
|
72 |
+
|
73 |
+
Jalateke jožozo beräneĺ. Son arseś mazyj šläpinedenť — Panamkadonť. Čumondś ěś pränzo, meks ěź kenere tenzě merems «Vastomazonok!».
|
74 |
+
|
75 |
+
Te škastonť Zontikeś varštaś viť jonov dy… Istäška kenärks! Ovse malaso kodaź kepternesě ašteś son – Panamkaś!
|
76 |
+
|
77 |
+
– Škaś a menstäma! – arsezevś Zontikeś. Dy ězize menstä. Velävtś, venstize Panamkanteń ěsenzě keme kedenzě dy merś:
|
78 |
+
|
79 |
+
– Kundak sede kuroksto!
|
80 |
+
|
81 |
+
Panamkaś vasnä tandadś, mejle saś ežos, peleź kundaś Zontikenť keďs dy sornoź, jutaś Zontikenť jonov…
|
82 |
+
|
83 |
+
…Se škastonť saeź škapsonť syń kavonest. Nej Zontikeś ovse a tošnakali pevteme pizemetneń ějstě. Son sody, kudoso uči ějsěnzě Panamkaś, konaś pek moli čipaenť jonov. Panamkanť neeź, Zontikenť seske lembelgady jožozo dy kenärdozevi sedeezě.
|
84 |
+
|
85 |
+
Vere lavcänť langso syre Šläpaś vany langozost dy kaštmoli, ansäk koj-zärdo kuvakasto ukstni, paräk, ledstni seń, meześ uš umok jutaś.
|
myv_translit.py
CHANGED
@@ -1,7 +1,27 @@
|
|
1 |
import re
|
2 |
from collections import Counter
|
3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
_cyr2lat = [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
{'find_what': 'А', 'replacer': 'A', 're': False},
|
6 |
{'find_what': 'а', 'replacer': 'a', 're': False},
|
7 |
{'find_what': 'О', 'replacer': 'O', 're': False},
|
@@ -231,6 +251,18 @@ _lat2cyr = [
|
|
231 |
{'find_what': 'O', 'replacer': 'О', 're': False},
|
232 |
{'find_what': 'a', 'replacer': 'а', 're': False},
|
233 |
{'find_what': 'A', 'replacer': 'А', 're': False},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
234 |
# ya, yo, yu
|
235 |
{'find_what': 'Й[Аа]', 'replacer': 'Я', 're': True},
|
236 |
{'find_what': 'й[Аа]', 'replacer': 'я', 're': True},
|
@@ -238,7 +270,11 @@ _lat2cyr = [
|
|
238 |
{'find_what': 'й[Оо]', 'replacer': 'ё', 're': True},
|
239 |
{'find_what': 'Й[Уу]', 'replacer': 'Ю', 're': True},
|
240 |
{'find_what': 'й[Уу]', 'replacer': 'ю', 're': True},
|
241 |
-
|
|
|
|
|
|
|
|
|
242 |
]
|
243 |
|
244 |
|
@@ -264,13 +300,9 @@ def cyr2lat(text, joint_acute=True, first_e_with_hacek=True, soft_l_after_vowels
|
|
264 |
|
265 |
def lat2cyr(text, joint_acute=True, first_e_with_hacek=True, soft_l_after_vowels=True):
|
266 |
# todo: support all the optional settings
|
267 |
-
|
268 |
-
|
269 |
-
|
270 |
-
CYR_CHARS = 'абвгдеёжзиклмнопрстуфхцчшщъыьэюя'
|
271 |
-
BASIC_LAT_CHARS = 'abcdefghijklmnopqrtuvwxyz'
|
272 |
-
ACCENT_LAT_CHARS = 'ěäüöśźćńŕťďĺ'
|
273 |
-
LAT_CHARS = BASIC_LAT_CHARS + ACCENT_LAT_CHARS
|
274 |
|
275 |
|
276 |
def detect_script(text: str, min_prevalence: float = 2.0, min_detectable: float = 0.1) -> str:
|
|
|
1 |
import re
|
2 |
from collections import Counter
|
3 |
|
4 |
+
|
5 |
+
CYR_CHARS = 'абвгдеёжзийклмнопрстуфхцчшщъыьэюя'
|
6 |
+
BASIC_LAT_CHARS = 'abcdefghijklmnopqrtuvwxyz'
|
7 |
+
ACCENT_LAT_CHARS = 'ěäüöśźćńŕťďĺ'
|
8 |
+
LAT_CHARS = BASIC_LAT_CHARS + ACCENT_LAT_CHARS
|
9 |
+
|
10 |
+
CYR_CONSONANTS_LOWER_HARD = 'бвгдзклмнпрстфх'
|
11 |
+
CYR_CONSONANTS_UPPER_HARD = 'БВГДЗКЛМНПРСТФХ'
|
12 |
+
CYR_CONSONANTS_HARD = CYR_CONSONANTS_LOWER_HARD + CYR_CONSONANTS_UPPER_HARD
|
13 |
+
|
14 |
+
|
15 |
_cyr2lat = [
|
16 |
+
{'find_what': 'ъё', 'replacer': 'jo', 're': False},
|
17 |
+
{'find_what': 'ъю', 'replacer': 'ju', 're': False},
|
18 |
+
{'find_what': 'ъя', 'replacer': 'ja', 're': False},
|
19 |
+
{'find_what': 'ъе', 'replacer': 'je', 're': False},
|
20 |
+
{'find_what': 'ЪЁ', 'replacer': 'JO', 're': False},
|
21 |
+
{'find_what': 'ЪЮ', 'replacer': 'JU', 're': False},
|
22 |
+
{'find_what': 'ЪЯ', 'replacer': 'JA', 're': False},
|
23 |
+
{'find_what': 'ЪЕ', 'replacer': 'JE', 're': False},
|
24 |
+
|
25 |
{'find_what': 'А', 'replacer': 'A', 're': False},
|
26 |
{'find_what': 'а', 'replacer': 'a', 're': False},
|
27 |
{'find_what': 'О', 'replacer': 'O', 're': False},
|
|
|
251 |
{'find_what': 'O', 'replacer': 'О', 're': False},
|
252 |
{'find_what': 'a', 'replacer': 'а', 're': False},
|
253 |
{'find_what': 'A', 'replacer': 'А', 're': False},
|
254 |
+
# make the soft sign upper in an uppercase word
|
255 |
+
{'find_what': '([А-ЯЁ]{2,})ь', 'replacer': '\\1Ь', 're': True},
|
256 |
+
{'find_what': '([А-ЯЁ])ь([А-ЯЁ])', 'replacer': '\\1Ь\\2', 're': True},
|
257 |
+
# introduce Ъ when appropriate
|
258 |
+
{'find_what': f'([{CYR_CONSONANTS_HARD}])Й[Аа]', 'replacer': '\\1ЪЯ', 're': True},
|
259 |
+
{'find_what': f'([{CYR_CONSONANTS_HARD}])й[Аа]', 'replacer': '\\1ъя', 're': True},
|
260 |
+
{'find_what': f'([{CYR_CONSONANTS_HARD}])Й[Оо]', 'replacer': '\\1ЪЁ', 're': True},
|
261 |
+
{'find_what': f'([{CYR_CONSONANTS_HARD}])й[Оо]', 'replacer': '\\1ъё', 're': True},
|
262 |
+
{'find_what': f'([{CYR_CONSONANTS_HARD}])Й[Уу]', 'replacer': '\\1ЪЮ', 're': True},
|
263 |
+
{'find_what': f'([{CYR_CONSONANTS_HARD}])й[Уу]', 'replacer': '\\1ъю', 're': True},
|
264 |
+
{'find_what': f'([{CYR_CONSONANTS_HARD}])Й[Ee]', 'replacer': '\\1ЪЕ', 're': True},
|
265 |
+
{'find_what': f'([{CYR_CONSONANTS_HARD}])й[Ee]', 'replacer': '\\1ъе', 're': True},
|
266 |
# ya, yo, yu
|
267 |
{'find_what': 'Й[Аа]', 'replacer': 'Я', 're': True},
|
268 |
{'find_what': 'й[Аа]', 'replacer': 'я', 're': True},
|
|
|
270 |
{'find_what': 'й[Оо]', 'replacer': 'ё', 're': True},
|
271 |
{'find_what': 'Й[Уу]', 'replacer': 'Ю', 're': True},
|
272 |
{'find_what': 'й[Уу]', 'replacer': 'ю', 're': True},
|
273 |
+
]
|
274 |
+
|
275 |
+
_lat2cyr_special_cases = [
|
276 |
+
{'find_what': 'раён', 'replacer': 'район'},
|
277 |
+
{'find_what': 'РАЁН', 'replacer': 'РАЙОН'},
|
278 |
]
|
279 |
|
280 |
|
|
|
300 |
|
301 |
def lat2cyr(text, joint_acute=True, first_e_with_hacek=True, soft_l_after_vowels=True):
|
302 |
# todo: support all the optional settings
|
303 |
+
text = transliterate_with_rules(text, _lat2cyr)
|
304 |
+
text = transliterate_with_rules(text, _lat2cyr_special_cases)
|
305 |
+
return text
|
|
|
|
|
|
|
|
|
306 |
|
307 |
|
308 |
def detect_script(text: str, min_prevalence: float = 2.0, min_detectable: float = 0.1) -> str:
|
test_translit.py
CHANGED
@@ -27,8 +27,20 @@ def test_detection():
|
|
27 |
assert detect_script('ěrzä эрзянь') == 'mix'
|
28 |
|
29 |
|
30 |
-
|
31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
|
34 |
def test_consistency():
|
@@ -41,3 +53,32 @@ def test_consistency():
|
|
41 |
line_cyr2 = lat2cyr(line_lat)
|
42 |
assert line_cyr == line_cyr2
|
43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
assert detect_script('ěrzä эрзянь') == 'mix'
|
28 |
|
29 |
|
30 |
+
DEFAULT_TEST_SET = [
|
31 |
+
("съёмка", "sjomka"), # ъё
|
32 |
+
('бажась велявтомс ды муемс эстензэ ён тарка', 'bažaś velävtoms dy muems ěstenzě jon tarka'),
|
33 |
+
('УЖОСТО УЖОС ИДЕМЕВСТЬ ПАНСЯН!', 'UŽOSTO UŽOS IDEMEVSŤ PANSÄN!'), # upper Ь
|
34 |
+
('ПЬЯНСТВО', 'ṔJANSTVO'), # also upper Ь
|
35 |
+
('райононть', 'rajononť'), # special case
|
36 |
+
# TODO: FIXME ('XVIII пингень', 'XVIII pingeń'), # consistency
|
37 |
+
]
|
38 |
+
|
39 |
+
|
40 |
+
def test_edge_cases():
|
41 |
+
for cyr, lat in DEFAULT_TEST_SET:
|
42 |
+
assert cyr2lat(cyr) == lat
|
43 |
+
assert lat2cyr(lat) == cyr
|
44 |
|
45 |
|
46 |
def test_consistency():
|
|
|
53 |
line_cyr2 = lat2cyr(line_lat)
|
54 |
assert line_cyr == line_cyr2
|
55 |
|
56 |
+
|
57 |
+
def test_zontik():
|
58 |
+
with open('examples/zontik_cyr.txt', 'r') as f:
|
59 |
+
lines_cyr = [line.strip() for line in f.readlines()]
|
60 |
+
lines_cyr = [line for line in lines_cyr if line]
|
61 |
+
with open('examples/zontik_lat.txt', 'r') as f:
|
62 |
+
lines_lat = [line.strip() for line in f.readlines()]
|
63 |
+
lines_lat = [line for line in lines_lat if line]
|
64 |
+
assert len(lines_cyr) == len(lines_lat)
|
65 |
+
for line_cyr, line_lat in zip(lines_cyr, lines_lat):
|
66 |
+
assert line_lat == cyr2lat(line_cyr)
|
67 |
+
assert line_cyr == lat2cyr(line_lat)
|
68 |
+
|
69 |
+
|
70 |
+
def get_inconsistent_pairs():
|
71 |
+
try:
|
72 |
+
from datasets import load_dataset
|
73 |
+
except ImportError:
|
74 |
+
return
|
75 |
+
dev = load_dataset('slone/myv_ru_2022', split='validation')
|
76 |
+
with open('examples/mismatches.txt', 'w') as f:
|
77 |
+
for line_cyr in dev['myv']:
|
78 |
+
line_lat = cyr2lat(line_cyr)
|
79 |
+
line_cyr2 = lat2cyr(line_lat)
|
80 |
+
if line_cyr != line_cyr2:
|
81 |
+
print(line_cyr, file=f)
|
82 |
+
print(line_cyr2, file=f)
|
83 |
+
print(line_lat, file=f)
|
84 |
+
print(file=f)
|