Spaces:
Configuration error
Configuration error
test-rtechs
commited on
Commit
•
607e564
1
Parent(s):
cd7a065
Upload 68 files
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- uroman/.gitignore +35 -0
- uroman/LICENSE.txt +11 -0
- uroman/README.md +165 -0
- uroman/README.txt +141 -0
- uroman/bin/de-accent.pl +201 -0
- uroman/bin/string-distance.pl +99 -0
- uroman/bin/uroman-quick.pl +58 -0
- uroman/bin/uroman-tsv.sh +28 -0
- uroman/bin/uroman.pl +138 -0
- uroman/bin/uroman.py +0 -0
- uroman/data/Chinese_to_Pinyin.txt +0 -0
- uroman/data/NumProps.jsonl +0 -0
- uroman/data/Scripts.txt +174 -0
- uroman/data/UnicodeData.txt +0 -0
- uroman/data/UnicodeDataOverwrite.txt +443 -0
- uroman/data/UnicodeDataProps.txt +164 -0
- uroman/data/UnicodeDataPropsCJK.txt +0 -0
- uroman/data/UnicodeDataPropsHangul.txt +1 -0
- uroman/data/romanization-auto-table.txt +0 -0
- uroman/data/romanization-table-arabic-block.txt +179 -0
- uroman/data/romanization-table.txt +2193 -0
- uroman/data/romanization-table.v1.2.1.txt +814 -0
- uroman/data/string-distance-cost-rules.txt +896 -0
- uroman/lib/JSON.pm +2317 -0
- uroman/lib/JSON/backportPP.pm +2806 -0
- uroman/lib/JSON/backportPP/Boolean.pm +27 -0
- uroman/lib/JSON/backportPP/Compat5005.pm +131 -0
- uroman/lib/JSON/backportPP/Compat5006.pm +173 -0
- uroman/lib/NLP/Chinese.pm +239 -0
- uroman/lib/NLP/English.pm +0 -0
- uroman/lib/NLP/Romanizer.pm +2020 -0
- uroman/lib/NLP/UTF8.pm +1404 -0
- uroman/lib/NLP/stringDistance.pm +724 -0
- uroman/lib/NLP/utilities.pm +0 -0
- uroman/tarballs/uroman-v1.0.tar.gz +0 -0
- uroman/tarballs/uroman-v1.1.tar.gz +0 -0
- uroman/tarballs/uroman-v1.2.4.tar.gz +0 -0
- uroman/tarballs/uroman-v1.2.5.tar.gz +0 -0
- uroman/tarballs/uroman-v1.2.6.tar.gz +0 -0
- uroman/tarballs/uroman-v1.2.7.tar.gz +0 -0
- uroman/tarballs/uroman-v1.2.tar.gz +0 -0
- uroman/test/multi-script.txt +32 -0
- uroman/test/multi-script.uroman-ref.txt +32 -0
- uroman/test/string-similarity-test-input.txt +7 -0
- uroman/test/string-similarity-test-output-ref.txt +8 -0
- uroman/text/amh.txt +7 -0
- uroman/text/ara.txt +3 -0
- uroman/text/ben.txt +8 -0
- uroman/text/bod.txt +3 -0
- uroman/text/egy.txt +5 -0
uroman/.gitignore
ADDED
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
!Build/
|
2 |
+
.last_cover_stats
|
3 |
+
/META.yml
|
4 |
+
/META.json
|
5 |
+
/MYMETA.*
|
6 |
+
*.o
|
7 |
+
*.pm.tdy
|
8 |
+
*.bs
|
9 |
+
|
10 |
+
# Devel::Cover
|
11 |
+
cover_db/
|
12 |
+
|
13 |
+
# Devel::NYTProf
|
14 |
+
nytprof.out
|
15 |
+
|
16 |
+
# Dizt::Zilla
|
17 |
+
/.build/
|
18 |
+
|
19 |
+
# Module::Build
|
20 |
+
_build/
|
21 |
+
Build
|
22 |
+
Build.bat
|
23 |
+
|
24 |
+
# Module::Install
|
25 |
+
inc/
|
26 |
+
|
27 |
+
# ExtUtils::MakeMaker
|
28 |
+
/blib/
|
29 |
+
/_eumm/
|
30 |
+
/*.gz
|
31 |
+
/Makefile
|
32 |
+
/Makefile.old
|
33 |
+
/MANIFEST.bak
|
34 |
+
/pm_to_blib
|
35 |
+
/*.zip
|
uroman/LICENSE.txt
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Copyright (C) 2015-2020 Ulf Hermjakob, USC Information Sciences Institute
|
2 |
+
|
3 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
4 |
+
|
5 |
+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
|
6 |
+
|
7 |
+
Any publication of projects using uroman shall acknowledge its use: "This project uses the universal romanizer software 'uroman' written by Ulf Hermjakob, USC Information Sciences Institute (2015-2020)".
|
8 |
+
Bibliography: Ulf Hermjakob, Jonathan May, and Kevin Knight. 2018. Out-of-the-box universal romanization tool uroman. In Proceedings of the 56th Annual Meeting of Association for Computational Linguistics, Demo Track.
|
9 |
+
|
10 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
11 |
+
|
uroman/README.md
ADDED
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# uroman
|
2 |
+
|
3 |
+
*uroman* is a *universal romanizer*. It converts text in any script to the Latin alphabet.
|
4 |
+
|
5 |
+
Version: 1.2.8
|
6 |
+
Release date: April 23, 2021
|
7 |
+
Author: Ulf Hermjakob, USC Information Sciences Institute
|
8 |
+
|
9 |
+
|
10 |
+
### Usage
|
11 |
+
```bash
|
12 |
+
$ uroman.pl [-l <lang-code>] [--chart] [--no-cache] < STDIN
|
13 |
+
where the optional <lang-code> is a 3-letter languages code, e.g. ara, bel, bul, deu, ell, eng, fas,
|
14 |
+
grc, ell, eng, heb, kaz, kir, lav, lit, mkd, mkd2, oss, pnt, pus, rus, srp, srp2, tur, uig, ukr, yid.
|
15 |
+
--chart specifies chart output (in JSON format) to represent alternative romanizations.
|
16 |
+
--no-cache disables caching.
|
17 |
+
```
|
18 |
+
### Examples
|
19 |
+
```bash
|
20 |
+
$ bin/uroman.pl < text/zho.txt
|
21 |
+
$ bin/uroman.pl -l tur < text/tur.txt
|
22 |
+
$ bin/uroman.pl -l heb --chart < text/heb.txt
|
23 |
+
$ bin/uroman.pl < test/multi-script.txt > test/multi-script.uroman.txt
|
24 |
+
```
|
25 |
+
|
26 |
+
Identifying the input as Arabic, Belarusian, Bulgarian, English, Farsi, German,
|
27 |
+
Ancient Greek, Modern Greek, Pontic Greek, Hebrew, Kazakh, Kyrgyz, Latvian,
|
28 |
+
Lithuanian, North Macedonian, Russian, Serbian, Turkish, Ukrainian, Uyghur or
|
29 |
+
Yiddish will improve romanization for those languages as some letters in those
|
30 |
+
languages have different sound values from other languages using the same script
|
31 |
+
(French, Russian, Hebrew respectively).
|
32 |
+
No effect for other languages in this version.
|
33 |
+
|
34 |
+
### Bibliography
|
35 |
+
Ulf Hermjakob, Jonathan May, and Kevin Knight. 2018. Out-of-the-box universal romanization tool uroman. In Proceedings of the 56th Annual Meeting of Association for Computational Linguistics, Demo Track. ACL-2018 Best Demo Paper Award. [Paper in ACL Anthology](https://www.aclweb.org/anthology/P18-4003) | [Poster](https://www.isi.edu/~ulf/papers/poster-uroman-acl2018.pdf) | [BibTex](https://www.aclweb.org/anthology/P18-4003.bib)
|
36 |
+
|
37 |
+
### Change History
|
38 |
+
Changes in version 1.2.8
|
39 |
+
* Updated to Unicode 13.0 (2021), which supports several new scripts (10% larger UnicodeData.txt).
|
40 |
+
* Improved support for Georgian.
|
41 |
+
* Preserve various symbols (as opposed to mapping to the symbols' names).
|
42 |
+
* Various small improvements.
|
43 |
+
|
44 |
+
Changes in version 1.2.7
|
45 |
+
* Improved support for Pashto.
|
46 |
+
|
47 |
+
Changes in version 1.2.6
|
48 |
+
* Improved support for Ukrainian, Russian and Ogham (ancient Irish script).
|
49 |
+
* Added support for English Braille.
|
50 |
+
* Added alternative Romanization for North Macedonian and Serbian (mkd2/srp2)
|
51 |
+
reflecting a casual style that many native speakers of those languages use
|
52 |
+
when writing text in Latin script, e.g. non-accented single letters (e.g. "s")
|
53 |
+
rather than phonetically motivated combinations of letters (e.g. "sh").
|
54 |
+
* When a line starts with "::lcode xyz ", the new uroman version will switch to
|
55 |
+
that language for that line. This is used for the new reference test file.
|
56 |
+
* Various small improvements.
|
57 |
+
|
58 |
+
Changes in version 1.2.5
|
59 |
+
* Improved support for Armenian and eight languages using Cyrillic scripts.
|
60 |
+
-- For Serbian and Macedonian, which are often written in both Cyrillic
|
61 |
+
and Latin scripts, uroman will map both official versions to the same
|
62 |
+
romanized text, e.g. both "Ниш" and "Niš" will be mapped to "Nish" (which
|
63 |
+
properly reflects the pronunciation of the city's name).
|
64 |
+
For both Serbian and Macedonian, casual writers often use a simplified
|
65 |
+
Latin form without diacritics, e.g. "s" to represent not only Cyrillic "с"
|
66 |
+
and Latin "s", but also "ш" or "š", even if this conflates "s" and "sh" and
|
67 |
+
other such pairs. The casual romanization can be simulated by using
|
68 |
+
alternative uroman language codes "srp2" and "mkd2", which romanize
|
69 |
+
both "Ниш" and "Niš" to "Nis" to reflect the casual Latin spelling.
|
70 |
+
* Various small improvements.
|
71 |
+
|
72 |
+
Changes in version 1.2.4
|
73 |
+
* Bug-fix that generated two emtpy lines for each empty line in cache mode.
|
74 |
+
|
75 |
+
Changes in version 1.2
|
76 |
+
* Run-time improvement based on (1) token-based caching and (2) shortcut
|
77 |
+
romanization (identity) of ASCII strings for default 1-best (non-chart)
|
78 |
+
output. Speed-up by a factor of 10 for Bengali and Uyghur on medium and
|
79 |
+
large size texts.
|
80 |
+
* Incremental improvements for Farsi, Amharic, Russian, Hebrew and related
|
81 |
+
languages.
|
82 |
+
* Richer lattice structure (more alternatives) for "Romanization" of English
|
83 |
+
to support better matching to romanizations of other languages.
|
84 |
+
Changes output only when --chart option is specified. No change in output for
|
85 |
+
default 1-best output, which for ASCII characters is always the input string.
|
86 |
+
|
87 |
+
Changes in version 1.1 (major upgrade)
|
88 |
+
* Offers chart output (in JSON format) to represent alternative romanizations.
|
89 |
+
-- Location of first character is defined to be "line: 1, start:0, end:0".
|
90 |
+
* Incremental improvements of Hebrew and Greek romanization; Chinese numbers.
|
91 |
+
* Improved web-interface at http://www.isi.edu/~ulf/uroman.html
|
92 |
+
-- Shows corresponding original and romanization text in red
|
93 |
+
when hovering over a text segment.
|
94 |
+
-- Shows alternative romanizations when hovering over romanized text
|
95 |
+
marked by dotted underline.
|
96 |
+
-- Added right-to-left script detection and improved display for right-to-left
|
97 |
+
script text (as determined line by line).
|
98 |
+
-- On-page support for some scripts that are often not pre-installed on users'
|
99 |
+
computers (Burmese, Egyptian, Klingon).
|
100 |
+
|
101 |
+
Changes in version 1.0 (major upgrade)
|
102 |
+
* Upgraded principal internal data structure from string to lattice.
|
103 |
+
* Improvements mostly in vowelization of South and Southeast Asian languages.
|
104 |
+
* Vocalic 'r' more consistently treated as vowel (no additional vowel added).
|
105 |
+
* Repetition signs (Japanese/Chinese/Thai/Khmer/Lao) are mapped to superscript 2.
|
106 |
+
* Japanese Katakana middle dots now mapped to ASCII space.
|
107 |
+
* Tibetan intersyllabic mark now mapped to middle dot (U+00B7).
|
108 |
+
* Some corrections regarding analysis of Chinese numbers.
|
109 |
+
* Many more foreign diacritics and punctuation marks dropped or mapped to ASCII.
|
110 |
+
* Zero-width characters dropped, except line/sentence-initial byte order marks.
|
111 |
+
* Spaces normalized to ASCII space.
|
112 |
+
* Fixed bug that in some cases mapped signs (such as dagger or bullet) to their verbal descriptions.
|
113 |
+
* Tested against previous version of uroman with a new uroman visual diff tool.
|
114 |
+
* Almost an order of magnitude faster.
|
115 |
+
|
116 |
+
Changes in version 0.7 (minor upgrade)
|
117 |
+
* Added script uroman-quick.pl for Arabic script languages, incl. Uyghur.
|
118 |
+
Much faster, pre-caching mapping of Arabic to Latin characters, simple greedy processing.
|
119 |
+
Will not convert material from non-Arabic blocks such as any (somewhat unusual) Cyrillic
|
120 |
+
or Chinese characters in Uyghur texts.
|
121 |
+
|
122 |
+
Changes in version 0.6 (minor upgrade)
|
123 |
+
* Added support for two letter characters used in Uzbek:
|
124 |
+
(1) character "ʻ" ("modifier letter turned comma", which modifies preceding "g" and "u" letters)
|
125 |
+
(2) character "ʼ" ("modifier letter apostrophe", which Uzbek uses to mark a glottal stop).
|
126 |
+
Both are now mapped to "'" (plain ASCII apostrophe).
|
127 |
+
* Added support for Uyghur vowel characters such as "ې" (Arabic e) and "ۆ" (Arabic oe)
|
128 |
+
even when they are not preceded by "ئ" (yeh with hamza above).
|
129 |
+
* Added support for Arabic semicolon "؛", Arabic ligature forms for phrases such as "ﷺ"
|
130 |
+
("sallallahou alayhe wasallam" = "prayer of God be upon him and his family and peace")
|
131 |
+
* Added robustness for Arabic letter presentation forms (initial/medial/final/isolated).
|
132 |
+
However, it is strongly recommended to normalize any presentation form Arabic letters
|
133 |
+
to their non-presentation form before calling uroman.
|
134 |
+
* Added force flush directive ($|=1;).
|
135 |
+
|
136 |
+
Changes in version 0.5 (minor upgrade)
|
137 |
+
* Improvements for Uyghur (make sure to use language option: -l uig)
|
138 |
+
|
139 |
+
Changes in version 0.4 (minor upgrade)
|
140 |
+
* Improvements for Thai (special cases for vowel/consonant reordering, e.g. for "sara o"; dropped some aspiration 'h's)
|
141 |
+
* Minor change for Arabic (added "alef+fathatan" = "an")
|
142 |
+
|
143 |
+
New features in version 0.3
|
144 |
+
* Covers Mandarin (Chinese)
|
145 |
+
* Improved romanization for numerous languages
|
146 |
+
* Preserves capitalization (e.g. from Latin, Cyrillic, Greek scripts)
|
147 |
+
* Maps from native digits to Western numbers
|
148 |
+
* Faster for South Asian languages
|
149 |
+
|
150 |
+
### Other features
|
151 |
+
* Web interface: http://www.isi.edu/~ulf/uroman.html
|
152 |
+
* Vowelization is provided when locally computable, e.g. for many South Asian languages and Tibetan.
|
153 |
+
|
154 |
+
### Limitations
|
155 |
+
* The current version of uroman has a few limitations, some of which we plan to address in future versions.
|
156 |
+
For Japanese, *uroman* currently romanizes hiragana and katakana as expected, but kanji are interpreted as Chinese characters and romanized as such.
|
157 |
+
For Egyptian hieroglyphs, only single-sound phonetic characters and numbers are currently romanized.
|
158 |
+
For Linear B, only phonetic syllabic characters are romanized.
|
159 |
+
For some other extinct scripts such as cuneiform, no romanization is provided.
|
160 |
+
* A romanizer is not a full transliterator. For example, this version of
|
161 |
+
uroman does not vowelize text that lacks explicit vowelization such as
|
162 |
+
normal text in Arabic and Hebrew (without diacritics/points).
|
163 |
+
|
164 |
+
### Acknowledgments
|
165 |
+
This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract # FA8650-17-C-9116, and by research sponsored by Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, Air Force Laboratory, DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
|
uroman/README.txt
ADDED
@@ -0,0 +1,141 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
uroman version 1.2.8
|
2 |
+
Release date: April 23, 2021
|
3 |
+
Author: Ulf Hermjakob, USC Information Sciences Institute
|
4 |
+
|
5 |
+
uroman is a universal romanizer. It converts text in any script to the Latin alphabet.
|
6 |
+
|
7 |
+
Usage: uroman.pl [-l <lang-code>] [--chart] [--no-cache] < STDIN
|
8 |
+
where the optional <lang-code> is a 3-letter languages code, e.g. ara, bel, bul, deu, ell, eng, fas,
|
9 |
+
grc, ell, eng, heb, kaz, kir, lav, lit, mkd, mkd2, oss, pnt, pus, rus, srp, srp2, tur, uig, ukr, yid.
|
10 |
+
--chart specifies chart output (in JSON format) to represent alternative romanizations.
|
11 |
+
--no-cache disables caching.
|
12 |
+
Examples: bin/uroman.pl < text/zho.txt
|
13 |
+
bin/uroman.pl -l tur < text/tur.txt
|
14 |
+
bin/uroman.pl -l heb --chart < text/heb.txt
|
15 |
+
bin/uroman.pl < test/multi-script.txt > test/multi-script.uroman.txt
|
16 |
+
|
17 |
+
Identifying the input as Arabic, Belarusian, Bulgarian, English, Farsi, German,
|
18 |
+
Ancient Greek, Modern Greek, Pontic Greek, Hebrew, Kazakh, Kyrgyz, Latvian,
|
19 |
+
Lithuanian, North Macedonian, Russian, Serbian, Turkish, Ukrainian, Uyghur or Yiddish
|
20 |
+
will improve romanization for those languages as some letters in those languages
|
21 |
+
have different sound values from other languages using the same script.
|
22 |
+
No effect for other languages in this version.
|
23 |
+
|
24 |
+
Bibliography: Ulf Hermjakob, Jonathan May, and Kevin Knight. 2018. Out-of-the-box universal romanization tool uroman. In Proceedings of the 56th Annual Meeting of Association for Computational Linguistics, Demo Track. [Best Demo Paper Award]
|
25 |
+
|
26 |
+
Changes in version 1.2.8
|
27 |
+
* Improved support for Georgian.
|
28 |
+
* Updated UnicodeData.txt to version 13 (2021) with several new scripts (10% larger).
|
29 |
+
* Preserve various symbols (as opposed to mapping to the symbols' names).
|
30 |
+
* Various small improvements.
|
31 |
+
Changes in version 1.2.7
|
32 |
+
* Improved support for Pashto.
|
33 |
+
Changes in version 1.2.6
|
34 |
+
* Improved support for Ukrainian, Russian and Ogham (ancient Irish script).
|
35 |
+
* Added support for English Braille.
|
36 |
+
* Added alternative Romanization for North Macedonian and Serbian (mkd2/srp2)
|
37 |
+
reflecting a casual style that many native speakers of those languages use
|
38 |
+
when writing text in Latin script, e.g. non-accented single letters (e.g. "s")
|
39 |
+
rather than phonetically motivated combinations of letters (e.g. "sh").
|
40 |
+
* When a line starts with "::lcode xyz ", the new uroman version will switch to
|
41 |
+
that language for that line. This is used for the new reference test file.
|
42 |
+
* Various small improvements.
|
43 |
+
Changes in version 1.2.5
|
44 |
+
* Improved support for Armenian and eight languages using Cyrillic scripts.
|
45 |
+
-- For Serbian and Macedonian, which are often written in both Cyrillic
|
46 |
+
and Latin scripts, uroman will map both official versions to the same
|
47 |
+
romanized text, e.g. both "Ниш" and "Niš" will be mapped to "Nish" (which
|
48 |
+
properly reflects the pronunciation of the city's name).
|
49 |
+
For both Serbian and Macedonian, casual writers often use a simplified
|
50 |
+
Latin form without diacritics, e.g. "s" to represent not only Cyrillic "с"
|
51 |
+
and Latin "s", but also "ш" or "š", even if this conflates "s" and "sh" and
|
52 |
+
other such pairs. The casual romanization can be simulated by using
|
53 |
+
alternative uroman language codes "srp2" and "mkd2", which romanize
|
54 |
+
both "Ниш" and "Niš" to "Nis" to reflect the casual Latin spelling.
|
55 |
+
* Various small improvements.
|
56 |
+
Changes in version 1.2.4
|
57 |
+
* Added support for Tifinagh (a script used for Berber languages).
|
58 |
+
* Bug-fix that generated two emtpy lines for each empty line in cache mode.
|
59 |
+
Changes in version 1.2.3
|
60 |
+
* Exclude emojis, dingbats, many other pictographs from being romanized (e.g. to "face")
|
61 |
+
Changes in version 1.2
|
62 |
+
* Run-time improvement based on (1) token-based caching and (2) shortcut
|
63 |
+
romanization (identity) of ASCII strings for default 1-best (non-chart)
|
64 |
+
output. Speed-up by a factor of 10 for Bengali and Uyghur on medium and
|
65 |
+
large size texts.
|
66 |
+
* Incremental improvements for Farsi, Amharic, Russian, Hebrew and related
|
67 |
+
languages.
|
68 |
+
* Richer lattice structure (more alternatives) for "Romanization" of English
|
69 |
+
to support better matching to romanizations of other languages.
|
70 |
+
Changes output only when --chart option is specified. No change in output for
|
71 |
+
default 1-best output, which for ASCII characters is always the input string.
|
72 |
+
Changes in version 1.1 (major upgrade)
|
73 |
+
* Offers chart output (in JSON format) to represent alternative romanizations.
|
74 |
+
-- Location of first character is defined to be "line: 1, start:0, end:0".
|
75 |
+
* Incremental improvements of Hebrew and Greek romanization; Chinese numbers.
|
76 |
+
* Improved web-interface at http://www.isi.edu/~ulf/uroman.html
|
77 |
+
-- Shows corresponding original and romanization text in red
|
78 |
+
when hovering over a text segment.
|
79 |
+
-- Shows alternative romanizations when hovering over romanized text
|
80 |
+
marked by dotted underline.
|
81 |
+
-- Added right-to-left script detection and improved display for right-to-left
|
82 |
+
script text (as determined line by line).
|
83 |
+
-- On-page support for some scripts that are often not pre-installed on users'
|
84 |
+
computers (Burmese, Egyptian, Klingon).
|
85 |
+
Changes in version 1.0 (major upgrade)
|
86 |
+
* Upgraded principal internal data structure from string to lattice.
|
87 |
+
* Improvements mostly in vowelization of South and Southeast Asian languages.
|
88 |
+
* Vocalic 'r' more consistently treated as vowel (no additional vowel added).
|
89 |
+
* Repetition signs (Japanese/Chinese/Thai/Khmer/Lao) are mapped to superscript 2.
|
90 |
+
* Japanese Katakana middle dots now mapped to ASCII space.
|
91 |
+
* Tibetan intersyllabic mark now mapped to middle dot (U+00B7).
|
92 |
+
* Some corrections regarding analysis of Chinese numbers.
|
93 |
+
* Many more foreign diacritics and punctuation marks dropped or mapped to ASCII.
|
94 |
+
* Zero-width characters dropped, except line/sentence-initial byte order marks.
|
95 |
+
* Spaces normalized to ASCII space.
|
96 |
+
* Fixed bug that in some cases mapped signs (such as dagger or bullet) to their verbal descriptions.
|
97 |
+
* Tested against previous version of uroman with a new uroman visual diff tool.
|
98 |
+
* Almost an order of magnitude faster.
|
99 |
+
Changes in version 0.7 (minor upgrade)
|
100 |
+
* Added script uroman-quick.pl for Arabic script languages, incl. Uyghur.
|
101 |
+
Much faster, pre-caching mapping of Arabic to Latin characters, simple greedy processing.
|
102 |
+
Will not convert material from non-Arabic blocks such as any (somewhat unusual) Cyrillic
|
103 |
+
or Chinese characters in Uyghur texts.
|
104 |
+
Changes in version 0.6 (minor upgrade)
|
105 |
+
* Added support for two letter characters used in Uzbek:
|
106 |
+
(1) character "ʻ" ("modifier letter turned comma", which modifies preceding "g" and "u" letters)
|
107 |
+
(2) character "ʼ" ("modifier letter apostrophe", which Uzbek uses to mark a glottal stop).
|
108 |
+
Both are now mapped to "'" (plain ASCII apostrophe).
|
109 |
+
* Added support for Uyghur vowel characters such as "ې" (Arabic e) and "ۆ" (Arabic oe)
|
110 |
+
even when they are not preceded by "ئ" (yeh with hamza above).
|
111 |
+
* Added support for Arabic semicolon "؛", Arabic ligature forms for phrases such as "ﷺ"
|
112 |
+
("sallallahou alayhe wasallam" = "prayer of God be upon him and his family and peace")
|
113 |
+
* Added robustness for Arabic letter presentation forms (initial/medial/final/isolated).
|
114 |
+
However, it is strongly recommended to normalize any presentation form Arabic letters
|
115 |
+
to their non-presentation form before calling uroman.
|
116 |
+
* Added force flush directive ($|=1;).
|
117 |
+
Changes in version 0.5 (minor upgrade)
|
118 |
+
* Improvements for Uyghur (make sure to use language option: -l uig)
|
119 |
+
Changes in version 0.4 (minor upgrade)
|
120 |
+
* Improvements for Thai (special cases for vowel/consonant reordering, e.g. for "sara o"; dropped some aspiration 'h's)
|
121 |
+
* Minor change for Arabic (added "alef+fathatan" = "an")
|
122 |
+
New features in version 0.3
|
123 |
+
* Covers Mandarin (Chinese)
|
124 |
+
* Improved romanization for numerous languages
|
125 |
+
* Preserves capitalization (e.g. from Latin, Cyrillic, Greek scripts)
|
126 |
+
* Maps from native digits to Western numbers
|
127 |
+
* Faster for South Asian languages
|
128 |
+
|
129 |
+
Other features
|
130 |
+
* Web interface: http://www.isi.edu/~ulf/uroman.html
|
131 |
+
* Vowelization is provided when locally computable, e.g. for many South Asian
|
132 |
+
languages and Tibetan.
|
133 |
+
|
134 |
+
Limitations
|
135 |
+
* This version of uroman assumes all CJK ideographs to be Mandarin (Chinese).
|
136 |
+
This means that Japanese kanji are incorrectly romanized; however, Japanese
|
137 |
+
hiragana and katakana are properly romanized.
|
138 |
+
* A romanizer is not a full transliterator. For example, this version of
|
139 |
+
uroman does not vowelize text that lacks explicit vowelization such as
|
140 |
+
normal text in Arabic and Hebrew (without diacritics/points).
|
141 |
+
|
uroman/bin/de-accent.pl
ADDED
@@ -0,0 +1,201 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/perl -w
|
2 |
+
|
3 |
+
sub print_version {
|
4 |
+
print STDERR "$0 version 1.1\n";
|
5 |
+
print STDERR " Author: Ulf Hermjakob\n";
|
6 |
+
print STDERR " Last changed: March 14, 2011\n";
|
7 |
+
}
|
8 |
+
|
9 |
+
sub print_usage {
|
10 |
+
print STDERR "$0 [options] < with_accents.txt > without_accents.txt\n";
|
11 |
+
print STDERR " -h or -help\n";
|
12 |
+
print STDERR " -v or -version\n";
|
13 |
+
}
|
14 |
+
|
15 |
+
sub de_accent_string {
|
16 |
+
local($s) = @_;
|
17 |
+
|
18 |
+
# $s =~ tr/A-Z/a-z/;
|
19 |
+
unless (0) {
|
20 |
+
# Latin-1
|
21 |
+
if ($s =~ /\xC3[\x80-\xBF]/) {
|
22 |
+
$s =~ s/(À|Á|Â|Ã|Ä|Å)/A/g;
|
23 |
+
$s =~ s/Æ/Ae/g;
|
24 |
+
$s =~ s/Ç/C/g;
|
25 |
+
$s =~ s/Ð/D/g;
|
26 |
+
$s =~ s/(È|É|Ê|Ë)/E/g;
|
27 |
+
$s =~ s/(Ì|Í|Î|Ï)/I/g;
|
28 |
+
$s =~ s/Ñ/N/g;
|
29 |
+
$s =~ s/(Ò|Ó|Ô|Õ|Ö|Ø)/O/g;
|
30 |
+
$s =~ s/(Ù|Ú|Û|Ü)/U/g;
|
31 |
+
$s =~ s/Þ/Th/g;
|
32 |
+
$s =~ s/Ý/Y/g;
|
33 |
+
$s =~ s/(à|á|â|ã|ä|å)/a/g;
|
34 |
+
$s =~ s/æ/ae/g;
|
35 |
+
$s =~ s/ç/c/g;
|
36 |
+
$s =~ s/(è|é|ê|ë)/e/g;
|
37 |
+
$s =~ s/(ì|í|î|ï)/i/g;
|
38 |
+
$s =~ s/ð/d/g;
|
39 |
+
$s =~ s/ñ/n/g;
|
40 |
+
$s =~ s/(ò|ó|ô|õ|ö)/o/g;
|
41 |
+
$s =~ s/ß/ss/g;
|
42 |
+
$s =~ s/þ/th/g;
|
43 |
+
$s =~ s/(ù|ú|û|ü)/u/g;
|
44 |
+
$s =~ s/(ý|ÿ)/y/g;
|
45 |
+
}
|
46 |
+
# Latin Extended-A
|
47 |
+
if ($s =~ /[\xC4-\xC5][\x80-\xBF]/) {
|
48 |
+
$s =~ s/(Ā|Ă|Ą)/A/g;
|
49 |
+
$s =~ s/(ā|ă|ą)/a/g;
|
50 |
+
$s =~ s/(Ć|Ĉ|Ċ|Č)/C/g;
|
51 |
+
$s =~ s/(ć|ĉ|ċ|č)/c/g;
|
52 |
+
$s =~ s/(Ď|Đ)/D/g;
|
53 |
+
$s =~ s/(ď|đ)/d/g;
|
54 |
+
$s =~ s/(Ē|Ĕ|Ė|Ę|Ě)/E/g;
|
55 |
+
$s =~ s/(ē|ĕ|ė|ę|ě)/e/g;
|
56 |
+
$s =~ s/(Ĝ|Ğ|Ġ|Ģ)/G/g;
|
57 |
+
$s =~ s/(ĝ|ğ|ġ|ģ)/g/g;
|
58 |
+
$s =~ s/(Ĥ|Ħ)/H/g;
|
59 |
+
$s =~ s/(ĥ|ħ)/h/g;
|
60 |
+
$s =~ s/(Ĩ|Ī|Ĭ|Į|İ)/I/g;
|
61 |
+
$s =~ s/(ĩ|ī|ĭ|į|ı)/i/g;
|
62 |
+
$s =~ s/IJ/Ij/g;
|
63 |
+
$s =~ s/ij/ij/g;
|
64 |
+
$s =~ s/Ĵ/J/g;
|
65 |
+
$s =~ s/ĵ/j/g;
|
66 |
+
$s =~ s/Ķ/K/g;
|
67 |
+
$s =~ s/(ķ|ĸ)/k/g;
|
68 |
+
$s =~ s/(Ĺ|Ļ|Ľ|Ŀ|Ł)/L/g;
|
69 |
+
$s =~ s/(ļ|ľ|ŀ|ł)/l/g;
|
70 |
+
$s =~ s/(Ń|Ņ|Ň|Ŋ)/N/g;
|
71 |
+
$s =~ s/(ń|ņ|ň|ʼn|ŋ)/n/g;
|
72 |
+
$s =~ s/(Ō|Ŏ|Ő)/O/g;
|
73 |
+
$s =~ s/(ō|ŏ|ő)/o/g;
|
74 |
+
$s =~ s/Œ/Oe/g;
|
75 |
+
$s =~ s/œ/oe/g;
|
76 |
+
$s =~ s/(Ŕ|Ŗ|Ř)/R/g;
|
77 |
+
$s =~ s/(ŕ|ŗ|ř)/r/g;
|
78 |
+
$s =~ s/(Ś|Ŝ|Ş|Š)/S/g;
|
79 |
+
$s =~ s/(ś|ŝ|ş|š|ſ)/s/g;
|
80 |
+
$s =~ s/(Ţ|Ť|Ŧ)/T/g;
|
81 |
+
$s =~ s/(ţ|ť|ŧ)/t/g;
|
82 |
+
$s =~ s/(Ũ|Ū|Ŭ|Ů|Ű|Ų)/U/g;
|
83 |
+
$s =~ s/(ũ|ū|ŭ|ů|ű|ų)/u/g;
|
84 |
+
$s =~ s/Ŵ/W/g;
|
85 |
+
$s =~ s/ŵ/w/g;
|
86 |
+
$s =~ s/(Ŷ|Ÿ)/Y/g;
|
87 |
+
$s =~ s/ŷ/y/g;
|
88 |
+
$s =~ s/(Ź|Ż|Ž)/Z/g;
|
89 |
+
$s =~ s/(ź|ż|ž)/z/g;
|
90 |
+
}
|
91 |
+
# Latin Extended Additional
|
92 |
+
if ($s =~ /\xE1[\xB8-\xBF][\x80-\xBF]/) {
|
93 |
+
$s =~ s/(ḁ|ạ|ả|ấ|ầ|ẩ|ẫ|ậ|ắ|ằ|ẳ|ẵ|ặ|ẚ)/a/g;
|
94 |
+
$s =~ s/(ḃ|ḅ|ḇ)/b/g;
|
95 |
+
$s =~ s/(ḉ)/c/g;
|
96 |
+
$s =~ s/(ḋ|ḍ|ḏ|ḑ|ḓ)/d/g;
|
97 |
+
$s =~ s/(ḕ|ḗ|ḙ|ḛ|ḝ|ẹ|ẻ|ẽ|ế|ề|ể|ễ|ệ)/e/g;
|
98 |
+
$s =~ s/(ḟ)/f/g;
|
99 |
+
$s =~ s/(ḡ)/g/g;
|
100 |
+
$s =~ s/(ḣ|ḥ|ḧ|ḩ|ḫ)/h/g;
|
101 |
+
$s =~ s/(ḭ|ḯ|ỉ|ị)/i/g;
|
102 |
+
$s =~ s/(ḱ|ḳ|ḵ)/k/g;
|
103 |
+
$s =~ s/(ḷ|ḹ|ḻ|ḽ)/l/g;
|
104 |
+
$s =~ s/(ḿ|ṁ|ṃ)/m/g;
|
105 |
+
$s =~ s/(ṅ|ṇ|ṉ|ṋ)/m/g;
|
106 |
+
$s =~ s/(ọ|ỏ|ố|ồ|ổ|ỗ|ộ|ớ|ờ|ở|ỡ|ợ|ṍ|ṏ|ṑ|ṓ)/o/g;
|
107 |
+
$s =~ s/(ṕ|ṗ)/p/g;
|
108 |
+
$s =~ s/(ṙ|ṛ|ṝ|ṟ)/r/g;
|
109 |
+
$s =~ s/(ṡ|ṣ|ṥ|ṧ|ṩ|ẛ)/s/g;
|
110 |
+
$s =~ s/(ṫ|ṭ|ṯ|ṱ)/t/g;
|
111 |
+
$s =~ s/(ṳ|ṵ|ṷ|ṹ|ṻ|ụ|ủ|ứ|ừ|ử|ữ|ự)/u/g;
|
112 |
+
$s =~ s/(ṽ|ṿ)/v/g;
|
113 |
+
$s =~ s/(ẁ|ẃ|ẅ|ẇ|ẉ|ẘ)/w/g;
|
114 |
+
$s =~ s/(ẋ|ẍ)/x/g;
|
115 |
+
$s =~ s/(ẏ|ỳ|ỵ|ỷ|ỹ|ẙ)/y/g;
|
116 |
+
$s =~ s/(ẑ|ẓ|ẕ)/z/g;
|
117 |
+
$s =~ s/(Ḁ|Ạ|Ả|Ấ|Ầ|Ẩ|Ẫ|Ậ|Ắ|Ằ|Ẳ|Ẵ|Ặ)/A/g;
|
118 |
+
$s =~ s/(Ḃ|Ḅ|Ḇ)/B/g;
|
119 |
+
$s =~ s/(Ḉ)/C/g;
|
120 |
+
$s =~ s/(Ḋ|Ḍ|Ḏ|Ḑ|Ḓ)/D/g;
|
121 |
+
$s =~ s/(Ḕ|Ḗ|Ḙ|Ḛ|Ḝ|Ẹ|Ẻ|Ẽ|Ế|Ề|Ể|Ễ|Ệ)/E/g;
|
122 |
+
$s =~ s/(Ḟ)/F/g;
|
123 |
+
$s =~ s/(Ḡ)/G/g;
|
124 |
+
$s =~ s/(Ḣ|Ḥ|Ḧ|Ḩ|Ḫ)/H/g;
|
125 |
+
$s =~ s/(Ḭ|Ḯ|Ỉ|Ị)/I/g;
|
126 |
+
$s =~ s/(Ḱ|Ḳ|Ḵ)/K/g;
|
127 |
+
$s =~ s/(Ḷ|Ḹ|Ḻ|Ḽ)/L/g;
|
128 |
+
$s =~ s/(Ḿ|Ṁ|Ṃ)/M/g;
|
129 |
+
$s =~ s/(Ṅ|Ṇ|Ṉ|Ṋ)/N/g;
|
130 |
+
$s =~ s/(Ṍ|Ṏ|Ṑ|Ṓ|Ọ|Ỏ|Ố|Ồ|Ổ|Ỗ|Ộ|Ớ|Ờ|Ở|Ỡ|Ợ)/O/g;
|
131 |
+
$s =~ s/(Ṕ|Ṗ)/P/g;
|
132 |
+
$s =~ s/(Ṙ|Ṛ|Ṝ|Ṟ)/R/g;
|
133 |
+
$s =~ s/(Ṡ|Ṣ|Ṥ|Ṧ|Ṩ)/S/g;
|
134 |
+
$s =~ s/(Ṫ|Ṭ|Ṯ|Ṱ)/T/g;
|
135 |
+
$s =~ s/(Ṳ|Ṵ|Ṷ|Ṹ|Ṻ|Ụ|Ủ|Ứ|Ừ|Ử|Ữ|Ự)/U/g;
|
136 |
+
$s =~ s/(Ṽ|Ṿ)/V/g;
|
137 |
+
$s =~ s/(Ẁ|Ẃ|Ẅ|Ẇ|Ẉ)/W/g;
|
138 |
+
$s =~ s/(Ẍ)/X/g;
|
139 |
+
$s =~ s/(Ẏ|Ỳ|Ỵ|Ỷ|Ỹ)/Y/g;
|
140 |
+
$s =~ s/(Ẑ|Ẓ|Ẕ)/Z/g;
|
141 |
+
}
|
142 |
+
# Greek letters
|
143 |
+
if ($s =~ /\xCE[\x86-\xAB]/) {
|
144 |
+
$s =~ s/ά/α/g;
|
145 |
+
$s =~ s/έ/ε/g;
|
146 |
+
$s =~ s/ί/ι/g;
|
147 |
+
$s =~ s/ϊ/ι/g;
|
148 |
+
$s =~ s/ΐ/ι/g;
|
149 |
+
$s =~ s/ό/ο/g;
|
150 |
+
$s =~ s/ύ/υ/g;
|
151 |
+
$s =~ s/ϋ/υ/g;
|
152 |
+
$s =~ s/ΰ/υ/g;
|
153 |
+
$s =~ s/ώ/ω/g;
|
154 |
+
$s =~ s/Ά/Α/g;
|
155 |
+
$s =~ s/Έ/Ε/g;
|
156 |
+
$s =~ s/Ή/Η/g;
|
157 |
+
$s =~ s/Ί/Ι/g;
|
158 |
+
$s =~ s/Ϊ/Ι/g;
|
159 |
+
$s =~ s/Ύ/Υ/g;
|
160 |
+
$s =~ s/Ϋ/Υ/g;
|
161 |
+
$s =~ s/Ώ/Ω/g;
|
162 |
+
}
|
163 |
+
# Cyrillic letters
|
164 |
+
if ($s =~ /\xD0[\x80-\xAF]/) {
|
165 |
+
$s =~ s/Ѐ/Е/g;
|
166 |
+
$s =~ s/Ё/Е/g;
|
167 |
+
$s =~ s/Ѓ/Г/g;
|
168 |
+
$s =~ s/Ќ/К/g;
|
169 |
+
$s =~ s/Ѝ/И/g;
|
170 |
+
$s =~ s/Й/И/g;
|
171 |
+
$s =~ s/ѐ/е/g;
|
172 |
+
$s =~ s/ё/е/g;
|
173 |
+
$s =~ s/ѓ/г/g;
|
174 |
+
$s =~ s/ќ/к/g;
|
175 |
+
$s =~ s/ѝ/и/g;
|
176 |
+
$s =~ s/й/и/g;
|
177 |
+
}
|
178 |
+
}
|
179 |
+
return $s;
|
180 |
+
}
|
181 |
+
|
182 |
+
while (@ARGV) {
|
183 |
+
$arg = shift @ARGV;
|
184 |
+
if ($arg =~ /^-*(h|help)$/i) {
|
185 |
+
&print_usage;
|
186 |
+
exit 1;
|
187 |
+
} elsif ($arg =~ /^-*(v|version)$/i) {
|
188 |
+
&print_version;
|
189 |
+
exit 1;
|
190 |
+
} else {
|
191 |
+
print STDERR "Ignoring unrecognized argument $arg\n";
|
192 |
+
}
|
193 |
+
}
|
194 |
+
|
195 |
+
$line_number = 0;
|
196 |
+
while (<>) {
|
197 |
+
$line_number++;
|
198 |
+
print &de_accent_string($_);
|
199 |
+
}
|
200 |
+
exit 0;
|
201 |
+
|
uroman/bin/string-distance.pl
ADDED
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/perl -w
|
2 |
+
|
3 |
+
# Author: Ulf Hermjakob
|
4 |
+
# Release date: October 13, 2019
|
5 |
+
|
6 |
+
# Usage: string-distance.pl {-lc1 <language-code>} {-lc2 <language-code>} < STDIN > STDOUT
|
7 |
+
# Example: string-distance.pl -lc1 rus -lc2 ukr < STDIN > STDOUT
|
8 |
+
# Example: string-distance.pl < ../test/string-similarity-test-input.txt
|
9 |
+
# Input format: two strings per line (tab-separated, in Latin script)
|
10 |
+
# Strings in non-Latin scripts should first be romanized. (Recommended script: uroman.pl)
|
11 |
+
# Output format: repetition of the two input strings, plus the string distance between them (tab-separated).
|
12 |
+
# Additional output meta info lines at the top are marked with an initial #.
|
13 |
+
#
|
14 |
+
# The script uses data from a string-distance-cost-rules file that lists costs,
|
15 |
+
# where the default cost is "1" with lower costs for differences in vowels,
|
16 |
+
# duplicate consonants, "f" vs. "ph" etc.
|
17 |
+
# Language cost rules can be language-specific and context-sensitive.
|
18 |
+
|
19 |
+
$|=1;
|
20 |
+
|
21 |
+
use FindBin;
|
22 |
+
use Cwd "abs_path";
|
23 |
+
use File::Basename qw(dirname);
|
24 |
+
use File::Spec;
|
25 |
+
|
26 |
+
my $bin_dir = abs_path(dirname($0));
|
27 |
+
my $root_dir = File::Spec->catfile($bin_dir, File::Spec->updir());
|
28 |
+
my $data_dir = File::Spec->catfile($root_dir, "data");
|
29 |
+
my $lib_dir = File::Spec->catfile($root_dir, "lib");
|
30 |
+
|
31 |
+
use lib "$FindBin::Bin/../lib";
|
32 |
+
use List::Util qw(min max);
|
33 |
+
use NLP::utilities;
|
34 |
+
use NLP::stringDistance;
|
35 |
+
$util = NLP::utilities;
|
36 |
+
$sd = NLP::stringDistance;
|
37 |
+
$verbose = 0;
|
38 |
+
$separator = "\t";
|
39 |
+
|
40 |
+
$cost_rule_filename = File::Spec->catfile($data_dir, "string-distance-cost-rules.txt");
|
41 |
+
|
42 |
+
$lang_code1 = "eng";
|
43 |
+
$lang_code2 = "eng";
|
44 |
+
%ht = ();
|
45 |
+
|
46 |
+
while (@ARGV) {
|
47 |
+
$arg = shift @ARGV;
|
48 |
+
if ($arg =~ /^-+lc1$/) {
|
49 |
+
$lang_code_candidate = shift @ARGV;
|
50 |
+
$lang_code1 = $lang_code_candidate if $lang_code_candidate =~ /^[a-z]{3,3}$/;
|
51 |
+
} elsif ($arg =~ /^-+lc2$/) {
|
52 |
+
$lang_code_candidate = shift @ARGV;
|
53 |
+
$lang_code2 = $lang_code_candidate if $lang_code_candidate =~ /^[a-z]{3,3}$/;
|
54 |
+
} elsif ($arg =~ /^-+(v|verbose)$/) {
|
55 |
+
$verbose = shift @ARGV;
|
56 |
+
} else {
|
57 |
+
print STDERR "Ignoring unrecognized arg $arg\n";
|
58 |
+
}
|
59 |
+
}
|
60 |
+
|
61 |
+
$sd->load_string_distance_data($cost_rule_filename, *ht, $verbose);
|
62 |
+
print STDERR "Loaded resources.\n" if $verbose;
|
63 |
+
|
64 |
+
my $chart_id = 0;
|
65 |
+
my $line_number = 0;
|
66 |
+
print "# Lang-code-1: $lang_code1 Lang-code-2: $lang_code2\n";
|
67 |
+
while (<>) {
|
68 |
+
$line_number++;
|
69 |
+
if ($verbose) {
|
70 |
+
if ($line_number =~ /000$/) {
|
71 |
+
if ($line_number =~ /0000$/) {
|
72 |
+
print STDERR $line_number;
|
73 |
+
} else {
|
74 |
+
print STDERR ".";
|
75 |
+
}
|
76 |
+
}
|
77 |
+
}
|
78 |
+
my $line = $_;
|
79 |
+
$line =~ s/^\xEF\xBB\xBF//;
|
80 |
+
next if $line =~ /^\s*(\#.*)?$/;
|
81 |
+
my $s1;
|
82 |
+
my $s2;
|
83 |
+
if (($s1, $s2) = ($line =~ /^("(?:\\"|[^"])*"|\S+)$separator("(?:\\"|[^"])*"|\S+)\s*$/)) {
|
84 |
+
$s1 = $util->dequote_string($s1);
|
85 |
+
$s2 = $util->dequote_string($s2);
|
86 |
+
} elsif ($line =~ /^\s*(#.*)$/) {
|
87 |
+
} else {
|
88 |
+
print STDERR "Could not process line $line_number: $line" if $verbose;
|
89 |
+
print "\n";
|
90 |
+
next;
|
91 |
+
}
|
92 |
+
|
93 |
+
$cost = $sd->quick_romanized_string_distance_by_chart($s1, $s2, *ht, "", $lang_code1, $lang_code2);
|
94 |
+
print "$s1\t$s2\t$cost\n";
|
95 |
+
}
|
96 |
+
print STDERR "\n" if $verbose;
|
97 |
+
|
98 |
+
exit 0;
|
99 |
+
|
uroman/bin/uroman-quick.pl
ADDED
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/perl -w
|
2 |
+
|
3 |
+
# uroman Nov. 12, 2015 - July 25, 2016
|
4 |
+
# version v0.7
|
5 |
+
# Author: Ulf Hermjakob
|
6 |
+
|
7 |
+
# Usage: uroman-quick.pl {-l [tur|uig|ukr|yid]} < STDIN
|
8 |
+
# currently only for Arabic script languages, incl. Uyghur
|
9 |
+
|
10 |
+
$|=1;
|
11 |
+
|
12 |
+
use FindBin;
|
13 |
+
use Cwd "abs_path";
|
14 |
+
use File::Basename qw(dirname);
|
15 |
+
use File::Spec;
|
16 |
+
|
17 |
+
my $bin_dir = abs_path(dirname($0));
|
18 |
+
my $root_dir = File::Spec->catfile($bin_dir, File::Spec->updir());
|
19 |
+
my $data_dir = File::Spec->catfile($root_dir, "data");
|
20 |
+
my $lib_dir = File::Spec->catfile($root_dir, "lib");
|
21 |
+
|
22 |
+
use lib "$FindBin::Bin/../lib";
|
23 |
+
use NLP::Romanizer;
|
24 |
+
use NLP::UTF8;
|
25 |
+
$romanizer = NLP::Romanizer;
|
26 |
+
%ht = ();
|
27 |
+
$lang_code = "";
|
28 |
+
|
29 |
+
while (@ARGV) {
|
30 |
+
$arg = shift @ARGV;
|
31 |
+
if ($arg =~ /^-+(l|lc|lang-code)$/) {
|
32 |
+
$lang_code = lc (shift @ARGV || "")
|
33 |
+
} else {
|
34 |
+
print STDERR "Ignoring unrecognized arg $arg\n";
|
35 |
+
}
|
36 |
+
}
|
37 |
+
|
38 |
+
$romanization_table_arabic_block_filename = File::Spec->catfile($data_dir, "romanization-table-arabic-block.txt");
|
39 |
+
$romanization_table_filename = File::Spec->catfile($data_dir, "romanization-table.txt");
|
40 |
+
|
41 |
+
$romanizer->load_romanization_table(*ht, $romanization_table_arabic_block_filename);
|
42 |
+
$romanizer->load_romanization_table(*ht, $romanization_table_filename);
|
43 |
+
|
44 |
+
$line_number = 0;
|
45 |
+
while (<>) {
|
46 |
+
$line_number++;
|
47 |
+
my $line = $_;
|
48 |
+
print $romanizer->quick_romanize($line, $lang_code, *ht) . "\n";
|
49 |
+
if ($line_number =~ /0000$/) {
|
50 |
+
print STDERR $line_number;
|
51 |
+
} elsif ($line_number =~ /000$/) {
|
52 |
+
print STDERR ".";
|
53 |
+
}
|
54 |
+
}
|
55 |
+
print STDERR "\n";
|
56 |
+
|
57 |
+
exit 0;
|
58 |
+
|
uroman/bin/uroman-tsv.sh
ADDED
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env bash
|
2 |
+
# Created by Thamme Gowda on June 17, 2019
|
3 |
+
|
4 |
+
DIR=$(dirname "${BASH_SOURCE[0]}") # get the directory name
|
5 |
+
# DIR=$(realpath "${DIR}") # resolve its full path if need be
|
6 |
+
|
7 |
+
if [[ $# -lt 1 || $# -gt 2 ]]; then
|
8 |
+
>&2 echo "ERROR: invalid args"
|
9 |
+
>&2 echo "Usage: <input.tsv> [<output.tsv>]"
|
10 |
+
exit 2
|
11 |
+
fi
|
12 |
+
|
13 |
+
INP=$1
|
14 |
+
OUT=$2
|
15 |
+
|
16 |
+
CMD=$DIR/uroman.pl
|
17 |
+
|
18 |
+
function romanize(){
|
19 |
+
paste <(cut -f1 $INP) <(cut -f2 $INP | $CMD)
|
20 |
+
}
|
21 |
+
|
22 |
+
if [[ -n $OUT ]]; then
|
23 |
+
romanize > $OUT
|
24 |
+
else
|
25 |
+
romanize
|
26 |
+
fi
|
27 |
+
|
28 |
+
|
uroman/bin/uroman.pl
ADDED
@@ -0,0 +1,138 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/perl -w
|
2 |
+
|
3 |
+
# uroman Nov. 12, 2015 - Apr. 23, 2021
|
4 |
+
$version = "v1.2.8";
|
5 |
+
# Author: Ulf Hermjakob
|
6 |
+
|
7 |
+
# Usage: uroman.pl {-l [ara|bel|bul|deu|ell|eng|fas|grc|heb|kaz|kir|lav|lit|mkd|mkd2|oss|pnt|rus|srp|srp2|tur|uig|ukr|yid]} {--chart|--offset-mapping} {--no-cache} {--workset} < STDIN
|
8 |
+
# Example: cat workset.txt | uroman.pl --offset-mapping --workset
|
9 |
+
|
10 |
+
$|=1;
|
11 |
+
|
12 |
+
use FindBin;
|
13 |
+
use Cwd "abs_path";
|
14 |
+
use File::Basename qw(dirname);
|
15 |
+
use File::Spec;
|
16 |
+
|
17 |
+
my $bin_dir = abs_path(dirname($0));
|
18 |
+
my $root_dir = File::Spec->catfile($bin_dir, File::Spec->updir());
|
19 |
+
my $data_dir = File::Spec->catfile($root_dir, "data");
|
20 |
+
my $lib_dir = File::Spec->catfile($root_dir, "lib");
|
21 |
+
|
22 |
+
use lib "$FindBin::Bin/../lib";
|
23 |
+
use NLP::Chinese;
|
24 |
+
use NLP::Romanizer;
|
25 |
+
use NLP::UTF8;
|
26 |
+
use NLP::utilities;
|
27 |
+
use JSON;
|
28 |
+
$chinesePM = NLP::Chinese;
|
29 |
+
$romanizer = NLP::Romanizer;
|
30 |
+
$util = NLP::utilities;
|
31 |
+
%ht = ();
|
32 |
+
%pinyin_ht = ();
|
33 |
+
$lang_code = "";
|
34 |
+
$return_chart_p = 0;
|
35 |
+
$return_offset_mappings_p = 0;
|
36 |
+
$workset_p = 0;
|
37 |
+
$cache_rom_tokens_p = 1;
|
38 |
+
|
39 |
+
$script_data_filename = File::Spec->catfile($data_dir, "Scripts.txt");
|
40 |
+
$unicode_data_overwrite_filename = File::Spec->catfile($data_dir, "UnicodeDataOverwrite.txt");
|
41 |
+
$unicode_data_filename = File::Spec->catfile($data_dir, "UnicodeData.txt");
|
42 |
+
$romanization_table_filename = File::Spec->catfile($data_dir, "romanization-table.txt");
|
43 |
+
$chinese_tonal_pinyin_filename = File::Spec->catfile($data_dir, "Chinese_to_Pinyin.txt");
|
44 |
+
|
45 |
+
while (@ARGV) {
|
46 |
+
$arg = shift @ARGV;
|
47 |
+
if ($arg =~ /^-+(l|lc|lang-code)$/) {
|
48 |
+
$lang_code = lc (shift @ARGV || "")
|
49 |
+
} elsif ($arg =~ /^-+chart$/i) {
|
50 |
+
$return_chart_p = 1;
|
51 |
+
} elsif ($arg =~ /^-+workset$/i) {
|
52 |
+
$workset_p = 1;
|
53 |
+
} elsif ($arg =~ /^-+offset[-_]*map/i) {
|
54 |
+
$return_offset_mappings_p = 1;
|
55 |
+
} elsif ($arg =~ /^-+unicode[-_]?data/i) {
|
56 |
+
$filename = shift @ARGV;
|
57 |
+
if (-r $filename) {
|
58 |
+
$unicode_data_filename = $filename;
|
59 |
+
} else {
|
60 |
+
print STDERR "Ignoring invalid UnicodeData filename $filename\n";
|
61 |
+
}
|
62 |
+
} elsif ($arg =~ /^-+(no-tok-cach|no-cach)/i) {
|
63 |
+
$cache_rom_tokens_p = 0;
|
64 |
+
} else {
|
65 |
+
print STDERR "Ignoring unrecognized arg $arg\n";
|
66 |
+
}
|
67 |
+
}
|
68 |
+
|
69 |
+
$romanizer->load_script_data(*ht, $script_data_filename);
|
70 |
+
$romanizer->load_unicode_data(*ht, $unicode_data_filename);
|
71 |
+
$romanizer->load_unicode_overwrite_romanization(*ht, $unicode_data_overwrite_filename);
|
72 |
+
$romanizer->load_romanization_table(*ht, $romanization_table_filename);
|
73 |
+
$chinese_to_pinyin_not_yet_loaded_p = 1;
|
74 |
+
$current_date = $util->datetime("dateTtime");
|
75 |
+
$lang_code_clause = ($lang_code) ? " \"lang-code\":\"$lang_code\",\n" : "";
|
76 |
+
|
77 |
+
print "{\n \"romanizer\":\"uroman $version (Ulf Hermjakob, USC/ISI)\",\n \"date\":\"$current_date\",\n$lang_code_clause \"romanization\": [\n" if $return_chart_p;
|
78 |
+
my $line_number = 0;
|
79 |
+
my $chart_result = "";
|
80 |
+
while (<>) {
|
81 |
+
$line_number++;
|
82 |
+
my $line = $_;
|
83 |
+
my $snt_id = "";
|
84 |
+
if ($workset_p) {
|
85 |
+
next if $line =~ /^#/;
|
86 |
+
if (($i_value, $s_value) = ($line =~ /^(\S+\.\d+)\s(.*)$/)) {
|
87 |
+
$snt_id = $i_value;
|
88 |
+
$line = "$s_value\n";
|
89 |
+
} else {
|
90 |
+
next;
|
91 |
+
}
|
92 |
+
}
|
93 |
+
if ($chinese_to_pinyin_not_yet_loaded_p && $chinesePM->string_contains_utf8_cjk_unified_ideograph_p($line)) {
|
94 |
+
$chinesePM->read_chinese_tonal_pinyin_files(*pinyin_ht, $chinese_tonal_pinyin_filename);
|
95 |
+
$chinese_to_pinyin_not_yet_loaded_p = 0;
|
96 |
+
}
|
97 |
+
if ($return_chart_p) {
|
98 |
+
print $chart_result;
|
99 |
+
*chart_ht = $romanizer->romanize($line, $lang_code, "", *ht, *pinyin_ht, 0, "return chart", $line_number);
|
100 |
+
$chart_result = $romanizer->chart_to_json_romanization_elements(0, $chart_ht{N_CHARS}, *chart_ht, $line_number);
|
101 |
+
} elsif ($return_offset_mappings_p) {
|
102 |
+
($best_romanization, $offset_mappings) = $romanizer->romanize($line, $lang_code, "", *ht, *pinyin_ht, 0, "return offset mappings", $line_number, 0);
|
103 |
+
print "::snt-id $snt_id\n" if $workset_p;
|
104 |
+
print "::orig $line";
|
105 |
+
print "::rom $best_romanization\n";
|
106 |
+
print "::align $offset_mappings\n\n";
|
107 |
+
} elsif ($cache_rom_tokens_p) {
|
108 |
+
print $romanizer->romanize_by_token_with_caching($line, $lang_code, "", *ht, *pinyin_ht, 0, "", $line_number) . "\n";
|
109 |
+
} else {
|
110 |
+
print $romanizer->romanize($line, $lang_code, "", *ht, *pinyin_ht, 0, "", $line_number) . "\n";
|
111 |
+
}
|
112 |
+
}
|
113 |
+
$chart_result =~ s/,(\s*)$/$1/;
|
114 |
+
print $chart_result;
|
115 |
+
print " ]\n}\n" if $return_chart_p;
|
116 |
+
|
117 |
+
$dev_test_p = 0;
|
118 |
+
if ($dev_test_p) {
|
119 |
+
$n_suspicious_code_points = 0;
|
120 |
+
$n_instances = 0;
|
121 |
+
foreach $char_name (sort { hex($ht{UTF_NAME_TO_UNICODE}->{$a}) <=> hex($ht{UTF_NAME_TO_UNICODE}->{$b}) }
|
122 |
+
keys %{$ht{SUSPICIOUS_ROMANIZATION}}) {
|
123 |
+
$unicode_value = $ht{UTF_NAME_TO_UNICODE}->{$char_name};
|
124 |
+
$utf8_string = $ht{UTF_NAME_TO_CODE}->{$char_name};
|
125 |
+
foreach $romanization (sort keys %{$ht{SUSPICIOUS_ROMANIZATION}->{$char_name}}) {
|
126 |
+
$count = $ht{SUSPICIOUS_ROMANIZATION}->{$char_name}->{$romanization};
|
127 |
+
$s = ($count == 1) ? "" : "s";
|
128 |
+
print STDERR "*** Suspiciously lengthy romanization:\n" unless $n_suspicious_code_points;
|
129 |
+
print STDERR "::s $utf8_string ::t $romanization ::comment $char_name (U+$unicode_value)\n";
|
130 |
+
$n_suspicious_code_points++;
|
131 |
+
$n_instances += $count;
|
132 |
+
}
|
133 |
+
}
|
134 |
+
print STDERR " *** Total of $n_suspicious_code_points suspicious code points ($n_instances instance$s)\n" if $n_suspicious_code_points;
|
135 |
+
}
|
136 |
+
|
137 |
+
exit 0;
|
138 |
+
|
uroman/bin/uroman.py
ADDED
The diff for this file is too large to render.
See raw diff
|
|
uroman/data/Chinese_to_Pinyin.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
uroman/data/NumProps.jsonl
ADDED
The diff for this file is too large to render.
See raw diff
|
|
uroman/data/Scripts.txt
ADDED
@@ -0,0 +1,174 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
::script-name Adlam
|
2 |
+
::script-name Aegean
|
3 |
+
::script-name Ahom
|
4 |
+
::script-name Anatolian Hieroglyph
|
5 |
+
::script-name Arabic ::direction right-to-left
|
6 |
+
::script-name Arabic-Indic
|
7 |
+
::script-name Armenian
|
8 |
+
::script-name Avestan
|
9 |
+
::script-name Balinese
|
10 |
+
::script-name Bamum
|
11 |
+
::script-name Bassa Vah
|
12 |
+
::script-name Batak
|
13 |
+
::script-name Bengali ::abugida-default-vowel a
|
14 |
+
::script-name Bhaiksuki
|
15 |
+
::script-name Bopomofo ::language Chinese
|
16 |
+
::script-name Brahmi ::abugida-default-vowel a
|
17 |
+
::script-name Braille
|
18 |
+
::script-name Buginese
|
19 |
+
::script-name Buhid
|
20 |
+
::script-name Canadian Syllabics
|
21 |
+
::script-name Carian
|
22 |
+
::script-name Caucasian Albanian
|
23 |
+
::script-name Chakma
|
24 |
+
::script-name Cham
|
25 |
+
::script-name Cherokee
|
26 |
+
::script-name Chorasmian
|
27 |
+
::script-name Coptic
|
28 |
+
::script-name Cuneiform
|
29 |
+
::script-name Cypro-Minoan
|
30 |
+
::script-name Cypriot
|
31 |
+
::script-name Cyrillic
|
32 |
+
::script-name CJK ::alt-script-name Chinese, Kanji ::language Chinese, Japanese, Korean, Mandarin
|
33 |
+
::script-name Deseret
|
34 |
+
::script-name Devanagari ::abugida-default-vowel a
|
35 |
+
::script-name Dives Akuru
|
36 |
+
::script-name Dogra
|
37 |
+
::script-name Duployan
|
38 |
+
::script-name Egyptian Hieroglyph ::alt-script-name Egyptian
|
39 |
+
::script-name Elbasan
|
40 |
+
::script-name Elymaic
|
41 |
+
::script-name Ethiopic
|
42 |
+
::script-name Extended Arabic-Indic
|
43 |
+
::script-name Georgian
|
44 |
+
::script-name Glagolitic
|
45 |
+
::script-name Gothic
|
46 |
+
::script-name Grantha
|
47 |
+
::script-name Greek
|
48 |
+
::script-name Greek Acrophonic
|
49 |
+
::script-name Gujarati ::abugida-default-vowel a
|
50 |
+
::script-name Gunjala Gondi
|
51 |
+
::script-name Gurmukhi ::abugida-default-vowel a
|
52 |
+
::script-name Hangul ::language Korean
|
53 |
+
::script-name Hangzhou
|
54 |
+
::script-name Hanifi Rohingya
|
55 |
+
::script-name Hanunoo
|
56 |
+
::script-name Hatran
|
57 |
+
::script-name Hebrew ::direction right-to-left
|
58 |
+
::script-name Hiragana ::language Japanese
|
59 |
+
::script-name Indic Siyaq
|
60 |
+
::script-name Imperial Aramaic
|
61 |
+
::script-name Inscriptional Pahlavi
|
62 |
+
::script-name Inscriptional Parthian
|
63 |
+
::script-name Javanese
|
64 |
+
::script-name Kaithi
|
65 |
+
::script-name Kannada ::abugida-default-vowel a
|
66 |
+
::script-name Katakana ::language Japanese
|
67 |
+
::script-name Kawi
|
68 |
+
::script-name Kayah Li
|
69 |
+
::script-name Kharoshthi
|
70 |
+
::script-name Khitan Small Script
|
71 |
+
::script-name Khmer ::abugida-default-vowel a, o
|
72 |
+
::script-name Khojki
|
73 |
+
::script-name Khudawadi
|
74 |
+
::script-name Klingon
|
75 |
+
::script-name Lao
|
76 |
+
::script-name Lepcha
|
77 |
+
::script-name Latin
|
78 |
+
::script-name Limbu
|
79 |
+
::script-name Linear A
|
80 |
+
::script-name Linear B
|
81 |
+
::script-name Lisu
|
82 |
+
::script-name Lycian
|
83 |
+
::script-name Lydian
|
84 |
+
::script-name Mahajani
|
85 |
+
::script-name Makasar
|
86 |
+
::script-name Malayalam ::abugida-default-vowel a
|
87 |
+
::script-name Mandaic
|
88 |
+
::script-name Manichaean
|
89 |
+
::script-name Marchen
|
90 |
+
::script-name Masaram Gondi
|
91 |
+
::script-name Mayan
|
92 |
+
::script-name Medefaidrin
|
93 |
+
::script-name Meetei Mayek
|
94 |
+
::script-name Mende Kikakui
|
95 |
+
::script-name Meroitic Cursive
|
96 |
+
::script-name Meroitic Hieroglyphic
|
97 |
+
::script-name Miao
|
98 |
+
::script-name Modi ::abugida-default-vowel a
|
99 |
+
::script-name Mongolian
|
100 |
+
::script-name Mro
|
101 |
+
::script-name Multani
|
102 |
+
::script-name Myanmar ::alt-script-name Burmese ::abugida-default-vowel a
|
103 |
+
::script-name Nabataean
|
104 |
+
::script-name Nag Mundari
|
105 |
+
::script-name Nandinagari
|
106 |
+
::script-name New Tai Lue
|
107 |
+
::script-name Newa
|
108 |
+
::script-name Nko ::direction right-to-left
|
109 |
+
::script-name North Indic
|
110 |
+
::script-name Nushu
|
111 |
+
::script-name Nyiakeng Puachue Hmong
|
112 |
+
::script-name Ogham
|
113 |
+
::script-name Ol Chiki
|
114 |
+
::script-name Old Hungarian
|
115 |
+
::script-name Old Italic
|
116 |
+
::script-name Old Permic
|
117 |
+
::script-name Old Persian
|
118 |
+
::script-name Old North Arabian
|
119 |
+
::script-name Old Sogdian
|
120 |
+
::script-name Old South Arabian
|
121 |
+
::script-name Old Turkic
|
122 |
+
::script-name Old Uyghur
|
123 |
+
::script-name Oriya ::alt-script-name Odia ::abugida-default-vowel a
|
124 |
+
::script-name Osage
|
125 |
+
::script-name Osmanya
|
126 |
+
::script-name Ottoman Siyaq
|
127 |
+
::script-name Pahawh Hmong
|
128 |
+
::script-name Palmyrene
|
129 |
+
::script-name Pau Cin Hau
|
130 |
+
::script-name Phags-Pa
|
131 |
+
::script-name Phaistos Disc
|
132 |
+
::script-name Phoenician
|
133 |
+
::script-name Psalter Pahlavi
|
134 |
+
::script-name Rejang
|
135 |
+
::script-name Rumi
|
136 |
+
::script-name Runic
|
137 |
+
::script-name Samaritan
|
138 |
+
::script-name Saurashtra
|
139 |
+
::script-name Sharada
|
140 |
+
::script-name Shavian
|
141 |
+
::script-name Siddham
|
142 |
+
::script-name SignWriting
|
143 |
+
::script-name Sinhala ::abugida-default-vowel a
|
144 |
+
::script-name Sogdian
|
145 |
+
::script-name Sora Sompeng
|
146 |
+
::script-name Soyombo
|
147 |
+
::script-name Sundanese ::abugida-default-vowel a
|
148 |
+
::script-name Syloti Nagri
|
149 |
+
::script-name Syriac
|
150 |
+
::script-name Tagalog
|
151 |
+
::script-name Tagbanwa
|
152 |
+
::script-name Tai Le
|
153 |
+
::script-name Tai Tham
|
154 |
+
::script-name Tai Viet
|
155 |
+
::script-name Takri
|
156 |
+
::script-name Tamil ::abugida-default-vowel a
|
157 |
+
::script-name Tangsa
|
158 |
+
::script-name Tangut
|
159 |
+
::script-name Telugu ::abugida-default-vowel a
|
160 |
+
::script-name Thaana ::direction right-to-left
|
161 |
+
::script-name Thai
|
162 |
+
::script-name Tibetan ::abugida-default-vowel a
|
163 |
+
::script-name Tifinagh
|
164 |
+
::script-name Tirhuta
|
165 |
+
::script-name Toto
|
166 |
+
::script-name Ugaritic
|
167 |
+
::script-name Vai
|
168 |
+
::script-name Vedic
|
169 |
+
::script-name Vithkuqi
|
170 |
+
::script-name Wancho
|
171 |
+
::script-name Warang Citi
|
172 |
+
::script-name Yezidi
|
173 |
+
::script-name Yi
|
174 |
+
::script-name Zanabazar Square
|
uroman/data/UnicodeData.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
uroman/data/UnicodeDataOverwrite.txt
ADDED
@@ -0,0 +1,443 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## UnicodeDataOverwrite.txt
|
2 |
+
::u 00A0 ::r " " ::comment no-break space
|
3 |
+
::u 01BF ::r w ::comment ƿ Latin Character Wynn (Old English)
|
4 |
+
::u 0294 ::r ' ::comment gottal stop
|
5 |
+
::u 0295 ::r ' ::comment ʕ voiced pharyngeal fricative
|
6 |
+
::u 0305 ::r "" ::comment ̅ Combining overline
|
7 |
+
::u 0306 ::r "" ::comment ̆ Combining breve
|
8 |
+
::u 0307 ::r "" ::comment ̇ Combining dot above
|
9 |
+
::u 030A ::r "" ::comment ̊ Combining ring above
|
10 |
+
::u 030C ::r "" ::comment ̌ Combining caron
|
11 |
+
::u 0311 ::r "" ::comment ̑ Combining inverted breve
|
12 |
+
::u 031D ::r "" ::comment ̝ Combining down up below
|
13 |
+
::u 031E ::r "" ::comment ̞ Combining down tack below
|
14 |
+
::u 031F ::r "" ::comment ̟ Combining plus sign below
|
15 |
+
::u 0323 ::r "" ::comment ̣ Combining dot below
|
16 |
+
::u 0325 ::r "" ::comment ̥ Combining ring below
|
17 |
+
::u 0329 ::r "" ::comment ̩ Combining vertical line below
|
18 |
+
::u 032A ::r "" ::comment ̪ Combining bridge below
|
19 |
+
::u 032F ::r "" ::comment ̯ Combining inverted breve below
|
20 |
+
::u 0342 ::r "" ::comment ͂ Combining Greek perispomeni (circumflex accent)
|
21 |
+
::u 0343 ::r "" ::comment ̓ Combining Greek koronis
|
22 |
+
::u 0361 ::r "" ::comment Combining double inverted breve
|
23 |
+
::u 0384 ::r "" ::comment ΄ Greek tonos
|
24 |
+
::u 0482 ::r 1000· ::comment ҂ Cyrillic thousands sign
|
25 |
+
::u 0483 ::r "" ::comment ҃ Combining Cyrillic Titlo ::annotation titlo
|
26 |
+
::u 0484 ::r "" ::comment ҄ Combining Cyrillic Palatalization ::annotation palatalization
|
27 |
+
::u 055B ::r "" ::comment ՛ Armenian emphasis mark
|
28 |
+
::u 055F ::r "" ::comment ՟ Armenian abbreviation mark ::annotation abbreviation
|
29 |
+
|
30 |
+
::u 0901 ::r +m ::comment Devanagari sign candrabindu
|
31 |
+
::u 0902 ::r +m ::comment Devanagari sign anusvara
|
32 |
+
::u 0903 ::r +h ::comment Devanagari sign visarga
|
33 |
+
::u 093D ::r ' ::comment Devanagari sign avagraha
|
34 |
+
::u 0950 ::r om ::comment ॐ Devanagari om symbol
|
35 |
+
::u 0951 ::r "" ::comment ॑ Devanagari stress sign "udatta"
|
36 |
+
::u 0952 ::r "" ::comment ॒ Devanagari stress sign "anudatta"
|
37 |
+
::u 0981 ::r +n ::comment Bengali sign candrabindu ("chôndrôbindu")
|
38 |
+
::u 0982 ::r +ng ::comment Bengali sign anusvara ("ônushar")
|
39 |
+
::u 0983 ::r +h ::comment Bengali sign visarga ("bishôrgô")
|
40 |
+
::u 099A ::r ch ::comment instead of Bengali C(A)
|
41 |
+
::u 099B ::r chh ::comment instead of Bengali CC(A)
|
42 |
+
::u 0A02 ::r +m ::comment Gurmukhi sign bindi
|
43 |
+
::u 0A70 ::r +m ::comment Gurmukhi tippi
|
44 |
+
# ::u 0A72 ::r "" ::comment Gurmukhi addak
|
45 |
+
::u 0A72 ::r "" ::comment Gurmukhi iri
|
46 |
+
::u 0A73 ::r "" ::comment Gurmukhi ura
|
47 |
+
::u 0B01 ::r +m ::comment Oriya sign candrabindu
|
48 |
+
::u 0B03 ::r +h ::comment Oriya sign visarga
|
49 |
+
::u 0B5F ::r ya ::comment ୟ Oriya letter yya
|
50 |
+
::u 0B82 ::r +m ::comment Tamil sign anusvara (not to be used?)
|
51 |
+
::u 0B83 ::r +h ::comment Tamil sign visarga ("āytam")
|
52 |
+
::u 0B9F ::r t ::comment instead of Tamil TT(A)
|
53 |
+
::u 0BA3 ::r n ::comment instead of Tamil NN(A)
|
54 |
+
::u 0BA9 ::r n ::comment instead of Tamil NNN(A)
|
55 |
+
::u 0BB1 ::r r ::comment instead of Tamil RR(A)
|
56 |
+
::u 0BB3 ::r l ::comment instead of Tamil LL(A)
|
57 |
+
::u 0BB4 ::r l ::comment instead of Tamil LLL(A)
|
58 |
+
::u 0C03 ::r +h ::comment ః Telugu sign visarga
|
59 |
+
::u 0C83 ::r +h ::comment Kannada sign visarga
|
60 |
+
::u 0D02 ::r +m ::comment Malayalam sign anusvara
|
61 |
+
::u 0D03 ::r +h ::comment Malayalam sign visarga
|
62 |
+
::u 0D82 ::r +n ::comment Sinhala sign anusvaraya
|
63 |
+
::u 0DA4 ::r ny ::comment Sinhala ඤ
|
64 |
+
::u 0DA5 ::r gn ::comment Sinhala ඥ
|
65 |
+
::u 0DCA ::r "" ::comment Sinhala sign al-lakuna (virama = no vowel)
|
66 |
+
::u 0DCF ::r aa ::comment Sinhala ා
|
67 |
+
::u 0DD0 ::r ae ::comment Sinhala ැ
|
68 |
+
::u 0DD1 ::r ae ::comment Sinhala ෑ
|
69 |
+
::u 0DD2 ::r i ::comment Sinhala ි
|
70 |
+
::u 0DD3 ::r ii ::comment Sinhala ී
|
71 |
+
::u 0DD4 ::r u ::comment Sinhala ු
|
72 |
+
::u 0DD6 ::r uu ::comment Sinhala ූ
|
73 |
+
::u 0DD8 ::r r ::comment Sinhala ෘ
|
74 |
+
::u 0DD9 ::r e ::comment Sinhala ෙ
|
75 |
+
::u 0DDA ::r ee ::comment Sinhala ේ
|
76 |
+
::u 0DDB ::r ai ::comment Sinhala ෛ
|
77 |
+
::u 0DDC ::r o ::comment Sinhala ො
|
78 |
+
::u 0DDD ::r oo ::comment Sinhala ෝ
|
79 |
+
::u 0DDE ::r au ::comment Sinhala ෞ
|
80 |
+
::u 0DDF ::r aa ::comment Sinhala ා
|
81 |
+
::u 0DF2 ::r rr ::comment Sinhala ෲ
|
82 |
+
|
83 |
+
::u 0E02 ::r k ::comment Thai character KHO KHAI
|
84 |
+
::u 0E03 ::r k ::comment Thai character KHO KHUAT
|
85 |
+
::u 0E04 ::r k ::comment Thai character KHO KHWAI
|
86 |
+
::u 0E05 ::r k ::comment Thai character KHO KHON
|
87 |
+
::u 0E06 ::r k ::comment Thai character KHO RAKHANG
|
88 |
+
::u 0E10 ::r t ::comment Thai character THO THAN
|
89 |
+
::u 0E11 ::r t ::comment Thai character THO NANGMONTHO
|
90 |
+
::u 0E12 ::r t ::comment Thai character THO PHUTHAO
|
91 |
+
::u 0E16 ::r t ::comment Thai character THO THUNG
|
92 |
+
::u 0E17 ::r t ::comment Thai character THO THAHAN
|
93 |
+
::u 0E18 ::r t ::comment Thai character THO THONG
|
94 |
+
::u 0E1C ::r p ::comment Thai character PHO PHUNG
|
95 |
+
::u 0E1E ::r p ::comment Thai character PHO PHAN
|
96 |
+
::u 0E20 ::r p ::comment Thai character PHO SAMPHAO
|
97 |
+
::u 0E2D ::r o ::comment Thai character O ANG
|
98 |
+
::u 0E2F ::r ... ::comment ฯ Thai character PAIYANNOI (ellipsis, abbreviation)
|
99 |
+
::u 0E31 ::r a ::comment Thai character MAI HAN-AKAT
|
100 |
+
::u 0E3A ::r "" ::comment Thai character PHINTHU (Pali virama)
|
101 |
+
::u 0E40 ::r e ::syllable-info written-pre-consonant-spoken-post-consonant ::comment Thai character SARA E
|
102 |
+
::u 0E41 ::r ae ::syllable-info written-pre-consonant-spoken-post-consonant ::comment Thai character SARA AE
|
103 |
+
::u 0E42 ::r o ::syllable-info written-pre-consonant-spoken-post-consonant ::comment Thai character SARA O
|
104 |
+
::u 0E43 ::r ai ::syllable-info written-pre-consonant-spoken-post-consonant ::comment Thai character SARA AI MAIMUAN
|
105 |
+
::u 0E44 ::r ai ::syllable-info written-pre-consonant-spoken-post-consonant ::comment Thai character SARA AI MAIMALAI
|
106 |
+
::u 0E45 ::r "" ::comment Thai character LAKKHANGYAO vowel lengthener
|
107 |
+
::u 0E47 ::r o ::comment Thai character MAITAIKHU vowel shortener
|
108 |
+
::u 0E48 ::r "" ::tone-mark non-standard ::comment Thai tone mark MAI EK
|
109 |
+
::u 0E49 ::r "" ::tone-mark standard ::comment Thai tone mark MAI THO
|
110 |
+
::u 0E4A ::r "" ::tone-mark high ::comment Thai tone mark MAI TRI
|
111 |
+
::u 0E4B ::r "" ::tone-mark rising ::comment Thai tone mark MAI CHATTAWA
|
112 |
+
::u 0E4C ::r "" ::comment Thai character THANTHAKHAT cancellation mark (cf. virama)
|
113 |
+
::u 0E4D ::r +m ::comment ํ Thai character NIKHAHIT final nasal (cf. anusvara)
|
114 |
+
::u 0ECC ::r "" ::comment ໌ Lao cancellation mark ::annotation cancellation
|
115 |
+
::u 0F0B ::r · ::comment ་ Tibetan mark intersyllabic tsheg
|
116 |
+
::u 0F0C ::r "" ::comment ༌ Tibetan mark delimiter tsheg bstar
|
117 |
+
::u 0F84 ::r "" ::comment ྄ Tibetan halanta
|
118 |
+
::u 1036 ::r +n ::comment Myanmar sign anusvara ("auk myit")
|
119 |
+
::u 1037 ::r "" ::tone-mark creaky ::comment Myanmar sign dot below
|
120 |
+
::u 1038 ::r "" ::tone-mark high ::comment Myanmar sign visarga
|
121 |
+
|
122 |
+
::u 16A0 ::r f ::comment ᚠ RUNIC LETTER FEHU FEOH FE F
|
123 |
+
::u 16A1 ::r v ::comment ᚡ RUNIC LETTER V
|
124 |
+
::u 16A2 ::r u ::comment ᚢ RUNIC LETTER URUZ UR U
|
125 |
+
::u 16A3 ::r y ::comment ᚣ RUNIC LETTER YR
|
126 |
+
::u 16A4 ::r y ::comment ᚤ RUNIC LETTER Y
|
127 |
+
::u 16A5 ::r w ::comment ᚥ RUNIC LETTER W
|
128 |
+
::u 16A6 ::r th ::comment ᚦ RUNIC LETTER THURISAZ THURS THORN
|
129 |
+
::u 16A7 ::r th ::comment ᚧ RUNIC LETTER ETH
|
130 |
+
::u 16A8 ::r a ::comment ᚨ RUNIC LETTER ANSUZ A
|
131 |
+
::u 16A9 ::r o ::comment ᚩ RUNIC LETTER OS O
|
132 |
+
::u 16AA ::r a ::comment ᚪ RUNIC LETTER AC A
|
133 |
+
::u 16AB ::r ae ::comment ᚫ RUNIC LETTER AESC
|
134 |
+
::u 16AC ::r o ::comment ᚬ RUNIC LETTER LONG-BRANCH-OSS O
|
135 |
+
::u 16AD ::r o ::comment ᚭ RUNIC LETTER SHORT-TWIG-OSS O
|
136 |
+
::u 16AE ::r o ::comment ᚮ RUNIC LETTER O
|
137 |
+
::u 16AF ::r oe ::comment ᚯ RUNIC LETTER OE
|
138 |
+
::u 16B0 ::r on ::comment ᚰ RUNIC LETTER ON
|
139 |
+
::u 16B1 ::r r ::comment ᚱ RUNIC LETTER RAIDO RAD REID R
|
140 |
+
::u 16B2 ::r k ::comment ᚲ RUNIC LETTER KAUNA
|
141 |
+
::u 16B3 ::r c ::comment ᚳ RUNIC LETTER CEN
|
142 |
+
::u 16B4 ::r k ::comment ᚴ RUNIC LETTER KAUN K
|
143 |
+
::u 16B5 ::r g ::comment ᚵ RUNIC LETTER G
|
144 |
+
::u 16B6 ::r ng ::comment ᚶ RUNIC LETTER ENG
|
145 |
+
::u 16B7 ::r g ::comment ᚷ RUNIC LETTER GEBO GYFU G
|
146 |
+
::u 16B8 ::r g ::comment ᚸ RUNIC LETTER GAR
|
147 |
+
::u 16B9 ::r w ::comment ᚹ RUNIC LETTER WUNJO WYNN W
|
148 |
+
::u 16BA ::r h ::comment ᚺ RUNIC LETTER HAGLAZ H
|
149 |
+
::u 16BB ::r h ::comment ᚻ RUNIC LETTER HAEGL H
|
150 |
+
::u 16BC ::r h ::comment ᚼ RUNIC LETTER LONG-BRANCH-HAGALL H
|
151 |
+
::u 16BD ::r h ::comment ᚽ RUNIC LETTER SHORT-TWIG-HAGALL H
|
152 |
+
::u 16BE ::r n ::comment ᚾ RUNIC LETTER NAUDIZ NYD NAUD N
|
153 |
+
::u 16BF ::r n ::comment ᚿ RUNIC LETTER SHORT-TWIG-NAUD N
|
154 |
+
::u 16C0 ::r n ::comment ᛀ RUNIC LETTER DOTTED-N
|
155 |
+
::u 16C1 ::r i ::comment ᛁ RUNIC LETTER ISAZ IS ISS I
|
156 |
+
::u 16C2 ::r e ::comment ᛂ RUNIC LETTER E
|
157 |
+
::u 16C3 ::r j ::comment ᛃ RUNIC LETTER JERAN J
|
158 |
+
::u 16C4 ::r j ::comment ᛄ RUNIC LETTER GER
|
159 |
+
::u 16C5 ::r ae ::comment ᛅ RUNIC LETTER LONG-BRANCH-AR AE
|
160 |
+
::u 16C6 ::r a ::comment ᛆ RUNIC LETTER SHORT-TWIG-AR A
|
161 |
+
::u 16C7 ::r i ::comment ᛇ RUNIC LETTER IWAZ EOH
|
162 |
+
::u 16C8 ::r p ::comment ᛈ RUNIC LETTER PERTHO PEORTH P
|
163 |
+
::u 16C9 ::r z ::comment ᛉ RUNIC LETTER ALGIZ EOLHX
|
164 |
+
::u 16CA ::r s ::comment ᛊ RUNIC LETTER SOWILO S
|
165 |
+
::u 16CB ::r s ::comment ᛋ RUNIC LETTER SIGEL LONG-BRANCH-SOL S
|
166 |
+
::u 16CC ::r s ::comment ᛌ RUNIC LETTER SHORT-TWIG-SOL S
|
167 |
+
::u 16CD ::r c ::comment ᛍ RUNIC LETTER C
|
168 |
+
::u 16CE ::r z ::comment ᛎ RUNIC LETTER Z
|
169 |
+
::u 16CF ::r t ::comment ᛏ RUNIC LETTER TIWAZ TIR TYR T
|
170 |
+
::u 16D0 ::r t ::comment ᛐ RUNIC LETTER SHORT-TWIG-TYR T
|
171 |
+
::u 16D1 ::r d ::comment ᛑ RUNIC LETTER D
|
172 |
+
::u 16D2 ::r b ::comment ᛒ RUNIC LETTER BERKANAN BEORC BJARKAN B
|
173 |
+
::u 16D3 ::r b ::comment ᛓ RUNIC LETTER SHORT-TWIG-BJARKAN B
|
174 |
+
::u 16D4 ::r p ::comment ᛔ RUNIC LETTER DOTTED-P
|
175 |
+
::u 16D5 ::r p ::comment ᛕ RUNIC LETTER OPEN-P
|
176 |
+
::u 16D6 ::r e ::comment ᛖ RUNIC LETTER EHWAZ EH E
|
177 |
+
::u 16D7 ::r m ::comment ᛗ RUNIC LETTER MANNAZ MAN M
|
178 |
+
::u 16D8 ::r m ::comment ᛘ RUNIC LETTER LONG-BRANCH-MADR M
|
179 |
+
::u 16D9 ::r m ::comment ᛙ RUNIC LETTER SHORT-TWIG-MADR M
|
180 |
+
::u 16DA ::r l ::comment ᛚ RUNIC LETTER LAUKAZ LAGU LOGR L
|
181 |
+
::u 16DB ::r l ::comment ᛛ RUNIC LETTER DOTTED-L
|
182 |
+
::u 16DC ::r ng ::comment ᛜ RUNIC LETTER INGWAZ
|
183 |
+
::u 16DD ::r ng ::comment ᛝ RUNIC LETTER ING
|
184 |
+
::u 16DE ::r d ::comment ᛞ RUNIC LETTER DAGAZ DAEG D
|
185 |
+
::u 16DF ::r o ::comment ᛟ RUNIC LETTER OTHALAN ETHEL O
|
186 |
+
::u 16E0 ::r ea ::comment ᛠ RUNIC LETTER EAR
|
187 |
+
::u 16E1 ::r io ::comment ᛡ RUNIC LETTER IOR
|
188 |
+
::u 16E2 ::r q ::comment ᛢ RUNIC LETTER CWEORTH
|
189 |
+
::u 16E3 ::r k ::comment ᛣ RUNIC LETTER CALC
|
190 |
+
::u 16E4 ::r k ::comment ᛤ RUNIC LETTER CEALC
|
191 |
+
::u 16E5 ::r st ::comment ᛥ RUNIC LETTER STAN
|
192 |
+
::u 16E6 ::r r ::comment ᛦ RUNIC LETTER LONG-BRANCH-YR
|
193 |
+
::u 16E7 ::r r ::comment ᛧ RUNIC LETTER SHORT-TWIG-YR
|
194 |
+
::u 16E8 ::r r ::comment ᛨ RUNIC LETTER ICELANDIC-YR
|
195 |
+
::u 16E9 ::r q ::comment ᛩ RUNIC LETTER Q
|
196 |
+
::u 16EA ::r x ::comment ᛪ RUNIC LETTER X
|
197 |
+
|
198 |
+
::u 17B9 ::r oe ::comment Khmer vowel sign y (short)
|
199 |
+
::u 17BA ::r oe ::comment Khmer vowel sign yy (long)
|
200 |
+
::u 17C6 ::r +m ::comment Khmer sign nikahit (cf. anusvara)
|
201 |
+
::u 17C7 ::r +h ::comment Khmer sign reahmuk (cf. visarga)
|
202 |
+
::u 17C8 ::r ' ::comment Khmer sign yuukaleapintu (short vowel and glottal stop)
|
203 |
+
::u 17C9 ::r "" ::comment Khmer sign muusikatoan: changes the second register to the first
|
204 |
+
::u 17CA ::r "" ::comment Khmer sign triisap: changes the first register to the second
|
205 |
+
::u 17CB ::r "" ::comment Khmer sign bantoc (vowel shortener)
|
206 |
+
::u 17D2 ::r "" ::comment Khmer sign coeng (foot/subscript, cf. virama = no vowel)
|
207 |
+
::u 17D5 ::r . ::comment Khmer sign bariyoosan; period ending entire text or chapter
|
208 |
+
|
209 |
+
::u 180E ::r ' ::comment Mongolian vowel separator
|
210 |
+
|
211 |
+
::u 1B80 ::r +ng ::comment ᮀ Sundanese sign panyecek
|
212 |
+
::u 1B81 ::r +r ::comment ᮁ Sundanese sign panglayar
|
213 |
+
::u 1B82 ::r +h ::comment ᮂ Sundanese sign pangwisad
|
214 |
+
::u 1BA1 ::r ya ::comment ᮡ Sundanese consonant sign pamingkal
|
215 |
+
::u 1BA2 ::r ra ::comment ᮢ Sundanese consonant sign panyakr
|
216 |
+
::u 1BA3 ::r la ::comment ᮣ Sundanese consonant sign panyiku
|
217 |
+
::u 1BA4 ::r i ::comment ᮤ Sundanese consonant sign panghulu
|
218 |
+
::u 1BA5 ::r u ::comment ᮥ Sundanese consonant sign panyuku
|
219 |
+
::u 1BA6 ::r e ::comment ᮦ Sundanese vowel sign panaelaeng
|
220 |
+
::u 1BA7 ::r o ::comment ᮧ Sundanese vowel sign panolong
|
221 |
+
::u 1BA8 ::r e ::comment ᮨ Sundanese vowel sign pamepet
|
222 |
+
::u 1BA9 ::r eu ::comment ᮩ Sundanese vowel sign paneuleung
|
223 |
+
::u 1BAA ::r "" ::comment ᮪ Sundanese sign pamaaeh or patén (no vowel/virama)
|
224 |
+
|
225 |
+
::u 1FBD ::r "" ::comment ᾽ Greek koronis
|
226 |
+
::u 1FFE ::r "" ::comment Greek dasia (rough breathing)
|
227 |
+
|
228 |
+
::u 2002 ::r " " ::comment en space
|
229 |
+
::u 2003 ::r " " ::comment em space
|
230 |
+
::u 2004 ::r " " ::comment three-per-em space
|
231 |
+
::u 2005 ::r " " ::comment four-per-em space
|
232 |
+
::u 2006 ::r " " ::comment six-per-em space
|
233 |
+
::u 2007 ::r " " ::comment figure space
|
234 |
+
::u 2008 ::r " " ::comment punctuation space
|
235 |
+
::u 2009 ::r " " ::comment thin space
|
236 |
+
::u 200A ::r " " ::comment hair space
|
237 |
+
::u 202F ::r " " ::comment narrow no-break space
|
238 |
+
|
239 |
+
::u 2D30 ::r a ::comment TIFINAGH LETTER YA ⴰ
|
240 |
+
::u 2D31 ::r b ::comment TIFINAGH LETTER YAB ⴱ
|
241 |
+
::u 2D32 ::r bh ::comment TIFINAGH LETTER YABH ⴲ
|
242 |
+
::u 2D33 ::r g ::comment TIFINAGH LETTER YAG ⴳ
|
243 |
+
::u 2D34 ::r ghh ::comment TIFINAGH LETTER YAGHH ⴴ
|
244 |
+
::u 2D35 ::r j ::comment TIFINAGH LETTER BERBER ACADEMY YAJ ⴵ
|
245 |
+
::u 2D36 ::r j ::comment TIFINAGH LETTER YAJ ⴶ
|
246 |
+
::u 2D37 ::r d ::comment TIFINAGH LETTER YAD ⴷ
|
247 |
+
::u 2D38 ::r dh ::comment TIFINAGH LETTER YADH ⴸ
|
248 |
+
::u 2D39 ::r dd ::comment TIFINAGH LETTER YADD ⴹ
|
249 |
+
::u 2D3A ::r ddh ::comment TIFINAGH LETTER YADDH ⴺ
|
250 |
+
::u 2D3B ::r e ::comment TIFINAGH LETTER YEY ⴻ
|
251 |
+
::u 2D3C ::r f ::comment TIFINAGH LETTER YAF ⴼ
|
252 |
+
::u 2D3D ::r k ::comment TIFINAGH LETTER YAK ⴽ
|
253 |
+
::u 2D3E ::r k ::comment TIFINAGH LETTER TUAREG YAK ⴾ
|
254 |
+
::u 2D3F ::r khh ::comment TIFINAGH LETTER YAKHH ⴿ
|
255 |
+
::u 2D40 ::r h ::comment TIFINAGH LETTER YAH ⵀ
|
256 |
+
::u 2D41 ::r h ::comment TIFINAGH LETTER BERBER ACADEMY YAH ⵁ
|
257 |
+
::u 2D42 ::r h ::comment TIFINAGH LETTER TUAREG YAH ⵂ
|
258 |
+
::u 2D43 ::r hh ::comment TIFINAGH LETTER YAHH ⵃ
|
259 |
+
::u 2D44 ::r ' ::comment TIFINAGH LETTER YAA ⵄ
|
260 |
+
::u 2D45 ::r kh ::comment TIFINAGH LETTER YAKH ⵅ
|
261 |
+
::u 2D46 ::r kh ::comment TIFINAGH LETTER TUAREG YAKH ⵆ
|
262 |
+
::u 2D47 ::r q ::comment TIFINAGH LETTER YAQ ⵇ
|
263 |
+
::u 2D48 ::r q ::comment TIFINAGH LETTER TUAREG YAQ ⵈ
|
264 |
+
::u 2D49 ::r i ::comment TIFINAGH LETTER YI ⵉ
|
265 |
+
::u 2D4A ::r zh ::comment TIFINAGH LETTER YAZH ⵊ
|
266 |
+
::u 2D4B ::r zh ::comment TIFINAGH LETTER AHAGGAR YAZH ⵋ
|
267 |
+
::u 2D4C ::r zh ::comment TIFINAGH LETTER TUAREG YAZH ⵌ
|
268 |
+
::u 2D4D ::r l ::comment TIFINAGH LETTER YAL ⵍ
|
269 |
+
::u 2D4E ::r m ::comment TIFINAGH LETTER YAM ⵎ
|
270 |
+
::u 2D4F ::r n ::comment TIFINAGH LETTER YAN ⵏ
|
271 |
+
::u 2D50 ::r gn ::comment TIFINAGH LETTER TUAREG YAGN ⵐ
|
272 |
+
::u 2D51 ::r ng ::comment TIFINAGH LETTER TUAREG YANG ⵑ
|
273 |
+
::u 2D52 ::r p ::comment TIFINAGH LETTER YAP ⵒ
|
274 |
+
::u 2D53 ::r u ::comment TIFINAGH LETTER YU ⵓ
|
275 |
+
::u 2D54 ::r r ::comment TIFINAGH LETTER YAR ⵔ
|
276 |
+
::u 2D55 ::r rr ::comment TIFINAGH LETTER YARR ⵕ
|
277 |
+
::u 2D56 ::r gh ::comment TIFINAGH LETTER YAGH ⵖ
|
278 |
+
::u 2D57 ::r gh ::comment TIFINAGH LETTER TUAREG YAGH ⵗ
|
279 |
+
::u 2D58 ::r gh ::comment TIFINAGH LETTER AYER YAGH ⵘ
|
280 |
+
::u 2D59 ::r s ::comment TIFINAGH LETTER YAS ⵙ
|
281 |
+
::u 2D5A ::r ss ::comment TIFINAGH LETTER YASS ⵚ
|
282 |
+
::u 2D5B ::r sh ::comment TIFINAGH LETTER YASH ⵛ
|
283 |
+
::u 2D5C ::r t ::comment TIFINAGH LETTER YAT ⵜ
|
284 |
+
::u 2D5D ::r th ::comment TIFINAGH LETTER YATH ⵝ
|
285 |
+
::u 2D5E ::r ch ::comment TIFINAGH LETTER YACH ⵞ
|
286 |
+
::u 2D5F ::r tt ::comment TIFINAGH LETTER YATT ⵟ
|
287 |
+
::u 2D60 ::r v ::comment TIFINAGH LETTER YAV ⵠ
|
288 |
+
::u 2D61 ::r w ::comment TIFINAGH LETTER YAW ⵡ
|
289 |
+
::u 2D62 ::r y ::comment TIFINAGH LETTER YAY ⵢ
|
290 |
+
::u 2D63 ::r z ::comment TIFINAGH LETTER YAZ ⵣ
|
291 |
+
::u 2D64 ::r z ::comment TIFINAGH LETTER TAWELLEMET YAZ ⵤ
|
292 |
+
::u 2D65 ::r zz ::comment TIFINAGH LETTER YAZZ ⵥ
|
293 |
+
::u 2D66 ::r ye ::comment TIFINAGH LETTER YE ⵦ
|
294 |
+
::u 2D67 ::r yo ::comment TIFINAGH LETTER YO ⵧ
|
295 |
+
::u 2D6F ::r "" ::comment TIFINAGH MODIFIER LETTER LABIALIZATION MARK ⵯ
|
296 |
+
::u 2D70 ::r "" ::comment TIFINAGH SEPARATOR MARK ⵰
|
297 |
+
::u 2D7F ::r "" ::comment TIFINAGH CONSONANT JOINER ⵿
|
298 |
+
|
299 |
+
::u 3063 ::r tsu ::comment Hiragana letter small tsu
|
300 |
+
::u 30C3 ::r tsu ::comment Katakana letter small tsu
|
301 |
+
|
302 |
+
::u ABE3 ::r o ::comment ꯣ Meetei Mayek vowel sign onap
|
303 |
+
::u ABE7 ::r ou ::comment ꯧ Meetei Mayek vowel sign sounap
|
304 |
+
|
305 |
+
::u F008 ::r "" ::comment Yoruba diacritic in private use area
|
306 |
+
::u F00F ::r "" ::comment Yoruba diacritic in private use area
|
307 |
+
::u F023 ::r "" ::comment Yoruba diacritic in private use area
|
308 |
+
::u F025 ::r "" ::comment Yoruba diacritic in private use area
|
309 |
+
|
310 |
+
::u F8D0 ::r a ::name KLINGON LETTER A
|
311 |
+
::u F8D1 ::r b ::name KLINGON LETTER B
|
312 |
+
::u F8D2 ::r ch ::name KLINGON LETTER CH
|
313 |
+
::u F8D3 ::r D ::name KLINGON LETTER D
|
314 |
+
::u F8D4 ::r e ::name KLINGON LETTER E
|
315 |
+
::u F8D5 ::r gh ::name KLINGON LETTER GH
|
316 |
+
::u F8D6 ::r H ::name KLINGON LETTER H
|
317 |
+
::u F8D7 ::r I ::name KLINGON LETTER I
|
318 |
+
::u F8D8 ::r j ::name KLINGON LETTER J
|
319 |
+
::u F8D9 ::r l ::name KLINGON LETTER L
|
320 |
+
::u F8DA ::r m ::name KLINGON LETTER M
|
321 |
+
::u F8DB ::r n ::name KLINGON LETTER N
|
322 |
+
::u F8DC ::r ng ::name KLINGON LETTER NG
|
323 |
+
::u F8DD ::r o ::name KLINGON LETTER O
|
324 |
+
::u F8DE ::r p ::name KLINGON LETTER P
|
325 |
+
::u F8DF ::r q ::name KLINGON LETTER Q
|
326 |
+
::u F8E0 ::r Q ::name KLINGON LETTER Q
|
327 |
+
::u F8E1 ::r r ::name KLINGON LETTER R
|
328 |
+
::u F8E2 ::r S ::name KLINGON LETTER S
|
329 |
+
::u F8E3 ::r t ::name KLINGON LETTER T
|
330 |
+
::u F8E4 ::r tlh ::name KLINGON LETTER TLH
|
331 |
+
::u F8E5 ::r u ::name KLINGON LETTER U
|
332 |
+
::u F8E6 ::r v ::name KLINGON LETTER V
|
333 |
+
::u F8E7 ::r w ::name KLINGON LETTER W
|
334 |
+
::u F8E8 ::r y ::name KLINGON LETTER Y
|
335 |
+
::u F8E9 ::r ' ::name KLINGON LETTER GLOTTAL STOP
|
336 |
+
::u F8F0 ::num 0 ::name KLINGON DIGIT ZERO
|
337 |
+
::u F8F1 ::num 1 ::name KLINGON DIGIT ONE
|
338 |
+
::u F8F2 ::num 2 ::name KLINGON DIGIT TWO
|
339 |
+
::u F8F3 ::num 3 ::name KLINGON DIGIT THREE
|
340 |
+
::u F8F4 ::num 4 ::name KLINGON DIGIT FOUR
|
341 |
+
::u F8F5 ::num 5 ::name KLINGON DIGIT FIVE
|
342 |
+
::u F8F6 ::num 6 ::name KLINGON DIGIT SIX
|
343 |
+
::u F8F7 ::num 7 ::name KLINGON DIGIT SEVEN
|
344 |
+
::u F8F8 ::num 8 ::name KLINGON DIGIT EIGHT
|
345 |
+
::u F8F9 ::num 9 ::name KLINGON DIGIT NINE
|
346 |
+
::u F8FD ::r , ::name KLINGON COMMA
|
347 |
+
::u F8FE ::r . ::name KLINGON FULL STOP
|
348 |
+
::u F8FF ::name KLINGON MUMMIFICATION GLYPH
|
349 |
+
::u FEFF ::r "" ::comment Byte Order Mark (BOM); ZERO WIDTH NO-BREAK SPACE (deprecated)
|
350 |
+
|
351 |
+
::u 1163D ::r +m ::comment Modi sign anusvara
|
352 |
+
::u 1163E ::r +h ::comment Modi sign visarga
|
353 |
+
|
354 |
+
::u 13068 ::num 1000000 ::comment Egyptian Hieroglyph
|
355 |
+
::u 1308B ::r r ::comment Egyptian Hieroglyph ::pic mouth
|
356 |
+
::u 1309D ::r ' ::comment Egyptian Hieroglyph (ayn) ::pic forearm
|
357 |
+
::u 130A7 ::r d ::comment Egyptian Hieroglyph ::pic hand
|
358 |
+
::u 130AD ::num 10000 ::comment Egyptian Hieroglyph
|
359 |
+
::u 130AE ::num 20000 ::comment Egyptian Hieroglyph
|
360 |
+
::u 130AF ::num 30000 ::comment Egyptian Hieroglyph
|
361 |
+
::u 130B0 ::num 40000 ::comment Egyptian Hieroglyph
|
362 |
+
::u 130B1 ::num 50000 ::comment Egyptian Hieroglyph
|
363 |
+
::u 130B2 ::num 60000 ::comment Egyptian Hieroglyph
|
364 |
+
::u 130B3 ::num 70000 ::comment Egyptian Hieroglyph
|
365 |
+
::u 130B4 ::num 80000 ::comment Egyptian Hieroglyph
|
366 |
+
::u 130B5 ::num 90000 ::comment Egyptian Hieroglyph
|
367 |
+
::u 130B6 ::num 50000 ::comment Egyptian Hieroglyph
|
368 |
+
::u 130C0 ::r b ::comment Egyptian Hieroglyph ::pic foot
|
369 |
+
::u 130ED ::r l ::comment Egyptian Hieroglyph [also rw] ::pic lion recumbent
|
370 |
+
::u 13121 ::r h ::comment Egyptian Hieroglyph (f-underscore) ::pic aninal's belly and udder
|
371 |
+
::u 1313F ::r a ::comment Egyptian Hieroglyph (alef) ::pic vulture
|
372 |
+
::u 13153 ::r m ::comment Egyptian Hieroglyph ::pic owl
|
373 |
+
::u 13171 ::r w ::comment Egyptian Hieroglyph ::pic quail chick
|
374 |
+
::u 13187 ::r ::comment Egyptian Hieroglyph (determinative/son) H8 ::pic egg
|
375 |
+
::u 13190 ::num 100000 ::comment Egyptian Hieroglyph
|
376 |
+
::u 13191 ::r f ::comment Egyptian Hieroglyph ::pic horned viper
|
377 |
+
::u 13193 ::r d ::comment Egyptian Hieroglyph (J) ::pic cobra
|
378 |
+
::u 131BC ::num 1000 ::comment Egyptian Hieroglyph
|
379 |
+
::u 131BD ::num 2000 ::comment Egyptian Hieroglyph
|
380 |
+
::u 131BE ::num 3000 ::comment Egyptian Hieroglyph
|
381 |
+
::u 131BF ::num 4000 ::comment Egyptian Hieroglyph
|
382 |
+
::u 131C0 ::num 5000 ::comment Egyptian Hieroglyph
|
383 |
+
::u 131C1 ::num 6000 ::comment Egyptian Hieroglyph
|
384 |
+
::u 131C2 ::num 7000 ::comment Egyptian Hieroglyph
|
385 |
+
::u 131C3 ::num 8000 ::comment Egyptian Hieroglyph
|
386 |
+
::u 131C4 ::num 9000 ::comment Egyptian Hieroglyph
|
387 |
+
::u 131CB ::r i ::comment Egyptian Hieroglyph (yod) ::pic single reed
|
388 |
+
::u 131CC ::r y ::comment Egyptian Hieroglyph ::pic double reed
|
389 |
+
::u 1320E ::r q ::comment Egyptian Hieroglyph (qaf) ::pic sandy slope
|
390 |
+
::u 13209 ::comment Egyptian Hieroglyph ::pic desert hills
|
391 |
+
::u 13216 ::r n ::comment Egyptian Hieroglyph ::pic ripple of water
|
392 |
+
::u 13219 ::r sh ::comment Egyptian Hieroglyph (š) ::pic basin
|
393 |
+
::u 13254 ::r h ::comment Egyptian Hieroglyph ::pic reed shelter
|
394 |
+
::u 13283 ::r z ::comment Egyptian Hieroglyph [also S?] ::pic door bolt
|
395 |
+
::u 132AA ::r p ::comment Egyptian Hieroglyph ::pic stool
|
396 |
+
::u 132D4 ::r n ::comment Egyptian Hieroglyph ::pic red crown
|
397 |
+
::u 132F4 ::r s ::comment Egyptian Hieroglyph [also Z?] ::pic folded cloth
|
398 |
+
::u 13319 ::comment Egyptian Hieroglyph ::pic throw stick
|
399 |
+
::u 13362 ::num 100 ::comment Egyptian Hieroglyph
|
400 |
+
::u 13363 ::num 200 ::comment Egyptian Hieroglyph
|
401 |
+
::u 13364 ::num 300 ::comment Egyptian Hieroglyph
|
402 |
+
::u 13365 ::num 400 ::comment Egyptian Hieroglyph
|
403 |
+
::u 13366 ::num 500 ::comment Egyptian Hieroglyph
|
404 |
+
::u 13367 ::num 600 ::comment Egyptian Hieroglyph
|
405 |
+
::u 13368 ::num 700 ::comment Egyptian Hieroglyph
|
406 |
+
::u 13369 ::num 800 ::comment Egyptian Hieroglyph
|
407 |
+
::u 1336A ::num 900 ::comment Egyptian Hieroglyph
|
408 |
+
::u 1336B ::num 500 ::comment Egyptian Hieroglyph
|
409 |
+
::u 1336F ::r o ::comment Egyptian Hieroglyph ::pic lasso
|
410 |
+
::u 1337F ::r t ::comment Egyptian Hieroglyph (ṯ) ::pic hobble
|
411 |
+
::u 13386 ::num 10 ::comment Egyptian Hieroglyph
|
412 |
+
::u 13387 ::num 20 ::comment Egyptian Hieroglyph
|
413 |
+
::u 13388 ::num 30 ::comment Egyptian Hieroglyph
|
414 |
+
::u 13389 ::num 40 ::comment Egyptian Hieroglyph
|
415 |
+
::u 1338A ::num 50 ::comment Egyptian Hieroglyph
|
416 |
+
::u 1338B ::num 60 ::comment Egyptian Hieroglyph
|
417 |
+
::u 1338C ::num 70 ::comment Egyptian Hieroglyph
|
418 |
+
::u 1338D ::num 80 ::comment Egyptian Hieroglyph
|
419 |
+
::u 1338E ::num 90 ::comment Egyptian Hieroglyph
|
420 |
+
::u 1338F ::num 20 ::comment Egyptian Hieroglyph
|
421 |
+
::u 13390 ::num 30 ::comment Egyptian Hieroglyph
|
422 |
+
::u 13391 ::num 40 ::comment Egyptian Hieroglyph
|
423 |
+
::u 13392 ::num 50 ::comment Egyptian Hieroglyph
|
424 |
+
::u 1339B ::r h ::comment Egyptian Hieroglyph ::pic twisted flax
|
425 |
+
::u 133A1 ::r k ::comment Egyptian Hieroglyph ::pic basket with handle
|
426 |
+
::u 133A2 ::r k ::comment Egyptian Hieroglyph ::pic basket with handle, variant
|
427 |
+
::u 133A4 ::r g ::comment Egyptian Hieroglyph ::pic bag
|
428 |
+
::u 133BC ::r g ::comment Egyptian Hieroglyph ::pic stand
|
429 |
+
::u 133CF ::r t ::comment Egyptian Hieroglyph ::pic loaf
|
430 |
+
::u 133ED ::r y ::comment Egyptian Hieroglyph ::pic two strokes
|
431 |
+
::u 133F2 ::r w ::comment Egyptian Hieroglyph ::pic quail chick, hieratic variant
|
432 |
+
::u 133FA ::num 1 ::comment Egyptian Hieroglyph
|
433 |
+
::u 133FB ::num 2 ::comment Egyptian Hieroglyph
|
434 |
+
::u 133FC ::num 3 ::comment Egyptian Hieroglyph
|
435 |
+
::u 133FD ::num 4 ::comment Egyptian Hieroglyph
|
436 |
+
::u 133FE ::num 5 ::comment Egyptian Hieroglyph
|
437 |
+
::u 133FF ::num 6 ::comment Egyptian Hieroglyph
|
438 |
+
::u 13400 ::num 7 ::comment Egyptian Hieroglyph
|
439 |
+
::u 13401 ::num 8 ::comment Egyptian Hieroglyph
|
440 |
+
::u 13402 ::num 9 ::comment Egyptian Hieroglyph
|
441 |
+
::u 13403 ::num 5 ::comment Egyptian Hieroglyph
|
442 |
+
::u 1340D ::r kh ::comment Egyptian Hieroglyph (ḫ, khah) ::pic placenta?
|
443 |
+
::u 1341D ::r m ::comment Egyptian Hieroglyph (also jm)
|
uroman/data/UnicodeDataProps.txt
ADDED
@@ -0,0 +1,164 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
::script-name Adlam ::n-char 71 ::char 𞤀𞤁𞤂𞤃𞤄𞤅𞤆𞤇𞤈𞤉𞤊𞤋𞤌𞤍𞤎𞤏𞤐𞤑𞤒𞤓𞤔𞤕𞤖𞤗𞤘𞤙𞤚𞤛𞤜𞤝𞤞𞤟𞤠𞤡𞤢𞤣𞤤𞤥𞤦𞤧𞤨𞤩𞤪𞤫𞤬𞤭𞤮𞤯𞤰𞤱𞤲𞤳𞤴𞤵𞤶𞤷𞤸𞤹𞤺𞤻𞤼𞤽𞤾𞤿𞥀𞥁𞥂𞥃𞥅𞥈𞥉 ::numeral 𞥐𞥑𞥒𞥓𞥔𞥕𞥖𞥗𞥘𞥙
|
2 |
+
::script-name Aegean ::numeral 𐄇𐄈𐄉𐄊𐄋𐄌𐄍𐄎𐄏𐄐𐄑𐄒𐄓𐄔𐄕𐄖𐄗𐄘𐄙𐄚𐄛𐄜𐄝𐄞𐄟𐄠𐄡𐄢𐄣𐄤𐄥𐄦𐄧𐄨𐄩𐄪𐄫𐄬𐄭𐄮𐄯𐄰𐄱𐄲𐄳
|
3 |
+
::script-name Ahom ::n-char 52 ::char 𑜀𑜁𑜂𑜃𑜄𑜅𑜆𑜇𑜈𑜉𑜊𑜋𑜌𑜍𑜎𑜏𑜐𑜑𑜒𑜓𑜔𑜕𑜖𑜗𑜘𑜙𑜚𑜝𑜞𑜟𑜠𑜡𑜢𑜣𑜤𑜥𑜦𑜧𑜨𑜩𑜪𑜫𑜼𑜽𑜾𑝀𑝁𑝂𑝃𑝄𑝅𑝆 ::medial-consonant-sign 𑜝𑜞𑜟 ::numeral 𑜰𑜱𑜲𑜳𑜴𑜵𑜶𑜷𑜸𑜹𑜺𑜻 ::vowel-sign 𑜠𑜡𑜢𑜣𑜤𑜥𑜦𑜧𑜨𑜩𑜪
|
4 |
+
::script-name Arabic ::n-char 1023 ::char ؎؏ؐؑؒؓؔؖؠءآأؤإئابةتثجحخدذرزسشصضطظعغػؼؽؾؿفقكلمنهوىيٜٚٛ٪ٮٯٰٱٲٳٴٵٶٷٸٹٺٻټٽپٿڀځڂڃڄڅچڇڈډڊڋڌڍڎڏڐڑڒړڔڕږڗژڙښڛڜڝڞڟڠڡڢڣڤڥڦڧڨکڪګڬڭڮگڰڱڲڳڴڵڶڷڸڹںڻڼڽھڿۀہۂۃۄۅۆۇۈۉۊۋیۍێۏېۑےۓەۖۗۮۯۺۻۼ۽۾ۿݐݑݒݓݔݕݖݗݘݙݚݛݜݝݞݟݠݡݢݣݤݥݦݧݨݩݪݫݬݭݮݯݰݱݲݳݴݵݶݷݸݹݺݻݼݽݾݿࡰࡱࡲࡳࡴࡵࡶࡷࡸࡹࡺࡻࡼࡽࡾࡿࢀࢁࢂࢆࢉࢊࢋࢌࢍࢠࢡࢢࢣࢤࢥࢦࢧࢨࢩࢪࢫࢬࢭࢮࢯࢰࢱࢲࢳࢴࢵࢶࢷࢸࢹࢺࢻࢼࢽࢾࢿࣀࣁࣂࣃࣄࣅࣆࣇࣈ࣡ﭐﭑﭒﭓﭔﭕﭖﭗﭘﭙﭚﭛﭜﭝﭞﭟﭠﭡﭢﭣﭤﭥﭦﭧﭨﭩﭪﭫﭬﭭﭮﭯﭰﭱﭲﭳﭴﭵﭶﭷﭸﭹﭺﭻﭼﭽﭾﭿﮀﮁﮂﮃﮄﮅﮆﮇﮈﮉﮊﮋﮌﮍﮎﮏﮐﮑﮒﮓﮔﮕﮖﮗﮘﮙﮚﮛﮜﮝﮞﮟﮠﮡﮢﮣﮤﮥﮦﮧﮨﮩﮪﮫﮬﮭﮮﮯﮰﮱﯓﯔﯕﯖﯗﯘﯙﯚﯛﯜﯝﯞﯟﯠﯡﯢﯣﯤﯥﯦﯧﯨﯩﯪﯫﯬﯭﯮﯯﯰﯱﯲﯳﯴﯵﯶﯷﯸﯹﯺﯻﯼﯽﯾﯿﰀﰁﰂﰃﰄﰅﰆﰇﰈﰉﰊﰋﰌﰍﰎﰏﰐﰑﰒﰓﰔﰕﰖﰗﰘﰙﰚﰛﰜﰝﰞﰟﰠﰡﰢﰣﰤﰥﰦﰧﰨﰩﰪﰫﰬﰭﰮﰯﰰﰱﰲﰳﰴﰵﰶﰷﰸﰹﰺﰻﰼﰽﰾﰿﱀﱁﱂﱃﱄﱅﱆﱇﱈﱉﱊﱋﱌﱍﱎﱏﱐﱑﱒﱓﱔﱕﱖﱗﱘﱙﱚﱛﱜﱝﱞﱟﱠﱡﱢﱣﱤﱥﱦﱧﱨﱩﱪﱫﱬﱭﱮﱯﱰﱱﱲﱳﱴﱵﱶﱷﱸﱹﱺﱻﱼﱽﱾﱿﲀﲁﲂﲃﲄﲅﲆﲇﲈﲉﲊﲋﲌﲍﲎﲏﲐﲑﲒﲓﲔﲕﲖﲗﲘﲙﲚﲛﲜﲝﲞﲟﲠﲡﲢﲣﲤﲥﲦﲧﲨﲩﲪﲫﲬﲭﲮﲯﲰﲱﲲﲳﲴﲵﲶﲷﲸﲹﲺﲻﲼﲽﲾﲿﳀﳁﳂﳃﳄﳅﳆﳇﳈﳉﳊﳋﳌﳍﳎﳏﳐﳑﳒﳓﳔﳕﳖﳗﳘﳙﳚﳛﳜﳝﳞﳟﳠﳡﳢﳣﳤﳥﳦﳧﳨﳩﳪﳫﳬﳭﳮﳯﳰﳱﳲﳳﳴﳵﳶﳷﳸﳹﳺﳻﳼﳽﳾﳿﴀﴁﴂﴃﴄﴅﴆﴇﴈﴉﴊﴋﴌﴍﴎﴏﴐﴑﴒﴓﴔﴕﴖﴗﴘﴙﴚﴛﴜﴝﴞﴟﴠﴡﴢﴣﴤﴥﴦﴧﴨﴩﴪﴫﴬﴭﴮﴯﴰﴱﴲﴳﴴﴵﴶﴷﴸﴹﴺﴻﴼﴽ﵀﵁﵂﵃﵄﵅﵆﵇﵈﵉﵊﵋﵌﵍﵎﵏ﵐﵑﵒﵓﵔﵕﵖﵗﵘﵙﵚﵛﵜﵝﵞﵟﵠﵡﵢﵣﵤﵥﵦﵧﵨﵩﵪﵫﵬﵭﵮﵯﵰﵱﵲﵳﵴﵵﵶﵷﵸﵹﵺﵻﵼﵽﵾﵿﶀﶁﶂﶃﶄﶅﶆﶇﶈﶉﶊﶋﶌﶍﶎﶏﶒﶓﶔﶕﶖﶗﶘﶙﶚﶛﶜﶝﶞﶟﶠﶡﶢﶣﶤﶥﶦﶧﶨﶩﶪﶫﶬﶭﶮﶯﶰﶱﶲﶳﶴﶵﶶﶷﶸﶹﶺﶻﶼﶽﶾﶿﷀﷁﷂﷃﷄﷅﷆﷇ﷏ﷰﷱﷲﷳﷴﷵﷶﷷﷸﷹﷺﷻ﷽﷾﷿ﺀﺁﺂﺃﺄﺅﺆﺇﺈﺉﺊﺋﺌﺍﺎﺏﺐﺑﺒﺓﺔﺕﺖﺗﺘﺙﺚﺛﺜﺝﺞﺟﺠﺡﺢﺣﺤﺥﺦﺧﺨﺩﺪﺫﺬﺭﺮﺯﺰﺱﺲﺳﺴﺵﺶﺷﺸﺹﺺﺻﺼﺽﺾﺿﻀﻁﻂﻃﻄﻅﻆﻇﻈﻉﻊﻋﻌﻍﻎﻏﻐﻑﻒﻓﻔﻕﻖﻗﻘﻙﻚﻛﻜﻝﻞﻟﻠﻡﻢﻣﻤﻥﻦﻧﻨﻩﻪﻫﻬﻭﻮﻯﻰﻱﻲﻳﻴﻵﻶﻷﻸﻹﻺﻻﻼ ::numeral ݳݴݵݶݷݸݹݺݻݼݽ ::vowel-sign ٜٚٛ
|
5 |
+
::script-name Arabic-Indic ::n-char 2 ::char ؉؊ ::numeral ٠١٢٣٤٥٦٧٨٩
|
6 |
+
::script-name Armenian ::n-char 86 ::char ԱԲԳԴԵԶԷԸԹԺԻԼԽԾԿՀՁՂՃՄՅՆՇՈՉՊՋՌՍՎՏՐՑՒՓՔՕՖՙՠաբգդեզէըթժիլխծկհձղճմյնշոչպջռսվտրցւփքօֆևֈ֏ﬓﬔﬕﬖﬗ
|
7 |
+
::script-name Avestan ::n-char 54 ::char 𐬀𐬁𐬂𐬃𐬄𐬅𐬆𐬇𐬈𐬉𐬊𐬋𐬌𐬍𐬎𐬏𐬐𐬑𐬒𐬓𐬔𐬕𐬖𐬗𐬘𐬙𐬚𐬛𐬜𐬝𐬞𐬟𐬠𐬡𐬢𐬣𐬤𐬥𐬦𐬧𐬨𐬩𐬪𐬫𐬬𐬭𐬮𐬯𐬰𐬱𐬲𐬳𐬴𐬵
|
8 |
+
::script-name Balinese ::n-char 76 ::char ᬀᬁᬂᬃᬄᬅᬆᬇᬈᬉᬊᬋᬌᬍᬎᬏᬐᬑᬒᬓᬔᬕᬖᬗᬘᬙᬚᬛᬜᬝᬞᬟᬠᬡᬢᬣᬤᬥᬦᬧᬨᬩᬪᬫᬬᬭᬮᬯᬰᬱᬲᬳ᬴ᬵᬶᬷᬸᬹᬺᬻᬼᬽᬾᬿᭀᭁᭂᭃᭅᭆᭇᭈᭉᭊᭋᭌ ::numeral ᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙ ::vowel-sign ᬵᬶᬷᬸᬹᬺᬻᬼᬽᬾᬿᭀᭁᭂᭃ
|
9 |
+
::script-name Bamum ::n-char 649 ::char ꚠꚡꚢꚣꚤꚥꚦꚧꚨꚩꚪꚫꚬꚭꚮꚯꚰꚱꚲꚳꚴꚵꚶꚷꚸꚹꚺꚻꚼꚽꚾꚿꛀꛁꛂꛃꛄꛅꛆꛇꛈꛉꛊꛋꛌꛍꛎꛏꛐꛑꛒꛓꛔꛕꛖꛗꛘꛙꛚ���ꛜꛝꛞꛟꛠꛡꛢꛣꛤꛥꛦꛧꛨꛩꛪꛫꛬꛭꛮꛯ𖠀𖠁𖠂𖠃𖠄𖠅𖠆𖠇𖠈𖠉𖠊𖠋𖠌𖠍𖠎𖠏𖠐𖠑𖠒𖠓𖠔𖠕𖠖𖠗𖠘𖠙𖠚𖠛𖠜𖠝𖠞𖠟𖠠𖠡𖠢𖠣𖠤𖠥𖠦𖠧𖠨𖠩𖠪𖠫𖠬𖠭𖠮𖠯𖠰𖠱𖠲𖠳𖠴𖠵𖠶𖠷𖠸𖠹𖠺𖠻𖠼𖠽𖠾𖠿𖡀𖡁𖡂𖡃𖡄𖡅𖡆𖡇𖡈𖡉𖡊𖡋𖡌𖡍𖡎𖡏𖡐𖡑𖡒𖡓𖡔𖡕𖡖𖡗𖡘𖡙𖡚𖡛𖡜𖡝𖡞𖡟𖡠𖡡𖡢𖡣𖡤𖡥𖡦𖡧𖡨𖡩𖡪𖡫𖡬𖡭𖡮𖡯𖡰𖡱𖡲𖡳𖡴𖡵𖡶𖡷𖡸𖡹𖡺𖡻𖡼𖡽𖡾𖡿𖢀𖢁𖢂𖢃𖢄𖢅𖢆𖢇𖢈𖢉𖢊𖢋𖢌𖢍𖢎𖢏𖢐𖢑𖢒𖢓𖢔𖢕𖢖𖢗𖢘𖢙𖢚𖢛𖢜𖢝𖢞𖢟𖢠𖢡𖢢𖢣𖢤𖢥𖢦𖢧𖢨𖢩𖢪𖢫𖢬𖢭𖢮𖢯𖢰𖢱𖢲𖢳𖢴𖢵𖢶𖢷𖢸𖢹𖢺𖢻𖢼𖢽𖢾𖢿𖣀𖣁𖣂𖣃𖣄𖣅𖣆𖣇𖣈𖣉𖣊𖣋𖣌𖣍𖣎𖣏𖣐𖣑𖣒𖣓𖣔𖣕𖣖𖣗𖣘𖣙𖣚𖣛𖣜𖣝𖣞𖣟𖣠𖣡𖣢𖣣𖣤𖣥𖣦𖣧𖣨𖣩𖣪𖣫𖣬𖣭𖣮𖣯𖣰𖣱𖣲𖣳𖣴𖣵𖣶𖣷𖣸𖣹𖣺𖣻𖣼𖣽𖣾𖣿𖤀𖤁𖤂𖤃𖤄𖤅𖤆𖤇𖤈𖤉𖤊𖤋𖤌𖤍𖤎𖤏𖤐𖤑𖤒𖤓𖤔𖤕𖤖𖤗𖤘𖤙𖤚𖤛𖤜𖤝𖤞𖤟𖤠𖤡𖤢𖤣𖤤𖤥𖤦𖤧𖤨𖤩𖤪𖤫𖤬𖤭𖤮𖤯𖤰𖤱𖤲𖤳𖤴𖤵𖤶𖤷𖤸𖤹𖤺𖤻𖤼𖤽𖤾𖤿𖥀𖥁𖥂𖥃𖥄𖥅𖥆𖥇𖥈𖥉𖥊𖥋𖥌𖥍𖥎𖥏𖥐𖥑𖥒𖥓𖥔𖥕𖥖𖥗𖥘𖥙𖥚𖥛𖥜𖥝𖥞𖥟𖥠𖥡𖥢𖥣𖥤𖥥𖥦𖥧𖥨𖥩𖥪𖥫𖥬𖥭𖥮𖥯𖥰𖥱𖥲𖥳𖥴𖥵𖥶𖥷𖥸𖥹𖥺𖥻𖥼𖥽𖥾𖥿𖦀𖦁𖦂𖦃𖦄𖦅𖦆𖦇𖦈𖦉𖦊𖦋𖦌𖦍𖦎𖦏𖦐𖦑𖦒𖦓𖦔𖦕𖦖𖦗𖦘𖦙𖦚𖦛𖦜𖦝𖦞𖦟𖦠𖦡𖦢𖦣𖦤𖦥𖦦𖦧𖦨𖦩𖦪𖦫𖦬𖦭𖦮𖦯𖦰𖦱𖦲𖦳𖦴𖦵𖦶𖦷𖦸𖦹𖦺𖦻𖦼𖦽𖦾𖦿𖧀𖧁𖧂𖧃𖧄𖧅𖧆𖧇𖧈𖧉𖧊𖧋𖧌𖧍𖧎𖧏𖧐𖧑𖧒𖧓𖧔𖧕𖧖𖧗𖧘𖧙𖧚𖧛𖧜𖧝𖧞𖧟𖧠𖧡𖧢𖧣𖧤𖧥𖧦𖧧𖧨𖧩𖧪𖧫𖧬𖧭𖧮𖧯𖧰𖧱𖧲𖧳𖧴𖧵𖧶𖧷𖧸𖧹𖧺𖧻𖧼𖧽𖧾𖧿𖨀𖨁𖨂𖨃𖨄𖨅𖨆𖨇𖨈𖨉𖨊𖨋𖨌𖨍𖨎𖨏𖨐𖨑𖨒𖨓𖨔𖨕𖨖𖨗𖨘𖨙𖨚𖨛𖨜𖨝𖨞𖨟𖨠𖨡𖨢𖨣𖨤𖨥𖨦𖨧𖨨𖨩𖨪𖨫𖨬𖨭𖨮𖨯𖨰𖨱𖨲𖨳𖨴𖨵𖨶𖨷𖨸
|
10 |
+
::script-name Bassa Vah ::n-char 30 ::char 𖫐𖫑𖫒𖫓𖫔𖫕𖫖𖫗𖫘𖫙𖫚𖫛𖫜𖫝𖫞𖫟𖫠𖫡𖫢𖫣𖫤𖫥𖫦𖫧𖫨𖫩𖫪𖫫𖫬𖫭
|
11 |
+
::script-name Batak ::n-char 50 ::char ᯀᯁᯂᯃᯄᯅᯆᯇᯈᯉᯊᯋᯌᯍᯎᯏᯐᯑᯒᯓᯔᯕᯖᯗᯘᯙᯚᯛᯜᯝᯞᯟᯠᯡᯢᯣᯤᯥ᯦ᯧᯨᯩᯪᯫᯬᯭᯮᯯᯰᯱ ::vowel-sign ᯧᯨᯩᯪᯫᯬᯭᯮᯯ
|
12 |
+
::script-name Bengali ::n-char 75 ::char ঁংঃঅআইঈউঊঋঌএঐওঔকখগঘঙচছজঝঞটঠডঢণতথদধনপফবভমযরলশষসহ়ঽািীুূৃৄেৈোৌ্ৎৗড়ঢ়য়ৠৡৢৣৰৱ৳ৼ৽ ::numeral ০১২৩৪৫৬৭৮৯ ::sign-virama ্ ::vowel-sign ািীুূৃৄেৈোৌৢৣ
|
13 |
+
::script-name Bhaiksuki ::n-char 63 ::char 𑰀𑰁𑰂𑰃𑰄𑰅𑰆𑰇𑰈𑰊𑰋𑰌𑰍𑰎𑰏𑰐𑰑𑰒𑰓𑰔𑰕𑰖𑰗𑰘𑰙𑰚𑰛𑰜𑰝𑰞𑰟𑰠𑰡𑰢𑰣𑰤𑰥𑰦𑰧𑰨𑰩𑰪𑰫𑰬𑰭𑰮𑰯𑰰𑰱𑰲𑰳𑰴𑰵𑰶𑰸𑰹𑰺𑰻𑰼𑰽𑰾𑰿𑱀 ::numeral 𑱐𑱑𑱒𑱓𑱔𑱕𑱖𑱗𑱘𑱙𑱚𑱛𑱜𑱝𑱞𑱟𑱠𑱡𑱢𑱣𑱤𑱥𑱦𑱧𑱨𑱩𑱪𑱫 ::sign-virama 𑰿 ::vowel-sign 𑰯𑰰𑰱𑰲𑰳𑰴𑰵𑰶𑰸𑰹𑰺𑰻
|
14 |
+
::script-name Bopomofo ::n-char 75 ::char ㄅㄆㄇㄈㄉㄊㄋㄌㄍㄎㄏㄐㄑㄒㄓㄔㄕㄖㄗㄘㄙㄚㄛㄜㄝㄞㄟㄠㄡㄢㄣㄤㄥㄦㄧㄨㄩㄪㄫㄬㄭㄮㄯㆠㆡㆢㆣㆤㆥㆦㆧㆨㆩㆪㆫㆬㆭㆮㆯㆰㆱㆲㆳㆴㆵㆶㆷㆸㆹㆺㆻㆼㆽㆾㆿ
|
15 |
+
::script-name Brahmi ::n-char 76 ::char 𑀀𑀁𑀂𑀃𑀄𑀅𑀆𑀇𑀈𑀉𑀊𑀋𑀌𑀍𑀎𑀏𑀐𑀑𑀒𑀓𑀔𑀕𑀖𑀗𑀘𑀙𑀚𑀛𑀜𑀝𑀞𑀟𑀠𑀡𑀢𑀣𑀤𑀥𑀦𑀧𑀨𑀩𑀪𑀫𑀬𑀭𑀮𑀯𑀰𑀱𑀲𑀳𑀴𑀵𑀶𑀷𑀸𑀹𑀺𑀻𑀼𑀽𑀾𑀿𑁀𑁁𑁂𑁃𑁄𑁅𑁰𑁱𑁲𑁳𑁴𑁵 ::numeral 𑁒𑁓𑁔𑁕𑁖𑁗𑁘𑁙𑁚𑁛𑁜𑁝𑁞𑁟𑁠𑁡𑁢𑁣𑁤𑁥𑁦𑁧𑁨𑁩𑁪𑁫𑁬𑁭𑁮𑁯𑁿 ::vowel-sign 𑀸𑀹𑀺𑀻𑀼𑀽𑀾𑀿𑁀𑁁𑁂𑁃𑁄𑁅𑁳𑁴
|
16 |
+
::script-name Buginese ::n-char 28 ::char ᨀᨁᨂᨃᨄᨅᨆᨇᨈᨉᨊᨋᨌᨍᨎᨏᨐᨑᨒᨓᨔᨕᨖᨘᨗᨙᨚᨛ ::vowel-sign ᨘᨗᨙᨚᨛ
|
17 |
+
::script-name Buhid ::n-char 20 ::char ᝀᝁᝂᝃᝄᝅᝆᝇᝈᝉᝊᝋᝌᝍᝎᝏᝐᝑᝒᝓ ::vowel-sign ᝒᝓ
|
18 |
+
::script-name Carian ::n-char 49 ::char 𐊠𐊡𐊢𐊣𐊤𐊥𐊦𐊧𐊨𐊩𐊪𐊫𐊬𐊭𐊮𐊯𐊰𐊱𐊲𐊳𐊴𐊵𐊶𐊷𐊸𐊹𐊺𐊻𐊼𐊽𐊾𐊿𐋀𐋁𐋂𐋃𐋄𐋅𐋆𐋇𐋈𐋉𐋊𐋋𐋌𐋍𐋎𐋏𐋐
|
19 |
+
::script-name Caucasian Albanian ::n-char 52 ::char 𐔰𐔱𐔲𐔳𐔴𐔵𐔶𐔷𐔸𐔹𐔺𐔻𐔼𐔽𐔾𐔿𐕀𐕁𐕂𐕃𐕄𐕅𐕆𐕇𐕈𐕉𐕊𐕋𐕌𐕍𐕎𐕏𐕐𐕑𐕒𐕓𐕔𐕕𐕖𐕗𐕘𐕙𐕚𐕛𐕜𐕝𐕞𐕟𐕠𐕡𐕢𐕣
|
20 |
+
::script-name Chakma ::n-char 53 ::char 𑄀𑄁𑄂𑄃𑄄𑄅𑄆𑄇𑄈𑄉𑄊𑄋𑄌𑄍𑄎𑄏𑄐𑄑𑄒𑄓𑄔𑄕𑄖𑄗𑄘𑄙𑄚𑄛𑄜𑄝𑄞𑄟𑄠𑄡𑄢𑄣𑄤𑄥𑄦𑄧𑄨𑄩𑄪𑄫𑄬𑄭𑄮𑄯𑄰𑅄𑅅𑅆𑅇 ::numeral 𑄶𑄷𑄸𑄹𑄺𑄻𑄼𑄽𑄾𑄿 ::vowel-sign 𑄧𑄨𑄩𑄪𑄫𑄬𑄭𑄮𑄯𑄰𑅅𑅆
|
21 |
+
::script-name Cham ::n-char 69 ::char ꨀꨁꨂꨃꨄꨅꨆꨇꨈꨉꨊꨋꨌꨍꨎꨏꨐꨑꨒꨓꨔꨕꨖꨗꨘꨙꨚꨛꨜꨝꨞꨟꨠꨡꨢꨣꨤꨥꨦꨧꨨꨩꨪꨫꨬꨭꨮꨯꨰꨱꨲꨳꨴꨵꨶꩀꩁꩂꩃꩄꩅꩆꩇꩈꩉꩊꩋꩌꩍ ::numeral ꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙ ::vowel-sign ꨩꨪꨫꨬꨭꨮꨯꨰꨱꨲ
|
22 |
+
::script-name Cherokee ::n-char 172 ::char ᎠᎡᎢᎣᎤᎥᎦᎧᎨᎩᎪᎫᎬᎭᎮᎯᎰᎱᎲᎳᎴᎵᎶᎷᎸᎹᎺᎻᎼᎽᎾᎿᏀᏁᏂᏃᏄᏅᏆᏇᏈᏉᏊᏋᏌᏍᏎᏏᏐᏑᏒᏓᏔᏕᏖᏗᏘᏙᏚᏛᏜᏝᏞᏟᏠᏡᏢᏣᏤᏥᏦᏧᏨᏩᏪᏫᏬᏭᏮᏯᏰᏱᏲᏳᏴᏵᏸᏹᏺᏻᏼᏽꭰꭱꭲꭳꭴꭵꭶꭷꭸꭹꭺꭻꭼꭽꭾꭿꮀꮁꮂꮃꮄꮅꮆꮇꮈꮉꮊꮋꮌꮍꮎꮏꮐꮑꮒꮓꮔꮕꮖꮗꮘꮙꮚꮛꮜꮝꮞꮟꮠꮡꮢꮣꮤꮥꮦꮧꮨꮩꮪꮫꮬꮭꮮꮯꮰꮱꮲꮳꮴꮵꮶꮷꮸꮹꮺꮻꮼꮽꮾꮿ
|
23 |
+
::script-name Chorasmian ::n-char 21 ::char 𐾰𐾱𐾲𐾳𐾴𐾵𐾶𐾷𐾸𐾹𐾺𐾻𐾼𐾽𐾾𐾿𐿀𐿁𐿂𐿃𐿄 ::numeral 𐿅𐿆𐿇𐿈𐿉𐿊𐿋
|
24 |
+
::script-name Coptic ::n-char 120 ::char ϢϣϤϥϦϧϨϩϪϫϬϭϮϯⲀⲁⲂⲃⲄⲅⲆⲇⲈⲉⲊⲋⲌⲍⲎⲏⲐⲑⲒⲓⲔⲕⲖⲗⲘⲙⲚⲛⲜⲝⲞⲟⲠⲡⲢⲣⲤⲥⲦⲧⲨⲩⲪⲫⲬⲭⲮⲯⲰⲱⲲⲳⲴⲵⲶⲷⲸⲹⲺⲻⲼⲽⲾⲿⳀⳁⳂⳃⳄⳅⳆⳇⳈⳉⳊⳋⳌⳍⳎⳏⳐⳑⳒⳓⳔⳕⳖⳗⳘⳙⳚⳛⳜⳝⳞⳟⳠⳡⳢⳣⳫⳬⳭⳮⳲⳳ ::numeral ⳽𐋡𐋢𐋣𐋤𐋥𐋦𐋧𐋨𐋩𐋪𐋫𐋬𐋭𐋮𐋯𐋰𐋱𐋲𐋳𐋴𐋵𐋶𐋷𐋸𐋹𐋺𐋻
|
25 |
+
::script-name Cuneiform ::n-char 1234 ::char 𒀀𒀁𒀂𒀃𒀄𒀅𒀆𒀇𒀈𒀉𒀊𒀋𒀌𒀍𒀎𒀏𒀐𒀑𒀒𒀓𒀔𒀕𒀖𒀗𒀘𒀙𒀚𒀛𒀜𒀝𒀞𒀟𒀠𒀡𒀢𒀣𒀤𒀥𒀦𒀧𒀨𒀩𒀪𒀫𒀬𒀭𒀮𒀯𒀰𒀱𒀲𒀳𒀴𒀵𒀶𒀷𒀸𒀹𒀺𒀻𒀼𒀽𒀾𒀿𒁀𒁁𒁂𒁃𒁄𒁅𒁆𒁇𒁈𒁉𒁊𒁋𒁌𒁍𒁎𒁏𒁐𒁑𒁒𒁓𒁔𒁕𒁖𒁗𒁘𒁙𒁚𒁛𒁜𒁝𒁞𒁟𒁠𒁡𒁢𒁣𒁤𒁥𒁦𒁧𒁨𒁩𒁪𒁫𒁬𒁭𒁮𒁯𒁰𒁱𒁲𒁳𒁴𒁵𒁶𒁷𒁸𒁹𒁺𒁻𒁼𒁽𒁾𒁿𒂀𒂁𒂂𒂃𒂄𒂅𒂆𒂇𒂈𒂉𒂊𒂋𒂌𒂍𒂎𒂏𒂐𒂑𒂒𒂓𒂔𒂕𒂖𒂗𒂘𒂙𒂚𒂛𒂜𒂝𒂞𒂟𒂠𒂡𒂢𒂣𒂤𒂥𒂦𒂧𒂨𒂩𒂪𒂫𒂬𒂭𒂮𒂯𒂰𒂱𒂲𒂳𒂴𒂵𒂶𒂷𒂸𒂹𒂺𒂻𒂼𒂽𒂾𒂿𒃀𒃁𒃂𒃃𒃄𒃅𒃆𒃇𒃈𒃉𒃊𒃋𒃌𒃍𒃎𒃏𒃐𒃑𒃒𒃓𒃔𒃕𒃖𒃗𒃘𒃙𒃚𒃛𒃜𒃝𒃞𒃟𒃠𒃡𒃢𒃣𒃤𒃥𒃦𒃧𒃨𒃩𒃪𒃫𒃬𒃭𒃮𒃯𒃰𒃱𒃲𒃳𒃴𒃵𒃶𒃷𒃸𒃹𒃺𒃻𒃼𒃽𒃾𒃿𒄀𒄁𒄂𒄃𒄄𒄅𒄆𒄇𒄈𒄉𒄊𒄋𒄌𒄍𒄎𒄏𒄐𒄑𒄒𒄓𒄔𒄕𒄖𒄗𒄘𒄙𒄚𒄛𒄜𒄝𒄞𒄟𒄠𒄡𒄢𒄣𒄤𒄥𒄦𒄧𒄨𒄩𒄪𒄫𒄬𒄭𒄮𒄯𒄰𒄱𒄲𒄳𒄴𒄵𒄶𒄷𒄸𒄹𒄺𒄻𒄼𒄽𒄾𒄿𒅀𒅁𒅂𒅃𒅄𒅅𒅆𒅇𒅈𒅉𒅊𒅋𒅌𒅍𒅎𒅏𒅐𒅑𒅒𒅓𒅔𒅕𒅖𒅗𒅘𒅙𒅚𒅛𒅜𒅝𒅞𒅟𒅠𒅡𒅢𒅣𒅤𒅥𒅦𒅧𒅨𒅩𒅪𒅫𒅬𒅭𒅮𒅯𒅰𒅱𒅲𒅳𒅴𒅵𒅶𒅷𒅸𒅹𒅺𒅻𒅼𒅽𒅾𒅿𒆀𒆁𒆂𒆃𒆄𒆅𒆆𒆇𒆈𒆉𒆊𒆋𒆌𒆍𒆎𒆏𒆐𒆑𒆒𒆓𒆔𒆕𒆖𒆗𒆘𒆙𒆚𒆛𒆜𒆝𒆞𒆟𒆠𒆡𒆢𒆣𒆤𒆥𒆦𒆧𒆨𒆩𒆪𒆫𒆬𒆭𒆮𒆯𒆰𒆱𒆲𒆳𒆴𒆵𒆶𒆷𒆸𒆹𒆺𒆻𒆼𒆽𒆾𒆿𒇀𒇁𒇂𒇃𒇄𒇅𒇆𒇇𒇈𒇉𒇊𒇋𒇌𒇍𒇎𒇏𒇐𒇑𒇒𒇓𒇔𒇕𒇖𒇗𒇘𒇙𒇚𒇛𒇜𒇝𒇞𒇟𒇠𒇡𒇢𒇣𒇤𒇥𒇦𒇧𒇨𒇩𒇪𒇫𒇬𒇭𒇮𒇯𒇰𒇱𒇲𒇳𒇴𒇵𒇶𒇷𒇸𒇹𒇺𒇻𒇼𒇽𒇾𒇿𒈀𒈁𒈂𒈃𒈄𒈅𒈆𒈇𒈈𒈉𒈊𒈋𒈌𒈍𒈎𒈏𒈐𒈑𒈒𒈓𒈔𒈕𒈖𒈗𒈘𒈙𒈚𒈛𒈜𒈝𒈞𒈟𒈠𒈡𒈢𒈣𒈤𒈥𒈦𒈧𒈨𒈩𒈪𒈫𒈬𒈭𒈮𒈯𒈰𒈱𒈲𒈳𒈴𒈵𒈶𒈷𒈸𒈹𒈺𒈻𒈼𒈽𒈾𒈿𒉀𒉁𒉂𒉃𒉄𒉅𒉆𒉇𒉈𒉉𒉊𒉋𒉌𒉍𒉎𒉏𒉐𒉑𒉒𒉓𒉔𒉕𒉖𒉗𒉘𒉙𒉚𒉛𒉜𒉝𒉞𒉟𒉠𒉡𒉢𒉣𒉤𒉥𒉦𒉧𒉨𒉩𒉪𒉫𒉬𒉭𒉮𒉯𒉰𒉱𒉲𒉳𒉴𒉵𒉶𒉷𒉸𒉹𒉺𒉻𒉼𒉽𒉾𒉿𒊀𒊁𒊂𒊃𒊄𒊅𒊆𒊇𒊈𒊉𒊊𒊋𒊌𒊍𒊎𒊏𒊐𒊑𒊒𒊓𒊔𒊕𒊖𒊗𒊘𒊙𒊚𒊛𒊜𒊝𒊞𒊟𒊠𒊡𒊢𒊣𒊤𒊥𒊦𒊧𒊨𒊩𒊪𒊫𒊬𒊭𒊮𒊯𒊰𒊱𒊲𒊳𒊴𒊵𒊶𒊷𒊸𒊹𒊺𒊻𒊼𒊽𒊾𒊿𒋀𒋁𒋂𒋃𒋄𒋅𒋆𒋇𒋈𒋉𒋊𒋋𒋌𒋍𒋎𒋏𒋐𒋑𒋒𒋓𒋔𒋕𒋖𒋗𒋘𒋙𒋚𒋛𒋜𒋝𒋞𒋟𒋠𒋡𒋢𒋣𒋤𒋥𒋦𒋧𒋨𒋩𒋪𒋫𒋬𒋭𒋮𒋯𒋰��𒋲𒋳𒋴𒋵𒋶𒋷𒋸𒋹𒋺𒋻𒋼𒋽𒋾𒋿𒌀𒌁𒌂𒌃𒌄𒌅𒌆𒌇𒌈𒌉𒌊𒌋𒌌𒌍𒌎𒌏𒌐𒌑𒌒𒌓𒌔𒌕𒌖𒌗𒌘𒌙𒌚𒌛𒌜𒌝𒌞𒌟𒌠𒌡𒌢𒌣𒌤𒌥𒌦𒌧𒌨𒌩𒌪𒌫𒌬𒌭𒌮𒌯𒌰𒌱𒌲𒌳𒌴𒌵𒌶𒌷𒌸𒌹𒌺𒌻𒌼𒌽𒌾𒌿𒍀𒍁𒍂𒍃𒍄𒍅𒍆𒍇𒍈𒍉𒍊𒍋𒍌𒍍𒍎𒍏𒍐𒍑𒍒𒍓𒍔𒍕𒍖𒍗𒍘𒍙𒍚𒍛𒍜𒍝𒍞𒍟𒍠𒍡𒍢𒍣𒍤𒍥𒍦𒍧𒍨𒍩𒍪𒍫𒍬𒍭𒍮𒍯𒍰𒍱𒍲𒍳𒍴𒍵𒍶𒍷𒍸𒍹𒍺𒍻𒍼𒍽𒍾𒍿𒎀𒎁𒎂𒎃𒎄𒎅𒎆𒎇𒎈𒎉𒎊𒎋𒎌𒎍𒎎𒎏𒎐𒎑𒎒𒎓𒎔𒎕𒎖𒎗𒎘𒎙𒐀𒐁𒐂𒐃𒐄𒐅𒐆𒐇𒐈𒐉𒐊𒐋𒐌𒐍𒐎𒐏𒐐𒐑𒐒𒐓𒐔𒐕𒐖𒐗𒐘𒐙𒐚𒐛𒐜𒐝𒐞𒐟𒐠𒐡𒐢𒐣𒐤𒐥𒐦𒐧𒐨𒐩𒐪𒐫𒐬𒐭𒐮𒐯𒐰𒐱𒐲𒐳𒐴𒐵𒐶𒐷𒐸𒐹𒐺𒐻𒐼𒐽𒐾𒐿𒑀𒑁𒑂𒑃𒑄𒑅𒑆𒑇𒑈𒑉𒑊𒑋𒑌𒑍𒑎𒑏𒑐𒑑𒑒𒑓𒑔𒑕𒑖𒑗𒑘𒑙𒑚𒑛𒑜𒑝𒑞𒑟𒑠𒑡𒑢𒑣𒑤𒑥𒑦𒑧𒑨𒑩𒑪𒑫𒑬𒑭𒑮𒑰𒑱𒑲𒑳𒑴𒒀𒒁𒒂𒒃𒒄𒒅𒒆𒒇𒒈𒒉𒒊𒒋𒒌𒒍𒒎𒒏𒒐𒒑𒒒𒒓𒒔𒒕𒒖𒒗𒒘𒒙𒒚𒒛𒒜𒒝𒒞𒒟𒒠𒒡𒒢𒒣𒒤𒒥𒒦𒒧𒒨𒒩𒒪𒒫𒒬𒒭𒒮𒒯𒒰𒒱𒒲𒒳𒒴𒒵𒒶𒒷𒒸𒒹𒒺𒒻𒒼𒒽𒒾𒒿𒓀𒓁𒓂𒓃𒓄𒓅𒓆𒓇𒓈𒓉𒓊𒓋𒓌𒓍𒓎𒓏𒓐𒓑𒓒𒓓𒓔𒓕𒓖𒓗𒓘𒓙𒓚𒓛𒓜𒓝𒓞𒓟𒓠𒓡𒓢𒓣𒓤𒓥𒓦𒓧𒓨𒓩𒓪𒓫𒓬𒓭𒓮𒓯𒓰𒓱𒓲𒓳𒓴𒓵𒓶𒓷𒓸𒓹𒓺𒓻𒓼𒓽𒓾𒓿𒔀𒔁𒔂𒔃𒔄𒔅𒔆𒔇𒔈𒔉𒔊𒔋𒔌𒔍𒔎𒔏𒔐𒔑𒔒𒔓𒔔𒔕𒔖𒔗𒔘𒔙𒔚𒔛𒔜𒔝𒔞𒔟𒔠𒔡𒔢𒔣𒔤𒔥𒔦𒔧𒔨𒔩𒔪𒔫𒔬𒔭𒔮𒔯𒔰𒔱𒔲𒔳𒔴𒔵𒔶𒔷𒔸𒔹𒔺𒔻𒔼𒔽𒔾𒔿𒕀𒕁𒕂𒕃
|
26 |
+
::script-name Cypriot ::n-char 55 ::char 𐠀𐠁𐠂𐠃𐠄𐠅𐠈𐠊𐠋𐠌𐠍𐠎𐠏𐠐𐠑𐠒𐠓𐠔𐠕𐠖𐠗𐠘𐠙𐠚𐠛𐠜𐠝𐠞𐠟𐠠𐠡𐠢𐠣𐠤𐠥𐠦𐠧𐠨𐠩𐠪𐠫𐠬𐠭𐠮𐠯𐠰𐠱𐠲𐠳𐠴𐠵𐠷𐠸𐠼𐠿
|
27 |
+
::script-name Cypro-Minoan ::n-char 99 ::char 𒾐𒾑𒾒𒾓𒾔𒾕𒾖𒾗𒾘𒾙𒾚𒾛𒾜𒾝𒾞𒾟𒾠𒾡𒾢𒾣𒾤𒾥𒾦𒾧𒾨𒾩𒾪𒾫𒾬𒾭𒾮𒾯𒾰𒾱𒾲𒾳𒾴𒾵𒾶𒾷𒾸𒾹𒾺𒾻𒾼𒾽𒾾𒾿𒿀𒿁𒿂𒿃𒿄𒿅𒿆𒿇𒿈𒿉𒿊𒿋𒿌𒿍𒿎𒿏𒿐𒿑𒿒𒿓𒿔𒿕𒿖𒿗𒿘𒿙𒿚𒿛𒿜𒿝𒿞𒿟𒿠𒿡𒿢𒿣𒿤𒿥𒿦𒿧𒿨𒿩𒿪𒿫𒿬𒿭𒿮𒿯𒿰𒿱𒿲
|
28 |
+
::script-name Cyrillic ::n-char 382 ::char ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿԀԁԂԃԄԅԆԇԈԉԊԋԌԍԎԏԐԑԒԓԔԕԖԗԘԙԚԛԜԝԞԟԠԡԢԣԤԥԦԧԨԩԪԫԬԭԮԯᲀᲁᲂᲃᲄᲅᲆᲇᲈᴫꙀꙁꙂꙃꙄꙅꙆꙇꙈꙉꙊꙋꙌꙍꙎꙏꙐꙑꙒꙓꙔꙕꙖꙗꙘꙙꙚꙛꙜꙝꙞꙟꙠꙡꙢꙣꙤꙥꙦꙧꙨꙩꙪꙫꙬꙭꙮꚀꚁꚂꚃꚄꚅꚆꚇꚈꚉꚊꚋꚌꚍꚎꚏꚐꚑꚒꚓꚔꚕꚖꚗꚘꚙꚚꚛ
|
29 |
+
::script-name Deseret ::n-char 80 ::char 𐐀𐐁𐐂𐐃𐐄𐐅𐐆𐐇𐐈𐐉𐐊𐐋𐐌𐐍𐐎𐐏𐐐𐐑𐐒𐐓𐐔𐐕𐐖𐐗𐐘𐐙𐐚𐐛𐐜𐐝𐐞𐐟𐐠𐐡𐐢𐐣𐐤𐐥𐐦𐐧𐐨𐐩𐐪𐐫𐐬𐐭𐐮𐐯𐐰𐐱𐐲𐐳𐐴𐐵𐐶𐐷𐐸𐐹𐐺𐐻𐐼𐐽𐐾𐐿𐑀𐑁𐑂𐑃𐑄𐑅𐑆𐑇𐑈𐑉𐑊𐑋𐑌𐑍𐑎𐑏
|
30 |
+
::script-name Devanagari ::n-char 125 ::char ऀँंःऄअआइईउऊऋऌऍऎएऐऑऒओऔकखगघङचछजझञटठडढणतथदधनऩपफबभमयरऱलळऴवशषसहऺऻ़ऽािीुूृॄॅॆेैॉॊोौ्ॎॏ॒॑॓॔ॕॖॗक़ख़ग़ज़ड़ढ़फ़य़ॠॡॢॣ॰ॱॲॳॴॵॶॷॸॹॺॻॼॽॾॿꣲꣳꣴꣵꣶꣷ꣸꣼ꣾꣿ ::numeral ०१२३४५६७८९ ::sign-virama ् ::vowel-sign ऺऻािीुूृॄॅॆेैॉॊोौॎॏॕॖॗॢॣꣿ
|
31 |
+
::script-name Dives Akuru ::n-char 55 ::char 𑤀𑤁𑤂𑤃𑤄𑤅𑤆𑤉𑤌𑤍𑤎𑤏𑤐𑤑𑤒𑤓𑤕𑤖𑤘𑤙𑤚𑤛𑤜𑤝𑤞𑤟𑤠𑤡𑤢𑤣𑤤𑤥𑤦𑤧𑤨𑤩𑤪𑤫𑤬𑤭𑤮𑤯𑤰𑤱𑤲𑤳𑤴𑤵𑤷𑤸𑤻𑤼𑤽𑤿𑥃 ::numeral 𑥐𑥑𑥒𑥓𑥔𑥕𑥖𑥗𑥘𑥙 ::vowel-sign 𑤰𑤱𑤲𑤳𑤴𑤵𑤷𑤸
|
32 |
+
::script-name Dogra ::n-char 60 ::char 𑠀𑠁𑠂𑠃𑠄𑠅𑠆𑠇𑠈𑠉𑠊𑠋𑠌𑠍𑠎𑠏𑠐𑠑𑠒𑠓𑠔𑠕𑠖𑠗𑠘𑠙𑠚𑠛𑠜𑠝𑠞𑠟𑠠𑠡𑠢𑠣𑠤𑠥𑠦𑠧𑠨𑠩𑠪𑠫𑠬𑠭𑠮𑠯𑠰𑠱𑠲𑠳𑠴𑠵𑠶𑠷𑠸𑠺𑠹𑠻 ::sign-virama 𑠹 ::vowel-sign 𑠬𑠭𑠮𑠯𑠰𑠱𑠲𑠳𑠴𑠵𑠶
|
33 |
+
::script-name Duployan ::n-char 109 ::char 𛰀𛰁𛰂𛰃𛰄𛰅𛰆𛰇𛰈𛰉𛰊𛰋𛰌𛰍𛰎𛰏𛰐𛰑𛰒𛰓𛰔𛰕𛰖𛰗𛰘𛰙𛰚𛰛𛰜𛰝𛰞𛰟𛰠𛰡𛰢𛰣𛰤𛰥𛰦𛰧𛰨𛰩𛰪𛰫𛰬𛰭𛰮𛰯𛰰𛰱𛰲𛰳𛰴𛰵𛰶𛰷𛰸𛰹𛰺𛰻𛰼𛰽𛰾𛰿𛱀𛱁𛱂𛱃𛱄𛱅𛱆𛱇𛱈𛱉𛱊𛱋𛱌𛱍𛱎𛱏𛱐𛱑𛱒𛱓𛱔𛱕𛱖𛱗𛱘𛱙𛱚𛱛𛱜𛱝𛱞𛱟𛱠𛱡𛱢𛱣𛱤𛱥𛱦𛱧𛱨𛱩𛱪𛲜𛲝
|
34 |
+
::script-name Egyptian Hieroglyph ::n-char 1080 ::char 𓀀𓀁𓀂𓀃𓀄𓀅𓀆𓀇𓀈𓀉𓀊𓀋𓀌𓀍𓀎𓀏𓀐𓀑𓀒𓀓𓀔𓀕𓀖𓀗𓀘𓀙𓀚𓀛𓀜𓀝𓀞𓀟𓀠𓀡𓀢𓀣𓀤𓀥𓀦𓀧𓀨𓀩𓀪𓀫𓀬𓀭𓀮𓀯𓀰𓀱𓀲𓀳𓀴𓀵𓀶𓀷𓀸𓀹𓀺𓀻𓀼𓀽𓀾𓀿𓁀𓁁𓁂𓁃𓁄𓁅𓁆𓁇𓁈𓁉𓁊𓁋𓁌𓁍𓁎𓁏𓁐𓁑𓁒𓁓𓁔𓁕𓁖𓁗𓁘𓁙𓁚𓁛𓁜𓁝𓁞𓁟𓁠𓁡𓁢𓁣𓁤𓁥𓁦𓁧𓁨𓁩𓁪𓁫𓁬𓁭𓁮𓁯𓁰𓁱𓁲𓁳𓁴𓁵𓁶𓁷𓁸𓁹𓁺𓁻𓁼𓁽𓁾𓁿𓂀𓂁𓂂𓂃𓂄𓂅𓂆𓂇𓂈𓂉𓂊𓂋𓂌𓂍𓂎𓂏𓂐𓂑𓂒𓂓𓂔𓂕𓂖𓂗𓂘𓂙𓂚𓂛𓂜𓂝𓂞𓂟𓂠𓂡𓂢𓂣𓂤𓂥𓂦𓂧𓂨𓂩𓂪𓂫𓂬𓂭𓂮𓂯𓂰𓂱𓂲𓂳𓂴𓂵𓂶𓂷𓂸𓂹𓂺𓂻𓂼𓂽𓂾𓂿𓃀𓃁𓃂𓃃𓃄𓃅𓃆𓃇𓃈𓃉𓃊𓃋𓃌𓃍𓃎𓃏𓃐𓃑𓃒𓃓𓃔𓃕𓃖𓃗𓃘𓃙𓃚𓃛𓃜𓃝𓃞𓃟𓃠𓃡𓃢𓃣𓃤𓃥𓃦𓃧𓃨𓃩𓃪𓃫𓃬𓃭𓃮𓃯𓃰𓃱𓃲𓃳𓃴𓃵𓃶𓃷𓃸𓃹𓃺𓃻𓃼𓃽𓃾𓃿𓄀𓄁𓄂𓄃𓄄𓄅𓄆𓄇𓄈𓄉𓄊𓄋𓄌𓄍𓄎𓄏𓄐𓄑𓄒𓄓𓄔𓄕𓄖𓄗𓄘𓄙𓄚𓄛𓄜𓄝𓄞𓄟𓄠𓄡𓄢𓄣𓄤𓄥𓄦𓄧𓄨𓄩𓄪𓄫𓄬𓄭𓄮𓄯𓄰𓄱𓄲𓄳𓄴𓄵𓄶𓄷𓄸𓄹𓄺𓄻𓄼𓄽𓄾𓄿𓅀𓅁𓅂𓅃𓅄𓅅𓅆𓅇𓅈𓅉𓅊𓅋𓅌𓅍𓅎𓅏𓅐𓅑𓅒𓅓𓅔𓅕𓅖𓅗𓅘𓅙𓅚𓅛𓅜𓅝𓅞𓅟𓅠𓅡𓅢𓅣𓅤𓅥𓅦𓅧𓅨𓅩𓅪𓅫𓅬𓅭𓅮𓅯𓅰𓅱𓅲𓅳𓅴𓅵𓅶𓅷𓅸𓅹𓅺𓅻𓅼𓅽𓅾𓅿𓆀𓆁𓆂𓆃𓆄𓆅𓆆𓆇𓆈𓆉𓆊𓆋𓆌𓆍𓆎𓆏𓆐𓆑𓆒𓆓𓆔𓆕𓆖𓆗𓆘𓆙𓆚𓆛𓆜𓆝𓆞𓆟𓆠𓆡𓆢𓆣𓆤𓆥𓆦𓆧𓆨𓆩𓆪𓆫𓆬𓆭𓆮𓆯𓆰𓆱𓆲𓆳𓆴𓆵𓆶𓆷𓆸𓆹𓆺𓆻𓆼𓆽𓆾𓆿𓇀𓇁𓇂𓇃𓇄𓇅𓇆𓇇𓇈𓇉𓇊𓇋𓇌𓇍𓇎𓇏𓇐𓇑𓇒𓇓𓇔𓇕𓇖𓇗𓇘𓇙𓇚𓇛𓇜𓇝𓇞𓇟𓇠𓇡𓇢𓇣𓇤𓇥𓇦𓇧𓇨𓇩𓇪𓇫𓇬𓇭𓇮𓇯𓇰𓇱𓇲𓇳𓇴𓇵𓇶𓇷𓇸𓇹𓇺𓇻𓇼𓇽𓇾𓇿𓈀𓈁𓈂𓈃𓈄𓈅𓈆𓈇𓈈𓈉𓈊𓈋𓈌𓈍𓈎𓈏𓈐𓈑𓈒𓈓𓈔𓈕𓈖𓈗𓈘𓈙𓈚𓈛𓈜𓈝𓈞𓈟𓈠𓈡𓈢𓈣𓈤𓈥𓈦𓈧𓈨𓈩𓈪𓈫𓈬𓈭𓈮𓈯𓈰𓈱𓈲𓈳𓈴𓈵𓈶𓈷𓈸𓈹𓈺𓈻𓈼𓈽𓈾𓈿𓉀𓉁𓉂𓉃𓉄𓉅𓉆𓉇𓉈𓉉𓉊𓉋𓉌𓉍𓉎𓉏𓉐𓉑𓉒𓉓𓉔𓉕𓉖𓉗𓉘𓉙𓉚𓉛𓉜𓉝𓉞𓉟𓉠𓉡𓉢𓉣𓉤𓉥𓉦𓉧𓉨𓉩𓉪𓉫𓉬𓉭𓉮𓉯𓉰𓉱𓉲𓉳𓉴𓉵𓉶𓉷𓉸𓉹𓉺𓉻𓉼𓉽𓉾𓉿𓊀𓊁𓊂𓊃𓊄𓊅𓊆𓊇𓊈𓊉𓊊𓊋𓊌𓊍𓊎𓊏𓊐𓊑𓊒𓊓𓊔𓊕𓊖𓊗𓊘𓊙𓊚𓊛𓊜𓊝𓊞𓊟𓊠𓊡𓊢𓊣𓊤𓊥𓊦𓊧𓊨𓊩𓊪𓊫𓊬𓊭𓊮𓊯𓊰𓊱𓊲𓊳𓊴𓊵𓊶𓊷𓊸𓊹𓊺𓊻𓊼𓊽𓊾𓊿𓋀𓋁𓋂𓋃𓋄𓋅𓋆𓋇𓋈𓋉𓋊𓋋𓋌𓋍𓋎𓋏𓋐𓋑𓋒𓋓𓋔𓋕𓋖𓋗𓋘𓋙𓋚𓋛𓋜𓋝𓋞𓋟𓋠𓋡𓋢𓋣𓋤𓋥𓋦𓋧𓋨𓋩𓋪𓋫𓋬𓋭𓋮𓋯𓋰𓋱𓋲𓋳𓋴𓋵𓋶𓋷𓋸𓋹𓋺𓋻𓋼𓋽𓋾𓋿𓌀𓌁𓌂𓌃𓌄𓌅𓌆𓌇𓌈𓌉𓌊𓌋𓌌𓌍𓌎𓌏𓌐𓌑𓌒𓌓𓌔𓌕𓌖𓌗𓌘𓌙𓌚𓌛𓌜𓌝𓌞𓌟𓌠𓌡𓌢𓌣𓌤𓌥𓌦𓌧𓌨𓌩𓌪𓌫𓌬𓌭𓌮𓌯𓌰𓌱𓌲𓌳𓌴𓌵𓌶𓌷𓌸𓌹𓌺𓌻𓌼𓌽𓌾𓌿𓍀𓍁𓍂𓍃𓍄𓍅𓍆𓍇𓍈𓍉𓍊𓍋𓍌𓍍𓍎𓍏𓍐𓍑𓍒𓍓𓍔𓍕𓍖𓍗𓍘𓍙𓍚𓍛𓍜𓍝𓍞𓍟𓍠𓍡𓍢𓍣𓍤𓍥𓍦𓍧𓍨𓍩𓍪𓍫𓍬𓍭𓍮𓍯𓍰𓍱𓍲𓍳𓍴𓍵𓍶𓍷𓍸𓍹𓍺𓍻𓍼𓍽𓍾𓍿𓎀𓎁𓎂𓎃𓎄𓎅𓎆𓎇𓎈𓎉𓎊𓎋𓎌𓎍𓎎𓎏𓎐𓎑𓎒𓎓𓎔𓎕𓎖𓎗𓎘𓎙𓎚𓎛𓎜𓎝𓎞𓎟𓎠𓎡𓎢𓎣𓎤𓎥𓎦𓎧𓎨𓎩𓎪𓎫𓎬𓎭𓎮𓎯𓎰𓎱𓎲𓎳𓎴𓎵𓎶𓎷𓎸𓎹𓎺𓎻𓎼𓎽𓎾𓎿𓏀𓏁𓏂𓏃𓏄𓏅𓏆𓏇𓏈𓏉𓏊𓏋𓏌𓏍𓏎𓏏𓏐𓏑𓏒𓏓𓏔𓏕𓏖𓏗𓏘𓏙𓏚𓏛𓏜𓏝𓏞𓏟𓏠𓏡𓏢𓏣𓏤𓏥𓏦𓏧𓏨𓏩𓏪𓏫𓏬𓏭𓏮𓏯𓏰𓏱𓏲𓏳𓏴𓏵𓏶𓏷𓏸𓏹𓏺𓏻𓏼𓏽𓏾𓏿𓐀𓐁𓐂𓐃𓐄𓐅𓐆𓐇𓐈𓐉𓐊𓐋𓐌𓐍𓐎𓐏𓐐𓐑𓐒𓐓𓐔𓐕𓐖𓐗𓐘𓐙𓐚𓐛𓐜𓐝𓐞𓐟𓐠𓐡𓐢𓐣𓐤𓐥𓐦𓐧𓐨𓐩𓐪𓐫𓐬𓐭𓐮
|
35 |
+
::script-name Elbasan ::n-char 40 ::char 𐔀𐔁𐔂𐔃𐔄𐔅𐔆𐔇𐔈𐔉𐔊𐔋𐔌𐔍𐔎𐔏𐔐𐔑𐔒𐔓𐔔𐔕𐔖𐔗𐔘𐔙𐔚𐔛𐔜𐔝𐔞𐔟𐔠𐔡𐔢𐔣𐔤𐔥𐔦𐔧
|
36 |
+
::script-name Elymaic ::n-char 23 ::char 𐿠𐿡𐿢𐿣𐿤𐿥𐿦𐿧𐿨𐿩𐿪𐿫𐿬𐿭𐿮𐿯𐿰𐿱𐿲𐿳𐿴𐿵𐿶
|
37 |
+
::script-name Ethiopic ::n-char 483 ::char ሀሁሂሃሄህሆሇለሉሊላሌልሎሏሐሑሒሓሔሕሖሗመሙሚማሜምሞሟሠሡሢሣሤሥሦሧረሩሪራሬርሮሯሰሱሲሳሴስሶሷሸሹሺሻሼሽሾሿቀቁቂቃቄቅቆቇቈቊቋቌቍቐቑቒቓቔቕቖቘቚቛቜቝበቡቢባቤብቦቧቨቩቪቫቬቭቮቯተቱቲታቴትቶቷቸቹቺቻቼችቾቿኀኁኂኃኄኅኆኇኈኊኋኌኍነኑኒናኔንኖኗኘኙኚኛኜኝኞኟአኡኢኣኤእኦኧከኩኪካኬክኮኯኰኲኳኴኵኸኹኺኻኼኽኾዀዂዃዄዅወዉዊዋዌውዎዏዐዑዒዓዔዕዖዘዙዚዛዜዝዞዟዠዡዢዣዤዥዦዧየዩዪያዬይዮዯደዱዲዳዴድዶዷዸዹዺዻዼዽዾዿጀጁጂጃጄጅጆጇገጉጊጋጌግጎጏጐጒጓጔጕጘጙጚጛጜጝጞጟጠጡጢጣጤጥጦጧጨጩጪጫጬጭጮጯጰጱጲጳጴጵጶጷጸጹጺጻጼጽጾጿፀፁፂፃፄፅፆፇፈፉፊፋፌፍፎፏፐፑፒፓፔፕፖፗፘፙፚ፝፞ᎀᎁᎂᎃᎄᎅᎆᎇᎈᎉᎊᎋᎌᎍᎎᎏⶀⶁⶂⶃⶄⶅⶆⶇⶈⶉⶊⶋⶌⶍⶎⶏⶐⶑⶒⶓⶔⶕⶖⶠⶡⶢⶣⶤⶥⶦⶨⶩⶪⶫⶬⶭⶮⶰⶱⶲⶳⶴⶵⶶⶸⶹⶺⶻⶼⶽⶾⷀⷁⷂⷃⷄⷅⷆⷈⷉⷊⷋⷌⷍⷎⷐⷑⷒⷓⷔⷕⷖⷘⷙⷚⷛⷜⷝⷞꬁꬂꬃꬄꬅꬆꬉꬊꬋꬌꬍꬎꬑꬒꬓꬔꬕꬖꬠꬡꬢꬣꬤꬥꬦꬨꬩꬪꬫꬬꬭꬮ𞟠𞟡𞟢𞟣𞟤𞟥𞟦𞟨𞟩𞟪𞟫𞟭𞟮𞟰𞟱𞟲𞟳𞟴𞟵𞟶𞟷𞟸𞟹𞟺𞟻𞟼𞟽𞟾 ::numeral ፩፪፫፬፭፮፯፰፱፲፳፴፵፶፷፸፹፺፻፼
|
38 |
+
::script-name Extended Arabic-Indic ::numeral ۰۱۲۳۴۵۶۷۸۹
|
39 |
+
::script-name Georgian ::n-char 172 ::char ႠႡႢႣႤႥႦႧႨႩႪႫႬႭႮႯႰႱႲႳႴႵႶႷႸႹႺႻႼႽႾႿჀჁჂჃჄჅჇჍაბგდევზთიკლმნოპჟრსტუფქღყშჩცძწჭხჯჰჱჲჳჴჵჶჷჸჹჺჽჾჿᲐᲑᲒᲓᲔᲕᲖᲗᲘᲙᲚᲛᲜᲝᲞᲟᲠᲡᲢᲣᲤᲥᲦᲧᲨᲩᲪᲫᲬᲭᲮᲯᲰᲱᲲᲳᲴᲵᲶᲷᲸᲹᲺᲽᲾᲿⴀⴁⴂⴃⴄⴅⴆⴇⴈⴉⴊⴋⴌⴍⴎⴏⴐⴑⴒⴓⴔⴕⴖⴗⴘⴙⴚⴛⴜⴝⴞⴟⴠⴡⴢⴣⴤⴥⴧⴭ
|
40 |
+
::script-name Glagolitic ::n-char 96 ::char ⰀⰁⰂⰃⰄⰅⰆⰇⰈⰉⰊⰋⰌⰍⰎⰏⰐⰑⰒⰓⰔⰕⰖⰗⰘⰙⰚⰛⰜⰝⰞⰟⰠⰡⰢⰣⰤⰥⰦⰧⰨⰩⰪⰫⰬⰭⰮⰯⰰⰱⰲⰳⰴⰵⰶⰷⰸⰹⰺⰻⰼⰽⰾⰿⱀⱁⱂⱃⱄⱅⱆⱇⱈⱉⱊⱋⱌⱍⱎⱏⱐⱑⱒⱓⱔⱕⱖⱗⱘⱙⱚⱛⱜⱝⱞⱟ
|
41 |
+
::script-name Gothic ::n-char 27 ::char 𐌰𐌱𐌲𐌳𐌴𐌵𐌶𐌷𐌸𐌹𐌺𐌻𐌼𐌽𐌾𐌿𐍀𐍁𐍂𐍃𐍄𐍅𐍆𐍇𐍈𐍉𐍊
|
42 |
+
::script-name Grantha ::n-char 72 ::char 𑌀𑌁𑌂𑌃𑌅𑌆𑌇𑌈𑌉𑌊𑌋𑌌𑌏𑌐𑌓𑌔𑌕𑌖𑌗𑌘𑌙𑌚𑌛𑌜𑌝𑌞𑌟𑌠𑌡𑌢𑌣𑌤𑌥𑌦𑌧𑌨𑌪𑌫𑌬𑌭𑌮𑌯𑌰𑌲𑌳𑌵𑌶𑌷𑌸𑌹𑌼𑌽𑌾𑌿𑍀𑍁𑍂𑍃𑍄𑍇𑍈𑍋𑍌𑍍𑍗𑍝𑍞𑍟𑍠𑍡𑍢𑍣 ::sign-virama 𑍍 ::vowel-sign 𑌾𑌿𑍀𑍁𑍂𑍃𑍄𑍇𑍈𑍋𑍌𑍢𑍣
|
43 |
+
::script-name Greek ::n-char 346 ::char ͰͱͲͳʹ͵ͶͷͿΆΈΉΊΌΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϘϙϚϛϜϝϞϟϠϡϳϷϸϺϻᴦᴧᴨᴩᴪᵦᵧᵨᵩᵪἀἁἂἃἄἅἆἇἈἉἊἋἌἍἎἏἐἑἒἓἔἕἘἙἚἛἜἝἠἡἢἣἤἥἦἧἨἩἪἫἬἭἮἯἰἱἲἳἴἵἶἷἸἹἺἻἼἽἾἿὀὁὂὃὄὅὈὉὊὋὌὍὐὑὒὓὔὕὖὗὙὛὝὟὠὡὢὣὤὥὦὧὨὩὪὫὬὭὮὯὰάὲέὴήὶίὸόὺύὼώᾀᾁᾂᾃᾄᾅᾆᾇᾈᾉᾊᾋᾌᾍᾎᾏᾐᾑᾒᾓᾔᾕᾖᾗᾘᾙᾚᾛᾜᾝᾞᾟᾠᾡᾢᾣᾤᾥᾦᾧᾨᾩᾪᾫᾬᾭᾮᾯᾰᾱᾲᾳᾴᾶᾷᾸᾹᾺΆᾼῂῃῄῆῇῈΈῊΉῌῐῑῒΐῖῗῘῙῚΊῠῡῢΰῤῥῦῧῨῩῪΎῬῲῳῴῶῷῸΌῺΏῼꭥ𐅵𐅶𐅷𐅸𐅹𐅺𐅻𐅼𐅽𐅾𐅿𐆀𐆁𐆂𐆃𐆄𐆅𐆆𐆇𐆈𐆉𐆊𐆋𐆌𐆍 ::numeral ʹ͵
|
44 |
+
::script-name Gujarati ::n-char 80 ::char ઁંઃઅઆઇઈઉઊઋઌઍએઐઑઓઔકખગઘઙચછજઝઞટઠડઢણતથદધનપફબભમયરલળવશષસહ઼ઽાિીુૂૃૄૅેૈૉોૌ્ૠૡૢૣ૰૱ૹૺૻૼ૽૾૿ ::numeral ૦૧૨૩૪૫૬૭૮૯ ::sign-virama ્ ::vowel-sign ાિીુૂૃૄૅેૈૉોૌૢૣ
|
45 |
+
::script-name Gunjala Gondi ::n-char 51 ::char 𑵠𑵡𑵢𑵣𑵤𑵥𑵧𑵨𑵪𑵫𑵬𑵭𑵮𑵯𑵰𑵱𑵲𑵳𑵴𑵵𑵶𑵷𑵸𑵹𑵺𑵻𑵼𑵽𑵾𑵿𑶀𑶁𑶂𑶃𑶄𑶅𑶆𑶇𑶈𑶉𑶊𑶋𑶌𑶍𑶎𑶐𑶑𑶓𑶔𑶕𑶖 ::numeral 𑶠𑶡𑶢𑶣𑶤𑶥𑶦𑶧𑶨𑶩 ::vowel-sign 𑶊𑶋𑶌𑶍𑶎𑶐𑶑𑶓𑶔
|
46 |
+
::script-name Gurmukhi ::n-char 69 ::char ਁਂਃਅਆ��ਈਉਊਏਐਓਔਕਖਗਘਙਚਛਜਝਞਟਠਡਢਣਤਥਦਧਨਪਫਬਭਮਯਰਲਲ਼ਵਸ਼ਸਹ਼ਾਿੀੁੂੇੈੋੌ੍ੑਖ਼ਗ਼ਜ਼ੜਫ਼ੰੱੲੳੵ੶ ::numeral ੦੧੨੩੪੫੬੭੮੯ ::sign-virama ੍ ::vowel-sign ਾਿੀੁੂੇੈੋੌ
|
47 |
+
::script-name Hangzhou ::numeral 〡〢〣〤〥〦〧〨〩〸〹〺
|
48 |
+
::script-name Hanifi Rohingya ::n-char 38 ::char 𐴀𐴁𐴂𐴃𐴄𐴅𐴆𐴇𐴈𐴉𐴊𐴋𐴌𐴍𐴎𐴏𐴐𐴑𐴒𐴓𐴔𐴕𐴖𐴗𐴘𐴙𐴚𐴛𐴜𐴝𐴞𐴟𐴠𐴡𐴤𐴥𐴦𐴧 ::numeral 𐴰𐴱𐴲𐴳𐴴𐴵𐴶𐴷𐴸𐴹
|
49 |
+
::script-name Hanunoo ::n-char 21 ::char ᜠᜡᜢᜣᜤᜥᜦᜧᜨᜩᜪᜫᜬᜭᜮᜯᜰᜱᜲᜳ᜴ ::vowel-sign ᜲᜳ
|
50 |
+
::script-name Hatran ::n-char 21 ::char 𐣠𐣡𐣢𐣣𐣤𐣥𐣦𐣧𐣨𐣩𐣪𐣫𐣬𐣭𐣮𐣯𐣰𐣱𐣲𐣴𐣵 ::numeral 𐣻𐣼𐣽𐣾𐣿
|
51 |
+
::script-name Hebrew ::n-char 124 ::char ְֱֲֳִֵֶַָׇֹֺֻּֽֿׁׂ֑֖֛֢֣֤֥֦֧֪֚֭֮֒֓֔֕֗֘֙֜֝֞֟֠֡֨֩֫֬אבגדהוזחטיךכלםמןנסעףפץצקרשתװױײיִﬞײַﬠﬡﬢﬣﬤﬥﬦﬧﬨ﬩שׁשׂשּׁשּׂאַאָאּבּגּדּהּוּזּטּיּךּכּלּמּנּסּףּפּצּקּרּשּתּוֹבֿכֿפֿﭏ
|
52 |
+
::script-name Hiragana ::n-char 91 ::char ぁあぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔゕゖ𛀁𛄟𛅐𛅑𛅒
|
53 |
+
::script-name Imperial Aramaic ::n-char 23 ::char 𐡀𐡁𐡂𐡃𐡄𐡅𐡆𐡇𐡈𐡉𐡊𐡋𐡌𐡍𐡎𐡏𐡐𐡑𐡒𐡓𐡔𐡕𐡗 ::numeral 𐡘𐡙𐡚𐡛𐡜𐡝𐡞𐡟
|
54 |
+
::script-name Indic Siyaq ::numeral 𞱱𞱲𞱳𞱴𞱵𞱶𞱷𞱸𞱹𞱺𞱻𞱼𞱽𞱾𞱿𞲀𞲁𞲂𞲃𞲄𞲅𞲆𞲇𞲈𞲉𞲊𞲋𞲌𞲍𞲎𞲏𞲐𞲑𞲒𞲓𞲔𞲕𞲖𞲗𞲘𞲙𞲚𞲛𞲜𞲝𞲞𞲟𞲡𞲢𞲣𞲤𞲥𞲦𞲧𞲨𞲩𞲪𞲫𞲭𞲮𞲯𞲱𞲲𞲳
|
55 |
+
::script-name Inscriptional Pahlavi ::n-char 19 ::char 𐭠𐭡𐭢𐭣𐭤𐭥𐭦𐭧𐭨𐭩𐭪𐭫𐭬𐭭𐭮𐭯𐭰𐭱𐭲 ::numeral 𐭸𐭹𐭺𐭻𐭼𐭽𐭾𐭿
|
56 |
+
::script-name Inscriptional Parthian ::n-char 22 ::char 𐭀𐭁𐭂𐭃𐭄𐭅𐭆𐭇𐭈𐭉𐭊𐭋𐭌𐭍𐭎𐭏𐭐𐭑𐭒𐭓𐭔𐭕 ::numeral 𐭘𐭙𐭚𐭛𐭜𐭝𐭞𐭟
|
57 |
+
::script-name Javanese ::n-char 64 ::char ꦀꦁꦂꦃꦄꦅꦆꦇꦈꦉꦊꦋꦌꦍꦎꦏꦐꦑꦒꦓꦔꦕꦖꦗꦘꦙꦚꦛꦜꦝꦞꦟꦠꦡꦢꦣꦤꦥꦦꦧꦨꦩꦪꦫꦬꦭꦮꦯꦰꦱꦲ꦳ꦴꦵꦶꦷꦸꦹꦺꦻꦼꦽꦾꦿ ::numeral ꧐꧑꧒꧓꧔꧕꧖꧗꧘꧙ ::vowel-sign ꦴꦵꦶꦷꦸꦹꦺꦻꦼ
|
58 |
+
::script-name Kaithi ::n-char 64 ::char 𑂀𑂁𑂂𑂃𑂄𑂅𑂆𑂇𑂈𑂉𑂊𑂋𑂌𑂍𑂎𑂏𑂐𑂑𑂒𑂓𑂔𑂕𑂖𑂗𑂘𑂙𑂚𑂛𑂜𑂝𑂞𑂟𑂠𑂡𑂢𑂣𑂤𑂥𑂦𑂧𑂨𑂩𑂪𑂫𑂬𑂭𑂮𑂯𑂰𑂱𑂲𑂳𑂴𑂵𑂶𑂷𑂸𑂺𑂹𑂻𑂼𑃂 ::numeral ::sign-virama 𑂹 ::vowel-sign 𑂰𑂱𑂲𑂳𑂴𑂵𑂶𑂷𑂸𑃂
|
59 |
+
::script-name Kannada ::n-char 78 ::char ಀಁಂಃ಄ಅಆಇಈಉಊಋಌಎಏಐಒಓಔಕಖಗಘಙಚಛಜಝಞಟಠಡಢಣತಥದಧನಪಫಬಭಮಯರಱಲಳವಶಷಸಹ಼ಽಾಿೀುೂೃೄೆೇೈೊೋೌ್ೝೞೠೡೢೣೱೲ ::numeral ೦೧೨೩೪೫೬೭೮೯ ::sign-virama ್ ::vowel-sign ಾಿೀುೂೃೄೆೇೈೊೋೌೢೣ
|
60 |
+
::script-name Katakana ::n-char 127 ::char ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺㇰㇱㇲㇳㇴㇵㇶㇷㇸㇹㇺㇻㇼㇽㇾㇿ𚿰𚿱𚿲𚿳𚿵𚿶𚿷𚿸𚿹𚿺𚿻𚿽𚿾𛀀𛄠𛄡𛄢𛅤𛅥𛅦𛅧
|
61 |
+
::script-name Kayah Li ::n-char 35 ::char ꤊꤋꤌꤍꤎꤏꤐꤑꤒꤓꤔꤕꤖꤗꤘꤙꤚꤛꤜꤝꤞꤟꤠꤡꤢꤣꤤꤥꤦꤧꤨꤩꤪ꤮꤯ ::numeral ꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉
|
62 |
+
::script-name Kharoshthi ::n-char 49 ::char 𐨀𐨁𐨂𐨃𐨅𐨆𐨌𐨍𐨎𐨏𐨐𐨑𐨒𐨓𐨕𐨖𐨗𐨙𐨚𐨛𐨜𐨝𐨞𐨟𐨠𐨡𐨢𐨣𐨤𐨥𐨦𐨧𐨨𐨩𐨪𐨫𐨬𐨭𐨮𐨯𐨰𐨱𐨲𐨳𐨴𐨵𐨹𐨺𐨸 ::numeral 𐩀𐩁𐩂𐩃𐩄𐩅𐩆𐩇𐩈 ::vowel-sign 𐨁𐨂𐨃𐨅𐨆
|
63 |
+
::script-name Khitan Small Script ::n-char 470 ::char 𘬀𘬁𘬂𘬃𘬄𘬅𘬆𘬇𘬈𘬉𘬊𘬋𘬌𘬍𘬎𘬏𘬐𘬑𘬒𘬓𘬔𘬕𘬖𘬗𘬘𘬙𘬚𘬛𘬜𘬝𘬞𘬟𘬠𘬡𘬢𘬣𘬤𘬥𘬦𘬧𘬨𘬩𘬪𘬫𘬬𘬭𘬮𘬯𘬰𘬱𘬲𘬳𘬴𘬵𘬶𘬷𘬸𘬹𘬺𘬻𘬼𘬽𘬾𘬿𘭀𘭁𘭂𘭃𘭄𘭅𘭆𘭇𘭈𘭉𘭊𘭋𘭌𘭍𘭎𘭏𘭐𘭑𘭒𘭓𘭔𘭕𘭖𘭗𘭘𘭙𘭚𘭛𘭜𘭝𘭞𘭟𘭠𘭡𘭢𘭣𘭤𘭥𘭦𘭧𘭨𘭩𘭪𘭫𘭬𘭭𘭮𘭯𘭰𘭱𘭲𘭳𘭴𘭵𘭶𘭷𘭸𘭹𘭺𘭻𘭼𘭽𘭾𘭿𘮀𘮁𘮂𘮃𘮄𘮅𘮆𘮇𘮈𘮉𘮊𘮋𘮌𘮍𘮎𘮏𘮐����𘮒𘮓𘮔𘮕𘮖𘮗𘮘𘮙𘮚𘮛𘮜𘮝𘮞𘮟𘮠𘮡𘮢𘮣𘮤𘮥𘮦𘮧𘮨𘮩𘮪𘮫𘮬𘮭𘮮𘮯𘮰𘮱𘮲𘮳𘮴𘮵𘮶𘮷𘮸𘮹𘮺𘮻𘮼𘮽𘮾𘮿𘯀𘯁𘯂𘯃𘯄𘯅𘯆𘯇𘯈𘯉𘯊𘯋𘯌𘯍𘯎𘯏𘯐𘯑𘯒𘯓𘯔𘯕𘯖𘯗𘯘𘯙𘯚𘯛𘯜𘯝𘯞𘯟𘯠𘯡𘯢𘯣𘯤𘯥𘯦𘯧𘯨𘯩𘯪𘯫𘯬𘯭𘯮𘯯𘯰𘯱𘯲𘯳𘯴𘯵𘯶𘯷𘯸𘯹𘯺𘯻𘯼𘯽𘯾𘯿𘰀𘰁𘰂𘰃𘰄𘰅𘰆𘰇𘰈𘰉𘰊𘰋𘰌𘰍𘰎𘰏𘰐𘰑𘰒𘰓𘰔𘰕𘰖𘰗𘰘𘰙𘰚𘰛𘰜𘰝𘰞𘰟𘰠𘰡𘰢𘰣𘰤𘰥𘰦𘰧𘰨𘰩𘰪𘰫𘰬𘰭𘰮𘰯𘰰𘰱𘰲𘰳𘰴𘰵𘰶𘰷𘰸𘰹𘰺𘰻𘰼𘰽𘰾𘰿𘱀𘱁𘱂𘱃𘱄𘱅𘱆𘱇𘱈𘱉𘱊𘱋𘱌𘱍𘱎𘱏𘱐𘱑𘱒𘱓𘱔𘱕𘱖𘱗𘱘𘱙𘱚𘱛𘱜𘱝𘱞𘱟𘱠𘱡𘱢𘱣𘱤𘱥𘱦𘱧𘱨𘱩𘱪𘱫𘱬𘱭𘱮𘱯𘱰𘱱𘱲𘱳𘱴𘱵𘱶𘱷𘱸𘱹𘱺𘱻𘱼𘱽𘱾𘱿𘲀𘲁𘲂𘲃𘲄𘲅𘲆𘲇𘲈𘲉𘲊𘲋𘲌𘲍𘲎𘲏𘲐𘲑𘲒𘲓𘲔𘲕𘲖𘲗𘲘𘲙𘲚𘲛𘲜𘲝𘲞𘲟𘲠𘲡𘲢𘲣𘲤𘲥𘲦𘲧𘲨𘲩𘲪𘲫𘲬𘲭𘲮𘲯𘲰𘲱𘲲𘲳𘲴𘲵𘲶𘲷𘲸𘲹𘲺𘲻𘲼𘲽𘲾𘲿𘳀𘳁𘳂𘳃𘳄𘳅𘳆𘳇𘳈𘳉𘳊𘳋𘳌𘳍𘳎𘳏𘳐𘳑𘳒𘳓𘳔𘳕
|
64 |
+
::script-name Khmer ::n-char 93 ::char កខគឃងចឆជឈញដឋឌឍណតថទធនបផពភមយរលវឝឞសហឡអឣឤឥឦឧឨឩឪឫឬឭឮឯឰឱឲឳ឴឵ាិីឹឺុូួើឿៀេែៃោៅំះៈ៉៊់៌៍៎៏័៑្៓។៕៖ៗ៘៙៚ៜ៝ ::numeral ០១២៣៤៥៦៧៨៩ ::sign-virama ្ ::vowel-sign ាិីឹឺុូួើឿៀេែៃោៅ
|
65 |
+
::script-name Khojki ::n-char 57 ::char 𑈀𑈁𑈂𑈃𑈄𑈅𑈆𑈇𑈈𑈉𑈊𑈋𑈌𑈍𑈎𑈏𑈐𑈑𑈓𑈔𑈕𑈖𑈗𑈘𑈙𑈚𑈛𑈜𑈝𑈞𑈟𑈠𑈡𑈢𑈣𑈤𑈥𑈦𑈧𑈨𑈩𑈪𑈫𑈬𑈭𑈮𑈯𑈰𑈱𑈲𑈳𑈴𑈶𑈵𑈷𑈽𑈾 ::sign-virama 𑈵 ::vowel-sign 𑈬𑈭𑈮𑈯𑈰𑈱𑈲𑈳
|
66 |
+
::script-name Khudawadi ::n-char 59 ::char 𑊰𑊱𑊲𑊳𑊴𑊵𑊶𑊷𑊸𑊹𑊺𑊻𑊼𑊽𑊾𑊿𑋀𑋁𑋂𑋃𑋄𑋅𑋆𑋇𑋈𑋉𑋊𑋋𑋌𑋍𑋎𑋏𑋐𑋑𑋒𑋓𑋔𑋕𑋖𑋗𑋘𑋙𑋚𑋛𑋜𑋝𑋞𑋟𑋠𑋡𑋢𑋣𑋤𑋥𑋦𑋧𑋨𑋩𑋪 ::numeral 𑋰𑋱𑋲𑋳𑋴𑋵𑋶𑋷𑋸𑋹 ::sign-virama 𑋪 ::vowel-sign 𑋠𑋡𑋢𑋣𑋤𑋥𑋦𑋧𑋨
|
67 |
+
::script-name Klingon ::n-char 26 ::char ::numeral
|
68 |
+
::script-name Lao ::n-char 62 ::char ກຂຄຆງຈຉຊຌຍຎຏຐຑຒຓດຕຖທຘນບປຜຝພຟຠມຢຣລວຨຩສຫຬອຮະັາຳິີຶື຺ຸູົຼຽເແໂໃໄໞໟ ::numeral ໐໑໒໓໔໕໖໗໘໙ ::vowel-sign ະັາຳິີຶືຸູົເແໂໃໄ
|
69 |
+
::script-name Latin ::n-char 1207 ::char ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷƸƹƺƻƼƽƾƿǀǁǂǃDŽDždžLJLjljNJNjnjǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟǠǡǢǣǤǥǦǧǨǩǪǫǬǭǮǯǰDZDzdzǴǵǶǷǸǹǺǻǼǽǾǿȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȜȝȞȟȠȡȢȣȤȥȦȧȨȩȪȫȬȭȮȯȰȱȲȳȴȵȶȷȸȹȺȻȼȽȾȿɀɁɂɃɄɅɆɇɈɉɊɋɌɍɎɏɐɑɒɓɔɕɖɗɘəɚɛɜɝɞɟɠɡɢɣɤɥɦɧɨɩɪɫɬɭɮɯɰɱɲɳɴɵɶɷɸɹɺɻɼɽɾɿʀʁʂʃʄʅʆʇʈʉʊʋʌʍʎʏʐʑʒʓʔʕʖʗʘʙʚʛʜʝʞʟʠʡʢʣʤʥʦʧʨʩʪʫʬʭʮʯᴀᴁᴂᴃᴄᴅᴆᴇᴈᴉᴊᴋᴌᴍᴎᴏᴐᴑᴒᴓᴔᴕᴖᴗᴘᴙᴚᴛᴜᴝᴞᴟᴠᴡᴢᴣᴤᴥᵢᵣᵤᵥᵫᵬᵭᵮᵯᵰᵱᵲᵳᵴᵵᵶᵷᵹᵺᵻᵼᵽᵾᵿᶀᶁᶂᶃᶄᶅᶆᶇᶈᶉᶊᶋᶌᶍᶎᶏᶐᶑᶒᶓᶔᶕᶖᶗᶘᶙᶚḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔḕḖḗḘḙḚḛḜḝḞḟḠḡḢḣḤḥḦḧḨḩḪḫḬḭḮḯḰḱḲḳḴḵḶḷḸḹḺḻḼḽḾḿṀṁṂṃṄṅṆṇṈṉṊṋṌṍṎṏṐṑṒṓṔṕṖṗṘṙṚṛṜṝṞṟṠṡṢṣṤṥṦṧṨṩṪṫṬṭṮṯṰṱṲṳṴṵṶṷṸṹṺṻṼṽṾṿẀẁẂẃẄẅẆẇẈẉẊẋẌẍẎẏẐẑẒẓẔẕẖẗẘẙẚẛẜẝẞẟẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊịỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹỺỻỼỽỾỿₐₑₒₓₔₕₖₗₘₙₚₛₜↄⱠⱡⱢⱣⱤⱥⱦⱧⱨⱩⱪⱫⱬⱭⱮⱯⱰⱱⱲⱳⱴⱵⱶⱷⱸⱹⱺⱻⱼⱾⱿꜢꜣꜤꜥꜦꜧꜨꜩꜪꜫꜬꜭꜮꜯꜰꜱꜲꜳꜴꜵꜶꜷꜸꜹꜺꜻꜼꜽꜾꜿꝀꝁꝂꝃꝄꝅꝆꝇꝈꝉꝊꝋꝌꝍꝎꝏꝐꝑꝒꝓꝔꝕꝖꝗꝘꝙꝚꝛꝜꝝꝞꝟꝠꝡꝢꝣꝤꝥꝦꝧꝨꝩꝪꝫꝬꝭꝮꝯꝱꝲꝳꝴꝵꝶꝷꝸꝹꝺꝻꝼꝽꝾꝿꞀꞁꞂꞃꞄꞅꞆꞇꞋꞌꞍꞎꞏꞐꞑꞒꞓꞔꞕꞖꞗꞘꞙꞚꞛꞜꞝꞞꞟꞠꞡꞢꞣꞤꞥꞦꞧꞨꞩꞪꞫꞬꞭꞮꞯꞰꞱꞲꞳꞴꞵꞶꞷꞸꞹꞺꞻꞼꞽꞾꞿꟀꟁꟂꟃꟄꟅꟆꟇꟈꟉꟊꟐꟑꟓꟕꟖꟗꟘꟙꟵꟶꟷꟺꟻꟼꟽꟾꟿꬰꬱꬲꬳꬴꬵꬶꬷꬸꬹꬺꬻꬼꬽꬾꬿꭀꭁꭂꭃꭄꭅꭆꭇꭈꭉꭊꭋꭌꭍꭎꭏꭐꭑꭒꭓꭔꭕꭖꭗꭘꭙꭚꭠꭡꭢꭣꭤꭦꭧꭨfffiflffifflſtst𝼀𝼁𝼂𝼃𝼄𝼅𝼆𝼇𝼈𝼉𝼊𝼋𝼌𝼍𝼎𝼏𝼐𝼑𝼒𝼓𝼔𝼕𝼖𝼗𝼘𝼙𝼚𝼛𝼜𝼝𝼞
|
70 |
+
::script-name Lepcha ::n-char 59 ::char ᰀᰁᰂᰃᰄᰅᰆᰇᰈᰉᰊᰋᰌᰍᰎᰏᰐᰑᰒᰓᰔᰕᰖᰗᰘᰙᰚᰛᰜᰝᰞᰟᰠᰡᰢᰣᰤᰥᰦᰧᰨᰩᰪᰫᰬᰭᰮᰯᰰᰱᰲᰳᰴᰵᰶ᰷ᱍᱎᱏ ::numeral ᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉ ::vowel-sign ᰦᰧᰨᰩᰪᰫᰬ
|
71 |
+
::script-name Limbu ::n-char 56 ::char ᤀᤁᤂᤃᤄᤅᤆᤇᤈᤉᤊᤋᤌᤍᤎᤏᤐᤑᤒᤓᤔᤕᤖᤗᤘᤙᤚᤛᤜᤝᤞᤠᤡᤢᤣᤤᤥᤦᤧᤨᤩᤪᤫᤰᤱᤲᤳᤴᤵᤶᤷᤸ᤻᤹᤺᥀ ::numeral ᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏ ::vowel-sign ᤠᤡᤢᤣᤤᤥᤦᤧᤨ
|
72 |
+
::script-name Linear A ::n-char 341 ::char 𐘀𐘁𐘂𐘃𐘄𐘅𐘆𐘇𐘈𐘉𐘊𐘋𐘌𐘍𐘎𐘏𐘐𐘑𐘒𐘓𐘔𐘕𐘖𐘗𐘘𐘙𐘚𐘛𐘜𐘝𐘞𐘟𐘠𐘡𐘢𐘣𐘤𐘥𐘦𐘧𐘨𐘩𐘪𐘫𐘬𐘭𐘮𐘯𐘰𐘱𐘲𐘳𐘴𐘵𐘶𐘷𐘸𐘹𐘺𐘻𐘼𐘽𐘾𐘿𐙀𐙁𐙂𐙃𐙄𐙅𐙆𐙇𐙈𐙉𐙊𐙋𐙌𐙍𐙎𐙏𐙐𐙑𐙒𐙓𐙔𐙕𐙖𐙗𐙘𐙙𐙚𐙛𐙜𐙝𐙞𐙟𐙠𐙡𐙢𐙣𐙤𐙥𐙦𐙧𐙨𐙩𐙪𐙫𐙬𐙭𐙮𐙯𐙰𐙱𐙲𐙳𐙴𐙵𐙶𐙷𐙸𐙹𐙺𐙻𐙼𐙽𐙾𐙿𐚀𐚁𐚂𐚃𐚄𐚅𐚆𐚇𐚈𐚉𐚊𐚋𐚌𐚍𐚎𐚏𐚐𐚑𐚒𐚓𐚔𐚕𐚖𐚗𐚘𐚙𐚚𐚛𐚜𐚝𐚞𐚟𐚠𐚡𐚢𐚣𐚤𐚥𐚦𐚧𐚨𐚩𐚪𐚫𐚬𐚭𐚮𐚯𐚰𐚱𐚲𐚳𐚴𐚵𐚶𐚷𐚸𐚹𐚺𐚻𐚼𐚽𐚾𐚿𐛀𐛁𐛂𐛃𐛄𐛅𐛆𐛇𐛈𐛉𐛊𐛋𐛌𐛍𐛎𐛏𐛐𐛑𐛒𐛓𐛔𐛕𐛖𐛗𐛘𐛙𐛚𐛛𐛜𐛝𐛞𐛟𐛠𐛡𐛢𐛣𐛤𐛥𐛦𐛧𐛨𐛩𐛪𐛫𐛬𐛭𐛮𐛯𐛰𐛱𐛲𐛳𐛴𐛵𐛶𐛷𐛸𐛹𐛺𐛻𐛼𐛽𐛾𐛿𐜀𐜁𐜂𐜃𐜄𐜅𐜆𐜇𐜈𐜉𐜊𐜋𐜌𐜍𐜎𐜏𐜐𐜑𐜒𐜓𐜔𐜕𐜖𐜗𐜘𐜙𐜚𐜛𐜜𐜝𐜞𐜟𐜠𐜡𐜢𐜣𐜤𐜥𐜦𐜧𐜨𐜩𐜪𐜫𐜬𐜭𐜮𐜯𐜰𐜱𐜲𐜳𐜴𐜵𐜶𐝀𐝁𐝂𐝃𐝄𐝅𐝆𐝇𐝈𐝉𐝊𐝋𐝌𐝍𐝎𐝏𐝐𐝑𐝒𐝓𐝔𐝕𐝠𐝡𐝢𐝣𐝤𐝥𐝦𐝧
|
73 |
+
::script-name Linear B ::n-char 74 ::char 𐀀𐀁𐀂𐀃𐀄𐀅𐀆𐀇𐀈𐀉𐀊𐀋𐀍𐀎𐀏𐀐𐀑𐀒𐀓𐀔𐀕𐀖𐀗𐀘𐀙𐀚𐀛𐀜𐀝𐀞𐀟𐀠𐀡𐀢𐀣𐀤𐀥𐀦𐀨𐀩𐀪𐀫𐀬𐀭𐀮𐀯𐀰𐀱𐀲𐀳𐀴𐀵𐀶𐀷𐀸𐀹𐀺𐀼𐀽𐀿𐁀𐁁𐁂𐁃𐁄𐁅𐁆𐁇𐁈𐁉𐁊𐁋𐁌𐁍
|
74 |
+
::script-name Lisu ::n-char 47 ::char ꓐꓑꓒꓓꓔꓕꓖꓗꓘꓙꓚꓛꓜꓝꓞꓟꓠꓡꓢꓣꓤꓥꓦꓧꓨꓩꓪꓫꓬꓭꓮꓯꓰꓱꓲꓳꓴꓵꓶꓷꓸꓹꓺꓻꓼꓽ𑾰
|
75 |
+
::script-name Lycian ::n-char 29 ::char 𐊀𐊁𐊂𐊃𐊄𐊅𐊆𐊇𐊈𐊉𐊊𐊋𐊌𐊍𐊎𐊏𐊐𐊑𐊒𐊓𐊔𐊕𐊖𐊗𐊘𐊙𐊚𐊛𐊜
|
76 |
+
::script-name Lydian ::n-char 26 ::char 𐤠𐤡𐤢𐤣𐤤𐤥𐤦𐤧𐤨𐤩𐤪𐤫𐤬𐤭𐤮𐤯𐤰𐤱𐤲𐤳𐤴𐤵𐤶𐤷𐤸𐤹
|
77 |
+
::script-name Mahajani ::n-char 38 ::char 𑅐𑅑𑅒𑅓𑅔𑅕𑅖𑅗𑅘𑅙𑅚𑅛𑅜𑅝𑅞𑅟𑅠𑅡𑅢𑅣𑅤𑅥𑅦𑅧𑅨𑅩𑅪𑅫𑅬𑅭𑅮𑅯𑅰𑅱𑅲𑅳𑅴𑅶
|
78 |
+
::script-name Makasar ::n-char 22 ::char 𑻠𑻡𑻢𑻣𑻤𑻥𑻦𑻧𑻨𑻩𑻪𑻫𑻬𑻭𑻮𑻯𑻰𑻱𑻳𑻴𑻵𑻶 ::vowel-sign 𑻳𑻴𑻵𑻶
|
79 |
+
::script-name Malayalam ::n-char 91 ::char ഀഁംഃഄഅആഇഈഉഊഋഌഎഏഐഒഓഔകഖഗഘങചഛജഝഞടഠഡഢണതഥദധനഩപഫബഭമയരറലളഴവശഷസഹഺ഻഼ഽാിീുൂൃൄെേൈൊോൌ്ൎ൏ൔൕൖൗൟൠൡൢൣൺൻർൽൾൿ ::numeral ൘൙൚൛൜൝൞൦൧൨൩൪൫൬൭൮൯൰൱൲൳൴൵൶൷൸ ::sign-virama ് ::vowel-sign ാിീുൂൃൄെേൈൊോൌൢൣ
|
80 |
+
::script-name Mandaic ::n-char 25 ::char ࡀࡁࡂࡃࡄࡅࡆࡇࡈࡉࡊࡋࡌࡍࡎࡏࡐࡑࡒࡓࡔࡕࡖࡗࡘ
|
81 |
+
::script-name Manichaean ::n-char 37 ::char 𐫀𐫁𐫂𐫃𐫄𐫅𐫆𐫇𐫈𐫉𐫊𐫋𐫌𐫍𐫎𐫏𐫐𐫑𐫒𐫓𐫔𐫕𐫖𐫗𐫘𐫙𐫚𐫛𐫜𐫝𐫞𐫟𐫠𐫡𐫢𐫣𐫤 ::numeral 𐫫𐫬𐫭𐫮𐫯
|
82 |
+
::script-name Marchen ::n-char 66 ::char 𑱲𑱳𑱴𑱵𑱶𑱷𑱸𑱹𑱺𑱻𑱼𑱽𑱾𑱿𑲀𑲁𑲂𑲃𑲄𑲅𑲆𑲇𑲈𑲉𑲊𑲋𑲌���𑲎𑲏𑲒𑲓𑲔𑲕𑲖𑲗𑲘𑲙𑲚𑲛𑲜𑲝𑲞𑲟𑲠𑲡𑲢𑲣𑲤𑲥𑲦𑲧𑲩𑲪𑲫𑲬𑲭𑲮𑲯𑲰𑲱𑲲𑲳𑲴𑲵𑲶 ::vowel-sign 𑲰𑲱𑲲𑲳𑲴
|
83 |
+
::script-name Masaram Gondi ::n-char 62 ::char 𑴀𑴁𑴂𑴃𑴄𑴅𑴆𑴈𑴉𑴋𑴌𑴍𑴎𑴏𑴐𑴑𑴒𑴓𑴔𑴕𑴖𑴗𑴘𑴙𑴚𑴛𑴜𑴝𑴞𑴟𑴠𑴡𑴢𑴣𑴤𑴥𑴦𑴧𑴨𑴩𑴪𑴫𑴬𑴭𑴮𑴯𑴰𑴱𑴲𑴳𑴴𑴵𑴶𑴺𑴼𑴽𑴿𑵀𑵁𑵂𑵃𑵄 ::numeral 𑵐𑵑𑵒𑵓𑵔𑵕𑵖𑵗𑵘𑵙 ::vowel-sign 𑴱𑴲𑴳𑴴𑴵𑴶𑴺𑴼𑴽𑴿
|
84 |
+
::script-name Mayan ::numeral 𝋠𝋡𝋢𝋣𝋤𝋥𝋦𝋧𝋨𝋩𝋪𝋫𝋬𝋭𝋮𝋯𝋰𝋱𝋲𝋳
|
85 |
+
::script-name Medefaidrin ::n-char 64 ::char 𖹀𖹁𖹂𖹃𖹄𖹅𖹆𖹇𖹈𖹉𖹊𖹋𖹌𖹍𖹎𖹏𖹐𖹑𖹒𖹓𖹔𖹕𖹖𖹗𖹘𖹙𖹚𖹛𖹜𖹝𖹞𖹟𖹠𖹡𖹢𖹣𖹤𖹥𖹦𖹧𖹨𖹩𖹪𖹫𖹬𖹭𖹮𖹯𖹰𖹱𖹲𖹳𖹴𖹵𖹶𖹷𖹸𖹹𖹺𖹻𖹼𖹽𖹾𖹿 ::numeral 𖺀𖺁𖺂𖺃𖺄𖺅𖺆𖺇𖺈𖺉𖺊𖺋𖺌𖺍𖺎𖺏𖺐𖺑𖺒𖺓𖺔𖺕𖺖
|
86 |
+
::script-name Meetei Mayek ::n-char 61 ::char ꫠꫡꫢꫣꫤꫥꫦꫧꫨꫩꫪꫫꫬꫭꫮꫯꫳꫵꯀꯁꯂꯃꯄꯅꯆꯇꯈꯉꯊꯋꯌꯍꯎꯏꯐꯑꯒꯓꯔꯕꯖꯗꯘꯙꯚꯛꯜꯝꯞꯟꯠꯡꯢꯣꯤꯥꯦꯧꯨꯩꯪ ::numeral ꯰꯱꯲꯳꯴꯵꯶꯷꯸꯹ ::vowel-sign ꫫꫬꫭꫮꫯꫵꯣꯤꯥꯦꯧꯨꯩꯪ
|
87 |
+
::script-name Mende Kikakui ::n-char 197 ::char 𞠀𞠁𞠂𞠃𞠄𞠅𞠆𞠇𞠈𞠉𞠊𞠋𞠌𞠍𞠎𞠏𞠐𞠑𞠒𞠓𞠔𞠕𞠖𞠗𞠘𞠙𞠚𞠛𞠜𞠝𞠞𞠟𞠠𞠡𞠢𞠣𞠤𞠥𞠦𞠧𞠨𞠩𞠪𞠫𞠬𞠭𞠮𞠯𞠰𞠱𞠲𞠳𞠴𞠵𞠶𞠷𞠸𞠹𞠺𞠻𞠼𞠽𞠾𞠿𞡀𞡁𞡂𞡃𞡄𞡅𞡆𞡇𞡈𞡉𞡊𞡋𞡌𞡍𞡎𞡏𞡐𞡑𞡒𞡓𞡔𞡕𞡖𞡗𞡘𞡙𞡚𞡛𞡜𞡝𞡞𞡟𞡠𞡡𞡢𞡣𞡤𞡥𞡦𞡧𞡨𞡩𞡪𞡫𞡬𞡭𞡮𞡯𞡰𞡱𞡲𞡳𞡴𞡵𞡶𞡷𞡸𞡹𞡺𞡻𞡼𞡽𞡾𞡿𞢀𞢁𞢂𞢃𞢄𞢅𞢆𞢇𞢈𞢉𞢊𞢋𞢌𞢍𞢎𞢏𞢐𞢑𞢒𞢓𞢔𞢕𞢖𞢗𞢘𞢙𞢚𞢛𞢜𞢝𞢞𞢟𞢠𞢡𞢢𞢣𞢤𞢥𞢦𞢧𞢨𞢩𞢪𞢫𞢬𞢭𞢮𞢯𞢰𞢱𞢲𞢳𞢴𞢵𞢶𞢷𞢸𞢹𞢺𞢻𞢼𞢽𞢾𞢿𞣀𞣁𞣂𞣃𞣄 ::numeral 𞣇𞣈𞣉𞣊𞣋𞣌𞣍𞣎𞣏𞣐𞣑𞣒𞣓𞣔𞣕𞣖
|
88 |
+
::script-name Meroitic Cursive ::n-char 24 ::char 𐦠𐦡𐦢𐦣𐦤𐦥𐦦𐦧𐦨𐦩𐦪𐦫𐦬𐦭𐦮𐦯𐦰𐦱𐦲𐦳𐦴𐦵𐦶𐦷 ::numeral 𐦼𐦽𐧀𐧁𐧂𐧃𐧄𐧅𐧆𐧇𐧈𐧉𐧊𐧋𐧌𐧍𐧎𐧏𐧒𐧓𐧔𐧕𐧖𐧗𐧘𐧙𐧚𐧛𐧜𐧝𐧞𐧟𐧠𐧡𐧢𐧣𐧤𐧥𐧦𐧧𐧨𐧩𐧪𐧫𐧬𐧭𐧮𐧯𐧰𐧱𐧲𐧳𐧴𐧵𐧶𐧷𐧸𐧹𐧺𐧻𐧼𐧽𐧾𐧿
|
89 |
+
::script-name Meroitic Hieroglyphic ::n-char 30 ::char 𐦀𐦁𐦂𐦃𐦄𐦅𐦆𐦇𐦈𐦉𐦊𐦋𐦌𐦍𐦎𐦏𐦐𐦑𐦒𐦓𐦔𐦕𐦖𐦗𐦘𐦙𐦚𐦛𐦜𐦝
|
90 |
+
::script-name Miao ::n-char 145 ::char 𖼀𖼁𖼂𖼃𖼄𖼅𖼆𖼇𖼈𖼉𖼊𖼋𖼌𖼍𖼎𖼏𖼐𖼑𖼒𖼓𖼔𖼕𖼖𖼗𖼘𖼙𖼚𖼛𖼜𖼝𖼞𖼟𖼠𖼡𖼢𖼣𖼤𖼥𖼦𖼧𖼨𖼩𖼪𖼫𖼬𖼭𖼮𖼯𖼰𖼱𖼲𖼳𖼴𖼵𖼶𖼷𖼸𖼹𖼺𖼻𖼼𖼽𖼾𖼿𖽀𖽁𖽂𖽃𖽄𖽅𖽆𖽇𖽈𖽉𖽊𖽏𖽐𖽑𖽒𖽓𖽔𖽕𖽖𖽗𖽘𖽙𖽚𖽛𖽜𖽝𖽞𖽟𖽠𖽡𖽢𖽣𖽤𖽥𖽦𖽧𖽨𖽩𖽪𖽫𖽬𖽭𖽮𖽯𖽰𖽱𖽲𖽳𖽴𖽵𖽶𖽷𖽸𖽹𖽺𖽻𖽼𖽽𖽾𖽿𖾀𖾁𖾂𖾃𖾄𖾅𖾆𖾇𖾓𖾔𖾕𖾖𖾗𖾘𖾙𖾚𖾛𖾜𖾝𖾞𖾟 ::vowel-sign 𖽔𖽕𖽖𖽗𖽘𖽙𖽚𖽛𖽜𖽝𖽞𖽟𖽠𖽡𖽢𖽣𖽤𖽥𖽦𖽧𖽨𖽩𖽪𖽫𖽬𖽭𖽮𖽯𖽰𖽱𖽲𖽳𖽴𖽵𖽶𖽷𖽸𖽹𖽺𖽻𖽼𖽽𖽾𖽿𖾀𖾁𖾂𖾃𖾄𖾅𖾆𖾇
|
91 |
+
::script-name Modi ::n-char 67 ::char 𑘀𑘁𑘂𑘃𑘄𑘅𑘆𑘇𑘈𑘉𑘊𑘋𑘌𑘍𑘎𑘏𑘐𑘑𑘒𑘓𑘔𑘕𑘖𑘗𑘘𑘙𑘚𑘛𑘜𑘝𑘞𑘟𑘠𑘡𑘢𑘣𑘤𑘥𑘦𑘧𑘨𑘩𑘪𑘫𑘬𑘭𑘮𑘯𑘰𑘱𑘲𑘳𑘴𑘵𑘶𑘷𑘸𑘹𑘺𑘻𑘼𑘽𑘾𑘿𑙀𑙃𑙄 ::numeral 𑙐𑙑𑙒𑙓𑙔𑙕𑙖𑙗𑙘𑙙 ::sign-virama 𑘿 ::vowel-sign 𑘰𑘱𑘲𑘳𑘴𑘵𑘶𑘷𑘸𑘹𑘺𑘻𑘼
|
92 |
+
::script-name Mongolian ::n-char 134 ::char ᠇ᠠᠡᠢᠣᠤᠥᠦᠧᠨᠩᠪᠫᠬᠭᠮᠯᠰᠱᠲᠳᠴᠵᠶᠷᠸᠹᠺᠻᠼᠽᠾᠿᡀᡁᡂᡃᡄᡅᡆᡇᡈᡉᡊᡋᡌᡍᡎᡏᡐᡑᡒᡓᡔᡕᡖᡗᡘᡙᡚᡛᡜᡝᡞᡟᡠᡡᡢᡣᡤᡥᡦᡧᡨᡩᡪᡫᡬᡭᡮᡯᡰᡱᡲᡳᡴᡵᡶᡷᡸᢀᢁᢂᢃᢄᢅᢆᢇᢈᢉᢊᢋᢌᢍᢎᢏᢐᢑᢒᢓᢔᢕᢖᢗᢘᢙᢚᢛᢜᢝᢞᢟᢠᢡᢢᢣᢤᢥᢦᢧᢨᢩᢪ ::numeral ᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙ ::vowel-sign ᡃ
|
93 |
+
::script-name Mro ::n-char 31 ::char 𖩀𖩁𖩂𖩃𖩄𖩅𖩆𖩇𖩈𖩉𖩊𖩋𖩌𖩍𖩎𖩏𖩐𖩑𖩒𖩓𖩔𖩕𖩖𖩗𖩘𖩙𖩚𖩛𖩜𖩝𖩞 ::numeral 𖩠𖩡𖩢𖩣𖩤𖩥𖩦𖩧𖩨𖩩
|
94 |
+
::script-name Multani ::n-char 37 ::char 𑊀𑊁𑊂𑊃𑊄𑊅𑊆𑊈𑊊𑊋𑊌𑊍𑊏𑊐𑊑𑊒𑊓𑊔𑊕𑊖𑊗𑊘𑊙𑊚𑊛𑊜𑊝𑊟𑊠����𑊢𑊣𑊤𑊥𑊦𑊧𑊨
|
95 |
+
::script-name Myanmar ::n-char 183 ::char ကခဂဃငစဆဇဈဉညဋဌဍဎဏတထဒဓနပဖဗဘမယရလဝသဟဠအဢဣဤဥဦဧဨဩဪါာိီုူေဲဳဴဵံ့း္်ျြွှဿ၊။၌၍၎၏ၐၑၒၓၔၕၖၗၘၙၚၛၜၝၞၟၠၡၢၥၦၧၨၩၪၫၬၭၮၯၰၱၲၳၴၵၶၷၸၹၺၻၼၽၾၿႀႁႂႃႄႅႆႇႈႉႊႋႌႍႎႏႚႛႜႝꧠꧡꧢꧣꧤꧥꧦꧧꧨꧩꧪꧫꧬꧭꧮꧯꧺꧻꧼꧽꧾꩠꩡꩢꩣꩤꩥꩦꩧꩨꩩꩪꩫꩬꩭꩮꩯꩰꩱꩲꩳꩺꩻꩼꩽꩾꩿ ::medial-consonant-sign ျြွှၞၟၠႂ ::numeral ၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙꧰꧱꧲꧳꧴꧵꧶꧷꧸꧹ ::sign-virama ္် ::vowel-sign ါာိီုူေဲဳဴဵၖၗၘၙၢၧၨၱၲၳၴႃႄႅႆႜႝ
|
96 |
+
::script-name Nabataean ::n-char 31 ::char 𐢀𐢁𐢂𐢃𐢄𐢅𐢆𐢇𐢈𐢉𐢊𐢋𐢌𐢍𐢎𐢏𐢐𐢑𐢒𐢓𐢔𐢕𐢖𐢗𐢘𐢙𐢚𐢛𐢜𐢝𐢞 ::numeral 𐢧𐢨𐢩𐢪𐢫𐢬𐢭𐢮𐢯
|
97 |
+
::script-name Nandinagari ::n-char 64 ::char 𑦠𑦡𑦢𑦣𑦤𑦥𑦦𑦧𑦪𑦫𑦬𑦭𑦮𑦯𑦰𑦱𑦲𑦳𑦴𑦵𑦶𑦷𑦸𑦹𑦺𑦻𑦼𑦽𑦾𑦿𑧀𑧁𑧂𑧃𑧄𑧅𑧆𑧇𑧈𑧉𑧊𑧋𑧌𑧍𑧎𑧏𑧐𑧑𑧒𑧓𑧔𑧕𑧖𑧗𑧚𑧛𑧜𑧝𑧞𑧟𑧠𑧡𑧢𑧤 ::sign-virama 𑧠 ::vowel-sign 𑧑𑧒𑧓𑧔𑧕𑧖𑧗𑧚𑧛𑧜𑧝𑧤
|
98 |
+
::script-name New Tai Lue ::n-char 70 ::char ᦀᦁᦂᦃᦄᦅᦆᦇᦈᦉᦊᦋᦌᦍᦎᦏᦐᦑᦒᦓᦔᦕᦖᦗᦘᦙᦚᦛᦜᦝᦞᦟᦠᦡᦢᦣᦤᦥᦦᦧᦨᦩᦪᦫᦰᦱᦲᦳᦴᦵᦶᦷᦸᦹᦺᦻᦼᦽᦾᦿᧀᧁᧂᧃᧄᧅᧆᧇ᧞᧟ ::numeral ᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᧚ ::vowel-sign ᦰᦱᦲᦳᦴᦵᦶᦷᦸᦹᦺᦻᦼᦽᦾᦿᧀ
|
99 |
+
::script-name Newa ::n-char 78 ::char 𑐀𑐁𑐂𑐃𑐄𑐅𑐆𑐇𑐈𑐉𑐊𑐋𑐌𑐍𑐎𑐏𑐐𑐑𑐒𑐓𑐔𑐕𑐖𑐗𑐘𑐙𑐚𑐛𑐜𑐝𑐞𑐟𑐠𑐡𑐢𑐣𑐤𑐥𑐦𑐧𑐨𑐩𑐪𑐫𑐬𑐭𑐮𑐯𑐰𑐱𑐲𑐳𑐴𑐵𑐶𑐷𑐸𑐹𑐺𑐻𑐼𑐽𑐾𑐿𑑀𑑁𑑂𑑃𑑄𑑅𑑆𑑇𑑈𑑏𑑝𑑟𑑠𑑡 ::numeral 𑑐𑑑𑑒𑑓𑑔𑑕𑑖𑑗𑑘𑑙 ::sign-virama 𑑂 ::vowel-sign 𑐵𑐶𑐷𑐸𑐹𑐺𑐻𑐼𑐽𑐾𑐿𑑀𑑁
|
100 |
+
::script-name Nko ::n-char 35 ::char ߊߋߌߍߎߏߐߑߒߓߔߕߖߗߘߙߚߛߜߝߞߟߠߡߢߣߤߥߦߧߨߩߪ߾߿ ::numeral ߀߁߂߃߄߅߆߇߈߉
|
101 |
+
::script-name North Indic ::numeral ꠰꠱꠲꠳꠴꠵
|
102 |
+
::script-name Nushu ::n-char 396 ::char 𛅰𛅱𛅲𛅳𛅴𛅵𛅶𛅷𛅸𛅹𛅺𛅻𛅼𛅽𛅾𛅿𛆀𛆁𛆂𛆃𛆄𛆅𛆆𛆇𛆈𛆉𛆊𛆋𛆌𛆍𛆎𛆏𛆐𛆑𛆒𛆓𛆔𛆕𛆖𛆗𛆘𛆙𛆚𛆛𛆜𛆝𛆞𛆟𛆠𛆡𛆢𛆣𛆤𛆥𛆦𛆧𛆨𛆩𛆪𛆫𛆬𛆭𛆮𛆯𛆰𛆱𛆲𛆳𛆴𛆵𛆶𛆷𛆸𛆹𛆺𛆻𛆼𛆽𛆾𛆿𛇀𛇁𛇂𛇃𛇄𛇅𛇆𛇇𛇈𛇉𛇊𛇋𛇌𛇍𛇎𛇏𛇐𛇑𛇒𛇓𛇔𛇕𛇖𛇗𛇘𛇙𛇚𛇛𛇜𛇝𛇞𛇟𛇠𛇡𛇢𛇣𛇤𛇥𛇦𛇧𛇨𛇩𛇪𛇫𛇬𛇭𛇮𛇯𛇰𛇱𛇲𛇳𛇴𛇵𛇶𛇷𛇸𛇹𛇺𛇻𛇼𛇽𛇾𛇿𛈀𛈁𛈂𛈃𛈄𛈅𛈆𛈇𛈈𛈉𛈊𛈋𛈌𛈍𛈎𛈏𛈐𛈑𛈒𛈓𛈔𛈕𛈖𛈗𛈘𛈙𛈚𛈛𛈜𛈝𛈞𛈟𛈠𛈡𛈢𛈣𛈤𛈥𛈦𛈧𛈨𛈩𛈪𛈫𛈬𛈭𛈮𛈯𛈰𛈱𛈲𛈳𛈴𛈵𛈶𛈷𛈸𛈹𛈺𛈻𛈼𛈽𛈾𛈿𛉀𛉁𛉂𛉃𛉄𛉅𛉆𛉇𛉈𛉉𛉊𛉋𛉌𛉍𛉎𛉏𛉐𛉑𛉒𛉓𛉔𛉕𛉖𛉗𛉘𛉙𛉚𛉛𛉜𛉝𛉞𛉟𛉠𛉡𛉢𛉣𛉤𛉥𛉦𛉧𛉨𛉩𛉪𛉫𛉬𛉭𛉮𛉯𛉰𛉱𛉲𛉳𛉴𛉵𛉶𛉷𛉸𛉹𛉺𛉻𛉼𛉽𛉾𛉿𛊀𛊁𛊂𛊃𛊄𛊅𛊆𛊇𛊈𛊉𛊊𛊋𛊌𛊍𛊎𛊏𛊐𛊑𛊒𛊓𛊔𛊕𛊖𛊗𛊘𛊙𛊚𛊛𛊜𛊝𛊞𛊟𛊠𛊡𛊢𛊣𛊤𛊥𛊦𛊧𛊨𛊩𛊪𛊫𛊬𛊭𛊮𛊯𛊰𛊱𛊲𛊳𛊴𛊵𛊶𛊷𛊸𛊹𛊺𛊻𛊼𛊽𛊾𛊿𛋀𛋁𛋂𛋃𛋄𛋅𛋆𛋇𛋈𛋉𛋊𛋋𛋌𛋍𛋎𛋏𛋐𛋑𛋒𛋓𛋔𛋕𛋖𛋗𛋘𛋙𛋚𛋛𛋜𛋝𛋞𛋟𛋠𛋡𛋢𛋣𛋤𛋥𛋦𛋧𛋨𛋩𛋪𛋫𛋬𛋭𛋮𛋯𛋰𛋱𛋲𛋳𛋴𛋵𛋶𛋷𛋸𛋹𛋺𛋻
|
103 |
+
::script-name Nyiakeng Puachue Hmong ::n-char 52 ::char 𞄀𞄁𞄂𞄃𞄄𞄅𞄆𞄇𞄈𞄉𞄊𞄋𞄌𞄍𞄎𞄏𞄐𞄑𞄒𞄓𞄔𞄕𞄖𞄗𞄘𞄙𞄚𞄛𞄜𞄝𞄞𞄟𞄠𞄡𞄢𞄣𞄤𞄥𞄦𞄧𞄨𞄩𞄪𞄫𞄬𞄷𞄸𞄹𞄺𞄻𞄼𞄽 ::numeral 𞅀𞅁𞅂𞅃𞅄𞅅𞅆𞅇𞅈𞅉
|
104 |
+
::script-name Ogham ::n-char 26 ::char ᚁᚂᚃᚄᚅᚆᚇᚈᚉᚊᚋᚌᚍᚎᚏᚐᚑᚒᚓᚔᚕᚖᚗᚘᚙᚚ
|
105 |
+
::script-name Ol Chiki ::n-char 30 ::char ᱚᱛᱜᱝᱞᱟᱠᱡᱢᱣᱤᱥᱦᱧᱨᱩᱪᱫᱬᱭᱮᱯᱰᱱᱲᱳᱴᱵᱶᱷ ::numeral ᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙
|
106 |
+
::script-name Old Hungarian ::n-char 102 ::char 𐲀𐲁𐲂𐲃𐲄𐲅𐲆𐲇𐲈𐲉𐲊𐲋𐲌𐲍𐲎𐲏𐲐𐲑𐲒𐲓𐲔𐲕𐲖𐲗𐲘𐲙𐲚𐲛𐲜𐲝𐲞𐲟𐲠𐲡𐲢𐲣𐲤𐲥𐲦𐲧𐲨𐲩𐲪𐲫𐲬𐲭𐲮𐲯𐲰𐲱𐲲𐳀𐳁𐳂𐳃𐳄𐳅𐳆𐳇𐳈𐳉𐳊𐳋𐳌𐳍𐳎𐳏𐳐𐳑���𐳓𐳔𐳕𐳖𐳗𐳘𐳙𐳚𐳛𐳜𐳝𐳞𐳟𐳠𐳡𐳢𐳣𐳤𐳥𐳦𐳧𐳨𐳩𐳪𐳫𐳬𐳭𐳮𐳯𐳰𐳱𐳲 ::numeral 𐳺𐳻𐳼𐳽𐳾𐳿
|
107 |
+
::script-name Old Italic ::n-char 35 ::char 𐌀𐌁𐌂𐌃𐌄𐌅𐌆𐌇𐌈𐌉𐌊𐌋𐌌𐌍𐌎𐌏𐌐𐌑𐌒𐌓𐌔𐌕𐌖𐌗𐌘𐌙𐌚𐌛𐌜𐌝𐌞𐌟𐌭𐌮𐌯 ::numeral 𐌠𐌡𐌢𐌣
|
108 |
+
::script-name Old North Arabian ::n-char 29 ::char 𐪀𐪁𐪂𐪃𐪄𐪅𐪆𐪇𐪈𐪉𐪊𐪋𐪌𐪍𐪎𐪏𐪐𐪑𐪒𐪓𐪔𐪕𐪖𐪗𐪘𐪙𐪚𐪛𐪜 ::numeral 𐪝𐪞𐪟
|
109 |
+
::script-name Old Permic ::n-char 38 ::char 𐍐𐍑𐍒𐍓𐍔𐍕𐍖𐍗𐍘𐍙𐍚𐍛𐍜𐍝𐍞𐍟𐍠𐍡𐍢𐍣𐍤𐍥𐍦𐍧𐍨𐍩𐍪𐍫𐍬𐍭𐍮𐍯𐍰𐍱𐍲𐍳𐍴𐍵
|
110 |
+
::script-name Old Persian ::n-char 44 ::char 𐎠𐎡𐎢𐎣𐎤𐎥𐎦𐎧𐎨𐎩𐎪𐎫𐎬𐎭𐎮𐎯𐎰𐎱𐎲𐎳𐎴𐎵𐎶𐎷𐎸𐎹𐎺𐎻𐎼𐎽𐎾𐎿𐏀𐏁𐏂𐏃𐏈𐏉𐏊𐏋𐏌𐏍𐏎𐏏 ::numeral 𐏑𐏒𐏓𐏔𐏕
|
111 |
+
::script-name Old Sogdian ::n-char 30 ::char 𐼀𐼁𐼂𐼃𐼄𐼅𐼆𐼇𐼈𐼉𐼊𐼋𐼌𐼍𐼎𐼏𐼐𐼑𐼒𐼓𐼔𐼕𐼖𐼗𐼘𐼙𐼚𐼛𐼜𐼧 ::numeral 𐼝𐼞𐼟𐼠𐼡𐼢𐼣𐼤𐼥𐼦
|
112 |
+
::script-name Old South Arabian ::n-char 29 ::char 𐩠𐩡𐩢𐩣𐩤𐩥𐩦𐩧𐩨𐩩𐩪𐩫𐩬𐩭𐩮𐩯𐩰𐩱𐩲𐩳𐩴𐩵𐩶𐩷𐩸𐩹𐩺𐩻𐩼 ::numeral 𐩽𐩾
|
113 |
+
::script-name Old Turkic ::n-char 73 ::char 𐰀𐰁𐰂𐰃𐰄𐰅𐰆𐰇𐰈𐰉𐰊𐰋𐰌𐰍𐰎𐰏𐰐𐰑𐰒𐰓𐰔𐰕𐰖𐰗𐰘𐰙𐰚𐰛𐰜𐰝𐰞𐰟𐰠𐰡𐰢𐰣𐰤𐰥𐰦𐰧𐰨𐰩𐰪𐰫𐰬𐰭𐰮𐰯𐰰𐰱𐰲𐰳𐰴𐰵𐰶𐰷𐰸𐰹𐰺𐰻𐰼𐰽𐰾𐰿𐱀𐱁𐱂𐱃𐱄𐱅𐱆𐱇𐱈
|
114 |
+
::script-name Old Uyghur ::n-char 18 ::char 𐽰𐽱𐽲𐽳𐽴𐽵𐽶𐽷𐽸𐽹𐽺𐽻𐽼𐽽𐽾𐽿𐾀𐾁
|
115 |
+
::script-name Oriya ::n-char 73 ::char ଁଂଃଅଆଇଈଉଊଋଌଏଐଓଔକଖଗଘଙଚଛଜଝଞଟଠଡଢଣତଥଦଧନପଫବଭମଯରଲଳଵଶଷସହ଼ଽାିୀୁୂୃୄେୈୋୌ୍୕ୗଡ଼ଢ଼ୟୠୡୢୣୱ ::numeral ୦୧୨୩୪୫୬୭୮୯୲୳୴୵୶୷ ::sign-virama ୍ ::vowel-sign ାିୀୁୂୃୄେୈୋୌୢୣ
|
116 |
+
::script-name Osage ::n-char 72 ::char 𐒰𐒱𐒲𐒳𐒴𐒵𐒶𐒷𐒸𐒹𐒺𐒻𐒼𐒽𐒾𐒿𐓀𐓁𐓂𐓃𐓄𐓅𐓆𐓇𐓈𐓉𐓊𐓋𐓌𐓍𐓎𐓏𐓐𐓑𐓒𐓓𐓘𐓙𐓚𐓛𐓜𐓝𐓞𐓟𐓠𐓡𐓢𐓣𐓤𐓥𐓦𐓧𐓨𐓩𐓪𐓫𐓬𐓭𐓮𐓯𐓰𐓱𐓲𐓳𐓴𐓵𐓶𐓷𐓸𐓹𐓺𐓻
|
117 |
+
::script-name Osmanya ::n-char 30 ::char 𐒀𐒁𐒂𐒃𐒄𐒅𐒆𐒇𐒈𐒉𐒊𐒋𐒌𐒍𐒎𐒏𐒐𐒑𐒒𐒓𐒔𐒕𐒖𐒗𐒘𐒙𐒚𐒛𐒜𐒝 ::numeral 𐒠𐒡𐒢𐒣𐒤𐒥𐒦𐒧𐒨𐒩
|
118 |
+
::script-name Ottoman Siyaq ::numeral 𞴁𞴂𞴃𞴄𞴅𞴆𞴇𞴈𞴉𞴊𞴋𞴌𞴍𞴎𞴏𞴐𞴑𞴒𞴓𞴔𞴕𞴖𞴗𞴘𞴙𞴚𞴛𞴜𞴝𞴞𞴟𞴠𞴡𞴢𞴣𞴤𞴥𞴦𞴧𞴨𞴩𞴪𞴫𞴬𞴭𞴯𞴰𞴱𞴲𞴳𞴴𞴵𞴶𞴷𞴸𞴹𞴺𞴻𞴼𞴽
|
119 |
+
::script-name Pahawh Hmong ::n-char 103 ::char 𖬀𖬁𖬂𖬃𖬄𖬅𖬆𖬇𖬈𖬉𖬊𖬋𖬌𖬍𖬎𖬏𖬐𖬑𖬒𖬓𖬔𖬕𖬖𖬗𖬘𖬙𖬚𖬛𖬜𖬝𖬞𖬟𖬠𖬡𖬢𖬣𖬤𖬥𖬦𖬧𖬨𖬩𖬪𖬫𖬬𖬭𖬮𖬯𖬷𖬸𖬹𖬺𖬻𖬼𖬽𖬾𖬿𖭀𖭁𖭂𖭃𖭄𖭅𖭣𖭤𖭥𖭦𖭧𖭨𖭩𖭪𖭫𖭬𖭭𖭮𖭯𖭰𖭱𖭲𖭳𖭴𖭵𖭶𖭷𖭽𖭾𖭿𖮀𖮁𖮂𖮃𖮄𖮅𖮆𖮇𖮈𖮉𖮊𖮋𖮌𖮍𖮎𖮏 ::numeral 𖭐𖭑𖭒𖭓𖭔𖭕𖭖𖭗𖭘𖭙𖭛𖭜𖭝𖭞𖭟𖭠𖭡
|
120 |
+
::script-name Palmyrene ::n-char 23 ::char 𐡠𐡡𐡢𐡣𐡤𐡥𐡦𐡧𐡨𐡩𐡪𐡫𐡬𐡭𐡮𐡯𐡰𐡱𐡲𐡳𐡴𐡵𐡶 ::numeral 𐡹𐡺𐡻𐡼𐡽𐡾𐡿
|
121 |
+
::script-name Pau Cin Hau ::n-char 37 ::char 𑫀𑫁𑫂𑫃𑫄𑫅𑫆𑫇𑫈𑫉𑫊𑫋𑫌𑫍𑫎𑫏𑫐𑫑𑫒𑫓𑫔𑫕𑫖𑫗𑫘𑫙𑫚𑫛𑫜𑫝𑫞𑫟𑫠𑫡𑫢𑫣𑫤
|
122 |
+
::script-name Phags-Pa ::n-char 52 ::char ꡀꡁꡂꡃꡄꡅꡆꡇꡈꡉꡊꡋꡌꡍꡎꡏꡐꡑꡒꡓꡔꡕꡖꡗꡘꡙꡚꡛꡜꡝꡞꡟꡠꡡꡢꡣꡤꡥꡦꡧꡨꡩꡪꡫꡬꡭꡮꡯꡰꡱꡲꡳ
|
123 |
+
::script-name Phaistos Disc ::n-char 46 ::char 𐇐𐇑𐇒𐇓𐇔𐇕𐇖𐇗𐇘𐇙𐇚𐇛𐇜𐇝𐇞𐇟𐇠𐇡𐇢𐇣𐇤𐇥𐇦𐇧𐇨𐇩𐇪𐇫𐇬𐇭𐇮𐇯𐇰𐇱𐇲𐇳𐇴𐇵𐇶𐇷𐇸𐇹𐇺𐇻𐇼𐇽
|
124 |
+
::script-name Phoenician ::n-char 22 ::char 𐤀𐤁𐤂𐤃𐤄𐤅𐤆𐤇𐤈𐤉𐤊𐤋𐤌𐤍𐤎𐤏𐤐𐤑𐤒𐤓𐤔𐤕 ::numeral 𐤖𐤗𐤘𐤙𐤚𐤛
|
125 |
+
::script-name Psalter Pahlavi ::n-char 18 ::char 𐮀𐮁𐮂𐮃𐮄𐮅𐮆𐮇𐮈𐮉𐮊𐮋𐮌𐮍𐮎𐮏𐮐𐮑 ::numeral 𐮩𐮪𐮫𐮬𐮭𐮮𐮯
|
126 |
+
::script-name Rejang ::n-char 35 ::char ꤰꤱꤲꤳꤴꤵꤶꤷꤸꤹꤺꤻꤼꤽꤾꤿꥀꥁꥂꥃꥄꥅꥆꥇꥈꥉꥊꥋꥌꥍꥎꥏꥐꥑꥒ ::vowel-sign ꥇꥈꥉꥊꥋꥌꥍꥎ
|
127 |
+
::script-name Rumi ::numeral 𐹠𐹡𐹢𐹣𐹤𐹥𐹦𐹧𐹨𐹩𐹪𐹫𐹬𐹭𐹮𐹯𐹰𐹱𐹲𐹳𐹴𐹵𐹶𐹷𐹸𐹹𐹺��𐹼𐹽𐹾
|
128 |
+
::script-name Runic ::n-char 83 ::char ᚠᚡᚢᚣᚤᚥᚦᚧᚨᚩᚪᚫᚬᚭᚮᚯᚰᚱᚲᚳᚴᚵᚶᚷᚸᚹᚺᚻᚼᚽᚾᚿᛀᛁᛂᛃᛄᛅᛆᛇᛈᛉᛊᛋᛌᛍᛎᛏᛐᛑᛒᛓᛔᛕᛖᛗᛘᛙᛚᛛᛜᛝᛞᛟᛠᛡᛢᛣᛤᛥᛦᛧᛨᛩᛪᛱᛲᛳᛴᛵᛶᛷᛸ
|
129 |
+
::script-name Samaritan ::n-char 40 ::char ࠀࠁࠂࠃࠄࠅࠆࠇࠈࠉࠊࠋࠌࠍࠎࠏࠐࠑࠒࠓࠔࠕࠚࠜࠝࠞࠟࠠࠡࠢࠣࠤࠥࠦࠧࠨࠩࠪࠫࠬ ::vowel-sign ࠜࠝࠞࠟࠠࠡࠢࠣࠥࠦࠧࠩࠪࠫࠬ
|
130 |
+
::script-name Saurashtra ::n-char 70 ::char ꢀꢁꢂꢃꢄꢅꢆꢇꢈꢉꢊꢋꢌꢍꢎꢏꢐꢑꢒꢓꢔꢕꢖꢗꢘꢙꢚꢛꢜꢝꢞꢟꢠꢡꢢꢣꢤꢥꢦꢧꢨꢩꢪꢫꢬꢭꢮꢯꢰꢱꢲꢳꢴꢵꢶꢷꢸꢹꢺꢻꢼꢽꢾꢿꣀꣁꣂꣃ꣄ꣅ ::numeral ꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙ ::sign-virama ꣄ ::vowel-sign ꢵꢶꢷꢸꢹꢺꢻꢼꢽꢾꢿꣀꣁꣂꣃ
|
131 |
+
::script-name Sharada ::n-char 76 ::char 𑆀𑆁𑆂𑆃𑆄𑆅𑆆𑆇𑆈𑆉𑆊𑆋𑆌𑆍𑆎𑆏𑆐𑆑𑆒𑆓𑆔𑆕𑆖𑆗𑆘𑆙𑆚𑆛𑆜𑆝𑆞𑆟𑆠𑆡𑆢𑆣𑆤𑆥𑆦𑆧𑆨𑆩𑆪𑆫𑆬𑆭𑆮𑆯𑆰𑆱𑆲𑆳𑆴𑆵𑆶𑆷𑆸𑆹𑆺𑆻𑆼𑆽𑆾𑆿𑇀𑇁𑇂𑇃𑇇𑇊𑇋𑇌𑇎𑇏𑇛𑇝 ::numeral 𑇐𑇑𑇒𑇓𑇔𑇕𑇖𑇗𑇘𑇙 ::sign-virama 𑇀 ::vowel-sign 𑆳𑆴𑆵𑆶𑆷𑆸𑆹𑆺𑆻𑆼𑆽𑆾𑆿𑇎
|
132 |
+
::script-name Shavian ::n-char 48 ::char 𐑐𐑑𐑒𐑓𐑔𐑕𐑖𐑗𐑘𐑙𐑚𐑛𐑜𐑝𐑞𐑟𐑠𐑡𐑢𐑣𐑤𐑥𐑦𐑧𐑨𐑩𐑪𐑫𐑬𐑭𐑮𐑯𐑰𐑱𐑲𐑳𐑴𐑵𐑶𐑷𐑸𐑹𐑺𐑻𐑼𐑽𐑾𐑿
|
133 |
+
::script-name Siddham ::n-char 70 ::char 𑖀𑖁𑖂𑖃𑖄𑖅𑖆𑖇𑖈𑖉𑖊𑖋𑖌𑖍𑖎𑖏𑖐𑖑𑖒𑖓𑖔𑖕𑖖𑖗𑖘𑖙𑖚𑖛𑖜𑖝𑖞𑖟𑖠𑖡𑖢𑖣𑖤𑖥𑖦𑖧𑖨𑖩𑖪𑖫𑖬𑖭𑖮𑖯𑖰𑖱𑖲𑖳𑖴𑖵𑖸𑖹𑖺𑖻𑖼𑖽𑖾𑗀𑖿𑗁𑗘𑗙𑗚𑗛𑗜𑗝 ::sign-virama 𑖿 ::vowel-sign 𑖯𑖰𑖱𑖲𑖳𑖴𑖵𑖸𑖹𑖺𑖻𑗜𑗝
|
134 |
+
::script-name Sinhala ::n-char 80 ::char ඁංඃඅආඇඈඉඊඋඌඍඎඏඐඑඒඓඔඕඖකඛගඝඞඟචඡජඣඤඥඦටඨඩඪණඬතථදධනඳපඵබභමඹයරලවශෂසහළෆ්ාැෑිීුූෘෙේෛොෝෞෟෲෳ ::numeral ෦෧෨෩෪෫෬෭෮෯𑇡𑇢𑇣𑇤𑇥𑇦𑇧𑇨𑇩𑇪𑇫𑇬𑇭𑇮𑇯𑇰𑇱𑇲𑇳𑇴 ::sign-virama ් ::vowel-sign ාැෑිීුූෘෙේෛොෝෞෟෲෳ
|
135 |
+
::script-name Sogdian ::n-char 21 ::char 𐼰𐼱𐼲𐼳𐼴𐼵𐼶𐼷𐼸𐼹𐼺𐼻𐼼𐼽𐼾𐼿𐽀𐽁𐽂𐽃𐽄 ::numeral 𐽑𐽒𐽓𐽔
|
136 |
+
::script-name Sora Sompeng ::n-char 25 ::char 𑃐𑃑𑃒𑃓𑃔𑃕𑃖𑃗𑃘𑃙𑃚𑃛𑃜𑃝𑃞𑃟𑃠𑃡𑃢𑃣𑃤𑃥𑃦𑃧𑃨 ::numeral 𑃰𑃱𑃲𑃳𑃴𑃵𑃶𑃷𑃸𑃹
|
137 |
+
::script-name Soyombo ::n-char 72 ::char 𑩐𑩑𑩒𑩓𑩔𑩕𑩖𑩗𑩘𑩙𑩚𑩛𑩜𑩝𑩞𑩟𑩠𑩡𑩢𑩣𑩤𑩥𑩦𑩧𑩨𑩩𑩪𑩫𑩬𑩭𑩮𑩯𑩰𑩱𑩲𑩳𑩴𑩵𑩶𑩷𑩸𑩹𑩺𑩻𑩼𑩽𑩾𑩿𑪀𑪁𑪂𑪃𑪄𑪅𑪆𑪇𑪈𑪉𑪊𑪋𑪌𑪍𑪎𑪏𑪐𑪑𑪒𑪓𑪔𑪕𑪖𑪗 ::vowel-sign 𑩑𑩒𑩓𑩔𑩕𑩖𑩗𑩘𑩙𑩚
|
138 |
+
::script-name Sundanese ::n-char 53 ::char ᮀᮁᮂᮃᮄᮅᮆᮇᮈᮉᮊᮋᮌᮍᮎᮏᮐᮑᮒᮓᮔᮕᮖᮗᮘᮙᮚᮛᮜᮝᮞᮟᮠᮡᮢᮣᮤᮥᮦᮧᮨᮩ᮪᮫ᮬᮭᮮᮯᮻᮼᮽᮾᮿ ::numeral ᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹ ::sign-virama ᮪᮫ ::vowel-sign ᮤᮥᮦᮧᮨᮩ
|
139 |
+
::script-name Syloti Nagri ::n-char 41 ::char ꠀꠁꠂꠃꠄꠅ꠆ꠇꠈꠉꠊꠋꠌꠍꠎꠏꠐꠑꠒꠓꠔꠕꠖꠗꠘꠙꠚꠛꠜꠝꠞꠟꠠꠡꠢꠣꠤꠥꠦꠧ꠬ ::vowel-sign ꠣꠤꠥꠦꠧ
|
140 |
+
::script-name Syriac ::n-char 46 ::char ܐܑܒܓܔܕܖܗܘܙܚܛܜܝܞܟܠܡܢܣܤܥܦܧܨܩܪܫܬܭܮܯݍݎݏࡠࡡࡢࡣࡤࡥࡦࡧࡨࡩࡪ
|
141 |
+
::script-name Tagalog ::n-char 23 ::char ᜀᜁᜂᜃᜄᜅᜆᜇᜈᜉᜊᜋᜌᜍᜎᜏᜐᜑᜒᜓ᜔᜕ᜟ ::sign-virama ᜔ ::vowel-sign ᜒᜓ
|
142 |
+
::script-name Tagbanwa ::n-char 18 ::char ᝠᝡᝢᝣᝤᝥᝦᝧᝨᝩᝪᝫᝬᝮᝯᝰᝲᝳ ::vowel-sign ᝲᝳ
|
143 |
+
::script-name Tai Le ::n-char 35 ::char ᥐᥑᥒᥓᥔᥕᥖᥗᥘᥙᥚᥛᥜᥝᥞᥟᥠᥡᥢᥣᥤᥥᥦᥧᥨᥩᥪᥫᥬᥭᥰᥱᥲᥳᥴ
|
144 |
+
::script-name Tai Tham ::n-char 106 ::char ᨠᨡᨢᨣᨤᨥᨦᨧᨨᨩᨪᨫᨬᨭᨮᨯᨰᨱᨲᨳᨴᨵᨶᨷᨸᨹᨺᨻᨼᨽᨾᨿᩀᩁᩂᩃᩄᩅᩆᩇᩈᩉᩊᩋᩌᩍᩎᩏᩐᩑᩒᩓᩔᩕᩖᩗᩘᩙᩚᩛᩜᩝᩞ᩠ᩡᩢᩣᩤᩥᩦᩧᩨᩩᩪᩫᩬᩭᩮᩯᩰᩱᩲᩳᩴ᩵᩶᩷᩸᩹᩺᩻᩼᪠᪡᪢᪣᪤᪥᪦ᪧ᪨᪩᪪᪫᪬᪭ ::medial-consonant-sign ᩕᩖ ::numeral ᪀᪁᪂᪃᪄᪅᪆᪇᪈᪉᪐᪑᪒᪓᪔᪕᪖᪗᪘᪙ ::vowel-sign ᩡᩢᩣᩤᩥᩦᩧᩨᩩᩪᩫᩬᩭᩮᩯᩰᩱᩲᩳ
|
145 |
+
::script-name Tai Viet ::n-char 61 ::char ꪀꪁꪂꪃꪄꪅꪆꪇꪈꪉꪊꪋꪌꪍꪎꪏꪐꪑꪒꪓꪔꪕꪖꪗꪘꪙꪚꪛꪜꪝꪞꪟꪠꪡꪢꪣꪤꪥꪦꪧꪨꪩꪪꪫꪬꪭꪮꪯꪱꪴꪲꪳꪵꪶꪸꪹꪺꪻꪼꪽꪾ
|
146 |
+
::script-name Takri ::n-char 58 ::char 𑚀𑚁𑚂𑚃𑚄𑚅𑚆𑚇𑚈𑚉𑚊𑚋𑚌𑚍𑚎𑚏𑚐𑚑𑚒𑚓𑚔𑚕𑚖𑚗𑚘𑚙𑚚𑚛𑚜𑚝𑚞𑚟𑚠𑚡𑚢𑚣𑚤𑚥𑚦𑚧𑚨𑚩𑚪𑚫𑚬𑚭𑚮𑚯𑚰𑚱𑚲𑚳𑚴𑚵𑚷𑚶𑚸𑚹 ::numeral 𑛀𑛁𑛂𑛃𑛄𑛅𑛆𑛇𑛈𑛉 ::sign-virama 𑚶 ::vowel-sign 𑚭𑚮𑚯𑚰𑚱𑚲𑚳𑚴𑚵
|
147 |
+
::script-name Tamil ::n-char 87 ::char ஂஃஅஆஇஈஉஊஎஏஐஒஓஔகஙசஜஞடணதநனபமயரறலளழவஶஷஸஹாிீுூெேைொோௌ்ௗ௳௴௵௶௷௸௹௺𑿕𑿖𑿗𑿘𑿙𑿚𑿛𑿜𑿝𑿞𑿟𑿠𑿡𑿢𑿣𑿤𑿥𑿦𑿧𑿨𑿩𑿪𑿫𑿬𑿭𑿮𑿯𑿰𑿱 ::numeral ௦௧௨௩௪௫௬௭௮௯௰௱௲௺𑿀𑿁𑿂𑿃𑿄𑿅𑿆𑿇𑿈𑿉𑿊𑿋𑿌𑿍𑿎𑿏𑿐𑿑𑿒𑿓𑿔𑿩 ::sign-virama ் ::vowel-sign ாிீுூெேைொோௌ
|
148 |
+
::script-name Tangsa ::n-char 79 ::char 𖩰𖩱𖩲𖩳𖩴𖩵𖩶𖩷𖩸𖩹𖩺𖩻𖩼𖩽𖩾𖩿𖪀𖪁𖪂𖪃𖪄𖪅𖪆𖪇𖪈𖪉𖪊𖪋𖪌𖪍𖪎𖪏𖪐𖪑𖪒𖪓𖪔𖪕𖪖𖪗𖪘𖪙𖪚𖪛𖪜𖪝𖪞𖪟𖪠𖪡𖪢𖪣𖪤𖪥𖪦𖪧𖪨𖪩𖪪𖪫𖪬𖪭𖪮𖪯𖪰𖪱𖪲𖪳𖪴𖪵𖪶𖪷𖪸𖪹𖪺𖪻𖪼𖪽𖪾 ::numeral 𖫀𖫁𖫂𖫃𖫄𖫅𖫆𖫇𖫈𖫉
|
149 |
+
::script-name Telugu ::n-char 81 ::char ఀఁంఃఄఅఆఇఈఉఊఋఌఎఏఐఒఓఔకఖగఘఙచఛజఝఞటఠడఢణతథదధనపఫబభమయరఱలళఴవశషసహ఼ఽాిీుూృౄెేైొోౌ్ౘౙౚౝౠౡౢౣ౷౿ ::numeral ౦౧౨౩౪౫౬౭౮౯౸౸౹౹౺౺౻౻౼౼౽౽౾౾ ::sign-virama ్ ::vowel-sign ాిీుూృౄెేైొోౌౢౣ
|
150 |
+
::script-name Thaana ::n-char 39 ::char ހށނރބޅކއވމފދތލގޏސޑޒޓޔޕޖޗޘޙޚޛޜޝޞޟޠޡޢޣޤޥޱ
|
151 |
+
::script-name Thai ::n-char 76 ::char กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤลฦวศษสหฬอฮฯะัาำิีึืฺุูเแโใไๅๆ็่้๊๋์ํ๎๏๚๛ ::numeral ๐๑๒๓๔๕๖๗๘๙ ::sign-virama ฺ
|
152 |
+
::script-name Tibetan ::n-char 137 ::char ༀ༕༖༗༘༙༚༛༜༝༞༟༾༿ཀཁགགྷངཅཆཇཉཊཋཌཌྷཎཏཐདདྷནཔཕབབྷམཙཚཛཛྷཝཞཟའཡརལཤཥསཧཨཀྵཪཫཬཱཱཱིིུུྲྀཷླྀཹེཻོཽཾཿཱྀྀྂྃ྆྇ྈྉྊྋྌྍྎྏྐྑྒྒྷྔྕྖྗྙྚྛྜྜྷྞྟྠྡྡྷྣྤྥྦྦྷྨྩྪྫྫྷྭྮྯྰྱྲླྴྵྶྷྸྐྵྺྻྼ࿀࿁࿂࿃࿎࿏ ::numeral ༠༡༢༣༤༥༦༧༨༩༪༫༬༭༮༯༰༱༲༳ ::vowel-sign ཱཱཱིིུུྲྀཷླྀཹཱེཻོཽྀྀ
|
153 |
+
::script-name Tifinagh ::n-char 58 ::char ⴰⴱⴲⴳⴴⴵⴶⴷⴸⴹⴺⴻⴼⴽⴾⴿⵀⵁⵂⵃⵄⵅⵆⵇⵈⵉⵊⵋⵌⵍⵎⵏⵐⵑⵒⵓⵔⵕⵖⵗⵘⵙⵚⵛⵜⵝⵞⵟⵠⵡⵢⵣⵤⵥⵦⵧⵯ⵿
|
154 |
+
::script-name Tirhuta ::n-char 69 ::char 𑒁𑒂𑒃𑒄𑒅𑒆𑒇𑒈𑒉𑒊𑒋𑒌𑒍𑒎𑒏𑒐𑒑𑒒𑒓𑒔𑒕𑒖𑒗𑒘𑒙𑒚𑒛𑒜𑒝𑒞𑒟𑒠𑒡𑒢𑒣𑒤𑒥𑒦𑒧𑒨𑒩𑒪𑒫𑒬𑒭𑒮𑒯𑒰𑒱𑒲𑒳𑒴𑒵𑒶𑒷𑒸𑒻𑒻𑒼𑒽𑒾𑒿𑓀𑓁𑓃𑓂𑓄𑓆 ::numeral 𑓐𑓑𑓒𑓓𑓔𑓕𑓖𑓗𑓘𑓙 ::sign-virama 𑓂 ::vowel-sign 𑒰𑒱𑒲𑒳𑒴𑒵𑒶𑒷𑒸𑒻𑒻𑒼𑒽𑒾
|
155 |
+
::script-name Toto ::n-char 31 ::char 𞊐𞊑𞊒𞊓𞊔𞊕𞊖𞊗𞊘𞊙𞊚𞊛𞊜𞊝𞊞𞊟𞊠𞊡𞊢𞊣𞊤𞊥𞊦𞊧𞊨𞊩𞊪𞊫𞊬𞊭𞊮
|
156 |
+
::script-name Ugaritic ::n-char 30 ::char 𐎀𐎁𐎂𐎃𐎄𐎅𐎆𐎇𐎈𐎉𐎊𐎋𐎌𐎍𐎎𐎏𐎐𐎑𐎒𐎓𐎔𐎕𐎖𐎗𐎘𐎙𐎚𐎛𐎜𐎝
|
157 |
+
::script-name Vai ::n-char 274 ::char ꔀꔁꔂꔃꔄꔅꔆꔇꔈꔉꔊꔋꔌꔍꔎꔏꔐꔑꔒꔓꔔꔕꔖꔗꔘꔙꔚꔛꔜꔝꔞꔟꔠꔡꔢꔣꔤꔥꔦꔧꔨꔩꔪꔫꔬꔭꔮꔯꔰꔱꔲꔳꔴꔵꔶꔷꔸꔹꔺꔻꔼꔽꔾꔿꕀꕁꕂꕃꕄꕅꕆꕇꕈꕉꕊꕋꕌꕍꕎꕏꕐꕑꕒꕓꕔꕕꕖꕗꕘꕙꕚꕛꕜꕝꕞꕟꕠꕡꕢꕣꕤꕥꕦꕧꕨꕩꕪꕫꕬꕭꕮꕯꕰꕱꕲꕳꕴꕵꕶꕷꕸꕹꕺꕻꕼꕽꕾꕿꖀꖁꖂꖃꖄꖅꖆꖇꖈꖉꖊꖋꖌꖍꖎꖏꖐꖑꖒꖓꖔꖕꖖꖗꖘꖙꖚꖛꖜꖝꖞꖟꖠꖡꖢꖣꖤꖥꖦꖧꖨꖩꖪꖫꖬꖭꖮꖯꖰꖱꖲꖳꖴꖵꖶꖷꖸꖹꖺꖻꖼꖽꖾꖿꗀꗁꗂꗃꗄꗅꗆꗇꗈꗉꗊꗋꗌꗍꗎꗏꗐꗑꗒꗓꗔꗕꗖꗗꗘꗙꗚꗛꗜꗝꗞꗟꗠꗡꗢꗣꗤꗥꗦꗧꗨꗩꗪꗫꗬꗭꗮꗯꗰꗱꗲꗳꗴꗵꗶꗷꗸꗹꗺꗻꗼꗽꗾꗿꘀꘁꘂꘃꘄꘅꘆꘇꘈꘉꘊꘋꘌꘐꘑꘒꘪꘫ ::numeral ꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩
|
158 |
+
::script-name Vedic ::n-char 24 ::char ᳓᳔᳢᳣᳤᳥᳦᳧᳨ᳩᳪᳫᳬ᳭ᳮᳯᳰᳱᳲᳳᳵᳶ᳷ᳺ
|
159 |
+
::script-name Vithkuqi ::n-char 70 ::char 𐕰𐕱𐕲𐕳𐕴𐕵𐕶𐕷𐕸𐕹𐕺𐕼𐕽𐕾𐕿𐖀𐖁𐖂𐖃𐖄𐖅𐖆𐖇𐖈𐖉𐖊𐖌𐖍𐖎𐖏𐖐𐖑𐖒𐖔𐖕𐖗𐖘𐖙𐖚𐖛𐖜𐖝𐖞𐖟𐖠𐖡𐖣𐖤𐖥𐖦𐖧𐖨𐖩𐖪𐖫𐖬𐖭𐖮𐖯𐖰𐖱���𐖴𐖵𐖶𐖷𐖸𐖹𐖻𐖼
|
160 |
+
::script-name Wancho ::n-char 45 ::char 𞋀𞋁𞋂𞋃𞋄𞋅𞋆𞋇𞋈𞋉𞋊𞋋𞋌𞋍𞋎𞋏𞋐𞋑𞋒𞋓𞋔𞋕𞋖𞋗𞋘𞋙𞋚𞋛𞋜𞋝𞋞𞋟𞋠𞋡𞋢𞋣𞋤𞋥𞋦𞋧𞋨𞋩𞋪𞋫𞋿 ::numeral 𞋰𞋱𞋲𞋳𞋴𞋵𞋶𞋷𞋸𞋹
|
161 |
+
::script-name Warang Citi ::n-char 64 ::char 𑢠𑢡𑢢𑢣𑢤𑢥𑢦𑢧𑢨𑢩𑢪𑢫𑢬𑢭𑢮𑢯𑢰𑢱𑢲𑢳𑢴𑢵𑢶𑢷𑢸𑢹𑢺𑢻𑢼𑢽𑢾𑢿𑣀𑣁𑣂𑣃𑣄𑣅𑣆𑣇𑣈𑣉𑣊𑣋𑣌𑣍𑣎𑣏𑣐𑣑𑣒𑣓𑣔𑣕𑣖𑣗𑣘𑣙𑣚𑣛𑣜𑣝𑣞𑣟 ::numeral 𑣠𑣡𑣢𑣣𑣤𑣥𑣦𑣧𑣨𑣩𑣪𑣫𑣬𑣭𑣮𑣯𑣰𑣱𑣲
|
162 |
+
::script-name Yezidi ::n-char 44 ::char 𐺀𐺁𐺂𐺃𐺄𐺅𐺆𐺇𐺈𐺉𐺊𐺋𐺌𐺍𐺎𐺏𐺐𐺑𐺒𐺓𐺔𐺕𐺖𐺗𐺘𐺙𐺚𐺛𐺜𐺝𐺞𐺟𐺠𐺡𐺢𐺣𐺤𐺥𐺦𐺧𐺨𐺩𐺰𐺱
|
163 |
+
::script-name Yi ::n-char 1165 ::char ꀀꀁꀂꀃꀄꀅꀆꀇꀈꀉꀊꀋꀌꀍꀎꀏꀐꀑꀒꀓꀔꀕꀖꀗꀘꀙꀚꀛꀜꀝꀞꀟꀠꀡꀢꀣꀤꀥꀦꀧꀨꀩꀪꀫꀬꀭꀮꀯꀰꀱꀲꀳꀴꀵꀶꀷꀸꀹꀺꀻꀼꀽꀾꀿꁀꁁꁂꁃꁄꁅꁆꁇꁈꁉꁊꁋꁌꁍꁎꁏꁐꁑꁒꁓꁔꁕꁖꁗꁘꁙꁚꁛꁜꁝꁞꁟꁠꁡꁢꁣꁤꁥꁦꁧꁨꁩꁪꁫꁬꁭꁮꁯꁰꁱꁲꁳꁴꁵꁶꁷꁸꁹꁺꁻꁼꁽꁾꁿꂀꂁꂂꂃꂄꂅꂆꂇꂈꂉꂊꂋꂌꂍꂎꂏꂐꂑꂒꂓꂔꂕꂖꂗꂘꂙꂚꂛꂜꂝꂞꂟꂠꂡꂢꂣꂤꂥꂦꂧꂨꂩꂪꂫꂬꂭꂮꂯꂰꂱꂲꂳꂴꂵꂶꂷꂸꂹꂺꂻꂼꂽꂾꂿꃀꃁꃂꃃꃄꃅꃆꃇꃈꃉꃊꃋꃌꃍꃎꃏꃐꃑꃒꃓꃔꃕꃖꃗꃘꃙꃚꃛꃜꃝꃞꃟꃠꃡꃢꃣꃤꃥꃦꃧꃨꃩꃪꃫꃬꃭꃮꃯꃰꃱꃲꃳꃴꃵꃶꃷꃸꃹꃺꃻꃼꃽꃾꃿꄀꄁꄂꄃꄄꄅꄆꄇꄈꄉꄊꄋꄌꄍꄎꄏꄐꄑꄒꄓꄔꄕꄖꄗꄘꄙꄚꄛꄜꄝꄞꄟꄠꄡꄢꄣꄤꄥꄦꄧꄨꄩꄪꄫꄬꄭꄮꄯꄰꄱꄲꄳꄴꄵꄶꄷꄸꄹꄺꄻꄼꄽꄾꄿꅀꅁꅂꅃꅄꅅꅆꅇꅈꅉꅊꅋꅌꅍꅎꅏꅐꅑꅒꅓꅔꅕꅖꅗꅘꅙꅚꅛꅜꅝꅞꅟꅠꅡꅢꅣꅤꅥꅦꅧꅨꅩꅪꅫꅬꅭꅮꅯꅰꅱꅲꅳꅴꅵꅶꅷꅸꅹꅺꅻꅼꅽꅾꅿꆀꆁꆂꆃꆄꆅꆆꆇꆈꆉꆊꆋꆌꆍꆎꆏꆐꆑꆒꆓꆔꆕꆖꆗꆘꆙꆚꆛꆜꆝꆞꆟꆠꆡꆢꆣꆤꆥꆦꆧꆨꆩꆪꆫꆬꆭꆮꆯꆰꆱꆲꆳꆴꆵꆶꆷꆸꆹꆺꆻꆼꆽꆾꆿꇀꇁꇂꇃꇄꇅꇆꇇꇈꇉꇊꇋꇌꇍꇎꇏꇐꇑꇒꇓꇔꇕꇖꇗꇘꇙꇚꇛꇜꇝꇞꇟꇠꇡꇢꇣꇤꇥꇦꇧꇨꇩꇪꇫꇬꇭꇮꇯꇰꇱꇲꇳꇴꇵꇶꇷꇸꇹꇺꇻꇼꇽꇾꇿꈀꈁꈂꈃꈄꈅꈆꈇꈈꈉꈊꈋꈌꈍꈎꈏꈐꈑꈒꈓꈔꈕꈖꈗꈘꈙꈚꈛꈜꈝꈞꈟꈠꈡꈢꈣꈤꈥꈦꈧꈨꈩꈪꈫꈬꈭꈮꈯꈰꈱꈲꈳꈴꈵꈶꈷꈸꈹꈺꈻꈼꈽꈾꈿꉀꉁꉂꉃꉄꉅꉆꉇꉈꉉꉊꉋꉌꉍꉎꉏꉐꉑꉒꉓꉔꉕꉖꉗꉘꉙꉚꉛꉜꉝꉞꉟꉠꉡꉢꉣꉤꉥꉦꉧꉨꉩꉪꉫꉬꉭꉮꉯꉰꉱꉲꉳꉴꉵꉶꉷꉸꉹꉺꉻꉼꉽꉾꉿꊀꊁꊂꊃꊄꊅꊆꊇꊈꊉꊊꊋꊌꊍꊎꊏꊐꊑꊒꊓꊔꊕꊖꊗꊘꊙꊚꊛꊜꊝꊞꊟꊠꊡꊢꊣꊤꊥꊦꊧꊨꊩꊪꊫꊬꊭꊮꊯꊰꊱꊲꊳꊴꊵꊶꊷꊸꊹꊺꊻꊼꊽꊾꊿꋀꋁꋂꋃꋄꋅꋆꋇꋈꋉꋊꋋꋌꋍꋎꋏꋐꋑꋒꋓꋔꋕꋖꋗꋘꋙꋚꋛꋜꋝꋞꋟꋠꋡꋢꋣꋤꋥꋦꋧꋨꋩꋪꋫꋬꋭꋮꋯꋰꋱꋲꋳꋴꋵꋶꋷꋸꋹꋺꋻꋼꋽꋾꋿꌀꌁꌂꌃꌄꌅꌆꌇꌈꌉꌊꌋꌌꌍꌎꌏꌐꌑꌒꌓꌔꌕꌖꌗꌘꌙꌚꌛꌜꌝꌞꌟꌠꌡꌢꌣꌤꌥꌦꌧꌨꌩꌪꌫꌬꌭꌮꌯꌰꌱꌲꌳꌴꌵꌶꌷꌸꌹꌺꌻꌼꌽꌾꌿꍀꍁꍂꍃꍄꍅꍆꍇꍈꍉꍊꍋꍌꍍꍎꍏꍐꍑꍒꍓꍔꍕꍖꍗꍘꍙꍚꍛꍜꍝꍞꍟꍠꍡꍢꍣꍤꍥꍦꍧꍨꍩꍪꍫꍬꍭꍮꍯꍰꍱꍲꍳꍴꍵꍶꍷꍸꍹꍺꍻꍼꍽꍾꍿꎀꎁꎂꎃꎄꎅꎆꎇꎈꎉꎊꎋꎌꎍꎎꎏꎐꎑꎒꎓꎔꎕꎖꎗꎘꎙꎚꎛꎜꎝꎞꎟꎠꎡꎢꎣꎤꎥꎦꎧꎨꎩꎪꎫꎬꎭꎮꎯꎰꎱꎲꎳꎴꎵꎶꎷꎸꎹꎺꎻꎼꎽꎾꎿꏀꏁꏂꏃꏄꏅꏆꏇꏈꏉꏊꏋꏌꏍꏎꏏꏐꏑꏒꏓꏔꏕꏖꏗꏘꏙꏚꏛꏜꏝꏞꏟꏠꏡꏢꏣꏤꏥꏦꏧꏨꏩꏪꏫꏬꏭꏮꏯꏰꏱꏲꏳꏴꏵꏶꏷꏸꏹꏺꏻꏼꏽꏾꏿꐀꐁꐂꐃꐄꐅꐆꐇꐈꐉꐊꐋꐌꐍꐎꐏꐐꐑꐒꐓꐔꐕꐖꐗꐘꐙꐚꐛꐜꐝꐞꐟꐠꐡꐢꐣꐤꐥꐦꐧꐨꐩꐪꐫꐬꐭꐮꐯꐰꐱꐲꐳꐴꐵꐶꐷꐸꐹꐺꐻꐼꐽꐾꐿꑀꑁꑂꑃꑄꑅꑆꑇꑈꑉꑊꑋꑌꑍꑎꑏꑐꑑꑒꑓꑔꑕꑖꑗꑘꑙꑚꑛꑜꑝꑞꑟꑠꑡꑢꑣꑤꑥꑦꑧꑨꑩꑪꑫꑬꑭꑮꑯꑰꑱꑲꑳꑴꑵꑶꑷꑸꑹꑺꑻꑼꑽꑾꑿꒀꒁꒂꒃꒄꒅꒆꒇꒈꒉꒊꒋꒌ
|
164 |
+
::script-name Zanabazar Square ::n-char 63 ::char 𑨀𑨁𑨂𑨃𑨄𑨅𑨆𑨇𑨈𑨉𑨊𑨋𑨌𑨍𑨎𑨏𑨐𑨑𑨒𑨓𑨔𑨕𑨖𑨗𑨘𑨙𑨚𑨛𑨜𑨝𑨞𑨟𑨠𑨡𑨢𑨣𑨤𑨥𑨦𑨧𑨨𑨩𑨪𑨫𑨬𑨭𑨮𑨯𑨰𑨱𑨲𑨳𑨴𑨵𑨶𑨷𑨸𑨹𑨺𑨻𑨼𑨽𑨾 ::sign-virama 𑨴 ::vowel-sign 𑨁𑨂𑨃𑨄𑨅𑨆𑨇𑨈𑨉
|
uroman/data/UnicodeDataPropsCJK.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
uroman/data/UnicodeDataPropsHangul.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
::script-name Hangul ::n-char 11265 ::char ㄱㄲㄳㄴㄵㄶㄷㄸㄹㄺㄻㄼㄽㄾㄿㅀㅁㅂㅃㅄㅅㅆㅇㅈㅉㅊㅋㅌㅍㅎㅏㅐㅑㅒㅓㅔㅕㅖㅗㅘㅙㅚㅛㅜㅝㅞㅟㅠㅡㅢㅣㅥㅦㅧㅨㅩㅪㅫㅬㅭㅮㅯㅰㅱㅲㅳㅴㅵㅶㅷㅸㅹㅺㅻㅼㅽㅾㅿㆀㆁㆂㆃㆄㆅㆆㆇㆈㆉㆊㆋㆌㆍㆎ가각갂갃간갅갆갇갈갉갊갋갌갍갎갏감갑값갓갔강갖갗갘같갚갛개객갞갟갠갡갢갣갤갥갦갧갨갩갪갫갬갭갮갯갰갱갲갳갴갵갶갷갸갹갺갻갼갽갾갿걀걁걂걃걄걅걆걇걈걉걊걋걌걍걎걏걐걑걒걓걔걕걖걗걘걙걚걛걜걝걞걟걠걡걢걣걤걥걦걧걨걩걪걫걬걭걮걯거걱걲걳건걵걶걷걸걹걺걻걼걽걾걿검겁겂것겄겅겆겇겈겉겊겋게겍겎겏겐겑겒겓겔겕겖겗겘겙겚겛겜겝겞겟겠겡겢겣겤겥겦겧겨격겪겫견겭겮겯결겱겲겳겴겵겶겷겸겹겺겻겼경겾겿곀곁곂곃계곅곆곇곈곉곊곋곌곍곎곏곐곑곒곓곔곕곖곗곘곙곚곛곜곝곞곟고곡곢곣곤곥곦곧골곩곪곫곬곭곮곯곰곱곲곳곴공곶곷곸곹곺곻과곽곾곿관괁괂괃괄괅괆괇괈괉괊괋괌괍괎괏괐광괒괓괔괕괖괗괘괙괚괛괜괝괞괟괠괡괢괣괤괥괦괧괨괩괪괫괬괭괮괯괰괱괲괳괴괵괶괷괸괹괺괻괼괽괾괿굀굁굂굃굄굅굆굇굈굉굊굋굌굍굎굏교굑굒굓굔굕굖굗굘굙굚굛굜굝굞굟굠굡굢굣굤굥굦굧굨굩굪굫구국굮굯군굱굲굳굴굵굶굷굸굹굺굻굼굽굾굿궀궁궂궃궄궅궆궇궈궉궊궋권궍궎궏궐궑궒궓궔궕궖궗궘궙궚궛궜궝궞궟궠궡궢궣궤궥궦궧궨궩궪궫궬궭궮궯궰궱궲궳궴궵궶궷궸궹궺궻궼궽궾궿귀귁귂귃귄귅귆귇귈귉귊귋귌귍귎귏귐귑귒귓귔귕귖귗귘귙귚귛규귝귞귟균귡귢귣귤귥귦귧귨귩귪귫귬귭귮귯귰귱귲귳귴귵귶귷그극귺귻근귽귾귿글긁긂긃긄긅긆긇금급긊긋긌긍긎긏긐긑긒긓긔긕긖긗긘긙긚긛긜긝긞긟긠긡긢긣긤긥긦긧긨긩긪긫긬긭긮긯기긱긲긳긴긵긶긷길긹긺긻긼긽긾긿김깁깂깃깄깅깆깇깈깉깊깋까깍깎깏깐깑깒깓깔깕깖깗깘깙깚깛깜깝깞깟깠깡깢깣깤깥깦깧깨깩깪깫깬깭깮깯깰깱깲깳깴깵깶깷깸깹깺깻깼깽깾깿꺀꺁꺂꺃꺄꺅꺆꺇꺈꺉꺊꺋꺌꺍꺎꺏꺐꺑꺒꺓꺔꺕꺖꺗꺘꺙꺚꺛꺜꺝꺞꺟꺠꺡꺢꺣꺤꺥꺦꺧꺨꺩꺪꺫꺬꺭꺮꺯꺰꺱꺲꺳꺴꺵꺶꺷꺸꺹꺺꺻꺼꺽꺾꺿껀껁껂껃껄껅껆껇껈껉껊껋껌껍껎껏껐껑껒껓껔껕껖껗께껙껚껛껜껝껞껟껠껡껢껣껤껥껦껧껨껩껪껫껬껭껮껯껰껱껲껳껴껵껶껷껸껹껺껻껼껽껾껿꼀꼁꼂꼃꼄꼅꼆꼇꼈꼉꼊꼋꼌꼍꼎꼏꼐꼑꼒꼓꼔꼕꼖꼗꼘꼙꼚꼛꼜꼝꼞꼟꼠꼡꼢꼣꼤꼥꼦꼧꼨꼩꼪꼫꼬꼭꼮꼯꼰꼱꼲꼳꼴꼵꼶꼷꼸꼹꼺꼻꼼꼽꼾꼿꽀꽁꽂꽃꽄꽅꽆꽇꽈꽉꽊꽋꽌꽍꽎꽏꽐꽑꽒꽓꽔꽕꽖꽗꽘꽙꽚꽛꽜꽝꽞꽟꽠꽡꽢꽣꽤꽥꽦꽧꽨꽩꽪꽫꽬꽭꽮꽯꽰꽱꽲꽳꽴꽵꽶꽷꽸꽹꽺꽻꽼꽽꽾꽿꾀꾁꾂꾃꾄꾅꾆꾇꾈꾉꾊꾋꾌꾍꾎꾏꾐꾑꾒꾓꾔꾕꾖꾗꾘꾙꾚꾛꾜꾝꾞꾟꾠꾡꾢꾣꾤꾥꾦꾧꾨꾩꾪꾫꾬꾭꾮꾯꾰꾱꾲꾳꾴꾵꾶꾷꾸꾹꾺꾻꾼꾽꾾꾿꿀꿁꿂꿃꿄꿅꿆꿇꿈꿉꿊꿋꿌꿍꿎꿏꿐꿑꿒꿓꿔꿕꿖꿗꿘꿙꿚꿛꿜꿝꿞꿟꿠꿡꿢꿣꿤꿥꿦꿧꿨꿩꿪꿫꿬꿭꿮꿯꿰꿱꿲꿳꿴꿵꿶꿷꿸꿹꿺꿻꿼꿽꿾꿿뀀뀁뀂뀃뀄뀅뀆뀇뀈뀉뀊뀋뀌뀍뀎뀏뀐뀑뀒뀓뀔뀕뀖뀗뀘뀙뀚뀛뀜뀝뀞뀟뀠뀡뀢뀣뀤뀥뀦뀧뀨뀩뀪뀫뀬뀭뀮뀯뀰뀱뀲뀳뀴뀵뀶뀷뀸뀹뀺뀻뀼뀽뀾뀿끀끁끂끃끄끅끆끇끈끉끊끋끌끍끎끏끐끑끒끓끔끕끖끗끘끙끚끛끜끝끞끟끠끡끢끣끤끥끦끧끨끩끪끫끬끭끮끯끰끱끲끳끴끵끶끷끸끹끺끻끼끽끾끿낀낁낂낃낄낅낆낇낈낉낊낋낌낍낎낏낐낑낒낓낔낕낖낗나낙낚낛난낝낞낟날낡낢낣낤낥낦낧남납낪낫났낭낮낯낰낱낲낳내낵낶낷낸낹낺낻낼낽낾낿냀냁냂냃냄냅냆냇냈냉냊냋냌냍냎냏냐냑냒냓냔냕냖냗냘냙냚냛냜냝냞냟냠냡냢냣냤냥냦냧냨냩냪냫냬냭냮냯냰냱냲냳냴냵냶냷냸냹냺냻냼냽냾냿넀넁넂넃넄넅넆넇너넉넊넋넌넍넎넏널넑넒넓넔넕넖넗넘넙넚넛넜넝넞넟넠넡넢넣네넥넦넧넨넩넪넫넬넭넮넯넰넱넲넳넴넵넶넷넸넹넺넻넼넽넾넿녀녁녂녃년녅녆녇녈녉녊녋녌녍녎녏념녑녒녓녔녕녖녗녘녙녚녛녜녝녞녟녠녡녢녣녤녥녦녧녨녩녪녫녬녭녮녯녰녱녲녳녴녵녶녷노녹녺녻논녽녾녿놀놁놂놃놄놅놆놇놈놉놊놋놌농놎놏놐놑높놓놔놕놖놗놘놙놚놛놜놝놞놟놠놡놢놣놤놥놦놧놨놩놪놫놬놭놮놯놰놱놲놳놴놵놶놷놸놹놺놻놼놽놾놿뇀뇁뇂뇃뇄뇅뇆뇇뇈뇉뇊뇋뇌뇍뇎뇏뇐뇑뇒뇓뇔뇕뇖뇗뇘뇙뇚뇛뇜뇝뇞뇟뇠뇡뇢뇣뇤뇥뇦뇧뇨뇩뇪뇫뇬뇭뇮뇯뇰뇱뇲뇳뇴뇵뇶뇷뇸뇹뇺뇻뇼뇽뇾뇿눀눁눂눃누눅눆눇눈눉눊눋눌눍눎눏눐눑눒눓눔눕눖눗눘눙눚눛눜눝눞눟눠눡눢눣눤눥눦눧눨눩눪눫눬눭눮눯눰눱눲눳눴눵눶눷눸눹���눻눼눽눾눿뉀뉁뉂뉃뉄뉅뉆뉇뉈뉉뉊뉋뉌뉍뉎뉏뉐뉑뉒뉓뉔뉕뉖뉗뉘뉙뉚뉛뉜뉝뉞뉟뉠뉡뉢뉣뉤뉥뉦뉧뉨뉩뉪뉫뉬뉭뉮뉯뉰뉱뉲뉳뉴뉵뉶뉷뉸뉹뉺뉻뉼뉽뉾뉿늀늁늂늃늄늅늆늇늈늉늊늋늌늍늎늏느늑늒늓는늕늖늗늘늙늚늛늜늝늞늟늠늡늢늣늤능늦늧늨늩늪늫늬늭늮늯늰늱늲늳늴늵늶늷늸늹늺늻늼늽늾늿닀닁닂닃닄닅닆닇니닉닊닋닌닍닎닏닐닑닒닓닔닕닖닗님닙닚닛닜닝닞닟닠닡닢닣다닥닦닧단닩닪닫달닭닮닯닰닱닲닳담답닶닷닸당닺닻닼닽닾닿대댁댂댃댄댅댆댇댈댉댊댋댌댍댎댏댐댑댒댓댔댕댖댗댘댙댚댛댜댝댞댟댠댡댢댣댤댥댦댧댨댩댪댫댬댭댮댯댰댱댲댳댴댵댶댷댸댹댺댻댼댽댾댿덀덁덂덃덄덅덆덇덈덉덊덋덌덍덎덏덐덑덒덓더덕덖덗던덙덚덛덜덝덞덟덠덡덢덣덤덥덦덧덨덩덪덫덬덭덮덯데덱덲덳덴덵덶덷델덹덺덻덼덽덾덿뎀뎁뎂뎃뎄뎅뎆뎇뎈뎉뎊뎋뎌뎍뎎뎏뎐뎑뎒뎓뎔뎕뎖뎗뎘뎙뎚뎛뎜뎝뎞뎟뎠뎡뎢뎣뎤뎥뎦뎧뎨뎩뎪뎫뎬뎭뎮뎯뎰뎱뎲뎳뎴뎵뎶뎷뎸뎹뎺뎻뎼뎽뎾뎿돀돁돂돃도독돆돇돈돉돊돋돌돍돎돏돐돑돒돓돔돕돖돗돘동돚돛돜돝돞돟돠돡돢돣돤돥돦돧돨돩돪돫돬돭돮돯돰돱돲돳돴돵돶돷돸돹돺돻돼돽돾돿됀됁됂됃됄됅됆됇됈됉됊됋됌됍됎됏됐됑됒됓됔됕됖됗되됙됚됛된됝됞됟될됡됢됣됤됥됦됧됨됩됪됫됬됭됮됯됰됱됲됳됴됵됶됷됸됹됺됻됼됽됾됿둀둁둂둃둄둅둆둇둈둉둊둋둌둍둎둏두둑둒둓둔둕둖둗둘둙둚둛둜둝둞둟둠둡둢둣둤둥둦둧둨둩둪둫둬둭둮둯둰둱둲둳둴둵둶둷둸둹둺둻둼둽둾둿뒀뒁뒂뒃뒄뒅뒆뒇뒈뒉뒊뒋뒌뒍뒎뒏뒐뒑뒒뒓뒔뒕뒖뒗뒘뒙뒚뒛뒜뒝뒞뒟뒠뒡뒢뒣뒤뒥뒦뒧뒨뒩뒪뒫뒬뒭뒮뒯뒰뒱뒲뒳뒴뒵뒶뒷뒸뒹뒺뒻뒼뒽뒾뒿듀듁듂듃듄듅듆듇듈듉듊듋듌듍듎듏듐듑듒듓듔듕듖듗듘듙듚듛드득듞듟든듡듢듣들듥듦듧듨듩듪듫듬듭듮듯듰등듲듳듴듵듶듷듸듹듺듻듼듽듾듿딀딁딂딃딄딅딆딇딈딉딊딋딌딍딎딏딐딑딒딓디딕딖딗딘딙딚딛딜딝딞딟딠딡딢딣딤딥딦딧딨딩딪딫딬딭딮딯따딱딲딳딴딵딶딷딸딹딺딻딼딽딾딿땀땁땂땃땄땅땆땇땈땉땊땋때땍땎땏땐땑땒땓땔땕땖땗땘땙땚땛땜땝땞땟땠땡땢땣땤땥땦땧땨땩땪땫땬땭땮땯땰땱땲땳땴땵땶땷땸땹땺땻땼땽땾땿떀떁떂떃떄떅떆떇떈떉떊떋떌떍떎떏떐떑떒떓떔떕떖떗떘떙떚떛떜떝떞떟떠떡떢떣떤떥떦떧떨떩떪떫떬떭떮떯떰떱떲떳떴떵떶떷떸떹떺떻떼떽떾떿뗀뗁뗂뗃뗄뗅뗆뗇뗈뗉뗊뗋뗌뗍뗎뗏뗐뗑뗒뗓뗔뗕뗖뗗뗘뗙뗚뗛뗜뗝뗞뗟뗠뗡뗢뗣뗤뗥뗦뗧뗨뗩뗪뗫뗬뗭뗮뗯뗰뗱뗲뗳뗴뗵뗶뗷뗸뗹뗺뗻뗼뗽뗾뗿똀똁똂똃똄똅똆똇똈똉똊똋똌똍똎똏또똑똒똓똔똕똖똗똘똙똚똛똜똝똞똟똠똡똢똣똤똥똦똧똨똩똪똫똬똭똮똯똰똱똲똳똴똵똶똷똸똹똺똻똼똽똾똿뙀뙁뙂뙃뙄뙅뙆뙇뙈뙉뙊뙋뙌뙍뙎뙏뙐뙑뙒뙓뙔뙕뙖뙗뙘뙙뙚뙛뙜뙝뙞뙟뙠뙡뙢뙣뙤뙥뙦뙧뙨뙩뙪뙫뙬뙭뙮뙯뙰뙱뙲뙳뙴뙵뙶뙷뙸뙹뙺뙻뙼뙽뙾뙿뚀뚁뚂뚃뚄뚅뚆뚇뚈뚉뚊뚋뚌뚍뚎뚏뚐뚑뚒뚓뚔뚕뚖뚗뚘뚙뚚뚛뚜뚝뚞뚟뚠뚡뚢뚣뚤뚥뚦뚧뚨뚩뚪뚫뚬뚭뚮뚯뚰뚱뚲뚳뚴뚵뚶뚷뚸뚹뚺뚻뚼뚽뚾뚿뛀뛁뛂뛃뛄뛅뛆뛇뛈뛉뛊뛋뛌뛍뛎뛏뛐뛑뛒뛓뛔뛕뛖뛗뛘뛙뛚뛛뛜뛝뛞뛟뛠뛡뛢뛣뛤뛥뛦뛧뛨뛩뛪뛫뛬뛭뛮뛯뛰뛱뛲뛳뛴뛵뛶뛷뛸뛹뛺뛻뛼뛽뛾뛿뜀뜁뜂뜃뜄뜅뜆뜇뜈뜉뜊뜋뜌뜍뜎뜏뜐뜑뜒뜓뜔뜕뜖뜗뜘뜙뜚뜛뜜뜝뜞뜟뜠뜡뜢뜣뜤뜥뜦뜧뜨뜩뜪뜫뜬뜭뜮뜯뜰뜱뜲뜳뜴뜵뜶뜷뜸뜹뜺뜻뜼뜽뜾뜿띀띁띂띃띄띅띆띇띈띉띊띋띌띍띎띏띐띑띒띓띔띕띖띗띘띙띚띛띜띝띞띟띠띡띢띣띤띥띦띧띨띩띪띫띬띭띮띯띰띱띲띳띴띵띶띷띸띹띺띻라락띾띿란랁랂랃랄랅랆랇랈랉랊랋람랍랎랏랐랑랒랓랔랕랖랗래랙랚랛랜랝랞랟랠랡랢랣랤랥랦랧램랩랪랫랬랭랮랯랰랱랲랳랴략랶랷랸랹랺랻랼랽랾랿럀럁럂럃럄럅럆럇럈량럊럋럌럍럎럏럐럑럒럓럔럕럖럗럘럙럚럛럜럝럞럟럠럡럢럣럤럥럦럧럨럩럪럫러럭럮럯런럱럲럳럴럵럶럷럸럹럺럻럼럽럾럿렀렁렂렃렄렅렆렇레렉렊렋렌렍렎렏렐렑렒렓렔렕렖렗렘렙렚렛렜렝렞렟렠렡렢렣려력렦렧련렩렪렫렬렭렮렯렰렱렲렳렴렵렶렷렸령렺렻렼렽렾렿례롁롂롃롄롅롆롇롈롉롊롋롌롍롎롏롐롑롒롓롔롕롖롗롘롙롚롛로록롞롟론롡롢롣롤롥롦롧롨롩롪롫롬롭롮롯롰롱롲롳롴롵롶롷롸롹롺롻롼롽롾롿뢀뢁뢂뢃뢄뢅뢆뢇뢈뢉뢊뢋뢌뢍뢎뢏뢐뢑뢒뢓뢔뢕뢖뢗뢘뢙뢚뢛뢜뢝뢞뢟뢠뢡뢢뢣뢤뢥뢦뢧뢨뢩뢪뢫뢬뢭뢮뢯뢰뢱뢲뢳뢴뢵뢶뢷뢸뢹뢺뢻뢼뢽뢾뢿룀룁룂룃룄룅룆룇룈룉룊룋료룍룎룏룐룑룒룓룔룕룖룗룘룙룚룛룜룝룞룟룠룡룢룣룤룥룦룧루룩룪룫룬룭룮룯룰룱룲룳룴룵룶룷룸룹룺룻룼룽룾룿뤀뤁뤂뤃뤄뤅뤆뤇뤈뤉뤊뤋뤌뤍뤎뤏뤐뤑뤒뤓뤔뤕뤖뤗뤘뤙뤚뤛뤜뤝뤞뤟뤠뤡뤢뤣뤤뤥뤦뤧뤨뤩뤪뤫뤬뤭뤮뤯뤰뤱뤲뤳뤴뤵뤶뤷뤸뤹뤺뤻뤼뤽뤾뤿륀륁륂륃륄륅륆륇륈륉륊륋륌륍륎륏륐륑륒륓륔륕륖륗류륙륚륛륜륝륞륟률륡륢륣륤륥륦륧륨륩륪륫륬륭륮륯륰륱륲륳르륵륶륷른륹륺륻를륽륾륿릀릁릂릃름릅릆릇릈릉릊릋릌릍릎릏릐릑릒릓릔릕릖릗릘릙릚릛릜릝릞릟릠릡릢릣릤릥릦릧릨릩릪릫리릭릮릯린릱릲릳릴릵릶릷릸릹릺릻림립릾릿맀링맂맃맄맅맆맇마막맊맋만맍많맏말맑맒맓맔맕맖맗맘맙맚맛맜망맞맟맠맡맢맣매맥맦맧맨맩맪맫맬맭맮맯맰맱맲맳맴맵맶맷맸맹맺맻맼맽맾맿먀먁먂먃먄먅먆먇먈먉먊먋먌먍먎먏먐먑먒먓먔먕먖먗먘먙먚먛먜먝먞먟먠먡먢먣먤먥먦먧먨먩먪먫먬먭먮먯먰먱먲먳먴먵먶먷머먹먺먻먼먽먾먿멀멁멂멃멄멅멆멇멈멉멊멋멌멍멎멏멐멑멒멓메멕멖멗멘멙멚멛멜멝멞멟멠멡멢멣멤멥멦멧멨멩멪멫멬멭멮멯며멱멲멳면멵멶멷멸멹멺멻멼멽멾멿몀몁몂몃몄명몆몇몈몉몊몋몌몍몎몏몐몑몒몓몔몕몖몗몘몙몚몛몜몝몞몟몠몡몢몣몤몥몦몧모목몪몫몬몭몮몯몰몱몲몳몴몵몶몷몸몹몺못몼몽몾몿뫀뫁뫂뫃뫄뫅뫆뫇뫈뫉뫊뫋뫌뫍뫎뫏뫐뫑뫒뫓뫔뫕뫖뫗뫘뫙뫚뫛뫜뫝뫞뫟뫠뫡뫢뫣뫤뫥뫦뫧뫨뫩뫪뫫뫬뫭뫮뫯뫰뫱뫲뫳뫴뫵뫶뫷뫸뫹뫺뫻뫼뫽뫾뫿묀묁묂묃묄묅묆묇묈묉묊묋묌묍묎묏묐묑묒묓묔묕묖묗묘묙묚묛묜묝묞묟묠묡묢묣묤묥묦묧묨묩묪묫묬묭묮묯묰묱묲묳무묵묶묷문묹묺묻물묽묾묿뭀뭁뭂뭃뭄뭅뭆뭇뭈뭉뭊뭋뭌뭍뭎뭏뭐뭑뭒뭓뭔뭕뭖뭗뭘뭙뭚뭛뭜뭝뭞뭟뭠뭡뭢뭣뭤뭥뭦뭧뭨뭩뭪뭫뭬뭭뭮뭯뭰뭱뭲뭳뭴뭵뭶뭷뭸뭹뭺뭻뭼뭽뭾뭿뮀뮁뮂뮃뮄뮅뮆뮇뮈뮉뮊뮋뮌뮍뮎뮏뮐뮑뮒뮓뮔뮕뮖뮗뮘뮙뮚뮛뮜뮝뮞뮟뮠뮡뮢뮣뮤뮥뮦뮧뮨뮩뮪뮫뮬뮭뮮뮯뮰뮱뮲뮳뮴뮵뮶뮷뮸뮹뮺뮻뮼뮽뮾뮿므믁믂믃믄믅믆믇믈믉믊믋믌믍믎믏믐믑믒믓믔믕믖믗믘믙믚믛믜믝믞믟믠믡믢믣믤믥믦믧믨믩믪믫믬믭믮믯믰믱믲믳믴믵믶믷미믹믺믻민믽믾믿밀밁밂밃밄밅밆밇밈밉밊밋밌밍밎및밐밑밒밓바박밖밗반밙밚받발밝밞밟밠밡밢밣밤밥밦밧밨방밪밫밬밭밮밯배백밲밳밴밵밶밷밸밹밺밻밼밽밾밿뱀뱁뱂뱃뱄뱅뱆뱇뱈뱉뱊뱋뱌뱍뱎뱏뱐뱑뱒뱓뱔뱕뱖뱗뱘뱙뱚뱛뱜뱝뱞뱟뱠뱡뱢뱣뱤뱥뱦뱧뱨뱩뱪뱫뱬뱭뱮뱯뱰뱱뱲뱳뱴뱵뱶뱷뱸뱹뱺뱻뱼뱽뱾뱿벀벁벂벃버벅벆벇번벉벊벋벌벍벎벏벐벑벒벓범법벖벗벘벙벚벛벜벝벞벟베벡벢벣벤벥벦벧벨벩벪벫벬벭벮벯벰벱벲벳벴벵벶벷벸벹벺벻벼벽벾벿변볁볂볃별볅볆볇볈볉볊볋볌볍볎볏볐병볒볓볔볕볖볗볘볙볚볛볜볝볞볟볠볡볢볣볤볥볦볧볨볩볪볫볬볭볮볯볰볱볲볳보복볶볷본볹볺볻볼볽볾볿봀봁봂봃봄봅봆봇봈봉봊봋봌봍봎봏봐봑봒봓봔봕봖봗봘봙봚봛봜봝봞봟봠봡봢봣봤봥봦봧봨봩봪봫봬봭봮봯봰봱봲봳봴봵봶봷봸봹봺봻봼봽봾봿뵀뵁뵂뵃뵄뵅뵆뵇뵈뵉뵊뵋뵌뵍뵎뵏뵐뵑뵒뵓뵔뵕뵖뵗뵘뵙뵚뵛뵜뵝뵞뵟뵠뵡뵢뵣뵤뵥뵦뵧뵨뵩뵪뵫뵬뵭뵮뵯뵰뵱뵲뵳뵴뵵뵶뵷뵸뵹뵺뵻뵼뵽뵾뵿부북붂붃분붅붆붇불붉붊붋붌붍붎붏붐붑붒붓붔붕붖붗붘붙붚붛붜붝붞붟붠붡붢붣붤붥붦붧붨붩붪붫붬붭붮붯붰붱붲붳붴붵붶붷붸붹붺붻붼붽붾붿뷀뷁뷂뷃뷄뷅뷆뷇뷈뷉뷊뷋뷌뷍뷎뷏뷐뷑뷒뷓뷔뷕뷖뷗뷘뷙뷚뷛뷜뷝뷞뷟뷠뷡뷢뷣뷤뷥뷦뷧뷨뷩뷪뷫뷬뷭뷮뷯뷰뷱뷲뷳뷴뷵뷶뷷뷸뷹뷺뷻뷼뷽뷾뷿븀븁븂븃븄븅븆븇븈븉븊븋브븍븎븏븐븑븒븓블븕븖븗븘븙븚븛븜븝븞븟븠븡븢븣븤븥븦븧븨븩븪븫븬븭븮븯븰븱븲븳븴븵븶븷븸븹븺븻븼븽븾븿빀빁빂빃비빅빆빇빈빉빊빋빌빍빎빏빐빑빒빓빔빕빖빗빘빙빚빛빜빝빞빟빠빡빢빣빤빥빦빧빨빩빪빫빬빭빮빯빰빱빲빳빴빵빶빷빸빹빺빻빼빽빾빿뺀뺁뺂뺃뺄뺅뺆뺇뺈뺉뺊뺋뺌뺍뺎뺏뺐뺑뺒뺓뺔뺕뺖뺗뺘뺙뺚뺛뺜뺝뺞뺟뺠뺡뺢뺣뺤뺥뺦뺧뺨뺩뺪뺫뺬뺭뺮뺯뺰뺱뺲뺳뺴뺵뺶뺷뺸뺹뺺뺻뺼뺽뺾뺿뻀뻁뻂뻃뻄뻅뻆뻇뻈뻉뻊뻋뻌뻍뻎뻏뻐뻑뻒뻓뻔뻕뻖뻗뻘뻙뻚뻛뻜뻝뻞뻟뻠뻡뻢뻣뻤뻥뻦뻧뻨뻩뻪뻫뻬뻭뻮뻯뻰뻱뻲뻳뻴뻵뻶뻷뻸뻹뻺뻻뻼뻽뻾뻿뼀뼁뼂뼃뼄뼅뼆뼇뼈뼉뼊뼋뼌뼍뼎뼏뼐뼑뼒뼓뼔뼕뼖뼗뼘뼙뼚뼛뼜뼝뼞뼟뼠뼡뼢뼣뼤뼥뼦뼧뼨뼩뼪뼫뼬뼭뼮뼯뼰뼱뼲뼳뼴뼵뼶뼷뼸뼹뼺뼻뼼뼽뼾뼿뽀뽁뽂뽃뽄뽅뽆뽇뽈뽉뽊뽋뽌뽍뽎뽏뽐뽑뽒뽓뽔뽕뽖뽗뽘뽙뽚뽛뽜뽝뽞뽟뽠뽡뽢뽣뽤뽥뽦뽧뽨뽩뽪뽫뽬뽭뽮뽯뽰뽱뽲뽳뽴뽵뽶뽷뽸뽹뽺뽻뽼뽽뽾뽿뾀뾁뾂뾃뾄뾅뾆뾇뾈뾉뾊뾋뾌뾍뾎��뾐뾑뾒뾓뾔뾕뾖뾗뾘뾙뾚뾛뾜뾝뾞뾟뾠뾡뾢뾣뾤뾥뾦뾧뾨뾩뾪뾫뾬뾭뾮뾯뾰뾱뾲뾳뾴뾵뾶뾷뾸뾹뾺뾻뾼뾽뾾뾿뿀뿁뿂뿃뿄뿅뿆뿇뿈뿉뿊뿋뿌뿍뿎뿏뿐뿑뿒뿓뿔뿕뿖뿗뿘뿙뿚뿛뿜뿝뿞뿟뿠뿡뿢뿣뿤뿥뿦뿧뿨뿩뿪뿫뿬뿭뿮뿯뿰뿱뿲뿳뿴뿵뿶뿷뿸뿹뿺뿻뿼뿽뿾뿿쀀쀁쀂쀃쀄쀅쀆쀇쀈쀉쀊쀋쀌쀍쀎쀏쀐쀑쀒쀓쀔쀕쀖쀗쀘쀙쀚쀛쀜쀝쀞쀟쀠쀡쀢쀣쀤쀥쀦쀧쀨쀩쀪쀫쀬쀭쀮쀯쀰쀱쀲쀳쀴쀵쀶쀷쀸쀹쀺쀻쀼쀽쀾쀿쁀쁁쁂쁃쁄쁅쁆쁇쁈쁉쁊쁋쁌쁍쁎쁏쁐쁑쁒쁓쁔쁕쁖쁗쁘쁙쁚쁛쁜쁝쁞쁟쁠쁡쁢쁣쁤쁥쁦쁧쁨쁩쁪쁫쁬쁭쁮쁯쁰쁱쁲쁳쁴쁵쁶쁷쁸쁹쁺쁻쁼쁽쁾쁿삀삁삂삃삄삅삆삇삈삉삊삋삌삍삎삏삐삑삒삓삔삕삖삗삘삙삚삛삜삝삞삟삠삡삢삣삤삥삦삧삨삩삪삫사삭삮삯산삱삲삳살삵삶삷삸삹삺삻삼삽삾삿샀상샂샃샄샅샆샇새색샊샋샌샍샎샏샐샑샒샓샔샕샖샗샘샙샚샛샜생샞샟샠샡샢샣샤샥샦샧샨샩샪샫샬샭샮샯샰샱샲샳샴샵샶샷샸샹샺샻샼샽샾샿섀섁섂섃섄섅섆섇섈섉섊섋섌섍섎섏섐섑섒섓섔섕섖섗섘섙섚섛서석섞섟선섡섢섣설섥섦섧섨섩섪섫섬섭섮섯섰성섲섳섴섵섶섷세섹섺섻센섽섾섿셀셁셂셃셄셅셆셇셈셉셊셋셌셍셎셏셐셑셒셓셔셕셖셗션셙셚셛셜셝셞셟셠셡셢셣셤셥셦셧셨셩셪셫셬셭셮셯셰셱셲셳셴셵셶셷셸셹셺셻셼셽셾셿솀솁솂솃솄솅솆솇솈솉솊솋소속솎솏손솑솒솓솔솕솖솗솘솙솚솛솜솝솞솟솠송솢솣솤솥솦솧솨솩솪솫솬솭솮솯솰솱솲솳솴솵솶솷솸솹솺솻솼솽솾솿쇀쇁쇂쇃쇄쇅쇆쇇쇈쇉쇊쇋쇌쇍쇎쇏쇐쇑쇒쇓쇔쇕쇖쇗쇘쇙쇚쇛쇜쇝쇞쇟쇠쇡쇢쇣쇤쇥쇦쇧쇨쇩쇪쇫쇬쇭쇮쇯쇰쇱쇲쇳쇴쇵쇶쇷쇸쇹쇺쇻쇼쇽쇾쇿숀숁숂숃숄숅숆숇숈숉숊숋숌숍숎숏숐숑숒숓숔숕숖숗수숙숚숛순숝숞숟술숡숢숣숤숥숦숧숨숩숪숫숬숭숮숯숰숱숲숳숴숵숶숷숸숹숺숻숼숽숾숿쉀쉁쉂쉃쉄쉅쉆쉇쉈쉉쉊쉋쉌쉍쉎쉏쉐쉑쉒쉓쉔쉕쉖쉗쉘쉙쉚쉛쉜쉝쉞쉟쉠쉡쉢쉣쉤쉥쉦쉧쉨쉩쉪쉫쉬쉭쉮쉯쉰쉱쉲쉳쉴쉵쉶쉷쉸쉹쉺쉻쉼쉽쉾쉿슀슁슂슃슄슅슆슇슈슉슊슋슌슍슎슏슐슑슒슓슔슕슖슗슘슙슚슛슜슝슞슟슠슡슢슣스슥슦슧슨슩슪슫슬슭슮슯슰슱슲슳슴습슶슷슸승슺슻슼슽슾슿싀싁싂싃싄싅싆싇싈싉싊싋싌싍싎싏싐싑싒싓싔싕싖싗싘싙싚싛시식싞싟신싡싢싣실싥싦싧싨싩싪싫심십싮싯싰싱싲싳싴싵싶싷싸싹싺싻싼싽싾싿쌀쌁쌂쌃쌄쌅쌆쌇쌈쌉쌊쌋쌌쌍쌎쌏쌐쌑쌒쌓쌔쌕쌖쌗쌘쌙쌚쌛쌜쌝쌞쌟쌠쌡쌢쌣쌤쌥쌦쌧쌨쌩쌪쌫쌬쌭쌮쌯쌰쌱쌲쌳쌴쌵쌶쌷쌸쌹쌺쌻쌼쌽쌾쌿썀썁썂썃썄썅썆썇썈썉썊썋썌썍썎썏썐썑썒썓썔썕썖썗썘썙썚썛썜썝썞썟썠썡썢썣썤썥썦썧써썩썪썫썬썭썮썯썰썱썲썳썴썵썶썷썸썹썺썻썼썽썾썿쎀쎁쎂쎃쎄쎅쎆쎇쎈쎉쎊쎋쎌쎍쎎쎏쎐쎑쎒쎓쎔쎕쎖쎗쎘쎙쎚쎛쎜쎝쎞쎟쎠쎡쎢쎣쎤쎥쎦쎧쎨쎩쎪쎫쎬쎭쎮쎯쎰쎱쎲쎳쎴쎵쎶쎷쎸쎹쎺쎻쎼쎽쎾쎿쏀쏁쏂쏃쏄쏅쏆쏇쏈쏉쏊쏋쏌쏍쏎쏏쏐쏑쏒쏓쏔쏕쏖쏗쏘쏙쏚쏛쏜쏝쏞쏟쏠쏡쏢쏣쏤쏥쏦쏧쏨쏩쏪쏫쏬쏭쏮쏯쏰쏱쏲쏳쏴쏵쏶쏷쏸쏹쏺쏻쏼쏽쏾쏿쐀쐁쐂쐃쐄쐅쐆쐇쐈쐉쐊쐋쐌쐍쐎쐏쐐쐑쐒쐓쐔쐕쐖쐗쐘쐙쐚쐛쐜쐝쐞쐟쐠쐡쐢쐣쐤쐥쐦쐧쐨쐩쐪쐫쐬쐭쐮쐯쐰쐱쐲쐳쐴쐵쐶쐷쐸쐹쐺쐻쐼쐽쐾쐿쑀쑁쑂쑃쑄쑅쑆쑇쑈쑉쑊쑋쑌쑍쑎쑏쑐쑑쑒쑓쑔쑕쑖쑗쑘쑙쑚쑛쑜쑝쑞쑟쑠쑡쑢쑣쑤쑥쑦쑧쑨쑩쑪쑫쑬쑭쑮쑯쑰쑱쑲쑳쑴쑵쑶쑷쑸쑹쑺쑻쑼쑽쑾쑿쒀쒁쒂쒃쒄쒅쒆쒇쒈쒉쒊쒋쒌쒍쒎쒏쒐쒑쒒쒓쒔쒕쒖쒗쒘쒙쒚쒛쒜쒝쒞쒟쒠쒡쒢쒣쒤쒥쒦쒧쒨쒩쒪쒫쒬쒭쒮쒯쒰쒱쒲쒳쒴쒵쒶쒷쒸쒹쒺쒻쒼쒽쒾쒿쓀쓁쓂쓃쓄쓅쓆쓇쓈쓉쓊쓋쓌쓍쓎쓏쓐쓑쓒쓓쓔쓕쓖쓗쓘쓙쓚쓛쓜쓝쓞쓟쓠쓡쓢쓣쓤쓥쓦쓧쓨쓩쓪쓫쓬쓭쓮쓯쓰쓱쓲쓳쓴쓵쓶쓷쓸쓹쓺쓻쓼쓽쓾쓿씀씁씂씃씄씅씆씇씈씉씊씋씌씍씎씏씐씑씒씓씔씕씖씗씘씙씚씛씜씝씞씟씠씡씢씣씤씥씦씧씨씩씪씫씬씭씮씯씰씱씲씳씴씵씶씷씸씹씺씻씼씽씾씿앀앁앂앃아악앆앇안앉않앋알앍앎앏앐앑앒앓암압앖앗았앙앚앛앜앝앞앟애액앢앣앤앥앦앧앨앩앪앫앬앭앮앯앰앱앲앳앴앵앶앷앸앹앺앻야약앾앿얀얁얂얃얄얅얆얇얈얉얊얋얌얍얎얏얐양얒얓얔얕얖얗얘얙얚얛얜얝얞얟얠얡얢얣얤얥얦얧얨얩얪얫얬얭얮얯얰얱얲얳어억얶얷언얹얺얻얼얽얾얿엀엁엂엃엄업없엇었엉엊엋엌엍엎엏에엑엒엓엔엕엖엗엘엙엚엛엜엝엞엟엠엡엢엣엤엥엦엧엨엩엪엫여역엮엯연엱엲엳열엵엶엷엸엹엺엻염엽엾엿였영옂옃옄옅옆옇예옉옊옋옌옍옎옏옐옑옒옓옔옕옖옗옘옙옚옛옜옝옞옟옠옡옢옣오옥옦옧온옩옪옫올옭옮옯옰옱옲옳옴옵옶옷옸옹���옻옼옽옾옿와왁왂왃완왅왆왇왈왉왊왋왌왍왎왏왐왑왒왓왔왕왖왗왘왙왚왛왜왝왞왟왠왡왢왣왤왥왦왧왨왩왪왫왬왭왮왯왰왱왲왳왴왵왶왷외왹왺왻왼왽왾왿욀욁욂욃욄욅욆욇욈욉욊욋욌욍욎욏욐욑욒욓요욕욖욗욘욙욚욛욜욝욞욟욠욡욢욣욤욥욦욧욨용욪욫욬욭욮욯우욱욲욳운욵욶욷울욹욺욻욼욽욾욿움웁웂웃웄웅웆웇웈웉웊웋워웍웎웏원웑웒웓월웕웖웗웘웙웚웛웜웝웞웟웠웡웢웣웤웥웦웧웨웩웪웫웬웭웮웯웰웱웲웳웴웵웶웷웸웹웺웻웼웽웾웿윀윁윂윃위윅윆윇윈윉윊윋윌윍윎윏윐윑윒윓윔윕윖윗윘윙윚윛윜윝윞윟유육윢윣윤윥윦윧율윩윪윫윬윭윮윯윰윱윲윳윴융윶윷윸윹윺윻으윽윾윿은읁읂읃을읅읆읇읈읉읊읋음읍읎읏읐응읒읓읔읕읖읗의읙읚읛읜읝읞읟읠읡읢읣읤읥읦읧읨읩읪읫읬읭읮읯읰읱읲읳이익읶읷인읹읺읻일읽읾읿잀잁잂잃임입잆잇있잉잊잋잌잍잎잏자작잒잓잔잕잖잗잘잙잚잛잜잝잞잟잠잡잢잣잤장잦잧잨잩잪잫재잭잮잯잰잱잲잳잴잵잶잷잸잹잺잻잼잽잾잿쟀쟁쟂쟃쟄쟅쟆쟇쟈쟉쟊쟋쟌쟍쟎쟏쟐쟑쟒쟓쟔쟕쟖쟗쟘쟙쟚쟛쟜쟝쟞쟟쟠쟡쟢쟣쟤쟥쟦쟧쟨쟩쟪쟫쟬쟭쟮쟯쟰쟱쟲쟳쟴쟵쟶쟷쟸쟹쟺쟻쟼쟽쟾쟿저적젂젃전젅젆젇절젉젊젋젌젍젎젏점접젒젓젔정젖젗젘젙젚젛제젝젞젟젠젡젢젣젤젥젦젧젨젩젪젫젬젭젮젯젰젱젲젳젴젵젶젷져젹젺젻젼젽젾젿졀졁졂졃졄졅졆졇졈졉졊졋졌졍졎졏졐졑졒졓졔졕졖졗졘졙졚졛졜졝졞졟졠졡졢졣졤졥졦졧졨졩졪졫졬졭졮졯조족졲졳존졵졶졷졸졹졺졻졼졽졾졿좀좁좂좃좄종좆좇좈좉좊좋좌좍좎좏좐좑좒좓좔좕좖좗좘좙좚좛좜좝좞좟좠좡좢좣좤좥좦좧좨좩좪좫좬좭좮좯좰좱좲좳좴좵좶좷좸좹좺좻좼좽좾좿죀죁죂죃죄죅죆죇죈죉죊죋죌죍죎죏죐죑죒죓죔죕죖죗죘죙죚죛죜죝죞죟죠죡죢죣죤죥죦죧죨죩죪죫죬죭죮죯죰죱죲죳죴죵죶죷죸죹죺죻주죽죾죿준줁줂줃줄줅줆줇줈줉줊줋줌줍줎줏줐중줒줓줔줕줖줗줘줙줚줛줜줝줞줟줠줡줢줣줤줥줦줧줨줩줪줫줬줭줮줯줰줱줲줳줴줵줶줷줸줹줺줻줼줽줾줿쥀쥁쥂쥃쥄쥅쥆쥇쥈쥉쥊쥋쥌쥍쥎쥏쥐쥑쥒쥓쥔쥕쥖쥗쥘쥙쥚쥛쥜쥝쥞쥟쥠쥡쥢쥣쥤쥥쥦쥧쥨쥩쥪쥫쥬쥭쥮쥯쥰쥱쥲쥳쥴쥵쥶쥷쥸쥹쥺쥻쥼쥽쥾쥿즀즁즂즃즄즅즆즇즈즉즊즋즌즍즎즏즐즑즒즓즔즕즖즗즘즙즚즛즜증즞즟즠즡즢즣즤즥즦즧즨즩즪즫즬즭즮즯즰즱즲즳즴즵즶즷즸즹즺즻즼즽즾즿지직짂짃진짅짆짇질짉짊짋짌짍짎짏짐집짒짓짔징짖짗짘짙짚짛짜짝짞짟짠짡짢짣짤짥짦짧짨짩짪짫짬짭짮짯짰짱짲짳짴짵짶짷째짹짺짻짼짽짾짿쨀쨁쨂쨃쨄쨅쨆쨇쨈쨉쨊쨋쨌쨍쨎쨏쨐쨑쨒쨓쨔쨕쨖쨗쨘쨙쨚쨛쨜쨝쨞쨟쨠쨡쨢쨣쨤쨥쨦쨧쨨쨩쨪쨫쨬쨭쨮쨯쨰쨱쨲쨳쨴쨵쨶쨷쨸쨹쨺쨻쨼쨽쨾쨿쩀쩁쩂쩃쩄쩅쩆쩇쩈쩉쩊쩋쩌쩍쩎쩏쩐쩑쩒쩓쩔쩕쩖쩗쩘쩙쩚쩛쩜쩝쩞쩟쩠쩡쩢쩣쩤쩥쩦쩧쩨쩩쩪쩫쩬쩭쩮쩯쩰쩱쩲쩳쩴쩵쩶쩷쩸쩹쩺쩻쩼쩽쩾쩿쪀쪁쪂쪃쪄쪅쪆쪇쪈쪉쪊쪋쪌쪍쪎쪏쪐쪑쪒쪓쪔쪕쪖쪗쪘쪙쪚쪛쪜쪝쪞쪟쪠쪡쪢쪣쪤쪥쪦쪧쪨쪩쪪쪫쪬쪭쪮쪯쪰쪱쪲쪳쪴쪵쪶쪷쪸쪹쪺쪻쪼쪽쪾쪿쫀쫁쫂쫃쫄쫅쫆쫇쫈쫉쫊쫋쫌쫍쫎쫏쫐쫑쫒쫓쫔쫕쫖쫗쫘쫙쫚쫛쫜쫝쫞쫟쫠쫡쫢쫣쫤쫥쫦쫧쫨쫩쫪쫫쫬쫭쫮쫯쫰쫱쫲쫳쫴쫵쫶쫷쫸쫹쫺쫻쫼쫽쫾쫿쬀쬁쬂쬃쬄쬅쬆쬇쬈쬉쬊쬋쬌쬍쬎쬏쬐쬑쬒쬓쬔쬕쬖쬗쬘쬙쬚쬛쬜쬝쬞쬟쬠쬡쬢쬣쬤쬥쬦쬧쬨쬩쬪쬫쬬쬭쬮쬯쬰쬱쬲쬳쬴쬵쬶쬷쬸쬹쬺쬻쬼쬽쬾쬿쭀쭁쭂쭃쭄쭅쭆쭇쭈쭉쭊쭋쭌쭍쭎쭏쭐쭑쭒쭓쭔쭕쭖쭗쭘쭙쭚쭛쭜쭝쭞쭟쭠쭡쭢쭣쭤쭥쭦쭧쭨쭩쭪쭫쭬쭭쭮쭯쭰쭱쭲쭳쭴쭵쭶쭷쭸쭹쭺쭻쭼쭽쭾쭿쮀쮁쮂쮃쮄쮅쮆쮇쮈쮉쮊쮋쮌쮍쮎쮏쮐쮑쮒쮓쮔쮕쮖쮗쮘쮙쮚쮛쮜쮝쮞쮟쮠쮡쮢쮣쮤쮥쮦쮧쮨쮩쮪쮫쮬쮭쮮쮯쮰쮱쮲쮳쮴쮵쮶쮷쮸쮹쮺쮻쮼쮽쮾쮿쯀쯁쯂쯃쯄쯅쯆쯇쯈쯉쯊쯋쯌쯍쯎쯏쯐쯑쯒쯓쯔쯕쯖쯗쯘쯙쯚쯛쯜쯝쯞쯟쯠쯡쯢쯣쯤쯥쯦쯧쯨쯩쯪쯫쯬쯭쯮쯯쯰쯱쯲쯳쯴쯵쯶쯷쯸쯹쯺쯻쯼쯽쯾쯿찀찁찂찃찄찅찆찇찈찉찊찋찌찍찎찏찐찑찒찓찔찕찖찗찘찙찚찛찜찝찞찟찠찡찢찣찤찥찦찧차착찪찫찬찭찮찯찰찱찲찳찴찵찶찷참찹찺찻찼창찾찿챀챁챂챃채책챆챇챈챉챊챋챌챍챎챏챐챑챒챓챔챕챖챗챘챙챚챛챜챝챞챟챠챡챢챣챤챥챦챧챨챩챪챫챬챭챮챯챰챱챲챳챴챵챶챷챸챹챺챻챼챽챾챿첀첁첂첃첄첅첆첇첈첉첊첋첌첍첎첏첐첑첒첓첔첕첖첗처척첚첛천첝첞첟철첡첢첣첤첥첦첧첨첩첪첫첬청첮첯첰첱첲첳체첵첶첷첸첹첺첻첼첽첾첿쳀쳁쳂쳃쳄쳅쳆쳇쳈쳉쳊쳋쳌쳍쳎쳏쳐쳑쳒쳓쳔쳕쳖쳗쳘쳙쳚쳛쳜쳝쳞쳟쳠쳡쳢쳣쳤쳥쳦쳧쳨쳩쳪쳫쳬쳭쳮쳯쳰쳱쳲쳳쳴쳵쳶쳷쳸쳹쳺쳻쳼쳽쳾쳿촀촁촂촃촄촅촆촇초촉촊촋촌촍촎촏촐촑촒촓촔촕촖촗촘촙촚촛촜총촞촟촠촡촢촣촤촥촦촧촨촩촪촫촬촭촮촯촰촱촲촳촴촵촶촷촸촹촺촻촼촽촾촿쵀쵁쵂쵃쵄쵅쵆쵇쵈쵉쵊쵋쵌쵍쵎쵏쵐쵑쵒쵓쵔쵕쵖쵗쵘쵙쵚쵛최쵝쵞쵟쵠쵡쵢쵣쵤쵥쵦쵧쵨쵩쵪쵫쵬쵭쵮쵯쵰쵱쵲쵳쵴쵵쵶쵷쵸쵹쵺쵻쵼쵽쵾쵿춀춁춂춃춄춅춆춇춈춉춊춋춌춍춎춏춐춑춒춓추축춖춗춘춙춚춛출춝춞춟춠춡춢춣춤춥춦춧춨충춪춫춬춭춮춯춰춱춲춳춴춵춶춷춸춹춺춻춼춽춾춿췀췁췂췃췄췅췆췇췈췉췊췋췌췍췎췏췐췑췒췓췔췕췖췗췘췙췚췛췜췝췞췟췠췡췢췣췤췥췦췧취췩췪췫췬췭췮췯췰췱췲췳췴췵췶췷췸췹췺췻췼췽췾췿츀츁츂츃츄츅츆츇츈츉츊츋츌츍츎츏츐츑츒츓츔츕츖츗츘츙츚츛츜츝츞츟츠측츢츣츤츥츦츧츨츩츪츫츬츭츮츯츰츱츲츳츴층츶츷츸츹츺츻츼츽츾츿칀칁칂칃칄칅칆칇칈칉칊칋칌칍칎칏칐칑칒칓칔칕칖칗치칙칚칛친칝칞칟칠칡칢칣칤칥칦칧침칩칪칫칬칭칮칯칰칱칲칳카칵칶칷칸칹칺칻칼칽칾칿캀캁캂캃캄캅캆캇캈캉캊캋캌캍캎캏캐캑캒캓캔캕캖캗캘캙캚캛캜캝캞캟캠캡캢캣캤캥캦캧캨캩캪캫캬캭캮캯캰캱캲캳캴캵캶캷캸캹캺캻캼캽캾캿컀컁컂컃컄컅컆컇컈컉컊컋컌컍컎컏컐컑컒컓컔컕컖컗컘컙컚컛컜컝컞컟컠컡컢컣커컥컦컧컨컩컪컫컬컭컮컯컰컱컲컳컴컵컶컷컸컹컺컻컼컽컾컿케켁켂켃켄켅켆켇켈켉켊켋켌켍켎켏켐켑켒켓켔켕켖켗켘켙켚켛켜켝켞켟켠켡켢켣켤켥켦켧켨켩켪켫켬켭켮켯켰켱켲켳켴켵켶켷켸켹켺켻켼켽켾켿콀콁콂콃콄콅콆콇콈콉콊콋콌콍콎콏콐콑콒콓코콕콖콗콘콙콚콛콜콝콞콟콠콡콢콣콤콥콦콧콨콩콪콫콬콭콮콯콰콱콲콳콴콵콶콷콸콹콺콻콼콽콾콿쾀쾁쾂쾃쾄쾅쾆쾇쾈쾉쾊쾋쾌쾍쾎쾏쾐쾑쾒쾓쾔쾕쾖쾗쾘쾙쾚쾛쾜쾝쾞쾟쾠쾡쾢쾣쾤쾥쾦쾧쾨쾩쾪쾫쾬쾭쾮쾯쾰쾱쾲쾳쾴쾵쾶쾷쾸쾹쾺쾻쾼쾽쾾쾿쿀쿁쿂쿃쿄쿅쿆쿇쿈쿉쿊쿋쿌쿍쿎쿏쿐쿑쿒쿓쿔쿕쿖쿗쿘쿙쿚쿛쿜쿝쿞쿟쿠쿡쿢쿣쿤쿥쿦쿧쿨쿩쿪쿫쿬쿭쿮쿯쿰쿱쿲쿳쿴쿵쿶쿷쿸쿹쿺쿻쿼쿽쿾쿿퀀퀁퀂퀃퀄퀅퀆퀇퀈퀉퀊퀋퀌퀍퀎퀏퀐퀑퀒퀓퀔퀕퀖퀗퀘퀙퀚퀛퀜퀝퀞퀟퀠퀡퀢퀣퀤퀥퀦퀧퀨퀩퀪퀫퀬퀭퀮퀯퀰퀱퀲퀳퀴퀵퀶퀷퀸퀹퀺퀻퀼퀽퀾퀿큀큁큂큃큄큅큆큇큈큉큊큋큌큍큎큏큐큑큒큓큔큕큖큗큘큙큚큛큜큝큞큟큠큡큢큣큤큥큦큧큨큩큪큫크큭큮큯큰큱큲큳클큵큶큷큸큹큺큻큼큽큾큿킀킁킂킃킄킅킆킇킈킉킊킋킌킍킎킏킐킑킒킓킔킕킖킗킘킙킚킛킜킝킞킟킠킡킢킣키킥킦킧킨킩킪킫킬킭킮킯킰킱킲킳킴킵킶킷킸킹킺킻킼킽킾킿타탁탂탃탄탅탆탇탈탉탊탋탌탍탎탏탐탑탒탓탔탕탖탗탘탙탚탛태택탞탟탠탡탢탣탤탥탦탧탨탩탪탫탬탭탮탯탰탱탲탳탴탵탶탷탸탹탺탻탼탽탾탿턀턁턂턃턄턅턆턇턈턉턊턋턌턍턎턏턐턑턒턓턔턕턖턗턘턙턚턛턜턝턞턟턠턡턢턣턤턥턦턧턨턩턪턫턬턭턮턯터턱턲턳턴턵턶턷털턹턺턻턼턽턾턿텀텁텂텃텄텅텆텇텈텉텊텋테텍텎텏텐텑텒텓텔텕텖텗텘텙텚텛템텝텞텟텠텡텢텣텤텥텦텧텨텩텪텫텬텭텮텯텰텱텲텳텴텵텶텷텸텹텺텻텼텽텾텿톀톁톂톃톄톅톆톇톈톉톊톋톌톍톎톏톐톑톒톓톔톕톖톗톘톙톚톛톜톝톞톟토톡톢톣톤톥톦톧톨톩톪톫톬톭톮톯톰톱톲톳톴통톶톷톸톹톺톻톼톽톾톿퇀퇁퇂퇃퇄퇅퇆퇇퇈퇉퇊퇋퇌퇍퇎퇏퇐퇑퇒퇓퇔퇕퇖퇗퇘퇙퇚퇛퇜퇝퇞퇟퇠퇡퇢퇣퇤퇥퇦퇧퇨퇩퇪퇫퇬퇭퇮퇯퇰퇱퇲퇳퇴퇵퇶퇷퇸퇹퇺퇻퇼퇽퇾퇿툀툁툂툃툄툅툆툇툈툉툊툋툌툍툎툏툐툑툒툓툔툕툖툗툘툙툚툛툜툝툞툟툠툡툢툣툤툥툦툧툨툩툪툫투툭툮툯툰툱툲툳툴툵툶툷툸툹툺툻툼툽툾툿퉀퉁퉂퉃퉄퉅퉆퉇퉈퉉퉊퉋퉌퉍퉎퉏퉐퉑퉒퉓퉔퉕퉖퉗퉘퉙퉚퉛퉜퉝퉞퉟퉠퉡퉢퉣퉤퉥퉦퉧퉨퉩퉪퉫퉬퉭퉮퉯퉰퉱퉲퉳퉴퉵퉶퉷퉸퉹퉺퉻퉼퉽퉾퉿튀튁튂튃튄튅튆튇튈튉튊튋튌튍튎튏튐튑튒튓튔튕튖튗튘튙튚튛튜튝튞튟튠튡튢튣튤튥튦튧튨튩튪튫튬튭튮튯튰튱튲튳튴튵튶튷트특튺튻튼튽튾튿틀틁틂틃틄틅틆틇틈틉틊틋틌틍틎틏틐틑틒틓틔틕틖틗틘틙틚틛틜틝틞틟틠틡틢틣틤틥틦틧틨틩틪틫틬틭틮틯티틱틲틳틴틵틶틷틸틹틺틻틼틽틾틿팀팁팂팃팄팅팆팇팈팉팊팋파팍팎팏판팑팒팓팔팕팖팗팘팙팚팛팜팝팞팟팠팡팢팣팤팥팦팧패팩팪팫팬팭팮팯팰팱팲팳팴팵팶팷팸팹팺팻팼팽팾팿퍀퍁퍂퍃퍄퍅퍆퍇퍈퍉퍊퍋퍌퍍퍎퍏퍐퍑퍒퍓퍔퍕퍖퍗퍘퍙퍚퍛퍜퍝퍞퍟퍠퍡퍢퍣퍤퍥퍦퍧퍨퍩퍪퍫퍬퍭퍮퍯퍰퍱퍲퍳퍴퍵퍶퍷퍸퍹퍺퍻퍼퍽퍾퍿펀펁펂펃펄펅펆펇펈펉펊펋펌펍펎��펐펑펒펓펔펕펖펗페펙펚펛펜펝펞펟펠펡펢펣펤펥펦펧펨펩펪펫펬펭펮펯펰펱펲펳펴펵펶펷편펹펺펻펼펽펾펿폀폁폂폃폄폅폆폇폈평폊폋폌폍폎폏폐폑폒폓폔폕폖폗폘폙폚폛폜폝폞폟폠폡폢폣폤폥폦폧폨폩폪폫포폭폮폯폰폱폲폳폴폵폶폷폸폹폺폻폼폽폾폿퐀퐁퐂퐃퐄퐅퐆퐇퐈퐉퐊퐋퐌퐍퐎퐏퐐퐑퐒퐓퐔퐕퐖퐗퐘퐙퐚퐛퐜퐝퐞퐟퐠퐡퐢퐣퐤퐥퐦퐧퐨퐩퐪퐫퐬퐭퐮퐯퐰퐱퐲퐳퐴퐵퐶퐷퐸퐹퐺퐻퐼퐽퐾퐿푀푁푂푃푄푅푆푇푈푉푊푋푌푍푎푏푐푑푒푓푔푕푖푗푘푙푚푛표푝푞푟푠푡푢푣푤푥푦푧푨푩푪푫푬푭푮푯푰푱푲푳푴푵푶푷푸푹푺푻푼푽푾푿풀풁풂풃풄풅풆풇품풉풊풋풌풍풎풏풐풑풒풓풔풕풖풗풘풙풚풛풜풝풞풟풠풡풢풣풤풥풦풧풨풩풪풫풬풭풮풯풰풱풲풳풴풵풶풷풸풹풺풻풼풽풾풿퓀퓁퓂퓃퓄퓅퓆퓇퓈퓉퓊퓋퓌퓍퓎퓏퓐퓑퓒퓓퓔퓕퓖퓗퓘퓙퓚퓛퓜퓝퓞퓟퓠퓡퓢퓣퓤퓥퓦퓧퓨퓩퓪퓫퓬퓭퓮퓯퓰퓱퓲퓳퓴퓵퓶퓷퓸퓹퓺퓻퓼퓽퓾퓿픀픁픂픃프픅픆픇픈픉픊픋플픍픎픏픐픑픒픓픔픕픖픗픘픙픚픛픜픝픞픟픠픡픢픣픤픥픦픧픨픩픪픫픬픭픮픯픰픱픲픳픴픵픶픷픸픹픺픻피픽픾픿핀핁핂핃필핅핆핇핈핉핊핋핌핍핎핏핐핑핒핓핔핕핖핗하학핚핛한핝핞핟할핡핢핣핤핥핦핧함합핪핫핬항핮핯핰핱핲핳해핵핶핷핸핹핺핻핼핽핾핿햀햁햂햃햄햅햆햇했행햊햋햌햍햎햏햐햑햒햓햔햕햖햗햘햙햚햛햜햝햞햟햠햡햢햣햤향햦햧햨햩햪햫햬햭햮햯햰햱햲햳햴햵햶햷햸햹햺햻햼햽햾햿헀헁헂헃헄헅헆헇허헉헊헋헌헍헎헏헐헑헒헓헔헕헖헗험헙헚헛헜헝헞헟헠헡헢헣헤헥헦헧헨헩헪헫헬헭헮헯헰헱헲헳헴헵헶헷헸헹헺헻헼헽헾헿혀혁혂혃현혅혆혇혈혉혊혋혌혍혎혏혐협혒혓혔형혖혗혘혙혚혛혜혝혞혟혠혡혢혣혤혥혦혧혨혩혪혫혬혭혮혯혰혱혲혳혴혵혶혷호혹혺혻혼혽혾혿홀홁홂홃홄홅홆홇홈홉홊홋홌홍홎홏홐홑홒홓화확홖홗환홙홚홛활홝홞홟홠홡홢홣홤홥홦홧홨황홪홫홬홭홮홯홰홱홲홳홴홵홶홷홸홹홺홻홼홽홾홿횀횁횂횃횄횅횆횇횈횉횊횋회획횎횏횐횑횒횓횔횕횖횗횘횙횚횛횜횝횞횟횠횡횢횣횤횥횦횧효횩횪횫횬횭횮횯횰횱횲횳횴횵횶횷횸횹횺횻횼횽횾횿훀훁훂훃후훅훆훇훈훉훊훋훌훍훎훏훐훑훒훓훔훕훖훗훘훙훚훛훜훝훞훟훠훡훢훣훤훥훦훧훨훩훪훫훬훭훮훯훰훱훲훳훴훵훶훷훸훹훺훻훼훽훾훿휀휁휂휃휄휅휆휇휈휉휊휋휌휍휎휏휐휑휒휓휔휕휖휗휘휙휚휛휜휝휞휟휠휡휢휣휤휥휦휧휨휩휪휫휬휭휮휯휰휱휲휳휴휵휶휷휸휹휺휻휼휽휾휿흀흁흂흃흄흅흆흇흈흉흊흋흌흍흎흏흐흑흒흓흔흕흖흗흘흙흚흛흜흝흞흟흠흡흢흣흤흥흦흧흨흩흪흫희흭흮흯흰흱흲흳흴흵흶흷흸흹흺흻흼흽흾흿힀힁힂힃힄힅힆힇히힉힊힋힌힍힎힏힐힑힒힓힔힕힖힗힘힙힚힛힜힝힞힟힠힡힢힣
|
uroman/data/romanization-auto-table.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
uroman/data/romanization-table-arabic-block.txt
ADDED
@@ -0,0 +1,179 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
::s ، ::t , ::comment ARABIC COMMA
|
2 |
+
::s ؛ ::t ; ::comment ARABIC SEMICOLON
|
3 |
+
::s ؟ ::t ? ::comment ARABIC QUESTION MARK
|
4 |
+
::s ء ::t ' ::comment ARABIC LETTER HAMZA
|
5 |
+
::s آ ::t a ::comment ARABIC LETTER ALEF WITH MADDA ABOVE
|
6 |
+
::s أ ::t a ::comment ARABIC LETTER ALEF WITH HAMZA ABOVE
|
7 |
+
::s ؤ ::t w ::comment ARABIC LETTER WAW WITH HAMZA ABOVE
|
8 |
+
::s إ ::t i ::comment ARABIC LETTER ALEF WITH HAMZA BELOW
|
9 |
+
::s ئ ::t ye ::comment ARABIC LETTER YEH WITH HAMZA ABOVE
|
10 |
+
::s ا ::t a ::comment ARABIC LETTER ALEF
|
11 |
+
::s ب ::t b ::comment ARABIC LETTER BEH
|
12 |
+
::s ة ::t a ::comment ARABIC LETTER TEH MARBUTA
|
13 |
+
::s ت ::t t ::comment ARABIC LETTER TEH
|
14 |
+
::s ث ::t th ::comment ARABIC LETTER THEH
|
15 |
+
::s ج ::t j ::comment ARABIC LETTER JEEM
|
16 |
+
::s ح ::t h ::comment ARABIC LETTER HAH
|
17 |
+
::s خ ::t kh ::comment ARABIC LETTER KHAH
|
18 |
+
::s د ::t d ::comment ARABIC LETTER DAL
|
19 |
+
::s ذ ::t th ::comment ARABIC LETTER THAL
|
20 |
+
::s ر ::t r ::comment ARABIC LETTER REH
|
21 |
+
::s ز ::t z ::comment ARABIC LETTER ZAIN
|
22 |
+
::s س ::t s ::comment ARABIC LETTER SEEN
|
23 |
+
::s ش ::t sh ::comment ARABIC LETTER SHEEN
|
24 |
+
::s ص ::t s ::comment ARABIC LETTER SAD
|
25 |
+
::s ض ::t d ::comment ARABIC LETTER DAD
|
26 |
+
::s ط ::t t ::comment ARABIC LETTER TAH
|
27 |
+
::s ظ ::t z ::comment ARABIC LETTER ZAH
|
28 |
+
::s ع ::t ' ::comment ARABIC LETTER AIN
|
29 |
+
::s غ ::t gh ::comment ARABIC LETTER GHAIN
|
30 |
+
::s ـ ::t - ::comment ARABIC TATWEEL
|
31 |
+
::s ف ::t f ::comment ARABIC LETTER FEH
|
32 |
+
::s ق ::t q ::comment ARABIC LETTER QAF
|
33 |
+
::s ك ::t k ::comment ARABIC LETTER KAF
|
34 |
+
::s ل ::t l ::comment ARABIC LETTER LAM
|
35 |
+
::s م ::t m ::comment ARABIC LETTER MEEM
|
36 |
+
::s ن ::t n ::comment ARABIC LETTER NOON
|
37 |
+
::s ه ::t h ::comment ARABIC LETTER HEH
|
38 |
+
::s و ::t w ::comment ARABIC LETTER WAW
|
39 |
+
::s ى ::t a ::comment ARABIC LETTER ALEF MAKSURA
|
40 |
+
::s ي ::t y ::comment ARABIC LETTER YEH
|
41 |
+
::s َ ::t a ::comment ARABIC FATHA
|
42 |
+
::s ُ ::t u ::comment ARABIC DAMMA
|
43 |
+
::s ِ ::t i ::comment ARABIC KASRA
|
44 |
+
::s ْ ::t ::comment ARABIC SUKUN
|
45 |
+
::s ٔ ::t ' ::comment ARABIC HAMZA ABOVE
|
46 |
+
::s ٕ ::t ' ::comment ARABIC HAMZA BELOW
|
47 |
+
::s ٠ ::t 0 ::comment ARABIC-INDIC DIGIT ZERO
|
48 |
+
::s ١ ::t 1 ::comment ARABIC-INDIC DIGIT ONE
|
49 |
+
::s ٢ ::t 2 ::comment ARABIC-INDIC DIGIT TWO
|
50 |
+
::s ٣ ::t 3 ::comment ARABIC-INDIC DIGIT THREE
|
51 |
+
::s ٤ ::t 4 ::comment ARABIC-INDIC DIGIT FOUR
|
52 |
+
::s ٥ ::t 5 ::comment ARABIC-INDIC DIGIT FIVE
|
53 |
+
::s ٦ ::t 6 ::comment ARABIC-INDIC DIGIT SIX
|
54 |
+
::s ٧ ::t 7 ::comment ARABIC-INDIC DIGIT SEVEN
|
55 |
+
::s ٨ ::t 8 ::comment ARABIC-INDIC DIGIT EIGHT
|
56 |
+
::s ٩ ::t 9 ::comment ARABIC-INDIC DIGIT NINE
|
57 |
+
::s ٪ ::t % ::comment ARABIC PERCENT SIGN
|
58 |
+
::s ٫ ::t , ::comment ARABIC DECIMAL SEPARATOR
|
59 |
+
::s ٬ ::t , ::comment ARABIC THOUSANDS SEPARATOR
|
60 |
+
::s ٮ ::t b ::comment ARABIC LETTER DOTLESS BEH
|
61 |
+
::s ٯ ::t q ::comment ARABIC LETTER DOTLESS QAF
|
62 |
+
::s ٰ ::t a ::comment ARABIC LETTER SUPERSCRIPT ALEF
|
63 |
+
::s ٱ ::t a ::comment ARABIC LETTER ALEF WASLA
|
64 |
+
::s ٲ ::t a ::comment ARABIC LETTER ALEF WITH WAVY HAMZA ABOVE
|
65 |
+
::s ٳ ::t a ::comment ARABIC LETTER ALEF WITH WAVY HAMZA BELOW
|
66 |
+
::s ٷ ::t u ::comment ARABIC LETTER U WITH HAMZA ABOVE
|
67 |
+
::s ٹ ::t tt ::comment ARABIC LETTER TTEH
|
68 |
+
::s ٺ ::t tt ::comment ARABIC LETTER TTEHEH
|
69 |
+
::s ٻ ::t b ::comment ARABIC LETTER BEEH
|
70 |
+
::s ټ ::t t ::comment ARABIC LETTER TEH WITH RING
|
71 |
+
::s ٽ ::t t ::comment ARABIC LETTER TEH WITH THREE DOTS ABOVE DOWNWARDS
|
72 |
+
::s پ ::t p ::comment ARABIC LETTER PEH
|
73 |
+
::s ٿ ::t t ::comment ARABIC LETTER TEHEH
|
74 |
+
::s ڀ ::t b ::comment ARABIC LETTER BEHEH
|
75 |
+
::s ځ ::t h ::comment ARABIC LETTER HAH WITH HAMZA ABOVE
|
76 |
+
::s ڂ ::t h ::comment ARABIC LETTER HAH WITH TWO DOTS VERTICAL ABOVE
|
77 |
+
::s ڃ ::t ny ::comment ARABIC LETTER NYEH
|
78 |
+
::s ڄ ::t dy ::comment ARABIC LETTER DYEH
|
79 |
+
::s څ ::t h ::comment ARABIC LETTER HAH WITH THREE DOTS ABOVE
|
80 |
+
::s چ ::t tch ::comment ARABIC LETTER TCHEH
|
81 |
+
::s ڇ ::t tch ::comment ARABIC LETTER TCHEHEH
|
82 |
+
::s ڈ ::t dd ::comment ARABIC LETTER DDAL
|
83 |
+
::s ډ ::t d ::comment ARABIC LETTER DAL WITH RING
|
84 |
+
::s ڊ ::t d ::comment ARABIC LETTER DAL WITH DOT BELOW
|
85 |
+
::s ڋ ::t d ::comment ARABIC LETTER DAL WITH DOT BELOW AND SMALL TAH
|
86 |
+
::s ڌ ::t d ::comment ARABIC LETTER DAHAL
|
87 |
+
::s ڍ ::t dd ::comment ARABIC LETTER DDAHAL
|
88 |
+
::s ڎ ::t d ::comment ARABIC LETTER DUL
|
89 |
+
::s ڏ ::t d ::comment ARABIC LETTER DAL WITH THREE DOTS ABOVE DOWNWARDS
|
90 |
+
::s ڐ ::t d ::comment ARABIC LETTER DAL WITH FOUR DOTS ABOVE
|
91 |
+
::s ڑ ::t rr ::comment ARABIC LETTER RREH
|
92 |
+
::s ڒ ::t r ::comment ARABIC LETTER REH WITH SMALL V
|
93 |
+
::s ړ ::t r ::comment ARABIC LETTER REH WITH RING
|
94 |
+
::s ڔ ::t r ::comment ARABIC LETTER REH WITH DOT BELOW
|
95 |
+
::s ڕ ::t r ::comment ARABIC LETTER REH WITH SMALL V BELOW
|
96 |
+
::s ږ ::t r ::comment ARABIC LETTER REH WITH DOT BELOW AND DOT ABOVE
|
97 |
+
::s ڗ ::t r ::comment ARABIC LETTER REH WITH TWO DOTS ABOVE
|
98 |
+
::s ژ ::t j ::comment ARABIC LETTER JEH
|
99 |
+
::s ڙ ::t r ::comment ARABIC LETTER REH WITH FOUR DOTS ABOVE
|
100 |
+
::s ښ ::t s ::comment ARABIC LETTER SEEN WITH DOT BELOW AND DOT ABOVE
|
101 |
+
::s ڛ ::t s ::comment ARABIC LETTER SEEN WITH THREE DOTS BELOW
|
102 |
+
::s ڜ ::t s ::comment ARABIC LETTER SEEN WITH THREE DOTS BELOW AND THREE DOTS ABOVE
|
103 |
+
::s ڝ ::t s ::comment ARABIC LETTER SAD WITH TWO DOTS BELOW
|
104 |
+
::s ڞ ::t s ::comment ARABIC LETTER SAD WITH THREE DOTS ABOVE
|
105 |
+
::s ڟ ::t t ::comment ARABIC LETTER TAH WITH THREE DOTS ABOVE
|
106 |
+
::s ڠ ::t n ::comment ARABIC LETTER AIN WITH THREE DOTS ABOVE
|
107 |
+
::s ڡ ::t f ::comment ARABIC LETTER DOTLESS FEH
|
108 |
+
::s ڢ ::t f ::comment ARABIC LETTER FEH WITH DOT MOVED BELOW
|
109 |
+
::s ڣ ::t f ::comment ARABIC LETTER FEH WITH DOT BELOW
|
110 |
+
::s ڤ ::t v ::comment ARABIC LETTER VEH
|
111 |
+
::s ڥ ::t f ::comment ARABIC LETTER FEH WITH THREE DOTS BELOW
|
112 |
+
::s ڦ ::t p ::comment ARABIC LETTER PEHEH
|
113 |
+
::s ڧ ::t q ::comment ARABIC LETTER QAF WITH DOT ABOVE
|
114 |
+
::s ڨ ::t q ::comment ARABIC LETTER QAF WITH THREE DOTS ABOVE
|
115 |
+
::s ک ::t k ::comment ARABIC LETTER KEHEH
|
116 |
+
::s ڪ ::t k ::comment ARABIC LETTER SWASH KAF
|
117 |
+
::s ګ ::t k ::comment ARABIC LETTER KAF WITH RING
|
118 |
+
::s ڬ ::t k ::comment ARABIC LETTER KAF WITH DOT ABOVE
|
119 |
+
::s ڭ ::t ng ::comment ARABIC LETTER NG
|
120 |
+
::s ڮ ::t k ::comment ARABIC LETTER KAF WITH THREE DOTS BELOW
|
121 |
+
::s گ ::t g ::comment ARABIC LETTER GAF
|
122 |
+
::s ڰ ::t g ::comment ARABIC LETTER GAF WITH RING
|
123 |
+
::s ڱ ::t ng ::comment ARABIC LETTER NGOEH
|
124 |
+
::s ڲ ::t g ::comment ARABIC LETTER GAF WITH TWO DOTS BELOW
|
125 |
+
::s ڳ ::t g ::comment ARABIC LETTER GUEH
|
126 |
+
::s ڴ ::t g ::comment ARABIC LETTER GAF WITH THREE DOTS ABOVE
|
127 |
+
::s ڵ ::t l ::comment ARABIC LETTER LAM WITH SMALL V
|
128 |
+
::s ڶ ::t l ::comment ARABIC LETTER LAM WITH DOT ABOVE
|
129 |
+
::s ڷ ::t l ::comment ARABIC LETTER LAM WITH THREE DOTS ABOVE
|
130 |
+
::s ڸ ::t l ::comment ARABIC LETTER LAM WITH THREE DOTS BELOW
|
131 |
+
::s ڹ ::t n ::comment ARABIC LETTER NOON WITH DOT BELOW
|
132 |
+
::s ں ::t n ::comment ARABIC LETTER NOON GHUNNA
|
133 |
+
::s ڻ ::t rn ::comment ARABIC LETTER RNOON
|
134 |
+
::s ڼ ::t n ::comment ARABIC LETTER NOON WITH RING
|
135 |
+
::s ڽ ::t n ::comment ARABIC LETTER NOON WITH THREE DOTS ABOVE
|
136 |
+
::s ھ ::t h ::comment ARABIC LETTER HEH DOACHASHMEE
|
137 |
+
::s ڿ ::t tch ::comment ARABIC LETTER TCHEH WITH DOT ABOVE
|
138 |
+
::s ۀ ::t h ::comment ARABIC LETTER HEH WITH YEH ABOVE
|
139 |
+
::s ہ ::t h ::comment ARABIC LETTER HEH GOAL
|
140 |
+
::s ۂ ::t h ::comment ARABIC LETTER HEH GOAL WITH HAMZA ABOVE
|
141 |
+
::s ۃ ::t a ::comment ARABIC LETTER TEH MARBUTA GOAL
|
142 |
+
::s ۄ ::t w ::comment ARABIC LETTER WAW WITH RING
|
143 |
+
::s ۅ ::t oe ::comment ARABIC LETTER KIRGHIZ OE
|
144 |
+
::s ۆ ::t oe ::comment ARABIC LETTER OE
|
145 |
+
::s ۇ ::t u ::comment ARABIC LETTER U
|
146 |
+
::s ۈ ::t yu ::comment ARABIC LETTER YU
|
147 |
+
::s ۉ ::t yu ::comment ARABIC LETTER KIRGHIZ YU
|
148 |
+
::s ۊ ::t w ::comment ARABIC LETTER WAW WITH TWO DOTS ABOVE
|
149 |
+
::s ۋ ::t v ::comment ARABIC LETTER VE
|
150 |
+
::s ی ::t y ::comment ARABIC LETTER FARSI YEH
|
151 |
+
::s ۍ ::t y ::comment ARABIC LETTER YEH WITH TAIL
|
152 |
+
::s ێ ::t y ::comment ARABIC LETTER YEH WITH SMALL V
|
153 |
+
::s ۏ ::t w ::comment ARABIC LETTER WAW WITH DOT ABOVE
|
154 |
+
::s ې ::t e ::comment ARABIC LETTER E
|
155 |
+
::s ۑ ::t y ::comment ARABIC LETTER YEH WITH THREE DOTS BELOW
|
156 |
+
::s ے ::t y ::comment ARABIC LETTER YEH BARREE
|
157 |
+
::s ۓ ::t y ::comment ARABIC LETTER YEH BARREE WITH HAMZA ABOVE
|
158 |
+
::s ۔ ::t . ::comment ARABIC FULL STOP
|
159 |
+
::s ە ::t ae ::comment ARABIC LETTER AE
|
160 |
+
::s ۮ ::t d ::comment ARABIC LETTER DAL WITH INVERTED V
|
161 |
+
::s ۯ ::t r ::comment ARABIC LETTER REH WITH INVERTED V
|
162 |
+
::s ۰ ::t 0 ::comment EXTENDED ARABIC-INDIC DIGIT ZERO
|
163 |
+
::s ۱ ::t 1 ::comment EXTENDED ARABIC-INDIC DIGIT ONE
|
164 |
+
::s ۲ ::t 2 ::comment EXTENDED ARABIC-INDIC DIGIT TWO
|
165 |
+
::s ۳ ::t 3 ::comment EXTENDED ARABIC-INDIC DIGIT THREE
|
166 |
+
::s ۴ ::t 4 ::comment EXTENDED ARABIC-INDIC DIGIT FOUR
|
167 |
+
::s ۵ ::t 5 ::comment EXTENDED ARABIC-INDIC DIGIT FIVE
|
168 |
+
::s ۶ ::t 6 ::comment EXTENDED ARABIC-INDIC DIGIT SIX
|
169 |
+
::s ۷ ::t 7 ::comment EXTENDED ARABIC-INDIC DIGIT SEVEN
|
170 |
+
::s ۸ ::t 8 ::comment EXTENDED ARABIC-INDIC DIGIT EIGHT
|
171 |
+
::s ۹ ::t 9 ::comment EXTENDED ARABIC-INDIC DIGIT NINE
|
172 |
+
::s ۺ ::t sh ::comment ARABIC LETTER SHEEN WITH DOT BELOW
|
173 |
+
::s ۻ ::t d ::comment ARABIC LETTER DAD WITH DOT BELOW
|
174 |
+
::s ۼ ::t gh ::comment ARABIC LETTER GHAIN WITH DOT BELOW
|
175 |
+
::s ۽ ::t & ::comment ARABIC SIGN SINDHI AMPERSAND
|
176 |
+
::s ﷲ ::t allah ::comment ARABIC LIGATURE ALLAH ISOLATED FORM
|
177 |
+
|
178 |
+
::s ::t ::comment ZERO WIDTH NON-JOINER
|
179 |
+
::s ::t ::comment ZERO WIDTH JOINER
|
uroman/data/romanization-table.txt
ADDED
@@ -0,0 +1,2193 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
## European Latin extensions
|
3 |
+
# Vowels
|
4 |
+
::s Ä ::t Ae
|
5 |
+
::s Ö ::t Oe
|
6 |
+
::s Ü ::t Ue
|
7 |
+
::s Å ::t Aa
|
8 |
+
::s Æ ::t Ae
|
9 |
+
::s Ø ::t oe
|
10 |
+
::s Œ ::t Oe
|
11 |
+
::s ä ::t ae
|
12 |
+
::s ö ::t oe
|
13 |
+
::s ü ::t ue
|
14 |
+
::s å ::t aa
|
15 |
+
::s æ ::t ae
|
16 |
+
::s ø ::t oe
|
17 |
+
::s œ ::t oe
|
18 |
+
# Consonants
|
19 |
+
::s Ç ::t S
|
20 |
+
::s ç ::t s
|
21 |
+
::s Ç ::t Ch ::lcode tur
|
22 |
+
::s ç ::t ch ::lcode tur
|
23 |
+
::s Ş ::t Sh
|
24 |
+
::s ş ::t sh
|
25 |
+
::s Ș ::t Sh
|
26 |
+
::s ș ::t sh
|
27 |
+
::s ß ::t ss
|
28 |
+
::s Ț ::t Ts
|
29 |
+
::s ț ::t ts
|
30 |
+
|
31 |
+
# Digraphs
|
32 |
+
# ::s ʣ ::t dz
|
33 |
+
::s ʤ ::t dzh ::comment Latin small letter dezh digraph
|
34 |
+
# ::s ʥ ::t dz
|
35 |
+
# ::s ʦ ::t ts
|
36 |
+
::s ʧ ::t tsh ::comment Latin small letter tesh digraph
|
37 |
+
# ::s ʨ ::t tc
|
38 |
+
|
39 |
+
# Miscellaneous
|
40 |
+
::s ə ::t e
|
41 |
+
|
42 |
+
# English
|
43 |
+
::s chr ::t chr ::t-alt kr ::example chromosome, synchronize
|
44 |
+
::s Chr ::t Chr ::t-alt Kr ::example Christmas, Chrysler
|
45 |
+
::s eight ::t eight ::t-alt eit ::example eight, weight
|
46 |
+
::s Eight ::t Eight ::t-alt Eit ::example Eighteen
|
47 |
+
::s ight ::t ight ::t-alt ait ::example Knight
|
48 |
+
::s gh ::t gh ::t-alt f, ph, "" ::example laugh, daughter
|
49 |
+
::s high ::t high ::t-alt hai ::example highlight
|
50 |
+
::s High ::t High ::t-alt Hai ::example High School
|
51 |
+
::s Isle ::t Isle ::t-alt Ail ::use-only-for-whole-word ::example Isle
|
52 |
+
::s Island ::t Island ::t-alt Ailand ::use-only-for-whole-word ::example Island
|
53 |
+
::s kn ::t kn ::t-alt n ::use-only-at-start-of-word ::example knowledge
|
54 |
+
::s Kn ::t Kn ::t-alt N ::use-only-at-start-of-word ::example Knight
|
55 |
+
::s Mc ::t Mc ::t-alt Mac ::use-only-at-start-of-word ::example McNulty
|
56 |
+
::s mc ::t mc ::t-alt mac ::use-only-at-start-of-word
|
57 |
+
::s oo ::t oo ::t-alt u ::lcode eng ::example Brooklyn; Goose Bay
|
58 |
+
::s ph ::t ph ::t-alt f ::example alpha
|
59 |
+
::s Ph ::t Ph ::t-alt F ::example Philip
|
60 |
+
::s Thom ::t Thom ::t-alt Tom ::use-only-at-start-of-word ::example Thomas, Thompson
|
61 |
+
::s tion ::t tion ::t-alt shen ::example
|
62 |
+
::s Sean ::t Sean ::t-alt Shawn ::use-only-for-whole-word
|
63 |
+
::s ssion ::t ssion ::t-alt shen ::example Sessions
|
64 |
+
::s St ::t St ::t-alt Saint ::use-only-for-whole-word
|
65 |
+
::s St. ::t St. ::t-alt Saint ::use-only-for-whole-word
|
66 |
+
::s Wr ::t Wr ::t-alt R ::example Wren
|
67 |
+
::s wr ::t wr ::t-alt r ::example Cartwright
|
68 |
+
::s x ::t x ::t-alt ks ::example Mexico
|
69 |
+
::s x ::t x ::t-alt gz ::example example, anxiety, exhaust, exit
|
70 |
+
|
71 |
+
# French
|
72 |
+
::s â ::t a ::t-alt as ::example pâte/paste, pastry
|
73 |
+
::s ê ::t e ::t-alt es ::example fête/feast
|
74 |
+
::s î ::t i ::t-alt is ::example île/isle
|
75 |
+
::s ô ::t o ::t-alt os ::example côte/coast
|
76 |
+
::s û ::t u ::t-alt us ::example août/August
|
77 |
+
::s eaux ::t eaux ::t-alt o ::example Bordeaux
|
78 |
+
::s eau ::t eau ::t-alt o ::example Chateau
|
79 |
+
::s auld ::t auld ::t-alt o ::use-only-at-end-of-word ::example Renauld
|
80 |
+
::s ault ::t ault ::t-alt o ::use-only-at-end-of-word ::example Renault
|
81 |
+
::s oux ::t oux ::t-alt u
|
82 |
+
::s ois ::t ois ::t-alt oa ::use-only-at-end-of-word ::example Dubois
|
83 |
+
|
84 |
+
# German
|
85 |
+
::s Sch ::t Sch ::t-alt Sh
|
86 |
+
::s sch ::t sch ::t-alt sh
|
87 |
+
::s stein ::t stein ::t-alt shtain
|
88 |
+
::s dt ::t dt ::t-alt tt ::use-only-at-end-of-word ::example Schmidt
|
89 |
+
|
90 |
+
# Dutch
|
91 |
+
::s ij ::t ij ::t-alt ai
|
92 |
+
::s Ij ::t Ij ::t-alt Ai
|
93 |
+
|
94 |
+
# Latvian
|
95 |
+
::s Ā ::t A ::t-alt Aa ::lcode lav
|
96 |
+
::s ā ::t a ::t-alt aa ::lcode lav
|
97 |
+
::s Ē ::t E ::t-alt Ee ::lcode lav
|
98 |
+
::s ē ::t e ::t-alt ee ::lcode lav
|
99 |
+
::s Ī ::t I ::t-alt Ii ::lcode lav
|
100 |
+
::s ī ::t i ::t-alt ii ::lcode lav
|
101 |
+
::s Ū ::t U ::t-alt Uu ::lcode lav
|
102 |
+
::s ū ::t u ::t-alt uu ::lcode lav
|
103 |
+
::s Ģ ::t G ::t-alt Gj ::lcode lav
|
104 |
+
::s ģ ::t g ::t-alt gj ::lcode lav
|
105 |
+
::s Ķ ::t K ::t-alt Kj ::lcode lav
|
106 |
+
::s ķ ::t k ::t-alt kj ::lcode lav
|
107 |
+
::s Ļ ::t L ::t-alt Lj ::lcode lav
|
108 |
+
::s ļ ::t l ::t-alt lj ::lcode lav
|
109 |
+
::s Ņ ::t N ::t-alt Nj ::lcode lav
|
110 |
+
::s ņ ::t n ::t-alt nj ::lcode lav
|
111 |
+
::s C ::t C ::t-alt Ts ::lcode lav
|
112 |
+
::s c ::t c ::t-alt ts ::lcode lav
|
113 |
+
::s Č ::t C ::t-alt Tsh ::lcode lav
|
114 |
+
::s č ::t c ::t-alt tsh ::lcode lav
|
115 |
+
::s Š ::t Sh ::t-alt s ::lcode lav
|
116 |
+
::s š ::t sh ::t-alt s ::lcode lav
|
117 |
+
::s Ž ::t Z ::t-alt Zh ::lcode lav
|
118 |
+
::s ž ::t z ::t-alt zh ::lcode lav
|
119 |
+
|
120 |
+
# Lithuanian
|
121 |
+
::s C ::t C ::t-alt Ts ::lcode lit
|
122 |
+
::s c ::t c ::t-alt ts ::lcode lit
|
123 |
+
::s Č ::t C ::t-alt Tsh ::lcode lit
|
124 |
+
::s č ::t c ::t-alt tsh ::lcode lit
|
125 |
+
::s Š ::t Sh ::t-alt s ::lcode lit
|
126 |
+
::s š ::t sh ::t-alt s ::lcode lit
|
127 |
+
::s Ž ::t Z ::t-alt Zh ::lcode lit
|
128 |
+
::s ž ::t z ::t-alt zh ::lcode lit
|
129 |
+
|
130 |
+
# Greek letter mu used as similarly looking micro sign for units such as µm
|
131 |
+
::s μs ::t µs ::use-only-for-whole-word ::comment microsecond
|
132 |
+
::s μm ::t µm ::use-only-for-whole-word ::comment micrometer
|
133 |
+
::s μg ::t µg ::use-only-for-whole-word ::comment microgram
|
134 |
+
::s μl ::t µl ::use-only-for-whole-word ::comment microliter
|
135 |
+
::s μV ::t µV ::use-only-for-whole-word ::comment microvolt
|
136 |
+
::s μC ::t µC ::use-only-for-whole-word ::comment microcoulomb
|
137 |
+
::s μF ::t µF ::use-only-for-whole-word ::comment microfarad
|
138 |
+
::s μJ ::t µJ ::use-only-for-whole-word ::comment microjoule
|
139 |
+
::s μT ::t µT ::use-only-for-whole-word ::comment microtesla
|
140 |
+
::s μA ::t µA ::use-only-for-whole-word ::comment microampere
|
141 |
+
::s μW ::t µW ::use-only-for-whole-word ::comment microwatt
|
142 |
+
::s μK ::t µK ::use-only-for-whole-word ::comment microkelvin
|
143 |
+
::s μHz ::t µHz ::use-only-for-whole-word ::comment microhertz
|
144 |
+
::s μcd ::t µcd ::use-only-for-whole-word ::comment microcandela
|
145 |
+
::s μmol ::t µmol ::use-only-for-whole-word ::comment micromol
|
146 |
+
|
147 |
+
# International Greek (e.g. as used in chemical compounds)
|
148 |
+
::s β ::t b
|
149 |
+
::s Β ::t B
|
150 |
+
::s ϐ ::t b
|
151 |
+
|
152 |
+
# Ancient Greek
|
153 |
+
::s β ::t b ::lcode grc
|
154 |
+
::s Β ::t B ::lcode grc
|
155 |
+
::s γγ ::t ng ::lcode grc
|
156 |
+
::s γκ ::t nk ::lcode grc
|
157 |
+
::s γξ ::t nx ::lcode grc
|
158 |
+
::s γχ ::t nch ::lcode grc
|
159 |
+
::s ϱ ::t r ::lcode grc
|
160 |
+
|
161 |
+
# Pontic Greek
|
162 |
+
::s β ::t v ::t-alt b ::lcode pnt
|
163 |
+
::s Β ::t V ::t-alt B ::lcode pnt
|
164 |
+
::s ϐ ::t v ::t-alt b ::lcode pnt
|
165 |
+
|
166 |
+
# Modern Greek (generally the default)
|
167 |
+
::s β ::t v ::t-alt b ::lcode ell
|
168 |
+
::s Β ::t V ::t-alt B ::lcode ell
|
169 |
+
::s ϐ ::t v ::t-alt b ::lcode ell
|
170 |
+
::s Ι ::t I
|
171 |
+
::s ι ::t i
|
172 |
+
::s ί ::t i
|
173 |
+
::s ἶ ::t i
|
174 |
+
::s Υ ::t Y
|
175 |
+
::s υ ::t y
|
176 |
+
::s Ρ ::t R
|
177 |
+
::s ρ ::t r
|
178 |
+
::s ϱ ::t r
|
179 |
+
::s Χ ::t Ch ::t-alt Kh
|
180 |
+
::s χ ::t ch ::t-alt kh
|
181 |
+
::s φ ::t f ::t-alt ph
|
182 |
+
::s Φ ::t F ::t-alt Ph
|
183 |
+
::s Ντ ::t D
|
184 |
+
::s ντ ::t nd ::t-alt d, nt
|
185 |
+
# ::s ντζ ::t ntz
|
186 |
+
::s Μπ ::t B
|
187 |
+
::s μπ ::t b ::use-only-at-start-of-word
|
188 |
+
::s μπ ::t mb ::t-alt b, mp ::dont-use-at-start-of-word
|
189 |
+
::s λμπ ::t lb
|
190 |
+
::s νμπ ::t nb
|
191 |
+
::s ρμπ ::t rb
|
192 |
+
::s γγ ::t ng
|
193 |
+
::s Γκ ::t G
|
194 |
+
::s γκ ::t ng ::t-alt g ::dont-use-at-start-of-word
|
195 |
+
::s γκ ::t g ::use-only-at-start-of-word
|
196 |
+
::s γξ ::t nx ::lcode grc
|
197 |
+
::s γχ ::t nch ::lcode grc
|
198 |
+
::s ει ::t ei ::t-alt i
|
199 |
+
::s Ει ::t Ei ::t-alt I
|
200 |
+
::s ευ ::t eu ::t-alt ev ::comment donated by Constantine
|
201 |
+
::s Ευ ::t Eu ::t-alt Ev ::comment donated by Constantine
|
202 |
+
::s αυ ::t au ::t-alt av
|
203 |
+
::s Αυ ::t Au ::t-alt Av
|
204 |
+
::s ου ::t ou ::t-alt u
|
205 |
+
::s Ου ::t Ou ::t-alt U
|
206 |
+
::s ηυ ::t eu
|
207 |
+
::s Ηυ ::t Eu
|
208 |
+
::s υι ::t ui
|
209 |
+
::s Υι ::t Ui
|
210 |
+
::s ωυ ::t ou
|
211 |
+
::s Ωυ ::t Ou
|
212 |
+
::s ͺ ::t ::comment GREEK YPOGEGRAMMENI (U+037A)
|
213 |
+
::s ϒ ::t Y ::comment GREEK UPSILON WITH HOOK SYMBOL (U+03D2)
|
214 |
+
::s ϓ ::t Y ::comment GREEK UPSILON WITH ACUTE AND HOOK SYMBOL (U+03D3)
|
215 |
+
::s ϔ ::t Y ::comment GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL (U+03D4)
|
216 |
+
::s ι ::t ::comment GREEK PROSGEGRAMMENI (U+1FBE)
|
217 |
+
::s ᾿ ::t ::comment GREEK PSILI (U+1FBF)
|
218 |
+
::s ῀ ::t ::comment GREEK PERISPOMENI (U+1FC0)
|
219 |
+
::s ` ::t ::comment GREEK VARIA (U+1FEF)
|
220 |
+
::s ´ ::t ::comment GREEK OXIA (U+1FFD)
|
221 |
+
|
222 |
+
# Coptic
|
223 |
+
::s ⲁ ::t a ::comment
|
224 |
+
::s ⲁ̀ ::t a ::comment
|
225 |
+
::s Ⲁ ::t A ::comment
|
226 |
+
::s Ⲁ̀ ::t A ::comment
|
227 |
+
::s ⲉ ::t e ::comment
|
228 |
+
::s ⲉ̀ ::t e ::comment
|
229 |
+
::s Ⲉ ::t e ::comment
|
230 |
+
::s ⲓ ::t i ::comment
|
231 |
+
::s ⲓ̀ ::t i ::comment
|
232 |
+
::s Ⲓ ::t i ::comment
|
233 |
+
::s ⲟ ::t o ::comment
|
234 |
+
::s ⲟⲩ ::t u ::comment
|
235 |
+
::s ⲟⲩⲁ ::t owe ::comment
|
236 |
+
::s Ⲟ ::t O ::comment
|
237 |
+
::s Ⲟⲩ ::t U ::comment
|
238 |
+
::s ⲱ ::t o ::comment
|
239 |
+
::s ⲱ̀ ::t o ::comment
|
240 |
+
::s Ⲱ ::t o ::comment
|
241 |
+
::s ⲏ ::t e ::comment
|
242 |
+
::s Ⲏ ::t E ::comment
|
243 |
+
::s Ⲩ ::t Y ::comment
|
244 |
+
::s Ⲩ̀ ::t U ::comment
|
245 |
+
::s ⲉⲩ ::t ev ::comment Ⲡⲓⲡ̀ⲛⲉⲩⲙⲁ
|
246 |
+
::s Ⲉⲩ ::t Ev ::comment Ⲡⲓⲡ̀ⲛⲉⲩⲙⲁ
|
247 |
+
::s ⲩ ::t u ::comment
|
248 |
+
::s Ⲩ ::t U ::comment
|
249 |
+
::s ⲃ ::t b ::t-alt v ::comment
|
250 |
+
::s ϣ ::t sh ::comment
|
251 |
+
::s Ϣ ::t Sh ::comment
|
252 |
+
::s ⲧ ::t t ::t-alt d ::comment
|
253 |
+
::s Ⲧ ::t T ::t-alt D ::comment
|
254 |
+
::s ϯ ::t ti ::comment
|
255 |
+
::s Ϯ ::t TI ::comment
|
256 |
+
::s ϫ ::t j ::comment
|
257 |
+
::s Ϫ ::t J ::comment
|
258 |
+
::s ϭ ::t ch ::comment tsh
|
259 |
+
|
260 |
+
# Glagolitic
|
261 |
+
::s Ⰿ ::t M ::comment GLAGOLITIC CAPITAL LETTER MYSLITE (U+2C0F)
|
262 |
+
::s Ⱞ ::t M ::comment GLAGOLITIC CAPITAL LETTER LATINATE MYSLITE (U+2C2E)
|
263 |
+
::s ⰿ ::t m ::comment GLAGOLITIC SMALL LETTER MYSLITE (U+2C3F)
|
264 |
+
::s ⱞ ::t m ::comment GLAGOLITIC SMALL LETTER LATINATE MYSLITE (U+2C5E)
|
265 |
+
::s 𞀏 ::t m ::comment COMBINING GLAGOLITIC LETTER MYSLITE (U+1E00F)
|
266 |
+
|
267 |
+
# Cyrillic
|
268 |
+
::s Г ::t G ::t-alt H ::comment Cyrillic capital ghe
|
269 |
+
::s г ::t g ::t-alt h ::comment Cyrillic small ghe
|
270 |
+
::s Е ::t E ::t-alt Ye ::comment Cyrillic capital ie
|
271 |
+
::s е ::t e ::t-alt ye ::comment Cyrillic small ie
|
272 |
+
::s Ё ::t E ::t-alt Yo
|
273 |
+
::s ё ::t e ::t-alt yo
|
274 |
+
::s Х ::t Kh ::t-alt Ch, H ::comment Cyrillic capital ha
|
275 |
+
::s х ::t kh ::t-alt ch, h ::comment Cyrillic small ha
|
276 |
+
::s Щ ::t Shch ::t-alt Sh
|
277 |
+
::s щ ::t shch ::t-alt sh
|
278 |
+
::s Ъ ::t ::comment Cyrillic capital hard sign
|
279 |
+
::s ъ ::t ::comment Cyrillic small hard sign
|
280 |
+
::s ᲆ ::t ::comment CYRILLIC SMALL LETTER TALL HARD SIGN
|
281 |
+
::s Ы ::t Y ::comment Cyrillic capital yeru
|
282 |
+
::s ы ::t y ::comment Cyrillic small yeru
|
283 |
+
::s Ь ::t ::comment Cyrillic capital soft sign
|
284 |
+
::s ь ::t ::comment Cyrillic small soft sign
|
285 |
+
::s Ж ::t Zh ::comment Cyrillic capital letter zhe
|
286 |
+
::s Ш ::t Sh ::comment Cyrillic capital letter sha
|
287 |
+
::s Ч ::t Ch ::comment Cyrillic capital letter che
|
288 |
+
::s Џ ::t Dzh ::comment Cyrillic capital letter dzhe
|
289 |
+
::s Є ::t Ie ::comment Cyrillic capital letter ie
|
290 |
+
::s Ю ::t Yu ::comment Cyrillic capital letter yu
|
291 |
+
::s Я ::t Ya ::comment Cyrillic capital letter ya
|
292 |
+
|
293 |
+
::s Ҥ ::t Ng ::comment Cyrillic capital ligature EN GHE
|
294 |
+
::s ҥ ::t ng ::comment Cyrillic small ligature EN GHE
|
295 |
+
::s Ә ::t e ::comment Cyrillic capital schwa
|
296 |
+
::s ә ::t e ::comment Cyrillic small schwa
|
297 |
+
::s Ӏ ::t ' ::comment Cyrillic palochka
|
298 |
+
::s Ҵ ::t TS ::comment Cyrillic capital ligature te tse, used in Abkhasian
|
299 |
+
::s ҵ ::t ts ::comment Cyrillic small ligature te tse, used in Abkhasian
|
300 |
+
::s Ӕ ::t AE ::comment Cyrillic capital ligature a ie
|
301 |
+
::s ӕ ::t ae ::comment Cyrillic small ligature a ie
|
302 |
+
::s ʹ ::t "'" ::comment modifier letter prime
|
303 |
+
::s ʺ ::t '"' ::comment modifier letter double prime
|
304 |
+
::s ий ::t iy ::dont-use-at-end-of-word
|
305 |
+
::s ий ::t y ::use-only-at-end-of-word
|
306 |
+
|
307 |
+
::s ᲈ ::t u ::comment CYRILLIC SMALL LETTER UNBLENDED UK ligature ou
|
308 |
+
|
309 |
+
# Russian
|
310 |
+
::s Г ::t G ::t-alt _NONE_ ::lcode rus ::comment Cyrillic capital letter ghe
|
311 |
+
::s г ::t g ::t-alt _NONE_ ::lcode rus ::comment Cyrillic small letter ghe
|
312 |
+
::s Й ::t Y ::t-alt I, J ::lcode rus ::comment Cyrillic capital letter short i
|
313 |
+
::s й ::t y ::t-alt i, j ::lcode rus ::comment Cyrillic small letter short i
|
314 |
+
::s Ц ::t Ts ::t-alt C ::lcode rus ::comment Cyrillic capital letter tse
|
315 |
+
::s ц ::t ts ::t-alt c ::lcode rus ::comment Cyrillic small letter tse
|
316 |
+
::s Щ ::t Shch ::t-alt _NONE_ ::lcode rus ::comment Cyrillic capital letter shcha
|
317 |
+
::s щ ::t shch ::t-alt _NONE_ ::lcode rus ::comment Cyrillic small letter shcha
|
318 |
+
::s Ѣ ::t E ::t-alt Ie ::lcode rus ::comment archaic Cyrillic capital letter yat
|
319 |
+
::s ѣ ::t e ::t-alt ie ::lcode rus ::comment archaic Cyrillic small letter yat
|
320 |
+
::s Е ::t E ::t-alt Ye ::dont-use-at-start-of-word ::lcode rus ::comment Cyrillic capital ie
|
321 |
+
::s Е ::t Ye ::t-alt E ::use-only-at-start-of-word ::lcode rus
|
322 |
+
::s е ::t e ::t-alt ye ::dont-use-at-start-of-word ::lcode rus ::comment Cyrillic small ie
|
323 |
+
::s е ::t ye ::t-alt e ::use-only-at-start-of-word ::lcode rus
|
324 |
+
::s ае ::t aye ::lcode rus
|
325 |
+
::s а́е ::t aye ::lcode rus
|
326 |
+
::s ее ::t eye ::lcode rus
|
327 |
+
::s е́е ::t eye ::lcode rus
|
328 |
+
::s ие ::t iye ::lcode rus
|
329 |
+
::s и́е ::t iye ::lcode rus
|
330 |
+
::s ое ::t oye ::lcode rus
|
331 |
+
::s о́е ::t oye ::lcode rus
|
332 |
+
::s уе ::t uye ::lcode rus
|
333 |
+
::s у́е ::t uye ::lcode rus
|
334 |
+
::s ье ::t ye ::lcode rus
|
335 |
+
::s ъе ::t ye ::lcode rus
|
336 |
+
::s Ё ::t Yo ::t-alt E ::lcode rus ::comment Cyrillic capital io
|
337 |
+
::s ё ::t yo ::t-alt e ::lcode rus
|
338 |
+
::s аё ::t ayo ::lcode rus
|
339 |
+
::s а́ё ::t ayo ::lcode rus
|
340 |
+
::s её ::t eyo ::lcode rus
|
341 |
+
::s е́ё ::t eyo ::lcode rus
|
342 |
+
::s иё ::t iyo ::lcode rus
|
343 |
+
::s и́ё ::t iyo ::lcode rus
|
344 |
+
::s оё ::t oyo ::lcode rus
|
345 |
+
::s о́ё ::t oyo ::lcode rus
|
346 |
+
::s уё ::t uyo ::lcode rus
|
347 |
+
::s у́ё ::t uyo ::lcode rus
|
348 |
+
::s ьё ::t yo ::lcode rus
|
349 |
+
::s ъё ::t yo ::lcode rus
|
350 |
+
::s ий ::t y ::lcode rus
|
351 |
+
|
352 |
+
# Ukranian
|
353 |
+
::s Г ::t H ::lcode ukr ::comment Ukrainian capital letter he
|
354 |
+
::s г ::t h ::lcode ukr ::comment Ukrainian small letter he
|
355 |
+
::s Ґ ::t G ::lcode ukr ::comment Ukrainian capital letter ghe
|
356 |
+
::s ґ ::t g ::lcode ukr ::comment Ukrainian small letter ghe
|
357 |
+
::s Е ::t E ::t-alt _NONE_ ::lcode ukr ::comment Cyrillic capital ie
|
358 |
+
::s е ::t e ::t-alt _NONE_ ::lcode ukr ::comment Cyrillic small ie
|
359 |
+
::s И ::t Y ::lcode ukr ::comment Ukrainian capital letter i
|
360 |
+
::s и ::t y ::lcode ukr ::comment Ukrainian small letter i
|
361 |
+
::s Ї ::t Yi ::lcode ukr ::comment Ukrainian capital letter yi
|
362 |
+
::s ї ::t yi ::lcode ukr ::comment Ukrainian small letter yi
|
363 |
+
::s Й ::t I ::t-alt Y ::lcode ukr ::comment Cyrillic capital letter short i
|
364 |
+
::s й ::t i ::t-alt y ::lcode ukr ::comment Cyrillic small letter short i
|
365 |
+
::s Ц ::t Ts ::t-alt C ::lcode ukr ::comment Cyrillic capital letter tse
|
366 |
+
::s ц ::t ts ::t-alt c ::lcode ukr ::comment Cyrillic small letter tse
|
367 |
+
::s Щ ::t Shch ::t-alt _NONE_ ::lcode ukr ::comment Cyrillic capital letter shcha
|
368 |
+
::s щ ::t shch ::t-alt _NONE_ ::lcode ukr ::comment Cyrillic small letter shcha
|
369 |
+
::s Ѣ ::t E ::t-alt Ie ::lcode ukr ::comment archaic Cyrillic capital letter yat
|
370 |
+
::s ѣ ::t e ::t-alt ie ::lcode ukr ::comment archaic Cyrillic small letter yat
|
371 |
+
::s Иї ::t Yi ::lcode ukr ::comment avoid Yyi
|
372 |
+
::s иї ::t yi ::lcode ukr ::comment avoid yyi
|
373 |
+
::s ій ::t iy ::lcode ukr
|
374 |
+
::s і́й ::t iy ::lcode ukr
|
375 |
+
::s ий ::t yi ::lcode ukr
|
376 |
+
::s ий ::t y ::lcode ukr ::use-only-at-end-of-word ::comment Зеленський/Zelensky
|
377 |
+
::s ий ::t yi ::lcode ukr ::dont-use-at-end-of-word
|
378 |
+
|
379 |
+
# Belarusian
|
380 |
+
::s Г ::t H ::t-alt G ::lcode bel ::comment capital letter he
|
381 |
+
::s г ::t h ::t-alt g ::lcode bel ::comment small letter he
|
382 |
+
::s Ґ ::t G ::lcode bel ::comment capital letter ghe
|
383 |
+
::s ґ ::t g ::lcode bel ::comment small letter ghe
|
384 |
+
::s Й ::t J ::t-alt Y ::lcode bel ::comment Cyrillic capital letter short i
|
385 |
+
::s й ::t j ::t-alt y ::lcode bel ::comment Cyrillic small letter short i
|
386 |
+
::s Ц ::t Ts ::t-alt C ::lcode bel ::comment Cyrillic capital letter tse
|
387 |
+
::s ц ::t ts ::t-alt c ::lcode bel ::comment Cyrillic small letter tse
|
388 |
+
::s Щ ::t Shch ::t-alt _NONE_ ::lcode bel ::comment Cyrillic capital letter shcha
|
389 |
+
::s щ ::t shch ::t-alt _NONE_ ::lcode bel ::comment Cyrillic small letter shcha
|
390 |
+
::s Ѣ ::t E ::t-alt Ie ::lcode bel ::comment archaic Cyrillic capital letter yat
|
391 |
+
::s ѣ ::t e ::t-alt ie ::lcode bel ::comment archaic Cyrillic small letter yat
|
392 |
+
::s 'я ::t ya ::lcode bel
|
393 |
+
::s ’я ::t ya ::lcode bel
|
394 |
+
::s 'і ::t i ::lcode bel
|
395 |
+
::s ’і ::t i ::lcode bel
|
396 |
+
::s Ё ::t Yo ::t-alt E ::lcode bel ::comment Cyrillic capital io
|
397 |
+
::s ё ::t yo ::t-alt e ::lcode bel
|
398 |
+
::s ёў ::t you ::lcode bel
|
399 |
+
::s ий ::t y ::lcode bel
|
400 |
+
|
401 |
+
# Serbian
|
402 |
+
::s Г ::t G ::t-alt _NONE_ ::lcode srp ::comment Cyrillic capital ghe
|
403 |
+
::s г ::t g ::t-alt _NONE_ ::lcode srp ::comment Cyrillic small ghe
|
404 |
+
::s Х ::t H ::t-alt _NONE_ ::lcode srp ::comment Cyrillic capital ha
|
405 |
+
::s х ::t h ::t-alt _NONE_ ::lcode srp ::comment Cyrillic small ha
|
406 |
+
::s Е ::t E ::t-alt _NONE_ ::lcode srp ::comment Cyrillic capital ie
|
407 |
+
::s е ::t e ::t-alt _NONE_ ::lcode srp ::comment Cyrillic small ie
|
408 |
+
::s Ђ ::t Dj ::lcode srp ::comment Cyrillic capital dje
|
409 |
+
::s Љ ::t Lj ::lcode srp ::comment Cyrillic capital lje
|
410 |
+
::s Ћ ::t Tsh ::lcode srp ::comment Cyrillic capital tshe
|
411 |
+
::s Ж ::t Zh ::lcode srp ::comment Cyrillic capital zhe
|
412 |
+
::s Ц ::t C ::t-alt Ts ::lcode srp ::comment Cyrillic capital tse
|
413 |
+
::s ц ::t c ::t-alt ts ::lcode srp ::comment Cyrillic capital tse
|
414 |
+
::s Đ ::t Dj ::lcode srp ::comment Latin capital d with stroke
|
415 |
+
::s đ ::t dj ::lcode srp ::comment Latin small d with stroke
|
416 |
+
::s Ž ::t Zh ::lcode srp ::comment Latin capital z with caron
|
417 |
+
::s ž ::t zh ::lcode srp ::comment Latin small z with caron
|
418 |
+
::s Ć ::t Tsh ::lcode srp ::comment Latin capital c with acute
|
419 |
+
::s ć ::t tsh ::lcode srp ::comment Latin small c with acute
|
420 |
+
::s Č ::t Ch ::lcode srp ::comment Latin capital c with caron
|
421 |
+
::s č ::t ch ::lcode srp ::comment Latin small c with caron
|
422 |
+
::s Š ::t Sh ::lcode srp ::comment Latin capital s with caron
|
423 |
+
::s š ::t sh ::lcode srp ::comment Latin small s with caron
|
424 |
+
|
425 |
+
::s Г ::t G ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic capital ghe
|
426 |
+
::s г ::t g ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic small ghe
|
427 |
+
::s Х ::t H ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic capital ha
|
428 |
+
::s х ::t h ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic small ha
|
429 |
+
::s Ц ::t C ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic capital letter tse
|
430 |
+
::s ц ::t c ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic small letter tse
|
431 |
+
::s Ч ::t C ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic capital letter che
|
432 |
+
::s ч ::t c ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic small letter che
|
433 |
+
::s Џ ::t Dz ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic capital letter dzhe
|
434 |
+
::s џ ::t dz ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic small letter dzhe
|
435 |
+
::s Е ::t E ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic capital ie
|
436 |
+
::s е ::t e ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic small ie
|
437 |
+
::s Ш ::t S ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic capital sha
|
438 |
+
::s ш ::t s ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic small sha
|
439 |
+
::s Ж ::t Z ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic capital zhe
|
440 |
+
::s ж ::t z ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic small zhe
|
441 |
+
::s Љ ::t Lj ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic capital lje
|
442 |
+
::s љ ::t lj ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic small lje
|
443 |
+
::s Њ ::t Nj ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic capital nje
|
444 |
+
::s њ ::t nj ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic small nje
|
445 |
+
::s Ђ ::t Dj ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic capital dje
|
446 |
+
::s ђ ::t dj ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic small dje
|
447 |
+
::s Ћ ::t C ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic capital tshe
|
448 |
+
::s ћ ::t c ::t-alt _NONE_ ::lcode srp2 ::comment Cyrillic small tshe
|
449 |
+
::s Đ ::t Dj ::lcode srp2 ::comment Latin capital d with stroke
|
450 |
+
::s đ ::t dj ::lcode srp2 ::comment Latin small d with stroke
|
451 |
+
|
452 |
+
# Montenegrin extension (controversial)
|
453 |
+
::s З́ ::t Zj ::lcode srp ::comment Cyrillic capital zje
|
454 |
+
::s з́ ::t zj ::lcode srp ::comment Cyrillic small zje
|
455 |
+
::s С́ ::t Sj ::lcode srp ::comment Cyrillic capital sje
|
456 |
+
::s с́ ::t sj ::lcode srp ::comment Cyrillic small sje
|
457 |
+
::s Ź ::t Zj ::lcode srp ::comment Latin capital z with acute
|
458 |
+
::s ź ::t zj ::lcode srp ::comment Latin small z with acute
|
459 |
+
::s Ś ::t Sj ::lcode srp ::comment Latin capital s with acute
|
460 |
+
::s ś ::t sj ::lcode srp ::comment Latin small s with acute
|
461 |
+
|
462 |
+
::s З́ ::t Z ::lcode srp2 ::comment Cyrillic capital zje
|
463 |
+
::s з́ ::t z ::lcode srp2 ::comment Cyrillic small zje
|
464 |
+
::s С́ ::t S ::lcode srp2 ::comment Cyrillic capital sje
|
465 |
+
::s с́ ::t s ::lcode srp2 ::comment Cyrillic small sje
|
466 |
+
::s Ź ::t Z ::lcode srp2 ::comment Latin capital z with acute
|
467 |
+
::s ź ::t z ::lcode srp2 ::comment Latin small z with acute
|
468 |
+
::s Ś ::t S ::lcode srp2 ::comment Latin capital s with acute
|
469 |
+
::s ś ::t s ::lcode srp2 ::comment Latin small s with acute
|
470 |
+
|
471 |
+
# Bulgarian
|
472 |
+
::s Г ::t G ::t-alt _NONE_ ::lcode bul ::comment Cyrillic capital ghe
|
473 |
+
::s г ::t g ::t-alt _NONE_ ::lcode bul ::comment Cyrillic small ghe
|
474 |
+
::s Х ::t H ::t-alt Kh ::lcode bul ::comment Cyrillic capital letter ha
|
475 |
+
::s х ::t h ::t-alt kh ::lcode bul ::comment Cyrillic small letter ha
|
476 |
+
::s Ц ::t C ::t-alt Ts ::lcode bul ::comment Cyrillic capital letter tse
|
477 |
+
::s ц ::t c ::t-alt ts ::lcode bul ::comment Cyrillic small letter tse
|
478 |
+
::s Щ ::t Sht ::t-alt _NONE_ ::lcode bul ::comment Cyrillic capital letter shcha
|
479 |
+
::s щ ::t sht ::t-alt _NONE_ ::lcode bul ::comment Cyrillic small letter shcha
|
480 |
+
::s Е ::t E ::t-alt _NONE_ ::lcode bul ::comment Cyrillic capital ie
|
481 |
+
::s е ::t e ::t-alt _NONE_ ::lcode bul ::comment Cyrillic small ie
|
482 |
+
::s Ж ::t Zh ::t-alt Z, J ::lcode bul ::comment Cyrillic capital zhe
|
483 |
+
::s ж ::t zh ::t-alt z, j ::lcode bul ::comment Cyrillic small zhe
|
484 |
+
::s Й ::t I ::t-alt Y, J ::lcode bul ::comment Cyrillic capital letter short i
|
485 |
+
::s й ::t i ::t-alt y, j ::lcode bul ::comment Cyrillic short letter short i
|
486 |
+
::s Ю ::t Yu ::t-alt U, Ju, Iu ::lcode bul ::comment Cyrillic capital letter yu
|
487 |
+
::s ю ::t yu ::t-alt u, ju, iu ::lcode bul ::comment Cyrillic small letter yu
|
488 |
+
::s Ъ ::t U ::t-alt A ::lcode bul ::comment Cyrillic capital letter hard sign
|
489 |
+
::s ъ ::t u ::t-alt a ::lcode bul ::comment Cyrillic capital letter hard sign
|
490 |
+
::s Ѣ ::t E ::t-alt Ie ::lcode bul ::comment archaic Cyrillic capital letter yat
|
491 |
+
::s ѣ ::t e ::t-alt ie ::lcode bul ::comment archaic Cyrillic small letter yat
|
492 |
+
::s Ѫ ::t U ::lcode bul ::comment archaic Cyrillic capital letter yus
|
493 |
+
::s ѫ ::t u ::lcode bul ::comment archaic Cyrillic small letter yus
|
494 |
+
::s ИЯ ::t IA ::lcode bul ::use-only-at-end-of-word
|
495 |
+
::s ия ::t ia ::lcode bul ::use-only-at-end-of-word
|
496 |
+
|
497 |
+
::s Ž ::t Zh ::lcode bul ::comment Latin capital z with caron
|
498 |
+
::s ž ::t zh ::lcode bul ::comment Latin small z with caron
|
499 |
+
::s Č ::t Ch ::lcode bul ::comment Latin capital c with caron
|
500 |
+
::s č ::t ch ::lcode bul ::comment Latin small c with caron
|
501 |
+
::s Š ::t Sh ::lcode bul ::comment Latin capital s with caron
|
502 |
+
::s š ::t sh ::lcode bul ::comment Latin small s with caron
|
503 |
+
::s Ŝ ::t Sht ::lcode bul ::comment Latin capital s with circumflex
|
504 |
+
::s ŝ ::t sht ::lcode bul ::comment Latin small s with circumflex
|
505 |
+
::s Û ::t Yu ::t-alt U, Ju, Iu ::lcode bul ::comment Latin capital u with circumflex
|
506 |
+
::s û ::t yu ::t-alt u, ju, iu ::lcode bul ::comment Latin small u with circumflex
|
507 |
+
::s  ::t Ya ::t-alt _NONE_ ::lcode bul ::comment Latin capital a with circumflex
|
508 |
+
::s â ::t ya ::t-alt _NONE_ ::lcode bul ::comment Latin small a with circumflex
|
509 |
+
::s Ŭ ::t U ::t-alt A ::lcode bul ::comment Latin capital u with breve (for hard sign)
|
510 |
+
::s ŭ ::t u ::t-alt a ::lcode bul ::comment Latin small u with breve (for hard sign)
|
511 |
+
::s Ǎ ::t U ::t-alt A ::lcode bul ::comment Latin capital a with caron (for hard sign)
|
512 |
+
::s ǎ ::t u ::t-alt a ::lcode bul ::comment Latin small a with caron (for hard sign)
|
513 |
+
|
514 |
+
# Macedonian
|
515 |
+
::s Г ::t G ::t-alt _NONE_ ::lcode mkd ::comment Cyrillic capital ghe
|
516 |
+
::s г ::t g ::t-alt _NONE_ ::lcode mkd ::comment Cyrillic small ghe
|
517 |
+
::s Х ::t H ::lcode mkd ::comment Cyrillic capital ha
|
518 |
+
::s х ::t h ::lcode mkd ::comment Cyrillic small ha
|
519 |
+
::s Ц ::t C ::t-alt Ts ::lcode mkd ::comment Cyrillic capital letter tse
|
520 |
+
::s ц ::t c ::t-alt ts ::lcode mkd ::comment Cyrillic small letter tse
|
521 |
+
::s Џ ::t Dzh ::t-alt Dj, Dz ::lcode mkd ::comment Cyrillic capital letter dzhe
|
522 |
+
::s џ ::t dzh ::t-alt dj, dz ::lcode mkd ::comment Cyrillic small letter dzhe
|
523 |
+
::s Е ::t E ::t-alt _NONE_ ::lcode mkd ::comment Cyrillic capital ie
|
524 |
+
::s е ::t e ::t-alt _NONE_ ::lcode mkd ::comment Cyrillic small ie
|
525 |
+
::s Ž ::t Zh ::lcode mkd ::comment Latin capital z with caron
|
526 |
+
::s ž ::t zh ::lcode mkd ::comment Latin small z with caron
|
527 |
+
::s Č ::t Ch ::lcode mkd ::comment Latin capital c with caron
|
528 |
+
::s č ::t ch ::lcode mkd ::comment Latin small c with caron
|
529 |
+
::s Š ::t Sh ::lcode mkd ::comment Latin capital s with caron
|
530 |
+
::s š ::t sh ::lcode mkd ::comment Latin small s with caron
|
531 |
+
::s Ǵ ::t Gj ::lcode mkd
|
532 |
+
::s ǵ ::t gj ::lcode mkd
|
533 |
+
::s Đ ::t Gj ::lcode mkd
|
534 |
+
::s đ ::t gj ::lcode mkd
|
535 |
+
::s Ẑ ::t Dz ::lcode mkd
|
536 |
+
::s ẑ ::t dz ::lcode mkd
|
537 |
+
::s J̌ ::t J ::lcode mkd
|
538 |
+
::s ǰ ::t j ::lcode mkd
|
539 |
+
::s L̂ ::t Lj ::lcode mkd
|
540 |
+
::s l̂ ::t lj ::lcode mkd
|
541 |
+
::s N̂ ::t Nj ::lcode mkd
|
542 |
+
::s n̂ ::t nj ::lcode mkd
|
543 |
+
::s Ḱ ::t Kj ::lcode mkd
|
544 |
+
::s ḱ ::t kj ::lcode mkd
|
545 |
+
::s Ć ::t Kj ::lcode mkd
|
546 |
+
::s ć ::t kj ::lcode mkd
|
547 |
+
::s D̂ ::t Dzh ::lcode mkd
|
548 |
+
::s d̂ ::t dzh ::lcode mkd
|
549 |
+
|
550 |
+
::s Г ::t G ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic capital ghe
|
551 |
+
::s г ::t g ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic small ghe
|
552 |
+
::s Х ::t H ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic capital ha
|
553 |
+
::s х ::t h ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic small ha
|
554 |
+
::s Ц ::t C ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic capital letter tse
|
555 |
+
::s ц ::t c ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic small letter tse
|
556 |
+
::s Ч ::t C ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic capital letter che
|
557 |
+
::s ч ::t c ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic small letter che
|
558 |
+
::s Џ ::t D ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic capital letter dzhe
|
559 |
+
::s џ ::t d ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic small letter dzhe
|
560 |
+
::s Е ::t E ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic capital ie
|
561 |
+
::s е ::t e ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic small ie
|
562 |
+
::s Ш ::t S ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic capital sha
|
563 |
+
::s ш ::t s ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic small sha
|
564 |
+
::s Ѓ ::t G ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic capital gje
|
565 |
+
::s ѓ ::t g ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic small gje
|
566 |
+
::s Ж ::t Z ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic capital zhe
|
567 |
+
::s ж ::t z ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic small zhe
|
568 |
+
::s Ѕ ::t Z ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic capital dze
|
569 |
+
::s ѕ ::t z ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic small dze
|
570 |
+
::s Ќ ::t K ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic capital kje
|
571 |
+
::s ќ ::t k ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic small kje
|
572 |
+
::s Љ ::t L ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic capital lje
|
573 |
+
::s љ ::t l ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic small lje
|
574 |
+
::s Њ ::t N ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic capital nje
|
575 |
+
::s њ ::t n ::t-alt _NONE_ ::lcode mkd2 ::comment Cyrillic small nje
|
576 |
+
::s Ž ::t Z ::lcode mkd2 ::comment Latin capital z with caron
|
577 |
+
::s ž ::t z ::lcode mkd2 ::comment Latin small z with caron
|
578 |
+
::s Č ::t C ::lcode mkd2 ::comment Latin capital c with caron
|
579 |
+
::s č ::t c ::lcode mkd2 ::comment Latin small c with caron
|
580 |
+
::s Š ::t S ::lcode mkd2 ::comment Latin capital s with caron
|
581 |
+
::s š ::t s ::lcode mkd2 ::comment Latin small s with caron
|
582 |
+
::s Ǵ ::t G ::lcode mkd2
|
583 |
+
::s ǵ ::t g ::lcode mkd2
|
584 |
+
::s Đ ::t G ::lcode mkd2
|
585 |
+
::s đ ::t g ::lcode mkd2
|
586 |
+
::s Ẑ ::t D ::lcode mkd2
|
587 |
+
::s ẑ ::t d ::lcode mkd2
|
588 |
+
::s J̌ ::t J ::lcode mkd2
|
589 |
+
::s ǰ ::t j ::lcode mkd2
|
590 |
+
::s L̂ ::t L ::lcode mkd2
|
591 |
+
::s l̂ ::t l ::lcode mkd2
|
592 |
+
::s N̂ ::t N ::lcode mkd2
|
593 |
+
::s n̂ ::t n ::lcode mkd2
|
594 |
+
::s Ḱ ::t K ::lcode mkd2
|
595 |
+
::s ḱ ::t k ::lcode mkd2
|
596 |
+
::s Ć ::t K ::lcode mkd2
|
597 |
+
::s ć ::t k ::lcode mkd2
|
598 |
+
::s D̂ ::t D ::lcode mkd2
|
599 |
+
::s d̂ ::t d ::lcode mkd2
|
600 |
+
|
601 |
+
# Kazakh
|
602 |
+
::s Ә ::t A ::lcode kaz
|
603 |
+
::s ә ::t a ::lcode kaz
|
604 |
+
::s Г ::t G ::t-alt _NONE_ ::lcode kaz ::comment Cyrillic capital ghe
|
605 |
+
::s г ::t g ::t-alt _NONE_ ::lcode kaz ::comment Cyrillic small ghe
|
606 |
+
::s Ғ ::t G ::t-alt _NONE_ ::lcode kaz ::comment Cyrillic capital ghe with stroke
|
607 |
+
::s ғ ::t g ::t-alt _NONE_ ::lcode kaz ::comment Cyrillic small ghe with stroke
|
608 |
+
::s Е ::t E ::t-alt _NONE_ ::lcode kaz ::comment Cyrillic capital ie
|
609 |
+
::s е ::t e ::t-alt _NONE_ ::lcode kaz ::comment Cyrillic small ie
|
610 |
+
::s Ё ::t Yo ::t-alt _NONE_ ::lcode kaz
|
611 |
+
::s ё ::t yo ::t-alt _NONE_ ::lcode kaz
|
612 |
+
::s Х ::t H ::t-alt X ::lcode kaz ::comment Cyrillic capital ha
|
613 |
+
::s х ::t h ::t-alt x ::lcode kaz ::comment Cyrillic small ha
|
614 |
+
::s Һ ::t H ::lcode kaz ::comment Cyrillic capital shha
|
615 |
+
::s һ ::t h ::lcode kaz ::comment Cyrillic small shha
|
616 |
+
::s Қ ::t Q ::t-alt K ::lcode kaz
|
617 |
+
::s қ ::t q ::t-alt k ::lcode kaz
|
618 |
+
::s Ц ::t Ts ::t-alt C ::lcode kaz ::comment Cyrillic capital letter tse
|
619 |
+
::s ц ::t ts ::t-alt c ::lcode kaz ::comment Cyrillic small letter tse
|
620 |
+
::s Щ ::t Sh ::t-alt _NONE_ ::lcode kaz ::comment Cyrillic capital letter shcha
|
621 |
+
::s щ ::t sh ::t-alt _NONE_ ::lcode kaz ::comment Cyrillic small letter shcha
|
622 |
+
::s У ::t U ::t-alt Y ::lcode kaz
|
623 |
+
::s у ::t u ::t-alt y ::lcode kaz
|
624 |
+
::s уы ::t wy ::lcode kaz
|
625 |
+
::s Ж ::t J ::t-alt _NONE_ ::lcode kaz ::comment Cyrillic capital zhe
|
626 |
+
::s ж ::t j ::t-alt _NONE_ ::lcode kaz ::comment Cyrillic small zhe
|
627 |
+
::s Ю ::t Yw ::t-alt Yuw, Yiw ::lcode kaz ::comment Cyrillic capital letter yu
|
628 |
+
::s ю ::t yw ::t-alt yuw, yiw ::lcode kaz ::comment Cyrillic small letter yu
|
629 |
+
|
630 |
+
# Kyrgyz
|
631 |
+
::s Г ::t G ::t-alt _NONE_ ::lcode kir ::comment Cyrillic capital ghe
|
632 |
+
::s г ::t g ::t-alt _NONE_ ::lcode kir ::comment Cyrillic small ghe
|
633 |
+
::s Е ::t E ::t-alt Ye ::lcode kir ::comment Cyrillic capital ie
|
634 |
+
::s е ::t e ::t-alt ye ::lcode kir ::comment Cyrillic small ie
|
635 |
+
::s Ё ::t Yo ::t-alt _NONE_ ::lcode kir
|
636 |
+
::s ё ::t yo ::t-alt _NONE_ ::lcode kir
|
637 |
+
::s Х ::t Kh ::t-alt X, H ::lcode kir ::comment Cyrillic capital ha
|
638 |
+
::s х ::t kh ::t-alt x, h ::lcode kir ::comment Cyrillic small ha
|
639 |
+
::s Ж ::t Zh ::t-alt J ::lcode kir ::comment Cyrillic capital zhe
|
640 |
+
::s ж ::t zh ::t-alt j ::lcode kir ::comment Cyrillic small zhe
|
641 |
+
::s Й ::t Y ::t-alt I ::lcode kir ::comment Cyrillic capital letter short i
|
642 |
+
::s й ::t y ::t-alt i ::lcode kir ::comment Cyrillic small letter short i
|
643 |
+
::s Ц ::t Ts ::t-alt C ::lcode kir ::comment Cyrillic capital letter tse
|
644 |
+
::s ц ::t ts ::t-alt c ::lcode kir ::comment Cyrillic small letter tse
|
645 |
+
::s Ң ::t Ng ::lcode kir
|
646 |
+
::s ң ::t ng ::lcode kir
|
647 |
+
::s Ө ::t O ::t-alt Oe ::lcode kir
|
648 |
+
::s ө ::t o ::t-alt oe ::lcode kir
|
649 |
+
::s Ү ::t U ::t-alt Y, Ue ::lcode kir
|
650 |
+
::s ү ::t u ::t-alt y, ue ::lcode kir
|
651 |
+
::s Ы ::t I ::t-alt Y ::lcode kir
|
652 |
+
::s ы ::t i ::t-alt y ::lcode kir
|
653 |
+
::s йы ::t yi ::lcode kir
|
654 |
+
::s ый ::t iy ::lcode kir
|
655 |
+
|
656 |
+
# Ossetian
|
657 |
+
::s ийы ::t iy ::lcode oss
|
658 |
+
|
659 |
+
# Gothic
|
660 |
+
::s 𐌴 ::t e ::comment Gothic letter aihvus
|
661 |
+
::s 𐌹 ::t i ::comment Gothic letter eis
|
662 |
+
::s 𐍇 ::t x ::comment Gothic letter iggws
|
663 |
+
|
664 |
+
# Runic
|
665 |
+
::s ᛫ ::t " " ::comment Runic single punctuation, used as word separator
|
666 |
+
::s ᛬ ::t . ::comment Runic multiple punctuation, used as sentence separator
|
667 |
+
|
668 |
+
# Ogham
|
669 |
+
::s ᚁ ::t b ::comment Ogham letter Beith
|
670 |
+
::s ᚂ ::t l ::comment Ogham letter Luis
|
671 |
+
::s ᚃ ::t f ::comment Ogham letter Fearn
|
672 |
+
::s ᚄ ::t s ::comment Ogham letter Sail
|
673 |
+
::s ᚅ ::t n ::comment Ogham letter Nion
|
674 |
+
::s ᚋ ::t m ::comment Ogham letter Muin
|
675 |
+
::s ᚌ ::t g ::comment Ogham letter Gort
|
676 |
+
::s ᚍ ::t v ::t-alt ng ::comment Ogham letter nGéadal
|
677 |
+
::s ᚎ ::t z ::comment Ogham letter Straif
|
678 |
+
::s ᚏ ::t r ::comment Ogham letter Ruis
|
679 |
+
::s ᚆ ::t h ::t-alt j ::comment Ogham letter Uath
|
680 |
+
::s ᚇ ::t d ::comment Ogham letter Dair
|
681 |
+
::s ᚈ ::t t ::comment Ogham letter Tinne
|
682 |
+
::s ᚉ ::t k ::comment Ogham letter Coll
|
683 |
+
::s ᚊ ::t q ::t-alt kw ::comment Ogham letter Ceirt
|
684 |
+
::s ᚐ ::t a ::comment Ogham letter Ailm
|
685 |
+
::s ᚑ ::t o ::comment Ogham letter Onn
|
686 |
+
::s ᚒ ::t u ::comment Ogham letter Úr
|
687 |
+
::s ᚓ ::t e ::comment Ogham letter Eadhadh
|
688 |
+
::s ᚔ ::t i ::comment Ogham letter Iodhadh
|
689 |
+
::s ᚚ ::t p ::comment Ogham letter Peith
|
690 |
+
# Additional Ogham letters (outside standard alphabet)
|
691 |
+
::s ᚕ ::t eo ::t-alt ea ::comment Ogham additional letter Éabhadh
|
692 |
+
::s ᚖ ::t oi ::t-alt oe ::comment Ogham additional letter Ór
|
693 |
+
::s ᚗ ::t ui ::t-alt ua ::comment Ogham additional letter Uilleann
|
694 |
+
::s ᚘ ::t p ::t-alt io ::comment Ogham additional letter Ifín
|
695 |
+
::s ᚙ ::t ch ::t-alt x, ai ::comment Ogham additional letter Eamhancholl
|
696 |
+
::s " " ::t " " ::comment Ogham space mark
|
697 |
+
::s ᚛ ::t "" ::comment Ogham feather mark
|
698 |
+
::s ᚜ ::t "" ::comment Ogham feather mark
|
699 |
+
|
700 |
+
# Georgian
|
701 |
+
::s ა ::t a ::comment Georgian letter an
|
702 |
+
::s ე ::t e ::comment Georgian letter en
|
703 |
+
::s ი ::t i ::comment Georgian letter in
|
704 |
+
::s ო ::t o ::comment Georgian letter on
|
705 |
+
::s უ ::t u ::comment Georgian letter un
|
706 |
+
::s ჱ ::t ey ::comment archaic Georgian letter he
|
707 |
+
::s ჲ ::t i ::comment archaic Georgian letter hie
|
708 |
+
::s ჳ :::t w ::comment archaic Georgian letter we
|
709 |
+
::s ჴ ::t q ::comment archaic Georgian letter har
|
710 |
+
::s ჵ ::t o ::comment archaic Georgian letter hoe
|
711 |
+
::s ჶ ::t f ::comment Georgian letter fi (Greek phi)
|
712 |
+
::s ჷ ::t e ::comment Georgian letter yn (schwa)
|
713 |
+
::s ჸ ::t a ::comment Georgian letter elifi
|
714 |
+
::s ჹ ::t g ::comment Georgian letter gan
|
715 |
+
::s ჺ ::t ' ::comment Georgian letter ain
|
716 |
+
::s ჼ ::t n ::comment Georgian letter nar
|
717 |
+
::s ჽ ::t e ::comment Georgian letter aen
|
718 |
+
::s ჾ ::t ::comment Georgian letter hard sign
|
719 |
+
::s ჿ ::t w ::comment Georgian letter labial sign
|
720 |
+
|
721 |
+
::s Ⴚ ::t TS ::comment GEORGIAN CAPITAL LETTER CAN
|
722 |
+
::s ც ::t ts ::comment GEORGIAN LETTER CAN
|
723 |
+
::s Ც ::t TS ::comment GEORGIAN MTAVRULI CAPITAL LETTER CAN
|
724 |
+
::s ⴚ ::t ts ::comment GEORGIAN SMALL LETTER CAN
|
725 |
+
::s Ⴜ ::t TS ::comment GEORGIAN CAPITAL LETTER CIL
|
726 |
+
::s წ ::t ts ::comment GEORGIAN LETTER CIL
|
727 |
+
::s Წ ::t TS ::comment GEORGIAN MTAVRULI CAPITAL LETTER CIL
|
728 |
+
::s ⴜ ::t ts ::comment GEORGIAN SMALL LETTER CIL
|
729 |
+
::s Ⴛ ::t DZ ::comment GEORGIAN CAPITAL LETTER JIL
|
730 |
+
::s ძ ::t dz ::comment GEORGIAN LETTER JIL
|
731 |
+
::s Ძ ::t DZ ::comment GEORGIAN MTAVRULI CAPITAL LETTER JIL
|
732 |
+
::s ⴛ ::t dz ::comment GEORGIAN SMALL LETTER JIL
|
733 |
+
::s Ⴟ ::t J ::comment GEORGIAN CAPITAL LETTER JHAN
|
734 |
+
::s ჯ ::t j ::comment GEORGIAN LETTER JHAN
|
735 |
+
::s Ჯ ::t J ::comment GEORGIAN MTAVRULI CAPITAL LETTER JHAN
|
736 |
+
::s ⴟ ::t j ::comment GEORGIAN SMALL LETTER JHAN
|
737 |
+
|
738 |
+
|
739 |
+
::s Ⴀ ::t A ::comment Georgian capital letter an
|
740 |
+
::s Ⴄ ::t E ::comment Georgian capital letter en
|
741 |
+
::s Ⴈ ::t I ::comment Georgian capital letter in
|
742 |
+
::s Ⴍ ::t O ::comment Georgian capital letter on
|
743 |
+
::s Ⴓ ::t U ::comment Georgian capital letter un
|
744 |
+
::s Ⴡ ::t EY ::comment archaic Georgian capital letter he
|
745 |
+
::s Ⴢ ::t I ::comment archaic Georgian capital letter hie
|
746 |
+
::s Ⴣ :::t W ::comment archaic Georgian capitel letter we
|
747 |
+
::s Ⴤ ::t Q ::comment archaic Georgian capital letter har
|
748 |
+
::s Ⴥ ::t O ::comment archaic Georgian capital letter hoe
|
749 |
+
::s Ⴧ ::t E ::comment archaic Georgian capital letter yn (schwa)
|
750 |
+
::s Ⴭ ::t E ::comment archaic Georgian capital letter aen
|
751 |
+
|
752 |
+
::s Ა ::t A ::comment Georgian Mtavruli capital letter an
|
753 |
+
::s Ე ::t E ::comment Georgian Mtavruli capital letter en
|
754 |
+
::s Ი ::t I ::comment Georgian Mtavruli capital letter in
|
755 |
+
::s Ო ::t O ::comment Georgian Mtavruli capital letter on
|
756 |
+
::s Უ ::t U ::comment Georgian Mtavruli capital letter un
|
757 |
+
::s Ჱ ::t EY ::comment archaic Georgian Mtavruli capital letter he
|
758 |
+
::s Ჲ ::t I ::comment archaic Georgian Mtavruli capital letter hie
|
759 |
+
::s Ჳ :::t W ::comment archaic Georgian Mtavruli capital letter we
|
760 |
+
::s Ჴ ::t Q ::comment archaic Georgian Mtavruli capital letter har
|
761 |
+
::s Ჵ ::t O ::comment archaic Georgian Mtavruli capital letter hoe
|
762 |
+
::s Ჶ ::t F ::comment Georgian Mtavruli capital letter fi (Greek phi)
|
763 |
+
::s Ჷ ::t E ::comment Georgian Mtavruli capital letter yn (schwa)
|
764 |
+
::s Ჸ ::t A ::comment Georgian Mtavruli capital letter elifi
|
765 |
+
::s Ჹ ::t G ::comment Georgian Mtavruli capital letter gan
|
766 |
+
::s Ჺ ::t ' ::comment Georgian Mtavruli capital letter ain
|
767 |
+
::s Ჽ ::t E ::comment Georgian Mtavruli capital letter aen
|
768 |
+
::s Ჾ ::t ::comment Georgian Mtavruli capital letter hard sign
|
769 |
+
::s Ჿ ::t W ::comment Georgian Mtavruli capital letter labial sign
|
770 |
+
|
771 |
+
::s ⴀ ::t a ::comment Georgian small letter an
|
772 |
+
::s ⴄ ::t e ::comment Georgian small letter en
|
773 |
+
::s ⴈ ::t i ::comment Georgian small letter in
|
774 |
+
::s ⴍ ::t o ::comment Georgian small letter on
|
775 |
+
::s ⴓ ::t u ::comment Georgian small letter un
|
776 |
+
::s ⴡ ::t ey ::comment archaic Georgian small letter he
|
777 |
+
::s ⴢ ::t i ::comment archaic Georgian small letter hie
|
778 |
+
::s ⴣ :::t w ::comment archaic Georgian small letter we
|
779 |
+
::s ⴤ ::t q ::comment archaic Georgian small letter har
|
780 |
+
::s ⴥ ::t o ::comment archaic Georgian small letter hoe
|
781 |
+
::s ⴧ ::t e ::comment Georgian small letter yn (schwa)
|
782 |
+
::s ⴭ ::t e ::comment Georgian small letter aen
|
783 |
+
|
784 |
+
# Armenian
|
785 |
+
::s Ա ::t A ::comment Armenian capital letter ayb
|
786 |
+
::s ա ::t a ::comment Armenian small letter ayb
|
787 |
+
::s ՠ ::t a ::comment ARMENIAN SMALL LETTER TURNED AYB (CHECK)
|
788 |
+
::s Ե ::t E ::comment Armenian capital letter ech ::dont-use-at-start-of-word
|
789 |
+
::s ե ::t e ::comment Armenian small letter ech ::dont-use-at-start-of-word
|
790 |
+
::s Ե ::t Ye ::comment Armenian capital letter ech ::use-only-at-start-of-word
|
791 |
+
::s ե ::t ye ::comment Armenian small letter ech ::use-only-at-start-of-word
|
792 |
+
::s Է ::t E ::comment Armenian capital letter eh
|
793 |
+
::s է ::t e ::comment Armenian small letter eh
|
794 |
+
::s Ը ::t E ::comment Armenian capital letter et
|
795 |
+
::s ը ::t e ::comment Armenian small letter et
|
796 |
+
::s Ի ::t I ::comment Armenian capital letter ini
|
797 |
+
::s ի ::t i ::comment Armenian small letter ini
|
798 |
+
::s Յ ::t Y ::comment Armenian capital letter yi
|
799 |
+
::s յ ::t y ::comment Armenian small letter yi
|
800 |
+
::s ֈ ::t y ::comment ARMENIAN SMALL LETTER YI WITH STROKE (CHECK)
|
801 |
+
::s Ո ::t Vo ::comment Armenian capital letter vo ::use-only-at-start-of-word
|
802 |
+
::s ո ::t vo ::comment Armenian small letter vo ::use-only-at-start-of-word
|
803 |
+
::s Ո ::t O ::comment Armenian capital letter vo ::dont-use-at-start-of-word
|
804 |
+
::s ո ::t o ::comment Armenian small letter vo ::dont-use-at-start-of-word
|
805 |
+
::s Ւ ::t W ::comment Armenian capital letter yiwn
|
806 |
+
::s ւ ::t w ::comment Armenian small letter yiwn
|
807 |
+
::s Օ ::t O ::comment Armenian capital letter oh
|
808 |
+
::s օ ::t o ::comment Armenian small letter oh
|
809 |
+
::s Խ ::t Kh ::comment Armenian capital letter xeh
|
810 |
+
::s խ ::t kh ::comment Armenian small letter xeh
|
811 |
+
|
812 |
+
::s Ժ ::t Zh ::comment Armenian capital letter zhe
|
813 |
+
::s Ղ ::t Gh ::comment Armenian capital letter ghad
|
814 |
+
::s Ճ ::t Tch ::comment Armenian capital letter cheh
|
815 |
+
::s ճ ::t tch ::comment Armenian small letter cheh
|
816 |
+
::s Շ ::t Sh ::comment Armenian capital letter sha
|
817 |
+
::s Չ ::t Ch ::comment Armenian capital letter cha
|
818 |
+
::s Ջ ::t J ::comment Armenian capital letter jheh
|
819 |
+
::s ջ ::t j ::comment Armenian small letter jheh
|
820 |
+
::s Վ ::t V ::comment Armenian capital letter vew
|
821 |
+
::s վ ::t v ::comment Armenian small letter vew
|
822 |
+
::s Ձ ::t Dz ::comment Armenian capital letter ja
|
823 |
+
::s ձ ::t dz ::comment Armenian small letter ja
|
824 |
+
::s Ծ ::t Ts ::comment Armenian capital letter ca
|
825 |
+
::s ծ ::t ts ::comment Armenian small letter ca
|
826 |
+
::s Ք ::t K ::t-alt Q ::comment Armenian capital letter keh - sometimes romanized as K' or Q
|
827 |
+
::s ք ::t k ::t-alt q ::comment Armenian small letter keh - sometimes romanized as k' or q
|
828 |
+
|
829 |
+
::s են ::t en ::use-only-for-whole-word ::comment exception (auxiliary verb)
|
830 |
+
::s եմ ::t em ::use-only-for-whole-word ::comment exception (auxiliary verb)
|
831 |
+
::s ենք ::t enk ::use-only-for-whole-word ::comment exception (auxiliary verb)
|
832 |
+
::s ես ::t es ::use-only-for-whole-word ::comment exception (auxiliary verb)
|
833 |
+
::s եք ::t ek ::use-only-for-whole-word ::comment exception (auxiliary verb)
|
834 |
+
|
835 |
+
::s և ::t ev ::comment Armenian small ligature ech yiwn
|
836 |
+
::s ՈՒ ::t U ::comment Armenian capital vo+yiwn
|
837 |
+
::s Ու ::t U ::comment Armenian capital/small vo+yiwn
|
838 |
+
::s ու ::t u ::comment Armenian small vo+wywn
|
839 |
+
|
840 |
+
::s իւ ::t yu
|
841 |
+
|
842 |
+
## Japanese
|
843 |
+
# Katakana
|
844 |
+
::s シ ::t shi
|
845 |
+
::s チ ::t chi
|
846 |
+
::s フ ::t fu
|
847 |
+
::s ジ ::t ji
|
848 |
+
::s ヂ ::t ji
|
849 |
+
::s ヅ ::t zu
|
850 |
+
::s ャ ::t ya
|
851 |
+
::s ュ ::t yu
|
852 |
+
::s ョ ::t yo
|
853 |
+
::s シャ ::t sha
|
854 |
+
::s シュ ::t shu
|
855 |
+
::s ショ ::t sho
|
856 |
+
::s チャ ::t cha
|
857 |
+
::s チェ ::t che
|
858 |
+
::s チュ ::t chu
|
859 |
+
::s チョ ::t cho
|
860 |
+
::s ジャ ::t ja
|
861 |
+
::s ジュ ::t ju
|
862 |
+
::s ジョ ::t jo
|
863 |
+
::s ジェ ::t je
|
864 |
+
::s ヂャ ::t ja
|
865 |
+
::s ヂュ ::t ju
|
866 |
+
::s ヂョ ::t jo
|
867 |
+
::s フェ ::t fe
|
868 |
+
::s ヴェ ::t ve
|
869 |
+
::s フィ ::t fi
|
870 |
+
::s ウィ ::t wi
|
871 |
+
::s ヴィ ::t vi
|
872 |
+
::s ティ ::t ti
|
873 |
+
::s ディ ::t di
|
874 |
+
::s ッ ::t tsu ::comment katakana double following consonant
|
875 |
+
::s ー ::t "" ::comment katakana prolonged sound mark
|
876 |
+
::s 𛅤 ::t i ::comment KATAKANA LETTER SMALL WI
|
877 |
+
::s 𛅥 ::t e ::comment KATAKANA LETTER SMALL WE
|
878 |
+
::s 𛅦 ::t o ::comment KATAKANA LETTER SMALL WO
|
879 |
+
# Hiragana
|
880 |
+
::s し ::t shi
|
881 |
+
::s ち ::t chi
|
882 |
+
::s つ ::t tsu
|
883 |
+
::s ふ ::t fu
|
884 |
+
::s を ::t o
|
885 |
+
::s じ ::t ji
|
886 |
+
::s ぢ ::t ji
|
887 |
+
::s づ ::t zu
|
888 |
+
::s ゃ ::t ya
|
889 |
+
::s ゅ ::t yu
|
890 |
+
::s ょ ::t yo
|
891 |
+
::s しゃ ::t sha
|
892 |
+
::s しゅ ::t shu
|
893 |
+
::s しょ ::t sho
|
894 |
+
::s ちゃ ::t cha
|
895 |
+
::s ちゅ ::t chu
|
896 |
+
::s ちょ ::t cho
|
897 |
+
::s じゃ ::t ja
|
898 |
+
::s じゅ ::t ju
|
899 |
+
::s じょ ::t jo
|
900 |
+
::s ぢゃ ::t ja
|
901 |
+
::s ぢゅ ::t ju
|
902 |
+
::s ぢょ ::t jo
|
903 |
+
::s 𛅐 ::t i ::comment HIRAGANA LETTER SMALL WI
|
904 |
+
::s 𛅑 ::t e ::comment HIRAGANA LETTER SMALL WE
|
905 |
+
::s 𛅒 ::t o ::comment HIRAGANA LETTER SMALL WO
|
906 |
+
::s っ ::t tsu ::comment hiragana double following consonant
|
907 |
+
::s 々 ::t ² ::comment ideographic iteration mark ::annotation repetition-sign
|
908 |
+
|
909 |
+
::s フ ::t fu ::t-alt f
|
910 |
+
::s キ ::t ki ::t-alt k
|
911 |
+
::s ク ::t ku ::t-alt k
|
912 |
+
::s ラ ::t ra ::t-alt la
|
913 |
+
::s リ ::t ri ::t-alt li
|
914 |
+
::s ル ::t ru ::t-alt lu, l, r
|
915 |
+
::s レ ::t re ::t-alt le
|
916 |
+
::s ロ ::t ro ::t-alt lo
|
917 |
+
::s ム ::t mu ::t-alt m ::example キム = Kim
|
918 |
+
::s シ ::t shi ::t-alt si ::example メキシコ = meksiko (Mexico)
|
919 |
+
::s ス ::t su ::t-alt s
|
920 |
+
::s ト ::t to ::t-alt t
|
921 |
+
::s ツ ::t tsu ::t-alt tu, ts ::example シュルツ = Schultz
|
922 |
+
|
923 |
+
::s ㋿ ::t Reiwa ::comment SQUARE ERA NAME REIWA
|
924 |
+
|
925 |
+
# Chinese
|
926 |
+
::s 邦 ::t bang ::t-alt bon, bum, bun, pon
|
927 |
+
::s 鲍 ::t bao ::t-alt bow
|
928 |
+
::s 堡 ::t bao ::t-alt berg, burg, bourg, burgh
|
929 |
+
::s 贝 ::t bei ::t-alt ber
|
930 |
+
::s 本 ::t ben ::t-alt bern, bon, bourn, burn
|
931 |
+
::s 彼得 ::t bide ::t-alt peter, pet
|
932 |
+
::s 伯 ::t bo ::t-alt ber
|
933 |
+
::s 波 ::t bo ::t-alt po
|
934 |
+
::s 布 ::t bu ::t-alt b
|
935 |
+
::s 策 ::t ce ::t-alt tze, tzer
|
936 |
+
::s 曾 ::t ceng ::t-alt tzen, zen
|
937 |
+
::s 彻 ::t che ::t-alt tche
|
938 |
+
::s 茨 ::t ci ::t-alt ts, tz, z
|
939 |
+
::s 兹 ::t ci ::t-alt ds, dz, tz, z, zi
|
940 |
+
::s 蒂 ::t di ::t-alt ti, tti
|
941 |
+
::s 丁 ::t ding ::t-alt din, tin
|
942 |
+
::s 顿 ::t dun ::t-alt ton
|
943 |
+
::s 多 ::t duo ::t-alt do, dor, to
|
944 |
+
::s 尔 ::t er ::t-alt l, le, ll, r
|
945 |
+
::s 弗 ::t fu ::t-alt f, fer, pher, v, ver, vir
|
946 |
+
::s 夫 ::t fu ::t-alt f, v, v
|
947 |
+
::s 福 ::t fu ::t-alt faw, for, ford
|
948 |
+
::s 哥 ::t ge ::t-alt go, co
|
949 |
+
::s 戈 ::t ge ::t-alt go
|
950 |
+
::s 各 ::t ge ::t-alt go, co
|
951 |
+
::s 赫 ::t he ::t-alt ch, che, cher, ge
|
952 |
+
::s 华 ::t hua ::t-alt ver, wa, war, wer ::example Washington
|
953 |
+
::s 怀 ::t huai ::t-alt whi, wi, wy
|
954 |
+
::s 惠 ::t hui ::t-alt wha, whea
|
955 |
+
::s 基 ::t ji ::t-alt ki, chi
|
956 |
+
::s 吉 ::t ji ::t-alt gi, gui
|
957 |
+
::s 加 ::t jia ::t-alt ca, ga, ka ::example Canada
|
958 |
+
::s 杰 ::t jie ::t-alt ger
|
959 |
+
::s 金 ::t jin ::t-alt kin, gin
|
960 |
+
::s 斤 ::t jin ::t-alt zin
|
961 |
+
::s 康 ::t kang ::t-alt con, corn
|
962 |
+
::s 考 ::t kao ::t-alt cow, cour
|
963 |
+
::s 克 ::t ke ::t-alt k, che, cher
|
964 |
+
::s 科 ::t ke ::t-alt ko
|
965 |
+
::s 拉 ::t la ::t-alt ra ::example Tirana
|
966 |
+
::s 朗 ::t lang ::t-alt lon, ron
|
967 |
+
::s 赖 ::t lai ::t-alt ri
|
968 |
+
::s 劳 ::t lao ::t-alt low
|
969 |
+
::s 勒 ::t lei ::t-alt ler
|
970 |
+
::s 伦 ::t lun ::t-alt lon, ran, ron
|
971 |
+
::s 里 ::t li ::t-alt ri
|
972 |
+
::s 利 ::t li ::t-alt ri ::example Ferrari
|
973 |
+
::s 隆 ::t long ::t-alt lon, lum, lund
|
974 |
+
::s 罗 ::t luo ::t-alt l, lo, lu, ro, row, ru
|
975 |
+
::s 洛 ::t luo ::t-alt lo, low, ro
|
976 |
+
::s 默 ::t mo ::t-alt mer
|
977 |
+
::s 纳 ::t na ::t-alt ne, ner
|
978 |
+
::s 珀 ::t po ::t-alt per
|
979 |
+
::s 奇 ::t qi ::t-alt chi, dge, ge, tch
|
980 |
+
::s 齐 ::t qi ::t-alt tsi, zi
|
981 |
+
::s 乔 ::t qiao ::t-alt jo
|
982 |
+
::s 青 ::t qing ::t-alt tsing
|
983 |
+
::s 琼 ::t qiong ::t-alt jon, jum, jun
|
984 |
+
::s 瑟 ::t se ::t-alt the
|
985 |
+
::s 什 ::t shen ::t-alt sh
|
986 |
+
::s 圣 ::t sheng ::t-alt san, sao, saint
|
987 |
+
::s 斯 ::t si ::t-alt s, rth, th ::example Alaska
|
988 |
+
::s 索 ::t suo ::t-alt tho
|
989 |
+
::s 特 ::t te ::t-alt t
|
990 |
+
::s 翁 ::t weng ::t-alt on
|
991 |
+
::s 沃 ::t wo ::t-alt ver, vo, war, wer
|
992 |
+
::s 乌 ::t wu ::t-alt ou, u
|
993 |
+
::s 希 ::t xi ::t-alt chi, hi, shi
|
994 |
+
::s 西 ::t xi ::t-alt s, si
|
995 |
+
::s 锡 ::t xi ::t-alt ci, si, thi, zi
|
996 |
+
::s 夏 ::t xia ::t-alt ha, cha, cia, sha, tia
|
997 |
+
::s 香 ::t xiang ::t-alt chan, cham
|
998 |
+
::s 歇 ::t xie ::t-alt she
|
999 |
+
::s 谢 ::t xie ::t-alt che, she
|
1000 |
+
::s 辛 ::t xin ::t-alt cin, sen, sin, sing, sun, zen
|
1001 |
+
::s 欣 ::t xin ::t-alt hin, shin
|
1002 |
+
::s 休 ::t xiu ::t-alt hu, hue
|
1003 |
+
::s 修 ::t xiu ::t-alt ciu, siu, thew, tiu
|
1004 |
+
::s 许 ::t xu ::t-alt hue, schue
|
1005 |
+
::s 逊 ::t xun ::t-alt son
|
1006 |
+
::s 耶 ::t ye ::t-alt yer, ier
|
1007 |
+
::s 泽 ::t ze ::t-alt ser
|
1008 |
+
::s 扎 ::t zha ::t-alt za
|
1009 |
+
::s 詹 ::t zhan ::t-alt ja, jam, jan, jen, jon
|
1010 |
+
::s 治 ::t zhi ::t-alt ge ::example George
|
1011 |
+
|
1012 |
+
## Numbers
|
1013 |
+
# Chinese and Japanese numbers
|
1014 |
+
::s 零 ::num 0 ::style formal
|
1015 |
+
::s 〇 ::num 0 ::style informal
|
1016 |
+
::s 一 ::num 1
|
1017 |
+
::s 壹 ::num 1 ::style financial
|
1018 |
+
::s 二 ::num 2
|
1019 |
+
::s 贰 ::num 2 ::style financial ::standard simplified
|
1020 |
+
::s 貳 ::num 2 ::style financial ::standard traditional
|
1021 |
+
::s 三 ::num 3
|
1022 |
+
::s 叁 ::num 3 ::style financial ::standard simplified
|
1023 |
+
::s 參 ::num 3 ::style financial ::standard traditional
|
1024 |
+
::s 四 ::num 4
|
1025 |
+
::s 肆 ::num 4 ::style financial
|
1026 |
+
::s 五 ::num 5
|
1027 |
+
::s 伍 ::num 5 ::style financial
|
1028 |
+
::s 六 ::num 6
|
1029 |
+
::s 陆 ::num 6 ::style financial ::standard simplified
|
1030 |
+
::s 陸 ::num 6 ::style financial ::standard traditional
|
1031 |
+
::s 七 ::num 7
|
1032 |
+
::s 柒 ::num 7 ::style financial
|
1033 |
+
::s 八 ::num 8
|
1034 |
+
::s 捌 ::num 8 ::style financial
|
1035 |
+
::s 九 ::num 9
|
1036 |
+
::s 玖 ::num 9 ::style financial
|
1037 |
+
::s 十 ::num 10
|
1038 |
+
::s 拾 ::num 10 ::style financial
|
1039 |
+
::s 百 ::num 100
|
1040 |
+
::s 佰 ::num 100 ::style financial
|
1041 |
+
::s 千 ::num 1000
|
1042 |
+
::s 仟 ::num 1000 ::style financial
|
1043 |
+
::s 万 ::num 10000 ::is-large-power ::standard simplified
|
1044 |
+
::s 萬 ::num 10000 ::is-large-power ::standard traditional
|
1045 |
+
::s 萬 ::num 10000 ::is-large-power ::style financial
|
1046 |
+
::s 亿 ::num 100000000 ::is-large-power ::standard simplified
|
1047 |
+
::s 億 ::num 100000000 ::is-large-power ::standard traditional
|
1048 |
+
::s 億 ::num 100000000 ::is-large-power ::style financial
|
1049 |
+
::s 兆 ::num 1000000000000 ::is-large-power
|
1050 |
+
::s 京 ::num 10000000000000000 ::is-large-power
|
1051 |
+
::s 负 ::is-minus-sign
|
1052 |
+
::s 正 ::is-plus-sign
|
1053 |
+
::s 点 ::is-decimal-point
|
1054 |
+
::s 分之 ::fraction-connector denominator * numerator
|
1055 |
+
::s 百分之 ::percentage-marker * numerator
|
1056 |
+
::s 又 ::int-frac-connector integer * fraction
|
1057 |
+
|
1058 |
+
# numbers in non-number words (to be exptended)
|
1059 |
+
::s 一贯 ::t yiguan ::comment consistent
|
1060 |
+
|
1061 |
+
::s 红十字会 ::t hongshizihui ::comment Red Cross
|
1062 |
+
|
1063 |
+
::s 百度 ::t baidu ::comment Baidu (company)
|
1064 |
+
::s 百分 ::t baifen ::comment percent
|
1065 |
+
::s 百合 ::t baihe ::comment lily
|
1066 |
+
::s 百货 ::t baihuo ::comment general merchandise
|
1067 |
+
::s 百科 ::t baike ::comment encyclopedia
|
1068 |
+
::s 百老汇 ::t bailaohui
|
1069 |
+
::s 百灵 ::t bailing
|
1070 |
+
::s 百慕大 ::t baimuda
|
1071 |
+
::s 百日咳 ::t bairike
|
1072 |
+
::s 百色市 ::t baiseshi
|
1073 |
+
::s 百事可乐 ::t baishikele ::comment Pepsi Cola
|
1074 |
+
::s 百無 ::t baiwu
|
1075 |
+
::s 百香 ::t baixiang
|
1076 |
+
::s 百姓 ::t baixing
|
1077 |
+
::s 百叶 ::t baiye
|
1078 |
+
::s 百色 ::t bose
|
1079 |
+
::s 杨百翰 ::t yangbaihan ::comment Brigham Young
|
1080 |
+
|
1081 |
+
::s 北京 ::t beijing
|
1082 |
+
::s 京都 ::t jingdou
|
1083 |
+
::s 东京 ::t dongjing
|
1084 |
+
::s 京胡 ::t jinghu
|
1085 |
+
::s 南京 ::t nangjing
|
1086 |
+
::s 普京 ::t pujing ::comment Putin
|
1087 |
+
::s 東京 ::t dongjing ::comment Tokyo
|
1088 |
+
::s 京兆 ::t jingzhao
|
1089 |
+
|
1090 |
+
::s ㎢ ::t km²
|
1091 |
+
::s ㎥ ::t m³
|
1092 |
+
::s ㎝ ::t cm
|
1093 |
+
|
1094 |
+
## Indian
|
1095 |
+
# see mostly under UnicodeDataOverwrite.txt
|
1096 |
+
|
1097 |
+
# Malayalam
|
1098 |
+
::s ൗ ::t au ::comment MALAYALAM AU LENGTH MARK
|
1099 |
+
|
1100 |
+
# Tamil
|
1101 |
+
::s ட ::t d ::comment most commonly d, but t when word-initial or in a doubled consonant
|
1102 |
+
::s ஃப ::t f ::comment h+p=f
|
1103 |
+
::s ஃஜ ::t z ::comment h+j=z
|
1104 |
+
::s ௐ ::t om ::comment TAMIL OM
|
1105 |
+
|
1106 |
+
# Myanmar/Burmese
|
1107 |
+
# ::s ့ ::t ::comment dot below, denotes creaky tone
|
1108 |
+
# ::s း ::t ::comment visarga, denotes high tone
|
1109 |
+
::s ၌ ::t -nai ::comment locative
|
1110 |
+
::s ၍ ::t -jwe ::comment completed
|
1111 |
+
::s ၎ ::t legau ::comment aforementioned
|
1112 |
+
::s ၏ ::t -i ::comment genetive
|
1113 |
+
|
1114 |
+
# Lao
|
1115 |
+
::s ັ ::t a ::comment vowel sign mai kan
|
1116 |
+
::s ົ ::t o ::comment vowel sign mai kon
|
1117 |
+
::s ູ ::t uu ::comment vowel sign uu
|
1118 |
+
::s ຽ ::t y ::comment semivowel sign nyo
|
1119 |
+
::s ຼ ::t l ::comment semivowel sign lo
|
1120 |
+
::s ລ ::t l ::comment lo loot
|
1121 |
+
::s ຣ ::t l ::comment lo ling
|
1122 |
+
::s ໝ ::t m ::comment ho mo
|
1123 |
+
::s ໜ ::t n ::comment ho no
|
1124 |
+
::s ຢ ::t y ::comment yo
|
1125 |
+
::s ໍ ::t oo ::comment niggahita (possibly also nasal -m in final position)
|
1126 |
+
::s ໆ ::t ² ::comment Lao ko la ::annotation repetition-sign
|
1127 |
+
::s ຯ ::t ... ::comment Lao ellipsis
|
1128 |
+
|
1129 |
+
# Thai
|
1130 |
+
::s ข ::t kh ::t-end-of-syllable k
|
1131 |
+
::s ฃ ::t kh ::t-end-of-syllable k
|
1132 |
+
::s ค ::t kh ::t-end-of-syllable k
|
1133 |
+
::s ฅ ::t kh ::t-end-of-syllable k
|
1134 |
+
::s ฆ ::t kh ::t-end-of-syllable k
|
1135 |
+
::s จ ::t ch ::t-end-of-syllable t
|
1136 |
+
::s ฉ ::t ch ::t-end-of-syllable t
|
1137 |
+
::s ช ::t ch ::t-end-of-syllable t
|
1138 |
+
::s ฌ ::t ch
|
1139 |
+
::s ฎ ::t d ::t-end-of-syllable t
|
1140 |
+
::s ด ::t d ::t-end-of-syllable t
|
1141 |
+
::s บ ::t b ::t-end-of-syllable p
|
1142 |
+
::s พ ::t ph ::t-end-of-syllable p
|
1143 |
+
::s ภ ::t ph ::t-end-of-syllable p
|
1144 |
+
::s ฟ ::t f ::t-end-of-syllable p
|
1145 |
+
::s ฝ ::t f
|
1146 |
+
::s ฐ ::t th ::t-end-of-syllable t
|
1147 |
+
::s ฑ ::t th ::t-end-of-syllable t
|
1148 |
+
::s ฒ ::t th ::t-end-of-syllable t
|
1149 |
+
::s ถ ::t th ::t-end-of-syllable t
|
1150 |
+
::s ท ::t th ::t-end-of-syllable t
|
1151 |
+
::s ธ ::t th ::t-end-of-syllable t
|
1152 |
+
::s ศ ::t s ::t-end-of-syllable t
|
1153 |
+
::s ษ ::t s ::t-end-of-syllable t
|
1154 |
+
::s ส ::t s ::t-end-of-syllable t
|
1155 |
+
::s ห ::t h ::t-end-of-syllable ::comment dropped at end of syllable
|
1156 |
+
::s ฮ ::t h
|
1157 |
+
::s ฬ ::t l ::t-end-of-syllable n
|
1158 |
+
::s ล ::t l ::t-end-of-syllable n
|
1159 |
+
::s ร ::t r ::t-end-of-syllable n
|
1160 |
+
::s ญ ::t y ::t-end-of-syllable n
|
1161 |
+
::s ย ::t y ::t-end-of-syllable i
|
1162 |
+
::s ว ::t w ::t-end-of-syllable ua
|
1163 |
+
::s ฦ ::t lue
|
1164 |
+
::s ฦๅ ::t lue
|
1165 |
+
::s ฤ ::t rue
|
1166 |
+
::s ฤๅ ::t rue
|
1167 |
+
::s ๆ ::t ² ::comment Thai character maiyamok ::annotation repetition-sign
|
1168 |
+
# ::s ๅ ::comment THAI CHARACTER LAKKHANGYAO vowel lengthener
|
1169 |
+
::s ะ ::t a ::comment THAI CHARACTER SARA A
|
1170 |
+
::s ั ::t a ::comment THAI CHARACTER MAI HAN-AKAT "stick turning in the air"
|
1171 |
+
::s รร ::t an ::t-end-of-syllable a
|
1172 |
+
::s อั ::t a
|
1173 |
+
::s แ–ะ ::t ae
|
1174 |
+
::s แ–็ว ::t aeo
|
1175 |
+
::s แ–ว ::t aeo
|
1176 |
+
::s ไ–ย ::t ai
|
1177 |
+
::s ัย ::t ai
|
1178 |
+
::s าย ::t ai
|
1179 |
+
::s เ–า ::t ao
|
1180 |
+
::s เ–้า ::t ao
|
1181 |
+
::s าว ::t ao
|
1182 |
+
::s เ–ะ ::t e
|
1183 |
+
::s เ–็ ::t e
|
1184 |
+
::s ็ ::t o ::comment THAI CHARACTER MAITAIKHU U+0E47 vowel shortener/short vowel (ɔː)
|
1185 |
+
::s เ–็ว ::t eo
|
1186 |
+
::s เ–ว ::t eo
|
1187 |
+
::s อิ ::t i
|
1188 |
+
::s ีย ::t ia
|
1189 |
+
::s เ–ียะ ::t ia
|
1190 |
+
::s เ–ีย ::t ia
|
1191 |
+
::s เ–ียว ::t iao
|
1192 |
+
::s ออ ::t o
|
1193 |
+
::s โ–ะ ::t o
|
1194 |
+
::s โ– ::t o
|
1195 |
+
::s เ–าะ ::t o
|
1196 |
+
::s เ–อะ ::t oe
|
1197 |
+
::s เ–ิ ::t oe
|
1198 |
+
::s เ–อ ::t oe
|
1199 |
+
::s เ–ย ::t oei
|
1200 |
+
::s โ–ย ::t oi
|
1201 |
+
# ::s อย ::t oi ::comment problematic if followed by further vowels
|
1202 |
+
::s วย ::t uai
|
1203 |
+
::s เ–ือะ ::t uea
|
1204 |
+
::s เ–ือ ::t uea
|
1205 |
+
::s เ–ือย ::t ueai
|
1206 |
+
|
1207 |
+
# Khmer
|
1208 |
+
::s ័ ::t "" ::comment Khmer samyok sannya: indicates deviation from the general rules of pronunciation
|
1209 |
+
::s ៏ ::t "" ::comment Khmer sign ahsda: denotes stressed intonation in some single-consonant words
|
1210 |
+
::s ៍ ::t "" ::comment Khmer sign toandakhiat: indicates that the base character is not pronounced
|
1211 |
+
::s ៌ ::t "" ::comment Khmer sign robat: a diacritic historically corresponding to the repha form of ra in Devanagari
|
1212 |
+
::s ប៉ ::t pa ::comment Khmer ba + musĕkâtônd -> pa
|
1213 |
+
::s ៗ ::t ² ::comment Khmer sign lek too ::annotation repetition-sign
|
1214 |
+
|
1215 |
+
## Semitic languages
|
1216 |
+
# Arabic
|
1217 |
+
::s و ::t w ::comment Arabic letter waw ::t-alt o, u ::lcode ara
|
1218 |
+
::s ء ::t ' ::comment hamza
|
1219 |
+
::s ٔ ::t ' ::comment hamza above
|
1220 |
+
::s ٕ ::t ' ::comment hamza below
|
1221 |
+
::s ع ::t ' ::comment ain
|
1222 |
+
::s آ ::t a ::comment alef madda
|
1223 |
+
::s ٓا ::t a ::comment Arabic maddah above plus alef (presumably an ill-formed version of آ; found 1 instance in Urdu text)
|
1224 |
+
::s إ ::t i ::comment alef with hamza below
|
1225 |
+
::s ٱ ::t a ::comment alef wasla ::comment typically indicates liaison with preceding word
|
1226 |
+
::s ة ::t a ::comment teh marbuta
|
1227 |
+
::s ۃ ::t a ::comment teh marbuta goal ::comment Used in Punjabi, Sindhi. Different from plain 'teh marbuta'?
|
1228 |
+
::s ي ::t y ::comment Arabic yeh
|
1229 |
+
::s ى ::t a ::comment alef maksura
|
1230 |
+
::s ﻯ ::t a ::comment alef maksura isolated form
|
1231 |
+
::s ﻰ ::t a ::comment alef maksura final form
|
1232 |
+
::s ﯨ ::t a ::comment Uighur Kazach Kirghiz alef maksura initial form
|
1233 |
+
::s ﯩ ::t a ::comment Uighur Kazach Kirghiz alef maksura medial form
|
1234 |
+
::s ٰ ::t a ::comment Arabic letter superscript alef
|
1235 |
+
::s ـ ::t ::comment tatweel (filler)
|
1236 |
+
::s َ ::t a ::comment fatha ("-a")
|
1237 |
+
::s ُ ::t u ::comment damma ("-u")
|
1238 |
+
::s ِ ::t i ::comment kasra ("-i")
|
1239 |
+
::s ْ ::t ::comment sukun (no vowel)
|
1240 |
+
::s ۡ ::t ::comment small high dotless head of khah; like sukun (no vowel); used in Kashmiri, Assamese
|
1241 |
+
::s ً ::t ::comment fathatan ("-an")
|
1242 |
+
::s اً ::t an ::comment alef + fathatan
|
1243 |
+
::s ٌ ::t ::comment dammatan ("-un")
|
1244 |
+
::s ٍ ::t ::comment kasratan ("-in")
|
1245 |
+
::s ّ ::t ::comment shadda (consonant doubler)
|
1246 |
+
::s ڃ ::t ny ::comment Arabic letter nyeh U+0683 (used in Sindhi (snd))
|
1247 |
+
::s ڄ ::t dy ::comment Arabic letter dyeh U+0684 (used in Sindhi (snd))
|
1248 |
+
::s ۾ ::t men ::comment Sindhi postposition men
|
1249 |
+
::s ؑ ::t alayhe wasallam ::comment "upon him be peace"
|
1250 |
+
::s ﷴ ::t mohammad ::comment "Mohammad"
|
1251 |
+
::s ﷸ ::t wasallam ::comment "and peace"
|
1252 |
+
::s ﷺ ::t sallallahou alayhe wasallam ::comment "prayer of God be upon him and his family and peace"
|
1253 |
+
|
1254 |
+
::s ࣓ ::t waw ::comment ARABIC SMALL LOW WAW
|
1255 |
+
::s ࣔ ::t al-rub ::comment ARABIC SMALL HIGH WORD AR-RUB
|
1256 |
+
::s ࣕ ::t s ::comment ARABIC SMALL HIGH SAD
|
1257 |
+
::s ࣖ ::t ' ::comment ARABIC SMALL HIGH AIN
|
1258 |
+
::s ࣗ ::t q ::comment ARABIC SMALL HIGH QAF
|
1259 |
+
::s ࣘ ::t n ::comment ARABIC SMALL HIGH NOON WITH KASRA
|
1260 |
+
::s ࣙ ::t n ::comment ARABIC SMALL LOW NOON WITH KASRA
|
1261 |
+
::s ࣚ ::t al-thalatha ::comment ARABIC SMALL HIGH WORD ATH-THALATHA
|
1262 |
+
::s ࣛ ::t al-sajda ::comment ARABIC SMALL HIGH WORD AS-SAJDA
|
1263 |
+
::s ࣜ ::t al-nisf ::comment ARABIC SMALL HIGH WORD AN-NISF
|
1264 |
+
::s ࣝ ::t sakta ::comment ARABIC SMALL HIGH WORD SAKTA
|
1265 |
+
::s ࣞ ::t qif ::comment ARABIC SMALL HIGH WORD QIF
|
1266 |
+
::s ࣟ ::t waqfa ::comment ARABIC SMALL HIGH WORD WAQFA
|
1267 |
+
::s ࣠ ::t ::comment ARABIC SMALL HIGH FOOTNOTE MARKER (CHECK)
|
1268 |
+
::s ࣡ ::t ::comment ARABIC SMALL HIGH SIGN SAFHA (CHECK)
|
1269 |
+
::s ::t ::comment ARABIC DISPUTED END OF AYAH (CHECK)
|
1270 |
+
|
1271 |
+
# Farsi
|
1272 |
+
::s ی ::t i ::t-alt y ::comment Contributed by Nima
|
1273 |
+
::s ای ::t i ::t-alt ai ::use-only-at-start-of-word ::comment Contributed by Nima
|
1274 |
+
::s هٔ ::t eye ::use-only-at-end-of-word ::lcode fas ::comment Contributed by Nima
|
1275 |
+
::s و ::t v ::t-alt o, u ::lcode fas ::comment Arabic letter waw
|
1276 |
+
::s ض ::t z ::t-alt d ::lcode fas ::comment Contributed by Marjan
|
1277 |
+
::s ث ::t s ::t-alt th ::lcode fas ::comment Contributed by Marjan
|
1278 |
+
::s ذ ::t z ::t-alt th ::lcode fas ::comment Contributed by Nima
|
1279 |
+
::s ع ::t a ::t-alt ' ::lcode fas ::comment Contributed by Nima
|
1280 |
+
::s عا ::t a ::lcode fas ::comment Contributed by Nima
|
1281 |
+
::s عی ::t i ::t-alt iy ::lcode fas ::comment Contributed by Nima
|
1282 |
+
::s عو ::t u ::t-alt o, av ::lcode fas ::comment Contributed by Nima
|
1283 |
+
::s چ ::t ch ::t-alt tch, tsh ::lcode fas ::comment Contributed by Nima
|
1284 |
+
::s ه ::t e ::t-alt h ::use-only-at-end-of-word ::lcode fas ::comment Contributed by Nima
|
1285 |
+
::s ::t "" ::t-alt " " ::lcode fas ::comment source is character "zero-width non-joiner" (U+200C); Contributed by Nima
|
1286 |
+
::s غ ::t gh ::t-alt g ::lcode fas
|
1287 |
+
::s آئی ::t ai ::t-alt ae ::lcode fas
|
1288 |
+
::s ائی ::t ai ::t-alt ae ::lcode fas
|
1289 |
+
::s آئو ::t au ::t-alt ao ::lcode fas
|
1290 |
+
::s ائو ::t au ::t-alt ao ::lcode fas
|
1291 |
+
|
1292 |
+
# Kashmiri (so far: educated guesses)
|
1293 |
+
::s ٖ ::t a ::comment Arabic subscript alef U+0656
|
1294 |
+
::s ٗ ::t u ::comment Arabic inverted damma U+0657
|
1295 |
+
::s ۚ ::t j ::comment Arabic small high jeem U+06DA
|
1296 |
+
::s ۪ ::t ::comment Arabic emtpy centre low stop U+06EA
|
1297 |
+
::s ۬ ::t ::comment Arabic rounded high stop with filled center U+06EC
|
1298 |
+
|
1299 |
+
# Pashto
|
1300 |
+
::s ٙ ::t e ::comment Arabic zwarakay
|
1301 |
+
::s ځ ::t z ::t-alt dz ::comment Pashto letter zim; Arabic letter "hah with hamza above"
|
1302 |
+
::s څ ::t ts ::t-alt c ::comment Pashto letter tsim; Arabic letter "h with three dots above"
|
1303 |
+
::s ګ ::t g ::comment Pashto letter gaf; Arabic letter "kaf with ring"
|
1304 |
+
::s ڼ ::t n ::comment Arabic letter "noon with ring"
|
1305 |
+
::s ږ ::t g ::t-alt z, zh, j ::comment pronunciation varies regionally
|
1306 |
+
::s ښ ::t kh ::t-alt sh ::comment pronunciation varies regionally
|
1307 |
+
::s ه ::t h ::t-alt a ::lcode pus
|
1308 |
+
::s ۀ ::t e ::lcode pus ::comment Arabic letter "heh with yeh above"
|
1309 |
+
::s و ::t w ::t-alt o, u ::lcode pus
|
1310 |
+
::s ی ::t ay ::t-alt y ::lcode pus
|
1311 |
+
::s وی ::t wy ::t-alt oy, uy ::lcode pus
|
1312 |
+
::s ای ::t ay ::lcode pus
|
1313 |
+
::s ۍ ::t ay ::lcode pus
|
1314 |
+
::s ئ ::t ay ::t-alt y ::lcode pus
|
1315 |
+
::s ژ ::t zh ::t-alt z ::lcode pus ::comment [ʒ]
|
1316 |
+
::s ض ::t z ::t-alt d ::lcode pus
|
1317 |
+
::s ث ::t s ::lcode pus ::t-alt th ::comment Arabic letter theh (unvoiced th/θ)
|
1318 |
+
::s ذ ::t z ::lcode pus ::t-alt th ::comment Arabic letter thal (voiced th/ð)
|
1319 |
+
|
1320 |
+
# Hebrew
|
1321 |
+
::s ב ::t v ::comment Hebrew letter bet ::t-alt b
|
1322 |
+
::s כ ::t k ::comment Hebrew letter kaf ::t-alt kh
|
1323 |
+
::s ך ::t k ::comment Hebrew letter kaf ::t-alt kh
|
1324 |
+
::s פ ::t f ::comment Hebrew letter pe ::t-alt p
|
1325 |
+
::s ש ::t sh ::comment Hebrew letter shin ::t-alt s
|
1326 |
+
::s ו ::t v ::comment Hebrew letter vav ::t-alt o, u
|
1327 |
+
::s ח ::t ch ::comment Hebrew letter het ::t-alt h ::use-alt-in-pointed
|
1328 |
+
::s ק ::t q ::t-alt k ::use-alt-in-pointed
|
1329 |
+
::s וֹ ::t o
|
1330 |
+
::s וּ ::t u
|
1331 |
+
::s קְוָ ::t qva ::t-alt kva ::use-alt-in-pointed
|
1332 |
+
::s י ::t y
|
1333 |
+
::s יּ ::t y
|
1334 |
+
::s יָּ ::t ya
|
1335 |
+
::s ײ ::t yy ::comment Hebrew ligature Yiddish double Yod (CHECK)
|
1336 |
+
::s ׯ ::t yyy ::comment HEBREW YOD TRIANGLE (CHECK)
|
1337 |
+
::s ע ::t '
|
1338 |
+
::s ִי ::t i ::t-alt iy ::use-alt-in-pointed
|
1339 |
+
::s ֵי ::t e
|
1340 |
+
::s ִיּ ::t iy
|
1341 |
+
::s ִיָּ ::t iya
|
1342 |
+
::s ױ ::t oy
|
1343 |
+
::s א ::t a ::t-alt '
|
1344 |
+
::s אָ ::t a
|
1345 |
+
::s ֹא ::t o
|
1346 |
+
::s אַ ::t 'a
|
1347 |
+
::s אֲ ::t 'a
|
1348 |
+
::s אֶ ::t e
|
1349 |
+
::s אֱ ::t e
|
1350 |
+
::s פ ::t f
|
1351 |
+
::s פּ ::t p
|
1352 |
+
::s פַּ ::t pa
|
1353 |
+
::s פְּ ::t pe ::t-alt p ::use-alt-in-pointed
|
1354 |
+
::s שׁ ::t sh
|
1355 |
+
::s שָׁ ::t sha
|
1356 |
+
::s שָּׁ ::t sha ::comment ?
|
1357 |
+
::s שְׁ ::t she ::t-alt sh ::use-alt-in-pointed
|
1358 |
+
::s שֶׁ ::t she
|
1359 |
+
::s שִׁ ::t shi
|
1360 |
+
::s שֻׁ ::t shu
|
1361 |
+
::s שׂ ::t s
|
1362 |
+
::s שָׂ ::t sa
|
1363 |
+
::s שְׂ ::t s ::t-alt se ::use-alt-in-pointed
|
1364 |
+
::s כּ ::t k
|
1365 |
+
::s כֶּ ::t ke
|
1366 |
+
::s כֹּ ::t ko
|
1367 |
+
::s בּ ::t b
|
1368 |
+
::s בַּ ::t ba
|
1369 |
+
::s בָּ ::t ba
|
1370 |
+
::s בְּ ::t be ::t-alt b ::use-alt-in-pointed
|
1371 |
+
::s בֶּ ::t be
|
1372 |
+
::s תּ ::t t
|
1373 |
+
::s תַּ ::t ta
|
1374 |
+
::s תֵּ ::t te
|
1375 |
+
::s תִּ ::t ti
|
1376 |
+
::s דָּ ::t da
|
1377 |
+
::s דְּ ::t de ::t-alt d ::use-alt-in-pointed
|
1378 |
+
::s גּ ::t g
|
1379 |
+
::s לֵּ ::t le
|
1380 |
+
::s ד׳ ::t dh
|
1381 |
+
::s ג׳ ::t j
|
1382 |
+
::s ת׳ ::t th
|
1383 |
+
::s ז׳ ::t zh
|
1384 |
+
::s חַ ::t ach ::comment furtive patah ::use-only-at-end-of-word
|
1385 |
+
::s עַ ::t a' ::comment furtive patah ::use-only-at-end-of-word
|
1386 |
+
::s הַּ ::t ah ::comment furtive patah ::use-only-at-end-of-word
|
1387 |
+
::s ַ ::t a ::comment Hebrew point patah
|
1388 |
+
::s ֲ ::t a ::comment Hebrew point hataf patah (hataf = reduced)
|
1389 |
+
::s ֳ ::t o ::comment Hebrew point hataf qamats
|
1390 |
+
::s ָ ::t a ::comment Hebrew point qamats ::t-alt o ::use-alt-in-pointed
|
1391 |
+
::s ֶ ::t e ::comment Hebrew point segol
|
1392 |
+
::s ֱ ::t e ::comment Hebrew point hataf segol (hataf = reduced)
|
1393 |
+
::s ְ ::t e ::comment Hebrew point sheva ::t-alt "" ::use-alt-in-pointed
|
1394 |
+
::s ֵ ::t e ::comment Hebrew point tsere
|
1395 |
+
::s ִ ::t i ::comment Hebrew point hiriq
|
1396 |
+
::s ֹ ::t o ::comment Hebrew point holam
|
1397 |
+
::s ֻ ::t u ::comment Hebrew point qubuts
|
1398 |
+
# ::s ּ ::t "" ::comment Hebrew point dagesh or mapiq
|
1399 |
+
|
1400 |
+
# Yiddish
|
1401 |
+
::s א ::t a ::lcode yid ::comment called "silent" alef
|
1402 |
+
::s אי ::t y ::lcode yid
|
1403 |
+
::s איי ::t ey ::lcode yid
|
1404 |
+
::s או ::t u ::lcode yid
|
1405 |
+
::s אוי ::t oy ::lcode yid
|
1406 |
+
::s אַ ::t a ::lcode yid
|
1407 |
+
::s אָ ::t o ::lcode yid
|
1408 |
+
::s ב ::t b ::lcode yid
|
1409 |
+
::s בֿ ::t v ::lcode yid
|
1410 |
+
::s דזש ::t dzh ::lcode yid
|
1411 |
+
::s ו ::t u ::lcode yid
|
1412 |
+
::s וּ ::t u ::lcode yid
|
1413 |
+
::s וֹ ::t o ::lcode yid
|
1414 |
+
::s װ ::t v ::lcode yid
|
1415 |
+
::s ווא ::t wa ::lcode yid
|
1416 |
+
::s וואַ ::t wa ::lcode yid
|
1417 |
+
::s ווע ::t we ::lcode yid
|
1418 |
+
::s ווי ::t wi ::lcode yid
|
1419 |
+
::s וואוי ::t wo ::lcode yid
|
1420 |
+
::s וי ::t oy ::lcode yid
|
1421 |
+
::s זש ::t zh ::lcode yid
|
1422 |
+
::s ח ::t ch ::lcode yid
|
1423 |
+
::s טש ::t tsh ::lcode yid
|
1424 |
+
::s יִ ::t i ::lcode yid
|
1425 |
+
::s יי ::t ey ::lcode yid ::comment maybe "yi" at beginning of word
|
1426 |
+
::s ײַ ::t ay ::lcode yid
|
1427 |
+
::s כּ ::t k ::lcode yid
|
1428 |
+
::s כ ::t ch ::lcode yid
|
1429 |
+
::s ך ::t ch ::lcode yid
|
1430 |
+
::s ע ::t e ::lcode yid
|
1431 |
+
::s פּ ::t p ::lcode yid
|
1432 |
+
::s פֿ ::t f ::lcode yid
|
1433 |
+
::s ף ::t f ::lcode yid ::comment sometimes p
|
1434 |
+
::s ק ::t k ::lcode yid
|
1435 |
+
::s ת ::t s ::lcode yid
|
1436 |
+
|
1437 |
+
# Syriac/Aramaic (should be vetted by expert)
|
1438 |
+
::s ܰ ::t a ::comment Syriac pthaha above
|
1439 |
+
::s ܲ ::t a ::comment Syriac pthaha dotted
|
1440 |
+
::s ܳ ::t aa ::comment Syriac zqapha above
|
1441 |
+
::s ܴ ::t aa ::comment Syriac zqapha below
|
1442 |
+
::s ܵ ::t aa ::comment Syriac zqapha dotted
|
1443 |
+
::s ܶ ::t e ::comment Syriac rbasa above
|
1444 |
+
::s ܷ ::t e ::comment Syriac rbasa below
|
1445 |
+
::s ܿ ::t o ::comment Syriac rwaha
|
1446 |
+
::s ܸ ::t e ::comment Syriac dotted zlama horizontal
|
1447 |
+
::s ܹ ::t e ::comment Syriac dotted zlama angular
|
1448 |
+
::s ܺ ::t i ::comment Syriac hbasa above
|
1449 |
+
::s ܝܺ ::t i ::comment Syriac yudh + hbasa above
|
1450 |
+
::s ܼ ::t u ::comment Syriac hbasa-esasa dotted
|
1451 |
+
::s ܽ ::t o ::comment Syriac esasa above
|
1452 |
+
::s ܾ ::t u ::comment Syriac esasa below
|
1453 |
+
::s ݇ ::t "" ::comment Syriac oblique line above; indication of a silent letter
|
1454 |
+
|
1455 |
+
::s ܖ ::t d ::comment Syriac letter dotless dalath rish; ambiguous form for undifferentiated early dalath/rish
|
1456 |
+
::s ܜ ::t t ::comment Syriac letter teth garshuni; used in Garshuni documents
|
1457 |
+
::s ܒ݂ ::t v ::comment Syriac beth + rukkakha
|
1458 |
+
::s ܒ̥ ::t v ::comment Syriac beth + ring-below
|
1459 |
+
::s ܓ݂ ::t g ::comment Syriac gammal + rukkakha [IPA: ɣ]
|
1460 |
+
::s ܓ̥ ::t g ::comment Syriac gammal + ring-below [IPA: ɣ]
|
1461 |
+
::s ܕ݂ ::t d ::comment Syriac dalath + rukkakha [IPA: ð]
|
1462 |
+
::s ܕ̥ ::t d ::comment Syriac dalath + ring-below [IPA: ð]
|
1463 |
+
::s ܟ݂ ::t kh ::comment Syriac kaph + rukkakha [IPA: x]
|
1464 |
+
::s ܟ̥ ::t kh ::comment Syriac kaph + ring-below [IPA: x]
|
1465 |
+
::s ܦ݂ ::t f ::comment Syriac pe + rukkakha
|
1466 |
+
::s ܦ̥ ::t f ::comment Syriac pe + ring-below
|
1467 |
+
::s ܦ݁ ::t p ::comment Syriac pe + qushshaya
|
1468 |
+
::s ܬ݂ ::t th ::comment Syriac taw + rukkakha [IPA: θ]
|
1469 |
+
::s ܬ̥ ::t th ::comment Syriac taw + ring-below [IPA: θ]
|
1470 |
+
|
1471 |
+
::s ܄ ::t : ::comment Syriac sublinear colon; used at the end of verses of supplicationscolon skewed left
|
1472 |
+
::s ܆ ::t , ::comment Syriac colon skewed left; marks a dependent clause
|
1473 |
+
::s ܇ ::t , ::comment Syriac colon skewed right; marks the end of a subdivision of the apodosis, or latter part of a Biblical verse
|
1474 |
+
|
1475 |
+
# Uzbek
|
1476 |
+
::s ʻ ::t ' ::comment modifies pronunciation of preceding "o" and "g"
|
1477 |
+
::s ʼ ::t ' ::comment glottal stop (tutuq belgisi)
|
1478 |
+
|
1479 |
+
# Uyghur
|
1480 |
+
::s ئا ::t a ::lcode uig
|
1481 |
+
::s ە ::t e ::lcode uig
|
1482 |
+
::s ئې ::t e ::lcode uig ::latinplus ë
|
1483 |
+
::s ې ::t e ::lcode uig ::latinplus ë
|
1484 |
+
::s ئە ::t e ::lcode uig
|
1485 |
+
::s يە ::t e ::lcode uig
|
1486 |
+
::s ئى ::t i ::lcode uig
|
1487 |
+
::s ى ::t i ::lcode uig
|
1488 |
+
::s ئو ::t o ::lcode uig
|
1489 |
+
::s و ::t o ::lcode uig
|
1490 |
+
::s ئۇ ::t u ::lcode uig
|
1491 |
+
::s ۇ ::t u ::lcode uig
|
1492 |
+
::s چ ::t ch ::t-alt q ::lcode uig
|
1493 |
+
::s خ ::t x ::lcode uig
|
1494 |
+
::s ژ ::t zh ::lcode uig
|
1495 |
+
::s ئۆ ::t oe ::t-alt o ::lcode uig ::latinplus ö
|
1496 |
+
::s ۆ ::t oe ::t-alt o ::lcode uig ::latinplus ö
|
1497 |
+
::s ئۈ ::t ue ::t-alt u ::lcode uig ::latinplus ü
|
1498 |
+
::s ۈ ::t ue ::t-alt u ::lcode uig ::latinplus ü
|
1499 |
+
::s ۋ ::t w ::lcode uig
|
1500 |
+
|
1501 |
+
# Maldivian
|
1502 |
+
::s ް ::t ::comment thaana sukun
|
1503 |
+
::s ަ ::t a ::comment thaana abafili
|
1504 |
+
::s ާ ::t aa ::comment thaana aabaafili
|
1505 |
+
::s ި ::t i ::comment thaana ibifili
|
1506 |
+
::s ީ ::t ee ::comment thaana eebeefili
|
1507 |
+
::s ު ::t u ::comment thaana ubufili
|
1508 |
+
::s ޫ ::t oo ::comment thaana ooboofili
|
1509 |
+
::s ެ ::t e ::comment thaana ebefili
|
1510 |
+
::s ޭ ::t ey ::comment thaana eybeyfili
|
1511 |
+
::s ޮ ::t o ::comment thaana obofili
|
1512 |
+
::s ޯ ::t oa ::comment thaana oaboafili
|
1513 |
+
|
1514 |
+
# Canadian syllabics (Inuktitut)
|
1515 |
+
::s ᑊ ::t p ::comment syllable final
|
1516 |
+
::s ᐟ ::t t ::comment syllable final
|
1517 |
+
::s ᐠ ::t k ::comment syllable final
|
1518 |
+
::s ᐨ ::t c ::comment syllable final
|
1519 |
+
::s ᒼ ::t m ::comment syllable final
|
1520 |
+
::s ᐣ ::t n ::comment syllable final
|
1521 |
+
::s ᐢ ::t s ::comment syllable final
|
1522 |
+
::s ᐧ ::t y ::comment syllable final
|
1523 |
+
::s ᐤ ::t w ::comment syllable final
|
1524 |
+
::s ᐦ ::t h ::comment syllable final
|
1525 |
+
::s ᕽ ::t hk ::comment syllable final
|
1526 |
+
::s ᓫ ::t l ::comment syllable final
|
1527 |
+
::s ᕑ ::t r ::comment syllable final
|
1528 |
+
|
1529 |
+
# Mongolian
|
1530 |
+
::s ᢅ ::t ::comment MONGOLIAN LETTER ALI GALI BALUDA (CHECK) indicates assimilation
|
1531 |
+
::s ᢆ ::t ::comment MONGOLIAN LETTER ALI GALI THREE BALUDA (CHECK) indicates assimilation
|
1532 |
+
|
1533 |
+
# Limbu
|
1534 |
+
::s ॽ ::t ' ::comment glottal stop (U+097D)
|
1535 |
+
|
1536 |
+
## Punctuation
|
1537 |
+
# delete
|
1538 |
+
::s ¿ ::t "" ::comment inverted question mark
|
1539 |
+
::s ¡ ::t "" ::comment inverted exclamation mark
|
1540 |
+
# decompose double-punctuation
|
1541 |
+
::s ‼ ::t !!
|
1542 |
+
::s ⁇ ::t ??
|
1543 |
+
::s ⁉ ::t !?
|
1544 |
+
::s ⁈ ::t ?!
|
1545 |
+
# preserve
|
1546 |
+
::s ′ ::t ′
|
1547 |
+
::s ∩ ::t ∩
|
1548 |
+
::s ‡ ::t ‡
|
1549 |
+
# Cyrillic
|
1550 |
+
::s ⁙ ::t . ::comment five dot punctuation
|
1551 |
+
# Amharic/Ethiopian
|
1552 |
+
::s ። ::t .
|
1553 |
+
::s ፣ ::t ,
|
1554 |
+
::s ፤ ::t ;
|
1555 |
+
::s ፥ ::t :
|
1556 |
+
::s ፧ ::t ? ::comment Ethiopic question mark
|
1557 |
+
::s ፡ ::t " " ::comment Ethiopic wordspace
|
1558 |
+
::s ፦ ::t : ::comment Ethiopic preface colon
|
1559 |
+
# Ethiopic wordspace often appropriated for other purposes:
|
1560 |
+
::s ፡፡ ::t .
|
1561 |
+
::s ፡- ::t :
|
1562 |
+
::s "፡ " ::t ", "
|
1563 |
+
::s ቸ ::t cha ::comment Ethiopic syllable ca
|
1564 |
+
::s ቹ ::t chu ::comment Ethiopic syllable cu
|
1565 |
+
::s ቺ ::t chi ::comment Ethiopic syllable ci
|
1566 |
+
::s ቻ ::t chaa ::comment Ethiopic syllable caa
|
1567 |
+
::s ቼ ::t chee ::comment Ethiopic syllable cee
|
1568 |
+
::s ች ::t che ::comment Ethiopic syllable ce
|
1569 |
+
::s ቾ ::t cho ::comment Ethiopic syllable co
|
1570 |
+
::s ሠ ::t sa ::comment Ethiopic syllable sza
|
1571 |
+
::s ሡ ::t su ::comment Ethiopic syllable szu
|
1572 |
+
::s ሢ ::t si ::comment Ethiopic syllable szi
|
1573 |
+
::s ሣ ::t saa ::comment Ethiopic syllable szaa
|
1574 |
+
::s ሤ ::t see ::comment Ethiopic syllable szee
|
1575 |
+
::s ሥ ::t se ::comment Ethiopic syllable sze
|
1576 |
+
::s ሦ ::t so ::comment Ethiopic syllable szo
|
1577 |
+
::s ጠ ::t te ::comment Ethiopic syllable the with ejective 't'
|
1578 |
+
::s ጡ ::t tu ::comment Ethiopic syllable thu with ejective 't'
|
1579 |
+
::s ጢ ::t ti ::comment Ethiopic syllable thi with ejective 't'
|
1580 |
+
::s ጣ ::t taa ::comment Ethiopic syllable thaa with ejective 't'
|
1581 |
+
::s ጤ ::t tee ::comment Ethiopic syllable thee with ejective 't'
|
1582 |
+
::s ጥ ::t te ::comment Ethiopic syllable the with ejective 't'
|
1583 |
+
::s ጦ ::t to ::comment Ethiopic syllable tho with ejective 't'
|
1584 |
+
::s ፻ ::num 100 ::is-large-power
|
1585 |
+
::s ፼ ::num 10000 ::is-large-power
|
1586 |
+
|
1587 |
+
# Devanagari (Hindi etc.)
|
1588 |
+
::s ॺ ::t y ::comment DEVANAGARI LETTER HEAVY YA
|
1589 |
+
::s । ::t . ::comment danda
|
1590 |
+
::s ॥ ::t . ::comment double danda
|
1591 |
+
::s ৷ ::t . ::comment Bengali currency numerator four; used as danda
|
1592 |
+
::s ॰ ::t . ::comment Devanagari abbreviation sign
|
1593 |
+
# Bengali
|
1594 |
+
::s ৽ ::t . ::comment BENGALI ABBREVIATION SIGN
|
1595 |
+
::s ৾ ::t ::comment BENGALI SANDHI MARK (CHECK)
|
1596 |
+
# Gurmukhi
|
1597 |
+
::s ੱ ::t ' ::comment GURMUKHI ADDAK U+0A71, which normally doubles following consonant; otherwise marked by '
|
1598 |
+
::s ੶ ::t . ::comment GURMUKHI ABBREVIATION SIGN
|
1599 |
+
# Oriya/Odia (India)
|
1600 |
+
::s ::t . ::comment danda (deprecated, should use Devanagari danda ।)
|
1601 |
+
::s ::t . ::comment double danda (deprecated, should use Devanagari double danda ॥)
|
1602 |
+
# Tibetan
|
1603 |
+
::s ྅ ::t ::comment TIBETAN MARK PALUTA (CHECK) indicates assimilation
|
1604 |
+
::s འ ::t ' ::comment Tibetan letter -a (U+0F60)
|
1605 |
+
::s ྰ ::t ' ::comment Tibetan letter -a (U+0FB0)
|
1606 |
+
::s ཨཱ ::t aa ::comment TIBETAN LETTER A + TIBETAN VOWEL SIGN AA
|
1607 |
+
::s ཨེ ::t e ::comment TIBETAN LETTER A + TIBETAN VOWEL SIGN E
|
1608 |
+
::s ཨི ::t i ::comment TIBETAN LETTER A + TIBETAN VOWEL SIGN I
|
1609 |
+
::s ཨོ ::t o ::comment TIBETAN LETTER A + TIBETAN VOWEL SIGN O
|
1610 |
+
::s ཨུ ::t u ::comment TIBETAN LETTER A + TIBETAN VOWEL SIGN U
|
1611 |
+
::s ཪ ::t r ::comment TIBETAN LETTER FIXED-FORM RA (U+0F6A)
|
1612 |
+
::s ། ::t ,
|
1613 |
+
::s །: ::t :
|
1614 |
+
::s ༏ ::t ;
|
1615 |
+
::s ༎ ::t .
|
1616 |
+
::s ༑ ::t , ::comment Tibetan mark run chen spungs shad
|
1617 |
+
::s ༼ ::t ( ::comment Tibetan open roof punctuation
|
1618 |
+
::s ༽ ::t ) ::comment Tibetan close roof punctuation
|
1619 |
+
::s ༈ ::t "" ::comment Tibetan mark srbul shad
|
1620 |
+
::s 【 ::t [ ::comment left black lenticular bracket
|
1621 |
+
::s 】 ::t ] ::comment right black lenticular bracket
|
1622 |
+
::s ༄ ::t "" ::comment Tibetan head mark
|
1623 |
+
::s ༄༅ ::t "" ::comment Tibetan head mark
|
1624 |
+
::s ༆ ::t "" ::comment Tibetan head mark
|
1625 |
+
# Myanmar/Burmese
|
1626 |
+
::s ၊ ::t ,
|
1627 |
+
::s ။ ::t .
|
1628 |
+
Khmer
|
1629 |
+
::s ៖ ::t ; ::comment Khmer sign camnuc pii kuuh
|
1630 |
+
::s ។ ::t . ::comment Khmer sign khan
|
1631 |
+
# Arabic
|
1632 |
+
::s ، ::t ,
|
1633 |
+
::s ؛ ::t ;
|
1634 |
+
::s ٬ ::t ,
|
1635 |
+
::s ۔ ::t .
|
1636 |
+
::s ؟ ::t ?
|
1637 |
+
::s ٪ ::t %
|
1638 |
+
::s ٫ ::t , ::comment Arabic decimal separator
|
1639 |
+
::s ۽ ::t & ::comment Arabic sign Sindhi ampersand
|
1640 |
+
# Aramaic
|
1641 |
+
::s ܀ ::t .
|
1642 |
+
::s ܂ ::t .
|
1643 |
+
# Hebrew
|
1644 |
+
::s ־ ::t - ::comment maqaf
|
1645 |
+
# Armenian
|
1646 |
+
::s ։ ::t .
|
1647 |
+
::s ՝ ::t , ::comment Armenian comma
|
1648 |
+
# Chinese
|
1649 |
+
::s , ::t ", "
|
1650 |
+
::s 、 ::t ", "
|
1651 |
+
::s 。 ::t ". "
|
1652 |
+
::s ! ::t "! "
|
1653 |
+
::s ? ::t "? "
|
1654 |
+
::s 「 ::t ' "'
|
1655 |
+
::s 」 ::t '" '
|
1656 |
+
::s 《 ::t ' "'
|
1657 |
+
::s 》 ::t '" '
|
1658 |
+
::s ( ::t " ("
|
1659 |
+
::s ) ::t ") "
|
1660 |
+
::s ; ::t ;
|
1661 |
+
::s : ::t ": "
|
1662 |
+
::s ︰ ::t ": "
|
1663 |
+
::s - ::t -
|
1664 |
+
::s / ::t /
|
1665 |
+
::s = ::t =
|
1666 |
+
::s ~ ::t ~
|
1667 |
+
::s & ::t &
|
1668 |
+
::s < ::t <
|
1669 |
+
::s > ::t >
|
1670 |
+
::s % ::t %
|
1671 |
+
::s _ ::t _ ::comment FULLWIDTH LOW LINE (U+FF3F)
|
1672 |
+
::s { ::t { ::comment FULLWIDTH LEFT CURLY BRACKET (U+FF5B)
|
1673 |
+
::s } ::t } ::comment FULLWIDTH RIGHT CURLY BRACKET (U+FF5D)
|
1674 |
+
::s ::t " " ::comment ideographic space
|
1675 |
+
# Japanese
|
1676 |
+
::s 『 ::t ' "'
|
1677 |
+
::s 』 ::t '" '
|
1678 |
+
::s ・ ::t " " ::comment Katakana middle dot; separates name elements such as first and last name
|
1679 |
+
# N'ko
|
1680 |
+
::s ߽ ::t . ::comment NKO DANTAYALAN used to abbreviate units of measure
|
1681 |
+
# Medefaidrin
|
1682 |
+
::s 𖺗 ::t , ::comment MEDEFAIDRIN COMMA
|
1683 |
+
::s 𖺘 ::t . ::comment MEDEFAIDRIN FULL STOP
|
1684 |
+
# Khitan
|
1685 |
+
::s 𖿤 ::t ::comment KHITAN SMALL SCRIPT FILLER
|
1686 |
+
|
1687 |
+
# Symbols
|
1688 |
+
::s ∞ ::t ∞ ::comment infinity
|
1689 |
+
::s ::t ::comment soft hyphen; used to indicate preferred line breaks; remove
|
1690 |
+
::s ֊ ::t - ::comment Armenian hyphen; map to regular hyphen-minus
|
1691 |
+
::s ᐩ ::t + ::comment Canadian syllabics final plus; map to regular plus
|
1692 |
+
::s ﹐ ::t , ::comment small comma; map to regular comma
|
1693 |
+
::s ˚ ::t ° ::comment ring above; map to degree sign
|
1694 |
+
::s ⇒ ::t ⇒ ::comment rightwards double arrow
|
1695 |
+
::s † ::t † ::comment dagger
|
1696 |
+
::s • ::t • ::comment bullet
|
1697 |
+
::s ℃ ::t °C ::comment degree Celsius; split into 2 characters
|
1698 |
+
::s ℉ ::t °F ::comment degree Fahrenheit; split into 2 characters
|
1699 |
+
::s ― ::t ― ::comment horizontal bar
|
1700 |
+
::s ˇ ::t ˇ ::comment caron (sometimes apparently used for "Arabic vowel sign small v above" U+065A, e.g. in Gilaki language (glk))
|
1701 |
+
::s ″ ::t ″ ::comment double prime
|
1702 |
+
::s ﴾ ::t ( ::comment ornate left parenthesis
|
1703 |
+
::s ﴿ ::t ) ::comment ornate right parenthesis
|
1704 |
+
::s 〔 ::t [ ::comment left tortoise shell bracket
|
1705 |
+
::s 〕 ::t ] ::comment right tortoise shell bracket
|
1706 |
+
::s ﹝ ::t ( ::comment small left tortoise shell bracket
|
1707 |
+
::s ﹞ ::t ) ::comment small left tortoise shell bracket
|
1708 |
+
::s ¦ ::t ¦ ::comment BROKEN BAR (U+00A6)
|
1709 |
+
::s ¨ ::t ::comment DIAERESIS (U+00A8)
|
1710 |
+
::s ¯ ::t ::comment MACRON (U+00AF)
|
1711 |
+
::s ¸ ::t ::comment CEDILLA (U+00B8)
|
1712 |
+
::s Ƿ ::t W ::comment LATIN CAPITAL LETTER WYNN (U+01F7)
|
1713 |
+
::s ˘ ::t ::comment BREVE (U+02D8)
|
1714 |
+
::s ˛ ::t ::comment OGONEK (U+02DB)
|
1715 |
+
::s ˜ ::t ~ ::comment SMALL TILDE (U+02DC)
|
1716 |
+
::s ̒ ::t ::comment COMBINING TURNED COMMA ABOVE (U+0312)
|
1717 |
+
::s ̔ ::t ::comment COMBINING REVERSED COMMA ABOVE (U+0314)
|
1718 |
+
::s ̜ ::t ::comment COMBINING LEFT HALF RING BELOW (U+031C)
|
1719 |
+
::s ̧ ::t ::comment COMBINING CEDILLA (U+0327)
|
1720 |
+
::s ̫ ::t ::comment COMBINING INVERTED DOUBLE ARCH BELOW (U+032B)
|
1721 |
+
::s ̲ ::t ::comment COMBINING LOW LINE (U+0332)
|
1722 |
+
::s ̳ ::t ::comment COMBINING DOUBLE LOW LINE (U+0333)
|
1723 |
+
::s ̹ ::t ::comment COMBINING RIGHT HALF RING BELOW (U+0339)
|
1724 |
+
::s ̺ ::t ::comment COMBINING INVERTED BRIDGE BELOW (U+033A)
|
1725 |
+
::s ̿ ::t ::comment COMBINING DOUBLE OVERLINE (U+033F)
|
1726 |
+
::s ͅ ::t ::comment COMBINING GREEK YPOGEGRAMMENI (U+0345)
|
1727 |
+
::s ͑ ::t ::comment COMBINING LEFT HALF RING ABOVE (U+0351)
|
1728 |
+
::s ͗ ::t ::comment COMBINING RIGHT HALF RING ABOVE (U+0357)
|
1729 |
+
::s ͚ ::t ::comment COMBINING DOUBLE RING BELOW (U+035A)
|
1730 |
+
::s ͜ ::t ::comment COMBINING DOUBLE BREVE BELOW (U+035C)
|
1731 |
+
::s ͝ ::t ::comment COMBINING DOUBLE BREVE (U+035D)
|
1732 |
+
::s ͞ ::t ::comment COMBINING DOUBLE MACRON (U+035E)
|
1733 |
+
::s ͟ ::t ::comment COMBINING DOUBLE MACRON BELOW (U+035F)
|
1734 |
+
::s ͠ ::t ::comment COMBINING DOUBLE TILDE (U+0360)
|
1735 |
+
|
1736 |
+
::s ‐ ::t - ::comment HYPHEN (U+2010)
|
1737 |
+
::s ‗ ::t ‗ ::comment DOUBLE LOW LINE (U+2017)
|
1738 |
+
::s ‵ ::t ‵ ::comment REVERSED PRIME (U+2035)
|
1739 |
+
::s ‶ ::t ‶ ::comment REVERSED DOUBLE PRIME (U+2036)
|
1740 |
+
::s ‸ ::t ‸ ::comment CARET (U+2038)
|
1741 |
+
::s ‽ ::t ?! ::comment INTERROBANG (U+203D)
|
1742 |
+
::s ‾ ::t ‾ ::comment OVERLINE (U+203E)
|
1743 |
+
::s ‿ ::t ‿ ::comment UNDERTIE (U+203F)
|
1744 |
+
::s ⁂ ::t ⁂ ::comment ASTERISM (U+2042)
|
1745 |
+
::s ⁎ ::t * ::comment LOW ASTERISK (U+204E)
|
1746 |
+
::s ⁏ ::t ; ::comment REVERSED SEMICOLON (U+204F)
|
1747 |
+
::s ⁔ ::t ⁔ ::comment INVERTED UNDERTIE (U+2054)
|
1748 |
+
::s ⁝ ::t ⁝ ::comment TRICOLON (U+205D)
|
1749 |
+
::s ::t " " ::comment MEDIUM MATHEMATICAL SPACE (U+205F)
|
1750 |
+
::s ₋ ::t - ::comment SUBSCRIPT MINUS (U+208B)
|
1751 |
+
::s ⃩ ::t ::comment COMBINING WIDE BRIDGE ABOVE (U+20E9)
|
1752 |
+
|
1753 |
+
::s ﹔ ::t ; ::comment SMALL SEMICOLON (U+FE54)
|
1754 |
+
::s ﹕ ::t : ::comment SMALL COLON (U+FE55)
|
1755 |
+
::s ﹛ ::t { ::comment SMALL LEFT CURLY BRACKET (U+FE5B)
|
1756 |
+
::s ﹜ ::t } ::comment SMALL RIGHT CURLY BRACKET (U+FE5C)
|
1757 |
+
::s ﹠ ::t & ::comment SMALL AMPERSAND (U+FE60)
|
1758 |
+
::s ﹡ ::t * ::comment SMALL ASTERISK (U+FE61)
|
1759 |
+
::s ﹣ ::t - ::comment SMALL HYPHEN-MINUS (U+FE63)
|
1760 |
+
|
1761 |
+
::s ℈ ::t ℈ ::comment SCRUPLE (U+2108)
|
1762 |
+
::s ℟ ::t ℟ ::comment RESPONSE (U+211F)
|
1763 |
+
::s ℣ ::t ℣ ::comment VERSICLE (U+2123)
|
1764 |
+
::s ℽ ::t ℽ ::comment DOUBLE-STRUCK SMALL GAMMA (U+213D)
|
1765 |
+
::s ℾ ::t ℾ ::comment DOUBLE-STRUCK CAPITAL GAMMA (U+213E)
|
1766 |
+
::s ⅋ ::t ⅋ ::comment TURNED AMPERSAND (U+214B)
|
1767 |
+
::s ⅍ ::t A/S::comment AKTIESELSKAB (U+214D)
|
1768 |
+
|
1769 |
+
::s ⑃ ::t ⑃ ::comment OCR INVERTED FORK (U+2443)
|
1770 |
+
::s ⑊ ::t \\ ::comment OCR DOUBLE BACKSLASH (U+244A)
|
1771 |
+
::s ⟮ ::t ( ::comment MATHEMATICAL LEFT FLATTENED PARENTHESIS (U+27EE)
|
1772 |
+
::s ⟯ ::t ) ::comment MATHEMATICAL RIGHT FLATTENED PARENTHESIS (U+27EF)
|
1773 |
+
::s ⸨ ::t (( ::comment LEFT DOUBLE PARENTHESIS (U+2E28)
|
1774 |
+
::s ⸩ ::t )) ::comment RIGHT DOUBLE PARENTHESIS (U+2E29)
|
1775 |
+
|
1776 |
+
::s Ω ::t Ω ::comment OHM SIGN (NEW)
|
1777 |
+
|
1778 |
+
# kavyka indicates alternative reading
|
1779 |
+
::s ᷶ ::t ::comment COMBINING KAVYKA ABOVE RIGHT (U+1DF6)
|
1780 |
+
::s ᷷ ::t ::comment COMBINING KAVYKA ABOVE LEFT (U+1DF7)
|
1781 |
+
::s ⹅ ::t ::comment INVERTED LOW KAVYKA (U+2E45)
|
1782 |
+
::s ⹆ ::t ::comment INVERTED LOW KAVYKA WITH KAVYKA ABOVE (U+2E46)
|
1783 |
+
::s ⹇ ::t ::comment LOW KAVYKA (U+2E47)
|
1784 |
+
::s ⹈ ::t ::comment LOW KAVYKA WITH DOT (U+2E48)
|
1785 |
+
::s ꙾ ::t ::comment CYRILLIC KAVYKA (U+A67E)
|
1786 |
+
|
1787 |
+
# Braille
|
1788 |
+
::s ⠁ ::t a
|
1789 |
+
::s ⠃ ::t b
|
1790 |
+
::s ⠉ ::t c
|
1791 |
+
::s ⠙ ::t d
|
1792 |
+
::s ⠑ ::t e
|
1793 |
+
::s ⠋ ::t f
|
1794 |
+
::s ⠛ ::t g
|
1795 |
+
::s ⠓ ::t h
|
1796 |
+
::s ⠊ ::t i
|
1797 |
+
::s ⠚ ::t j
|
1798 |
+
::s ⠅ ::t k
|
1799 |
+
::s ⠇ ::t l
|
1800 |
+
::s ⠍ ::t m
|
1801 |
+
::s ⠝ ::t n
|
1802 |
+
::s ⠕ ::t o
|
1803 |
+
::s ⠏ ::t p
|
1804 |
+
::s ⠟ ::t q
|
1805 |
+
::s ⠗ ::t r
|
1806 |
+
::s ⠎ ::t s
|
1807 |
+
::s ⠞ ::t t
|
1808 |
+
::s ⠥ ::t u
|
1809 |
+
::s ⠧ ::t v
|
1810 |
+
::s ⠺ ::t w
|
1811 |
+
::s ⠭ ::t x
|
1812 |
+
::s ⠽ ::t y
|
1813 |
+
::s ⠵ ::t z
|
1814 |
+
|
1815 |
+
::s ⠜ ::t ae
|
1816 |
+
::s ⠪ ::t oe
|
1817 |
+
::s ⠳ ::t ue
|
1818 |
+
::s ⠷ ::t a ::comment à
|
1819 |
+
::s ⠡ ::t a ::comment â
|
1820 |
+
::s ⠿ ::t e ::comment é
|
1821 |
+
::s ⠮ ::t e ::comment è
|
1822 |
+
::s ⠣ ::t e ::comment ê
|
1823 |
+
::s ⠫ ::t e ::comment ë
|
1824 |
+
::s ⠩ ::t i ::comment î
|
1825 |
+
::s ⠻ ::t i ::comment ï
|
1826 |
+
::s ⠹ ::t o ::comment ô
|
1827 |
+
::s ⠾ ::t u ::comment ù
|
1828 |
+
::s ⠱ ::t u ::comment û
|
1829 |
+
|
1830 |
+
::s ⠡ ::t au ::lcode deu
|
1831 |
+
::s ⠌ ::t aeu ::lcode deu
|
1832 |
+
::s ⠹ ::t ch ::lcode deu
|
1833 |
+
::s ⠩ ::t ei ::lcode deu
|
1834 |
+
::s ⠣ ::t eu ::lcode deu
|
1835 |
+
::s ⠬ ::t ie ::lcode deu
|
1836 |
+
::s ⠱ ::t sch ::lcode deu
|
1837 |
+
::s ⠮ ::t ss ::lcode deu
|
1838 |
+
::s ⠾ ::t st ::lcode deu
|
1839 |
+
|
1840 |
+
::s ⠠⠠ ::t "" ::comment start of word all-caps mode
|
1841 |
+
# ::s ⠠⠁ ::t A
|
1842 |
+
# ::s ⠠⠃ ::t B
|
1843 |
+
# ::s ⠠⠉ ::t C
|
1844 |
+
# ::s ⠠⠙ ::t D
|
1845 |
+
# ::s ⠠⠑ ::t E
|
1846 |
+
# ::s ⠠⠋ ::t F
|
1847 |
+
# ::s ⠠⠛ ::t G
|
1848 |
+
# ::s ⠠⠓ ::t H
|
1849 |
+
# ::s ⠠⠊ ::t I
|
1850 |
+
# ::s ⠠⠚ ::t J
|
1851 |
+
# ::s ⠠⠅ ::t K
|
1852 |
+
# ::s ⠠⠇ ::t L
|
1853 |
+
# ::s ⠠⠍ ::t M
|
1854 |
+
# ::s ⠠⠝ ::t N
|
1855 |
+
# ::s ⠠⠕ ::t O
|
1856 |
+
# ::s ⠠⠏ ::t P
|
1857 |
+
# ::s ⠠⠟ ::t Q
|
1858 |
+
# ::s ⠠⠗ ::t R
|
1859 |
+
# ::s ⠠⠎ ::t S
|
1860 |
+
# ::s ⠠⠞ ::t T
|
1861 |
+
# ::s ⠠⠥ ::t U
|
1862 |
+
# ::s ⠠⠧ ::t V
|
1863 |
+
# ::s ⠠⠺ ::t W
|
1864 |
+
# ::s ⠠⠭ ::t X
|
1865 |
+
# ::s ⠠⠽ ::t Y
|
1866 |
+
# ::s ⠠⠵ ::t Z
|
1867 |
+
|
1868 |
+
::s ⠼⠁ ::t 1
|
1869 |
+
::s ⠼⠃ ::t 2
|
1870 |
+
::s ⠼⠉ ::t 3
|
1871 |
+
::s ⠼⠙ ::t 4
|
1872 |
+
::s ⠼⠑ ::t 5
|
1873 |
+
::s ⠼⠋ ::t 6
|
1874 |
+
::s ⠼⠛ ::t 7
|
1875 |
+
::s ⠼⠓ ::t 8
|
1876 |
+
::s ⠼⠊ ::t 9
|
1877 |
+
::s ⠼⠚ ::t 0
|
1878 |
+
|
1879 |
+
::s ⠂ ::t ,
|
1880 |
+
::s ⠆ ::t ;
|
1881 |
+
::s ⠒ ::t :
|
1882 |
+
::s ⠲ ::t .
|
1883 |
+
::s ⠦ ::t ?
|
1884 |
+
::s ⠖ ::t !
|
1885 |
+
::s ⠄ ::t '
|
1886 |
+
::s ⠤ ::t -
|
1887 |
+
::s ⠨⠤ ::t _
|
1888 |
+
|
1889 |
+
::s ⠀ ::t " " ::comment blank
|
1890 |
+
# ::s ⠐ t " " ::comment blank in numeric mode
|
1891 |
+
::s ⠈ ::t "" ::comment accent
|
1892 |
+
# ::s ⠌ ::t / ::comment in numeric mode only
|
1893 |
+
# ::s ⠐ ::comment abbreviation sign
|
1894 |
+
# ::s ⠘ ::comment abbreviation sign
|
1895 |
+
# ::s ⠠ ::comment capital indicator
|
1896 |
+
::s ⠨ ::t . ::comment decimal point; emphasis
|
1897 |
+
::s ⠰ ::t "" ::comment letter indicator
|
1898 |
+
# ::s ⠴ ::t ”
|
1899 |
+
# ::s ⠶ ::t ()
|
1900 |
+
# ::s ⠸ ::comment abbreviation sign
|
1901 |
+
::s ⠼ ::t "" ::comment number indicator
|
1902 |
+
::s ⠘⠚ ::t ° ::word-external-punctuation
|
1903 |
+
::s ⠘⠚⠠⠉ ::t °C
|
1904 |
+
::s ⠘⠚⠉ ::t °C
|
1905 |
+
::s ⠘⠚⠠⠋ ::t °F
|
1906 |
+
::s ⠘⠚⠋ ::t °F
|
1907 |
+
|
1908 |
+
::s ⠠⠶ ::t " ::word-external-punctuation
|
1909 |
+
::s ⠘⠦ ::t “ ::word-external-punctuation
|
1910 |
+
::s ⠘⠴ ::t ” ::word-external-punctuation
|
1911 |
+
::s ⠄⠦ ::t ‘
|
1912 |
+
::s ⠄⠴ ::t ’
|
1913 |
+
::s ⠠⠴ ::t ’
|
1914 |
+
::s ⠐⠣ ::t ( ::word-external-punctuation
|
1915 |
+
::s ⠐⠜ ::t ) ::word-external-punctuation
|
1916 |
+
::s ⠨⠣ ::t [ ::word-external-punctuation
|
1917 |
+
::s ⠨⠜ ::t ] ::word-external-punctuation
|
1918 |
+
::s ⠸⠣ ::t { ::word-external-punctuation
|
1919 |
+
::s ⠸⠜ ::t } ::word-external-punctuation
|
1920 |
+
::s ⠈⠣ ::t < ::word-external-punctuation
|
1921 |
+
::s ⠈⠜ ::t > ::word-external-punctuation
|
1922 |
+
::s ⠸⠌ ::t / ::word-external-punctuation
|
1923 |
+
::s ⠸⠡ ::t \ ::word-external-punctuation
|
1924 |
+
::s ⠠⠤ ::t – ::word-external-punctuation
|
1925 |
+
::s ⠐⠠⠤ ::t — ::word-external-punctuation
|
1926 |
+
::s ⠈⠯ ::t & ::word-external-punctuation
|
1927 |
+
::s ⠐⠔ ::t * ::word-external-punctuation
|
1928 |
+
::s ⠨⠦ ::t ∩ ::word-external-punctuation
|
1929 |
+
::s ⠨⠴ ::t % ::word-external-punctuation
|
1930 |
+
::s ⠐⠖ ::t + ::word-external-punctuation
|
1931 |
+
::s ⠐⠤ ::t − ::word-external-punctuation
|
1932 |
+
::s ⠐⠶ ::t = ::word-external-punctuation
|
1933 |
+
::s ⠈⠎ ::t $ ::word-external-punctuation
|
1934 |
+
::s ⠈⠉ ::t ¢ ::word-external-punctuation
|
1935 |
+
::s ⠈⠇ ::t £ ::word-external-punctuation
|
1936 |
+
::s ⠈⠽ ::t ¥ ::word-external-punctuation
|
1937 |
+
::s ⠈⠁ ::t @ ::word-external-punctuation
|
1938 |
+
::s ⠸⠹ ::t # ::word-external-punctuation
|
1939 |
+
::s ⠸⠲ ::t • ::word-external-punctuation
|
1940 |
+
::s ⠈⠢ ::t ^ ::word-external-punctuation
|
1941 |
+
::s ⠈⠔ ::t ~ ::word-external-punctuation
|
1942 |
+
::s ⠘⠉ ::t © ::word-external-punctuation
|
1943 |
+
::s ⠐⠌ ::t ÷ ::word-external-punctuation
|
1944 |
+
::s ⠐⠦ ::t × ::word-external-punctuation
|
1945 |
+
::s ⠈⠠⠹ ::t † ::word-external-punctuation
|
1946 |
+
::s ⠈⠠⠻ ::t ‡ ::word-external-punctuation
|
1947 |
+
::s ⠘⠏ ::t ¶ ::word-external-punctuation
|
1948 |
+
::s ⠘⠎ ::t § ::word-external-punctuation
|
1949 |
+
::s ⠘⠗ ::t ® ::word-external-punctuation
|
1950 |
+
::s ⠘⠞ ::t ™ ::word-external-punctuation
|
1951 |
+
|
1952 |
+
# English Braille
|
1953 |
+
::s ⠁⠃ ::t about ::lcode eng ::use-only-for-whole-word
|
1954 |
+
::s ⠁⠃⠧ ::t above ::lcode eng ::use-only-for-whole-word
|
1955 |
+
::s ⠁⠉ ::t according ::lcode eng ::use-only-for-whole-word
|
1956 |
+
::s ⠁⠉⠗ ::t across ::lcode eng ::use-only-for-whole-word
|
1957 |
+
::s ⠁⠋ ::t after ::lcode eng ::use-only-for-whole-word
|
1958 |
+
::s ⠁⠋⠝ ::t afternoon ::lcode eng ::use-only-for-whole-word
|
1959 |
+
::s ⠁⠋⠺ ::t afterward ::lcode eng ::use-only-for-whole-word
|
1960 |
+
::s ⠁⠛ ::t again ::lcode eng ::use-only-for-whole-word
|
1961 |
+
::s ⠁⠛⠌ ::t against ::lcode eng ::use-only-for-whole-word
|
1962 |
+
::s ⠠⠽ ::t ally ::lcode eng ::use-only-at-end-of-word ::use-only-in-lower-case-environment
|
1963 |
+
::s ⠁⠇⠍ ::t almost ::lcode eng ::use-only-for-whole-word
|
1964 |
+
::s ⠁⠇⠗ ::t already ::lcode eng ::use-only-for-whole-word
|
1965 |
+
::s ⠁⠇ ::t also ::lcode eng ::use-only-for-whole-word
|
1966 |
+
::s ⠁⠇⠹ ::t although ::lcode eng ::use-only-for-whole-word
|
1967 |
+
::s ⠁⠇⠞ ::t altogether ::lcode eng ::use-only-for-whole-word
|
1968 |
+
::s ⠁⠇⠺ ::t always ::lcode eng ::use-only-for-whole-word
|
1969 |
+
::s ⠨⠑ ::t ance ::lcode eng
|
1970 |
+
::s ⠯ ::t and ::lcode eng
|
1971 |
+
::s ⠜ ::t ar ::lcode eng
|
1972 |
+
::s ⠵ ::t as ::lcode eng ::use-only-for-whole-word
|
1973 |
+
::s ⠠⠝ ::t ation ::lcode eng ::use-only-at-end-of-word ::use-only-in-lower-case-environment
|
1974 |
+
::s ⠃ ::t b ::lcode eng
|
1975 |
+
::s ⠆ ::t bb ::lcode eng ::dont-use-at-start-of-word ::dont-use-at-end-of-word
|
1976 |
+
::s ⠆ ::t be ::lcode eng ::use-only-at-start-of-word
|
1977 |
+
::s ⠆⠉ ::t because ::lcode eng ::use-only-for-whole-word
|
1978 |
+
::s ⠆⠋ ::t before ::lcode eng ::use-only-for-whole-word
|
1979 |
+
::s ⠆⠓ ::t behind ::lcode eng ::use-only-for-whole-word
|
1980 |
+
::s ⠆⠇ ::t below ::lcode eng ::use-only-for-whole-word
|
1981 |
+
::s ⠆⠝ ::t beneath ::lcode eng ::use-only-for-whole-word
|
1982 |
+
::s ⠆⠎ ::t beside ::lcode eng ::use-only-for-whole-word
|
1983 |
+
::s ⠆⠞ ::t between ::lcode eng ::use-only-for-whole-word
|
1984 |
+
::s ⠆⠽ ::t beyond ::lcode eng ::use-only-for-whole-word
|
1985 |
+
::s ⠃⠇ ::t blind ::lcode eng ::use-only-for-whole-word
|
1986 |
+
::s ⠃⠗⠇ ::t Braille ::lcode eng ::use-only-for-whole-word
|
1987 |
+
::s ⠃ ::t but ::lcode eng ::use-only-for-whole-word
|
1988 |
+
::s ⠉ ::t c ::lcode eng
|
1989 |
+
::s ⠉ ::t can ::lcode eng ::use-only-for-whole-word
|
1990 |
+
::s ⠸⠉ ::t cannot ::lcode eng
|
1991 |
+
::s ⠒ ::t cc ::lcode eng ::dont-use-at-start-of-word ::dont-use-at-end-of-word
|
1992 |
+
::s ⠉⠧ ::t ceive ::lcode eng ::use-only-at-end-of-word
|
1993 |
+
::s ⠉⠧⠙ ::t ceived ::lcode eng ::use-only-at-end-of-word
|
1994 |
+
::s ⠉⠧⠎ ::t ceives ::lcode eng ::use-only-at-end-of-word
|
1995 |
+
::s ⠉⠧⠛ ::t ceiving ::lcode eng
|
1996 |
+
::s ⠡ ::t ch ::lcode eng
|
1997 |
+
::s ⠐⠡ ::t character ::lcode eng
|
1998 |
+
::s ⠡ ::t child ::lcode eng ::use-only-for-whole-word
|
1999 |
+
::s ⠡⠝ ::t children ::lcode eng ::use-only-for-whole-word
|
2000 |
+
::s ⠒ ::t con ::lcode eng ::use-only-at-start-of-word
|
2001 |
+
::s ⠒ ::t : ::lcode eng ::use-only-at-end-of-word
|
2002 |
+
::s ⠉⠙ ::t could ::lcode eng ::use-only-for-whole-word
|
2003 |
+
::s ⠙ ::t d ::lcode eng
|
2004 |
+
::s ⠙ ::t do ::lcode eng ::use-only-for-whole-word
|
2005 |
+
::s ⠐⠙ ::t day ::lcode eng
|
2006 |
+
# ::s ⠲ ::t dd ::t-alt . ::lcode eng ::dont-use-at-start-of-word ::dont-use-at-end-of-word ::comment abolished; interferes with period in abbrevisations such as U.S.
|
2007 |
+
::s ⠙⠉⠇ ::t declare ::lcode eng
|
2008 |
+
::s ⠙⠉⠇⠛ ::t declaring ::lcode eng
|
2009 |
+
::s ⠲ ::t dis ::lcode eng ::use-only-at-start-of-word
|
2010 |
+
::s ⠲ ::t . ::lcode eng ::dont-use-at-start-of-word
|
2011 |
+
::s ⠑ ::t e ::lcode eng
|
2012 |
+
::s ⠂ ::t ea ::lcode eng ::dont-use-at-end-of-word
|
2013 |
+
::s ⠂ ::t , ::lcode eng ::use-only-at-end-of-word
|
2014 |
+
::s ⠫ ::t ed ::lcode eng
|
2015 |
+
::s ⠑⠊ ::t either ::lcode eng ::use-only-for-whole-word
|
2016 |
+
::s ⠢ ::t en ::lcode eng
|
2017 |
+
::s ⠰⠑ ::t ence ::lcode eng ::dont-use-at-start-of-word
|
2018 |
+
::s ⠢ ::t enough ::lcode eng ::use-only-for-whole-word
|
2019 |
+
::s ⠻ ::t er ::lcode eng
|
2020 |
+
::s ⠐⠑ ::t ever ::lcode eng
|
2021 |
+
::s ⠑ ::t every ::lcode eng ::use-only-for-whole-word
|
2022 |
+
::s ⠋ ::t f ::lcode eng
|
2023 |
+
::s ⠐⠋ ::t father ::lcode eng
|
2024 |
+
::s ⠖ ::t ff ::lcode eng ::dont-use-at-start-of-word ::dont-use-at-end-of-word
|
2025 |
+
::s ⠋⠌ ::t first ::lcode eng
|
2026 |
+
::s ⠿ ::t for ::lcode eng
|
2027 |
+
::s ⠋⠗ ::t friend ::lcode eng ::use-only-for-whole-word
|
2028 |
+
::s ⠋⠗⠎ ::t friends ::lcode eng ::use-only-for-whole-word
|
2029 |
+
::s ⠋ ::t from ::lcode eng ::use-only-for-whole-word
|
2030 |
+
::s ⠰⠇ ::t ful ::lcode eng ::dont-use-at-start-of-word
|
2031 |
+
::s ⠛ ::t g ::lcode eng
|
2032 |
+
::s ⠶ ::t gg ::lcode eng ::dont-use-at-start-of-word ::dont-use-at-end-of-word
|
2033 |
+
::s ⠣ ::t gh ::lcode eng
|
2034 |
+
::s ⠛ ::t go ::lcode eng ::use-only-for-whole-word
|
2035 |
+
::s ⠛⠙ ::t good ::lcode eng ::use-only-at-start-of-word
|
2036 |
+
::s ⠛⠗⠞ ::t great ::lcode eng
|
2037 |
+
::s ⠓ ::t h ::lcode eng
|
2038 |
+
::s ⠸⠓ ::t had ::lcode eng
|
2039 |
+
::s ⠓ ::t have ::lcode eng ::use-only-for-whole-word
|
2040 |
+
::s ⠐⠓ ::t here ::lcode eng
|
2041 |
+
::s ⠓⠻⠋ ::t herself ::lcode eng ::use-only-for-whole-word
|
2042 |
+
::s ⠓⠍ ::t him ::lcode eng ::use-only-for-whole-word
|
2043 |
+
::s ⠓⠍⠋ ::t himself ::lcode eng ::use-only-for-whole-word
|
2044 |
+
::s ⠦ ::t ? ::lcode eng
|
2045 |
+
::s ⠦ ::t his ::lcode eng ::use-only-for-whole-word
|
2046 |
+
::s ⠊⠍⠍ ::t immediate ::lcode eng ::use-only-for-whole-word
|
2047 |
+
::s ⠊⠍⠍⠇⠽ ::t immediately ::lcode eng ::use-only-for-whole-word
|
2048 |
+
::s ⠔ ::t in ::lcode eng
|
2049 |
+
::s ⠔⠒ ::t incon ::lcode eng ::use-only-at-start-of-word
|
2050 |
+
::s ⠬ ::t ing ::lcode eng
|
2051 |
+
::s ⠭ ::t it ::lcode eng ::use-only-for-whole-word
|
2052 |
+
::s ⠭⠎ ::t its ::lcode eng ::use-only-for-whole-word
|
2053 |
+
::s ⠭⠋ ::t itself ::lcode eng ::use-only-for-whole-word
|
2054 |
+
::s ⠰⠽ ::t ity ::lcode eng ::dont-use-at-start-of-word
|
2055 |
+
::s ⠚ ::t j ::lcode eng
|
2056 |
+
::s ⠚ ::t just ::lcode eng ::use-only-for-whole-word
|
2057 |
+
::s ⠅ ::t k ::lcode eng
|
2058 |
+
::s ⠐⠅ ::t know ::lcode eng
|
2059 |
+
::s ⠅ ::t knowledge ::lcode eng ::use-only-for-whole-word
|
2060 |
+
::s ⠇ ::t l ::lcode eng
|
2061 |
+
::s ⠨⠎ ::t less ::lcode eng ::dont-use-at-start-of-word
|
2062 |
+
::s ⠇⠗ ::t letter ::lcode eng ::use-only-for-whole-word
|
2063 |
+
::s ⠇⠗⠎ ::t letters ::lcode eng ::use-only-for-whole-word
|
2064 |
+
::s ⠇ ::t like ::lcode eng ::use-only-for-whole-word
|
2065 |
+
::s ⠇⠇ ::t little ::lcode eng ::use-only-for-whole-word
|
2066 |
+
::s ⠐⠇ ::t lord ::lcode eng
|
2067 |
+
::s ⠍ ::t m ::lcode eng
|
2068 |
+
::s ⠸⠍ ::t many ::lcode eng
|
2069 |
+
::s ⠰⠞ ::t ment ::lcode eng ::dont-use-at-start-of-word
|
2070 |
+
::s ⠍ ::t more ::lcode eng ::use-only-for-whole-word
|
2071 |
+
::s ⠐⠍ ::t mother ::lcode eng
|
2072 |
+
::s ⠍⠡ ::t much ::lcode eng ::use-only-for-whole-word
|
2073 |
+
::s ⠍⠌ ::t must ::lcode eng ::use-only-for-whole-word
|
2074 |
+
::s ⠍⠽⠋ ::t myself ::lcode eng ::use-only-for-whole-word
|
2075 |
+
::s ⠝ ::t n ::lcode eng
|
2076 |
+
::s ⠐⠝ ::t name ::lcode eng
|
2077 |
+
::s ⠝⠑⠉ ::t necessary ::lcode eng ::use-only-for-whole-word
|
2078 |
+
::s ⠝⠑⠊ ::t neither ::lcode eng ::use-only-for-whole-word
|
2079 |
+
::s ⠰⠎ ::t ness ::lcode eng ::dont-use-at-start-of-word
|
2080 |
+
::s ⠝ ::t not ::lcode eng ::use-only-for-whole-word
|
2081 |
+
::s ⠕⠄⠉ ::t o'clock ::lcode eng ::use-only-for-whole-word
|
2082 |
+
::s ⠷ ::t of ::lcode eng
|
2083 |
+
::s ⠐⠕ ::t one ::lcode eng
|
2084 |
+
::s ⠰⠛ ::t ong ::lcode eng ::dont-use-at-start-of-word
|
2085 |
+
::s ⠳ ::t ou ::lcode eng
|
2086 |
+
::s ⠨⠙ ::t ound ::lcode eng
|
2087 |
+
::s ⠨⠞ ::t ount ::lcode eng
|
2088 |
+
::s ⠐⠳ ::t ought ::lcode eng
|
2089 |
+
::s ⠳⠗⠧⠎ ::t ourselves ::lcode eng ::use-only-for-whole-word
|
2090 |
+
::s ⠳ ::t out ::lcode eng ::use-only-for-whole-word
|
2091 |
+
::s ⠪ ::t ow ::lcode eng
|
2092 |
+
::s ⠏ ::t p ::lcode eng
|
2093 |
+
::s ⠏⠙ ::t paid ::lcode eng ::use-only-for-whole-word
|
2094 |
+
::s ⠐⠏ ::t part ::lcode eng
|
2095 |
+
::s ⠏ ::t people ::lcode eng ::use-only-for-whole-word
|
2096 |
+
::s ⠏⠻⠓ ::t perhaps ::lcode eng ::use-only-for-whole-word
|
2097 |
+
::s ⠟ ::t q ::lcode eng
|
2098 |
+
::s ⠐⠟ ::t question ::lcode eng
|
2099 |
+
::s ⠟⠅ ::t quick ::lcode eng ::use-only-for-whole-word
|
2100 |
+
::s ⠟⠅⠻ ::t quicker ::lcode eng ::use-only-for-whole-word
|
2101 |
+
::s ⠟⠅⠑⠌ ::t quickest ::lcode eng ::use-only-for-whole-word
|
2102 |
+
::s ⠟ ::t quite ::lcode eng ::use-only-for-whole-word
|
2103 |
+
::s ⠗ ::t r ::lcode eng
|
2104 |
+
::s ⠗ ::t rather ::lcode eng ::use-only-for-whole-word
|
2105 |
+
::s ⠐⠗ ::t right ::lcode eng
|
2106 |
+
::s ⠗⠚⠉ ::t rejoice ::lcode eng
|
2107 |
+
::s ⠗⠚⠉⠛ ::t rejoicing ::lcode eng
|
2108 |
+
::s ⠎ ::t s ::lcode eng
|
2109 |
+
::s ⠎⠙ ::t said ::lcode eng ::use-only-for-whole-word
|
2110 |
+
::s ⠩ ::t sh ::lcode eng
|
2111 |
+
::s ⠩ ::t shall ::lcode eng ::use-only-for-whole-word
|
2112 |
+
::s ⠩⠙ ::t should ::lcode eng ::use-only-for-whole-word
|
2113 |
+
::s ⠨⠝ ::t sion ::lcode eng
|
2114 |
+
::s ⠎ ::t so ::lcode eng ::use-only-for-whole-word
|
2115 |
+
::s ⠐⠎ ::t some ::lcode eng
|
2116 |
+
::s ⠸⠎ ::t spirit ::lcode eng
|
2117 |
+
::s ⠌ ::t st ::lcode eng
|
2118 |
+
::s ⠌ ::t still ::lcode eng ::use-only-for-whole-word
|
2119 |
+
::s ⠎⠡ ::t such ::lcode eng ::use-only-for-whole-word
|
2120 |
+
::s ⠞ ::t t ::lcode eng
|
2121 |
+
::s ⠹ ::t th ::lcode eng
|
2122 |
+
::s ⠞ ::t that ::lcode eng ::use-only-for-whole-word
|
2123 |
+
::s ⠹ ::t this ::lcode eng ::use-only-for-whole-word
|
2124 |
+
::s ⠮ ::t the ::lcode eng
|
2125 |
+
::s ⠸⠮ ::t their ::lcode eng
|
2126 |
+
::s ⠮⠍⠧⠎ ::t themselves ::lcode eng ::use-only-for-whole-word
|
2127 |
+
::s ⠐⠮ ::t there ::lcode eng
|
2128 |
+
::s ⠘⠮ ::t these ::lcode eng
|
2129 |
+
::s ⠘⠹ ::t those ::lcode eng
|
2130 |
+
::s ⠐⠹ ::t through ::lcode eng
|
2131 |
+
::s ⠐⠞ ::t time ::lcode eng
|
2132 |
+
::s ⠰⠝ ::t tion ::lcode eng ::dont-use-at-start-of-word
|
2133 |
+
::s ⠖ ::t to ::lcode eng ::use-only-for-whole-word
|
2134 |
+
::s ⠞⠙ ::t today ::lcode eng ::use-only-for-whole-word
|
2135 |
+
::s ⠞⠛⠗ ::t together ::lcode eng ::use-only-for-whole-word
|
2136 |
+
::s ⠞⠍ ::t tomorrow ::lcode eng ::use-only-for-whole-word
|
2137 |
+
::s ⠞⠝ ::t tonight ::lcode eng ::use-only-for-whole-word
|
2138 |
+
::s ⠥ ::t u ::lcode eng
|
2139 |
+
::s ⠥⠝⠒ ::t uncon ::lcode eng ::use-only-at-start-of-word
|
2140 |
+
::s ⠥ ::t us ::lcode eng ::use-only-for-whole-word
|
2141 |
+
::s ⠠⠥⠲⠎⠲ ::t U.S. ::lcode eng
|
2142 |
+
::s ⠐⠥ ::t under ::lcode eng
|
2143 |
+
::s ⠘⠥ ::t upon ::lcode eng
|
2144 |
+
::s ⠧ ::t v ::lcode eng
|
2145 |
+
::s ⠧ ::t very ::lcode eng ::use-only-for-whole-word
|
2146 |
+
::s ⠺ ::t w ::lcode eng
|
2147 |
+
::s ⠴ ::t " ::lcode eng
|
2148 |
+
::s ⠴ ::t was ::lcode eng ::use-only-for-whole-word
|
2149 |
+
::s ⠶ ::t were ::lcode eng ::use-only-for-whole-word
|
2150 |
+
::s ⠱ ::t wh ::lcode eng
|
2151 |
+
::s ⠐⠱ ::t where ::lcode eng
|
2152 |
+
::s ⠱ ::t which ::lcode eng ::use-only-for-whole-word
|
2153 |
+
::s ⠘⠱ ::t whose ::lcode eng
|
2154 |
+
::s ⠺ ::t will ::lcode eng ::use-only-for-whole-word
|
2155 |
+
::s ⠾ ::t with ::lcode eng
|
2156 |
+
::s ⠘⠺ ::t word ::lcode eng
|
2157 |
+
::s ⠐⠺ ::t work ::lcode eng
|
2158 |
+
::s ⠸⠺ ::t world ::lcode eng
|
2159 |
+
::s ⠺⠙ ::t would ::lcode eng ::use-only-for-whole-word
|
2160 |
+
::s ⠭ ::t x ::lcode eng
|
2161 |
+
::s ⠽ ::t y ::lcode eng
|
2162 |
+
::s ⠽ ::t you ::lcode eng ::use-only-for-whole-word
|
2163 |
+
::s ⠽⠗ ::t your ::lcode eng ::use-only-for-whole-word
|
2164 |
+
::s ⠽⠗⠎ ::t yours ::lcode eng ::use-only-for-whole-word
|
2165 |
+
::s ⠽⠗⠋ ::t yourself ::lcode eng ::use-only-for-whole-word
|
2166 |
+
::s ⠽⠗⠧⠎ ::t yourselves ::lcode eng ::use-only-for-whole-word
|
2167 |
+
::s ⠐⠽ ::t young ::lcode eng
|
2168 |
+
::s ⠵ ::t z ::lcode eng
|
2169 |
+
::s ⠠⠴ ::t ’ ::lcode eng
|
2170 |
+
|
2171 |
+
::preserve ::from U+2190 ::to U+21FF ::comments Arrows
|
2172 |
+
::preserve ::from U+2200 ::to U+22FF ::comment Mathematical Operators
|
2173 |
+
::preserve ::from U+2300 ::to U+23FF ::comment Miscellaneous Technical
|
2174 |
+
::preserve ::from U+2500 ::to U+257F ::comment Box Drawing
|
2175 |
+
::preserve ::from U+2580 ::to U+259F ::comment Block Elements
|
2176 |
+
::preserve ::from U+25A0 ::to U+25FF ::comment Geometric Shapes
|
2177 |
+
::preserve ::from U+2600 ::to U+26FF ::comment Miscellaneous Symbols
|
2178 |
+
::preserve ::from U+27C0 ::to U+27ED ::comment Miscellaneous Mathematical Symbols-A
|
2179 |
+
::preserve ::from U+27F0 ::to U+27FF ::comment Supplemental Arrows-A
|
2180 |
+
::preserve ::from U+2900 ::to U+297F ::comment Supplemental Arrows-B
|
2181 |
+
::preserve ::from U+2980 ::to U+29FF ::comment Miscellaneous Mathematical Symbols-B
|
2182 |
+
::preserve ::from U+2A00 ::to U+2AFF ::comment Supplemental Mathematical Operators
|
2183 |
+
::preserve ::from U+2B00 ::to U+2BFF ::comment Miscellaneous Symbols and Arrows
|
2184 |
+
::preserve ::from U+2E00 ::to U+2E27 ::comment Supplemental Punctuation (excluding ⸨⸩)
|
2185 |
+
::preserve ::from U+2E2A ::to U+2E7F ::comment Supplemental Punctuation (cont'd)
|
2186 |
+
::preserve ::from U+18B00 ::to U+18CD5 ::comment Khitan Small Script
|
2187 |
+
::preserve ::from U+1D100 ::to U+1D1FF ::comment Musical Symbols
|
2188 |
+
::preserve ::from U+1D6A8 ::to U+1D7CB ::comment Mathematical Alphanumeric Symbols (Greek)
|
2189 |
+
::preserve ::from U+1D800 ::to U+1DAAF ::comment Sutton SignWriting
|
2190 |
+
::preserve ::from U+1F800 ::to U+1F8FF ::comment Supplemental Arrows-C
|
2191 |
+
::preserve ::from U+1FA00 ::to U+1FA6F ::comment Chess Symbols
|
2192 |
+
::preserve ::from U+1FB00 ::to U+1FBCF ::comment Symbols for Legacy Computing
|
2193 |
+
::preserve ::from U+1FA70 ::to U+1FAFF ::comment Symbols and Pictographs Extended-A
|
uroman/data/romanization-table.v1.2.1.txt
ADDED
@@ -0,0 +1,814 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
## European Latin extensions
|
3 |
+
# Vowels
|
4 |
+
::s Ä ::t Ae
|
5 |
+
::s Ö ::t Oe
|
6 |
+
::s Ü ::t Ue
|
7 |
+
::s Å ::t Aa
|
8 |
+
::s Æ ::t Ae
|
9 |
+
::s Ø ::t oe
|
10 |
+
::s Œ ::t Oe
|
11 |
+
::s ä ::t ae
|
12 |
+
::s ö ::t oe
|
13 |
+
::s ü ::t ue
|
14 |
+
::s å ::t aa
|
15 |
+
::s æ ::t ae
|
16 |
+
::s ø ::t oe
|
17 |
+
::s œ ::t oe
|
18 |
+
# Consonants
|
19 |
+
::s Ç ::t S
|
20 |
+
::s ç ::t s
|
21 |
+
::s Ç ::t Ch ::lcode tur
|
22 |
+
::s ç ::t ch ::lcode tur
|
23 |
+
::s Ş ::t Sh
|
24 |
+
::s ş ::t sh
|
25 |
+
::s Ș ::t Sh
|
26 |
+
::s ș ::t sh
|
27 |
+
::s ß ::t ss
|
28 |
+
::s Ț ::t Ts
|
29 |
+
::s ț ::t ts
|
30 |
+
|
31 |
+
# Miscellaneous
|
32 |
+
::s ə ::t e
|
33 |
+
|
34 |
+
# English
|
35 |
+
::s chr ::t chr ::t-alt kr ::example chromosome, synchronize
|
36 |
+
::s Chr ::t Chr ::t-alt Kr ::example Christmas, Chrysler
|
37 |
+
::s eight ::t eight ::t-alt eit ::example eight, weight
|
38 |
+
::s Eight ::t Eight ::t-alt Eit ::example Eighteen
|
39 |
+
::s ight ::t ight ::t-alt ait ::example Knight
|
40 |
+
::s gh ::t gh ::t-alt f, ph, "" ::example laugh, daughter
|
41 |
+
::s high ::t high ::t-alt hai ::example highlight
|
42 |
+
::s High ::t High ::t-alt Hai ::example High School
|
43 |
+
::s Isle ::t Isle ::t-alt Ail ::use-only-at-start-of-word ::use-only-at-end-of-word ::example Isle
|
44 |
+
::s Island ::t Island ::t-alt Ailand ::use-only-at-start-of-word ::use-only-at-end-of-word ::example Island
|
45 |
+
::s kn ::t kn ::t-alt n ::use-only-at-start-of-word ::example knowledge
|
46 |
+
::s Kn ::t Kn ::t-alt N ::use-only-at-start-of-word ::example Knight
|
47 |
+
::s Mc ::t Mc ::t-alt Mac ::use-only-at-start-of-word ::example McNulty
|
48 |
+
::s mc ::t mc ::t-alt mac ::use-only-at-start-of-word
|
49 |
+
::s oo ::t oo ::t-alt u ::lcode eng ::example Brooklyn; Goose Bay
|
50 |
+
::s ph ::t ph ::t-alt f ::example alpha
|
51 |
+
::s Ph ::t Ph ::t-alt F ::example Philip
|
52 |
+
::s Thom ::t Thom ::t-alt Tom ::use-only-at-start-of-word ::example Thomas, Thompson
|
53 |
+
::s tion ::t tion ::t-alt shen ::example
|
54 |
+
::s Sean ::t Sean ::t-alt Shawn ::use-only-at-start-of-word ::use-only-at-end-of-word
|
55 |
+
::s ssion ::t ssion ::t-alt shen ::example Sessions
|
56 |
+
::s St ::t St ::t-alt Saint ::use-only-at-start-of-word ::use-only-at-end-of-word
|
57 |
+
::s St. ::t St. ::t-alt Saint ::use-only-at-start-of-word ::use-only-at-end-of-word
|
58 |
+
::s Wr ::t Wr ::t-alt R ::example Wren
|
59 |
+
::s wr ::t wr ::t-alt r ::example Cartwright
|
60 |
+
::s x ::t x ::t-alt ks ::example Mexico
|
61 |
+
::s x ::t x ::t-alt gz ::example example, anxiety, exhaust, exit
|
62 |
+
|
63 |
+
# French
|
64 |
+
::s â ::t a ::t-alt as ::example pâte/paste, pastry
|
65 |
+
::s ê ::t e ::t-alt es ::example fête/feast
|
66 |
+
::s î ::t i ::t-alt is ::example île/isle
|
67 |
+
::s ô ::t o ::t-alt os ::example côte/coast
|
68 |
+
::s û ::t u ::t-alt us ::example août/August
|
69 |
+
::s eaux ::t eaux ::t-alt o ::example Bordeaux
|
70 |
+
::s eau ::t eau ::t-alt o ::example Chateau
|
71 |
+
::s auld ::t auld ::t-alt o ::use-only-at-end-of-word ::example Renauld
|
72 |
+
::s ault ::t ault ::t-alt o ::use-only-at-end-of-word ::example Renault
|
73 |
+
::s oux ::t oux ::t-alt u
|
74 |
+
::s ois ::t ois ::t-alt oa ::use-only-at-end-of-word ::example Dubois
|
75 |
+
|
76 |
+
# German
|
77 |
+
::s Sch ::t Sch ::t-alt Sh
|
78 |
+
::s sch ::t sch ::t-alt sh
|
79 |
+
::s stein ::t stein ::t-alt shtain
|
80 |
+
::s dt ::t dt ::t-alt tt ::use-only-at-end-of-word ::example Schmidt
|
81 |
+
|
82 |
+
# Dutch
|
83 |
+
::s ij ::t ij ::t-alt ai
|
84 |
+
::s Ij ::t Ij ::t-alt Ai
|
85 |
+
|
86 |
+
# Greek
|
87 |
+
::s Ι ::t I
|
88 |
+
::s ι ::t i
|
89 |
+
::s ί ::t i
|
90 |
+
::s ἶ ::t i
|
91 |
+
::s Υ ::t Y
|
92 |
+
::s υ ::t y
|
93 |
+
::s Ρ ::t R
|
94 |
+
::s ρ ::t r
|
95 |
+
::s Ντ ::t D
|
96 |
+
::s ντ ::t nd ::t-alt d
|
97 |
+
# ::s ντζ ::t ntz
|
98 |
+
::s Μπ ::t B
|
99 |
+
::s μπ ::t mb ::t-alt b
|
100 |
+
::s γγ ::t ng
|
101 |
+
::s γκ ::t ng ::t-alt g
|
102 |
+
::s ει ::t ei ::t-alt i
|
103 |
+
::s ου ::t ou ::t-alt u
|
104 |
+
::s χ ::t ch ::t-alt kh
|
105 |
+
|
106 |
+
# Cyrillic
|
107 |
+
::s Г ::t G ::t-alt H
|
108 |
+
::s г ::t g ::t-alt h
|
109 |
+
::s Е ::t E ::t-alt Ye
|
110 |
+
::s е ::t e ::t-alt ye
|
111 |
+
::s Ё ::t E ::t-alt Yo
|
112 |
+
::s ё ::t e ::t-alt yo
|
113 |
+
::s Х ::t Kh ::t-alt Ch, H ::comment Cyrillic capital ha
|
114 |
+
::s х ::t kh ::t-alt ch, h ::comment Cyrillic small ha
|
115 |
+
::s Щ ::t Shch ::t-alt Sh
|
116 |
+
::s щ ::t shch ::t-alt sh
|
117 |
+
::s Ъ ::t ::comment Cyrillic capital hard sign
|
118 |
+
::s ъ ::t ::comment Cyrillic small hard sign
|
119 |
+
::s Ы ::t Y ::comment Cyrillic capital yeru
|
120 |
+
::s ы ::t y ::comment Cyrillic small yeru
|
121 |
+
::s Ь ::t ::comment Cyrillic capital soft sign
|
122 |
+
::s ь ::t ::comment Cyrillic small soft sign
|
123 |
+
|
124 |
+
::s Ҥ ::t Ng ::comment Cyrillic capital ligature EN GHE
|
125 |
+
::s ҥ ::t ng ::comment Cyrillic small ligature EN GHE
|
126 |
+
::s Ә ::t e ::comment Cyrillic capital schwa
|
127 |
+
::s ә ::t e ::comment Cyrillic small schwa
|
128 |
+
::s Ӏ ::t ' ::comment Cyrillic palochka
|
129 |
+
::s Ҵ ::t TS ::comment Cyrillic capital ligature te tse, used in Abkhasian
|
130 |
+
::s ҵ ::t ts ::comment Cyrillic small ligature te tse, used in Abkhasian
|
131 |
+
::s Ӕ ::t AE ::comment Cyrillic capital ligature a ie
|
132 |
+
::s ӕ ::t ae ::comment Cyrillic small ligature a ie
|
133 |
+
::s Г ::t H ::lcode ukr ::comment Ukrainian capital letter he
|
134 |
+
::s г ::t h ::lcode ukr ::comment Ukrainian small letter he
|
135 |
+
::s Ґ ::t G ::lcode ukr ::comment Ukrainian capital letter ghe
|
136 |
+
::s ґ ::t g ::lcode ukr ::comment Ukrainian small letter ghe
|
137 |
+
|
138 |
+
# Gothic
|
139 |
+
::s 𐌴 ::t e ::comment Gothic letter aihvus
|
140 |
+
::s 𐌹 ::t i ::comment Gothic letter eis
|
141 |
+
::s 𐍇 ::t x ::comment Gothic letter iggws
|
142 |
+
|
143 |
+
# Georgian
|
144 |
+
::s ა ::t a ::comment Georgian letter an
|
145 |
+
::s ე ::t e ::comment Georgian letter en
|
146 |
+
::s ი ::t i ::comment Georgian letter in
|
147 |
+
::s ო ::t o ::comment Georgian letter on
|
148 |
+
::s უ ::t u ::comment Georgian letter un
|
149 |
+
|
150 |
+
# Armenian
|
151 |
+
::s Ա ::t a ::comment Armenian capital letter ayb
|
152 |
+
::s ա ::t a ::comment Armenian small letter ayb
|
153 |
+
::s Ե ::t e ::comment Armenian capital letter ech
|
154 |
+
::s ե ::t e ::comment Armenian small letter ech
|
155 |
+
::s և ::t ev ::comment Armenian small ligature ech yiwn
|
156 |
+
::s Է ::t e ::comment Armenian capital letter eh
|
157 |
+
::s է ::t e ::comment Armenian small letter eh
|
158 |
+
::s Ի ::t i ::comment Armenian capital letter ini
|
159 |
+
::s ի ::t i ::comment Armenian small letter ini
|
160 |
+
::s Օ ::t o ::comment Armenian capital letter oh
|
161 |
+
::s օ ::t o ::comment Armenian small letter oh
|
162 |
+
|
163 |
+
## Japanese
|
164 |
+
# Katakana
|
165 |
+
::s シ ::t shi
|
166 |
+
::s チ ::t chi
|
167 |
+
::s フ ::t fu
|
168 |
+
::s ジ ::t ji
|
169 |
+
::s ヂ ::t ji
|
170 |
+
::s ヅ ::t zu
|
171 |
+
::s シャ ::t sha
|
172 |
+
::s シュ ::t shu
|
173 |
+
::s ショ ::t sho
|
174 |
+
::s チャ ::t cha
|
175 |
+
::s チェ ::t che
|
176 |
+
::s チュ ::t chu
|
177 |
+
::s チョ ::t cho
|
178 |
+
::s ジャ ::t ja
|
179 |
+
::s ジュ ::t ju
|
180 |
+
::s ジョ ::t jo
|
181 |
+
::s ジェ ::t je
|
182 |
+
::s ヂャ ::t ja
|
183 |
+
::s ヂュ ::t ju
|
184 |
+
::s ヂョ ::t jo
|
185 |
+
::s フェ ::t fe
|
186 |
+
::s ヴェ ::t ve
|
187 |
+
::s フィ ::t fi
|
188 |
+
::s ウィ ::t wi
|
189 |
+
::s ヴィ ::t vi
|
190 |
+
::s ティ ::t ti
|
191 |
+
::s ディ ::t di
|
192 |
+
::s ッ ::t (__SOKUON__) ::comment katakana double following consonant
|
193 |
+
::s ー ::t (__CHOONPU__) ::comment katakana prolonged sound mark
|
194 |
+
# Hiragana
|
195 |
+
::s し ::t shi
|
196 |
+
::s ち ::t chi
|
197 |
+
::s つ ::t tsu
|
198 |
+
::s ふ ::t fu
|
199 |
+
::s を ::t o
|
200 |
+
::s じ ::t ji
|
201 |
+
::s ぢ ::t ji
|
202 |
+
::s づ ::t zu
|
203 |
+
::s しゃ ::t sha
|
204 |
+
::s しゅ ::t shu
|
205 |
+
::s しょ ::t sho
|
206 |
+
::s ちゃ ::t cha
|
207 |
+
::s ちゅ ::t chu
|
208 |
+
::s ちょ ::t cho
|
209 |
+
::s じゃ ::t ja
|
210 |
+
::s じゅ ::t ju
|
211 |
+
::s じょ ::t jo
|
212 |
+
::s ぢゃ ::t ja
|
213 |
+
::s ぢゅ ::t ju
|
214 |
+
::s ぢょ ::t jo
|
215 |
+
::s っ ::t (__SOKUON__) ::comment hiragana double following consonant
|
216 |
+
::s 々 ::t ² ::comment ideographic iteration mark ::annotation repetition-sign
|
217 |
+
|
218 |
+
::s フ ::t fu ::t-alt f
|
219 |
+
::s キ ::t ki ::t-alt k
|
220 |
+
::s ク ::t ku ::t-alt k
|
221 |
+
::s ラ ::t ra ::t-alt la
|
222 |
+
::s リ ::t ri ::t-alt li
|
223 |
+
::s ル ::t ru ::t-alt lu, l, r
|
224 |
+
::s レ ::t re ::t-alt le
|
225 |
+
::s ロ ::t ro ::t-alt lo
|
226 |
+
::s ム ::t mu ::t-alt m ::example キム = Kim
|
227 |
+
::s シ ::t shi ::t-alt si ::example メキシコ = meksiko (Mexico)
|
228 |
+
::s ス ::t su ::t-alt s
|
229 |
+
::s ト ::t to ::t-alt t
|
230 |
+
::s ツ ::t tsu ::t-alt tu, ts ::example シュルツ = Schultz
|
231 |
+
|
232 |
+
# Chinese
|
233 |
+
::s 邦 ::t bang ::t-alt bon, bum, bun, pon
|
234 |
+
::s 鲍 ::t bao ::t-alt bow
|
235 |
+
::s 堡 ::t bao ::t-alt berg, burg, bourg, burgh
|
236 |
+
::s 贝 ::t bei ::t-alt ber
|
237 |
+
::s 本 ::t ben ::t-alt bern, bon, bourn, burn
|
238 |
+
::s 彼得 ::t bide ::t-alt peter, pet
|
239 |
+
::s 伯 ::t bo ::t-alt ber
|
240 |
+
::s 波 ::t bo ::t-alt po
|
241 |
+
::s 布 ::t bu ::t-alt b
|
242 |
+
::s 策 ::t ce ::t-alt tze, tzer
|
243 |
+
::s 曾 ::t ceng ::t-alt tzen, zen
|
244 |
+
::s 彻 ::t che ::t-alt tche
|
245 |
+
::s 茨 ::t ci ::t-alt ts, tz, z
|
246 |
+
::s 兹 ::t ci ::t-alt ds, dz, tz, z, zi
|
247 |
+
::s 蒂 ::t di ::t-alt ti, tti
|
248 |
+
::s 丁 ::t ding ::t-alt din, tin
|
249 |
+
::s 顿 ::t dun ::t-alt ton
|
250 |
+
::s 多 ::t duo ::t-alt do, dor, to
|
251 |
+
::s 尔 ::t er ::t-alt l, le, ll, r
|
252 |
+
::s 弗 ::t fu ::t-alt f, fer, pher, v, ver, vir
|
253 |
+
::s 夫 ::t fu ::t-alt f, v, v
|
254 |
+
::s 福 ::t fu ::t-alt faw, for, ford
|
255 |
+
::s 哥 ::t ge ::t-alt go, co
|
256 |
+
::s 戈 ::t ge ::t-alt go
|
257 |
+
::s 各 ::t ge ::t-alt go, co
|
258 |
+
::s 赫 ::t he ::t-alt ch, che, cher, ge
|
259 |
+
::s 华 ::t hua ::t-alt ver, wa, war, wer ::example Washington
|
260 |
+
::s 怀 ::t huai ::t-alt whi, wi, wy
|
261 |
+
::s 惠 ::t hui ::t-alt wha, whea
|
262 |
+
::s 基 ::t ji ::t-alt ki, chi
|
263 |
+
::s 吉 ::t ji ::t-alt gi, gui
|
264 |
+
::s 加 ::t jia ::t-alt ca, ga, ka ::example Canada
|
265 |
+
::s 杰 ::t jie ::t-alt ger
|
266 |
+
::s 金 ::t jin ::t-alt kin, gin
|
267 |
+
::s 斤 ::t jin ::t-alt zin
|
268 |
+
::s 康 ::t kang ::t-alt con, corn
|
269 |
+
::s 考 ::t kao ::t-alt cow, cour
|
270 |
+
::s 克 ::t ke ::t-alt k, che, cher
|
271 |
+
::s 科 ::t ke ::t-alt ko
|
272 |
+
::s 拉 ::t la ::t-alt ra ::example Tirana
|
273 |
+
::s 朗 ::t lang ::t-alt lon, ron
|
274 |
+
::s 赖 ::t lai ::t-alt ri
|
275 |
+
::s 劳 ::t lao ::t-alt low
|
276 |
+
::s 勒 ::t lei ::t-alt ler
|
277 |
+
::s 伦 ::t lun ::t-alt lon, ran, ron
|
278 |
+
::s 里 ::t li ::t-alt ri
|
279 |
+
::s 利 ::t li ::t-alt ri ::example Ferrari
|
280 |
+
::s 隆 ::t long ::t-alt lon, lum, lund
|
281 |
+
::s 罗 ::t luo ::t-alt l, lo, lu, ro, row, ru
|
282 |
+
::s 洛 ::t luo ::t-alt lo, low, ro
|
283 |
+
::s 默 ::t mo ::t-alt mer
|
284 |
+
::s 纳 ::t na ::t-alt ne, ner
|
285 |
+
::s 珀 ::t po ::t-alt per
|
286 |
+
::s 奇 ::t qi ::t-alt chi, dge, ge, tch
|
287 |
+
::s 齐 ::t qi ::t-alt tsi, zi
|
288 |
+
::s 乔 ::t qiao ::t-alt jo
|
289 |
+
::s 青 ::t qing ::t-alt tsing
|
290 |
+
::s 琼 ::t qiong ::t-alt jon, jum, jun
|
291 |
+
::s 瑟 ::t se ::t-alt the
|
292 |
+
::s 什 ::t shen ::t-alt sh
|
293 |
+
::s 圣 ::t sheng ::t-alt san, sao, saint
|
294 |
+
::s 斯 ::t si ::t-alt s, rth, th ::example Alaska
|
295 |
+
::s 索 ::t suo ::t-alt tho
|
296 |
+
::s 特 ::t te ::t-alt t
|
297 |
+
::s 翁 ::t weng ::t-alt on
|
298 |
+
::s 沃 ::t wo ::t-alt ver, vo, war, wer
|
299 |
+
::s 乌 ::t wu ::t-alt ou, u
|
300 |
+
::s 希 ::t xi ::t-alt chi, hi, shi
|
301 |
+
::s 西 ::t xi ::t-alt s, si
|
302 |
+
::s 锡 ::t xi ::t-alt ci, si, thi, zi
|
303 |
+
::s 夏 ::t xia ::t-alt ha, cha, cia, sha, tia
|
304 |
+
::s 香 ::t xiang ::t-alt chan, cham
|
305 |
+
::s 歇 ::t xie ::t-alt she
|
306 |
+
::s 谢 ::t xie ::t-alt che, she
|
307 |
+
::s 辛 ::t xin ::t-alt cin, sen, sin, sing, sun, zen
|
308 |
+
::s 欣 ::t xin ::t-alt hin, shin
|
309 |
+
::s 休 ::t xiu ::t-alt hu, hue
|
310 |
+
::s 修 ::t xiu ::t-alt ciu, siu, thew, tiu
|
311 |
+
::s 许 ::t xu ::t-alt hue, schue
|
312 |
+
::s 逊 ::t xun ::t-alt son
|
313 |
+
::s 耶 ::t ye ::t-alt yer, ier
|
314 |
+
::s 泽 ::t ze ::t-alt ser
|
315 |
+
::s 扎 ::t zha ::t-alt za
|
316 |
+
::s 詹 ::t zhan ::t-alt ja, jam, jan, jen, jon
|
317 |
+
::s 治 ::t zhi ::t-alt ge ::example George
|
318 |
+
|
319 |
+
## Numbers
|
320 |
+
# Chinese and Japanese numbers
|
321 |
+
::s 零 ::num 0
|
322 |
+
::s 〇 ::num 0
|
323 |
+
::s 一 ::num 1
|
324 |
+
::s 二 ::num 2
|
325 |
+
::s 三 ::num 3
|
326 |
+
::s 四 ::num 4
|
327 |
+
::s 五 ::num 5
|
328 |
+
::s 六 ::num 6
|
329 |
+
::s 七 ::num 7
|
330 |
+
::s 八 ::num 8
|
331 |
+
::s 九 ::num 9
|
332 |
+
::s 十 ::num 10
|
333 |
+
::s 百 ::num 100
|
334 |
+
::s 千 ::num 1000
|
335 |
+
::s 万 ::num 10000
|
336 |
+
::s 萬 ::num 10000
|
337 |
+
::s 亿 ::num 100000000
|
338 |
+
::s 億 ::num 100000000
|
339 |
+
::s 兆 ::num 1000000000000
|
340 |
+
::s 京 ::num 10000000000000000
|
341 |
+
|
342 |
+
::s 北京 ::t beijing
|
343 |
+
::s 京都 ::t jingdou
|
344 |
+
::s 东京 ::t dongjing
|
345 |
+
::s 京胡 ::t jinghu
|
346 |
+
::s 南京 ::t nangjing
|
347 |
+
::s 普京 ::t pujing ::comment Putin
|
348 |
+
::s 東京 ::t dongjing ::comment Tokyo
|
349 |
+
::s 京兆 ::t jingzhao
|
350 |
+
|
351 |
+
::s ㎢ ::t km²
|
352 |
+
::s ㎥ ::t m³
|
353 |
+
::s ㎝ ::t cm
|
354 |
+
|
355 |
+
## Indian
|
356 |
+
# see mostly under UnicodeDataOverwrite.txt
|
357 |
+
|
358 |
+
# Malayalam
|
359 |
+
::s ൗ ::t au ::comment MALAYALAM AU LENGTH MARK
|
360 |
+
|
361 |
+
# Tamil
|
362 |
+
::s ட ::t d ::comment most commonly d, but t when word-initial or in a doubled consonant
|
363 |
+
::s ஃப ::t f ::comment h+p=f
|
364 |
+
::s ஃஜ ::t z ::comment h+j=z
|
365 |
+
|
366 |
+
# Myanmar/Burmese
|
367 |
+
# ::s ့ ::t ::comment dot below, denotes creaky tone
|
368 |
+
# ::s း ::t ::comment visarga, denotes high tone
|
369 |
+
::s ၌ ::t -nai ::comment locative
|
370 |
+
::s ၍ ::t -jwe ::comment completed
|
371 |
+
::s ၎ ::t legau ::comment aforementioned
|
372 |
+
::s ၏ ::t -i ::comment genetive
|
373 |
+
|
374 |
+
# Lao
|
375 |
+
::s ັ ::t a ::comment vowel sign mai kan
|
376 |
+
::s ົ ::t o ::comment vowel sign mai kon
|
377 |
+
::s ູ ::t uu ::comment vowel sign uu
|
378 |
+
::s ຽ ::t y ::comment semivowel sign nyo
|
379 |
+
::s ຼ ::t l ::comment semivowel sign lo
|
380 |
+
::s ລ ::t l ::comment lo loot
|
381 |
+
::s ຣ ::t l ::comment lo ling
|
382 |
+
::s ໝ ::t m ::comment ho mo
|
383 |
+
::s ໜ ::n ::comment ho no
|
384 |
+
::s ຢ ::t y ::comment yo
|
385 |
+
::s ໍ ::t oo ::comment niggahita (possibly also nasal -m in final position)
|
386 |
+
::s ໆ ::t ² ::comment Lao ko la ::annotation repetition-sign
|
387 |
+
::s ຯ ::t ... ::comment Lao ellipsis
|
388 |
+
|
389 |
+
# Thai
|
390 |
+
::s ออ ::t o
|
391 |
+
::s อั ::t a
|
392 |
+
::s อิ ::t i
|
393 |
+
::s ๆ ::t ² ::comment Thai character maiyamok ::annotation repetition-sign
|
394 |
+
|
395 |
+
# Khmer
|
396 |
+
::s ័ ::t "" ::comment Khmer samyok sannya: indicates deviation from the general rules of pronunciation
|
397 |
+
::s ៏ ::t "" ::comment Khmer sign ahsda: denotes stressed intonation in some single-consonant words
|
398 |
+
::s ៍ ::t "" ::comment Khmer sign toandakhiat: indicates that the base character is not pronounced
|
399 |
+
::s ៌ ::t "" ::comment Khmer sign robat: a diacritic historically corresponding to the repha form of ra in Devanagari
|
400 |
+
::s ប៉ ::t pa ::comment Khmer ba + musĕkâtônd -> pa
|
401 |
+
::s ៗ ::t ² ::comment Khmer sign lek too ::annotation repetition-sign
|
402 |
+
|
403 |
+
## Semitic languages
|
404 |
+
# Arabic
|
405 |
+
::s و ::t w ::comment Arabic letter waw ::t-alt o, u ::lcode ara
|
406 |
+
::s ء ::t ' ::comment hamza
|
407 |
+
::s ٔ ::t ' ::comment hamza above
|
408 |
+
::s ٕ ::t ' ::comment hamza below
|
409 |
+
::s ع ::t ' ::comment ain
|
410 |
+
::s آ ::t a ::comment alef madda
|
411 |
+
::s ٓا ::t a ::comment Arabic maddah above plus alef (presumably an ill-formed version of آ; found 1 instance in Urdu text)
|
412 |
+
::s إ ::t i ::comment alef with hamza below
|
413 |
+
::s ٱ ::t a ::comment alef wasla ::comment typically indicates liaison with preceding word
|
414 |
+
::s ة ::t a ::comment teh marbuta
|
415 |
+
::s ۃ ::t a ::comment teh marbuta goal ::comment Used in Punjabi, Sindhi. Different from plain 'teh marbuta'?
|
416 |
+
::s ي ::t y ::comment Arabic yeh
|
417 |
+
::s ى ::t a ::comment alef maksura
|
418 |
+
::s ﻯ ::t a ::comment alef maksura isolated form
|
419 |
+
::s ﻰ ::t a ::comment alef maksura final form
|
420 |
+
::s ﯨ ::t a ::comment Uighur Kazach Kirghiz alef maksura initial form
|
421 |
+
::s ﯩ ::t a ::comment Uighur Kazach Kirghiz alef maksura medial form
|
422 |
+
::s ٰ ::t a ::comment Arabic letter superscript alef
|
423 |
+
::s ـ ::t ::comment tatweel (filler)
|
424 |
+
::s َ ::t a ::comment fatha ("-a")
|
425 |
+
::s ُ ::t u ::comment damma ("-u")
|
426 |
+
::s ِ ::t i ::comment kasra ("-i")
|
427 |
+
::s ْ ::t ::comment sukun (no vowel)
|
428 |
+
::s ۡ ::t ::comment small high dotless head of khah; like sukun (no vowel); used in Kashmiri, Assamese
|
429 |
+
::s ً ::t ::comment fathatan ("-an")
|
430 |
+
::s اً ::t an ::comment alef + fathatan
|
431 |
+
::s ٌ ::t ::comment dammatan ("-un")
|
432 |
+
::s ٍ ::t ::comment kasratan ("-in")
|
433 |
+
::s ّ ::t ::comment shadda (consonant doubler)
|
434 |
+
::s ڃ ::t ny ::comment Arabic letter nyeh U+0683 (used in Sindhi (snd))
|
435 |
+
::s ڄ ::t dy ::comment Arabic letter dyeh U+0684 (used in Sindhi (snd))
|
436 |
+
::s ۾ ::t men ::comment Sindhi postposition men
|
437 |
+
::s ؑ ::t alayhe wasallam ::comment "upon him be peace"
|
438 |
+
::s ﷴ ::t mohammad ::comment "Mohammad"
|
439 |
+
::s ﷸ ::t wasallam ::comment "and peace"
|
440 |
+
::s ﷺ ::t sallallahou alayhe wasallam ::comment "prayer of God be upon him and his family and peace"
|
441 |
+
|
442 |
+
# Farsi
|
443 |
+
::s ی ::t i ::t-alt y ::comment Contributed by Nima
|
444 |
+
::s ای ::t i ::t-alt ai ::use-only-at-start-of-word ::comment Contributed by Nima
|
445 |
+
::s هٔ ::t eye ::use-only-at-end-of-word ::lcode fas ::comment Contributed by Nima
|
446 |
+
::s و ::t v ::t-alt o, u ::lcode fas ::comment Arabic letter waw
|
447 |
+
::s ض ::t z ::t-alt d ::lcode fas ::comment Contributed by Marjan
|
448 |
+
::s ث ::t s ::t-alt th ::lcode fas ::comment Contributed by Marjan
|
449 |
+
::s ذ ::t z ::t-alt th ::lcode fas ::comment Contributed by Nima
|
450 |
+
::s ع ::t a ::t-alt ' ::lcode fas ::comment Contributed by Nima
|
451 |
+
::s عا ::t a ::lcode fas ::comment Contributed by Nima
|
452 |
+
::s عی ::t i ::t-alt iy ::lcode fas ::comment Contributed by Nima
|
453 |
+
::s عو ::t u ::t-alt o, av ::lcode fas ::comment Contributed by Nima
|
454 |
+
::s چ ::t ch ::t-alt tch, tsh ::lcode fas ::comment Contributed by Nima
|
455 |
+
::s ه ::t e ::t-alt h ::use-only-at-end-of-word ::lcode fas ::comment Contributed by Nima
|
456 |
+
::s ::t "" ::t-alt " " ::lcode fas ::comment source is character "zero-width non-joiner" (U+200C); Contributed by Nima
|
457 |
+
::s غ ::t gh ::t-alt g ::lcode fas
|
458 |
+
::s آئی ::t ai ::t-alt ae ::lcode fas
|
459 |
+
::s ائی ::t ai ::t-alt ae ::lcode fas
|
460 |
+
::s آئو ::t au ::t-alt ao ::lcode fas
|
461 |
+
::s ائو ::t au ::t-alt ao ::lcode fas
|
462 |
+
|
463 |
+
# Kashmiri (so far: educated guesses)
|
464 |
+
::s ٖ ::t a ::comment Arabic subscript alef U+0656
|
465 |
+
::s ٗ ::t u ::comment Arabic inverted damma U+0657
|
466 |
+
::s ۚ ::t j ::comment Arabic small high jeem U+06DA
|
467 |
+
::s ۪ ::t ::comment Arabic emtpy centre low stop U+06EA
|
468 |
+
::s ۬ ::t ::comment Arabic rounded high stop with filled center U+06EC
|
469 |
+
|
470 |
+
# Pashto
|
471 |
+
::s ٙ ::t e
|
472 |
+
|
473 |
+
# Hebrew
|
474 |
+
::s ב ::t v ::comment Hebrew letter bet ::t-alt b
|
475 |
+
::s כ ::t k ::comment Hebrew letter kaf ::t-alt kh
|
476 |
+
::s ך ::t k ::comment Hebrew letter kaf ::t-alt kh
|
477 |
+
::s פ ::t f ::comment Hebrew letter pe ::t-alt p
|
478 |
+
::s ש ::t sh ::comment Hebrew letter shin ::t-alt s
|
479 |
+
::s ו ::t v ::comment Hebrew letter vav ::t-alt o, u
|
480 |
+
::s ח ::t ch ::comment Hebrew letter het ::t-alt h ::use-alt-in-pointed
|
481 |
+
::s ק ::t q ::t-alt k ::use-alt-in-pointed
|
482 |
+
::s וֹ ::t o
|
483 |
+
::s וּ ::t u
|
484 |
+
::s קְוָ ::t qva ::t-alt kva ::use-alt-in-pointed
|
485 |
+
::s י ::t y
|
486 |
+
::s יּ ::t y
|
487 |
+
::s יָּ ::t ya
|
488 |
+
::s ע ::t '
|
489 |
+
::s ִי ::t i ::t-alt iy ::use-alt-in-pointed
|
490 |
+
::s ֵי ::t e
|
491 |
+
::s ִיּ ::t iy
|
492 |
+
::s ִיָּ ::t iya
|
493 |
+
::s ױ ::t oy
|
494 |
+
::s א ::t a ::t-alt '
|
495 |
+
::s אָ ::t a
|
496 |
+
::s ֹא ::t o
|
497 |
+
::s אַ ::t 'a
|
498 |
+
::s אֲ ::t 'a
|
499 |
+
::s אֶ ::t e
|
500 |
+
::s אֱ ::t e
|
501 |
+
::s פ ::t f
|
502 |
+
::s פּ ::t p
|
503 |
+
::s פַּ ::t pa
|
504 |
+
::s פְּ ::t pe ::t-alt p ::use-alt-in-pointed
|
505 |
+
::s שׁ ::t sh
|
506 |
+
::s שָׁ ::t sha
|
507 |
+
::s שָּׁ ::t sha ::comment ?
|
508 |
+
::s שְׁ ::t she ::t-alt sh ::use-alt-in-pointed
|
509 |
+
::s שֶׁ ::t she
|
510 |
+
::s שִׁ ::t shi
|
511 |
+
::s שֻׁ ::t shu
|
512 |
+
::s שׂ ::t s
|
513 |
+
::s שָׂ ::t sa
|
514 |
+
::s שְׂ ::t s ::t-alt se ::use-alt-in-pointed
|
515 |
+
::s כּ ::t k
|
516 |
+
::s כֶּ ::t ke
|
517 |
+
::s כֹּ ::t ko
|
518 |
+
::s בּ ::t b
|
519 |
+
::s בַּ ::t ba
|
520 |
+
::s בָּ ::t ba
|
521 |
+
::s בְּ ::t be ::t-alt b ::use-alt-in-pointed
|
522 |
+
::s בֶּ ::t be
|
523 |
+
::s תּ ::t t
|
524 |
+
::s תַּ ::t ta
|
525 |
+
::s תֵּ ::t te
|
526 |
+
::s תִּ ::t ti
|
527 |
+
::s דָּ ::t da
|
528 |
+
::s דְּ ::t de ::t-alt d ::use-alt-in-pointed
|
529 |
+
::s גּ ::t g
|
530 |
+
::s לֵּ ::t le
|
531 |
+
::s ד׳ ::t dh
|
532 |
+
::s ג׳ ::t j
|
533 |
+
::s ת׳ ::t th
|
534 |
+
::s ז׳ ::t zh
|
535 |
+
::s חַ ::t ach ::comment furtive patah ::use-only-at-end-of-word
|
536 |
+
::s עַ ::t a' ::comment furtive patah ::use-only-at-end-of-word
|
537 |
+
::s הַּ ::t ah ::comment furtive patah ::use-only-at-end-of-word
|
538 |
+
::s ַ ::t a ::comment Hebrew point patah
|
539 |
+
::s ֲ ::t a ::comment Hebrew point hataf patah (hataf = reduced)
|
540 |
+
::s ֳ ::t o ::comment Hebrew point hataf qamats
|
541 |
+
::s ָ ::t a ::comment Hebrew point qamats ::t-alt o ::use-alt-in-pointed
|
542 |
+
::s ֶ ::t e ::comment Hebrew point segol
|
543 |
+
::s ֱ ::t e ::comment Hebrew point hataf segol (hataf = reduced)
|
544 |
+
::s ְ ::t e ::comment Hebrew point sheva ::t-alt "" ::use-alt-in-pointed
|
545 |
+
::s ֵ ::t e ::comment Hebrew point tsere
|
546 |
+
::s ִ ::t i ::comment Hebrew point hiriq
|
547 |
+
::s ֹ ::t o ::comment Hebrew point holam
|
548 |
+
::s ֻ ::t u ::comment Hebrew point qubuts
|
549 |
+
# ::s ּ ::t "" ::comment Hebrew point dagesh or mapiq
|
550 |
+
|
551 |
+
# Yiddish
|
552 |
+
::s א ::t a ::lcode yid ::comment called "silent" alef
|
553 |
+
::s אי ::t y ::lcode yid
|
554 |
+
::s איי ::t ey ::lcode yid
|
555 |
+
::s או ::t u ::lcode yid
|
556 |
+
::s אוי ::t oy ::lcode yid
|
557 |
+
::s אַ ::t a ::lcode yid
|
558 |
+
::s אָ ::t o ::lcode yid
|
559 |
+
::s ב ::t b ::lcode yid
|
560 |
+
::s בֿ ::t v ::lcode yid
|
561 |
+
::s דזש ::t dzh ::lcode yid
|
562 |
+
::s ו ::t u ::lcode yid
|
563 |
+
::s וּ ::t u ::lcode yid
|
564 |
+
::s וֹ ::t o ::lcode yid
|
565 |
+
::s װ ::t v ::lcode yid
|
566 |
+
::s ווא ::t wa ::lcode yid
|
567 |
+
::s וואַ ::t wa ::lcode yid
|
568 |
+
::s ווע ::t we ::lcode yid
|
569 |
+
::s ווי ::t wi ::lcode yid
|
570 |
+
::s וואוי ::t wo ::lcode yid
|
571 |
+
::s וי ::t oy ::lcode yid
|
572 |
+
::s זש ::t zh ::lcode yid
|
573 |
+
::s ח ::t ch ::lcode yid
|
574 |
+
::s טש ::t tsh ::lcode yid
|
575 |
+
::s יִ::t i ::lcode yid
|
576 |
+
::s יי ::t ey ::lcode yid ::comment maybe "yi" at beginning of word
|
577 |
+
::s ײַ ::t ay ::lcode yid
|
578 |
+
::s כּ ::t k ::lcode yid
|
579 |
+
::s כ ::t ch ::lcode yid
|
580 |
+
::s ך ::t ch ::lcode yid
|
581 |
+
::s ע ::t e ::lcode yid
|
582 |
+
::s פּ ::t p ::lcode yid
|
583 |
+
::s פֿ ::t f ::lcode yid
|
584 |
+
::s ף ::t f ::lcode yid ::comment sometimes p
|
585 |
+
::s ק ::t k ::lcode yid
|
586 |
+
::s ת ::t s ::lcode yid
|
587 |
+
|
588 |
+
# Syriac/Aramaic (should be vetted by expert)
|
589 |
+
::s ܰ ::t a ::comment Syriac pthaha above
|
590 |
+
::s ܲ ::t a ::comment Syriac pthaha dotted
|
591 |
+
::s ܳ ::t aa ::comment Syriac zqapha above
|
592 |
+
::s ܴ ::t aa ::comment Syriac zqapha below
|
593 |
+
::s ܵ ::t aa ::comment Syriac zqapha dotted
|
594 |
+
::s ܶ ::t e ::comment Syriac rbasa above
|
595 |
+
::s ܷ ::t e ::comment Syriac rbasa below
|
596 |
+
::s ܿ ::t o ::comment Syriac rwaha
|
597 |
+
::s ܸ ::t e ::comment Syriac dotted zlama horizontal
|
598 |
+
::s ܹ ::t e ::comment Syriac dotted zlama angular
|
599 |
+
::s ܺ ::t i ::comment Syriac hbasa above
|
600 |
+
::s ܝܺ ::t i ::comment Syriac yudh + hbasa above
|
601 |
+
::s ܼ ::t u ::comment Syriac hbasa-esasa dotted
|
602 |
+
::s ܽ ::t o ::comment Syriac esasa above
|
603 |
+
::s ܾ ::t u ::comment Syriac esasa below
|
604 |
+
::s ݇ ::t "" ::comment Syriac oblique line above; indication of a silent letter
|
605 |
+
|
606 |
+
::s ܖ ::t d ::comment Syriac letter dotless dalath rish; ambiguous form for undifferentiated early dalath/rish
|
607 |
+
::s ܜ ::t t ::comment Syriac letter teth garshuni; used in Garshuni documents
|
608 |
+
::s ܒ݂ ::t v ::comment Syriac beth + rukkakha
|
609 |
+
::s ܒ̥ ::t v ::comment Syriac beth + ring-below
|
610 |
+
::s ܓ݂ ::t g ::comment Syriac gammal + rukkakha [IPA: ɣ]
|
611 |
+
::s ܓ̥ ::t g ::comment Syriac gammal + ring-below [IPA: ɣ]
|
612 |
+
::s ܕ݂ ::t d ::comment Syriac dalath + rukkakha [IPA: ð]
|
613 |
+
::s ܕ̥ ::t d ::comment Syriac dalath + ring-below [IPA: ð]
|
614 |
+
::s ܟ݂ ::t kh ::comment Syriac kaph + rukkakha [IPA: x]
|
615 |
+
::s ܟ̥ ::t kh ::comment Syriac kaph + ring-below [IPA: x]
|
616 |
+
::s ܦ݂ ::t f ::comment Syriac pe + rukkakha
|
617 |
+
::s ܦ̥ ::t f ::comment Syriac pe + ring-below
|
618 |
+
::s ܦ݁ ::t p ::comment Syriac pe + qushshaya
|
619 |
+
::s ܬ݂ ::t th ::comment Syriac taw + rukkakha [IPA: θ]
|
620 |
+
::s ܬ̥ ::t th ::comment Syriac taw + ring-below [IPA: θ]
|
621 |
+
|
622 |
+
::s ܄ ::t : ::comment Syriac sublinear colon; used at the end of verses of supplicationscolon skewed left
|
623 |
+
::s ܆ ::t , ::comment Syriac colon skewed left; marks a dependent clause
|
624 |
+
::s ܇ ::t , ::comment Syriac colon skewed right; marks the end of a subdivision of the apodosis, or latter part of a Biblical verse
|
625 |
+
|
626 |
+
# Uzbek
|
627 |
+
::s ʻ ::t ' ::comment modifies pronunciation of preceding "o" and "g"
|
628 |
+
::s ʼ ::t ' ::comment glottal stop (tutuq belgisi)
|
629 |
+
|
630 |
+
# Uyghur
|
631 |
+
::s ئا ::t a ::lcode uig
|
632 |
+
::s ە ::t e ::lcode uig
|
633 |
+
::s ئې ::t e ::lcode uig ::latinplus ë
|
634 |
+
::s ې ::t e ::lcode uig ::latinplus ë
|
635 |
+
::s ئە ::t e ::lcode uig
|
636 |
+
::s يە ::t e ::lcode uig
|
637 |
+
::s ئى ::t i ::lcode uig
|
638 |
+
::s ى ::t i ::lcode uig
|
639 |
+
::s ئو ::t o ::lcode uig
|
640 |
+
::s و ::t o ::lcode uig
|
641 |
+
::s ئۇ ::t u ::lcode uig
|
642 |
+
::s ۇ ::t u ::lcode uig
|
643 |
+
::s چ ::t ch ::t-alt q ::lcode uig
|
644 |
+
::s خ ::t x ::lcode uig
|
645 |
+
::s ژ ::t zh ::lcode uig
|
646 |
+
::s ئۆ ::t oe ::t-alt o ::lcode uig ::latinplus ö
|
647 |
+
::s ۆ ::t oe ::t-alt o ::lcode uig ::latinplus ö
|
648 |
+
::s ئۈ ::t ue ::t-alt u ::lcode uig ::latinplus ü
|
649 |
+
::s ۈ ::t ue ::t-alt u ::lcode uig ::latinplus ü
|
650 |
+
::s ۋ ::t w ::lcode uig
|
651 |
+
|
652 |
+
# Maldivian
|
653 |
+
::s ް ::t ::comment thaana sukun
|
654 |
+
::s ަ ::t a ::comment thaana abafili
|
655 |
+
::s ާ ::t aa ::comment thaana aabaafili
|
656 |
+
::s ި ::t i ::comment thaana ibifili
|
657 |
+
::s ީ ::t ee ::comment thaana eebeefili
|
658 |
+
::s ު ::t u ::comment thaana ubufili
|
659 |
+
::s ޫ ::t oo ::comment thaana ooboofili
|
660 |
+
::s ެ ::t e ::comment thaana ebefili
|
661 |
+
::s ޭ ::t ey ::comment thaana eybeyfili
|
662 |
+
::s ޮ ::t o ::comment thaana obofili
|
663 |
+
::s ޯ ::t oa ::comment thaana oaboafili
|
664 |
+
|
665 |
+
# Canadian syllabics (Inuktitut)
|
666 |
+
::s ᑊ ::t p ::comment syllable final
|
667 |
+
::s ᐟ ::t t ::comment syllable final
|
668 |
+
::s ᐠ ::t k ::comment syllable final
|
669 |
+
::s ᐨ ::t c ::comment syllable final
|
670 |
+
::s ᒼ ::t m ::comment syllable final
|
671 |
+
::s ᐣ ::t n ::comment syllable final
|
672 |
+
::s ᐢ ::t s ::comment syllable final
|
673 |
+
::s ᐧ ::t y ::comment syllable final
|
674 |
+
::s ᐤ ::t w ::comment syllable final
|
675 |
+
::s ᐦ ::t h ::comment syllable final
|
676 |
+
::s ᕽ ::t hk ::comment syllable final
|
677 |
+
::s ᓫ ::t l ::comment syllable final
|
678 |
+
::s ᕑ ::t r ::comment syllable final
|
679 |
+
|
680 |
+
## Punctuation
|
681 |
+
# delete
|
682 |
+
::s ¿ ::t "" ::comment inverted question mark
|
683 |
+
::s ¡ ::t "" ::comment inverted exclamation mark
|
684 |
+
# preserve
|
685 |
+
::s ′ ::t ′
|
686 |
+
# Cyrillic
|
687 |
+
::s ⁙ ::t . ::comment five dot punctuation
|
688 |
+
# Amharic/Ethiopian
|
689 |
+
::s ። ::t .
|
690 |
+
::s ፣ ::t ,
|
691 |
+
::s ፤ ::t ;
|
692 |
+
::s ፥ ::t :
|
693 |
+
::s ፡ ::t " " ::comment Ethiopic wordspace
|
694 |
+
::s ፦ ::t : ::comment Ethiopic preface colon
|
695 |
+
::s ቸ ::t cha ::comment Ethiopic syllable ca
|
696 |
+
::s ቹ ::t chu ::comment Ethiopic syllable cu
|
697 |
+
::s ቺ ::t chi ::comment Ethiopic syllable ci
|
698 |
+
::s ቻ ::t chaa ::comment Ethiopic syllable caa
|
699 |
+
::s ቼ ::t chee ::comment Ethiopic syllable cee
|
700 |
+
::s ች ::t che ::comment Ethiopic syllable ce
|
701 |
+
::s ቾ ::t cho ::comment Ethiopic syllable co
|
702 |
+
::s ሠ ::t sa ::comment Ethiopic syllable sza
|
703 |
+
::s ሡ ::t su ::comment Ethiopic syllable szu
|
704 |
+
::s ሢ ::t si ::comment Ethiopic syllable szi
|
705 |
+
::s ሣ ::t saa ::comment Ethiopic syllable szaa
|
706 |
+
::s ሤ ::t see::comment Ethiopic syllable szee
|
707 |
+
::s ሥ ::t se ::comment Ethiopic syllable sze
|
708 |
+
::s ሦ ::t so ::comment Ethiopic syllable szo
|
709 |
+
::s ጠ ::t te ::comment Ethiopic syllable the with ejective 't'
|
710 |
+
::s ጡ ::t tu ::comment Ethiopic syllable thu with ejective 't'
|
711 |
+
::s ጢ ::t ti ::comment Ethiopic syllable thi with ejective 't'
|
712 |
+
::s ጣ ::t taa ::comment Ethiopic syllable thaa with ejective 't'
|
713 |
+
::s ጤ ::t tee ::comment Ethiopic syllable thee with ejective 't'
|
714 |
+
::s ጥ ::t te ::comment Ethiopic syllable the with ejective 't'
|
715 |
+
::s ጦ ::t to ::comment Ethiopic syllable tho with ejective 't'
|
716 |
+
|
717 |
+
# Devanagari (Hindi etc.)
|
718 |
+
::s । ::t . ::comment danda
|
719 |
+
::s ॥ ::t . ::comment double danda
|
720 |
+
::s ৷ ::t . ::comment Bengali currency numerator four; used as danda
|
721 |
+
::s ॰ ::t . ::comment Devanagari abbreviation sign
|
722 |
+
# Oriya/Odia (India)
|
723 |
+
::s ::t . ::comment danda (deprecated, should use Devanagari danda ।)
|
724 |
+
::s ::t . ::comment double danda (deprecated, should use Devanagari double danda ॥)
|
725 |
+
# Tibetan
|
726 |
+
::s ། ::t ,
|
727 |
+
::s །: ::t :
|
728 |
+
::s ༏ ::t ;
|
729 |
+
::s ༎ ::t .
|
730 |
+
::s ༑ ::t , ::comment Tibetan mark run chen spungs shad
|
731 |
+
::s ༼ ::t ( ::comment Tibetan open roof punctuation
|
732 |
+
::s ༽ ::t ) ::comment Tibetan close roof punctuation
|
733 |
+
::s ༈ ::t "" ::comment Tibetan mark srbul shad
|
734 |
+
::s 【 ::t [ ::comment left black lenticular bracket
|
735 |
+
::s 】 ::t ] ::comment right black lenticular bracket
|
736 |
+
::s ༄ ::t "" ::comment Tibetan head mark
|
737 |
+
::s ༄༅ ::t "" ::comment Tibetan head mark
|
738 |
+
::s ༆ ::t "" ::comment Tibetan head mark
|
739 |
+
# Myanmar/Burmese
|
740 |
+
::s ၊ ::t ,
|
741 |
+
::s ။ ::t .
|
742 |
+
Khmer
|
743 |
+
::s ៖ ::t ; ::comment Khmer sign camnuc pii kuuh
|
744 |
+
::s ។ ::t . ::comment Khmer sign khan
|
745 |
+
# Arabic
|
746 |
+
::s ، ::t ,
|
747 |
+
::s ؛ ::t ;
|
748 |
+
::s ٬ ::t ,
|
749 |
+
::s ۔ ::t .
|
750 |
+
::s ؟ ::t ?
|
751 |
+
::s ٪ ::t %
|
752 |
+
::s ٫ ::t , ::comment Arabic decimal separator
|
753 |
+
::s ۽ ::t & ::comment Arabic sign Sindhi ampersand
|
754 |
+
# Aramaic
|
755 |
+
::s ܀ ::t .
|
756 |
+
::s ܂ ::t .
|
757 |
+
# Hebrew
|
758 |
+
::s ־ ::t - ::comment maqaf
|
759 |
+
# Armenian
|
760 |
+
::s ։ ::t .
|
761 |
+
::s ՝ ::t , ::comment Armenian comma
|
762 |
+
# Chinese
|
763 |
+
::s , ::t ", "
|
764 |
+
::s 、 ::t ", "
|
765 |
+
::s 。 ::t ". "
|
766 |
+
::s ! ::t "! "
|
767 |
+
::s ? ::t "? "
|
768 |
+
::s 「 ::t ' "'
|
769 |
+
::s 」 ::t '" '
|
770 |
+
::s 《 ::t ' "'
|
771 |
+
::s 》 ::t '" '
|
772 |
+
::s ( ::t " ("
|
773 |
+
::s ) ::t ") "
|
774 |
+
::s ; ::t ;
|
775 |
+
::s : ::t ": "
|
776 |
+
::s ︰ ::t ": "
|
777 |
+
::s - ::t -
|
778 |
+
::s / ::t /
|
779 |
+
::s = ::t =
|
780 |
+
::s ~ ::t ~
|
781 |
+
::s & ::t &
|
782 |
+
::s < ::t <
|
783 |
+
::s > ::t >
|
784 |
+
::s % ::t %
|
785 |
+
::s ::t " " ::comment ideographic space
|
786 |
+
# Japanese
|
787 |
+
::s 『 ::t ' "'
|
788 |
+
::s 』 ::t '" '
|
789 |
+
::s ・ ::t " " ::comment Katakana middle dot; separates name elements such as first and last name
|
790 |
+
|
791 |
+
# Symbols
|
792 |
+
::s ∞ ::t ∞ ::comment infinity
|
793 |
+
::s ::t ::comment soft hyphen; used to indicate preferred line breaks; remove
|
794 |
+
::s ֊ ::t - ::comment Armenian hyphen; map to regular hyphen-minus
|
795 |
+
::s ᐩ ::t + ::comment Canadian syllabics final plus; map to regular plus
|
796 |
+
::s ﹐ ::t , ::comment small comma; map to regular comma
|
797 |
+
::s ˚ ::t ° ::comment ring above; map to degree sign
|
798 |
+
::s ⇒ ::t ⇒ ::comment rightwards double arrow
|
799 |
+
::s † ::t † ::comment dagger
|
800 |
+
::s • ::t • ::comment bullet
|
801 |
+
::s ℃ ::t °C ::comment degree Celsius; split into 2 characters
|
802 |
+
::s ℉ ::t °F ::comment degree Fahrenheit; split into 2 characters
|
803 |
+
::s ― ::t ― ::comment horizontal bar
|
804 |
+
::s ˇ ::t ˇ ::comment caron (sometimes apparently used for "Arabic vowel sign small v above" U+065A, e.g. in Gilaki language (glk))
|
805 |
+
::s ″ ::t ″ ::comment double prime
|
806 |
+
::s ﴾ ::t ( ::comment ornate left parenthesis
|
807 |
+
::s ﴿ ::t ) ::comment ornate right parenthesis
|
808 |
+
::s 〔 ::t [ ::comment left tortoise shell bracket
|
809 |
+
::s 〕 ::t ] ::comment right tortoise shell bracket
|
810 |
+
::s ﹝ ::t ( ::comment small left tortoise shell bracket
|
811 |
+
::s ﹞ ::t ) ::comment small left tortoise shell bracket
|
812 |
+
::s ♄ ::t ♄ ::comment Saturn
|
813 |
+
::s ♆ ::t ♆ ::comment Neptune
|
814 |
+
::s ♋ ::t ♋ ::comment Cancer
|
uroman/data/string-distance-cost-rules.txt
ADDED
@@ -0,0 +1,896 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# String distance
|
2 |
+
|
3 |
+
::s1 a ::s2 ::cost 0.1
|
4 |
+
::s1 b ::s2 ::cost 1
|
5 |
+
::s1 b ::s2 ::cost 0.2 ::left1 /[aou]m$/ ::right1 [e] ::lc1 eng ::lc2 zho ::example Balcombe
|
6 |
+
::s1 c ::s2 ::cost 1
|
7 |
+
::s1 c ::s2 ::cost 0.2 ::left1 /[aeou]$/ ::right1 [cgkq] ::lc2 zho
|
8 |
+
::s1 c ::s2 ::cost 0.5 ::left1 /[aeou][lnr]?$/ ::right1 [h] ::lc2 zho
|
9 |
+
::s1 d ::s2 ::cost 1
|
10 |
+
::s1 d ::s2 ::cost 0.5 ::left1 /[aeiou][lnr]$/ ::right1 [-,$ ]
|
11 |
+
::s1 d ::s2 ::cost 0.4 ::lc1 eng ::lc2 zho ::right1 [bcfgklmnpqrstvwxz]
|
12 |
+
::s1 e ::s2 ::cost 0.1
|
13 |
+
::s1 é ::s2 ::cost 0.1
|
14 |
+
::s1 e ::s2 ::cost 0.02 ::lc2 fas
|
15 |
+
::s1 e ::s2 ::cost 0.02 ::lc1 amh ::lc2 eng
|
16 |
+
::s1 f ::s2 ::cost 1
|
17 |
+
::s1 g ::s2 ::cost 1
|
18 |
+
::s1 g ::s2 ::cost 0.4 ::right1 [bcdfghklmnpqrstvwxz] ::lc2 zho
|
19 |
+
::s1 g ::s2 ::cost 0.2 ::right1 [k] ::lc2 zho
|
20 |
+
::s1 h ::s2 ::cost 0.5
|
21 |
+
::s1 h ::s2 ::cost 0.1 ::left1 /[aeiouy]$/ ::right1 [-,bcdfghklmnpqrstvwxz$ ]
|
22 |
+
::s1 h ::s2 ::cost 0.2 ::left1 /[bdlnr]$/ ::right1 [-,$ aeiouy] ::example Delhi, Minh, Riyadh
|
23 |
+
::s1 i ::s2 ::cost 0.1
|
24 |
+
::s1 j ::s2 ::cost 0.5
|
25 |
+
::s1 k ::s2 ::cost 1
|
26 |
+
::s1 l ::s2 ::cost 1
|
27 |
+
::s1 l ::s2 ::cost 0.3 ::left1 /eui$/ ::right1 [-,$ ] ::example Argenteuil
|
28 |
+
::s1 l ::s2 ::cost 0.3 ::left1 /a$/ ::right1 [km] ::comment walk, palm
|
29 |
+
::s1 l ::s2 ::cost 0.3 ::left1 /[aeiou]$/ ::right1 [bdfgkmpstvwz] ::lc2 zho
|
30 |
+
::s1 m ::s2 ::cost 1
|
31 |
+
::s1 n ::s2 ::cost 1
|
32 |
+
::s1 n ::s2 ::cost 0.7 ::right1 [-,$ ]
|
33 |
+
::s1 o ::s2 ::cost 0.1
|
34 |
+
::s1 p ::s2 ::cost 1
|
35 |
+
::s1 q ::s2 ::cost 1
|
36 |
+
::s1 r ::s2 ::cost 1
|
37 |
+
::s1 r ::s2 ::cost 0.5 ::left1 /[aou]$/ ::right1 [-,bcdfghjklmnpqrstvwxz$ ]
|
38 |
+
::s1 r ::s2 ::cost 0.3 ::left1 /[aeiou]$/ ::right1 [-,bcdfghjklmnpqrstvwxz$ ] ::lc2 zho
|
39 |
+
::s1 re ::s2 ::cost 0.4 ::left1 /[ou]$/ ::right1 [-,$ ] ::lc2 zho
|
40 |
+
::s1 re ::s2 ::cost 0.5 ::left1 /[aeiou]$/ ::right1 [-,bcdfghjklmnpqrstvwxz$ ] ::lc2 zho
|
41 |
+
::s1 rr ::s2 ::cost 0.5 ::left1 /[aeiou]$/ ::right1 [-,bcdfghjklmnpqrstvwxz$ ] ::lc2 zho
|
42 |
+
::s1 s ::s2 ::cost 1
|
43 |
+
::s1 s ::s2 ::cost 0.6 ::right1 [-,$ ]
|
44 |
+
::s1 t ::s2 ::cost 1
|
45 |
+
::s1 t ::s2 ::cost 0.5 ::left1 /[aeiou][lnr]?$/ ::right1 [-,$ ]
|
46 |
+
::s1 t ::s2 ::cost 0.6 ::left1 /[bcdfghklmnpqrstvwxz]$/ ::right1 [bcdfghklmnpqrstvwxz]
|
47 |
+
::s1 u ::s2 ::cost 0.1
|
48 |
+
::s1 v ::s2 ::cost 1
|
49 |
+
::s1 w ::s2 ::cost 1
|
50 |
+
::s1 w ::s2 ::cost 0.4 ::lc1 eng ::right1 [i][c][hk][-,$ ] ::example Greenwich, Alnwick
|
51 |
+
::s1 x ::s2 ::cost 1
|
52 |
+
::s1 y ::s2 ::cost 0.3
|
53 |
+
::s1 z ::s2 ::cost 1
|
54 |
+
::s1 ı ::s2 ::cost 0.3
|
55 |
+
::s1 0 ::s2 ::cost 1
|
56 |
+
::s1 1 ::s2 ::cost 1
|
57 |
+
::s1 2 ::s2 ::cost 1
|
58 |
+
::s1 3 ::s2 ::cost 1
|
59 |
+
::s1 4 ::s2 ::cost 1
|
60 |
+
::s1 5 ::s2 ::cost 1
|
61 |
+
::s1 6 ::s2 ::cost 1
|
62 |
+
::s1 7 ::s2 ::cost 1
|
63 |
+
::s1 8 ::s2 ::cost 1
|
64 |
+
::s1 9 ::s2 ::cost 1
|
65 |
+
::s1 ' ::s2 ::cost 0.1
|
66 |
+
::s1 ` ::s2 ::cost 0.1
|
67 |
+
::s1 ( ::s2 ::cost 0.1
|
68 |
+
::s1 ) ::s2 ::cost 0.1
|
69 |
+
::s1 , ::s2 ::cost 0.1
|
70 |
+
::s1 ; ::s2 ::cost 0.1
|
71 |
+
::s1 - ::s2 ::cost 0.1
|
72 |
+
::s1 . ::s2 ::cost 0.1
|
73 |
+
::s1 .. ::s2 ::cost 0.12
|
74 |
+
::s1 ... ::s2 ::cost 0.14
|
75 |
+
::s1 ? ::s2 ::cost 0.2
|
76 |
+
::s1 ! ::s2 ::cost 0.2
|
77 |
+
::s1 ‼ ::s2 ::cost 0.2
|
78 |
+
::s1 ‼ ::s2 !! ::cost 0.02
|
79 |
+
::s1 ‼ ::s2 ! ::cost 0.1
|
80 |
+
::s1 / ::s2 ::cost 0.1
|
81 |
+
::s1 : ::s2 ::cost 0.1
|
82 |
+
::s1 ː ::s2 ::cost 0.1
|
83 |
+
::s1 ː ::s2 : ::cost 0.1
|
84 |
+
::s1 « ::s2 ::cost 0.1
|
85 |
+
::s1 » ::s2 ::cost 0.1
|
86 |
+
::s1 – ::s2 ::cost 0.1
|
87 |
+
::s1 – ::s2 - ::cost 0.05
|
88 |
+
::s1 — ::s2 ::cost 0.15
|
89 |
+
::s1 — ::s2 - ::cost 0.1
|
90 |
+
::s1 — ::s2 – ::cost 0.05
|
91 |
+
::s1 ─ ::s2 ::cost 0.2
|
92 |
+
::s1 ─ ::s2 - ::cost 0.15
|
93 |
+
::s1 ─ ::s2 – ::cost 0.1
|
94 |
+
::s1 ─ ::s2 — ::cost 0.05
|
95 |
+
::s1 ’ ::s2 ::cost 0.1
|
96 |
+
::s1 ʼ ::s2 ::cost 0.1
|
97 |
+
::s1 " " ::s2 ::cost 0.1
|
98 |
+
::s1 “ ::s2 ::cost 0.1
|
99 |
+
::s1 ” ::s2 ::cost 0.1
|
100 |
+
::s1 ″ ::s2 ::cost 0.1
|
101 |
+
::s1 # ::s2 ::cost 0.3
|
102 |
+
::s1 + ::s2 ::cost 0.3
|
103 |
+
::s1 * ::s2 ::cost 0.3
|
104 |
+
::s1 = ::s2 ::cost 0.3
|
105 |
+
::s1 < ::s2 ::cost 0.3
|
106 |
+
::s1 > ::s2 ::cost 0.3
|
107 |
+
::s1 [ ::s2 ::cost 0.3
|
108 |
+
::s1 ] ::s2 ::cost 0.3
|
109 |
+
::s1 { ::s2 ::cost 0.3
|
110 |
+
::s1 } ::s2 ::cost 0.3
|
111 |
+
::s1 | ::s2 ::cost 0.3
|
112 |
+
::s1 & ::s2 ::cost 0.3
|
113 |
+
::s1 _ ::s2 ::cost 0.3
|
114 |
+
::s1 • ::s2 ::cost 0.1
|
115 |
+
::s1 · ::s2 ::cost 0.1
|
116 |
+
::s1 ◦ ::s2 ::cost 0.1
|
117 |
+
::s1 ° ::s2 ::cost 0.1
|
118 |
+
::s1 … ::s2 ::cost 0.1
|
119 |
+
::s1 … ::s2 ... ::cost 0
|
120 |
+
::s1 @ ::s2 ::cost 0.3
|
121 |
+
::s1 © ::s2 ::cost 0.3
|
122 |
+
::s1 © ::s2 (c) ::cost 0.1
|
123 |
+
|
124 |
+
|
125 |
+
::s1 a ::s2 aa ::cost 0.02
|
126 |
+
::s1 a ::s2 aaa ::cost 0.03
|
127 |
+
::s1 a ::s2 aaaa ::cost 0.03
|
128 |
+
::s1 a ::s2 aaaaa ::cost 0.03
|
129 |
+
::s1 a ::s2 aaaaaa ::cost 0.04
|
130 |
+
::s1 a ::s2 aaaaaaa ::cost 0.04
|
131 |
+
::s1 a ::s2 aaaaaaaa ::cost 0.04
|
132 |
+
::s1 a ::s2 aaaaaaaaa ::cost 0.04
|
133 |
+
::s1 a ::s2 aaaaaaaaaa ::cost 0.04
|
134 |
+
::s1 a ::s2 aaaaaaaaaaa ::cost 0.04
|
135 |
+
::s1 a ::s2 aaaaaaaaaaaa ::cost 0.04
|
136 |
+
::s1 a ::s2 aaaaaaaaaaaaa ::cost 0.04
|
137 |
+
::s1 a ::s2 aaaaaaaaaaaaaa ::cost 0.04
|
138 |
+
::s1 a ::s2 aaaaaaaaaaaaaaa ::cost 0.04
|
139 |
+
::s1 a ::s2 aaaaaaaaaaaaaaaa ::cost 0.04
|
140 |
+
::s1 b ::s2 bb ::cost 0.02
|
141 |
+
::s1 b ::s2 bbb ::cost 0.03
|
142 |
+
::s1 b ::s2 bbbb ::cost 0.03
|
143 |
+
::s1 b ::s2 bbbbb ::cost 0.03
|
144 |
+
::s1 c ::s2 cc ::cost 0.02
|
145 |
+
::s1 c ::s2 ccc ::cost 0.03
|
146 |
+
::s1 c ::s2 cccc ::cost 0.03
|
147 |
+
::s1 c ::s2 ccccc ::cost 0.03
|
148 |
+
::s1 d ::s2 dd ::cost 0.02
|
149 |
+
::s1 d ::s2 ddd ::cost 0.03
|
150 |
+
::s1 d ::s2 dddd ::cost 0.03
|
151 |
+
::s1 d ::s2 ddddd ::cost 0.03
|
152 |
+
::s1 e ::s2 ee ::cost 0.02
|
153 |
+
::s1 e ::s2 eee ::cost 0.03
|
154 |
+
::s1 e ::s2 eeee ::cost 0.03
|
155 |
+
::s1 e ::s2 eeeee ::cost 0.03
|
156 |
+
::s1 e ::s2 eeeeee ::cost 0.04
|
157 |
+
::s1 e ::s2 eeeeeee ::cost 0.04
|
158 |
+
::s1 e ::s2 eeeeeeee ::cost 0.04
|
159 |
+
::s1 e ::s2 eeeeeeeee ::cost 0.04
|
160 |
+
::s1 e ::s2 eeeeeeeeee ::cost 0.04
|
161 |
+
::s1 e ::s2 eeeeeeeeeee ::cost 0.04
|
162 |
+
::s1 e ::s2 eeeeeeeeeeee ::cost 0.04
|
163 |
+
::s1 e ::s2 eeeeeeeeeeeee ::cost 0.04
|
164 |
+
::s1 e ::s2 eeeeeeeeeeeeee ::cost 0.04
|
165 |
+
::s1 e ::s2 eeeeeeeeeeeeeee ::cost 0.04
|
166 |
+
::s1 e ::s2 eeeeeeeeeeeeeeee ::cost 0.04
|
167 |
+
::s1 f ::s2 ff ::cost 0.02
|
168 |
+
::s1 f ::s2 fff ::cost 0.03
|
169 |
+
::s1 f ::s2 ffff ::cost 0.03
|
170 |
+
::s1 f ::s2 fffff ::cost 0.03
|
171 |
+
::s1 g ::s2 gg ::cost 0.02
|
172 |
+
::s1 g ::s2 ggg ::cost 0.03
|
173 |
+
::s1 g ::s2 gggg ::cost 0.03
|
174 |
+
::s1 g ::s2 ggggg ::cost 0.03
|
175 |
+
::s1 h ::s2 hh ::cost 0.02
|
176 |
+
::s1 h ::s2 hhh ::cost 0.03
|
177 |
+
::s1 h ::s2 hhhh ::cost 0.03
|
178 |
+
::s1 h ::s2 hhhhh ::cost 0.03
|
179 |
+
::s1 i ::s2 ii ::cost 0.02
|
180 |
+
::s1 i ::s2 iii ::cost 0.03
|
181 |
+
::s1 i ::s2 iiii ::cost 0.03
|
182 |
+
::s1 i ::s2 iiiii ::cost 0.03
|
183 |
+
::s1 i ::s2 iiiiii ::cost 0.04
|
184 |
+
::s1 i ::s2 iiiiiii ::cost 0.04
|
185 |
+
::s1 i ::s2 iiiiiiii ::cost 0.04
|
186 |
+
::s1 i ::s2 iiiiiiiii ::cost 0.04
|
187 |
+
::s1 i ::s2 iiiiiiiiii ::cost 0.04
|
188 |
+
::s1 i ::s2 iiiiiiiiiii ::cost 0.04
|
189 |
+
::s1 i ::s2 iiiiiiiiiiii ::cost 0.04
|
190 |
+
::s1 i ::s2 iiiiiiiiiiiii ::cost 0.04
|
191 |
+
::s1 i ::s2 iiiiiiiiiiiiii ::cost 0.04
|
192 |
+
::s1 i ::s2 iiiiiiiiiiiiiii ::cost 0.04
|
193 |
+
::s1 i ::s2 iiiiiiiiiiiiiiii ::cost 0.04
|
194 |
+
::s1 j ::s2 jj ::cost 0.02
|
195 |
+
::s1 j ::s2 jjj ::cost 0.03
|
196 |
+
::s1 j ::s2 jjjj ::cost 0.03
|
197 |
+
::s1 j ::s2 jjjjj ::cost 0.03
|
198 |
+
::s1 k ::s2 kk ::cost 0.02
|
199 |
+
::s1 k ::s2 kkk ::cost 0.03
|
200 |
+
::s1 k ::s2 kkkk ::cost 0.03
|
201 |
+
::s1 k ::s2 kkkkk ::cost 0.03
|
202 |
+
::s1 l ::s2 ll ::cost 0.02
|
203 |
+
::s1 l ::s2 lll ::cost 0.03
|
204 |
+
::s1 l ::s2 llll ::cost 0.03
|
205 |
+
::s1 l ::s2 lllll ::cost 0.03
|
206 |
+
::s1 m ::s2 mm ::cost 0.02
|
207 |
+
::s1 m ::s2 mmm ::cost 0.03
|
208 |
+
::s1 m ::s2 mmmm ::cost 0.03
|
209 |
+
::s1 m ::s2 mmmmm ::cost 0.03
|
210 |
+
::s1 n ::s2 nn ::cost 0.02
|
211 |
+
::s1 n ::s2 nnn ::cost 0.03
|
212 |
+
::s1 n ::s2 nnnn ::cost 0.03
|
213 |
+
::s1 n ::s2 nnnnn ::cost 0.03
|
214 |
+
::s1 o ::s2 oo ::cost 0.02
|
215 |
+
::s1 o ::s2 ooo ::cost 0.03
|
216 |
+
::s1 o ::s2 oooo ::cost 0.03
|
217 |
+
::s1 o ::s2 ooooo ::cost 0.03
|
218 |
+
::s1 o ::s2 oooooo ::cost 0.04
|
219 |
+
::s1 o ::s2 ooooooo ::cost 0.04
|
220 |
+
::s1 o ::s2 oooooooo ::cost 0.04
|
221 |
+
::s1 o ::s2 ooooooooo ::cost 0.04
|
222 |
+
::s1 o ::s2 oooooooooo ::cost 0.04
|
223 |
+
::s1 o ::s2 ooooooooooo ::cost 0.04
|
224 |
+
::s1 o ::s2 oooooooooooo ::cost 0.04
|
225 |
+
::s1 o ::s2 ooooooooooooo ::cost 0.04
|
226 |
+
::s1 o ::s2 oooooooooooooo ::cost 0.04
|
227 |
+
::s1 o ::s2 ooooooooooooooo ::cost 0.04
|
228 |
+
::s1 o ::s2 oooooooooooooooo ::cost 0.04
|
229 |
+
::s1 p ::s2 pp ::cost 0.02
|
230 |
+
::s1 p ::s2 ppp ::cost 0.03
|
231 |
+
::s1 p ::s2 pppp ::cost 0.03
|
232 |
+
::s1 p ::s2 ppppp ::cost 0.03
|
233 |
+
::s1 q ::s2 qq ::cost 0.02
|
234 |
+
::s1 q ::s2 qqq ::cost 0.03
|
235 |
+
::s1 q ::s2 qqqq ::cost 0.03
|
236 |
+
::s1 q ::s2 qqqqq ::cost 0.03
|
237 |
+
::s1 r ::s2 rr ::cost 0.02
|
238 |
+
::s1 r ::s2 rrr ::cost 0.03
|
239 |
+
::s1 r ::s2 rrrr ::cost 0.03
|
240 |
+
::s1 r ::s2 rrrrr ::cost 0.03
|
241 |
+
::s1 s ::s2 ss ::cost 0.02
|
242 |
+
::s1 s ::s2 sss ::cost 0.03
|
243 |
+
::s1 s ::s2 ssss ::cost 0.03
|
244 |
+
::s1 s ::s2 sssss ::cost 0.03
|
245 |
+
::s1 t ::s2 tt ::cost 0.02
|
246 |
+
::s1 t ::s2 ttt ::cost 0.03
|
247 |
+
::s1 t ::s2 tttt ::cost 0.03
|
248 |
+
::s1 t ::s2 ttttt ::cost 0.03
|
249 |
+
::s1 u ::s2 uu ::cost 0.02
|
250 |
+
::s1 u ::s2 uuu ::cost 0.03
|
251 |
+
::s1 u ::s2 uuuu ::cost 0.03
|
252 |
+
::s1 u ::s2 uuuuu ::cost 0.03
|
253 |
+
::s1 u ::s2 uuuuuu ::cost 0.04
|
254 |
+
::s1 u ::s2 uuuuuuu ::cost 0.04
|
255 |
+
::s1 u ::s2 uuuuuuuu ::cost 0.04
|
256 |
+
::s1 u ::s2 uuuuuuuuu ::cost 0.04
|
257 |
+
::s1 u ::s2 uuuuuuuuuu ::cost 0.04
|
258 |
+
::s1 u ::s2 uuuuuuuuuuu ::cost 0.04
|
259 |
+
::s1 u ::s2 uuuuuuuuuuuu ::cost 0.04
|
260 |
+
::s1 u ::s2 uuuuuuuuuuuuu ::cost 0.04
|
261 |
+
::s1 u ::s2 uuuuuuuuuuuuuu ::cost 0.04
|
262 |
+
::s1 u ::s2 uuuuuuuuuuuuuuu ::cost 0.04
|
263 |
+
::s1 u ::s2 uuuuuuuuuuuuuuuu ::cost 0.04
|
264 |
+
::s1 v ::s2 vv ::cost 0.02
|
265 |
+
::s1 v ::s2 vvv ::cost 0.03
|
266 |
+
::s1 v ::s2 vvvv ::cost 0.03
|
267 |
+
::s1 v ::s2 vvvvv ::cost 0.03
|
268 |
+
::s1 w ::s2 ww ::cost 0.02
|
269 |
+
::s1 w ::s2 www ::cost 0.03
|
270 |
+
::s1 w ::s2 wwww ::cost 0.03
|
271 |
+
::s1 w ::s2 wwwww ::cost 0.03
|
272 |
+
::s1 x ::s2 xx ::cost 0.02
|
273 |
+
::s1 x ::s2 xxx ::cost 0.03
|
274 |
+
::s1 x ::s2 xxxx ::cost 0.03
|
275 |
+
::s1 x ::s2 xxxxx ::cost 0.03
|
276 |
+
::s1 y ::s2 yy ::cost 0.02
|
277 |
+
::s1 y ::s2 yyy ::cost 0.03
|
278 |
+
::s1 y ::s2 yyyy ::cost 0.03
|
279 |
+
::s1 y ::s2 yyyyy ::cost 0.03
|
280 |
+
::s1 z ::s2 zz ::cost 0.02
|
281 |
+
::s1 z ::s2 zzz ::cost 0.03
|
282 |
+
::s1 z ::s2 zzzz ::cost 0.03
|
283 |
+
::s1 z ::s2 zzzzz ::cost 0.03
|
284 |
+
::s1 " " ::s2 " " ::cost 0
|
285 |
+
::s1 . ::s2 ::left1 /\./ ::left2 /\./ ::cost 0.02
|
286 |
+
::s1 … ::s2 ::left1 /…/ ::left2 /…/ ::cost 0.01
|
287 |
+
::s1 _ ::s2 ::left1 /_/ ::left2 /_/ ::cost 0.01
|
288 |
+
::s1 = ::s2 ::left1 /=/ ::left2 /=/ ::cost 0.01
|
289 |
+
::s1 ! ::s2 ::left1 /!/ ::left2 /!/ ::cost 0.02
|
290 |
+
::s1 ? ::s2 ::left1 /\?/ ::left2 /\?/ ::cost 0.02
|
291 |
+
::s1 aa ::s2 aː ::cost 0.02
|
292 |
+
::s1 ee ::s2 eː ::cost 0.02
|
293 |
+
::s1 ii ::s2 iː ::cost 0.02
|
294 |
+
::s1 oo ::s2 oː ::cost 0.02
|
295 |
+
::s1 uu ::s2 uː ::cost 0.02
|
296 |
+
|
297 |
+
::s1 a ::s2 e ::cost 0.1
|
298 |
+
::s1 au ::s2 o ::cost 0.1 ::lc1 eng
|
299 |
+
::s1 aw ::s2 o ::cost 0.3 ::right1 [-,bcdfghklmnpqrstvwxz$ ]
|
300 |
+
::s1 aw ::s2 o ::cost 0.1 ::right1 [-,bcdfghklmnpqrstvwxz$ ] ::lc1 eng
|
301 |
+
::s1 aw ::s2 a ::cost 0.2 ::right1 [-,bcdfghklmnpqrstvwxz$ ] ::lc1 eng
|
302 |
+
::s1 ay ::s2 i ::cost 0.02 ::lc1 fas ::lc2 eng
|
303 |
+
::s1 aye ::s2 ae ::cost 0.05 ::lc1 fas
|
304 |
+
::s1 é ::s2 e ::cost 0.05
|
305 |
+
::s1 e ::s2 i ::cost 0.15
|
306 |
+
::s1 e ::s2 i ::cost 0.1 ::lc1 uig ::lc2 uig
|
307 |
+
::s1 e ::s2 y ::cost 0.15
|
308 |
+
::s1 ew ::s2 u ::cost 0.3 ::right1 [-,bcdfghklmnpqrstvwxz$ ]
|
309 |
+
::s1 ew ::s2 u ::cost 0.1 ::right1 [-,bcdfghklmnpqrstvwxz$ ] ::lc1 eng
|
310 |
+
::s1 ew ::s2 u ::cost 0.3 ::right1 [aei][lgnrst] ::lc1 eng
|
311 |
+
::s1 ew ::s2 e ::cost 0.3 ::right1 [-,bcdfghklmnpqrstvwxz$ ] ::lc1 eng
|
312 |
+
::s1 i ::s2 a ::cost 0.1 ::right1 [-,$ ] ::lc1 fas
|
313 |
+
::s1 i ::s2 ea ::cost 0.03 ::lc2 eng
|
314 |
+
::s1 i ::s2 ee ::cost 0.03 ::lc2 eng
|
315 |
+
::s1 i ::s2 ei ::cost 0.05 ::lc2 eng
|
316 |
+
::s1 i ::s2 ie ::cost 0.03 ::lc2 eng
|
317 |
+
::s1 i ::s2 ı ::cost 0.05
|
318 |
+
::s1 i ::s2 e ::cost 0.1 ::lc2 eng
|
319 |
+
::s1 i ::s2 y ::cost 0.15
|
320 |
+
::s1 i ::s2 y ::cost 0.1 ::right2 [-,bcdfghklmnpqrstvwxz$ ]
|
321 |
+
::s1 ie ::s2 ei ::cost 0.15
|
322 |
+
::s1 ie ::s2 y ::cost 0.15
|
323 |
+
::s1 ij ::s2 ai ::cost 0.15
|
324 |
+
::s1 o ::s2 u ::cost 0.1
|
325 |
+
::s1 oo ::s2 u ::cost 0.1
|
326 |
+
::s1 ow ::s2 au ::cost 0.2 ::right1 [-,bcdfghklmnpqrstvwxz$ ]
|
327 |
+
::s1 ow ::s2 o ::cost 0.2 ::right1 [-,bcdfghklmnpqrstvwxz$ ]
|
328 |
+
::s1 ow ::s2 o ::cost 0.2 ::lc1 eng ::lc2 zho ::right1 [e]
|
329 |
+
::s1 ow ::s2 o ::cost 0.4 ::lc1 eng ::lc2 zho ::right1 [iy]
|
330 |
+
::s1 u ::s2 a ::cost 0.1 ::lc1 eng ::right1 [-,bcdfghklmnpqrstvwxz][bcdfghklmnpqrstvwxz$ ]
|
331 |
+
::s1 u ::s2 ou ::cost 0.05
|
332 |
+
::s1 u ::s2 yu ::cost 0.05 ::left1 /^(.*[- ])?$/
|
333 |
+
::s1 yeo ::s2 eo ::cost 0.1 ::lc1 fas
|
334 |
+
|
335 |
+
# Amharic
|
336 |
+
::s1 a ::s2 e ::cost 0.05 ::lc1 amh
|
337 |
+
::s1 aa ::s2 o ::cost 0.15 ::lc1 amh
|
338 |
+
::s1 aawe ::s2 au ::cost 0.05 ::lc1 amh
|
339 |
+
::s1 aawe ::s2 ao ::cost 0.1 ::lc1 amh
|
340 |
+
::s1 aawe ::s2 ou ::cost 0.1 ::lc1 amh
|
341 |
+
::s1 aawo ::s2 ao ::cost 0.05 ::lc1 amh
|
342 |
+
::s1 aaye ::s2 ai ::cost 0.05 ::lc1 amh
|
343 |
+
::s1 aaye ::s2 i ::cost 0.1 ::lc1 amh
|
344 |
+
::s1 aaye ::s2 ei ::cost 0.1 ::lc1 amh
|
345 |
+
::s1 awe ::s2 au ::cost 0.05 ::lc1 amh
|
346 |
+
::s1 awe ::s2 ao ::cost 0.1 ::lc1 amh
|
347 |
+
::s1 awe ::s2 ou ::cost 0.1 ::lc1 amh
|
348 |
+
::s1 ee ::s2 ai ::cost 0.1 ::lc1 amh
|
349 |
+
::s1 eewo ::s2 eo ::cost 0.05 ::lc1 amh
|
350 |
+
::s1 eeyaa ::s2 ea ::cost 0.1 ::lc1 amh
|
351 |
+
::s1 eeye ::s2 ai ::cost 0.1 ::lc1 amh
|
352 |
+
::s1 ewee ::s2 ue ::cost 0.1 ::lc1 amh
|
353 |
+
::s1 gwaa ::s2 gua ::cost 0.05 ::lc1 amh
|
354 |
+
::s1 iya ::s2 ie ::cost 0.05 ::lc1 amh
|
355 |
+
::s1 iyaa ::s2 ia ::cost 0.05 ::lc1 amh
|
356 |
+
::s1 iyo ::s2 io ::cost 0.05 ::lc1 amh
|
357 |
+
::s1 kxaa ::s2 kha ::cost 0.05 ::lc1 amh
|
358 |
+
::s1 liyaa ::s2 llia ::cost 0.05 ::lc1 amh
|
359 |
+
::s2 qaa ::s2 cca ::cost 0.05 ::lc1 amh
|
360 |
+
::s1 uwaa ::s2 ua ::cost 0.05 ::lc1 amh
|
361 |
+
::s1 uwee ::s2 ue ::cost 0.05 ::lc1 amh
|
362 |
+
::s1 uwi ::s2 oui ::cost 0.05 ::lc1 amh
|
363 |
+
::s1 uwi ::s2 ui ::cost 0.05 ::lc1 amh
|
364 |
+
::s1 xaaye ::s2 hai ::cost 0.1 ::lc1 amh
|
365 |
+
::s1 xwaa ::s2 jua ::cost 0.1 ::lc1 amh
|
366 |
+
::s1 ziyaa ::s1 sia ::cost 0.05 ::lc1 amh
|
367 |
+
::s1 w ::s2 ::cost 0.3 ::lc1 amh ::left1 /[aeiou]$/ ::right1 [aeiou]
|
368 |
+
::s1 y ::s2 ::cost 0.1 ::lc1 amh ::left1 /[aeiou]$/ ::right1 [aeiou]
|
369 |
+
# abbreviations
|
370 |
+
::s1 ee. ::s2 a ::cost 0.02 ::lc1 amh ::left1 /^(.*[- ])?$/
|
371 |
+
::s1 si. ::s2 c ::cost 0.02 ::lc1 amh ::left1 /^(.*[- ])?$/
|
372 |
+
::s1 di. ::s2 d ::cost 0.02 ::lc1 amh ::left1 /^(.*[- ])?$/
|
373 |
+
::s1 eefe. ::s2 f ::cost 0.02 ::lc1 amh ::left1 /^(.*[- ])?$/
|
374 |
+
::s1 are. ::s2 r ::cost 0.02 ::lc1 amh ::left1 /^(.*[- ])?$/
|
375 |
+
|
376 |
+
# Arabic
|
377 |
+
::s1 ::s2 a ::cost 0.02 ::lc1 ara
|
378 |
+
::s1 ::s2 e ::cost 0.02 ::lc1 ara
|
379 |
+
::s1 ::s2 i ::cost 0.05 ::lc1 ara
|
380 |
+
::s1 ::s2 o ::cost 0.05 ::lc1 ara
|
381 |
+
::s1 ::s2 p ::cost 0.15 ::lc1 ara ::left2 /m$/ ::right2 [dfgklmnpqrstvwz]
|
382 |
+
::s1 ::s2 u ::cost 0.05 ::lc1 ara
|
383 |
+
::s1 y ::s2 a ::cost 0.15 ::lc1 ara
|
384 |
+
::s1 y ::s2 e ::cost 0.05 ::lc1 ara
|
385 |
+
::s1 y ::s2 ea ::cost 0.02 ::lc1 ara
|
386 |
+
::s1 y ::s2 ee ::cost 0.02 ::lc1 ara
|
387 |
+
::s1 y ::s2 i ::cost 0.02 ::lc1 ara
|
388 |
+
::s1 y ::s2 ie ::cost 0.02 ::lc1 ara
|
389 |
+
::s1 b ::s2 p ::cost 0.02 ::lc1 ara
|
390 |
+
::s1 b ::s2 pp ::cost 0.03 ::lc1 ara
|
391 |
+
::s1 f ::s2 v ::cost 0.02 ::lc1 ara
|
392 |
+
::s1 fyl ::s2 ville ::right2 [-,$ ] ::cost 0.05 ::lc1 ara
|
393 |
+
::s1 gh ::s2 g ::right2 [abcdfgklmnopqrstuvwz] ::cost 0.05 ::lc1 ara
|
394 |
+
::s1 ghz ::s2 gs ::cost 0.05 ::lc1 ara
|
395 |
+
::s1 j ::s2 g ::cost 0.2 ::lc1 ara
|
396 |
+
::s1 kh ::s2 g ::cost 0.3 ::lc1 ara ::right2 [eiy]
|
397 |
+
::s1 q ::s2 g ::cost 0.2 ::lc1 ara ::right2 [arouz]
|
398 |
+
::s1 q ::s2 gg ::cost 0.2 ::lc1 ara ::right2 [arouz]
|
399 |
+
::s1 th ::s2 z ::cost 0.4 ::lc1 ara ::right2 [aou] ::comment Spanish
|
400 |
+
::s1 " (" ::s2 ", " ::cost 0.02 ::lc1 ara
|
401 |
+
::s1 ) ::s2 ::right2 [-,$ ] ::cost 0.02 ::lc1 ara
|
402 |
+
|
403 |
+
# Bengali
|
404 |
+
::s1 aoyaa ::s2 wa ::cost 0.1 ::lc1 ben
|
405 |
+
::s1 aoye ::s2 way ::cost 0.1 ::lc1 ben
|
406 |
+
::s1 bhaa ::s2 ve ::cost 0.1 ::lc1 ben
|
407 |
+
::s1 bh ::s2 v ::cost 0.2 ::lc1 ben
|
408 |
+
::s1 bh ::s2 w ::cost 0.2 ::lc1 ben
|
409 |
+
::s1 b ::s2 v ::cost 0.3 ::lc1 ben
|
410 |
+
::s1 b ::s2 w ::cost 0.3 ::lc1 ben
|
411 |
+
::s1 dda ::s2 rh ::right2 [-,$ ] ::cost 0.2 ::lc1 ben
|
412 |
+
::s1 dd ::s2 r ::cost 0.4 ::lc1 ben
|
413 |
+
::s1 gk ::s2 k ::cost 0.05 ::lc1 ben
|
414 |
+
::s1 h ::s2 g ::right2 [eiy] ::cost 0.4 ::lc1 ben
|
415 |
+
::s1 h ::s2 j ::cost 0.4 ::lc1 ben
|
416 |
+
::s1 hoyaai ::s2 whi ::cost 0.05 ::lc1 ben
|
417 |
+
::s1 j ::s2 z ::cost 0.1 ::lc1 ben
|
418 |
+
::s1 j ::s2 s ::cost 0.3 ::lc1 ben
|
419 |
+
::s1 myaaka ::s2 mc ::cost 0.1 ::lc1 ben
|
420 |
+
::s1 myaaka ::s2 mac ::cost 0.1 ::lc1 ben
|
421 |
+
::s1 oyaa ::s2 wa ::cost 0.02 ::lc1 ben
|
422 |
+
::s1 oyaa ::s2 wo ::cost 0.1 ::lc1 ben
|
423 |
+
::s1 oyena ::s2 owen ::cost 0.1 ::lc1 ben
|
424 |
+
::s1 ph ::s2 v ::cost 0.1 ::lc1 ben
|
425 |
+
::s1 phana ::s2 von ::cost 0.1 ::lc1 ben
|
426 |
+
::s1 rhio ::s2 gio ::cost 0.2 ::lc1 ben
|
427 |
+
::s1 sh ::s2 s ::cost 0.4 ::lc1 ben
|
428 |
+
::s1 ss ::s2 sh ::left1 /[k]$/ ::cost 0.15 ::lc1 ben
|
429 |
+
::s1 ss ::s2 sh ::cost 0.3 ::lc1 ben
|
430 |
+
::s1 o ::s2 wo ::cost 0.2 ::lc1 ben ::left1 /^(.*[-, ]?)$/
|
431 |
+
::s1 oye ::s2 we ::cost 0.2 ::lc1 ben
|
432 |
+
::s1 tta ::s2 tho ::cost 0.3 ::lc1 ben
|
433 |
+
::s1 tthaa ::s2 ta ::cost 0.3 ::lc1 ben
|
434 |
+
::s1 u ::s2 wo ::cost 0.2 ::lc1 ben ::left1 /^(.*[-, ]?)$/
|
435 |
+
::s1 u ::s2 woo ::cost 0.2 ::lc1 ben ::left1 /^(.*[-, ]?)$/
|
436 |
+
::s1 u ::s2 wu ::cost 0.2 ::lc1 ben ::left1 /^(.*[-, ]?)$/
|
437 |
+
::s1 ui ::s2 wi ::cost 0.02 ::lc1 ben ::left1 /^(.*[-, ]?)$/
|
438 |
+
::s1 yaa ::s2 wa ::cost 0.3 ::lc1 ben
|
439 |
+
::s1 ye ::s2 we ::cost 0.3 ::lc1 ben
|
440 |
+
|
441 |
+
# Russian
|
442 |
+
::s1 ::s2 os ::cost 0.4 ::left2 /[bcdfghilmnprstvx]$/ ::right2 [-,$ ] ::lc1 rus
|
443 |
+
::s1 ::s2 us ::cost 0.4 ::left2 /[bcdfghilmnprstvx]$/ ::right2 [-,$ ] ::lc1 rus
|
444 |
+
::s1 av ::s2 au ::cost 0.05 ::lc1 rus
|
445 |
+
::s1 ch ::s2 cz ::cost 0.1 ::lc1 rus ::comment Polish
|
446 |
+
::s1 chch ::s2 cci ::right2 [aou] ::cost 0.1 ::lc1 rus
|
447 |
+
::s1 chch ::s2 cc ::right2 [eiy] ::cost 0.1 ::lc1 rus
|
448 |
+
::s1 chzh ::s2 zh ::cost 0.1 ::lc1 rus
|
449 |
+
::s1 dz ::s2 zz ::cost 0.1 ::lc1 rus ::right2 [aeiouy]
|
450 |
+
::s1 dz ::s2 j ::cost 0.3 ::lc1 rus ::right2 [aeiouy] ::comment Japanese
|
451 |
+
::s1 dzh ::s2 g ::cost 0.05 ::lc1 rus ::right2 [eiy]
|
452 |
+
::s1 dzh ::s2 gg ::cost 0.05 ::lc1 rus ::right2 [eiy]
|
453 |
+
::s1 dzh ::s2 j ::cost 0.05 ::lc1 rus
|
454 |
+
::s1 ev ::s2 eu ::cost 0.1 ::lc1 rus
|
455 |
+
::s1 f ::s2 th ::cost 0.6 ::lc1 rus
|
456 |
+
::s1 ievye ::s2 iaceae ::cost 0.02 ::right1 [-,$ ] ::lc1 rus ::comment scientific names for families of species
|
457 |
+
::s1 ii ::s2 ius ::cost 0.2 ::right1 [-,$ ] ::lc1 rus
|
458 |
+
::s1 i ::s2 j ::cost 0.2 ::lc1 rus
|
459 |
+
::s1 naya ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 rus ::suffix adjective
|
460 |
+
::s1 nyi ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 rus ::suffix adjective
|
461 |
+
::s1 ovye ::s2 aceae ::cost 0.02 ::right1 [-,$ ] ::lc1 rus ::comment scientific names for families of species
|
462 |
+
::s1 shsh ::s2 sh ::cost 0 ::lc1 rus
|
463 |
+
::s1 skaya ::s2 ian ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::suffix possessive
|
464 |
+
::s1 skaya ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::suffix possessive
|
465 |
+
::s1 skii ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::suffix possessive
|
466 |
+
::s1 skii ::s2 ian ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::suffix adjective
|
467 |
+
::s1 tsian ::s2 tian ::cost 0.05 ::lc1 rus
|
468 |
+
::s1 tsion ::s2 tion ::cost 0.05 ::lc1 rus
|
469 |
+
::s1 ts ::s2 c ::cost 0.3 ::lc1 rus
|
470 |
+
::s1 ts ::s2 c ::cost 0.02 ::right1 [-,$ ] ::lc1 rus
|
471 |
+
::s1 tsz ::s2 z ::cost 0.1 ::lc1 rus
|
472 |
+
::s1 itsa ::s2 ica ::cost 0.02 ::right1 [-,$ ] ::lc1 rus
|
473 |
+
::s1 etski ::s2 ecky ::cost 0.02 ::right1 [-,$ ] ::lc1 rus
|
474 |
+
::s1 tsiya ::s2 tion ::cost 0.02 ::right1 [-,$ ] ::lc1 rus
|
475 |
+
::s1 tsi ::s2 qi ::cost 0.15 ::lc1 rus ::comment Chinese names
|
476 |
+
::s1 tsy ::s2 qi ::cost 0.15 ::lc1 rus ::comment Chinese names
|
477 |
+
::s1 tszi ::s2 ji ::cost 0.15 ::lc1 rus ::comment Chinese names
|
478 |
+
::s1 tszy ::s2 ji ::cost 0.15 ::lc1 rus ::comment Chinese names
|
479 |
+
::s1 u ::s2 w ::right2 [aeio] ::cost 0.05 ::lc1 rus
|
480 |
+
::s1 u ::s2 w ::cost 0.2 ::lc1 rus
|
481 |
+
::s1 uo ::s2 wa ::cost 0.2 ::lc1 rus ::right2 [lnrst]
|
482 |
+
::s1 v ::s2 u ::cost 0.05 ::lc1 rus ::left1 /[bcdfghjklmnpqrstvwxz]$/ ::right1 [aeiou]
|
483 |
+
::s1 gva ::s2 gua ::cost 0.02 ::lc1 rus
|
484 |
+
::s1 gvi ::s2 gui ::cost 0.02 ::lc1 rus
|
485 |
+
::s1 x ::s2 sh ::cost 0.2 ::left2 /[aeiou]$/ ::right2 [-,aouct$-] ::lc1 rus
|
486 |
+
::s1 y ::s2 s ::cost 0.4 ::right2 [-,$-] ::lc1 rus
|
487 |
+
::s1 zh ::s2 rz ::cost 0.1 ::lc1 rus ::comment Polish rz
|
488 |
+
|
489 |
+
# Russian case endings
|
490 |
+
::s1 em ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
491 |
+
::s1 ey ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
492 |
+
::s1 om ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
493 |
+
::s1 oy ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
494 |
+
::s1 oyu ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
495 |
+
::s1 y ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
496 |
+
::s1 ya ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
497 |
+
::s1 ye ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
498 |
+
::s1 yem ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
499 |
+
::s1 ym ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
500 |
+
::s1 ymi ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
501 |
+
::s1 yu ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
502 |
+
::s1 ii ::s2 iya ::cost 0.1 ::right1 [-,$ ] ::right2 [-,$ ] ::lc1 rus ::lc2 rus ::comment Russian case endings
|
503 |
+
::s1 ii ::s2 iye ::cost 0.1 ::right1 [-,$ ] ::right2 [-,$ ] ::lc1 rus ::lc2 rus ::comment Russian case endings
|
504 |
+
|
505 |
+
::s1 am ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
506 |
+
::s1 ami ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
507 |
+
::s1 em ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
508 |
+
::s1 ev ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
509 |
+
::s1 eri ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
510 |
+
::s1 eryu ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
511 |
+
::s1 om ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
512 |
+
::s1 ov ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
513 |
+
::s1 akh ::s2 ::cost 0.3 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
514 |
+
::s1 ykh ::s2 ::cost 0.3 ::right1 [-,$ ] ::lc1 rus ::comment Russian case ending
|
515 |
+
|
516 |
+
# Ukrainian case endings
|
517 |
+
::s1 eyu ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
518 |
+
::s1 oyu ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
519 |
+
::s1 ya ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
520 |
+
::s1 yi ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
521 |
+
::s1 yu ::s2 ::cost 0.1 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
522 |
+
|
523 |
+
::s1 am ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
524 |
+
::s1 amy ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
525 |
+
::s1 em ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
526 |
+
::s1 evy ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
527 |
+
::s1 iv ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
528 |
+
::s1 om ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
529 |
+
::s1 ovy ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
530 |
+
::s1 yam ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
531 |
+
::s1 yamy ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
532 |
+
::s1 yiv ::s2 ::cost 0.2 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
533 |
+
::s1 akh ::s2 ::cost 0.3 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
534 |
+
::s1 yakh ::s2 ::cost 0.3 ::right1 [-,$ ] ::lc1 ukr ::comment Ukrainian case ending
|
535 |
+
|
536 |
+
# Uyghur
|
537 |
+
::s1 aw ::s2 ao ::cost 0.05 ::lc1 uig
|
538 |
+
::s1 aw ::s2 au ::cost 0.05 ::lc1 uig
|
539 |
+
::s1 gwi ::s2 gui ::cost 0.05 ::lc1 uig
|
540 |
+
::s1 iye ::s2 ia ::cost 0.05 ::lc1 uig
|
541 |
+
::s1 istan ::s2 ia ::cost 0.1 ::right1 [-,$ ] ::lc1 uig
|
542 |
+
::s1 j ::s2 c ::cost 0.4 ::lc1 uig
|
543 |
+
::s1 q ::s2 h ::cost 0.2 ::lc1 uig
|
544 |
+
::s1 sey ::s2 cai ::cost 0.2 ::lc1 uig
|
545 |
+
::s1 sh ::s2 x ::cost 0.2 ::lc1 uig
|
546 |
+
|
547 |
+
::s1 b ::s2 p ::cost 0.3
|
548 |
+
::s1 b ::s2 v ::cost 0.5 ::left2 /^(.*[- ])?$/
|
549 |
+
::s1 b ::s2 v ::cost 0.7
|
550 |
+
::s1 c ::s2 ch ::cost 0.25 ::right1 [eiy]
|
551 |
+
::s1 c ::s2 ck ::cost 0.02 ::right1 [-,abcdfghklmnpoqrstuvwxz$ ]
|
552 |
+
::s1 c ::s2 k ::cost 0.4
|
553 |
+
::s1 c ::s2 k ::cost 0.05 ::left1 /^(.* )?ma?$/ ::comment MacIntyre
|
554 |
+
::s1 c ::s2 k ::cost 0.02 ::right1 [-,abcdfghklmnpoqrstuvwxz$ ]
|
555 |
+
::s1 c ::s2 kk ::cost 0.02 ::right1 [-,abcdfghklmnpoqrstuvwxz$ ]
|
556 |
+
::s1 c ::s2 s ::cost 0.7
|
557 |
+
::s1 c ::s2 s ::cost 0.1 ::right1 [eiy]
|
558 |
+
::s1 c ::s2 ts ::cost 0.15 ::right1 [eiy]
|
559 |
+
::s1 c ::s2 z ::cost 0.3
|
560 |
+
::s1 ch ::s2 ck ::cost 0.2
|
561 |
+
::s1 ch ::s2 g ::cost 0.3 ::right1 [eiy] ::right2 [eiy]
|
562 |
+
::s1 ch ::s2 k ::cost 0.2
|
563 |
+
::s1 ch ::s2 kk ::cost 0.2
|
564 |
+
::s1 ch ::s2 sh ::cost 0.3
|
565 |
+
::s1 ch ::s2 sh ::cost 0.2 ::left1 /eiy$/ ::right1 [$ ]
|
566 |
+
::s1 ch ::s2 tch ::cost 0.1
|
567 |
+
::s1 ch ::s2 tsh ::cost 0.1
|
568 |
+
::s1 ch ::s2 z ::cost 0.5
|
569 |
+
::s1 ck ::s2 kk ::cost 0.02
|
570 |
+
::s1 cz ::s2 ch ::cost 0.2 ::left1 /i$/
|
571 |
+
::s1 d ::s2 t ::cost 0.3
|
572 |
+
::s1 de ::s2 dre ::cost 0.3 ::lc1 zho ::right2 [-,$ ]
|
573 |
+
::s1 dg ::s2 j ::cost 0.6 ::lc1 eng ::comment Cambridge
|
574 |
+
::s1 dg ::s2 j ::cost 0.3 ::right1 [eiy] ::lc1 eng
|
575 |
+
::s1 dg ::s2 j ::cost 0.1 ::right1 [eiy] ::lc1 eng ::lc2 fas, jpn
|
576 |
+
::s1 dt ::s2 d ::cost 0.3
|
577 |
+
::s1 dt ::s2 t ::cost 0.03
|
578 |
+
::s1 dt ::s2 tt ::cost 0.03
|
579 |
+
::s1 f ::s2 p ::cost 0.8
|
580 |
+
::s1 f ::s2 ph ::cost 0.01
|
581 |
+
::s1 ff ::s2 ph ::cost 0.02
|
582 |
+
::s1 f ::s2 pf ::cost 0.1
|
583 |
+
::s1 f ::s2 v ::cost 0.3
|
584 |
+
::s1 f ::s2 v ::cost 0.1 ::right1 [-,$ ]
|
585 |
+
::s1 ef ::s2 ev ::cost 0.1 ::right1 [-,bcdfghklmnpqrstvwxz$ ]
|
586 |
+
::s1 f ::s2 w ::cost 0.3
|
587 |
+
::s1 g ::s2 j ::cost 0.6
|
588 |
+
::s1 g ::s2 j ::cost 0.3 ::right1 [eiy]
|
589 |
+
::s1 g ::s2 j ::cost 0.1 ::right1 [eiy] ::lc2 amh, ara, fas, jpn, som
|
590 |
+
::s1 g ::s2 k ::cost 0.3
|
591 |
+
::s1 g ::s2 gh ::cost 0.3
|
592 |
+
::s1 g ::s2 ch ::cost 0.4 ::left1 /[eiy]$/ ::right1 [-,$ ] ::comment German: Ludwig, Braunschweig
|
593 |
+
::s1 gh ::s2 f ::cost 0.2 ::lc1 eng ::comment laughter
|
594 |
+
::s1 gh ::s2 "" ::cost 0.2 ::lc1 eng ::comment daughter
|
595 |
+
::s1 gh ::s2 g ::cost 0.2 ::lc1 eng ::comment Afghanistan
|
596 |
+
::s1 gl ::s2 l ::cost 0.2 ::lc1 eng ::right1 [i]
|
597 |
+
::s1 gn ::s2 n ::cost 0.05 ::left1 /^(.* )?$/ ::lc1 eng
|
598 |
+
::s1 gn ::s2 n ::cost 0.2 ::lc1 eng
|
599 |
+
::s1 gz ::s2 ks ::cost 0.2
|
600 |
+
::s1 h ::s2 e ::cost 0.4 ::lc1 fas
|
601 |
+
::s1 ise ::s2 ize ::cost 0.1
|
602 |
+
::s1 j ::s2 y ::cost 0.2
|
603 |
+
::s1 j ::s2 dj ::cost 0.2
|
604 |
+
::s1 j ::s2 h ::cost 0.4 ::right2 [aeiou] ::lc2 amh ::example Jose
|
605 |
+
::s1 j ::s2 hh ::cost 0.4 ::right2 [aeiou] ::lc2 amh ::example Tardajos
|
606 |
+
::s1 j ::s2 zh ::cost 0.2
|
607 |
+
::s1 k ::s2 cc ::cost 0.02 ::right2 [aour]
|
608 |
+
::s1 k ::s2 cc ::cost 0.3
|
609 |
+
::s1 k ::s2 cch ::cost 0.15
|
610 |
+
::s1 k ::s2 ck ::cost 0.02
|
611 |
+
::s1 k ::s2 cq ::cost 0.05
|
612 |
+
::s1 k ::s2 cqu ::cost 0.05
|
613 |
+
::s1 k ::s2 cque ::cost 0.1
|
614 |
+
::s1 k ::s2 cque ::cost 0.05 ::right2 [-,$ ]
|
615 |
+
::s1 k ::s2 cques ::cost 0.05 ::right2 [-,$ ]
|
616 |
+
::s1 k ::s2 q ::cost 0.05
|
617 |
+
::s1 k ::s2 qu ::cost 0.05
|
618 |
+
::s1 k ::s2 que ::cost 0.1
|
619 |
+
::s1 k ::s2 que ::cost 0.05 ::right2 [-,$ ]
|
620 |
+
::s1 k ::s2 ques ::cost 0.1 ::right2 [-,$ ]
|
621 |
+
::s1 kh ::s2 j ::cost 0.2
|
622 |
+
::s1 kh ::s2 q ::cost 0.2
|
623 |
+
::s1 kh ::s2 k ::cost 0.25 ::right1 [aeiouy]
|
624 |
+
::s1 kh ::s2 k ::cost 0.1 ::right1 [aeiouys] ::lc2 amh
|
625 |
+
::s1 kn ::s2 n ::cost 0.05 ::left1 /^(.* )?$/ ::lc1 eng
|
626 |
+
::s1 kj ::s2 sh ::cost 0.2 ::comment Swedish
|
627 |
+
::s1 l ::s2 r ::cost 0.1 ::lc1 zho
|
628 |
+
::s1 aib ::s2 alb ::cost 0.1 ::lc1 zho
|
629 |
+
::s1 al ::s2 ::cost 0.5 ::left1 /^(.* )?$/
|
630 |
+
::s1 al- ::s2 ::cost 0.3 ::left1 /^(.* )?$/
|
631 |
+
::s1 el ::s2 ::cost 0.5 ::left1 /^(.* )?$/
|
632 |
+
::s1 el- ::s2 ::cost 0.3 ::left1 /^(.* )?$/
|
633 |
+
::s1 ll ::s2 y ::cost 0.1 ::left1 /[aeiouy]$/ ::right1 [aeiouy] ::comment Guillermo, Guillaume
|
634 |
+
::s1 mb ::s2 m ::cost 0.2 ::right1 [-,bcdfghklmnpqstvwxz$ ] ::lc1 eng ::comment bomb
|
635 |
+
::s1 n ::s2 m ::cost 0.5 ::left1 /[aeiou]$/ ::left2 /[aeiou]$/ ::right1 [bcdfghklmnpqrstvwxz$ ] ::right2 [-,bcdfghklmnpqrstvwxz$ ]
|
636 |
+
::s1 ng ::s2 n ::cost 0.1 ::left1 /[aeiou]$/ ::lc1 zho
|
637 |
+
::s1 ng ::s2 m ::cost 0.25 ::left1 /[aeiou]$/ ::lc1 zho
|
638 |
+
::s1 ng ::s2 n ::cost 0.1 ::left2 /[aeiou]$/ ::lc2 ara, ben, rus, zho
|
639 |
+
::s1 nm ::s2 m ::cost 0.25 ::lc1 zho ::left1
|
640 |
+
::s1 pn ::s2 n ::cost 0.05 ::left1 /^(.* )?$/ ::lc1 eng
|
641 |
+
::s1 ph ::s2 p ::cost 0.3 ::lc1 amh
|
642 |
+
::s1 q ::s2 c ::cost 0.15
|
643 |
+
::s1 q ::s2 ch ::cost 0.2 ::right2 [eiy]
|
644 |
+
::s1 q ::s2 ck ::cost 0.2
|
645 |
+
::s1 q ::s2 kk ::cost 0.2
|
646 |
+
::s1 q ::s2 gh ::cost 0.2 ::lc1 fas ::right2 [aeiouy]
|
647 |
+
::s1 qi ::s2 ch ::cost 0.2 ::lc1 zho ::right1 [aeou]
|
648 |
+
::s1 qi ::s2 cci ::cost 0.1 ::lc1 zho
|
649 |
+
::s1 qi ::s2 chi ::cost 0.1 ::lc1 zho
|
650 |
+
::s1 qi ::s2 tch ::cost 0.2 ::lc1 zho ::right1 [aeou]
|
651 |
+
::s1 qi ::s2 ts ::cost 0.4 ::lc1 zho ::right1 [aeou]
|
652 |
+
::s1 qi ::s2 tsch ::cost 0.2 ::lc1 zho ::right1 [aeou]
|
653 |
+
::s1 qi ::s2 tzsch ::cost 0.2 ::lc1 zho ::right1 [aeou]
|
654 |
+
::s1 qi ::s2 czy ::cost 0.2 ::lc1 zho
|
655 |
+
::s1 qu ::s2 kw ::cost 0.15
|
656 |
+
::s1 qu ::s2 kv ::cost 0.15
|
657 |
+
::s1 e ::s2 er ::cost 0.25 ::left1 /[bcdfghklmnpqrstvwxz]$/ ::lc1 zho
|
658 |
+
::s1 re ::s2 er ::cost 0.1
|
659 |
+
::s1 rh ::s2 r ::cost 0.05 ::left1 /^(.*[- ])?$/ ::example Rhine
|
660 |
+
::s1 s ::s2 sh ::cost 0.03 ::right2 [aeiou] ::lc2 amh
|
661 |
+
::s1 s ::s2 sz ::cost 0.3 ::lc2 eng ::example Liszt (Hungarian)
|
662 |
+
::s1 s ::s2 ts ::cost 0.4 ::lc1 amh, zho
|
663 |
+
::s1 s ::s2 z ::cost 0.4
|
664 |
+
::s1 s ::s2 z ::cost 0.1 ::left1 /[aeiouy]$/ ::right1 [aeiouy] ::lc1 eng
|
665 |
+
::s1 s ::s2 z ::cost 0.1 ::left1 /[aeiouy][bdglmnrvw]?$/ ::right1 [-,$ ] ::lc1 eng
|
666 |
+
::s1 s ::s2 z ::cost 0.2 ::lc2 fas
|
667 |
+
::s1 sc ::s2 s ::cost 0.2 ::right1 [i] ::example Nascimento
|
668 |
+
::s1 sci ::s2 sh ::cost 0.2 ::example Brescia
|
669 |
+
::s1 sch ::s2 sh ::cost 0.1
|
670 |
+
::s1 sh ::s2 sz ::cost 0.2 ::example Mariusz (Polish) ::lc2 eng
|
671 |
+
::s1 si ::s2 j ::cost 0.1 ::right2 [a] ::lc1 eng
|
672 |
+
::s1 ss ::s2 z ::cost 0.5
|
673 |
+
# ::s1 smith ::s2 mith ::cost 0.75 ::lc2 zho ::comment weird, but several different Xinhua examples
|
674 |
+
::s1 tch ::s2 c ::cost 0.2 ::left2 /[aeiou]$/ ::right2 [-,e$ ]
|
675 |
+
::s1 te ::s2 tre ::cost 0.3 ::lc1 zho ::right2 [-,$ ]
|
676 |
+
::s1 th ::s2 t ::cost 0.2 ::lc2 amh, fas, uig
|
677 |
+
::s1 th ::s2 s ::cost 0.4 ::lc2 zho
|
678 |
+
::s1 th ::s2 sth ::cost 0.4 ::lc1 zho
|
679 |
+
::s1 th ::s2 ths ::cost 0.4 ::lc1 zho
|
680 |
+
::s1 th ::s2 z ::cost 0.3 ::lc2 amh ::right2 [-,$ aeot]
|
681 |
+
::s1 v ::s2 w ::cost 0.02
|
682 |
+
::s1 v ::s2 wh ::cost 0.02 ::left1 /^(.* )?$/
|
683 |
+
::s1 vv ::s2 w ::cost 0.02
|
684 |
+
::s1 w ::s2 u ::cost 0.1 ::lc2 uig
|
685 |
+
::s1 wa ::s2 ua ::cost 0.05
|
686 |
+
::s1 wh ::s2 w ::cost 0.05 ::left1 /^(.* )?$/
|
687 |
+
::s1 wr ::s2 r ::cost 0.05 ::left1 /^(.* )?$/ ::lc1 eng
|
688 |
+
::s1 x ::s2 ks ::cost 0.05
|
689 |
+
::s1 x ::s2 s ::cost 0.2 ::left1 /^(.* )?$/
|
690 |
+
::s1 x ::s2 sh ::cost 0.2 ::lc1 uig ::left1 /^(.* )?$/ ::right1 [aeiou]
|
691 |
+
::s1 x ::s2 z ::cost 0.2 ::left1 /^(.* )?$/ ::right1 [aeiouy]
|
692 |
+
::s1 x ::s2 h ::cost 0.3 ::lc1 uig
|
693 |
+
::s1 x ::s2 h ::cost 0.05 ::lc1 uig ::left1 /^(.* )?$/ ::right1 [aeiou]
|
694 |
+
::s1 x ::s2 kh ::cost 0.1 ::lc1 uig
|
695 |
+
::s1 xi ::s2 sch ::cost 0.2 ::right1 [aeou] ::lc1 zho
|
696 |
+
::s1 xi ::s2 sh ::cost 0.2 ::right1 [aeou] ::lc1 zho
|
697 |
+
::s1 xi ::s2 ch ::cost 0.4 ::right1 [aeou] ::lc1 zho
|
698 |
+
::s1 xi ::s2 sci ::cost 0.4 ::right1 [aeou] ::lc1 zho
|
699 |
+
::s1 xi ::s2 s ::cost 0.6 ::right1 [aeou] ::lc1 zho
|
700 |
+
::s1 z ::s2 dz ::cost 0.1 ::left1 /^(.*[ aeiouy])?[lnr]?$/
|
701 |
+
::s1 z ::s2 ts ::cost 0.15
|
702 |
+
::s1 z ::s2 tz ::cost 0.15
|
703 |
+
::s1 zh ::s2 g ::cost 0.2 ::right2 [eiy]
|
704 |
+
::s1 zh ::s2 g ::cost 0.1 ::right2 [eiy] ::lc2 amh
|
705 |
+
::s1 zz ::s2 ts ::cost 0.15
|
706 |
+
::s1 zz ::s2 tz ::cost 0.1
|
707 |
+
|
708 |
+
# Oromo
|
709 |
+
::s1 nb ::s2 mb ::cost 0.4 ::lc1 orm ::lc2 orm ::left1 /[aeiou]$/ ::left2 /[aeiou]$/
|
710 |
+
::s1 np ::s2 mp ::cost 0.4 ::lc1 orm ::lc2 orm ::left1 /[aeiou]$/ ::left2 /[aeiou]$/
|
711 |
+
::s1 ph ::s2 p ::cost 0.3 ::lc1 orm ::lc2 orm
|
712 |
+
|
713 |
+
# Tigrinya
|
714 |
+
::s1 aaye ::s2 a ::cost 0.4 ::lc1 tir ::lc2 tir ::left1 /[bcdfghklmnpqrstvwxz]$/ ::right1 [bcdfghklmnpqrstvwxz] ::comment internal plural
|
715 |
+
::s1 aaye ::s2 i ::cost 0.4 ::lc1 tir ::lc2 tir ::left1 /[bcdfghklmnpqrstvwxz]$/ ::right1 [bcdfghklmnpqrstvwxz] ::comment internal plural
|
716 |
+
|
717 |
+
# Somali
|
718 |
+
::s1 ay ::s2 ey ::cost 0.1 ::lc1 som ::lc2 som
|
719 |
+
::s1 ay ::s2 eey ::cost 0.15 ::lc1 som ::lc2 som
|
720 |
+
::s1 aha ::s2 ihii ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
721 |
+
::s1 aha ::s2 ihi ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
722 |
+
::s1 aha ::s2 uhu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
723 |
+
::s1 ihii ::s2 uhu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
724 |
+
::s1 ihi ::s2 uhu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
725 |
+
::s1 ha ::s2 hii ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
726 |
+
::s1 ha ::s2 hi ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
727 |
+
::s1 ha ::s2 hu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
728 |
+
::s1 hii ::s2 hu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
729 |
+
::s1 hi ::s2 hu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
730 |
+
::s1 aka ::s2 ikii ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
731 |
+
::s1 aka ::s2 iki ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
732 |
+
::s1 aka ::s2 uku ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
733 |
+
::s1 ikii ::s2 uku ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
734 |
+
::s1 iki ::s2 uku ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
735 |
+
::s1 ka ::s2 kii ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
736 |
+
::s1 ka ::s2 ki ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
737 |
+
::s1 ka ::s2 ku ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
738 |
+
::s1 kii ::s2 ku ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
739 |
+
::s1 ki ::s2 ku ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
740 |
+
::s1 aga ::s2 ugu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
741 |
+
::s1 ga ::s2 gu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
742 |
+
::s1 ata ::s2 itii ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
743 |
+
::s1 ata ::s2 iti ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
744 |
+
::s1 ata ::s2 utu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
745 |
+
::s1 itii ::s2 utu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
746 |
+
::s1 iti ::s2 utu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
747 |
+
::s1 ta ::s2 tii ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
748 |
+
::s1 ta ::s2 ti ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
749 |
+
::s1 ta ::s2 tu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
750 |
+
::s1 tii ::s2 tu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
751 |
+
::s1 ti ::s2 tu ::cost 0.15 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [-,$ ]
|
752 |
+
::s1 ata ::s2 ete ::cost 0.15 ::lc1 som ::lc2 som
|
753 |
+
::s1 ata ::s2 iti ::cost 0.2 ::lc1 som ::lc2 som
|
754 |
+
::s1 ete ::s2 iti ::cost 0.15 ::lc1 som ::lc2 som
|
755 |
+
::s1 g ::s2 k ::cost 0.2 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [aeiou]
|
756 |
+
::s1 g ::s2 k ::cost 0.25 ::lc1 som ::lc2 som
|
757 |
+
::s1 g ::s2 kh ::cost 0.25 ::lc1 som ::lc2 som
|
758 |
+
::s1 gh ::s2 kh ::cost 0.1 ::lc1 som ::lc2 som
|
759 |
+
::s1 gh ::s2 k ::cost 0.2 ::lc1 som ::lc2 som
|
760 |
+
::s1 g ::s2 q ::cost 0.25 ::lc1 som ::lc2 som
|
761 |
+
::s1 g ::s2 q ::cost 0.2 ::lc1 som ::lc2 som ::right1 [aou] ::right2 [aou]
|
762 |
+
::s1 ga ::s2 q ::cost 0.2 ::lc1 som ::lc2 som ::left1 /^(.*[aeiou])?$/ ::left2 /^(.*[aeiou])?$/ ::right1 [bcdfghklmnpqrstvwxz] ::right2 [bcdfghklmnpqrstvwxz]
|
763 |
+
::s1 g ::s2 j ::cost 0.25 ::lc1 som ::lc2 som
|
764 |
+
::s1 g ::s2 j ::cost 0.15 ::lc1 som ::lc2 som ::right1 [ei] ::right2 [ei]
|
765 |
+
::s1 gi ::s2 j ::cost 0.15 ::lc1 som ::lc2 som ::right2 [ei]
|
766 |
+
::s1 n ::s2 m ::cost 0.2 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [aeiou]
|
767 |
+
::s1 n ::s2 mm ::cost 0.2 ::lc1 som ::lc2 som ::right1 [-,$ ] ::right2 [aeiou]
|
768 |
+
::s1 n ::s2 m ::cost 0.25 ::lc1 som ::lc2 som ::right2 [aeiko]
|
769 |
+
::s1 n ::s2 mm ::cost 0.25 ::lc1 som ::lc2 som ::right2 [aeiko]
|
770 |
+
::s1 ii ::s2 a ::cost 0.15 ::lc1 som ::lc2 som
|
771 |
+
::s1 y ::s2 dj ::cost 0.2 ::lc2 som
|
772 |
+
::s1 ca ::s2 a ::cost 0.15 ::left1 /^(.*[-, ])?$/ ::lc1 som
|
773 |
+
::s1 c ::s2 ::cost 0.25 ::left1 /^(.*[-, ])?$/ ::lc1 som
|
774 |
+
::s1 x ::s2 h ::cost 0.25 ::lc1 som
|
775 |
+
::s1 x ::s2 h ::cost 0.05 ::lc1 som ::left1 /^(.* )?$/ ::right1 [aeiou]
|
776 |
+
::s1 x ::s2 h ::cost 0.1 ::lc1 som ::left1 /[aeiou]$/
|
777 |
+
::s1 b ::s2 p ::cost 0.1 ::lc1 som
|
778 |
+
::s1 majm ::s2 mahm ::cost 0.1 ::lc1 som
|
779 |
+
::s1 chalim ::s2 halim ::cost 0.1 ::lc1 som ::lc2 som
|
780 |
+
::s1 chalim ::s2 jalim ::cost 0.1 ::lc1 som ::lc2 som
|
781 |
+
::s1 chalim ::s2 kalim ::cost 0.1 ::lc1 som ::lc2 som
|
782 |
+
::s1 halim ::s2 jalim ::cost 0.1 ::lc1 som ::lc2 som
|
783 |
+
::s1 halim ::s2 kalim ::cost 0.1 ::lc1 som ::lc2 som
|
784 |
+
::s1 jalim ::s2 kalim ::cost 0.1 ::lc1 som ::lc2 som
|
785 |
+
::s1 dh ::s2 r ::cost 0.25 ::lc1 som ::lc2 som ::left1 /[aeiou]$/
|
786 |
+
::s1 j ::s2 ch ::cost 0.25 ::lc1 som ::lc2 som
|
787 |
+
::s1 j ::s2 kh ::cost 0.25 ::lc1 som ::lc2 som
|
788 |
+
::s1 ch ::s2 sh ::cost 0.2 ::lc1 som ::lc2 som
|
789 |
+
|
790 |
+
# French
|
791 |
+
::s1 aud ::s2 o ::cost 0.3 ::right1 [-,$ ] ::lc1 eng, fra
|
792 |
+
::s1 aux ::s2 o ::cost 0.05 ::right1 [-,$ ]
|
793 |
+
::s1 eaux ::s2 o ::cost 0.05 ::right1 [-,$ ]
|
794 |
+
::s1 eux ::s2 o ::cost 0.05 ::right1 [-,$ ]
|
795 |
+
::s1 eux ::s2 e ::cost 0.15 ::right1 [-,$ ]
|
796 |
+
|
797 |
+
::s1 - ::s2 " " ::cost 0.1
|
798 |
+
::s1 : ::s2 , ::cost 0.1 ::lc1 amh
|
799 |
+
|
800 |
+
# mini dictionary Amharic-English
|
801 |
+
::s1 dabube ::s2 south ::cost 0 ::lc1 amh ::lc2 eng
|
802 |
+
::s1 daseete ::s2 island ::cost 0 ::lc1 amh ::lc2 eng
|
803 |
+
::s1 daseetoche ::s2 islands ::cost 0 ::lc1 amh ::lc2 eng
|
804 |
+
::s1 kaaweneti ::s2 county ::cost 0 ::lc1 amh ::lc2 eng
|
805 |
+
::s1 katamaa ::s2 city ::cost 0 ::lc1 amh ::lc2 eng
|
806 |
+
::s1 kelele ::s2 region ::cost 0 ::lc1 amh ::lc2 eng
|
807 |
+
::s1 meseraaqe ::s2 east ::cost 0 ::lc1 amh ::lc2 eng
|
808 |
+
::s1 sameene ::s2 north ::cost 0 ::lc1 amh ::lc2 eng
|
809 |
+
::s1 setaadiyame ::s2 stadium ::cost 0 ::lc1 amh ::lc2 eng
|
810 |
+
::s1 waneze ::s2 river ::cost 0 ::lc1 amh ::lc2 eng
|
811 |
+
|
812 |
+
# mini dictionary Arabic-English
|
813 |
+
::s1 " " ::s2 " of " ::cost 0 ::lc1 ara ::lc2 eng
|
814 |
+
::s1 " alawl" ::s2 " i" ::cost 0 ::lc1 ara ::lc2 eng ::right2 [-,$ ]
|
815 |
+
|
816 |
+
# mini dictionary Bengali-English
|
817 |
+
::s1 anychala ::s2 zone ::cost 0 ::lc1 ben ::lc2 eng
|
818 |
+
::s1 pradesha ::s2 province ::cost 0 ::lc1 ben ::lc2 eng
|
819 |
+
::s1 saamraajya ::s2 empire ::cost 0 ::lc1 ben ::lc2 eng
|
820 |
+
::s1 upajelaa ::s2 upazila ::cost 0 ::lc1 ben ::lc2 eng
|
821 |
+
::s1 uttara ::s2 north ::cost 0 ::lc1 ben ::lc2 eng
|
822 |
+
::s1 "dya " ::s2 "the " ::left1 /^(.*[-, ])?$/ ::cost 0.2 ::lc1 ben ::lc2 eng
|
823 |
+
::s1 " aba " ::s2 " of " ::cost 0 ::lc1 ben ::lc2 eng
|
824 |
+
|
825 |
+
# mini dictionary Russian-English
|
826 |
+
::s1 akademiya ::s2 academy ::cost 0 ::lc1 rus ::lc2 eng
|
827 |
+
::s1 eparkhiya ::s2 diocese ::cost 0 ::lc1 rus ::lc2 eng
|
828 |
+
::s1 gorod ::s2 city ::cost 0 ::lc1 rus ::lc2 eng
|
829 |
+
::s1 gosudarstvennyi ::s2 state ::cost 0 ::lc1 rus ::lc2 eng
|
830 |
+
::s1 gubernator ::s2 governor ::cost 0 ::lc1 rus ::lc2 eng
|
831 |
+
::s1 guberniya ::s2 governate ::cost 0 ::lc1 rus ::lc2 eng
|
832 |
+
::s1 imperator ::s2 emperor ::cost 0 ::lc1 rus ::lc2 eng
|
833 |
+
::s1 komitet ::s2 committee ::cost 0 ::lc1 rus ::lc2 eng
|
834 |
+
::s1 korolevstvo ::s2 kingdom ::cost 0 ::lc1 rus ::lc2 eng
|
835 |
+
::s1 koroli ::s2 king ::cost 0 ::lc1 rus ::lc2 eng
|
836 |
+
::s1 mezhdunarodnaya ::s2 international ::cost 0 ::lc1 rus ::lc2 eng
|
837 |
+
::s1 natsionalnyi ::s2 national ::cost 0 ::lc1 rus ::lc2 eng
|
838 |
+
::s1 novyi ::s2 new ::cost 0 ::lc1 rus ::lc2 eng
|
839 |
+
::s1 oblast ::s2 province ::cost 0 ::lc1 rus ::lc2 eng
|
840 |
+
::s1 oblast ::s2 region ::cost 0 ::lc1 rus ::lc2 eng
|
841 |
+
::s1 obshchestvo ::s2 society ::cost 0 ::lc1 rus ::lc2 eng
|
842 |
+
::s1 okrug ::s2 district ::cost 0 ::lc1 rus ::lc2 eng
|
843 |
+
::s1 okrug ::s2 region ::cost 0 ::lc1 rus ::lc2 eng
|
844 |
+
::s1 ostrova ::s2 island ::cost 0 ::lc1 rus ::lc2 eng
|
845 |
+
::s1 partiya ::s2 party ::cost 0 ::lc1 rus ::lc2 eng
|
846 |
+
::s1 raion ::s2 district ::cost 0 ::lc1 rus ::lc2 eng
|
847 |
+
::s1 respublika ::s2 republic ::cost 0 ::lc1 rus ::lc2 eng
|
848 |
+
::s1 respublik ::s2 republic ::cost 0 ::lc1 rus ::lc2 eng
|
849 |
+
::s1 sbornaya ::s2 team ::cost 0 ::lc1 rus ::lc2 eng
|
850 |
+
::s1 severnaya ::s2 north ::cost 0 ::lc1 rus ::lc2 eng
|
851 |
+
::s1 sovet council ::cost 0 ::lc1 rus ::lc2 eng
|
852 |
+
::s1 soyuz ::s2 alliance ::cost 0 ::lc1 rus ::lc2 eng
|
853 |
+
::s1 soyuz ::s2 association ::cost 0 ::lc1 rus ::lc2 eng
|
854 |
+
::s1 soyuz ::s2 league ::cost 0 ::lc1 rus ::lc2 eng
|
855 |
+
::s1 soyuz ::s2 union ::cost 0 ::lc1 rus ::lc2 eng
|
856 |
+
::s1 svyataya ::s2 saint ::cost 0 ::lc1 rus ::lc2 eng
|
857 |
+
::s1 svobodnyi ::s2 free ::cost 0 ::lc1 rus ::lc2 eng
|
858 |
+
::s1 tserkov ::s2 church ::cost 0 ::lc1 rus ::lc2 eng
|
859 |
+
::s1 uezd ::s2 county ::cost 0 ::lc1 rus ::lc2 eng
|
860 |
+
::s1 universitet ::s2 university ::cost 0 ::lc1 rus ::lc2 eng
|
861 |
+
::s1 vostochnaya ::s2 east ::cost 0 ::lc1 rus ::lc2 eng
|
862 |
+
::s1 vostochnaya ::s2 eastern ::cost 0 ::lc1 rus ::lc2 eng
|
863 |
+
::s1 yuzhnaya ::s2 south ::cost 0 ::lc1 rus ::lc2 eng
|
864 |
+
::s1 yuzhnaya ::s2 southern ::cost 0 ::lc1 rus ::lc2 eng
|
865 |
+
::s1 yuzhnoi ::s2 south ::cost 0 ::lc1 rus ::lc2 eng
|
866 |
+
::s1 yuzhnoi ::s2 southern ::cost 0 ::lc1 rus ::lc2 eng
|
867 |
+
::s1 yuzhnyi ::s2 south ::cost 0 ::lc1 rus ::lc2 eng
|
868 |
+
# often dropped in Russian name
|
869 |
+
::s1 ::s2 county ::cost 0 ::lc1 rus ::lc2 eng
|
870 |
+
::s1 ::s2 island ::cost 0 ::lc1 rus ::lc2 eng
|
871 |
+
::s1 ::s2 pope ::cost 0 ::lc1 rus ::lc2 eng
|
872 |
+
::s1 ::s2 river ::cost 0 ::lc1 rus ::lc2 eng
|
873 |
+
::s1 ::s2 "the " ::cost 0 ::lc1 rus ::lc2 eng ::left2 /^(.*[- ])?$/
|
874 |
+
::s1 " " ::s2 " of " ::cost 0 ::lc1 rus ::lc2 eng
|
875 |
+
|
876 |
+
|
877 |
+
# mini dictionary Uyghur-English
|
878 |
+
::s1 aptonom ::s2 automomous ::cost 0 ::lc1 uig ::lc2 eng
|
879 |
+
::s1 aralliri ::s2 islands ::cost 0 ::lc1 uig ::lc2 eng
|
880 |
+
::s1 aralliri ::s2 ::cost 0 ::lc1 uig ::lc2 eng
|
881 |
+
::s1 arili ::s2 island ::cost 0 ::lc1 uig ::lc2 eng
|
882 |
+
::s1 arili ::s2 ::cost 0 ::lc1 uig ::lc2 eng
|
883 |
+
::s1 nahiyisi ::s2 county ::cost 0 ::lc1 uig ::lc2 eng
|
884 |
+
::s1 oelkisi ::s2 province ::cost 0 ::lc1 uig ::lc2 eng
|
885 |
+
::s1 oelkisi ::s2 ::cost 0 ::lc1 uig ::lc2 eng
|
886 |
+
::s1 ottura ::s2 central ::cost 0 ::lc1 uig ::lc2 eng
|
887 |
+
::s1 rayoni ::s2 region ::cost 0 ::lc1 uig ::lc2 eng
|
888 |
+
::s1 shehiri ::s2 city ::cost 0 ::lc1 uig ::lc2 eng
|
889 |
+
::s1 shehiri ::s2 ::cost 0 ::lc1 uig ::lc2 eng
|
890 |
+
::s1 shitati ::s2 state ::cost 0 ::lc1 uig ::lc2 eng
|
891 |
+
::s1 shitati ::s2 ::cost 0 ::lc1 uig ::lc2 eng
|
892 |
+
::s1 shtati ::s2 state ::cost 0 ::lc1 uig ::lc2 eng
|
893 |
+
::s1 shtati ::s2 ::cost 0 ::lc1 uig ::lc2 eng
|
894 |
+
::s1 uniwersiteti ::s2 university ::cost 0 ::lc1 uig ::lc2 eng
|
895 |
+
::s1 yengi ::s2 new ::cost 0 ::lc1 uig ::lc2 eng
|
896 |
+
|
uroman/lib/JSON.pm
ADDED
@@ -0,0 +1,2317 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
package JSON;
|
2 |
+
|
3 |
+
|
4 |
+
use strict;
|
5 |
+
use Carp ();
|
6 |
+
use base qw(Exporter);
|
7 |
+
@JSON::EXPORT = qw(from_json to_json jsonToObj objToJson encode_json decode_json);
|
8 |
+
|
9 |
+
BEGIN {
|
10 |
+
$JSON::VERSION = '2.90';
|
11 |
+
$JSON::DEBUG = 0 unless (defined $JSON::DEBUG);
|
12 |
+
$JSON::DEBUG = $ENV{ PERL_JSON_DEBUG } if exists $ENV{ PERL_JSON_DEBUG };
|
13 |
+
}
|
14 |
+
|
15 |
+
my $Module_XS = 'JSON::XS';
|
16 |
+
my $Module_PP = 'JSON::PP';
|
17 |
+
my $Module_bp = 'JSON::backportPP'; # included in JSON distribution
|
18 |
+
my $PP_Version = '2.27203';
|
19 |
+
my $XS_Version = '2.34';
|
20 |
+
|
21 |
+
|
22 |
+
# XS and PP common methods
|
23 |
+
|
24 |
+
my @PublicMethods = qw/
|
25 |
+
ascii latin1 utf8 pretty indent space_before space_after relaxed canonical allow_nonref
|
26 |
+
allow_blessed convert_blessed filter_json_object filter_json_single_key_object
|
27 |
+
shrink max_depth max_size encode decode decode_prefix allow_unknown
|
28 |
+
/;
|
29 |
+
|
30 |
+
my @Properties = qw/
|
31 |
+
ascii latin1 utf8 indent space_before space_after relaxed canonical allow_nonref
|
32 |
+
allow_blessed convert_blessed shrink max_depth max_size allow_unknown
|
33 |
+
/;
|
34 |
+
|
35 |
+
my @XSOnlyMethods = qw/allow_tags/; # Currently nothing
|
36 |
+
|
37 |
+
my @PPOnlyMethods = qw/
|
38 |
+
indent_length sort_by
|
39 |
+
allow_singlequote allow_bignum loose allow_barekey escape_slash as_nonblessed
|
40 |
+
/; # JSON::PP specific
|
41 |
+
|
42 |
+
|
43 |
+
# used in _load_xs and _load_pp ($INSTALL_ONLY is not used currently)
|
44 |
+
my $_INSTALL_DONT_DIE = 1; # When _load_xs fails to load XS, don't die.
|
45 |
+
my $_INSTALL_ONLY = 2; # Don't call _set_methods()
|
46 |
+
my $_ALLOW_UNSUPPORTED = 0;
|
47 |
+
my $_UNIV_CONV_BLESSED = 0;
|
48 |
+
my $_USSING_bpPP = 0;
|
49 |
+
|
50 |
+
|
51 |
+
# Check the environment variable to decide worker module.
|
52 |
+
|
53 |
+
unless ($JSON::Backend) {
|
54 |
+
$JSON::DEBUG and Carp::carp("Check used worker module...");
|
55 |
+
|
56 |
+
my $backend = exists $ENV{PERL_JSON_BACKEND} ? $ENV{PERL_JSON_BACKEND} : 1;
|
57 |
+
|
58 |
+
if ($backend eq '1' or $backend =~ /JSON::XS\s*,\s*JSON::PP/) {
|
59 |
+
_load_xs($_INSTALL_DONT_DIE) or _load_pp();
|
60 |
+
}
|
61 |
+
elsif ($backend eq '0' or $backend eq 'JSON::PP') {
|
62 |
+
_load_pp();
|
63 |
+
}
|
64 |
+
elsif ($backend eq '2' or $backend eq 'JSON::XS') {
|
65 |
+
_load_xs();
|
66 |
+
}
|
67 |
+
elsif ($backend eq 'JSON::backportPP') {
|
68 |
+
$_USSING_bpPP = 1;
|
69 |
+
_load_pp();
|
70 |
+
}
|
71 |
+
else {
|
72 |
+
Carp::croak "The value of environmental variable 'PERL_JSON_BACKEND' is invalid.";
|
73 |
+
}
|
74 |
+
}
|
75 |
+
|
76 |
+
|
77 |
+
sub import {
|
78 |
+
my $pkg = shift;
|
79 |
+
my @what_to_export;
|
80 |
+
my $no_export;
|
81 |
+
|
82 |
+
for my $tag (@_) {
|
83 |
+
if ($tag eq '-support_by_pp') {
|
84 |
+
if (!$_ALLOW_UNSUPPORTED++) {
|
85 |
+
JSON::Backend::XS
|
86 |
+
->support_by_pp(@PPOnlyMethods) if ($JSON::Backend eq $Module_XS);
|
87 |
+
}
|
88 |
+
next;
|
89 |
+
}
|
90 |
+
elsif ($tag eq '-no_export') {
|
91 |
+
$no_export++, next;
|
92 |
+
}
|
93 |
+
elsif ( $tag eq '-convert_blessed_universally' ) {
|
94 |
+
eval q|
|
95 |
+
require B;
|
96 |
+
*UNIVERSAL::TO_JSON = sub {
|
97 |
+
my $b_obj = B::svref_2object( $_[0] );
|
98 |
+
return $b_obj->isa('B::HV') ? { %{ $_[0] } }
|
99 |
+
: $b_obj->isa('B::AV') ? [ @{ $_[0] } ]
|
100 |
+
: undef
|
101 |
+
;
|
102 |
+
}
|
103 |
+
| if ( !$_UNIV_CONV_BLESSED++ );
|
104 |
+
next;
|
105 |
+
}
|
106 |
+
push @what_to_export, $tag;
|
107 |
+
}
|
108 |
+
|
109 |
+
return if ($no_export);
|
110 |
+
|
111 |
+
__PACKAGE__->export_to_level(1, $pkg, @what_to_export);
|
112 |
+
}
|
113 |
+
|
114 |
+
|
115 |
+
# OBSOLETED
|
116 |
+
|
117 |
+
sub jsonToObj {
|
118 |
+
my $alternative = 'from_json';
|
119 |
+
if (defined $_[0] and UNIVERSAL::isa($_[0], 'JSON')) {
|
120 |
+
shift @_; $alternative = 'decode';
|
121 |
+
}
|
122 |
+
Carp::carp "'jsonToObj' will be obsoleted. Please use '$alternative' instead.";
|
123 |
+
return JSON::from_json(@_);
|
124 |
+
};
|
125 |
+
|
126 |
+
sub objToJson {
|
127 |
+
my $alternative = 'to_json';
|
128 |
+
if (defined $_[0] and UNIVERSAL::isa($_[0], 'JSON')) {
|
129 |
+
shift @_; $alternative = 'encode';
|
130 |
+
}
|
131 |
+
Carp::carp "'objToJson' will be obsoleted. Please use '$alternative' instead.";
|
132 |
+
JSON::to_json(@_);
|
133 |
+
};
|
134 |
+
|
135 |
+
|
136 |
+
# INTERFACES
|
137 |
+
|
138 |
+
sub to_json ($@) {
|
139 |
+
if (
|
140 |
+
ref($_[0]) eq 'JSON'
|
141 |
+
or (@_ > 2 and $_[0] eq 'JSON')
|
142 |
+
) {
|
143 |
+
Carp::croak "to_json should not be called as a method.";
|
144 |
+
}
|
145 |
+
my $json = JSON->new;
|
146 |
+
|
147 |
+
if (@_ == 2 and ref $_[1] eq 'HASH') {
|
148 |
+
my $opt = $_[1];
|
149 |
+
for my $method (keys %$opt) {
|
150 |
+
$json->$method( $opt->{$method} );
|
151 |
+
}
|
152 |
+
}
|
153 |
+
|
154 |
+
$json->encode($_[0]);
|
155 |
+
}
|
156 |
+
|
157 |
+
|
158 |
+
sub from_json ($@) {
|
159 |
+
if ( ref($_[0]) eq 'JSON' or $_[0] eq 'JSON' ) {
|
160 |
+
Carp::croak "from_json should not be called as a method.";
|
161 |
+
}
|
162 |
+
my $json = JSON->new;
|
163 |
+
|
164 |
+
if (@_ == 2 and ref $_[1] eq 'HASH') {
|
165 |
+
my $opt = $_[1];
|
166 |
+
for my $method (keys %$opt) {
|
167 |
+
$json->$method( $opt->{$method} );
|
168 |
+
}
|
169 |
+
}
|
170 |
+
|
171 |
+
return $json->decode( $_[0] );
|
172 |
+
}
|
173 |
+
|
174 |
+
|
175 |
+
|
176 |
+
sub true { $JSON::true }
|
177 |
+
|
178 |
+
sub false { $JSON::false }
|
179 |
+
|
180 |
+
sub null { undef; }
|
181 |
+
|
182 |
+
|
183 |
+
sub require_xs_version { $XS_Version; }
|
184 |
+
|
185 |
+
sub backend {
|
186 |
+
my $proto = shift;
|
187 |
+
$JSON::Backend;
|
188 |
+
}
|
189 |
+
|
190 |
+
#*module = *backend;
|
191 |
+
|
192 |
+
|
193 |
+
sub is_xs {
|
194 |
+
return $_[0]->backend eq $Module_XS;
|
195 |
+
}
|
196 |
+
|
197 |
+
|
198 |
+
sub is_pp {
|
199 |
+
return not $_[0]->is_xs;
|
200 |
+
}
|
201 |
+
|
202 |
+
|
203 |
+
sub pureperl_only_methods { @PPOnlyMethods; }
|
204 |
+
|
205 |
+
|
206 |
+
sub property {
|
207 |
+
my ($self, $name, $value) = @_;
|
208 |
+
|
209 |
+
if (@_ == 1) {
|
210 |
+
my %props;
|
211 |
+
for $name (@Properties) {
|
212 |
+
my $method = 'get_' . $name;
|
213 |
+
if ($name eq 'max_size') {
|
214 |
+
my $value = $self->$method();
|
215 |
+
$props{$name} = $value == 1 ? 0 : $value;
|
216 |
+
next;
|
217 |
+
}
|
218 |
+
$props{$name} = $self->$method();
|
219 |
+
}
|
220 |
+
return \%props;
|
221 |
+
}
|
222 |
+
elsif (@_ > 3) {
|
223 |
+
Carp::croak('property() can take only the option within 2 arguments.');
|
224 |
+
}
|
225 |
+
elsif (@_ == 2) {
|
226 |
+
if ( my $method = $self->can('get_' . $name) ) {
|
227 |
+
if ($name eq 'max_size') {
|
228 |
+
my $value = $self->$method();
|
229 |
+
return $value == 1 ? 0 : $value;
|
230 |
+
}
|
231 |
+
$self->$method();
|
232 |
+
}
|
233 |
+
}
|
234 |
+
else {
|
235 |
+
$self->$name($value);
|
236 |
+
}
|
237 |
+
|
238 |
+
}
|
239 |
+
|
240 |
+
|
241 |
+
|
242 |
+
# INTERNAL
|
243 |
+
|
244 |
+
sub _load_xs {
|
245 |
+
my $opt = shift;
|
246 |
+
|
247 |
+
$JSON::DEBUG and Carp::carp "Load $Module_XS.";
|
248 |
+
|
249 |
+
# if called after install module, overload is disable.... why?
|
250 |
+
JSON::Boolean::_overrride_overload($Module_XS);
|
251 |
+
JSON::Boolean::_overrride_overload($Module_PP);
|
252 |
+
|
253 |
+
eval qq|
|
254 |
+
use $Module_XS $XS_Version ();
|
255 |
+
|;
|
256 |
+
|
257 |
+
if ($@) {
|
258 |
+
if (defined $opt and $opt & $_INSTALL_DONT_DIE) {
|
259 |
+
$JSON::DEBUG and Carp::carp "Can't load $Module_XS...($@)";
|
260 |
+
return 0;
|
261 |
+
}
|
262 |
+
Carp::croak $@;
|
263 |
+
}
|
264 |
+
|
265 |
+
unless (defined $opt and $opt & $_INSTALL_ONLY) {
|
266 |
+
_set_module( $JSON::Backend = $Module_XS );
|
267 |
+
my $data = join("", <DATA>); # this code is from Jcode 2.xx.
|
268 |
+
close(DATA);
|
269 |
+
eval $data;
|
270 |
+
JSON::Backend::XS->init;
|
271 |
+
}
|
272 |
+
|
273 |
+
return 1;
|
274 |
+
};
|
275 |
+
|
276 |
+
|
277 |
+
sub _load_pp {
|
278 |
+
my $opt = shift;
|
279 |
+
my $backend = $_USSING_bpPP ? $Module_bp : $Module_PP;
|
280 |
+
|
281 |
+
$JSON::DEBUG and Carp::carp "Load $backend.";
|
282 |
+
|
283 |
+
# if called after install module, overload is disable.... why?
|
284 |
+
JSON::Boolean::_overrride_overload($Module_XS);
|
285 |
+
JSON::Boolean::_overrride_overload($backend);
|
286 |
+
|
287 |
+
if ( $_USSING_bpPP ) {
|
288 |
+
eval qq| require $backend |;
|
289 |
+
}
|
290 |
+
else {
|
291 |
+
eval qq| use $backend $PP_Version () |;
|
292 |
+
}
|
293 |
+
|
294 |
+
if ($@) {
|
295 |
+
if ( $backend eq $Module_PP ) {
|
296 |
+
$JSON::DEBUG and Carp::carp "Can't load $Module_PP ($@), so try to load $Module_bp";
|
297 |
+
$_USSING_bpPP++;
|
298 |
+
$backend = $Module_bp;
|
299 |
+
JSON::Boolean::_overrride_overload($backend);
|
300 |
+
local $^W; # if PP installed but invalid version, backportPP redefines methods.
|
301 |
+
eval qq| require $Module_bp |;
|
302 |
+
}
|
303 |
+
Carp::croak $@ if $@;
|
304 |
+
}
|
305 |
+
|
306 |
+
unless (defined $opt and $opt & $_INSTALL_ONLY) {
|
307 |
+
_set_module( $JSON::Backend = $Module_PP ); # even if backportPP, set $Backend with 'JSON::PP'
|
308 |
+
JSON::Backend::PP->init;
|
309 |
+
}
|
310 |
+
};
|
311 |
+
|
312 |
+
|
313 |
+
sub _set_module {
|
314 |
+
return if defined $JSON::true;
|
315 |
+
|
316 |
+
my $module = shift;
|
317 |
+
|
318 |
+
local $^W;
|
319 |
+
no strict qw(refs);
|
320 |
+
|
321 |
+
$JSON::true = ${"$module\::true"};
|
322 |
+
$JSON::false = ${"$module\::false"};
|
323 |
+
|
324 |
+
push @JSON::ISA, $module;
|
325 |
+
if ( JSON->is_xs and JSON->backend->VERSION < 3 ) {
|
326 |
+
eval 'package JSON::PP::Boolean';
|
327 |
+
push @{"$module\::Boolean::ISA"}, qw(JSON::PP::Boolean);
|
328 |
+
}
|
329 |
+
|
330 |
+
*{"JSON::is_bool"} = \&{"$module\::is_bool"};
|
331 |
+
|
332 |
+
for my $method ($module eq $Module_XS ? @PPOnlyMethods : @XSOnlyMethods) {
|
333 |
+
*{"JSON::$method"} = sub {
|
334 |
+
Carp::carp("$method is not supported in $module.");
|
335 |
+
$_[0];
|
336 |
+
};
|
337 |
+
}
|
338 |
+
|
339 |
+
return 1;
|
340 |
+
}
|
341 |
+
|
342 |
+
|
343 |
+
|
344 |
+
#
|
345 |
+
# JSON Boolean
|
346 |
+
#
|
347 |
+
|
348 |
+
package JSON::Boolean;
|
349 |
+
|
350 |
+
my %Installed;
|
351 |
+
|
352 |
+
sub _overrride_overload {
|
353 |
+
return; # this function is currently disable.
|
354 |
+
return if ($Installed{ $_[0] }++);
|
355 |
+
|
356 |
+
my $boolean = $_[0] . '::Boolean';
|
357 |
+
|
358 |
+
eval sprintf(q|
|
359 |
+
package %s;
|
360 |
+
use overload (
|
361 |
+
'""' => sub { ${$_[0]} == 1 ? 'true' : 'false' },
|
362 |
+
'eq' => sub {
|
363 |
+
my ($obj, $op) = ref ($_[0]) ? ($_[0], $_[1]) : ($_[1], $_[0]);
|
364 |
+
if ($op eq 'true' or $op eq 'false') {
|
365 |
+
return "$obj" eq 'true' ? 'true' eq $op : 'false' eq $op;
|
366 |
+
}
|
367 |
+
else {
|
368 |
+
return $obj ? 1 == $op : 0 == $op;
|
369 |
+
}
|
370 |
+
},
|
371 |
+
);
|
372 |
+
|, $boolean);
|
373 |
+
|
374 |
+
if ($@) { Carp::croak $@; }
|
375 |
+
|
376 |
+
if ( exists $INC{'JSON/XS.pm'} and $boolean eq 'JSON::XS::Boolean' ) {
|
377 |
+
local $^W;
|
378 |
+
my $true = do { bless \(my $dummy = 1), $boolean };
|
379 |
+
my $false = do { bless \(my $dummy = 0), $boolean };
|
380 |
+
*JSON::XS::true = sub () { $true };
|
381 |
+
*JSON::XS::false = sub () { $false };
|
382 |
+
}
|
383 |
+
elsif ( exists $INC{'JSON/PP.pm'} and $boolean eq 'JSON::PP::Boolean' ) {
|
384 |
+
local $^W;
|
385 |
+
my $true = do { bless \(my $dummy = 1), $boolean };
|
386 |
+
my $false = do { bless \(my $dummy = 0), $boolean };
|
387 |
+
*JSON::PP::true = sub { $true };
|
388 |
+
*JSON::PP::false = sub { $false };
|
389 |
+
}
|
390 |
+
|
391 |
+
return 1;
|
392 |
+
}
|
393 |
+
|
394 |
+
|
395 |
+
#
|
396 |
+
# Helper classes for Backend Module (PP)
|
397 |
+
#
|
398 |
+
|
399 |
+
package JSON::Backend::PP;
|
400 |
+
|
401 |
+
sub init {
|
402 |
+
local $^W;
|
403 |
+
no strict qw(refs); # this routine may be called after JSON::Backend::XS init was called.
|
404 |
+
*{"JSON::decode_json"} = \&{"JSON::PP::decode_json"};
|
405 |
+
*{"JSON::encode_json"} = \&{"JSON::PP::encode_json"};
|
406 |
+
*{"JSON::PP::is_xs"} = sub { 0 };
|
407 |
+
*{"JSON::PP::is_pp"} = sub { 1 };
|
408 |
+
return 1;
|
409 |
+
}
|
410 |
+
|
411 |
+
#
|
412 |
+
# To save memory, the below lines are read only when XS backend is used.
|
413 |
+
#
|
414 |
+
|
415 |
+
package JSON;
|
416 |
+
|
417 |
+
1;
|
418 |
+
__DATA__
|
419 |
+
|
420 |
+
|
421 |
+
#
|
422 |
+
# Helper classes for Backend Module (XS)
|
423 |
+
#
|
424 |
+
|
425 |
+
package JSON::Backend::XS;
|
426 |
+
|
427 |
+
use constant INDENT_LENGTH_FLAG => 15 << 12;
|
428 |
+
|
429 |
+
use constant UNSUPPORTED_ENCODE_FLAG => {
|
430 |
+
ESCAPE_SLASH => 0x00000010,
|
431 |
+
ALLOW_BIGNUM => 0x00000020,
|
432 |
+
AS_NONBLESSED => 0x00000040,
|
433 |
+
EXPANDED => 0x10000000, # for developer's
|
434 |
+
};
|
435 |
+
|
436 |
+
use constant UNSUPPORTED_DECODE_FLAG => {
|
437 |
+
LOOSE => 0x00000001,
|
438 |
+
ALLOW_BIGNUM => 0x00000002,
|
439 |
+
ALLOW_BAREKEY => 0x00000004,
|
440 |
+
ALLOW_SINGLEQUOTE => 0x00000008,
|
441 |
+
EXPANDED => 0x20000000, # for developer's
|
442 |
+
};
|
443 |
+
|
444 |
+
|
445 |
+
sub init {
|
446 |
+
local $^W;
|
447 |
+
no strict qw(refs);
|
448 |
+
*{"JSON::decode_json"} = \&{"JSON::XS::decode_json"};
|
449 |
+
*{"JSON::encode_json"} = \&{"JSON::XS::encode_json"};
|
450 |
+
*{"JSON::XS::is_xs"} = sub { 1 };
|
451 |
+
*{"JSON::XS::is_pp"} = sub { 0 };
|
452 |
+
return 1;
|
453 |
+
}
|
454 |
+
|
455 |
+
|
456 |
+
sub support_by_pp {
|
457 |
+
my ($class, @methods) = @_;
|
458 |
+
|
459 |
+
local $^W;
|
460 |
+
no strict qw(refs);
|
461 |
+
|
462 |
+
my $JSON_XS_encode_orignal = \&JSON::XS::encode;
|
463 |
+
my $JSON_XS_decode_orignal = \&JSON::XS::decode;
|
464 |
+
my $JSON_XS_incr_parse_orignal = \&JSON::XS::incr_parse;
|
465 |
+
|
466 |
+
*JSON::XS::decode = \&JSON::Backend::XS::Supportable::_decode;
|
467 |
+
*JSON::XS::encode = \&JSON::Backend::XS::Supportable::_encode;
|
468 |
+
*JSON::XS::incr_parse = \&JSON::Backend::XS::Supportable::_incr_parse;
|
469 |
+
|
470 |
+
*{JSON::XS::_original_decode} = $JSON_XS_decode_orignal;
|
471 |
+
*{JSON::XS::_original_encode} = $JSON_XS_encode_orignal;
|
472 |
+
*{JSON::XS::_original_incr_parse} = $JSON_XS_incr_parse_orignal;
|
473 |
+
|
474 |
+
push @JSON::Backend::XS::Supportable::ISA, 'JSON';
|
475 |
+
|
476 |
+
my $pkg = 'JSON::Backend::XS::Supportable';
|
477 |
+
|
478 |
+
*{JSON::new} = sub {
|
479 |
+
my $proto = JSON::XS->new; $$proto = 0;
|
480 |
+
bless $proto, $pkg;
|
481 |
+
};
|
482 |
+
|
483 |
+
|
484 |
+
for my $method (@methods) {
|
485 |
+
my $flag = uc($method);
|
486 |
+
my $type |= (UNSUPPORTED_ENCODE_FLAG->{$flag} || 0);
|
487 |
+
$type |= (UNSUPPORTED_DECODE_FLAG->{$flag} || 0);
|
488 |
+
|
489 |
+
next unless($type);
|
490 |
+
|
491 |
+
$pkg->_make_unsupported_method($method => $type);
|
492 |
+
}
|
493 |
+
|
494 |
+
# push @{"JSON::XS::Boolean::ISA"}, qw(JSON::PP::Boolean);
|
495 |
+
# push @{"JSON::PP::Boolean::ISA"}, qw(JSON::Boolean);
|
496 |
+
|
497 |
+
$JSON::DEBUG and Carp::carp("set -support_by_pp mode.");
|
498 |
+
|
499 |
+
return 1;
|
500 |
+
}
|
501 |
+
|
502 |
+
|
503 |
+
|
504 |
+
|
505 |
+
#
|
506 |
+
# Helper classes for XS
|
507 |
+
#
|
508 |
+
|
509 |
+
package JSON::Backend::XS::Supportable;
|
510 |
+
|
511 |
+
$Carp::Internal{'JSON::Backend::XS::Supportable'} = 1;
|
512 |
+
|
513 |
+
sub _make_unsupported_method {
|
514 |
+
my ($pkg, $method, $type) = @_;
|
515 |
+
|
516 |
+
local $^W;
|
517 |
+
no strict qw(refs);
|
518 |
+
|
519 |
+
*{"$pkg\::$method"} = sub {
|
520 |
+
local $^W;
|
521 |
+
if (defined $_[1] ? $_[1] : 1) {
|
522 |
+
${$_[0]} |= $type;
|
523 |
+
}
|
524 |
+
else {
|
525 |
+
${$_[0]} &= ~$type;
|
526 |
+
}
|
527 |
+
$_[0];
|
528 |
+
};
|
529 |
+
|
530 |
+
*{"$pkg\::get_$method"} = sub {
|
531 |
+
${$_[0]} & $type ? 1 : '';
|
532 |
+
};
|
533 |
+
|
534 |
+
}
|
535 |
+
|
536 |
+
|
537 |
+
sub _set_for_pp {
|
538 |
+
JSON::_load_pp( $_INSTALL_ONLY );
|
539 |
+
|
540 |
+
my $type = shift;
|
541 |
+
my $pp = JSON::PP->new;
|
542 |
+
my $prop = $_[0]->property;
|
543 |
+
|
544 |
+
for my $name (keys %$prop) {
|
545 |
+
$pp->$name( $prop->{$name} ? $prop->{$name} : 0 );
|
546 |
+
}
|
547 |
+
|
548 |
+
my $unsupported = $type eq 'encode' ? JSON::Backend::XS::UNSUPPORTED_ENCODE_FLAG
|
549 |
+
: JSON::Backend::XS::UNSUPPORTED_DECODE_FLAG;
|
550 |
+
my $flags = ${$_[0]} || 0;
|
551 |
+
|
552 |
+
for my $name (keys %$unsupported) {
|
553 |
+
next if ($name eq 'EXPANDED'); # for developer's
|
554 |
+
my $enable = ($flags & $unsupported->{$name}) ? 1 : 0;
|
555 |
+
my $method = lc $name;
|
556 |
+
$pp->$method($enable);
|
557 |
+
}
|
558 |
+
|
559 |
+
$pp->indent_length( $_[0]->get_indent_length );
|
560 |
+
|
561 |
+
return $pp;
|
562 |
+
}
|
563 |
+
|
564 |
+
sub _encode { # using with PP encode
|
565 |
+
if (${$_[0]}) {
|
566 |
+
_set_for_pp('encode' => @_)->encode($_[1]);
|
567 |
+
}
|
568 |
+
else {
|
569 |
+
$_[0]->_original_encode( $_[1] );
|
570 |
+
}
|
571 |
+
}
|
572 |
+
|
573 |
+
|
574 |
+
sub _decode { # if unsupported-flag is set, use PP
|
575 |
+
if (${$_[0]}) {
|
576 |
+
_set_for_pp('decode' => @_)->decode($_[1]);
|
577 |
+
}
|
578 |
+
else {
|
579 |
+
$_[0]->_original_decode( $_[1] );
|
580 |
+
}
|
581 |
+
}
|
582 |
+
|
583 |
+
|
584 |
+
sub decode_prefix { # if unsupported-flag is set, use PP
|
585 |
+
_set_for_pp('decode' => @_)->decode_prefix($_[1]);
|
586 |
+
}
|
587 |
+
|
588 |
+
|
589 |
+
sub _incr_parse {
|
590 |
+
if (${$_[0]}) {
|
591 |
+
_set_for_pp('decode' => @_)->incr_parse($_[1]);
|
592 |
+
}
|
593 |
+
else {
|
594 |
+
$_[0]->_original_incr_parse( $_[1] );
|
595 |
+
}
|
596 |
+
}
|
597 |
+
|
598 |
+
|
599 |
+
sub get_indent_length {
|
600 |
+
${$_[0]} << 4 >> 16;
|
601 |
+
}
|
602 |
+
|
603 |
+
|
604 |
+
sub indent_length {
|
605 |
+
my $length = $_[1];
|
606 |
+
|
607 |
+
if (!defined $length or $length > 15 or $length < 0) {
|
608 |
+
Carp::carp "The acceptable range of indent_length() is 0 to 15.";
|
609 |
+
}
|
610 |
+
else {
|
611 |
+
local $^W;
|
612 |
+
$length <<= 12;
|
613 |
+
${$_[0]} &= ~ JSON::Backend::XS::INDENT_LENGTH_FLAG;
|
614 |
+
${$_[0]} |= $length;
|
615 |
+
*JSON::XS::encode = \&JSON::Backend::XS::Supportable::_encode;
|
616 |
+
}
|
617 |
+
|
618 |
+
$_[0];
|
619 |
+
}
|
620 |
+
|
621 |
+
|
622 |
+
1;
|
623 |
+
__END__
|
624 |
+
|
625 |
+
=head1 NAME
|
626 |
+
|
627 |
+
JSON - JSON (JavaScript Object Notation) encoder/decoder
|
628 |
+
|
629 |
+
=head1 SYNOPSIS
|
630 |
+
|
631 |
+
use JSON; # imports encode_json, decode_json, to_json and from_json.
|
632 |
+
|
633 |
+
# simple and fast interfaces (expect/generate UTF-8)
|
634 |
+
|
635 |
+
$utf8_encoded_json_text = encode_json $perl_hash_or_arrayref;
|
636 |
+
$perl_hash_or_arrayref = decode_json $utf8_encoded_json_text;
|
637 |
+
|
638 |
+
# OO-interface
|
639 |
+
|
640 |
+
$json = JSON->new->allow_nonref;
|
641 |
+
|
642 |
+
$json_text = $json->encode( $perl_scalar );
|
643 |
+
$perl_scalar = $json->decode( $json_text );
|
644 |
+
|
645 |
+
$pretty_printed = $json->pretty->encode( $perl_scalar ); # pretty-printing
|
646 |
+
|
647 |
+
# If you want to use PP only support features, call with '-support_by_pp'
|
648 |
+
# When XS unsupported feature is enable, using PP (de|en)code instead of XS ones.
|
649 |
+
|
650 |
+
use JSON -support_by_pp;
|
651 |
+
|
652 |
+
# option-acceptable interfaces (expect/generate UNICODE by default)
|
653 |
+
|
654 |
+
$json_text = to_json( $perl_scalar, { ascii => 1, pretty => 1 } );
|
655 |
+
$perl_scalar = from_json( $json_text, { utf8 => 1 } );
|
656 |
+
|
657 |
+
# Between (en|de)code_json and (to|from)_json, if you want to write
|
658 |
+
# a code which communicates to an outer world (encoded in UTF-8),
|
659 |
+
# recommend to use (en|de)code_json.
|
660 |
+
|
661 |
+
=head1 VERSION
|
662 |
+
|
663 |
+
2.90
|
664 |
+
|
665 |
+
This version is compatible with JSON::XS B<2.34> and later.
|
666 |
+
(Not yet compatble to JSON::XS B<3.0x>.)
|
667 |
+
|
668 |
+
|
669 |
+
=head1 NOTE
|
670 |
+
|
671 |
+
JSON::PP was earlier included in the C<JSON> distribution, but
|
672 |
+
has since Perl 5.14 been a core module. For this reason,
|
673 |
+
L<JSON::PP> was removed from the JSON distribution and can now
|
674 |
+
be found also in the Perl5 repository at
|
675 |
+
|
676 |
+
=over
|
677 |
+
|
678 |
+
=item * L<http://perl5.git.perl.org/perl.git>
|
679 |
+
|
680 |
+
=back
|
681 |
+
|
682 |
+
(The newest JSON::PP version still exists in CPAN.)
|
683 |
+
|
684 |
+
Instead, the C<JSON> distribution will include JSON::backportPP
|
685 |
+
for backwards computability. JSON.pm should thus work as it did
|
686 |
+
before.
|
687 |
+
|
688 |
+
=head1 DESCRIPTION
|
689 |
+
|
690 |
+
*************************** CAUTION **************************************
|
691 |
+
* *
|
692 |
+
* INCOMPATIBLE CHANGE (JSON::XS version 2.90) *
|
693 |
+
* *
|
694 |
+
* JSON.pm had patched JSON::XS::Boolean and JSON::PP::Boolean internally *
|
695 |
+
* on loading time for making these modules inherit JSON::Boolean. *
|
696 |
+
* But since JSON::XS v3.0 it use Types::Serialiser as boolean class. *
|
697 |
+
* Then now JSON.pm breaks boolean classe overload features and *
|
698 |
+
* -support_by_pp if JSON::XS v3.0 or later is installed. *
|
699 |
+
* *
|
700 |
+
* JSON::true and JSON::false returned JSON::Boolean objects. *
|
701 |
+
* For workaround, they return JSON::PP::Boolean objects in this version. *
|
702 |
+
* *
|
703 |
+
* isa_ok(JSON::true, 'JSON::PP::Boolean'); *
|
704 |
+
* *
|
705 |
+
* And it discards a feature: *
|
706 |
+
* *
|
707 |
+
* ok(JSON::true eq 'true'); *
|
708 |
+
* *
|
709 |
+
* In other word, JSON::PP::Boolean overload numeric only. *
|
710 |
+
* *
|
711 |
+
* ok( JSON::true == 1 ); *
|
712 |
+
* *
|
713 |
+
**************************************************************************
|
714 |
+
|
715 |
+
************************** CAUTION ********************************
|
716 |
+
* This is 'JSON module version 2' and there are many differences *
|
717 |
+
* to version 1.xx *
|
718 |
+
* Please check your applications using old version. *
|
719 |
+
* See to 'INCOMPATIBLE CHANGES TO OLD VERSION' *
|
720 |
+
*******************************************************************
|
721 |
+
|
722 |
+
JSON (JavaScript Object Notation) is a simple data format.
|
723 |
+
See to L<http://www.json.org/> and C<RFC4627>(L<http://www.ietf.org/rfc/rfc4627.txt>).
|
724 |
+
|
725 |
+
This module converts Perl data structures to JSON and vice versa using either
|
726 |
+
L<JSON::XS> or L<JSON::PP>.
|
727 |
+
|
728 |
+
JSON::XS is the fastest and most proper JSON module on CPAN which must be
|
729 |
+
compiled and installed in your environment.
|
730 |
+
JSON::PP is a pure-Perl module which is bundled in this distribution and
|
731 |
+
has a strong compatibility to JSON::XS.
|
732 |
+
|
733 |
+
This module try to use JSON::XS by default and fail to it, use JSON::PP instead.
|
734 |
+
So its features completely depend on JSON::XS or JSON::PP.
|
735 |
+
|
736 |
+
See to L<BACKEND MODULE DECISION>.
|
737 |
+
|
738 |
+
To distinguish the module name 'JSON' and the format type JSON,
|
739 |
+
the former is quoted by CE<lt>E<gt> (its results vary with your using media),
|
740 |
+
and the latter is left just as it is.
|
741 |
+
|
742 |
+
Module name : C<JSON>
|
743 |
+
|
744 |
+
Format type : JSON
|
745 |
+
|
746 |
+
=head2 FEATURES
|
747 |
+
|
748 |
+
=over
|
749 |
+
|
750 |
+
=item * correct unicode handling
|
751 |
+
|
752 |
+
This module (i.e. backend modules) knows how to handle Unicode, documents
|
753 |
+
how and when it does so, and even documents what "correct" means.
|
754 |
+
|
755 |
+
Even though there are limitations, this feature is available since Perl version 5.6.
|
756 |
+
|
757 |
+
JSON::XS requires Perl 5.8.2 (but works correctly in 5.8.8 or later), so in older versions
|
758 |
+
C<JSON> should call JSON::PP as the backend which can be used since Perl 5.005.
|
759 |
+
|
760 |
+
With Perl 5.8.x JSON::PP works, but from 5.8.0 to 5.8.2, because of a Perl side problem,
|
761 |
+
JSON::PP works slower in the versions. And in 5.005, the Unicode handling is not available.
|
762 |
+
See to L<JSON::PP/UNICODE HANDLING ON PERLS> for more information.
|
763 |
+
|
764 |
+
See also to L<JSON::XS/A FEW NOTES ON UNICODE AND PERL>
|
765 |
+
and L<JSON::XS/ENCODING/CODESET_FLAG_NOTES>.
|
766 |
+
|
767 |
+
|
768 |
+
=item * round-trip integrity
|
769 |
+
|
770 |
+
When you serialise a perl data structure using only data types supported
|
771 |
+
by JSON and Perl, the deserialised data structure is identical on the Perl
|
772 |
+
level. (e.g. the string "2.0" doesn't suddenly become "2" just because
|
773 |
+
it looks like a number). There I<are> minor exceptions to this, read the
|
774 |
+
L</MAPPING> section below to learn about those.
|
775 |
+
|
776 |
+
|
777 |
+
=item * strict checking of JSON correctness
|
778 |
+
|
779 |
+
There is no guessing, no generating of illegal JSON texts by default,
|
780 |
+
and only JSON is accepted as input by default (the latter is a security
|
781 |
+
feature).
|
782 |
+
|
783 |
+
See to L<JSON::XS/FEATURES> and L<JSON::PP/FEATURES>.
|
784 |
+
|
785 |
+
=item * fast
|
786 |
+
|
787 |
+
This module returns a JSON::XS object itself if available.
|
788 |
+
Compared to other JSON modules and other serialisers such as Storable,
|
789 |
+
JSON::XS usually compares favorably in terms of speed, too.
|
790 |
+
|
791 |
+
If not available, C<JSON> returns a JSON::PP object instead of JSON::XS and
|
792 |
+
it is very slow as pure-Perl.
|
793 |
+
|
794 |
+
=item * simple to use
|
795 |
+
|
796 |
+
This module has both a simple functional interface as well as an
|
797 |
+
object oriented interface interface.
|
798 |
+
|
799 |
+
=item * reasonably versatile output formats
|
800 |
+
|
801 |
+
You can choose between the most compact guaranteed-single-line format possible
|
802 |
+
(nice for simple line-based protocols), a pure-ASCII format (for when your transport
|
803 |
+
is not 8-bit clean, still supports the whole Unicode range), or a pretty-printed
|
804 |
+
format (for when you want to read that stuff). Or you can combine those features
|
805 |
+
in whatever way you like.
|
806 |
+
|
807 |
+
=back
|
808 |
+
|
809 |
+
=head1 FUNCTIONAL INTERFACE
|
810 |
+
|
811 |
+
Some documents are copied and modified from L<JSON::XS/FUNCTIONAL INTERFACE>.
|
812 |
+
C<to_json> and C<from_json> are additional functions.
|
813 |
+
|
814 |
+
=head2 encode_json
|
815 |
+
|
816 |
+
$json_text = encode_json $perl_scalar
|
817 |
+
|
818 |
+
Converts the given Perl data structure to a UTF-8 encoded, binary string.
|
819 |
+
|
820 |
+
This function call is functionally identical to:
|
821 |
+
|
822 |
+
$json_text = JSON->new->utf8->encode($perl_scalar)
|
823 |
+
|
824 |
+
=head2 decode_json
|
825 |
+
|
826 |
+
$perl_scalar = decode_json $json_text
|
827 |
+
|
828 |
+
The opposite of C<encode_json>: expects an UTF-8 (binary) string and tries
|
829 |
+
to parse that as an UTF-8 encoded JSON text, returning the resulting
|
830 |
+
reference.
|
831 |
+
|
832 |
+
This function call is functionally identical to:
|
833 |
+
|
834 |
+
$perl_scalar = JSON->new->utf8->decode($json_text)
|
835 |
+
|
836 |
+
|
837 |
+
=head2 to_json
|
838 |
+
|
839 |
+
$json_text = to_json($perl_scalar)
|
840 |
+
|
841 |
+
Converts the given Perl data structure to a json string.
|
842 |
+
|
843 |
+
This function call is functionally identical to:
|
844 |
+
|
845 |
+
$json_text = JSON->new->encode($perl_scalar)
|
846 |
+
|
847 |
+
Takes a hash reference as the second.
|
848 |
+
|
849 |
+
$json_text = to_json($perl_scalar, $flag_hashref)
|
850 |
+
|
851 |
+
So,
|
852 |
+
|
853 |
+
$json_text = to_json($perl_scalar, {utf8 => 1, pretty => 1})
|
854 |
+
|
855 |
+
equivalent to:
|
856 |
+
|
857 |
+
$json_text = JSON->new->utf8(1)->pretty(1)->encode($perl_scalar)
|
858 |
+
|
859 |
+
If you want to write a modern perl code which communicates to outer world,
|
860 |
+
you should use C<encode_json> (supposed that JSON data are encoded in UTF-8).
|
861 |
+
|
862 |
+
=head2 from_json
|
863 |
+
|
864 |
+
$perl_scalar = from_json($json_text)
|
865 |
+
|
866 |
+
The opposite of C<to_json>: expects a json string and tries
|
867 |
+
to parse it, returning the resulting reference.
|
868 |
+
|
869 |
+
This function call is functionally identical to:
|
870 |
+
|
871 |
+
$perl_scalar = JSON->decode($json_text)
|
872 |
+
|
873 |
+
Takes a hash reference as the second.
|
874 |
+
|
875 |
+
$perl_scalar = from_json($json_text, $flag_hashref)
|
876 |
+
|
877 |
+
So,
|
878 |
+
|
879 |
+
$perl_scalar = from_json($json_text, {utf8 => 1})
|
880 |
+
|
881 |
+
equivalent to:
|
882 |
+
|
883 |
+
$perl_scalar = JSON->new->utf8(1)->decode($json_text)
|
884 |
+
|
885 |
+
If you want to write a modern perl code which communicates to outer world,
|
886 |
+
you should use C<decode_json> (supposed that JSON data are encoded in UTF-8).
|
887 |
+
|
888 |
+
=head2 JSON::is_bool
|
889 |
+
|
890 |
+
$is_boolean = JSON::is_bool($scalar)
|
891 |
+
|
892 |
+
Returns true if the passed scalar represents either JSON::true or
|
893 |
+
JSON::false, two constants that act like C<1> and C<0> respectively
|
894 |
+
and are also used to represent JSON C<true> and C<false> in Perl strings.
|
895 |
+
|
896 |
+
=head2 JSON::true
|
897 |
+
|
898 |
+
Returns JSON true value which is blessed object.
|
899 |
+
It C<isa> JSON::Boolean object.
|
900 |
+
|
901 |
+
=head2 JSON::false
|
902 |
+
|
903 |
+
Returns JSON false value which is blessed object.
|
904 |
+
It C<isa> JSON::Boolean object.
|
905 |
+
|
906 |
+
=head2 JSON::null
|
907 |
+
|
908 |
+
Returns C<undef>.
|
909 |
+
|
910 |
+
See L<MAPPING>, below, for more information on how JSON values are mapped to
|
911 |
+
Perl.
|
912 |
+
|
913 |
+
=head1 HOW DO I DECODE A DATA FROM OUTER AND ENCODE TO OUTER
|
914 |
+
|
915 |
+
This section supposes that your perl version is 5.8 or later.
|
916 |
+
|
917 |
+
If you know a JSON text from an outer world - a network, a file content, and so on,
|
918 |
+
is encoded in UTF-8, you should use C<decode_json> or C<JSON> module object
|
919 |
+
with C<utf8> enable. And the decoded result will contain UNICODE characters.
|
920 |
+
|
921 |
+
# from network
|
922 |
+
my $json = JSON->new->utf8;
|
923 |
+
my $json_text = CGI->new->param( 'json_data' );
|
924 |
+
my $perl_scalar = $json->decode( $json_text );
|
925 |
+
|
926 |
+
# from file content
|
927 |
+
local $/;
|
928 |
+
open( my $fh, '<', 'json.data' );
|
929 |
+
$json_text = <$fh>;
|
930 |
+
$perl_scalar = decode_json( $json_text );
|
931 |
+
|
932 |
+
If an outer data is not encoded in UTF-8, firstly you should C<decode> it.
|
933 |
+
|
934 |
+
use Encode;
|
935 |
+
local $/;
|
936 |
+
open( my $fh, '<', 'json.data' );
|
937 |
+
my $encoding = 'cp932';
|
938 |
+
my $unicode_json_text = decode( $encoding, <$fh> ); # UNICODE
|
939 |
+
|
940 |
+
# or you can write the below code.
|
941 |
+
#
|
942 |
+
# open( my $fh, "<:encoding($encoding)", 'json.data' );
|
943 |
+
# $unicode_json_text = <$fh>;
|
944 |
+
|
945 |
+
In this case, C<$unicode_json_text> is of course UNICODE string.
|
946 |
+
So you B<cannot> use C<decode_json> nor C<JSON> module object with C<utf8> enable.
|
947 |
+
Instead of them, you use C<JSON> module object with C<utf8> disable or C<from_json>.
|
948 |
+
|
949 |
+
$perl_scalar = $json->utf8(0)->decode( $unicode_json_text );
|
950 |
+
# or
|
951 |
+
$perl_scalar = from_json( $unicode_json_text );
|
952 |
+
|
953 |
+
Or C<encode 'utf8'> and C<decode_json>:
|
954 |
+
|
955 |
+
$perl_scalar = decode_json( encode( 'utf8', $unicode_json_text ) );
|
956 |
+
# this way is not efficient.
|
957 |
+
|
958 |
+
And now, you want to convert your C<$perl_scalar> into JSON data and
|
959 |
+
send it to an outer world - a network or a file content, and so on.
|
960 |
+
|
961 |
+
Your data usually contains UNICODE strings and you want the converted data to be encoded
|
962 |
+
in UTF-8, you should use C<encode_json> or C<JSON> module object with C<utf8> enable.
|
963 |
+
|
964 |
+
print encode_json( $perl_scalar ); # to a network? file? or display?
|
965 |
+
# or
|
966 |
+
print $json->utf8->encode( $perl_scalar );
|
967 |
+
|
968 |
+
If C<$perl_scalar> does not contain UNICODE but C<$encoding>-encoded strings
|
969 |
+
for some reason, then its characters are regarded as B<latin1> for perl
|
970 |
+
(because it does not concern with your $encoding).
|
971 |
+
You B<cannot> use C<encode_json> nor C<JSON> module object with C<utf8> enable.
|
972 |
+
Instead of them, you use C<JSON> module object with C<utf8> disable or C<to_json>.
|
973 |
+
Note that the resulted text is a UNICODE string but no problem to print it.
|
974 |
+
|
975 |
+
# $perl_scalar contains $encoding encoded string values
|
976 |
+
$unicode_json_text = $json->utf8(0)->encode( $perl_scalar );
|
977 |
+
# or
|
978 |
+
$unicode_json_text = to_json( $perl_scalar );
|
979 |
+
# $unicode_json_text consists of characters less than 0x100
|
980 |
+
print $unicode_json_text;
|
981 |
+
|
982 |
+
Or C<decode $encoding> all string values and C<encode_json>:
|
983 |
+
|
984 |
+
$perl_scalar->{ foo } = decode( $encoding, $perl_scalar->{ foo } );
|
985 |
+
# ... do it to each string values, then encode_json
|
986 |
+
$json_text = encode_json( $perl_scalar );
|
987 |
+
|
988 |
+
This method is a proper way but probably not efficient.
|
989 |
+
|
990 |
+
See to L<Encode>, L<perluniintro>.
|
991 |
+
|
992 |
+
|
993 |
+
=head1 COMMON OBJECT-ORIENTED INTERFACE
|
994 |
+
|
995 |
+
=head2 new
|
996 |
+
|
997 |
+
$json = JSON->new
|
998 |
+
|
999 |
+
Returns a new C<JSON> object inherited from either JSON::XS or JSON::PP
|
1000 |
+
that can be used to de/encode JSON strings.
|
1001 |
+
|
1002 |
+
All boolean flags described below are by default I<disabled>.
|
1003 |
+
|
1004 |
+
The mutators for flags all return the JSON object again and thus calls can
|
1005 |
+
be chained:
|
1006 |
+
|
1007 |
+
my $json = JSON->new->utf8->space_after->encode({a => [1,2]})
|
1008 |
+
=> {"a": [1, 2]}
|
1009 |
+
|
1010 |
+
=head2 ascii
|
1011 |
+
|
1012 |
+
$json = $json->ascii([$enable])
|
1013 |
+
|
1014 |
+
$enabled = $json->get_ascii
|
1015 |
+
|
1016 |
+
If $enable is true (or missing), then the encode method will not generate characters outside
|
1017 |
+
the code range 0..127. Any Unicode characters outside that range will be escaped using either
|
1018 |
+
a single \uXXXX or a double \uHHHH\uLLLLL escape sequence, as per RFC4627.
|
1019 |
+
|
1020 |
+
If $enable is false, then the encode method will not escape Unicode characters unless
|
1021 |
+
required by the JSON syntax or other flags. This results in a faster and more compact format.
|
1022 |
+
|
1023 |
+
This feature depends on the used Perl version and environment.
|
1024 |
+
|
1025 |
+
See to L<JSON::PP/UNICODE HANDLING ON PERLS> if the backend is PP.
|
1026 |
+
|
1027 |
+
JSON->new->ascii(1)->encode([chr 0x10401])
|
1028 |
+
=> ["\ud801\udc01"]
|
1029 |
+
|
1030 |
+
=head2 latin1
|
1031 |
+
|
1032 |
+
$json = $json->latin1([$enable])
|
1033 |
+
|
1034 |
+
$enabled = $json->get_latin1
|
1035 |
+
|
1036 |
+
If $enable is true (or missing), then the encode method will encode the resulting JSON
|
1037 |
+
text as latin1 (or iso-8859-1), escaping any characters outside the code range 0..255.
|
1038 |
+
|
1039 |
+
If $enable is false, then the encode method will not escape Unicode characters
|
1040 |
+
unless required by the JSON syntax or other flags.
|
1041 |
+
|
1042 |
+
JSON->new->latin1->encode (["\x{89}\x{abc}"]
|
1043 |
+
=> ["\x{89}\\u0abc"] # (perl syntax, U+abc escaped, U+89 not)
|
1044 |
+
|
1045 |
+
=head2 utf8
|
1046 |
+
|
1047 |
+
$json = $json->utf8([$enable])
|
1048 |
+
|
1049 |
+
$enabled = $json->get_utf8
|
1050 |
+
|
1051 |
+
If $enable is true (or missing), then the encode method will encode the JSON result
|
1052 |
+
into UTF-8, as required by many protocols, while the decode method expects to be handled
|
1053 |
+
an UTF-8-encoded string. Please note that UTF-8-encoded strings do not contain any
|
1054 |
+
characters outside the range 0..255, they are thus useful for bytewise/binary I/O.
|
1055 |
+
|
1056 |
+
In future versions, enabling this option might enable autodetection of the UTF-16 and UTF-32
|
1057 |
+
encoding families, as described in RFC4627.
|
1058 |
+
|
1059 |
+
If $enable is false, then the encode method will return the JSON string as a (non-encoded)
|
1060 |
+
Unicode string, while decode expects thus a Unicode string. Any decoding or encoding
|
1061 |
+
(e.g. to UTF-8 or UTF-16) needs to be done yourself, e.g. using the Encode module.
|
1062 |
+
|
1063 |
+
|
1064 |
+
Example, output UTF-16BE-encoded JSON:
|
1065 |
+
|
1066 |
+
use Encode;
|
1067 |
+
$jsontext = encode "UTF-16BE", JSON::XS->new->encode ($object);
|
1068 |
+
|
1069 |
+
Example, decode UTF-32LE-encoded JSON:
|
1070 |
+
|
1071 |
+
use Encode;
|
1072 |
+
$object = JSON::XS->new->decode (decode "UTF-32LE", $jsontext);
|
1073 |
+
|
1074 |
+
See to L<JSON::PP/UNICODE HANDLING ON PERLS> if the backend is PP.
|
1075 |
+
|
1076 |
+
|
1077 |
+
=head2 pretty
|
1078 |
+
|
1079 |
+
$json = $json->pretty([$enable])
|
1080 |
+
|
1081 |
+
This enables (or disables) all of the C<indent>, C<space_before> and
|
1082 |
+
C<space_after> (and in the future possibly more) flags in one call to
|
1083 |
+
generate the most readable (or most compact) form possible.
|
1084 |
+
|
1085 |
+
Equivalent to:
|
1086 |
+
|
1087 |
+
$json->indent->space_before->space_after
|
1088 |
+
|
1089 |
+
The indent space length is three and JSON::XS cannot change the indent
|
1090 |
+
space length.
|
1091 |
+
|
1092 |
+
=head2 indent
|
1093 |
+
|
1094 |
+
$json = $json->indent([$enable])
|
1095 |
+
|
1096 |
+
$enabled = $json->get_indent
|
1097 |
+
|
1098 |
+
If C<$enable> is true (or missing), then the C<encode> method will use a multiline
|
1099 |
+
format as output, putting every array member or object/hash key-value pair
|
1100 |
+
into its own line, identifying them properly.
|
1101 |
+
|
1102 |
+
If C<$enable> is false, no newlines or indenting will be produced, and the
|
1103 |
+
resulting JSON text is guaranteed not to contain any C<newlines>.
|
1104 |
+
|
1105 |
+
This setting has no effect when decoding JSON texts.
|
1106 |
+
|
1107 |
+
The indent space length is three.
|
1108 |
+
With JSON::PP, you can also access C<indent_length> to change indent space length.
|
1109 |
+
|
1110 |
+
|
1111 |
+
=head2 space_before
|
1112 |
+
|
1113 |
+
$json = $json->space_before([$enable])
|
1114 |
+
|
1115 |
+
$enabled = $json->get_space_before
|
1116 |
+
|
1117 |
+
If C<$enable> is true (or missing), then the C<encode> method will add an extra
|
1118 |
+
optional space before the C<:> separating keys from values in JSON objects.
|
1119 |
+
|
1120 |
+
If C<$enable> is false, then the C<encode> method will not add any extra
|
1121 |
+
space at those places.
|
1122 |
+
|
1123 |
+
This setting has no effect when decoding JSON texts.
|
1124 |
+
|
1125 |
+
Example, space_before enabled, space_after and indent disabled:
|
1126 |
+
|
1127 |
+
{"key" :"value"}
|
1128 |
+
|
1129 |
+
|
1130 |
+
=head2 space_after
|
1131 |
+
|
1132 |
+
$json = $json->space_after([$enable])
|
1133 |
+
|
1134 |
+
$enabled = $json->get_space_after
|
1135 |
+
|
1136 |
+
If C<$enable> is true (or missing), then the C<encode> method will add an extra
|
1137 |
+
optional space after the C<:> separating keys from values in JSON objects
|
1138 |
+
and extra whitespace after the C<,> separating key-value pairs and array
|
1139 |
+
members.
|
1140 |
+
|
1141 |
+
If C<$enable> is false, then the C<encode> method will not add any extra
|
1142 |
+
space at those places.
|
1143 |
+
|
1144 |
+
This setting has no effect when decoding JSON texts.
|
1145 |
+
|
1146 |
+
Example, space_before and indent disabled, space_after enabled:
|
1147 |
+
|
1148 |
+
{"key": "value"}
|
1149 |
+
|
1150 |
+
|
1151 |
+
=head2 relaxed
|
1152 |
+
|
1153 |
+
$json = $json->relaxed([$enable])
|
1154 |
+
|
1155 |
+
$enabled = $json->get_relaxed
|
1156 |
+
|
1157 |
+
If C<$enable> is true (or missing), then C<decode> will accept some
|
1158 |
+
extensions to normal JSON syntax (see below). C<encode> will not be
|
1159 |
+
affected in anyway. I<Be aware that this option makes you accept invalid
|
1160 |
+
JSON texts as if they were valid!>. I suggest only to use this option to
|
1161 |
+
parse application-specific files written by humans (configuration files,
|
1162 |
+
resource files etc.)
|
1163 |
+
|
1164 |
+
If C<$enable> is false (the default), then C<decode> will only accept
|
1165 |
+
valid JSON texts.
|
1166 |
+
|
1167 |
+
Currently accepted extensions are:
|
1168 |
+
|
1169 |
+
=over 4
|
1170 |
+
|
1171 |
+
=item * list items can have an end-comma
|
1172 |
+
|
1173 |
+
JSON I<separates> array elements and key-value pairs with commas. This
|
1174 |
+
can be annoying if you write JSON texts manually and want to be able to
|
1175 |
+
quickly append elements, so this extension accepts comma at the end of
|
1176 |
+
such items not just between them:
|
1177 |
+
|
1178 |
+
[
|
1179 |
+
1,
|
1180 |
+
2, <- this comma not normally allowed
|
1181 |
+
]
|
1182 |
+
{
|
1183 |
+
"k1": "v1",
|
1184 |
+
"k2": "v2", <- this comma not normally allowed
|
1185 |
+
}
|
1186 |
+
|
1187 |
+
=item * shell-style '#'-comments
|
1188 |
+
|
1189 |
+
Whenever JSON allows whitespace, shell-style comments are additionally
|
1190 |
+
allowed. They are terminated by the first carriage-return or line-feed
|
1191 |
+
character, after which more white-space and comments are allowed.
|
1192 |
+
|
1193 |
+
[
|
1194 |
+
1, # this comment not allowed in JSON
|
1195 |
+
# neither this one...
|
1196 |
+
]
|
1197 |
+
|
1198 |
+
=back
|
1199 |
+
|
1200 |
+
|
1201 |
+
=head2 canonical
|
1202 |
+
|
1203 |
+
$json = $json->canonical([$enable])
|
1204 |
+
|
1205 |
+
$enabled = $json->get_canonical
|
1206 |
+
|
1207 |
+
If C<$enable> is true (or missing), then the C<encode> method will output JSON objects
|
1208 |
+
by sorting their keys. This is adding a comparatively high overhead.
|
1209 |
+
|
1210 |
+
If C<$enable> is false, then the C<encode> method will output key-value
|
1211 |
+
pairs in the order Perl stores them (which will likely change between runs
|
1212 |
+
of the same script).
|
1213 |
+
|
1214 |
+
This option is useful if you want the same data structure to be encoded as
|
1215 |
+
the same JSON text (given the same overall settings). If it is disabled,
|
1216 |
+
the same hash might be encoded differently even if contains the same data,
|
1217 |
+
as key-value pairs have no inherent ordering in Perl.
|
1218 |
+
|
1219 |
+
This setting has no effect when decoding JSON texts.
|
1220 |
+
|
1221 |
+
=head2 allow_nonref
|
1222 |
+
|
1223 |
+
$json = $json->allow_nonref([$enable])
|
1224 |
+
|
1225 |
+
$enabled = $json->get_allow_nonref
|
1226 |
+
|
1227 |
+
If C<$enable> is true (or missing), then the C<encode> method can convert a
|
1228 |
+
non-reference into its corresponding string, number or null JSON value,
|
1229 |
+
which is an extension to RFC4627. Likewise, C<decode> will accept those JSON
|
1230 |
+
values instead of croaking.
|
1231 |
+
|
1232 |
+
If C<$enable> is false, then the C<encode> method will croak if it isn't
|
1233 |
+
passed an arrayref or hashref, as JSON texts must either be an object
|
1234 |
+
or array. Likewise, C<decode> will croak if given something that is not a
|
1235 |
+
JSON object or array.
|
1236 |
+
|
1237 |
+
JSON->new->allow_nonref->encode ("Hello, World!")
|
1238 |
+
=> "Hello, World!"
|
1239 |
+
|
1240 |
+
=head2 allow_unknown
|
1241 |
+
|
1242 |
+
$json = $json->allow_unknown ([$enable])
|
1243 |
+
|
1244 |
+
$enabled = $json->get_allow_unknown
|
1245 |
+
|
1246 |
+
If $enable is true (or missing), then "encode" will *not* throw an
|
1247 |
+
exception when it encounters values it cannot represent in JSON (for
|
1248 |
+
example, filehandles) but instead will encode a JSON "null" value.
|
1249 |
+
Note that blessed objects are not included here and are handled
|
1250 |
+
separately by c<allow_nonref>.
|
1251 |
+
|
1252 |
+
If $enable is false (the default), then "encode" will throw an
|
1253 |
+
exception when it encounters anything it cannot encode as JSON.
|
1254 |
+
|
1255 |
+
This option does not affect "decode" in any way, and it is
|
1256 |
+
recommended to leave it off unless you know your communications
|
1257 |
+
partner.
|
1258 |
+
|
1259 |
+
=head2 allow_blessed
|
1260 |
+
|
1261 |
+
$json = $json->allow_blessed([$enable])
|
1262 |
+
|
1263 |
+
$enabled = $json->get_allow_blessed
|
1264 |
+
|
1265 |
+
If C<$enable> is true (or missing), then the C<encode> method will not
|
1266 |
+
barf when it encounters a blessed reference. Instead, the value of the
|
1267 |
+
B<convert_blessed> option will decide whether C<null> (C<convert_blessed>
|
1268 |
+
disabled or no C<TO_JSON> method found) or a representation of the
|
1269 |
+
object (C<convert_blessed> enabled and C<TO_JSON> method found) is being
|
1270 |
+
encoded. Has no effect on C<decode>.
|
1271 |
+
|
1272 |
+
If C<$enable> is false (the default), then C<encode> will throw an
|
1273 |
+
exception when it encounters a blessed object.
|
1274 |
+
|
1275 |
+
|
1276 |
+
=head2 convert_blessed
|
1277 |
+
|
1278 |
+
$json = $json->convert_blessed([$enable])
|
1279 |
+
|
1280 |
+
$enabled = $json->get_convert_blessed
|
1281 |
+
|
1282 |
+
If C<$enable> is true (or missing), then C<encode>, upon encountering a
|
1283 |
+
blessed object, will check for the availability of the C<TO_JSON> method
|
1284 |
+
on the object's class. If found, it will be called in scalar context
|
1285 |
+
and the resulting scalar will be encoded instead of the object. If no
|
1286 |
+
C<TO_JSON> method is found, the value of C<allow_blessed> will decide what
|
1287 |
+
to do.
|
1288 |
+
|
1289 |
+
The C<TO_JSON> method may safely call die if it wants. If C<TO_JSON>
|
1290 |
+
returns other blessed objects, those will be handled in the same
|
1291 |
+
way. C<TO_JSON> must take care of not causing an endless recursion cycle
|
1292 |
+
(== crash) in this case. The name of C<TO_JSON> was chosen because other
|
1293 |
+
methods called by the Perl core (== not by the user of the object) are
|
1294 |
+
usually in upper case letters and to avoid collisions with the C<to_json>
|
1295 |
+
function or method.
|
1296 |
+
|
1297 |
+
This setting does not yet influence C<decode> in any way.
|
1298 |
+
|
1299 |
+
If C<$enable> is false, then the C<allow_blessed> setting will decide what
|
1300 |
+
to do when a blessed object is found.
|
1301 |
+
|
1302 |
+
=over
|
1303 |
+
|
1304 |
+
=item convert_blessed_universally mode
|
1305 |
+
|
1306 |
+
If use C<JSON> with C<-convert_blessed_universally>, the C<UNIVERSAL::TO_JSON>
|
1307 |
+
subroutine is defined as the below code:
|
1308 |
+
|
1309 |
+
*UNIVERSAL::TO_JSON = sub {
|
1310 |
+
my $b_obj = B::svref_2object( $_[0] );
|
1311 |
+
return $b_obj->isa('B::HV') ? { %{ $_[0] } }
|
1312 |
+
: $b_obj->isa('B::AV') ? [ @{ $_[0] } ]
|
1313 |
+
: undef
|
1314 |
+
;
|
1315 |
+
}
|
1316 |
+
|
1317 |
+
This will cause that C<encode> method converts simple blessed objects into
|
1318 |
+
JSON objects as non-blessed object.
|
1319 |
+
|
1320 |
+
JSON -convert_blessed_universally;
|
1321 |
+
$json->allow_blessed->convert_blessed->encode( $blessed_object )
|
1322 |
+
|
1323 |
+
This feature is experimental and may be removed in the future.
|
1324 |
+
|
1325 |
+
=back
|
1326 |
+
|
1327 |
+
=head2 filter_json_object
|
1328 |
+
|
1329 |
+
$json = $json->filter_json_object([$coderef])
|
1330 |
+
|
1331 |
+
When C<$coderef> is specified, it will be called from C<decode> each
|
1332 |
+
time it decodes a JSON object. The only argument passed to the coderef
|
1333 |
+
is a reference to the newly-created hash. If the code references returns
|
1334 |
+
a single scalar (which need not be a reference), this value
|
1335 |
+
(i.e. a copy of that scalar to avoid aliasing) is inserted into the
|
1336 |
+
deserialised data structure. If it returns an empty list
|
1337 |
+
(NOTE: I<not> C<undef>, which is a valid scalar), the original deserialised
|
1338 |
+
hash will be inserted. This setting can slow down decoding considerably.
|
1339 |
+
|
1340 |
+
When C<$coderef> is omitted or undefined, any existing callback will
|
1341 |
+
be removed and C<decode> will not change the deserialised hash in any
|
1342 |
+
way.
|
1343 |
+
|
1344 |
+
Example, convert all JSON objects into the integer 5:
|
1345 |
+
|
1346 |
+
my $js = JSON->new->filter_json_object (sub { 5 });
|
1347 |
+
# returns [5]
|
1348 |
+
$js->decode ('[{}]'); # the given subroutine takes a hash reference.
|
1349 |
+
# throw an exception because allow_nonref is not enabled
|
1350 |
+
# so a lone 5 is not allowed.
|
1351 |
+
$js->decode ('{"a":1, "b":2}');
|
1352 |
+
|
1353 |
+
|
1354 |
+
=head2 filter_json_single_key_object
|
1355 |
+
|
1356 |
+
$json = $json->filter_json_single_key_object($key [=> $coderef])
|
1357 |
+
|
1358 |
+
Works remotely similar to C<filter_json_object>, but is only called for
|
1359 |
+
JSON objects having a single key named C<$key>.
|
1360 |
+
|
1361 |
+
This C<$coderef> is called before the one specified via
|
1362 |
+
C<filter_json_object>, if any. It gets passed the single value in the JSON
|
1363 |
+
object. If it returns a single value, it will be inserted into the data
|
1364 |
+
structure. If it returns nothing (not even C<undef> but the empty list),
|
1365 |
+
the callback from C<filter_json_object> will be called next, as if no
|
1366 |
+
single-key callback were specified.
|
1367 |
+
|
1368 |
+
If C<$coderef> is omitted or undefined, the corresponding callback will be
|
1369 |
+
disabled. There can only ever be one callback for a given key.
|
1370 |
+
|
1371 |
+
As this callback gets called less often then the C<filter_json_object>
|
1372 |
+
one, decoding speed will not usually suffer as much. Therefore, single-key
|
1373 |
+
objects make excellent targets to serialise Perl objects into, especially
|
1374 |
+
as single-key JSON objects are as close to the type-tagged value concept
|
1375 |
+
as JSON gets (it's basically an ID/VALUE tuple). Of course, JSON does not
|
1376 |
+
support this in any way, so you need to make sure your data never looks
|
1377 |
+
like a serialised Perl hash.
|
1378 |
+
|
1379 |
+
Typical names for the single object key are C<__class_whatever__>, or
|
1380 |
+
C<$__dollars_are_rarely_used__$> or C<}ugly_brace_placement>, or even
|
1381 |
+
things like C<__class_md5sum(classname)__>, to reduce the risk of clashing
|
1382 |
+
with real hashes.
|
1383 |
+
|
1384 |
+
Example, decode JSON objects of the form C<< { "__widget__" => <id> } >>
|
1385 |
+
into the corresponding C<< $WIDGET{<id>} >> object:
|
1386 |
+
|
1387 |
+
# return whatever is in $WIDGET{5}:
|
1388 |
+
JSON
|
1389 |
+
->new
|
1390 |
+
->filter_json_single_key_object (__widget__ => sub {
|
1391 |
+
$WIDGET{ $_[0] }
|
1392 |
+
})
|
1393 |
+
->decode ('{"__widget__": 5')
|
1394 |
+
|
1395 |
+
# this can be used with a TO_JSON method in some "widget" class
|
1396 |
+
# for serialisation to json:
|
1397 |
+
sub WidgetBase::TO_JSON {
|
1398 |
+
my ($self) = @_;
|
1399 |
+
|
1400 |
+
unless ($self->{id}) {
|
1401 |
+
$self->{id} = ..get..some..id..;
|
1402 |
+
$WIDGET{$self->{id}} = $self;
|
1403 |
+
}
|
1404 |
+
|
1405 |
+
{ __widget__ => $self->{id} }
|
1406 |
+
}
|
1407 |
+
|
1408 |
+
|
1409 |
+
=head2 shrink
|
1410 |
+
|
1411 |
+
$json = $json->shrink([$enable])
|
1412 |
+
|
1413 |
+
$enabled = $json->get_shrink
|
1414 |
+
|
1415 |
+
With JSON::XS, this flag resizes strings generated by either
|
1416 |
+
C<encode> or C<decode> to their minimum size possible. This can save
|
1417 |
+
memory when your JSON texts are either very very long or you have many
|
1418 |
+
short strings. It will also try to downgrade any strings to octet-form
|
1419 |
+
if possible: perl stores strings internally either in an encoding called
|
1420 |
+
UTF-X or in octet-form. The latter cannot store everything but uses less
|
1421 |
+
space in general (and some buggy Perl or C code might even rely on that
|
1422 |
+
internal representation being used).
|
1423 |
+
|
1424 |
+
With JSON::PP, it is noop about resizing strings but tries
|
1425 |
+
C<utf8::downgrade> to the returned string by C<encode>. See to L<utf8>.
|
1426 |
+
|
1427 |
+
See to L<JSON::XS/OBJECT-ORIENTED INTERFACE> and L<JSON::PP/METHODS>.
|
1428 |
+
|
1429 |
+
=head2 max_depth
|
1430 |
+
|
1431 |
+
$json = $json->max_depth([$maximum_nesting_depth])
|
1432 |
+
|
1433 |
+
$max_depth = $json->get_max_depth
|
1434 |
+
|
1435 |
+
Sets the maximum nesting level (default C<512>) accepted while encoding
|
1436 |
+
or decoding. If a higher nesting level is detected in JSON text or a Perl
|
1437 |
+
data structure, then the encoder and decoder will stop and croak at that
|
1438 |
+
point.
|
1439 |
+
|
1440 |
+
Nesting level is defined by number of hash- or arrayrefs that the encoder
|
1441 |
+
needs to traverse to reach a given point or the number of C<{> or C<[>
|
1442 |
+
characters without their matching closing parenthesis crossed to reach a
|
1443 |
+
given character in a string.
|
1444 |
+
|
1445 |
+
If no argument is given, the highest possible setting will be used, which
|
1446 |
+
is rarely useful.
|
1447 |
+
|
1448 |
+
Note that nesting is implemented by recursion in C. The default value has
|
1449 |
+
been chosen to be as large as typical operating systems allow without
|
1450 |
+
crashing. (JSON::XS)
|
1451 |
+
|
1452 |
+
With JSON::PP as the backend, when a large value (100 or more) was set and
|
1453 |
+
it de/encodes a deep nested object/text, it may raise a warning
|
1454 |
+
'Deep recursion on subroutine' at the perl runtime phase.
|
1455 |
+
|
1456 |
+
See L<JSON::XS/SECURITY CONSIDERATIONS> for more info on why this is useful.
|
1457 |
+
|
1458 |
+
=head2 max_size
|
1459 |
+
|
1460 |
+
$json = $json->max_size([$maximum_string_size])
|
1461 |
+
|
1462 |
+
$max_size = $json->get_max_size
|
1463 |
+
|
1464 |
+
Set the maximum length a JSON text may have (in bytes) where decoding is
|
1465 |
+
being attempted. The default is C<0>, meaning no limit. When C<decode>
|
1466 |
+
is called on a string that is longer then this many bytes, it will not
|
1467 |
+
attempt to decode the string but throw an exception. This setting has no
|
1468 |
+
effect on C<encode> (yet).
|
1469 |
+
|
1470 |
+
If no argument is given, the limit check will be deactivated (same as when
|
1471 |
+
C<0> is specified).
|
1472 |
+
|
1473 |
+
See L<JSON::XS/SECURITY CONSIDERATIONS>, below, for more info on why this is useful.
|
1474 |
+
|
1475 |
+
=head2 encode
|
1476 |
+
|
1477 |
+
$json_text = $json->encode($perl_scalar)
|
1478 |
+
|
1479 |
+
Converts the given Perl data structure (a simple scalar or a reference
|
1480 |
+
to a hash or array) to its JSON representation. Simple scalars will be
|
1481 |
+
converted into JSON string or number sequences, while references to arrays
|
1482 |
+
become JSON arrays and references to hashes become JSON objects. Undefined
|
1483 |
+
Perl values (e.g. C<undef>) become JSON C<null> values.
|
1484 |
+
References to the integers C<0> and C<1> are converted into C<true> and C<false>.
|
1485 |
+
|
1486 |
+
=head2 decode
|
1487 |
+
|
1488 |
+
$perl_scalar = $json->decode($json_text)
|
1489 |
+
|
1490 |
+
The opposite of C<encode>: expects a JSON text and tries to parse it,
|
1491 |
+
returning the resulting simple scalar or reference. Croaks on error.
|
1492 |
+
|
1493 |
+
JSON numbers and strings become simple Perl scalars. JSON arrays become
|
1494 |
+
Perl arrayrefs and JSON objects become Perl hashrefs. C<true> becomes
|
1495 |
+
C<1> (C<JSON::true>), C<false> becomes C<0> (C<JSON::false>) and
|
1496 |
+
C<null> becomes C<undef>.
|
1497 |
+
|
1498 |
+
=head2 decode_prefix
|
1499 |
+
|
1500 |
+
($perl_scalar, $characters) = $json->decode_prefix($json_text)
|
1501 |
+
|
1502 |
+
This works like the C<decode> method, but instead of raising an exception
|
1503 |
+
when there is trailing garbage after the first JSON object, it will
|
1504 |
+
silently stop parsing there and return the number of characters consumed
|
1505 |
+
so far.
|
1506 |
+
|
1507 |
+
JSON->new->decode_prefix ("[1] the tail")
|
1508 |
+
=> ([], 3)
|
1509 |
+
|
1510 |
+
See to L<JSON::XS/OBJECT-ORIENTED INTERFACE>
|
1511 |
+
|
1512 |
+
=head2 property
|
1513 |
+
|
1514 |
+
$boolean = $json->property($property_name)
|
1515 |
+
|
1516 |
+
Returns a boolean value about above some properties.
|
1517 |
+
|
1518 |
+
The available properties are C<ascii>, C<latin1>, C<utf8>,
|
1519 |
+
C<indent>,C<space_before>, C<space_after>, C<relaxed>, C<canonical>,
|
1520 |
+
C<allow_nonref>, C<allow_unknown>, C<allow_blessed>, C<convert_blessed>,
|
1521 |
+
C<shrink>, C<max_depth> and C<max_size>.
|
1522 |
+
|
1523 |
+
$boolean = $json->property('utf8');
|
1524 |
+
=> 0
|
1525 |
+
$json->utf8;
|
1526 |
+
$boolean = $json->property('utf8');
|
1527 |
+
=> 1
|
1528 |
+
|
1529 |
+
Sets the property with a given boolean value.
|
1530 |
+
|
1531 |
+
$json = $json->property($property_name => $boolean);
|
1532 |
+
|
1533 |
+
With no argument, it returns all the above properties as a hash reference.
|
1534 |
+
|
1535 |
+
$flag_hashref = $json->property();
|
1536 |
+
|
1537 |
+
=head1 INCREMENTAL PARSING
|
1538 |
+
|
1539 |
+
Most of this section are copied and modified from L<JSON::XS/INCREMENTAL PARSING>.
|
1540 |
+
|
1541 |
+
In some cases, there is the need for incremental parsing of JSON texts.
|
1542 |
+
This module does allow you to parse a JSON stream incrementally.
|
1543 |
+
It does so by accumulating text until it has a full JSON object, which
|
1544 |
+
it then can decode. This process is similar to using C<decode_prefix>
|
1545 |
+
to see if a full JSON object is available, but is much more efficient
|
1546 |
+
(and can be implemented with a minimum of method calls).
|
1547 |
+
|
1548 |
+
The backend module will only attempt to parse the JSON text once it is sure it
|
1549 |
+
has enough text to get a decisive result, using a very simple but
|
1550 |
+
truly incremental parser. This means that it sometimes won't stop as
|
1551 |
+
early as the full parser, for example, it doesn't detect parenthesis
|
1552 |
+
mismatches. The only thing it guarantees is that it starts decoding as
|
1553 |
+
soon as a syntactically valid JSON text has been seen. This means you need
|
1554 |
+
to set resource limits (e.g. C<max_size>) to ensure the parser will stop
|
1555 |
+
parsing in the presence if syntax errors.
|
1556 |
+
|
1557 |
+
The following methods implement this incremental parser.
|
1558 |
+
|
1559 |
+
=head2 incr_parse
|
1560 |
+
|
1561 |
+
$json->incr_parse( [$string] ) # void context
|
1562 |
+
|
1563 |
+
$obj_or_undef = $json->incr_parse( [$string] ) # scalar context
|
1564 |
+
|
1565 |
+
@obj_or_empty = $json->incr_parse( [$string] ) # list context
|
1566 |
+
|
1567 |
+
This is the central parsing function. It can both append new text and
|
1568 |
+
extract objects from the stream accumulated so far (both of these
|
1569 |
+
functions are optional).
|
1570 |
+
|
1571 |
+
If C<$string> is given, then this string is appended to the already
|
1572 |
+
existing JSON fragment stored in the C<$json> object.
|
1573 |
+
|
1574 |
+
After that, if the function is called in void context, it will simply
|
1575 |
+
return without doing anything further. This can be used to add more text
|
1576 |
+
in as many chunks as you want.
|
1577 |
+
|
1578 |
+
If the method is called in scalar context, then it will try to extract
|
1579 |
+
exactly I<one> JSON object. If that is successful, it will return this
|
1580 |
+
object, otherwise it will return C<undef>. If there is a parse error,
|
1581 |
+
this method will croak just as C<decode> would do (one can then use
|
1582 |
+
C<incr_skip> to skip the erroneous part). This is the most common way of
|
1583 |
+
using the method.
|
1584 |
+
|
1585 |
+
And finally, in list context, it will try to extract as many objects
|
1586 |
+
from the stream as it can find and return them, or the empty list
|
1587 |
+
otherwise. For this to work, there must be no separators between the JSON
|
1588 |
+
objects or arrays, instead they must be concatenated back-to-back. If
|
1589 |
+
an error occurs, an exception will be raised as in the scalar context
|
1590 |
+
case. Note that in this case, any previously-parsed JSON texts will be
|
1591 |
+
lost.
|
1592 |
+
|
1593 |
+
Example: Parse some JSON arrays/objects in a given string and return them.
|
1594 |
+
|
1595 |
+
my @objs = JSON->new->incr_parse ("[5][7][1,2]");
|
1596 |
+
|
1597 |
+
=head2 incr_text
|
1598 |
+
|
1599 |
+
$lvalue_string = $json->incr_text
|
1600 |
+
|
1601 |
+
This method returns the currently stored JSON fragment as an lvalue, that
|
1602 |
+
is, you can manipulate it. This I<only> works when a preceding call to
|
1603 |
+
C<incr_parse> in I<scalar context> successfully returned an object. Under
|
1604 |
+
all other circumstances you must not call this function (I mean it.
|
1605 |
+
although in simple tests it might actually work, it I<will> fail under
|
1606 |
+
real world conditions). As a special exception, you can also call this
|
1607 |
+
method before having parsed anything.
|
1608 |
+
|
1609 |
+
This function is useful in two cases: a) finding the trailing text after a
|
1610 |
+
JSON object or b) parsing multiple JSON objects separated by non-JSON text
|
1611 |
+
(such as commas).
|
1612 |
+
|
1613 |
+
$json->incr_text =~ s/\s*,\s*//;
|
1614 |
+
|
1615 |
+
In Perl 5.005, C<lvalue> attribute is not available.
|
1616 |
+
You must write codes like the below:
|
1617 |
+
|
1618 |
+
$string = $json->incr_text;
|
1619 |
+
$string =~ s/\s*,\s*//;
|
1620 |
+
$json->incr_text( $string );
|
1621 |
+
|
1622 |
+
=head2 incr_skip
|
1623 |
+
|
1624 |
+
$json->incr_skip
|
1625 |
+
|
1626 |
+
This will reset the state of the incremental parser and will remove the
|
1627 |
+
parsed text from the input buffer. This is useful after C<incr_parse>
|
1628 |
+
died, in which case the input buffer and incremental parser state is left
|
1629 |
+
unchanged, to skip the text parsed so far and to reset the parse state.
|
1630 |
+
|
1631 |
+
=head2 incr_reset
|
1632 |
+
|
1633 |
+
$json->incr_reset
|
1634 |
+
|
1635 |
+
This completely resets the incremental parser, that is, after this call,
|
1636 |
+
it will be as if the parser had never parsed anything.
|
1637 |
+
|
1638 |
+
This is useful if you want to repeatedly parse JSON objects and want to
|
1639 |
+
ignore any trailing data, which means you have to reset the parser after
|
1640 |
+
each successful decode.
|
1641 |
+
|
1642 |
+
See to L<JSON::XS/INCREMENTAL PARSING> for examples.
|
1643 |
+
|
1644 |
+
|
1645 |
+
=head1 JSON::PP SUPPORT METHODS
|
1646 |
+
|
1647 |
+
The below methods are JSON::PP own methods, so when C<JSON> works
|
1648 |
+
with JSON::PP (i.e. the created object is a JSON::PP object), available.
|
1649 |
+
See to L<JSON::PP/JSON::PP OWN METHODS> in detail.
|
1650 |
+
|
1651 |
+
If you use C<JSON> with additional C<-support_by_pp>, some methods
|
1652 |
+
are available even with JSON::XS. See to L<USE PP FEATURES EVEN THOUGH XS BACKEND>.
|
1653 |
+
|
1654 |
+
BEING { $ENV{PERL_JSON_BACKEND} = 'JSON::XS' }
|
1655 |
+
|
1656 |
+
use JSON -support_by_pp;
|
1657 |
+
|
1658 |
+
my $json = JSON->new;
|
1659 |
+
$json->allow_nonref->escape_slash->encode("/");
|
1660 |
+
|
1661 |
+
# functional interfaces too.
|
1662 |
+
print to_json(["/"], {escape_slash => 1});
|
1663 |
+
print from_json('["foo"]', {utf8 => 1});
|
1664 |
+
|
1665 |
+
If you do not want to all functions but C<-support_by_pp>,
|
1666 |
+
use C<-no_export>.
|
1667 |
+
|
1668 |
+
use JSON -support_by_pp, -no_export;
|
1669 |
+
# functional interfaces are not exported.
|
1670 |
+
|
1671 |
+
=head2 allow_singlequote
|
1672 |
+
|
1673 |
+
$json = $json->allow_singlequote([$enable])
|
1674 |
+
|
1675 |
+
If C<$enable> is true (or missing), then C<decode> will accept
|
1676 |
+
any JSON strings quoted by single quotations that are invalid JSON
|
1677 |
+
format.
|
1678 |
+
|
1679 |
+
$json->allow_singlequote->decode({"foo":'bar'});
|
1680 |
+
$json->allow_singlequote->decode({'foo':"bar"});
|
1681 |
+
$json->allow_singlequote->decode({'foo':'bar'});
|
1682 |
+
|
1683 |
+
As same as the C<relaxed> option, this option may be used to parse
|
1684 |
+
application-specific files written by humans.
|
1685 |
+
|
1686 |
+
=head2 allow_barekey
|
1687 |
+
|
1688 |
+
$json = $json->allow_barekey([$enable])
|
1689 |
+
|
1690 |
+
If C<$enable> is true (or missing), then C<decode> will accept
|
1691 |
+
bare keys of JSON object that are invalid JSON format.
|
1692 |
+
|
1693 |
+
As same as the C<relaxed> option, this option may be used to parse
|
1694 |
+
application-specific files written by humans.
|
1695 |
+
|
1696 |
+
$json->allow_barekey->decode('{foo:"bar"}');
|
1697 |
+
|
1698 |
+
=head2 allow_bignum
|
1699 |
+
|
1700 |
+
$json = $json->allow_bignum([$enable])
|
1701 |
+
|
1702 |
+
If C<$enable> is true (or missing), then C<decode> will convert
|
1703 |
+
the big integer Perl cannot handle as integer into a L<Math::BigInt>
|
1704 |
+
object and convert a floating number (any) into a L<Math::BigFloat>.
|
1705 |
+
|
1706 |
+
On the contrary, C<encode> converts C<Math::BigInt> objects and C<Math::BigFloat>
|
1707 |
+
objects into JSON numbers with C<allow_blessed> enable.
|
1708 |
+
|
1709 |
+
$json->allow_nonref->allow_blessed->allow_bignum;
|
1710 |
+
$bigfloat = $json->decode('2.000000000000000000000000001');
|
1711 |
+
print $json->encode($bigfloat);
|
1712 |
+
# => 2.000000000000000000000000001
|
1713 |
+
|
1714 |
+
See to L<MAPPING> about the conversion of JSON number.
|
1715 |
+
|
1716 |
+
=head2 loose
|
1717 |
+
|
1718 |
+
$json = $json->loose([$enable])
|
1719 |
+
|
1720 |
+
The unescaped [\x00-\x1f\x22\x2f\x5c] strings are invalid in JSON strings
|
1721 |
+
and the module doesn't allow to C<decode> to these (except for \x2f).
|
1722 |
+
If C<$enable> is true (or missing), then C<decode> will accept these
|
1723 |
+
unescaped strings.
|
1724 |
+
|
1725 |
+
$json->loose->decode(qq|["abc
|
1726 |
+
def"]|);
|
1727 |
+
|
1728 |
+
See to L<JSON::PP/JSON::PP OWN METHODS>.
|
1729 |
+
|
1730 |
+
=head2 escape_slash
|
1731 |
+
|
1732 |
+
$json = $json->escape_slash([$enable])
|
1733 |
+
|
1734 |
+
According to JSON Grammar, I<slash> (U+002F) is escaped. But by default
|
1735 |
+
JSON backend modules encode strings without escaping slash.
|
1736 |
+
|
1737 |
+
If C<$enable> is true (or missing), then C<encode> will escape slashes.
|
1738 |
+
|
1739 |
+
=head2 indent_length
|
1740 |
+
|
1741 |
+
$json = $json->indent_length($length)
|
1742 |
+
|
1743 |
+
With JSON::XS, The indent space length is 3 and cannot be changed.
|
1744 |
+
With JSON::PP, it sets the indent space length with the given $length.
|
1745 |
+
The default is 3. The acceptable range is 0 to 15.
|
1746 |
+
|
1747 |
+
=head2 sort_by
|
1748 |
+
|
1749 |
+
$json = $json->sort_by($function_name)
|
1750 |
+
$json = $json->sort_by($subroutine_ref)
|
1751 |
+
|
1752 |
+
If $function_name or $subroutine_ref are set, its sort routine are used.
|
1753 |
+
|
1754 |
+
$js = $pc->sort_by(sub { $JSON::PP::a cmp $JSON::PP::b })->encode($obj);
|
1755 |
+
# is($js, q|{"a":1,"b":2,"c":3,"d":4,"e":5,"f":6,"g":7,"h":8,"i":9}|);
|
1756 |
+
|
1757 |
+
$js = $pc->sort_by('own_sort')->encode($obj);
|
1758 |
+
# is($js, q|{"a":1,"b":2,"c":3,"d":4,"e":5,"f":6,"g":7,"h":8,"i":9}|);
|
1759 |
+
|
1760 |
+
sub JSON::PP::own_sort { $JSON::PP::a cmp $JSON::PP::b }
|
1761 |
+
|
1762 |
+
As the sorting routine runs in the JSON::PP scope, the given
|
1763 |
+
subroutine name and the special variables C<$a>, C<$b> will begin
|
1764 |
+
with 'JSON::PP::'.
|
1765 |
+
|
1766 |
+
If $integer is set, then the effect is same as C<canonical> on.
|
1767 |
+
|
1768 |
+
See to L<JSON::PP/JSON::PP OWN METHODS>.
|
1769 |
+
|
1770 |
+
=head1 MAPPING
|
1771 |
+
|
1772 |
+
This section is copied from JSON::XS and modified to C<JSON>.
|
1773 |
+
JSON::XS and JSON::PP mapping mechanisms are almost equivalent.
|
1774 |
+
|
1775 |
+
See to L<JSON::XS/MAPPING>.
|
1776 |
+
|
1777 |
+
=head2 JSON -> PERL
|
1778 |
+
|
1779 |
+
=over 4
|
1780 |
+
|
1781 |
+
=item object
|
1782 |
+
|
1783 |
+
A JSON object becomes a reference to a hash in Perl. No ordering of object
|
1784 |
+
keys is preserved (JSON does not preserver object key ordering itself).
|
1785 |
+
|
1786 |
+
=item array
|
1787 |
+
|
1788 |
+
A JSON array becomes a reference to an array in Perl.
|
1789 |
+
|
1790 |
+
=item string
|
1791 |
+
|
1792 |
+
A JSON string becomes a string scalar in Perl - Unicode codepoints in JSON
|
1793 |
+
are represented by the same codepoints in the Perl string, so no manual
|
1794 |
+
decoding is necessary.
|
1795 |
+
|
1796 |
+
=item number
|
1797 |
+
|
1798 |
+
A JSON number becomes either an integer, numeric (floating point) or
|
1799 |
+
string scalar in perl, depending on its range and any fractional parts. On
|
1800 |
+
the Perl level, there is no difference between those as Perl handles all
|
1801 |
+
the conversion details, but an integer may take slightly less memory and
|
1802 |
+
might represent more values exactly than floating point numbers.
|
1803 |
+
|
1804 |
+
If the number consists of digits only, C<JSON> will try to represent
|
1805 |
+
it as an integer value. If that fails, it will try to represent it as
|
1806 |
+
a numeric (floating point) value if that is possible without loss of
|
1807 |
+
precision. Otherwise it will preserve the number as a string value (in
|
1808 |
+
which case you lose roundtripping ability, as the JSON number will be
|
1809 |
+
re-encoded to a JSON string).
|
1810 |
+
|
1811 |
+
Numbers containing a fractional or exponential part will always be
|
1812 |
+
represented as numeric (floating point) values, possibly at a loss of
|
1813 |
+
precision (in which case you might lose perfect roundtripping ability, but
|
1814 |
+
the JSON number will still be re-encoded as a JSON number).
|
1815 |
+
|
1816 |
+
Note that precision is not accuracy - binary floating point values cannot
|
1817 |
+
represent most decimal fractions exactly, and when converting from and to
|
1818 |
+
floating point, C<JSON> only guarantees precision up to but not including
|
1819 |
+
the least significant bit.
|
1820 |
+
|
1821 |
+
If the backend is JSON::PP and C<allow_bignum> is enable, the big integers
|
1822 |
+
and the numeric can be optionally converted into L<Math::BigInt> and
|
1823 |
+
L<Math::BigFloat> objects.
|
1824 |
+
|
1825 |
+
=item true, false
|
1826 |
+
|
1827 |
+
These JSON atoms become C<JSON::true> and C<JSON::false>,
|
1828 |
+
respectively. They are overloaded to act almost exactly like the numbers
|
1829 |
+
C<1> and C<0>. You can check whether a scalar is a JSON boolean by using
|
1830 |
+
the C<JSON::is_bool> function.
|
1831 |
+
|
1832 |
+
print JSON::true + 1;
|
1833 |
+
=> 1
|
1834 |
+
|
1835 |
+
ok(JSON::true eq '1');
|
1836 |
+
ok(JSON::true == 1);
|
1837 |
+
|
1838 |
+
C<JSON> will install these missing overloading features to the backend modules.
|
1839 |
+
|
1840 |
+
|
1841 |
+
=item null
|
1842 |
+
|
1843 |
+
A JSON null atom becomes C<undef> in Perl.
|
1844 |
+
|
1845 |
+
C<JSON::null> returns C<undef>.
|
1846 |
+
|
1847 |
+
=back
|
1848 |
+
|
1849 |
+
|
1850 |
+
=head2 PERL -> JSON
|
1851 |
+
|
1852 |
+
The mapping from Perl to JSON is slightly more difficult, as Perl is a
|
1853 |
+
truly typeless language, so we can only guess which JSON type is meant by
|
1854 |
+
a Perl value.
|
1855 |
+
|
1856 |
+
=over 4
|
1857 |
+
|
1858 |
+
=item hash references
|
1859 |
+
|
1860 |
+
Perl hash references become JSON objects. As there is no inherent ordering
|
1861 |
+
in hash keys (or JSON objects), they will usually be encoded in a
|
1862 |
+
pseudo-random order that can change between runs of the same program but
|
1863 |
+
stays generally the same within a single run of a program. C<JSON>
|
1864 |
+
optionally sort the hash keys (determined by the I<canonical> flag), so
|
1865 |
+
the same data structure will serialise to the same JSON text (given same
|
1866 |
+
settings and version of JSON::XS), but this incurs a runtime overhead
|
1867 |
+
and is only rarely useful, e.g. when you want to compare some JSON text
|
1868 |
+
against another for equality.
|
1869 |
+
|
1870 |
+
In future, the ordered object feature will be added to JSON::PP using C<tie> mechanism.
|
1871 |
+
|
1872 |
+
|
1873 |
+
=item array references
|
1874 |
+
|
1875 |
+
Perl array references become JSON arrays.
|
1876 |
+
|
1877 |
+
=item other references
|
1878 |
+
|
1879 |
+
Other unblessed references are generally not allowed and will cause an
|
1880 |
+
exception to be thrown, except for references to the integers C<0> and
|
1881 |
+
C<1>, which get turned into C<false> and C<true> atoms in JSON. You can
|
1882 |
+
also use C<JSON::false> and C<JSON::true> to improve readability.
|
1883 |
+
|
1884 |
+
to_json [\0,JSON::true] # yields [false,true]
|
1885 |
+
|
1886 |
+
=item JSON::true, JSON::false, JSON::null
|
1887 |
+
|
1888 |
+
These special values become JSON true and JSON false values,
|
1889 |
+
respectively. You can also use C<\1> and C<\0> directly if you want.
|
1890 |
+
|
1891 |
+
JSON::null returns C<undef>.
|
1892 |
+
|
1893 |
+
=item blessed objects
|
1894 |
+
|
1895 |
+
Blessed objects are not directly representable in JSON. See the
|
1896 |
+
C<allow_blessed> and C<convert_blessed> methods on various options on
|
1897 |
+
how to deal with this: basically, you can choose between throwing an
|
1898 |
+
exception, encoding the reference as if it weren't blessed, or provide
|
1899 |
+
your own serialiser method.
|
1900 |
+
|
1901 |
+
With C<convert_blessed_universally> mode, C<encode> converts blessed
|
1902 |
+
hash references or blessed array references (contains other blessed references)
|
1903 |
+
into JSON members and arrays.
|
1904 |
+
|
1905 |
+
use JSON -convert_blessed_universally;
|
1906 |
+
JSON->new->allow_blessed->convert_blessed->encode( $blessed_object );
|
1907 |
+
|
1908 |
+
See to L<convert_blessed>.
|
1909 |
+
|
1910 |
+
=item simple scalars
|
1911 |
+
|
1912 |
+
Simple Perl scalars (any scalar that is not a reference) are the most
|
1913 |
+
difficult objects to encode: JSON::XS and JSON::PP will encode undefined scalars as
|
1914 |
+
JSON C<null> values, scalars that have last been used in a string context
|
1915 |
+
before encoding as JSON strings, and anything else as number value:
|
1916 |
+
|
1917 |
+
# dump as number
|
1918 |
+
encode_json [2] # yields [2]
|
1919 |
+
encode_json [-3.0e17] # yields [-3e+17]
|
1920 |
+
my $value = 5; encode_json [$value] # yields [5]
|
1921 |
+
|
1922 |
+
# used as string, so dump as string
|
1923 |
+
print $value;
|
1924 |
+
encode_json [$value] # yields ["5"]
|
1925 |
+
|
1926 |
+
# undef becomes null
|
1927 |
+
encode_json [undef] # yields [null]
|
1928 |
+
|
1929 |
+
You can force the type to be a string by stringifying it:
|
1930 |
+
|
1931 |
+
my $x = 3.1; # some variable containing a number
|
1932 |
+
"$x"; # stringified
|
1933 |
+
$x .= ""; # another, more awkward way to stringify
|
1934 |
+
print $x; # perl does it for you, too, quite often
|
1935 |
+
|
1936 |
+
You can force the type to be a number by numifying it:
|
1937 |
+
|
1938 |
+
my $x = "3"; # some variable containing a string
|
1939 |
+
$x += 0; # numify it, ensuring it will be dumped as a number
|
1940 |
+
$x *= 1; # same thing, the choice is yours.
|
1941 |
+
|
1942 |
+
You can not currently force the type in other, less obscure, ways.
|
1943 |
+
|
1944 |
+
Note that numerical precision has the same meaning as under Perl (so
|
1945 |
+
binary to decimal conversion follows the same rules as in Perl, which
|
1946 |
+
can differ to other languages). Also, your perl interpreter might expose
|
1947 |
+
extensions to the floating point numbers of your platform, such as
|
1948 |
+
infinities or NaN's - these cannot be represented in JSON, and it is an
|
1949 |
+
error to pass those in.
|
1950 |
+
|
1951 |
+
=item Big Number
|
1952 |
+
|
1953 |
+
If the backend is JSON::PP and C<allow_bignum> is enable,
|
1954 |
+
C<encode> converts C<Math::BigInt> objects and C<Math::BigFloat>
|
1955 |
+
objects into JSON numbers.
|
1956 |
+
|
1957 |
+
|
1958 |
+
=back
|
1959 |
+
|
1960 |
+
=head1 JSON and ECMAscript
|
1961 |
+
|
1962 |
+
See to L<JSON::XS/JSON and ECMAscript>.
|
1963 |
+
|
1964 |
+
=head1 JSON and YAML
|
1965 |
+
|
1966 |
+
JSON is not a subset of YAML.
|
1967 |
+
See to L<JSON::XS/JSON and YAML>.
|
1968 |
+
|
1969 |
+
|
1970 |
+
=head1 BACKEND MODULE DECISION
|
1971 |
+
|
1972 |
+
When you use C<JSON>, C<JSON> tries to C<use> JSON::XS. If this call failed, it will
|
1973 |
+
C<uses> JSON::PP. The required JSON::XS version is I<2.2> or later.
|
1974 |
+
|
1975 |
+
The C<JSON> constructor method returns an object inherited from the backend module,
|
1976 |
+
and JSON::XS object is a blessed scalar reference while JSON::PP is a blessed hash
|
1977 |
+
reference.
|
1978 |
+
|
1979 |
+
So, your program should not depend on the backend module, especially
|
1980 |
+
returned objects should not be modified.
|
1981 |
+
|
1982 |
+
my $json = JSON->new; # XS or PP?
|
1983 |
+
$json->{stash} = 'this is xs object'; # this code may raise an error!
|
1984 |
+
|
1985 |
+
To check the backend module, there are some methods - C<backend>, C<is_pp> and C<is_xs>.
|
1986 |
+
|
1987 |
+
JSON->backend; # 'JSON::XS' or 'JSON::PP'
|
1988 |
+
|
1989 |
+
JSON->backend->is_pp: # 0 or 1
|
1990 |
+
|
1991 |
+
JSON->backend->is_xs: # 1 or 0
|
1992 |
+
|
1993 |
+
$json->is_xs; # 1 or 0
|
1994 |
+
|
1995 |
+
$json->is_pp; # 0 or 1
|
1996 |
+
|
1997 |
+
|
1998 |
+
If you set an environment variable C<PERL_JSON_BACKEND>, the calling action will be changed.
|
1999 |
+
|
2000 |
+
=over
|
2001 |
+
|
2002 |
+
=item PERL_JSON_BACKEND = 0 or PERL_JSON_BACKEND = 'JSON::PP'
|
2003 |
+
|
2004 |
+
Always use JSON::PP
|
2005 |
+
|
2006 |
+
=item PERL_JSON_BACKEND == 1 or PERL_JSON_BACKEND = 'JSON::XS,JSON::PP'
|
2007 |
+
|
2008 |
+
(The default) Use compiled JSON::XS if it is properly compiled & installed,
|
2009 |
+
otherwise use JSON::PP.
|
2010 |
+
|
2011 |
+
=item PERL_JSON_BACKEND == 2 or PERL_JSON_BACKEND = 'JSON::XS'
|
2012 |
+
|
2013 |
+
Always use compiled JSON::XS, die if it isn't properly compiled & installed.
|
2014 |
+
|
2015 |
+
=item PERL_JSON_BACKEND = 'JSON::backportPP'
|
2016 |
+
|
2017 |
+
Always use JSON::backportPP.
|
2018 |
+
JSON::backportPP is JSON::PP back port module.
|
2019 |
+
C<JSON> includes JSON::backportPP instead of JSON::PP.
|
2020 |
+
|
2021 |
+
=back
|
2022 |
+
|
2023 |
+
These ideas come from L<DBI::PurePerl> mechanism.
|
2024 |
+
|
2025 |
+
example:
|
2026 |
+
|
2027 |
+
BEGIN { $ENV{PERL_JSON_BACKEND} = 'JSON::PP' }
|
2028 |
+
use JSON; # always uses JSON::PP
|
2029 |
+
|
2030 |
+
In future, it may be able to specify another module.
|
2031 |
+
|
2032 |
+
=head1 USE PP FEATURES EVEN THOUGH XS BACKEND
|
2033 |
+
|
2034 |
+
Many methods are available with either JSON::XS or JSON::PP and
|
2035 |
+
when the backend module is JSON::XS, if any JSON::PP specific (i.e. JSON::XS unsupported)
|
2036 |
+
method is called, it will C<warn> and be noop.
|
2037 |
+
|
2038 |
+
But If you C<use> C<JSON> passing the optional string C<-support_by_pp>,
|
2039 |
+
it makes a part of those unsupported methods available.
|
2040 |
+
This feature is achieved by using JSON::PP in C<de/encode>.
|
2041 |
+
|
2042 |
+
BEGIN { $ENV{PERL_JSON_BACKEND} = 2 } # with JSON::XS
|
2043 |
+
use JSON -support_by_pp;
|
2044 |
+
my $json = JSON->new;
|
2045 |
+
$json->allow_nonref->escape_slash->encode("/");
|
2046 |
+
|
2047 |
+
At this time, the returned object is a C<JSON::Backend::XS::Supportable>
|
2048 |
+
object (re-blessed XS object), and by checking JSON::XS unsupported flags
|
2049 |
+
in de/encoding, can support some unsupported methods - C<loose>, C<allow_bignum>,
|
2050 |
+
C<allow_barekey>, C<allow_singlequote>, C<escape_slash> and C<indent_length>.
|
2051 |
+
|
2052 |
+
When any unsupported methods are not enable, C<XS de/encode> will be
|
2053 |
+
used as is. The switch is achieved by changing the symbolic tables.
|
2054 |
+
|
2055 |
+
C<-support_by_pp> is effective only when the backend module is JSON::XS
|
2056 |
+
and it makes the de/encoding speed down a bit.
|
2057 |
+
|
2058 |
+
See to L<JSON::PP SUPPORT METHODS>.
|
2059 |
+
|
2060 |
+
=head1 INCOMPATIBLE CHANGES TO OLD VERSION
|
2061 |
+
|
2062 |
+
There are big incompatibility between new version (2.00) and old (1.xx).
|
2063 |
+
If you use old C<JSON> 1.xx in your code, please check it.
|
2064 |
+
|
2065 |
+
See to L<Transition ways from 1.xx to 2.xx.>
|
2066 |
+
|
2067 |
+
=over
|
2068 |
+
|
2069 |
+
=item jsonToObj and objToJson are obsoleted.
|
2070 |
+
|
2071 |
+
Non Perl-style name C<jsonToObj> and C<objToJson> are obsoleted
|
2072 |
+
(but not yet deleted from the source).
|
2073 |
+
If you use these functions in your code, please replace them
|
2074 |
+
with C<from_json> and C<to_json>.
|
2075 |
+
|
2076 |
+
|
2077 |
+
=item Global variables are no longer available.
|
2078 |
+
|
2079 |
+
C<JSON> class variables - C<$JSON::AUTOCONVERT>, C<$JSON::BareKey>, etc...
|
2080 |
+
- are not available any longer.
|
2081 |
+
Instead, various features can be used through object methods.
|
2082 |
+
|
2083 |
+
|
2084 |
+
=item Package JSON::Converter and JSON::Parser are deleted.
|
2085 |
+
|
2086 |
+
Now C<JSON> bundles with JSON::PP which can handle JSON more properly than them.
|
2087 |
+
|
2088 |
+
=item Package JSON::NotString is deleted.
|
2089 |
+
|
2090 |
+
There was C<JSON::NotString> class which represents JSON value C<true>, C<false>, C<null>
|
2091 |
+
and numbers. It was deleted and replaced by C<JSON::Boolean>.
|
2092 |
+
|
2093 |
+
C<JSON::Boolean> represents C<true> and C<false>.
|
2094 |
+
|
2095 |
+
C<JSON::Boolean> does not represent C<null>.
|
2096 |
+
|
2097 |
+
C<JSON::null> returns C<undef>.
|
2098 |
+
|
2099 |
+
C<JSON> makes L<JSON::XS::Boolean> and L<JSON::PP::Boolean> is-a relation
|
2100 |
+
to L<JSON::Boolean>.
|
2101 |
+
|
2102 |
+
=item function JSON::Number is obsoleted.
|
2103 |
+
|
2104 |
+
C<JSON::Number> is now needless because JSON::XS and JSON::PP have
|
2105 |
+
round-trip integrity.
|
2106 |
+
|
2107 |
+
=item JSONRPC modules are deleted.
|
2108 |
+
|
2109 |
+
Perl implementation of JSON-RPC protocol - C<JSONRPC >, C<JSONRPC::Transport::HTTP>
|
2110 |
+
and C<Apache::JSONRPC > are deleted in this distribution.
|
2111 |
+
Instead of them, there is L<JSON::RPC> which supports JSON-RPC protocol version 1.1.
|
2112 |
+
|
2113 |
+
=back
|
2114 |
+
|
2115 |
+
=head2 Transition ways from 1.xx to 2.xx.
|
2116 |
+
|
2117 |
+
You should set C<suport_by_pp> mode firstly, because
|
2118 |
+
it is always successful for the below codes even with JSON::XS.
|
2119 |
+
|
2120 |
+
use JSON -support_by_pp;
|
2121 |
+
|
2122 |
+
=over
|
2123 |
+
|
2124 |
+
=item Exported jsonToObj (simple)
|
2125 |
+
|
2126 |
+
from_json($json_text);
|
2127 |
+
|
2128 |
+
=item Exported objToJson (simple)
|
2129 |
+
|
2130 |
+
to_json($perl_scalar);
|
2131 |
+
|
2132 |
+
=item Exported jsonToObj (advanced)
|
2133 |
+
|
2134 |
+
$flags = {allow_barekey => 1, allow_singlequote => 1};
|
2135 |
+
from_json($json_text, $flags);
|
2136 |
+
|
2137 |
+
equivalent to:
|
2138 |
+
|
2139 |
+
$JSON::BareKey = 1;
|
2140 |
+
$JSON::QuotApos = 1;
|
2141 |
+
jsonToObj($json_text);
|
2142 |
+
|
2143 |
+
=item Exported objToJson (advanced)
|
2144 |
+
|
2145 |
+
$flags = {allow_blessed => 1, allow_barekey => 1};
|
2146 |
+
to_json($perl_scalar, $flags);
|
2147 |
+
|
2148 |
+
equivalent to:
|
2149 |
+
|
2150 |
+
$JSON::BareKey = 1;
|
2151 |
+
objToJson($perl_scalar);
|
2152 |
+
|
2153 |
+
=item jsonToObj as object method
|
2154 |
+
|
2155 |
+
$json->decode($json_text);
|
2156 |
+
|
2157 |
+
=item objToJson as object method
|
2158 |
+
|
2159 |
+
$json->encode($perl_scalar);
|
2160 |
+
|
2161 |
+
=item new method with parameters
|
2162 |
+
|
2163 |
+
The C<new> method in 2.x takes any parameters no longer.
|
2164 |
+
You can set parameters instead;
|
2165 |
+
|
2166 |
+
$json = JSON->new->pretty;
|
2167 |
+
|
2168 |
+
=item $JSON::Pretty, $JSON::Indent, $JSON::Delimiter
|
2169 |
+
|
2170 |
+
If C<indent> is enable, that means C<$JSON::Pretty> flag set. And
|
2171 |
+
C<$JSON::Delimiter> was substituted by C<space_before> and C<space_after>.
|
2172 |
+
In conclusion:
|
2173 |
+
|
2174 |
+
$json->indent->space_before->space_after;
|
2175 |
+
|
2176 |
+
Equivalent to:
|
2177 |
+
|
2178 |
+
$json->pretty;
|
2179 |
+
|
2180 |
+
To change indent length, use C<indent_length>.
|
2181 |
+
|
2182 |
+
(Only with JSON::PP, if C<-support_by_pp> is not used.)
|
2183 |
+
|
2184 |
+
$json->pretty->indent_length(2)->encode($perl_scalar);
|
2185 |
+
|
2186 |
+
=item $JSON::BareKey
|
2187 |
+
|
2188 |
+
(Only with JSON::PP, if C<-support_by_pp> is not used.)
|
2189 |
+
|
2190 |
+
$json->allow_barekey->decode($json_text)
|
2191 |
+
|
2192 |
+
=item $JSON::ConvBlessed
|
2193 |
+
|
2194 |
+
use C<-convert_blessed_universally>. See to L<convert_blessed>.
|
2195 |
+
|
2196 |
+
=item $JSON::QuotApos
|
2197 |
+
|
2198 |
+
(Only with JSON::PP, if C<-support_by_pp> is not used.)
|
2199 |
+
|
2200 |
+
$json->allow_singlequote->decode($json_text)
|
2201 |
+
|
2202 |
+
=item $JSON::SingleQuote
|
2203 |
+
|
2204 |
+
Disable. C<JSON> does not make such a invalid JSON string any longer.
|
2205 |
+
|
2206 |
+
=item $JSON::KeySort
|
2207 |
+
|
2208 |
+
$json->canonical->encode($perl_scalar)
|
2209 |
+
|
2210 |
+
This is the ascii sort.
|
2211 |
+
|
2212 |
+
If you want to use with your own sort routine, check the C<sort_by> method.
|
2213 |
+
|
2214 |
+
(Only with JSON::PP, even if C<-support_by_pp> is used currently.)
|
2215 |
+
|
2216 |
+
$json->sort_by($sort_routine_ref)->encode($perl_scalar)
|
2217 |
+
|
2218 |
+
$json->sort_by(sub { $JSON::PP::a <=> $JSON::PP::b })->encode($perl_scalar)
|
2219 |
+
|
2220 |
+
Can't access C<$a> and C<$b> but C<$JSON::PP::a> and C<$JSON::PP::b>.
|
2221 |
+
|
2222 |
+
=item $JSON::SkipInvalid
|
2223 |
+
|
2224 |
+
$json->allow_unknown
|
2225 |
+
|
2226 |
+
=item $JSON::AUTOCONVERT
|
2227 |
+
|
2228 |
+
Needless. C<JSON> backend modules have the round-trip integrity.
|
2229 |
+
|
2230 |
+
=item $JSON::UTF8
|
2231 |
+
|
2232 |
+
Needless because C<JSON> (JSON::XS/JSON::PP) sets
|
2233 |
+
the UTF8 flag on properly.
|
2234 |
+
|
2235 |
+
# With UTF8-flagged strings
|
2236 |
+
|
2237 |
+
$json->allow_nonref;
|
2238 |
+
$str = chr(1000); # UTF8-flagged
|
2239 |
+
|
2240 |
+
$json_text = $json->utf8(0)->encode($str);
|
2241 |
+
utf8::is_utf8($json_text);
|
2242 |
+
# true
|
2243 |
+
$json_text = $json->utf8(1)->encode($str);
|
2244 |
+
utf8::is_utf8($json_text);
|
2245 |
+
# false
|
2246 |
+
|
2247 |
+
$str = '"' . chr(1000) . '"'; # UTF8-flagged
|
2248 |
+
|
2249 |
+
$perl_scalar = $json->utf8(0)->decode($str);
|
2250 |
+
utf8::is_utf8($perl_scalar);
|
2251 |
+
# true
|
2252 |
+
$perl_scalar = $json->utf8(1)->decode($str);
|
2253 |
+
# died because of 'Wide character in subroutine'
|
2254 |
+
|
2255 |
+
See to L<JSON::XS/A FEW NOTES ON UNICODE AND PERL>.
|
2256 |
+
|
2257 |
+
=item $JSON::UnMapping
|
2258 |
+
|
2259 |
+
Disable. See to L<MAPPING>.
|
2260 |
+
|
2261 |
+
=item $JSON::SelfConvert
|
2262 |
+
|
2263 |
+
This option was deleted.
|
2264 |
+
Instead of it, if a given blessed object has the C<TO_JSON> method,
|
2265 |
+
C<TO_JSON> will be executed with C<convert_blessed>.
|
2266 |
+
|
2267 |
+
$json->convert_blessed->encode($blessed_hashref_or_arrayref)
|
2268 |
+
# if need, call allow_blessed
|
2269 |
+
|
2270 |
+
Note that it was C<toJson> in old version, but now not C<toJson> but C<TO_JSON>.
|
2271 |
+
|
2272 |
+
=back
|
2273 |
+
|
2274 |
+
=head1 TODO
|
2275 |
+
|
2276 |
+
=over
|
2277 |
+
|
2278 |
+
=item example programs
|
2279 |
+
|
2280 |
+
=back
|
2281 |
+
|
2282 |
+
=head1 THREADS
|
2283 |
+
|
2284 |
+
No test with JSON::PP. If with JSON::XS, See to L<JSON::XS/THREADS>.
|
2285 |
+
|
2286 |
+
|
2287 |
+
=head1 BUGS
|
2288 |
+
|
2289 |
+
Please report bugs relevant to C<JSON> to E<lt>makamaka[at]cpan.orgE<gt>.
|
2290 |
+
|
2291 |
+
|
2292 |
+
=head1 SEE ALSO
|
2293 |
+
|
2294 |
+
Most of the document is copied and modified from JSON::XS doc.
|
2295 |
+
|
2296 |
+
L<JSON::XS>, L<JSON::PP>
|
2297 |
+
|
2298 |
+
C<RFC4627>(L<http://www.ietf.org/rfc/rfc4627.txt>)
|
2299 |
+
|
2300 |
+
=head1 AUTHOR
|
2301 |
+
|
2302 |
+
Makamaka Hannyaharamitu, E<lt>makamaka[at]cpan.orgE<gt>
|
2303 |
+
|
2304 |
+
JSON::XS was written by Marc Lehmann <schmorp[at]schmorp.de>
|
2305 |
+
|
2306 |
+
The release of this new version owes to the courtesy of Marc Lehmann.
|
2307 |
+
|
2308 |
+
|
2309 |
+
=head1 COPYRIGHT AND LICENSE
|
2310 |
+
|
2311 |
+
Copyright 2005-2013 by Makamaka Hannyaharamitu
|
2312 |
+
|
2313 |
+
This library is free software; you can redistribute it and/or modify
|
2314 |
+
it under the same terms as Perl itself.
|
2315 |
+
|
2316 |
+
=cut
|
2317 |
+
|
uroman/lib/JSON/backportPP.pm
ADDED
@@ -0,0 +1,2806 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
package # This is JSON::backportPP
|
2 |
+
JSON::PP;
|
3 |
+
|
4 |
+
# JSON-2.0
|
5 |
+
|
6 |
+
use 5.005;
|
7 |
+
use strict;
|
8 |
+
use base qw(Exporter);
|
9 |
+
use overload ();
|
10 |
+
|
11 |
+
use Carp ();
|
12 |
+
use B ();
|
13 |
+
#use Devel::Peek;
|
14 |
+
|
15 |
+
use vars qw($VERSION);
|
16 |
+
$VERSION = '2.27204';
|
17 |
+
|
18 |
+
@JSON::PP::EXPORT = qw(encode_json decode_json from_json to_json);
|
19 |
+
|
20 |
+
# instead of hash-access, i tried index-access for speed.
|
21 |
+
# but this method is not faster than what i expected. so it will be changed.
|
22 |
+
|
23 |
+
use constant P_ASCII => 0;
|
24 |
+
use constant P_LATIN1 => 1;
|
25 |
+
use constant P_UTF8 => 2;
|
26 |
+
use constant P_INDENT => 3;
|
27 |
+
use constant P_CANONICAL => 4;
|
28 |
+
use constant P_SPACE_BEFORE => 5;
|
29 |
+
use constant P_SPACE_AFTER => 6;
|
30 |
+
use constant P_ALLOW_NONREF => 7;
|
31 |
+
use constant P_SHRINK => 8;
|
32 |
+
use constant P_ALLOW_BLESSED => 9;
|
33 |
+
use constant P_CONVERT_BLESSED => 10;
|
34 |
+
use constant P_RELAXED => 11;
|
35 |
+
|
36 |
+
use constant P_LOOSE => 12;
|
37 |
+
use constant P_ALLOW_BIGNUM => 13;
|
38 |
+
use constant P_ALLOW_BAREKEY => 14;
|
39 |
+
use constant P_ALLOW_SINGLEQUOTE => 15;
|
40 |
+
use constant P_ESCAPE_SLASH => 16;
|
41 |
+
use constant P_AS_NONBLESSED => 17;
|
42 |
+
|
43 |
+
use constant P_ALLOW_UNKNOWN => 18;
|
44 |
+
|
45 |
+
use constant OLD_PERL => $] < 5.008 ? 1 : 0;
|
46 |
+
|
47 |
+
BEGIN {
|
48 |
+
my @xs_compati_bit_properties = qw(
|
49 |
+
latin1 ascii utf8 indent canonical space_before space_after allow_nonref shrink
|
50 |
+
allow_blessed convert_blessed relaxed allow_unknown
|
51 |
+
);
|
52 |
+
my @pp_bit_properties = qw(
|
53 |
+
allow_singlequote allow_bignum loose
|
54 |
+
allow_barekey escape_slash as_nonblessed
|
55 |
+
);
|
56 |
+
|
57 |
+
# Perl version check, Unicode handling is enable?
|
58 |
+
# Helper module sets @JSON::PP::_properties.
|
59 |
+
if ($] < 5.008 ) {
|
60 |
+
my $helper = $] >= 5.006 ? 'JSON::backportPP::Compat5006' : 'JSON::backportPP::Compat5005';
|
61 |
+
eval qq| require $helper |;
|
62 |
+
if ($@) { Carp::croak $@; }
|
63 |
+
}
|
64 |
+
|
65 |
+
for my $name (@xs_compati_bit_properties, @pp_bit_properties) {
|
66 |
+
my $flag_name = 'P_' . uc($name);
|
67 |
+
|
68 |
+
eval qq/
|
69 |
+
sub $name {
|
70 |
+
my \$enable = defined \$_[1] ? \$_[1] : 1;
|
71 |
+
|
72 |
+
if (\$enable) {
|
73 |
+
\$_[0]->{PROPS}->[$flag_name] = 1;
|
74 |
+
}
|
75 |
+
else {
|
76 |
+
\$_[0]->{PROPS}->[$flag_name] = 0;
|
77 |
+
}
|
78 |
+
|
79 |
+
\$_[0];
|
80 |
+
}
|
81 |
+
|
82 |
+
sub get_$name {
|
83 |
+
\$_[0]->{PROPS}->[$flag_name] ? 1 : '';
|
84 |
+
}
|
85 |
+
/;
|
86 |
+
}
|
87 |
+
|
88 |
+
}
|
89 |
+
|
90 |
+
|
91 |
+
|
92 |
+
# Functions
|
93 |
+
|
94 |
+
my %encode_allow_method
|
95 |
+
= map {($_ => 1)} qw/utf8 pretty allow_nonref latin1 self_encode escape_slash
|
96 |
+
allow_blessed convert_blessed indent indent_length allow_bignum
|
97 |
+
as_nonblessed
|
98 |
+
/;
|
99 |
+
my %decode_allow_method
|
100 |
+
= map {($_ => 1)} qw/utf8 allow_nonref loose allow_singlequote allow_bignum
|
101 |
+
allow_barekey max_size relaxed/;
|
102 |
+
|
103 |
+
|
104 |
+
my $JSON; # cache
|
105 |
+
|
106 |
+
sub encode_json ($) { # encode
|
107 |
+
($JSON ||= __PACKAGE__->new->utf8)->encode(@_);
|
108 |
+
}
|
109 |
+
|
110 |
+
|
111 |
+
sub decode_json { # decode
|
112 |
+
($JSON ||= __PACKAGE__->new->utf8)->decode(@_);
|
113 |
+
}
|
114 |
+
|
115 |
+
# Obsoleted
|
116 |
+
|
117 |
+
sub to_json($) {
|
118 |
+
Carp::croak ("JSON::PP::to_json has been renamed to encode_json.");
|
119 |
+
}
|
120 |
+
|
121 |
+
|
122 |
+
sub from_json($) {
|
123 |
+
Carp::croak ("JSON::PP::from_json has been renamed to decode_json.");
|
124 |
+
}
|
125 |
+
|
126 |
+
|
127 |
+
# Methods
|
128 |
+
|
129 |
+
sub new {
|
130 |
+
my $class = shift;
|
131 |
+
my $self = {
|
132 |
+
max_depth => 512,
|
133 |
+
max_size => 0,
|
134 |
+
indent => 0,
|
135 |
+
FLAGS => 0,
|
136 |
+
fallback => sub { encode_error('Invalid value. JSON can only reference.') },
|
137 |
+
indent_length => 3,
|
138 |
+
};
|
139 |
+
|
140 |
+
bless $self, $class;
|
141 |
+
}
|
142 |
+
|
143 |
+
|
144 |
+
sub encode {
|
145 |
+
return $_[0]->PP_encode_json($_[1]);
|
146 |
+
}
|
147 |
+
|
148 |
+
|
149 |
+
sub decode {
|
150 |
+
return $_[0]->PP_decode_json($_[1], 0x00000000);
|
151 |
+
}
|
152 |
+
|
153 |
+
|
154 |
+
sub decode_prefix {
|
155 |
+
return $_[0]->PP_decode_json($_[1], 0x00000001);
|
156 |
+
}
|
157 |
+
|
158 |
+
|
159 |
+
# accessor
|
160 |
+
|
161 |
+
|
162 |
+
# pretty printing
|
163 |
+
|
164 |
+
sub pretty {
|
165 |
+
my ($self, $v) = @_;
|
166 |
+
my $enable = defined $v ? $v : 1;
|
167 |
+
|
168 |
+
if ($enable) { # indent_length(3) for JSON::XS compatibility
|
169 |
+
$self->indent(1)->indent_length(3)->space_before(1)->space_after(1);
|
170 |
+
}
|
171 |
+
else {
|
172 |
+
$self->indent(0)->space_before(0)->space_after(0);
|
173 |
+
}
|
174 |
+
|
175 |
+
$self;
|
176 |
+
}
|
177 |
+
|
178 |
+
# etc
|
179 |
+
|
180 |
+
sub max_depth {
|
181 |
+
my $max = defined $_[1] ? $_[1] : 0x80000000;
|
182 |
+
$_[0]->{max_depth} = $max;
|
183 |
+
$_[0];
|
184 |
+
}
|
185 |
+
|
186 |
+
|
187 |
+
sub get_max_depth { $_[0]->{max_depth}; }
|
188 |
+
|
189 |
+
|
190 |
+
sub max_size {
|
191 |
+
my $max = defined $_[1] ? $_[1] : 0;
|
192 |
+
$_[0]->{max_size} = $max;
|
193 |
+
$_[0];
|
194 |
+
}
|
195 |
+
|
196 |
+
|
197 |
+
sub get_max_size { $_[0]->{max_size}; }
|
198 |
+
|
199 |
+
|
200 |
+
sub filter_json_object {
|
201 |
+
$_[0]->{cb_object} = defined $_[1] ? $_[1] : 0;
|
202 |
+
$_[0]->{F_HOOK} = ($_[0]->{cb_object} or $_[0]->{cb_sk_object}) ? 1 : 0;
|
203 |
+
$_[0];
|
204 |
+
}
|
205 |
+
|
206 |
+
sub filter_json_single_key_object {
|
207 |
+
if (@_ > 1) {
|
208 |
+
$_[0]->{cb_sk_object}->{$_[1]} = $_[2];
|
209 |
+
}
|
210 |
+
$_[0]->{F_HOOK} = ($_[0]->{cb_object} or $_[0]->{cb_sk_object}) ? 1 : 0;
|
211 |
+
$_[0];
|
212 |
+
}
|
213 |
+
|
214 |
+
sub indent_length {
|
215 |
+
if (!defined $_[1] or $_[1] > 15 or $_[1] < 0) {
|
216 |
+
Carp::carp "The acceptable range of indent_length() is 0 to 15.";
|
217 |
+
}
|
218 |
+
else {
|
219 |
+
$_[0]->{indent_length} = $_[1];
|
220 |
+
}
|
221 |
+
$_[0];
|
222 |
+
}
|
223 |
+
|
224 |
+
sub get_indent_length {
|
225 |
+
$_[0]->{indent_length};
|
226 |
+
}
|
227 |
+
|
228 |
+
sub sort_by {
|
229 |
+
$_[0]->{sort_by} = defined $_[1] ? $_[1] : 1;
|
230 |
+
$_[0];
|
231 |
+
}
|
232 |
+
|
233 |
+
sub allow_bigint {
|
234 |
+
Carp::carp("allow_bigint() is obsoleted. use allow_bignum() insted.");
|
235 |
+
}
|
236 |
+
|
237 |
+
###############################
|
238 |
+
|
239 |
+
###
|
240 |
+
### Perl => JSON
|
241 |
+
###
|
242 |
+
|
243 |
+
|
244 |
+
{ # Convert
|
245 |
+
|
246 |
+
my $max_depth;
|
247 |
+
my $indent;
|
248 |
+
my $ascii;
|
249 |
+
my $latin1;
|
250 |
+
my $utf8;
|
251 |
+
my $space_before;
|
252 |
+
my $space_after;
|
253 |
+
my $canonical;
|
254 |
+
my $allow_blessed;
|
255 |
+
my $convert_blessed;
|
256 |
+
|
257 |
+
my $indent_length;
|
258 |
+
my $escape_slash;
|
259 |
+
my $bignum;
|
260 |
+
my $as_nonblessed;
|
261 |
+
|
262 |
+
my $depth;
|
263 |
+
my $indent_count;
|
264 |
+
my $keysort;
|
265 |
+
|
266 |
+
|
267 |
+
sub PP_encode_json {
|
268 |
+
my $self = shift;
|
269 |
+
my $obj = shift;
|
270 |
+
|
271 |
+
$indent_count = 0;
|
272 |
+
$depth = 0;
|
273 |
+
|
274 |
+
my $idx = $self->{PROPS};
|
275 |
+
|
276 |
+
($ascii, $latin1, $utf8, $indent, $canonical, $space_before, $space_after, $allow_blessed,
|
277 |
+
$convert_blessed, $escape_slash, $bignum, $as_nonblessed)
|
278 |
+
= @{$idx}[P_ASCII .. P_SPACE_AFTER, P_ALLOW_BLESSED, P_CONVERT_BLESSED,
|
279 |
+
P_ESCAPE_SLASH, P_ALLOW_BIGNUM, P_AS_NONBLESSED];
|
280 |
+
|
281 |
+
($max_depth, $indent_length) = @{$self}{qw/max_depth indent_length/};
|
282 |
+
|
283 |
+
$keysort = $canonical ? sub { $a cmp $b } : undef;
|
284 |
+
|
285 |
+
if ($self->{sort_by}) {
|
286 |
+
$keysort = ref($self->{sort_by}) eq 'CODE' ? $self->{sort_by}
|
287 |
+
: $self->{sort_by} =~ /\D+/ ? $self->{sort_by}
|
288 |
+
: sub { $a cmp $b };
|
289 |
+
}
|
290 |
+
|
291 |
+
encode_error("hash- or arrayref expected (not a simple scalar, use allow_nonref to allow this)")
|
292 |
+
if(!ref $obj and !$idx->[ P_ALLOW_NONREF ]);
|
293 |
+
|
294 |
+
my $str = $self->object_to_json($obj);
|
295 |
+
|
296 |
+
$str .= "\n" if ( $indent ); # JSON::XS 2.26 compatible
|
297 |
+
|
298 |
+
unless ($ascii or $latin1 or $utf8) {
|
299 |
+
utf8::upgrade($str);
|
300 |
+
}
|
301 |
+
|
302 |
+
if ($idx->[ P_SHRINK ]) {
|
303 |
+
utf8::downgrade($str, 1);
|
304 |
+
}
|
305 |
+
|
306 |
+
return $str;
|
307 |
+
}
|
308 |
+
|
309 |
+
|
310 |
+
sub object_to_json {
|
311 |
+
my ($self, $obj) = @_;
|
312 |
+
my $type = ref($obj);
|
313 |
+
|
314 |
+
if($type eq 'HASH'){
|
315 |
+
return $self->hash_to_json($obj);
|
316 |
+
}
|
317 |
+
elsif($type eq 'ARRAY'){
|
318 |
+
return $self->array_to_json($obj);
|
319 |
+
}
|
320 |
+
elsif ($type) { # blessed object?
|
321 |
+
if (blessed($obj)) {
|
322 |
+
|
323 |
+
return $self->value_to_json($obj) if ( $obj->isa('JSON::PP::Boolean') );
|
324 |
+
|
325 |
+
if ( $convert_blessed and $obj->can('TO_JSON') ) {
|
326 |
+
my $result = $obj->TO_JSON();
|
327 |
+
if ( defined $result and ref( $result ) ) {
|
328 |
+
if ( refaddr( $obj ) eq refaddr( $result ) ) {
|
329 |
+
encode_error( sprintf(
|
330 |
+
"%s::TO_JSON method returned same object as was passed instead of a new one",
|
331 |
+
ref $obj
|
332 |
+
) );
|
333 |
+
}
|
334 |
+
}
|
335 |
+
|
336 |
+
return $self->object_to_json( $result );
|
337 |
+
}
|
338 |
+
|
339 |
+
return "$obj" if ( $bignum and _is_bignum($obj) );
|
340 |
+
return $self->blessed_to_json($obj) if ($allow_blessed and $as_nonblessed); # will be removed.
|
341 |
+
|
342 |
+
encode_error( sprintf("encountered object '%s', but neither allow_blessed "
|
343 |
+
. "nor convert_blessed settings are enabled", $obj)
|
344 |
+
) unless ($allow_blessed);
|
345 |
+
|
346 |
+
return 'null';
|
347 |
+
}
|
348 |
+
else {
|
349 |
+
return $self->value_to_json($obj);
|
350 |
+
}
|
351 |
+
}
|
352 |
+
else{
|
353 |
+
return $self->value_to_json($obj);
|
354 |
+
}
|
355 |
+
}
|
356 |
+
|
357 |
+
|
358 |
+
sub hash_to_json {
|
359 |
+
my ($self, $obj) = @_;
|
360 |
+
my @res;
|
361 |
+
|
362 |
+
encode_error("json text or perl structure exceeds maximum nesting level (max_depth set too low?)")
|
363 |
+
if (++$depth > $max_depth);
|
364 |
+
|
365 |
+
my ($pre, $post) = $indent ? $self->_up_indent() : ('', '');
|
366 |
+
my $del = ($space_before ? ' ' : '') . ':' . ($space_after ? ' ' : '');
|
367 |
+
|
368 |
+
for my $k ( _sort( $obj ) ) {
|
369 |
+
if ( OLD_PERL ) { utf8::decode($k) } # key for Perl 5.6 / be optimized
|
370 |
+
push @res, string_to_json( $self, $k )
|
371 |
+
. $del
|
372 |
+
. ( $self->object_to_json( $obj->{$k} ) || $self->value_to_json( $obj->{$k} ) );
|
373 |
+
}
|
374 |
+
|
375 |
+
--$depth;
|
376 |
+
$self->_down_indent() if ($indent);
|
377 |
+
|
378 |
+
return '{' . ( @res ? $pre : '' ) . ( @res ? join( ",$pre", @res ) . $post : '' ) . '}';
|
379 |
+
}
|
380 |
+
|
381 |
+
|
382 |
+
sub array_to_json {
|
383 |
+
my ($self, $obj) = @_;
|
384 |
+
my @res;
|
385 |
+
|
386 |
+
encode_error("json text or perl structure exceeds maximum nesting level (max_depth set too low?)")
|
387 |
+
if (++$depth > $max_depth);
|
388 |
+
|
389 |
+
my ($pre, $post) = $indent ? $self->_up_indent() : ('', '');
|
390 |
+
|
391 |
+
for my $v (@$obj){
|
392 |
+
push @res, $self->object_to_json($v) || $self->value_to_json($v);
|
393 |
+
}
|
394 |
+
|
395 |
+
--$depth;
|
396 |
+
$self->_down_indent() if ($indent);
|
397 |
+
|
398 |
+
return '[' . ( @res ? $pre : '' ) . ( @res ? join( ",$pre", @res ) . $post : '' ) . ']';
|
399 |
+
}
|
400 |
+
|
401 |
+
|
402 |
+
sub value_to_json {
|
403 |
+
my ($self, $value) = @_;
|
404 |
+
|
405 |
+
return 'null' if(!defined $value);
|
406 |
+
|
407 |
+
my $b_obj = B::svref_2object(\$value); # for round trip problem
|
408 |
+
my $flags = $b_obj->FLAGS;
|
409 |
+
|
410 |
+
return $value # as is
|
411 |
+
if $flags & ( B::SVp_IOK | B::SVp_NOK ) and !( $flags & B::SVp_POK ); # SvTYPE is IV or NV?
|
412 |
+
|
413 |
+
my $type = ref($value);
|
414 |
+
|
415 |
+
if(!$type){
|
416 |
+
return string_to_json($self, $value);
|
417 |
+
}
|
418 |
+
elsif( blessed($value) and $value->isa('JSON::PP::Boolean') ){
|
419 |
+
return $$value == 1 ? 'true' : 'false';
|
420 |
+
}
|
421 |
+
elsif ($type) {
|
422 |
+
if ((overload::StrVal($value) =~ /=(\w+)/)[0]) {
|
423 |
+
return $self->value_to_json("$value");
|
424 |
+
}
|
425 |
+
|
426 |
+
if ($type eq 'SCALAR' and defined $$value) {
|
427 |
+
return $$value eq '1' ? 'true'
|
428 |
+
: $$value eq '0' ? 'false'
|
429 |
+
: $self->{PROPS}->[ P_ALLOW_UNKNOWN ] ? 'null'
|
430 |
+
: encode_error("cannot encode reference to scalar");
|
431 |
+
}
|
432 |
+
|
433 |
+
if ( $self->{PROPS}->[ P_ALLOW_UNKNOWN ] ) {
|
434 |
+
return 'null';
|
435 |
+
}
|
436 |
+
else {
|
437 |
+
if ( $type eq 'SCALAR' or $type eq 'REF' ) {
|
438 |
+
encode_error("cannot encode reference to scalar");
|
439 |
+
}
|
440 |
+
else {
|
441 |
+
encode_error("encountered $value, but JSON can only represent references to arrays or hashes");
|
442 |
+
}
|
443 |
+
}
|
444 |
+
|
445 |
+
}
|
446 |
+
else {
|
447 |
+
return $self->{fallback}->($value)
|
448 |
+
if ($self->{fallback} and ref($self->{fallback}) eq 'CODE');
|
449 |
+
return 'null';
|
450 |
+
}
|
451 |
+
|
452 |
+
}
|
453 |
+
|
454 |
+
|
455 |
+
my %esc = (
|
456 |
+
"\n" => '\n',
|
457 |
+
"\r" => '\r',
|
458 |
+
"\t" => '\t',
|
459 |
+
"\f" => '\f',
|
460 |
+
"\b" => '\b',
|
461 |
+
"\"" => '\"',
|
462 |
+
"\\" => '\\\\',
|
463 |
+
"\'" => '\\\'',
|
464 |
+
);
|
465 |
+
|
466 |
+
|
467 |
+
sub string_to_json {
|
468 |
+
my ($self, $arg) = @_;
|
469 |
+
|
470 |
+
$arg =~ s/([\x22\x5c\n\r\t\f\b])/$esc{$1}/g;
|
471 |
+
$arg =~ s/\//\\\//g if ($escape_slash);
|
472 |
+
$arg =~ s/([\x00-\x08\x0b\x0e-\x1f])/'\\u00' . unpack('H2', $1)/eg;
|
473 |
+
|
474 |
+
if ($ascii) {
|
475 |
+
$arg = JSON_PP_encode_ascii($arg);
|
476 |
+
}
|
477 |
+
|
478 |
+
if ($latin1) {
|
479 |
+
$arg = JSON_PP_encode_latin1($arg);
|
480 |
+
}
|
481 |
+
|
482 |
+
if ($utf8) {
|
483 |
+
utf8::encode($arg);
|
484 |
+
}
|
485 |
+
|
486 |
+
return '"' . $arg . '"';
|
487 |
+
}
|
488 |
+
|
489 |
+
|
490 |
+
sub blessed_to_json {
|
491 |
+
my $reftype = reftype($_[1]) || '';
|
492 |
+
if ($reftype eq 'HASH') {
|
493 |
+
return $_[0]->hash_to_json($_[1]);
|
494 |
+
}
|
495 |
+
elsif ($reftype eq 'ARRAY') {
|
496 |
+
return $_[0]->array_to_json($_[1]);
|
497 |
+
}
|
498 |
+
else {
|
499 |
+
return 'null';
|
500 |
+
}
|
501 |
+
}
|
502 |
+
|
503 |
+
|
504 |
+
sub encode_error {
|
505 |
+
my $error = shift;
|
506 |
+
Carp::croak "$error";
|
507 |
+
}
|
508 |
+
|
509 |
+
|
510 |
+
sub _sort {
|
511 |
+
defined $keysort ? (sort $keysort (keys %{$_[0]})) : keys %{$_[0]};
|
512 |
+
}
|
513 |
+
|
514 |
+
|
515 |
+
sub _up_indent {
|
516 |
+
my $self = shift;
|
517 |
+
my $space = ' ' x $indent_length;
|
518 |
+
|
519 |
+
my ($pre,$post) = ('','');
|
520 |
+
|
521 |
+
$post = "\n" . $space x $indent_count;
|
522 |
+
|
523 |
+
$indent_count++;
|
524 |
+
|
525 |
+
$pre = "\n" . $space x $indent_count;
|
526 |
+
|
527 |
+
return ($pre,$post);
|
528 |
+
}
|
529 |
+
|
530 |
+
|
531 |
+
sub _down_indent { $indent_count--; }
|
532 |
+
|
533 |
+
|
534 |
+
sub PP_encode_box {
|
535 |
+
{
|
536 |
+
depth => $depth,
|
537 |
+
indent_count => $indent_count,
|
538 |
+
};
|
539 |
+
}
|
540 |
+
|
541 |
+
} # Convert
|
542 |
+
|
543 |
+
|
544 |
+
sub _encode_ascii {
|
545 |
+
join('',
|
546 |
+
map {
|
547 |
+
$_ <= 127 ?
|
548 |
+
chr($_) :
|
549 |
+
$_ <= 65535 ?
|
550 |
+
sprintf('\u%04x', $_) : sprintf('\u%x\u%x', _encode_surrogates($_));
|
551 |
+
} unpack('U*', $_[0])
|
552 |
+
);
|
553 |
+
}
|
554 |
+
|
555 |
+
|
556 |
+
sub _encode_latin1 {
|
557 |
+
join('',
|
558 |
+
map {
|
559 |
+
$_ <= 255 ?
|
560 |
+
chr($_) :
|
561 |
+
$_ <= 65535 ?
|
562 |
+
sprintf('\u%04x', $_) : sprintf('\u%x\u%x', _encode_surrogates($_));
|
563 |
+
} unpack('U*', $_[0])
|
564 |
+
);
|
565 |
+
}
|
566 |
+
|
567 |
+
|
568 |
+
sub _encode_surrogates { # from perlunicode
|
569 |
+
my $uni = $_[0] - 0x10000;
|
570 |
+
return ($uni / 0x400 + 0xD800, $uni % 0x400 + 0xDC00);
|
571 |
+
}
|
572 |
+
|
573 |
+
|
574 |
+
sub _is_bignum {
|
575 |
+
$_[0]->isa('Math::BigInt') or $_[0]->isa('Math::BigFloat');
|
576 |
+
}
|
577 |
+
|
578 |
+
|
579 |
+
|
580 |
+
#
|
581 |
+
# JSON => Perl
|
582 |
+
#
|
583 |
+
|
584 |
+
my $max_intsize;
|
585 |
+
|
586 |
+
BEGIN {
|
587 |
+
my $checkint = 1111;
|
588 |
+
for my $d (5..64) {
|
589 |
+
$checkint .= 1;
|
590 |
+
my $int = eval qq| $checkint |;
|
591 |
+
if ($int =~ /[eE]/) {
|
592 |
+
$max_intsize = $d - 1;
|
593 |
+
last;
|
594 |
+
}
|
595 |
+
}
|
596 |
+
}
|
597 |
+
|
598 |
+
{ # PARSE
|
599 |
+
|
600 |
+
my %escapes = ( # by Jeremy Muhlich <jmuhlich [at] bitflood.org>
|
601 |
+
b => "\x8",
|
602 |
+
t => "\x9",
|
603 |
+
n => "\xA",
|
604 |
+
f => "\xC",
|
605 |
+
r => "\xD",
|
606 |
+
'\\' => '\\',
|
607 |
+
'"' => '"',
|
608 |
+
'/' => '/',
|
609 |
+
);
|
610 |
+
|
611 |
+
my $text; # json data
|
612 |
+
my $at; # offset
|
613 |
+
my $ch; # 1chracter
|
614 |
+
my $len; # text length (changed according to UTF8 or NON UTF8)
|
615 |
+
# INTERNAL
|
616 |
+
my $depth; # nest counter
|
617 |
+
my $encoding; # json text encoding
|
618 |
+
my $is_valid_utf8; # temp variable
|
619 |
+
my $utf8_len; # utf8 byte length
|
620 |
+
# FLAGS
|
621 |
+
my $utf8; # must be utf8
|
622 |
+
my $max_depth; # max nest number of objects and arrays
|
623 |
+
my $max_size;
|
624 |
+
my $relaxed;
|
625 |
+
my $cb_object;
|
626 |
+
my $cb_sk_object;
|
627 |
+
|
628 |
+
my $F_HOOK;
|
629 |
+
|
630 |
+
my $allow_bigint; # using Math::BigInt
|
631 |
+
my $singlequote; # loosely quoting
|
632 |
+
my $loose; #
|
633 |
+
my $allow_barekey; # bareKey
|
634 |
+
|
635 |
+
# $opt flag
|
636 |
+
# 0x00000001 .... decode_prefix
|
637 |
+
# 0x10000000 .... incr_parse
|
638 |
+
|
639 |
+
sub PP_decode_json {
|
640 |
+
my ($self, $opt); # $opt is an effective flag during this decode_json.
|
641 |
+
|
642 |
+
($self, $text, $opt) = @_;
|
643 |
+
|
644 |
+
($at, $ch, $depth) = (0, '', 0);
|
645 |
+
|
646 |
+
if ( !defined $text or ref $text ) {
|
647 |
+
decode_error("malformed JSON string, neither array, object, number, string or atom");
|
648 |
+
}
|
649 |
+
|
650 |
+
my $idx = $self->{PROPS};
|
651 |
+
|
652 |
+
($utf8, $relaxed, $loose, $allow_bigint, $allow_barekey, $singlequote)
|
653 |
+
= @{$idx}[P_UTF8, P_RELAXED, P_LOOSE .. P_ALLOW_SINGLEQUOTE];
|
654 |
+
|
655 |
+
if ( $utf8 ) {
|
656 |
+
utf8::downgrade( $text, 1 ) or Carp::croak("Wide character in subroutine entry");
|
657 |
+
}
|
658 |
+
else {
|
659 |
+
utf8::upgrade( $text );
|
660 |
+
}
|
661 |
+
|
662 |
+
$len = length $text;
|
663 |
+
|
664 |
+
($max_depth, $max_size, $cb_object, $cb_sk_object, $F_HOOK)
|
665 |
+
= @{$self}{qw/max_depth max_size cb_object cb_sk_object F_HOOK/};
|
666 |
+
|
667 |
+
if ($max_size > 1) {
|
668 |
+
use bytes;
|
669 |
+
my $bytes = length $text;
|
670 |
+
decode_error(
|
671 |
+
sprintf("attempted decode of JSON text of %s bytes size, but max_size is set to %s"
|
672 |
+
, $bytes, $max_size), 1
|
673 |
+
) if ($bytes > $max_size);
|
674 |
+
}
|
675 |
+
|
676 |
+
# Currently no effect
|
677 |
+
# should use regexp
|
678 |
+
my @octets = unpack('C4', $text);
|
679 |
+
$encoding = ( $octets[0] and $octets[1]) ? 'UTF-8'
|
680 |
+
: (!$octets[0] and $octets[1]) ? 'UTF-16BE'
|
681 |
+
: (!$octets[0] and !$octets[1]) ? 'UTF-32BE'
|
682 |
+
: ( $octets[2] ) ? 'UTF-16LE'
|
683 |
+
: (!$octets[2] ) ? 'UTF-32LE'
|
684 |
+
: 'unknown';
|
685 |
+
|
686 |
+
white(); # remove head white space
|
687 |
+
|
688 |
+
my $valid_start = defined $ch; # Is there a first character for JSON structure?
|
689 |
+
|
690 |
+
my $result = value();
|
691 |
+
|
692 |
+
return undef if ( !$result && ( $opt & 0x10000000 ) ); # for incr_parse
|
693 |
+
|
694 |
+
decode_error("malformed JSON string, neither array, object, number, string or atom") unless $valid_start;
|
695 |
+
|
696 |
+
if ( !$idx->[ P_ALLOW_NONREF ] and !ref $result ) {
|
697 |
+
decode_error(
|
698 |
+
'JSON text must be an object or array (but found number, string, true, false or null,'
|
699 |
+
. ' use allow_nonref to allow this)', 1);
|
700 |
+
}
|
701 |
+
|
702 |
+
Carp::croak('something wrong.') if $len < $at; # we won't arrive here.
|
703 |
+
|
704 |
+
my $consumed = defined $ch ? $at - 1 : $at; # consumed JSON text length
|
705 |
+
|
706 |
+
white(); # remove tail white space
|
707 |
+
|
708 |
+
if ( $ch ) {
|
709 |
+
return ( $result, $consumed ) if ($opt & 0x00000001); # all right if decode_prefix
|
710 |
+
decode_error("garbage after JSON object");
|
711 |
+
}
|
712 |
+
|
713 |
+
( $opt & 0x00000001 ) ? ( $result, $consumed ) : $result;
|
714 |
+
}
|
715 |
+
|
716 |
+
|
717 |
+
sub next_chr {
|
718 |
+
return $ch = undef if($at >= $len);
|
719 |
+
$ch = substr($text, $at++, 1);
|
720 |
+
}
|
721 |
+
|
722 |
+
|
723 |
+
sub value {
|
724 |
+
white();
|
725 |
+
return if(!defined $ch);
|
726 |
+
return object() if($ch eq '{');
|
727 |
+
return array() if($ch eq '[');
|
728 |
+
return string() if($ch eq '"' or ($singlequote and $ch eq "'"));
|
729 |
+
return number() if($ch =~ /[0-9]/ or $ch eq '-');
|
730 |
+
return word();
|
731 |
+
}
|
732 |
+
|
733 |
+
sub string {
|
734 |
+
my ($i, $s, $t, $u);
|
735 |
+
my $utf16;
|
736 |
+
my $is_utf8;
|
737 |
+
|
738 |
+
($is_valid_utf8, $utf8_len) = ('', 0);
|
739 |
+
|
740 |
+
$s = ''; # basically UTF8 flag on
|
741 |
+
|
742 |
+
if($ch eq '"' or ($singlequote and $ch eq "'")){
|
743 |
+
my $boundChar = $ch;
|
744 |
+
|
745 |
+
OUTER: while( defined(next_chr()) ){
|
746 |
+
|
747 |
+
if($ch eq $boundChar){
|
748 |
+
next_chr();
|
749 |
+
|
750 |
+
if ($utf16) {
|
751 |
+
decode_error("missing low surrogate character in surrogate pair");
|
752 |
+
}
|
753 |
+
|
754 |
+
utf8::decode($s) if($is_utf8);
|
755 |
+
|
756 |
+
return $s;
|
757 |
+
}
|
758 |
+
elsif($ch eq '\\'){
|
759 |
+
next_chr();
|
760 |
+
if(exists $escapes{$ch}){
|
761 |
+
$s .= $escapes{$ch};
|
762 |
+
}
|
763 |
+
elsif($ch eq 'u'){ # UNICODE handling
|
764 |
+
my $u = '';
|
765 |
+
|
766 |
+
for(1..4){
|
767 |
+
$ch = next_chr();
|
768 |
+
last OUTER if($ch !~ /[0-9a-fA-F]/);
|
769 |
+
$u .= $ch;
|
770 |
+
}
|
771 |
+
|
772 |
+
# U+D800 - U+DBFF
|
773 |
+
if ($u =~ /^[dD][89abAB][0-9a-fA-F]{2}/) { # UTF-16 high surrogate?
|
774 |
+
$utf16 = $u;
|
775 |
+
}
|
776 |
+
# U+DC00 - U+DFFF
|
777 |
+
elsif ($u =~ /^[dD][c-fC-F][0-9a-fA-F]{2}/) { # UTF-16 low surrogate?
|
778 |
+
unless (defined $utf16) {
|
779 |
+
decode_error("missing high surrogate character in surrogate pair");
|
780 |
+
}
|
781 |
+
$is_utf8 = 1;
|
782 |
+
$s .= JSON_PP_decode_surrogates($utf16, $u) || next;
|
783 |
+
$utf16 = undef;
|
784 |
+
}
|
785 |
+
else {
|
786 |
+
if (defined $utf16) {
|
787 |
+
decode_error("surrogate pair expected");
|
788 |
+
}
|
789 |
+
|
790 |
+
if ( ( my $hex = hex( $u ) ) > 127 ) {
|
791 |
+
$is_utf8 = 1;
|
792 |
+
$s .= JSON_PP_decode_unicode($u) || next;
|
793 |
+
}
|
794 |
+
else {
|
795 |
+
$s .= chr $hex;
|
796 |
+
}
|
797 |
+
}
|
798 |
+
|
799 |
+
}
|
800 |
+
else{
|
801 |
+
unless ($loose) {
|
802 |
+
$at -= 2;
|
803 |
+
decode_error('illegal backslash escape sequence in string');
|
804 |
+
}
|
805 |
+
$s .= $ch;
|
806 |
+
}
|
807 |
+
}
|
808 |
+
else{
|
809 |
+
|
810 |
+
if ( ord $ch > 127 ) {
|
811 |
+
if ( $utf8 ) {
|
812 |
+
unless( $ch = is_valid_utf8($ch) ) {
|
813 |
+
$at -= 1;
|
814 |
+
decode_error("malformed UTF-8 character in JSON string");
|
815 |
+
}
|
816 |
+
else {
|
817 |
+
$at += $utf8_len - 1;
|
818 |
+
}
|
819 |
+
}
|
820 |
+
else {
|
821 |
+
utf8::encode( $ch );
|
822 |
+
}
|
823 |
+
|
824 |
+
$is_utf8 = 1;
|
825 |
+
}
|
826 |
+
|
827 |
+
if (!$loose) {
|
828 |
+
if ($ch =~ /[\x00-\x1f\x22\x5c]/) { # '/' ok
|
829 |
+
$at--;
|
830 |
+
decode_error('invalid character encountered while parsing JSON string');
|
831 |
+
}
|
832 |
+
}
|
833 |
+
|
834 |
+
$s .= $ch;
|
835 |
+
}
|
836 |
+
}
|
837 |
+
}
|
838 |
+
|
839 |
+
decode_error("unexpected end of string while parsing JSON string");
|
840 |
+
}
|
841 |
+
|
842 |
+
|
843 |
+
sub white {
|
844 |
+
while( defined $ch ){
|
845 |
+
if($ch le ' '){
|
846 |
+
next_chr();
|
847 |
+
}
|
848 |
+
elsif($ch eq '/'){
|
849 |
+
next_chr();
|
850 |
+
if(defined $ch and $ch eq '/'){
|
851 |
+
1 while(defined(next_chr()) and $ch ne "\n" and $ch ne "\r");
|
852 |
+
}
|
853 |
+
elsif(defined $ch and $ch eq '*'){
|
854 |
+
next_chr();
|
855 |
+
while(1){
|
856 |
+
if(defined $ch){
|
857 |
+
if($ch eq '*'){
|
858 |
+
if(defined(next_chr()) and $ch eq '/'){
|
859 |
+
next_chr();
|
860 |
+
last;
|
861 |
+
}
|
862 |
+
}
|
863 |
+
else{
|
864 |
+
next_chr();
|
865 |
+
}
|
866 |
+
}
|
867 |
+
else{
|
868 |
+
decode_error("Unterminated comment");
|
869 |
+
}
|
870 |
+
}
|
871 |
+
next;
|
872 |
+
}
|
873 |
+
else{
|
874 |
+
$at--;
|
875 |
+
decode_error("malformed JSON string, neither array, object, number, string or atom");
|
876 |
+
}
|
877 |
+
}
|
878 |
+
else{
|
879 |
+
if ($relaxed and $ch eq '#') { # correctly?
|
880 |
+
pos($text) = $at;
|
881 |
+
$text =~ /\G([^\n]*(?:\r\n|\r|\n|$))/g;
|
882 |
+
$at = pos($text);
|
883 |
+
next_chr;
|
884 |
+
next;
|
885 |
+
}
|
886 |
+
|
887 |
+
last;
|
888 |
+
}
|
889 |
+
}
|
890 |
+
}
|
891 |
+
|
892 |
+
|
893 |
+
sub array {
|
894 |
+
my $a = $_[0] || []; # you can use this code to use another array ref object.
|
895 |
+
|
896 |
+
decode_error('json text or perl structure exceeds maximum nesting level (max_depth set too low?)')
|
897 |
+
if (++$depth > $max_depth);
|
898 |
+
|
899 |
+
next_chr();
|
900 |
+
white();
|
901 |
+
|
902 |
+
if(defined $ch and $ch eq ']'){
|
903 |
+
--$depth;
|
904 |
+
next_chr();
|
905 |
+
return $a;
|
906 |
+
}
|
907 |
+
else {
|
908 |
+
while(defined($ch)){
|
909 |
+
push @$a, value();
|
910 |
+
|
911 |
+
white();
|
912 |
+
|
913 |
+
if (!defined $ch) {
|
914 |
+
last;
|
915 |
+
}
|
916 |
+
|
917 |
+
if($ch eq ']'){
|
918 |
+
--$depth;
|
919 |
+
next_chr();
|
920 |
+
return $a;
|
921 |
+
}
|
922 |
+
|
923 |
+
if($ch ne ','){
|
924 |
+
last;
|
925 |
+
}
|
926 |
+
|
927 |
+
next_chr();
|
928 |
+
white();
|
929 |
+
|
930 |
+
if ($relaxed and $ch eq ']') {
|
931 |
+
--$depth;
|
932 |
+
next_chr();
|
933 |
+
return $a;
|
934 |
+
}
|
935 |
+
|
936 |
+
}
|
937 |
+
}
|
938 |
+
|
939 |
+
decode_error(", or ] expected while parsing array");
|
940 |
+
}
|
941 |
+
|
942 |
+
|
943 |
+
sub object {
|
944 |
+
my $o = $_[0] || {}; # you can use this code to use another hash ref object.
|
945 |
+
my $k;
|
946 |
+
|
947 |
+
decode_error('json text or perl structure exceeds maximum nesting level (max_depth set too low?)')
|
948 |
+
if (++$depth > $max_depth);
|
949 |
+
next_chr();
|
950 |
+
white();
|
951 |
+
|
952 |
+
if(defined $ch and $ch eq '}'){
|
953 |
+
--$depth;
|
954 |
+
next_chr();
|
955 |
+
if ($F_HOOK) {
|
956 |
+
return _json_object_hook($o);
|
957 |
+
}
|
958 |
+
return $o;
|
959 |
+
}
|
960 |
+
else {
|
961 |
+
while (defined $ch) {
|
962 |
+
$k = ($allow_barekey and $ch ne '"' and $ch ne "'") ? bareKey() : string();
|
963 |
+
white();
|
964 |
+
|
965 |
+
if(!defined $ch or $ch ne ':'){
|
966 |
+
$at--;
|
967 |
+
decode_error("':' expected");
|
968 |
+
}
|
969 |
+
|
970 |
+
next_chr();
|
971 |
+
$o->{$k} = value();
|
972 |
+
white();
|
973 |
+
|
974 |
+
last if (!defined $ch);
|
975 |
+
|
976 |
+
if($ch eq '}'){
|
977 |
+
--$depth;
|
978 |
+
next_chr();
|
979 |
+
if ($F_HOOK) {
|
980 |
+
return _json_object_hook($o);
|
981 |
+
}
|
982 |
+
return $o;
|
983 |
+
}
|
984 |
+
|
985 |
+
if($ch ne ','){
|
986 |
+
last;
|
987 |
+
}
|
988 |
+
|
989 |
+
next_chr();
|
990 |
+
white();
|
991 |
+
|
992 |
+
if ($relaxed and $ch eq '}') {
|
993 |
+
--$depth;
|
994 |
+
next_chr();
|
995 |
+
if ($F_HOOK) {
|
996 |
+
return _json_object_hook($o);
|
997 |
+
}
|
998 |
+
return $o;
|
999 |
+
}
|
1000 |
+
|
1001 |
+
}
|
1002 |
+
|
1003 |
+
}
|
1004 |
+
|
1005 |
+
$at--;
|
1006 |
+
decode_error(", or } expected while parsing object/hash");
|
1007 |
+
}
|
1008 |
+
|
1009 |
+
|
1010 |
+
sub bareKey { # doesn't strictly follow Standard ECMA-262 3rd Edition
|
1011 |
+
my $key;
|
1012 |
+
while($ch =~ /[^\x00-\x23\x25-\x2F\x3A-\x40\x5B-\x5E\x60\x7B-\x7F]/){
|
1013 |
+
$key .= $ch;
|
1014 |
+
next_chr();
|
1015 |
+
}
|
1016 |
+
return $key;
|
1017 |
+
}
|
1018 |
+
|
1019 |
+
|
1020 |
+
sub word {
|
1021 |
+
my $word = substr($text,$at-1,4);
|
1022 |
+
|
1023 |
+
if($word eq 'true'){
|
1024 |
+
$at += 3;
|
1025 |
+
next_chr;
|
1026 |
+
return $JSON::PP::true;
|
1027 |
+
}
|
1028 |
+
elsif($word eq 'null'){
|
1029 |
+
$at += 3;
|
1030 |
+
next_chr;
|
1031 |
+
return undef;
|
1032 |
+
}
|
1033 |
+
elsif($word eq 'fals'){
|
1034 |
+
$at += 3;
|
1035 |
+
if(substr($text,$at,1) eq 'e'){
|
1036 |
+
$at++;
|
1037 |
+
next_chr;
|
1038 |
+
return $JSON::PP::false;
|
1039 |
+
}
|
1040 |
+
}
|
1041 |
+
|
1042 |
+
$at--; # for decode_error report
|
1043 |
+
|
1044 |
+
decode_error("'null' expected") if ($word =~ /^n/);
|
1045 |
+
decode_error("'true' expected") if ($word =~ /^t/);
|
1046 |
+
decode_error("'false' expected") if ($word =~ /^f/);
|
1047 |
+
decode_error("malformed JSON string, neither array, object, number, string or atom");
|
1048 |
+
}
|
1049 |
+
|
1050 |
+
|
1051 |
+
sub number {
|
1052 |
+
my $n = '';
|
1053 |
+
my $v;
|
1054 |
+
|
1055 |
+
# According to RFC4627, hex or oct digits are invalid.
|
1056 |
+
if($ch eq '0'){
|
1057 |
+
my $peek = substr($text,$at,1);
|
1058 |
+
my $hex = $peek =~ /[xX]/; # 0 or 1
|
1059 |
+
|
1060 |
+
if($hex){
|
1061 |
+
decode_error("malformed number (leading zero must not be followed by another digit)");
|
1062 |
+
($n) = ( substr($text, $at+1) =~ /^([0-9a-fA-F]+)/);
|
1063 |
+
}
|
1064 |
+
else{ # oct
|
1065 |
+
($n) = ( substr($text, $at) =~ /^([0-7]+)/);
|
1066 |
+
if (defined $n and length $n > 1) {
|
1067 |
+
decode_error("malformed number (leading zero must not be followed by another digit)");
|
1068 |
+
}
|
1069 |
+
}
|
1070 |
+
|
1071 |
+
if(defined $n and length($n)){
|
1072 |
+
if (!$hex and length($n) == 1) {
|
1073 |
+
decode_error("malformed number (leading zero must not be followed by another digit)");
|
1074 |
+
}
|
1075 |
+
$at += length($n) + $hex;
|
1076 |
+
next_chr;
|
1077 |
+
return $hex ? hex($n) : oct($n);
|
1078 |
+
}
|
1079 |
+
}
|
1080 |
+
|
1081 |
+
if($ch eq '-'){
|
1082 |
+
$n = '-';
|
1083 |
+
next_chr;
|
1084 |
+
if (!defined $ch or $ch !~ /\d/) {
|
1085 |
+
decode_error("malformed number (no digits after initial minus)");
|
1086 |
+
}
|
1087 |
+
}
|
1088 |
+
|
1089 |
+
while(defined $ch and $ch =~ /\d/){
|
1090 |
+
$n .= $ch;
|
1091 |
+
next_chr;
|
1092 |
+
}
|
1093 |
+
|
1094 |
+
if(defined $ch and $ch eq '.'){
|
1095 |
+
$n .= '.';
|
1096 |
+
|
1097 |
+
next_chr;
|
1098 |
+
if (!defined $ch or $ch !~ /\d/) {
|
1099 |
+
decode_error("malformed number (no digits after decimal point)");
|
1100 |
+
}
|
1101 |
+
else {
|
1102 |
+
$n .= $ch;
|
1103 |
+
}
|
1104 |
+
|
1105 |
+
while(defined(next_chr) and $ch =~ /\d/){
|
1106 |
+
$n .= $ch;
|
1107 |
+
}
|
1108 |
+
}
|
1109 |
+
|
1110 |
+
if(defined $ch and ($ch eq 'e' or $ch eq 'E')){
|
1111 |
+
$n .= $ch;
|
1112 |
+
next_chr;
|
1113 |
+
|
1114 |
+
if(defined($ch) and ($ch eq '+' or $ch eq '-')){
|
1115 |
+
$n .= $ch;
|
1116 |
+
next_chr;
|
1117 |
+
if (!defined $ch or $ch =~ /\D/) {
|
1118 |
+
decode_error("malformed number (no digits after exp sign)");
|
1119 |
+
}
|
1120 |
+
$n .= $ch;
|
1121 |
+
}
|
1122 |
+
elsif(defined($ch) and $ch =~ /\d/){
|
1123 |
+
$n .= $ch;
|
1124 |
+
}
|
1125 |
+
else {
|
1126 |
+
decode_error("malformed number (no digits after exp sign)");
|
1127 |
+
}
|
1128 |
+
|
1129 |
+
while(defined(next_chr) and $ch =~ /\d/){
|
1130 |
+
$n .= $ch;
|
1131 |
+
}
|
1132 |
+
|
1133 |
+
}
|
1134 |
+
|
1135 |
+
$v .= $n;
|
1136 |
+
|
1137 |
+
if ($v !~ /[.eE]/ and length $v > $max_intsize) {
|
1138 |
+
if ($allow_bigint) { # from Adam Sussman
|
1139 |
+
require Math::BigInt;
|
1140 |
+
return Math::BigInt->new($v);
|
1141 |
+
}
|
1142 |
+
else {
|
1143 |
+
return "$v";
|
1144 |
+
}
|
1145 |
+
}
|
1146 |
+
elsif ($allow_bigint) {
|
1147 |
+
require Math::BigFloat;
|
1148 |
+
return Math::BigFloat->new($v);
|
1149 |
+
}
|
1150 |
+
|
1151 |
+
return 0+$v;
|
1152 |
+
}
|
1153 |
+
|
1154 |
+
|
1155 |
+
sub is_valid_utf8 {
|
1156 |
+
|
1157 |
+
$utf8_len = $_[0] =~ /[\x00-\x7F]/ ? 1
|
1158 |
+
: $_[0] =~ /[\xC2-\xDF]/ ? 2
|
1159 |
+
: $_[0] =~ /[\xE0-\xEF]/ ? 3
|
1160 |
+
: $_[0] =~ /[\xF0-\xF4]/ ? 4
|
1161 |
+
: 0
|
1162 |
+
;
|
1163 |
+
|
1164 |
+
return unless $utf8_len;
|
1165 |
+
|
1166 |
+
my $is_valid_utf8 = substr($text, $at - 1, $utf8_len);
|
1167 |
+
|
1168 |
+
return ( $is_valid_utf8 =~ /^(?:
|
1169 |
+
[\x00-\x7F]
|
1170 |
+
|[\xC2-\xDF][\x80-\xBF]
|
1171 |
+
|[\xE0][\xA0-\xBF][\x80-\xBF]
|
1172 |
+
|[\xE1-\xEC][\x80-\xBF][\x80-\xBF]
|
1173 |
+
|[\xED][\x80-\x9F][\x80-\xBF]
|
1174 |
+
|[\xEE-\xEF][\x80-\xBF][\x80-\xBF]
|
1175 |
+
|[\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
|
1176 |
+
|[\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
|
1177 |
+
|[\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
|
1178 |
+
)$/x ) ? $is_valid_utf8 : '';
|
1179 |
+
}
|
1180 |
+
|
1181 |
+
|
1182 |
+
sub decode_error {
|
1183 |
+
my $error = shift;
|
1184 |
+
my $no_rep = shift;
|
1185 |
+
my $str = defined $text ? substr($text, $at) : '';
|
1186 |
+
my $mess = '';
|
1187 |
+
my $type = $] >= 5.008 ? 'U*'
|
1188 |
+
: $] < 5.006 ? 'C*'
|
1189 |
+
: utf8::is_utf8( $str ) ? 'U*' # 5.6
|
1190 |
+
: 'C*'
|
1191 |
+
;
|
1192 |
+
|
1193 |
+
for my $c ( unpack( $type, $str ) ) { # emulate pv_uni_display() ?
|
1194 |
+
$mess .= $c == 0x07 ? '\a'
|
1195 |
+
: $c == 0x09 ? '\t'
|
1196 |
+
: $c == 0x0a ? '\n'
|
1197 |
+
: $c == 0x0d ? '\r'
|
1198 |
+
: $c == 0x0c ? '\f'
|
1199 |
+
: $c < 0x20 ? sprintf('\x{%x}', $c)
|
1200 |
+
: $c == 0x5c ? '\\\\'
|
1201 |
+
: $c < 0x80 ? chr($c)
|
1202 |
+
: sprintf('\x{%x}', $c)
|
1203 |
+
;
|
1204 |
+
if ( length $mess >= 20 ) {
|
1205 |
+
$mess .= '...';
|
1206 |
+
last;
|
1207 |
+
}
|
1208 |
+
}
|
1209 |
+
|
1210 |
+
unless ( length $mess ) {
|
1211 |
+
$mess = '(end of string)';
|
1212 |
+
}
|
1213 |
+
|
1214 |
+
Carp::croak (
|
1215 |
+
$no_rep ? "$error" : "$error, at character offset $at (before \"$mess\")"
|
1216 |
+
);
|
1217 |
+
|
1218 |
+
}
|
1219 |
+
|
1220 |
+
|
1221 |
+
sub _json_object_hook {
|
1222 |
+
my $o = $_[0];
|
1223 |
+
my @ks = keys %{$o};
|
1224 |
+
|
1225 |
+
if ( $cb_sk_object and @ks == 1 and exists $cb_sk_object->{ $ks[0] } and ref $cb_sk_object->{ $ks[0] } ) {
|
1226 |
+
my @val = $cb_sk_object->{ $ks[0] }->( $o->{$ks[0]} );
|
1227 |
+
if (@val == 1) {
|
1228 |
+
return $val[0];
|
1229 |
+
}
|
1230 |
+
}
|
1231 |
+
|
1232 |
+
my @val = $cb_object->($o) if ($cb_object);
|
1233 |
+
if (@val == 0 or @val > 1) {
|
1234 |
+
return $o;
|
1235 |
+
}
|
1236 |
+
else {
|
1237 |
+
return $val[0];
|
1238 |
+
}
|
1239 |
+
}
|
1240 |
+
|
1241 |
+
|
1242 |
+
sub PP_decode_box {
|
1243 |
+
{
|
1244 |
+
text => $text,
|
1245 |
+
at => $at,
|
1246 |
+
ch => $ch,
|
1247 |
+
len => $len,
|
1248 |
+
depth => $depth,
|
1249 |
+
encoding => $encoding,
|
1250 |
+
is_valid_utf8 => $is_valid_utf8,
|
1251 |
+
};
|
1252 |
+
}
|
1253 |
+
|
1254 |
+
} # PARSE
|
1255 |
+
|
1256 |
+
|
1257 |
+
sub _decode_surrogates { # from perlunicode
|
1258 |
+
my $uni = 0x10000 + (hex($_[0]) - 0xD800) * 0x400 + (hex($_[1]) - 0xDC00);
|
1259 |
+
my $un = pack('U*', $uni);
|
1260 |
+
utf8::encode( $un );
|
1261 |
+
return $un;
|
1262 |
+
}
|
1263 |
+
|
1264 |
+
|
1265 |
+
sub _decode_unicode {
|
1266 |
+
my $un = pack('U', hex shift);
|
1267 |
+
utf8::encode( $un );
|
1268 |
+
return $un;
|
1269 |
+
}
|
1270 |
+
|
1271 |
+
#
|
1272 |
+
# Setup for various Perl versions (the code from JSON::PP58)
|
1273 |
+
#
|
1274 |
+
|
1275 |
+
BEGIN {
|
1276 |
+
|
1277 |
+
unless ( defined &utf8::is_utf8 ) {
|
1278 |
+
require Encode;
|
1279 |
+
*utf8::is_utf8 = *Encode::is_utf8;
|
1280 |
+
}
|
1281 |
+
|
1282 |
+
if ( $] >= 5.008 ) {
|
1283 |
+
*JSON::PP::JSON_PP_encode_ascii = \&_encode_ascii;
|
1284 |
+
*JSON::PP::JSON_PP_encode_latin1 = \&_encode_latin1;
|
1285 |
+
*JSON::PP::JSON_PP_decode_surrogates = \&_decode_surrogates;
|
1286 |
+
*JSON::PP::JSON_PP_decode_unicode = \&_decode_unicode;
|
1287 |
+
}
|
1288 |
+
|
1289 |
+
if ($] >= 5.008 and $] < 5.008003) { # join() in 5.8.0 - 5.8.2 is broken.
|
1290 |
+
package # hide from PAUSE
|
1291 |
+
JSON::PP;
|
1292 |
+
require subs;
|
1293 |
+
subs->import('join');
|
1294 |
+
eval q|
|
1295 |
+
sub join {
|
1296 |
+
return '' if (@_ < 2);
|
1297 |
+
my $j = shift;
|
1298 |
+
my $str = shift;
|
1299 |
+
for (@_) { $str .= $j . $_; }
|
1300 |
+
return $str;
|
1301 |
+
}
|
1302 |
+
|;
|
1303 |
+
}
|
1304 |
+
|
1305 |
+
|
1306 |
+
sub JSON::PP::incr_parse {
|
1307 |
+
local $Carp::CarpLevel = 1;
|
1308 |
+
( $_[0]->{_incr_parser} ||= JSON::PP::IncrParser->new )->incr_parse( @_ );
|
1309 |
+
}
|
1310 |
+
|
1311 |
+
|
1312 |
+
sub JSON::PP::incr_skip {
|
1313 |
+
( $_[0]->{_incr_parser} ||= JSON::PP::IncrParser->new )->incr_skip;
|
1314 |
+
}
|
1315 |
+
|
1316 |
+
|
1317 |
+
sub JSON::PP::incr_reset {
|
1318 |
+
( $_[0]->{_incr_parser} ||= JSON::PP::IncrParser->new )->incr_reset;
|
1319 |
+
}
|
1320 |
+
|
1321 |
+
eval q{
|
1322 |
+
sub JSON::PP::incr_text : lvalue {
|
1323 |
+
$_[0]->{_incr_parser} ||= JSON::PP::IncrParser->new;
|
1324 |
+
|
1325 |
+
if ( $_[0]->{_incr_parser}->{incr_parsing} ) {
|
1326 |
+
Carp::croak("incr_text can not be called when the incremental parser already started parsing");
|
1327 |
+
}
|
1328 |
+
$_[0]->{_incr_parser}->{incr_text};
|
1329 |
+
}
|
1330 |
+
} if ( $] >= 5.006 );
|
1331 |
+
|
1332 |
+
} # Setup for various Perl versions (the code from JSON::PP58)
|
1333 |
+
|
1334 |
+
|
1335 |
+
###############################
|
1336 |
+
# Utilities
|
1337 |
+
#
|
1338 |
+
|
1339 |
+
BEGIN {
|
1340 |
+
eval 'require Scalar::Util';
|
1341 |
+
unless($@){
|
1342 |
+
*JSON::PP::blessed = \&Scalar::Util::blessed;
|
1343 |
+
*JSON::PP::reftype = \&Scalar::Util::reftype;
|
1344 |
+
*JSON::PP::refaddr = \&Scalar::Util::refaddr;
|
1345 |
+
}
|
1346 |
+
else{ # This code is from Scalar::Util.
|
1347 |
+
# warn $@;
|
1348 |
+
eval 'sub UNIVERSAL::a_sub_not_likely_to_be_here { ref($_[0]) }';
|
1349 |
+
*JSON::PP::blessed = sub {
|
1350 |
+
local($@, $SIG{__DIE__}, $SIG{__WARN__});
|
1351 |
+
ref($_[0]) ? eval { $_[0]->a_sub_not_likely_to_be_here } : undef;
|
1352 |
+
};
|
1353 |
+
my %tmap = qw(
|
1354 |
+
B::NULL SCALAR
|
1355 |
+
B::HV HASH
|
1356 |
+
B::AV ARRAY
|
1357 |
+
B::CV CODE
|
1358 |
+
B::IO IO
|
1359 |
+
B::GV GLOB
|
1360 |
+
B::REGEXP REGEXP
|
1361 |
+
);
|
1362 |
+
*JSON::PP::reftype = sub {
|
1363 |
+
my $r = shift;
|
1364 |
+
|
1365 |
+
return undef unless length(ref($r));
|
1366 |
+
|
1367 |
+
my $t = ref(B::svref_2object($r));
|
1368 |
+
|
1369 |
+
return
|
1370 |
+
exists $tmap{$t} ? $tmap{$t}
|
1371 |
+
: length(ref($$r)) ? 'REF'
|
1372 |
+
: 'SCALAR';
|
1373 |
+
};
|
1374 |
+
*JSON::PP::refaddr = sub {
|
1375 |
+
return undef unless length(ref($_[0]));
|
1376 |
+
|
1377 |
+
my $addr;
|
1378 |
+
if(defined(my $pkg = blessed($_[0]))) {
|
1379 |
+
$addr .= bless $_[0], 'Scalar::Util::Fake';
|
1380 |
+
bless $_[0], $pkg;
|
1381 |
+
}
|
1382 |
+
else {
|
1383 |
+
$addr .= $_[0]
|
1384 |
+
}
|
1385 |
+
|
1386 |
+
$addr =~ /0x(\w+)/;
|
1387 |
+
local $^W;
|
1388 |
+
#no warnings 'portable';
|
1389 |
+
hex($1);
|
1390 |
+
}
|
1391 |
+
}
|
1392 |
+
}
|
1393 |
+
|
1394 |
+
|
1395 |
+
# shamelessly copied and modified from JSON::XS code.
|
1396 |
+
|
1397 |
+
unless ( $INC{'JSON/PP.pm'} ) {
|
1398 |
+
eval q|
|
1399 |
+
package
|
1400 |
+
JSON::PP::Boolean;
|
1401 |
+
|
1402 |
+
use overload (
|
1403 |
+
"0+" => sub { ${$_[0]} },
|
1404 |
+
"++" => sub { $_[0] = ${$_[0]} + 1 },
|
1405 |
+
"--" => sub { $_[0] = ${$_[0]} - 1 },
|
1406 |
+
fallback => 1,
|
1407 |
+
);
|
1408 |
+
|;
|
1409 |
+
}
|
1410 |
+
|
1411 |
+
$JSON::PP::true = do { bless \(my $dummy = 1), "JSON::PP::Boolean" };
|
1412 |
+
$JSON::PP::false = do { bless \(my $dummy = 0), "JSON::PP::Boolean" };
|
1413 |
+
|
1414 |
+
sub is_bool { defined $_[0] and UNIVERSAL::isa($_[0], "JSON::PP::Boolean"); }
|
1415 |
+
|
1416 |
+
sub true { $JSON::PP::true }
|
1417 |
+
sub false { $JSON::PP::false }
|
1418 |
+
sub null { undef; }
|
1419 |
+
|
1420 |
+
###############################
|
1421 |
+
|
1422 |
+
###############################
|
1423 |
+
|
1424 |
+
package # hide from PAUSE
|
1425 |
+
JSON::PP::IncrParser;
|
1426 |
+
|
1427 |
+
use strict;
|
1428 |
+
|
1429 |
+
use constant INCR_M_WS => 0; # initial whitespace skipping
|
1430 |
+
use constant INCR_M_STR => 1; # inside string
|
1431 |
+
use constant INCR_M_BS => 2; # inside backslash
|
1432 |
+
use constant INCR_M_JSON => 3; # outside anything, count nesting
|
1433 |
+
use constant INCR_M_C0 => 4;
|
1434 |
+
use constant INCR_M_C1 => 5;
|
1435 |
+
|
1436 |
+
use vars qw($VERSION);
|
1437 |
+
$VERSION = '1.01';
|
1438 |
+
|
1439 |
+
my $unpack_format = $] < 5.006 ? 'C*' : 'U*';
|
1440 |
+
|
1441 |
+
sub new {
|
1442 |
+
my ( $class ) = @_;
|
1443 |
+
|
1444 |
+
bless {
|
1445 |
+
incr_nest => 0,
|
1446 |
+
incr_text => undef,
|
1447 |
+
incr_parsing => 0,
|
1448 |
+
incr_p => 0,
|
1449 |
+
}, $class;
|
1450 |
+
}
|
1451 |
+
|
1452 |
+
|
1453 |
+
sub incr_parse {
|
1454 |
+
my ( $self, $coder, $text ) = @_;
|
1455 |
+
|
1456 |
+
$self->{incr_text} = '' unless ( defined $self->{incr_text} );
|
1457 |
+
|
1458 |
+
if ( defined $text ) {
|
1459 |
+
if ( utf8::is_utf8( $text ) and !utf8::is_utf8( $self->{incr_text} ) ) {
|
1460 |
+
utf8::upgrade( $self->{incr_text} ) ;
|
1461 |
+
utf8::decode( $self->{incr_text} ) ;
|
1462 |
+
}
|
1463 |
+
$self->{incr_text} .= $text;
|
1464 |
+
}
|
1465 |
+
|
1466 |
+
|
1467 |
+
my $max_size = $coder->get_max_size;
|
1468 |
+
|
1469 |
+
if ( defined wantarray ) {
|
1470 |
+
|
1471 |
+
$self->{incr_mode} = INCR_M_WS unless defined $self->{incr_mode};
|
1472 |
+
|
1473 |
+
if ( wantarray ) {
|
1474 |
+
my @ret;
|
1475 |
+
|
1476 |
+
$self->{incr_parsing} = 1;
|
1477 |
+
|
1478 |
+
do {
|
1479 |
+
push @ret, $self->_incr_parse( $coder, $self->{incr_text} );
|
1480 |
+
|
1481 |
+
unless ( !$self->{incr_nest} and $self->{incr_mode} == INCR_M_JSON ) {
|
1482 |
+
$self->{incr_mode} = INCR_M_WS if $self->{incr_mode} != INCR_M_STR;
|
1483 |
+
}
|
1484 |
+
|
1485 |
+
} until ( length $self->{incr_text} >= $self->{incr_p} );
|
1486 |
+
|
1487 |
+
$self->{incr_parsing} = 0;
|
1488 |
+
|
1489 |
+
return @ret;
|
1490 |
+
}
|
1491 |
+
else { # in scalar context
|
1492 |
+
$self->{incr_parsing} = 1;
|
1493 |
+
my $obj = $self->_incr_parse( $coder, $self->{incr_text} );
|
1494 |
+
$self->{incr_parsing} = 0 if defined $obj; # pointed by Martin J. Evans
|
1495 |
+
return $obj ? $obj : undef; # $obj is an empty string, parsing was completed.
|
1496 |
+
}
|
1497 |
+
|
1498 |
+
}
|
1499 |
+
|
1500 |
+
}
|
1501 |
+
|
1502 |
+
|
1503 |
+
sub _incr_parse {
|
1504 |
+
my ( $self, $coder, $text, $skip ) = @_;
|
1505 |
+
my $p = $self->{incr_p};
|
1506 |
+
my $restore = $p;
|
1507 |
+
|
1508 |
+
my @obj;
|
1509 |
+
my $len = length $text;
|
1510 |
+
|
1511 |
+
if ( $self->{incr_mode} == INCR_M_WS ) {
|
1512 |
+
while ( $len > $p ) {
|
1513 |
+
my $s = substr( $text, $p, 1 );
|
1514 |
+
$p++ and next if ( 0x20 >= unpack($unpack_format, $s) );
|
1515 |
+
$self->{incr_mode} = INCR_M_JSON;
|
1516 |
+
last;
|
1517 |
+
}
|
1518 |
+
}
|
1519 |
+
|
1520 |
+
while ( $len > $p ) {
|
1521 |
+
my $s = substr( $text, $p++, 1 );
|
1522 |
+
|
1523 |
+
if ( $s eq '"' ) {
|
1524 |
+
if (substr( $text, $p - 2, 1 ) eq '\\' ) {
|
1525 |
+
next;
|
1526 |
+
}
|
1527 |
+
|
1528 |
+
if ( $self->{incr_mode} != INCR_M_STR ) {
|
1529 |
+
$self->{incr_mode} = INCR_M_STR;
|
1530 |
+
}
|
1531 |
+
else {
|
1532 |
+
$self->{incr_mode} = INCR_M_JSON;
|
1533 |
+
unless ( $self->{incr_nest} ) {
|
1534 |
+
last;
|
1535 |
+
}
|
1536 |
+
}
|
1537 |
+
}
|
1538 |
+
|
1539 |
+
if ( $self->{incr_mode} == INCR_M_JSON ) {
|
1540 |
+
|
1541 |
+
if ( $s eq '[' or $s eq '{' ) {
|
1542 |
+
if ( ++$self->{incr_nest} > $coder->get_max_depth ) {
|
1543 |
+
Carp::croak('json text or perl structure exceeds maximum nesting level (max_depth set too low?)');
|
1544 |
+
}
|
1545 |
+
}
|
1546 |
+
elsif ( $s eq ']' or $s eq '}' ) {
|
1547 |
+
last if ( --$self->{incr_nest} <= 0 );
|
1548 |
+
}
|
1549 |
+
elsif ( $s eq '#' ) {
|
1550 |
+
while ( $len > $p ) {
|
1551 |
+
last if substr( $text, $p++, 1 ) eq "\n";
|
1552 |
+
}
|
1553 |
+
}
|
1554 |
+
|
1555 |
+
}
|
1556 |
+
|
1557 |
+
}
|
1558 |
+
|
1559 |
+
$self->{incr_p} = $p;
|
1560 |
+
|
1561 |
+
return if ( $self->{incr_mode} == INCR_M_STR and not $self->{incr_nest} );
|
1562 |
+
return if ( $self->{incr_mode} == INCR_M_JSON and $self->{incr_nest} > 0 );
|
1563 |
+
|
1564 |
+
return '' unless ( length substr( $self->{incr_text}, 0, $p ) );
|
1565 |
+
|
1566 |
+
local $Carp::CarpLevel = 2;
|
1567 |
+
|
1568 |
+
$self->{incr_p} = $restore;
|
1569 |
+
$self->{incr_c} = $p;
|
1570 |
+
|
1571 |
+
my ( $obj, $tail ) = $coder->PP_decode_json( substr( $self->{incr_text}, 0, $p ), 0x10000001 );
|
1572 |
+
|
1573 |
+
$self->{incr_text} = substr( $self->{incr_text}, $p );
|
1574 |
+
$self->{incr_p} = 0;
|
1575 |
+
|
1576 |
+
return $obj || '';
|
1577 |
+
}
|
1578 |
+
|
1579 |
+
|
1580 |
+
sub incr_text {
|
1581 |
+
if ( $_[0]->{incr_parsing} ) {
|
1582 |
+
Carp::croak("incr_text can not be called when the incremental parser already started parsing");
|
1583 |
+
}
|
1584 |
+
$_[0]->{incr_text};
|
1585 |
+
}
|
1586 |
+
|
1587 |
+
|
1588 |
+
sub incr_skip {
|
1589 |
+
my $self = shift;
|
1590 |
+
$self->{incr_text} = substr( $self->{incr_text}, $self->{incr_c} );
|
1591 |
+
$self->{incr_p} = 0;
|
1592 |
+
}
|
1593 |
+
|
1594 |
+
|
1595 |
+
sub incr_reset {
|
1596 |
+
my $self = shift;
|
1597 |
+
$self->{incr_text} = undef;
|
1598 |
+
$self->{incr_p} = 0;
|
1599 |
+
$self->{incr_mode} = 0;
|
1600 |
+
$self->{incr_nest} = 0;
|
1601 |
+
$self->{incr_parsing} = 0;
|
1602 |
+
}
|
1603 |
+
|
1604 |
+
###############################
|
1605 |
+
|
1606 |
+
|
1607 |
+
1;
|
1608 |
+
__END__
|
1609 |
+
=pod
|
1610 |
+
|
1611 |
+
=head1 NAME
|
1612 |
+
|
1613 |
+
JSON::PP - JSON::XS compatible pure-Perl module.
|
1614 |
+
|
1615 |
+
=head1 SYNOPSIS
|
1616 |
+
|
1617 |
+
use JSON::PP;
|
1618 |
+
|
1619 |
+
# exported functions, they croak on error
|
1620 |
+
# and expect/generate UTF-8
|
1621 |
+
|
1622 |
+
$utf8_encoded_json_text = encode_json $perl_hash_or_arrayref;
|
1623 |
+
$perl_hash_or_arrayref = decode_json $utf8_encoded_json_text;
|
1624 |
+
|
1625 |
+
# OO-interface
|
1626 |
+
|
1627 |
+
$coder = JSON::PP->new->ascii->pretty->allow_nonref;
|
1628 |
+
|
1629 |
+
$json_text = $json->encode( $perl_scalar );
|
1630 |
+
$perl_scalar = $json->decode( $json_text );
|
1631 |
+
|
1632 |
+
$pretty_printed = $json->pretty->encode( $perl_scalar ); # pretty-printing
|
1633 |
+
|
1634 |
+
# Note that JSON version 2.0 and above will automatically use
|
1635 |
+
# JSON::XS or JSON::PP, so you should be able to just:
|
1636 |
+
|
1637 |
+
use JSON;
|
1638 |
+
|
1639 |
+
|
1640 |
+
=head1 VERSION
|
1641 |
+
|
1642 |
+
2.27200
|
1643 |
+
|
1644 |
+
L<JSON::XS> 2.27 (~2.30) compatible.
|
1645 |
+
|
1646 |
+
=head1 DESCRIPTION
|
1647 |
+
|
1648 |
+
This module is L<JSON::XS> compatible pure Perl module.
|
1649 |
+
(Perl 5.8 or later is recommended)
|
1650 |
+
|
1651 |
+
JSON::XS is the fastest and most proper JSON module on CPAN.
|
1652 |
+
It is written by Marc Lehmann in C, so must be compiled and
|
1653 |
+
installed in the used environment.
|
1654 |
+
|
1655 |
+
JSON::PP is a pure-Perl module and has compatibility to JSON::XS.
|
1656 |
+
|
1657 |
+
|
1658 |
+
=head2 FEATURES
|
1659 |
+
|
1660 |
+
=over
|
1661 |
+
|
1662 |
+
=item * correct unicode handling
|
1663 |
+
|
1664 |
+
This module knows how to handle Unicode (depending on Perl version).
|
1665 |
+
|
1666 |
+
See to L<JSON::XS/A FEW NOTES ON UNICODE AND PERL> and
|
1667 |
+
L<UNICODE HANDLING ON PERLS>.
|
1668 |
+
|
1669 |
+
|
1670 |
+
=item * round-trip integrity
|
1671 |
+
|
1672 |
+
When you serialise a perl data structure using only data types
|
1673 |
+
supported by JSON and Perl, the deserialised data structure is
|
1674 |
+
identical on the Perl level. (e.g. the string "2.0" doesn't suddenly
|
1675 |
+
become "2" just because it looks like a number). There I<are> minor
|
1676 |
+
exceptions to this, read the MAPPING section below to learn about
|
1677 |
+
those.
|
1678 |
+
|
1679 |
+
|
1680 |
+
=item * strict checking of JSON correctness
|
1681 |
+
|
1682 |
+
There is no guessing, no generating of illegal JSON texts by default,
|
1683 |
+
and only JSON is accepted as input by default (the latter is a
|
1684 |
+
security feature). But when some options are set, loose checking
|
1685 |
+
features are available.
|
1686 |
+
|
1687 |
+
=back
|
1688 |
+
|
1689 |
+
=head1 FUNCTIONAL INTERFACE
|
1690 |
+
|
1691 |
+
Some documents are copied and modified from L<JSON::XS/FUNCTIONAL INTERFACE>.
|
1692 |
+
|
1693 |
+
=head2 encode_json
|
1694 |
+
|
1695 |
+
$json_text = encode_json $perl_scalar
|
1696 |
+
|
1697 |
+
Converts the given Perl data structure to a UTF-8 encoded, binary string.
|
1698 |
+
|
1699 |
+
This function call is functionally identical to:
|
1700 |
+
|
1701 |
+
$json_text = JSON::PP->new->utf8->encode($perl_scalar)
|
1702 |
+
|
1703 |
+
=head2 decode_json
|
1704 |
+
|
1705 |
+
$perl_scalar = decode_json $json_text
|
1706 |
+
|
1707 |
+
The opposite of C<encode_json>: expects an UTF-8 (binary) string and tries
|
1708 |
+
to parse that as an UTF-8 encoded JSON text, returning the resulting
|
1709 |
+
reference.
|
1710 |
+
|
1711 |
+
This function call is functionally identical to:
|
1712 |
+
|
1713 |
+
$perl_scalar = JSON::PP->new->utf8->decode($json_text)
|
1714 |
+
|
1715 |
+
=head2 JSON::PP::is_bool
|
1716 |
+
|
1717 |
+
$is_boolean = JSON::PP::is_bool($scalar)
|
1718 |
+
|
1719 |
+
Returns true if the passed scalar represents either JSON::PP::true or
|
1720 |
+
JSON::PP::false, two constants that act like C<1> and C<0> respectively
|
1721 |
+
and are also used to represent JSON C<true> and C<false> in Perl strings.
|
1722 |
+
|
1723 |
+
=head2 JSON::PP::true
|
1724 |
+
|
1725 |
+
Returns JSON true value which is blessed object.
|
1726 |
+
It C<isa> JSON::PP::Boolean object.
|
1727 |
+
|
1728 |
+
=head2 JSON::PP::false
|
1729 |
+
|
1730 |
+
Returns JSON false value which is blessed object.
|
1731 |
+
It C<isa> JSON::PP::Boolean object.
|
1732 |
+
|
1733 |
+
=head2 JSON::PP::null
|
1734 |
+
|
1735 |
+
Returns C<undef>.
|
1736 |
+
|
1737 |
+
See L<MAPPING>, below, for more information on how JSON values are mapped to
|
1738 |
+
Perl.
|
1739 |
+
|
1740 |
+
|
1741 |
+
=head1 HOW DO I DECODE A DATA FROM OUTER AND ENCODE TO OUTER
|
1742 |
+
|
1743 |
+
This section supposes that your perl version is 5.8 or later.
|
1744 |
+
|
1745 |
+
If you know a JSON text from an outer world - a network, a file content, and so on,
|
1746 |
+
is encoded in UTF-8, you should use C<decode_json> or C<JSON> module object
|
1747 |
+
with C<utf8> enable. And the decoded result will contain UNICODE characters.
|
1748 |
+
|
1749 |
+
# from network
|
1750 |
+
my $json = JSON::PP->new->utf8;
|
1751 |
+
my $json_text = CGI->new->param( 'json_data' );
|
1752 |
+
my $perl_scalar = $json->decode( $json_text );
|
1753 |
+
|
1754 |
+
# from file content
|
1755 |
+
local $/;
|
1756 |
+
open( my $fh, '<', 'json.data' );
|
1757 |
+
$json_text = <$fh>;
|
1758 |
+
$perl_scalar = decode_json( $json_text );
|
1759 |
+
|
1760 |
+
If an outer data is not encoded in UTF-8, firstly you should C<decode> it.
|
1761 |
+
|
1762 |
+
use Encode;
|
1763 |
+
local $/;
|
1764 |
+
open( my $fh, '<', 'json.data' );
|
1765 |
+
my $encoding = 'cp932';
|
1766 |
+
my $unicode_json_text = decode( $encoding, <$fh> ); # UNICODE
|
1767 |
+
|
1768 |
+
# or you can write the below code.
|
1769 |
+
#
|
1770 |
+
# open( my $fh, "<:encoding($encoding)", 'json.data' );
|
1771 |
+
# $unicode_json_text = <$fh>;
|
1772 |
+
|
1773 |
+
In this case, C<$unicode_json_text> is of course UNICODE string.
|
1774 |
+
So you B<cannot> use C<decode_json> nor C<JSON> module object with C<utf8> enable.
|
1775 |
+
Instead of them, you use C<JSON> module object with C<utf8> disable.
|
1776 |
+
|
1777 |
+
$perl_scalar = $json->utf8(0)->decode( $unicode_json_text );
|
1778 |
+
|
1779 |
+
Or C<encode 'utf8'> and C<decode_json>:
|
1780 |
+
|
1781 |
+
$perl_scalar = decode_json( encode( 'utf8', $unicode_json_text ) );
|
1782 |
+
# this way is not efficient.
|
1783 |
+
|
1784 |
+
And now, you want to convert your C<$perl_scalar> into JSON data and
|
1785 |
+
send it to an outer world - a network or a file content, and so on.
|
1786 |
+
|
1787 |
+
Your data usually contains UNICODE strings and you want the converted data to be encoded
|
1788 |
+
in UTF-8, you should use C<encode_json> or C<JSON> module object with C<utf8> enable.
|
1789 |
+
|
1790 |
+
print encode_json( $perl_scalar ); # to a network? file? or display?
|
1791 |
+
# or
|
1792 |
+
print $json->utf8->encode( $perl_scalar );
|
1793 |
+
|
1794 |
+
If C<$perl_scalar> does not contain UNICODE but C<$encoding>-encoded strings
|
1795 |
+
for some reason, then its characters are regarded as B<latin1> for perl
|
1796 |
+
(because it does not concern with your $encoding).
|
1797 |
+
You B<cannot> use C<encode_json> nor C<JSON> module object with C<utf8> enable.
|
1798 |
+
Instead of them, you use C<JSON> module object with C<utf8> disable.
|
1799 |
+
Note that the resulted text is a UNICODE string but no problem to print it.
|
1800 |
+
|
1801 |
+
# $perl_scalar contains $encoding encoded string values
|
1802 |
+
$unicode_json_text = $json->utf8(0)->encode( $perl_scalar );
|
1803 |
+
# $unicode_json_text consists of characters less than 0x100
|
1804 |
+
print $unicode_json_text;
|
1805 |
+
|
1806 |
+
Or C<decode $encoding> all string values and C<encode_json>:
|
1807 |
+
|
1808 |
+
$perl_scalar->{ foo } = decode( $encoding, $perl_scalar->{ foo } );
|
1809 |
+
# ... do it to each string values, then encode_json
|
1810 |
+
$json_text = encode_json( $perl_scalar );
|
1811 |
+
|
1812 |
+
This method is a proper way but probably not efficient.
|
1813 |
+
|
1814 |
+
See to L<Encode>, L<perluniintro>.
|
1815 |
+
|
1816 |
+
|
1817 |
+
=head1 METHODS
|
1818 |
+
|
1819 |
+
Basically, check to L<JSON> or L<JSON::XS>.
|
1820 |
+
|
1821 |
+
=head2 new
|
1822 |
+
|
1823 |
+
$json = JSON::PP->new
|
1824 |
+
|
1825 |
+
Returns a new JSON::PP object that can be used to de/encode JSON
|
1826 |
+
strings.
|
1827 |
+
|
1828 |
+
All boolean flags described below are by default I<disabled>.
|
1829 |
+
|
1830 |
+
The mutators for flags all return the JSON object again and thus calls can
|
1831 |
+
be chained:
|
1832 |
+
|
1833 |
+
my $json = JSON::PP->new->utf8->space_after->encode({a => [1,2]})
|
1834 |
+
=> {"a": [1, 2]}
|
1835 |
+
|
1836 |
+
=head2 ascii
|
1837 |
+
|
1838 |
+
$json = $json->ascii([$enable])
|
1839 |
+
|
1840 |
+
$enabled = $json->get_ascii
|
1841 |
+
|
1842 |
+
If $enable is true (or missing), then the encode method will not generate characters outside
|
1843 |
+
the code range 0..127. Any Unicode characters outside that range will be escaped using either
|
1844 |
+
a single \uXXXX or a double \uHHHH\uLLLLL escape sequence, as per RFC4627.
|
1845 |
+
(See to L<JSON::XS/OBJECT-ORIENTED INTERFACE>).
|
1846 |
+
|
1847 |
+
In Perl 5.005, there is no character having high value (more than 255).
|
1848 |
+
See to L<UNICODE HANDLING ON PERLS>.
|
1849 |
+
|
1850 |
+
If $enable is false, then the encode method will not escape Unicode characters unless
|
1851 |
+
required by the JSON syntax or other flags. This results in a faster and more compact format.
|
1852 |
+
|
1853 |
+
JSON::PP->new->ascii(1)->encode([chr 0x10401])
|
1854 |
+
=> ["\ud801\udc01"]
|
1855 |
+
|
1856 |
+
=head2 latin1
|
1857 |
+
|
1858 |
+
$json = $json->latin1([$enable])
|
1859 |
+
|
1860 |
+
$enabled = $json->get_latin1
|
1861 |
+
|
1862 |
+
If $enable is true (or missing), then the encode method will encode the resulting JSON
|
1863 |
+
text as latin1 (or iso-8859-1), escaping any characters outside the code range 0..255.
|
1864 |
+
|
1865 |
+
If $enable is false, then the encode method will not escape Unicode characters
|
1866 |
+
unless required by the JSON syntax or other flags.
|
1867 |
+
|
1868 |
+
JSON::XS->new->latin1->encode (["\x{89}\x{abc}"]
|
1869 |
+
=> ["\x{89}\\u0abc"] # (perl syntax, U+abc escaped, U+89 not)
|
1870 |
+
|
1871 |
+
See to L<UNICODE HANDLING ON PERLS>.
|
1872 |
+
|
1873 |
+
=head2 utf8
|
1874 |
+
|
1875 |
+
$json = $json->utf8([$enable])
|
1876 |
+
|
1877 |
+
$enabled = $json->get_utf8
|
1878 |
+
|
1879 |
+
If $enable is true (or missing), then the encode method will encode the JSON result
|
1880 |
+
into UTF-8, as required by many protocols, while the decode method expects to be handled
|
1881 |
+
an UTF-8-encoded string. Please note that UTF-8-encoded strings do not contain any
|
1882 |
+
characters outside the range 0..255, they are thus useful for bytewise/binary I/O.
|
1883 |
+
|
1884 |
+
(In Perl 5.005, any character outside the range 0..255 does not exist.
|
1885 |
+
See to L<UNICODE HANDLING ON PERLS>.)
|
1886 |
+
|
1887 |
+
In future versions, enabling this option might enable autodetection of the UTF-16 and UTF-32
|
1888 |
+
encoding families, as described in RFC4627.
|
1889 |
+
|
1890 |
+
If $enable is false, then the encode method will return the JSON string as a (non-encoded)
|
1891 |
+
Unicode string, while decode expects thus a Unicode string. Any decoding or encoding
|
1892 |
+
(e.g. to UTF-8 or UTF-16) needs to be done yourself, e.g. using the Encode module.
|
1893 |
+
|
1894 |
+
Example, output UTF-16BE-encoded JSON:
|
1895 |
+
|
1896 |
+
use Encode;
|
1897 |
+
$jsontext = encode "UTF-16BE", JSON::PP->new->encode ($object);
|
1898 |
+
|
1899 |
+
Example, decode UTF-32LE-encoded JSON:
|
1900 |
+
|
1901 |
+
use Encode;
|
1902 |
+
$object = JSON::PP->new->decode (decode "UTF-32LE", $jsontext);
|
1903 |
+
|
1904 |
+
|
1905 |
+
=head2 pretty
|
1906 |
+
|
1907 |
+
$json = $json->pretty([$enable])
|
1908 |
+
|
1909 |
+
This enables (or disables) all of the C<indent>, C<space_before> and
|
1910 |
+
C<space_after> flags in one call to generate the most readable
|
1911 |
+
(or most compact) form possible.
|
1912 |
+
|
1913 |
+
Equivalent to:
|
1914 |
+
|
1915 |
+
$json->indent->space_before->space_after
|
1916 |
+
|
1917 |
+
=head2 indent
|
1918 |
+
|
1919 |
+
$json = $json->indent([$enable])
|
1920 |
+
|
1921 |
+
$enabled = $json->get_indent
|
1922 |
+
|
1923 |
+
The default indent space length is three.
|
1924 |
+
You can use C<indent_length> to change the length.
|
1925 |
+
|
1926 |
+
=head2 space_before
|
1927 |
+
|
1928 |
+
$json = $json->space_before([$enable])
|
1929 |
+
|
1930 |
+
$enabled = $json->get_space_before
|
1931 |
+
|
1932 |
+
If C<$enable> is true (or missing), then the C<encode> method will add an extra
|
1933 |
+
optional space before the C<:> separating keys from values in JSON objects.
|
1934 |
+
|
1935 |
+
If C<$enable> is false, then the C<encode> method will not add any extra
|
1936 |
+
space at those places.
|
1937 |
+
|
1938 |
+
This setting has no effect when decoding JSON texts.
|
1939 |
+
|
1940 |
+
Example, space_before enabled, space_after and indent disabled:
|
1941 |
+
|
1942 |
+
{"key" :"value"}
|
1943 |
+
|
1944 |
+
=head2 space_after
|
1945 |
+
|
1946 |
+
$json = $json->space_after([$enable])
|
1947 |
+
|
1948 |
+
$enabled = $json->get_space_after
|
1949 |
+
|
1950 |
+
If C<$enable> is true (or missing), then the C<encode> method will add an extra
|
1951 |
+
optional space after the C<:> separating keys from values in JSON objects
|
1952 |
+
and extra whitespace after the C<,> separating key-value pairs and array
|
1953 |
+
members.
|
1954 |
+
|
1955 |
+
If C<$enable> is false, then the C<encode> method will not add any extra
|
1956 |
+
space at those places.
|
1957 |
+
|
1958 |
+
This setting has no effect when decoding JSON texts.
|
1959 |
+
|
1960 |
+
Example, space_before and indent disabled, space_after enabled:
|
1961 |
+
|
1962 |
+
{"key": "value"}
|
1963 |
+
|
1964 |
+
=head2 relaxed
|
1965 |
+
|
1966 |
+
$json = $json->relaxed([$enable])
|
1967 |
+
|
1968 |
+
$enabled = $json->get_relaxed
|
1969 |
+
|
1970 |
+
If C<$enable> is true (or missing), then C<decode> will accept some
|
1971 |
+
extensions to normal JSON syntax (see below). C<encode> will not be
|
1972 |
+
affected in anyway. I<Be aware that this option makes you accept invalid
|
1973 |
+
JSON texts as if they were valid!>. I suggest only to use this option to
|
1974 |
+
parse application-specific files written by humans (configuration files,
|
1975 |
+
resource files etc.)
|
1976 |
+
|
1977 |
+
If C<$enable> is false (the default), then C<decode> will only accept
|
1978 |
+
valid JSON texts.
|
1979 |
+
|
1980 |
+
Currently accepted extensions are:
|
1981 |
+
|
1982 |
+
=over 4
|
1983 |
+
|
1984 |
+
=item * list items can have an end-comma
|
1985 |
+
|
1986 |
+
JSON I<separates> array elements and key-value pairs with commas. This
|
1987 |
+
can be annoying if you write JSON texts manually and want to be able to
|
1988 |
+
quickly append elements, so this extension accepts comma at the end of
|
1989 |
+
such items not just between them:
|
1990 |
+
|
1991 |
+
[
|
1992 |
+
1,
|
1993 |
+
2, <- this comma not normally allowed
|
1994 |
+
]
|
1995 |
+
{
|
1996 |
+
"k1": "v1",
|
1997 |
+
"k2": "v2", <- this comma not normally allowed
|
1998 |
+
}
|
1999 |
+
|
2000 |
+
=item * shell-style '#'-comments
|
2001 |
+
|
2002 |
+
Whenever JSON allows whitespace, shell-style comments are additionally
|
2003 |
+
allowed. They are terminated by the first carriage-return or line-feed
|
2004 |
+
character, after which more white-space and comments are allowed.
|
2005 |
+
|
2006 |
+
[
|
2007 |
+
1, # this comment not allowed in JSON
|
2008 |
+
# neither this one...
|
2009 |
+
]
|
2010 |
+
|
2011 |
+
=back
|
2012 |
+
|
2013 |
+
=head2 canonical
|
2014 |
+
|
2015 |
+
$json = $json->canonical([$enable])
|
2016 |
+
|
2017 |
+
$enabled = $json->get_canonical
|
2018 |
+
|
2019 |
+
If C<$enable> is true (or missing), then the C<encode> method will output JSON objects
|
2020 |
+
by sorting their keys. This is adding a comparatively high overhead.
|
2021 |
+
|
2022 |
+
If C<$enable> is false, then the C<encode> method will output key-value
|
2023 |
+
pairs in the order Perl stores them (which will likely change between runs
|
2024 |
+
of the same script).
|
2025 |
+
|
2026 |
+
This option is useful if you want the same data structure to be encoded as
|
2027 |
+
the same JSON text (given the same overall settings). If it is disabled,
|
2028 |
+
the same hash might be encoded differently even if contains the same data,
|
2029 |
+
as key-value pairs have no inherent ordering in Perl.
|
2030 |
+
|
2031 |
+
This setting has no effect when decoding JSON texts.
|
2032 |
+
|
2033 |
+
If you want your own sorting routine, you can give a code reference
|
2034 |
+
or a subroutine name to C<sort_by>. See to C<JSON::PP OWN METHODS>.
|
2035 |
+
|
2036 |
+
=head2 allow_nonref
|
2037 |
+
|
2038 |
+
$json = $json->allow_nonref([$enable])
|
2039 |
+
|
2040 |
+
$enabled = $json->get_allow_nonref
|
2041 |
+
|
2042 |
+
If C<$enable> is true (or missing), then the C<encode> method can convert a
|
2043 |
+
non-reference into its corresponding string, number or null JSON value,
|
2044 |
+
which is an extension to RFC4627. Likewise, C<decode> will accept those JSON
|
2045 |
+
values instead of croaking.
|
2046 |
+
|
2047 |
+
If C<$enable> is false, then the C<encode> method will croak if it isn't
|
2048 |
+
passed an arrayref or hashref, as JSON texts must either be an object
|
2049 |
+
or array. Likewise, C<decode> will croak if given something that is not a
|
2050 |
+
JSON object or array.
|
2051 |
+
|
2052 |
+
JSON::PP->new->allow_nonref->encode ("Hello, World!")
|
2053 |
+
=> "Hello, World!"
|
2054 |
+
|
2055 |
+
=head2 allow_unknown
|
2056 |
+
|
2057 |
+
$json = $json->allow_unknown ([$enable])
|
2058 |
+
|
2059 |
+
$enabled = $json->get_allow_unknown
|
2060 |
+
|
2061 |
+
If $enable is true (or missing), then "encode" will *not* throw an
|
2062 |
+
exception when it encounters values it cannot represent in JSON (for
|
2063 |
+
example, filehandles) but instead will encode a JSON "null" value.
|
2064 |
+
Note that blessed objects are not included here and are handled
|
2065 |
+
separately by c<allow_nonref>.
|
2066 |
+
|
2067 |
+
If $enable is false (the default), then "encode" will throw an
|
2068 |
+
exception when it encounters anything it cannot encode as JSON.
|
2069 |
+
|
2070 |
+
This option does not affect "decode" in any way, and it is
|
2071 |
+
recommended to leave it off unless you know your communications
|
2072 |
+
partner.
|
2073 |
+
|
2074 |
+
=head2 allow_blessed
|
2075 |
+
|
2076 |
+
$json = $json->allow_blessed([$enable])
|
2077 |
+
|
2078 |
+
$enabled = $json->get_allow_blessed
|
2079 |
+
|
2080 |
+
If C<$enable> is true (or missing), then the C<encode> method will not
|
2081 |
+
barf when it encounters a blessed reference. Instead, the value of the
|
2082 |
+
B<convert_blessed> option will decide whether C<null> (C<convert_blessed>
|
2083 |
+
disabled or no C<TO_JSON> method found) or a representation of the
|
2084 |
+
object (C<convert_blessed> enabled and C<TO_JSON> method found) is being
|
2085 |
+
encoded. Has no effect on C<decode>.
|
2086 |
+
|
2087 |
+
If C<$enable> is false (the default), then C<encode> will throw an
|
2088 |
+
exception when it encounters a blessed object.
|
2089 |
+
|
2090 |
+
=head2 convert_blessed
|
2091 |
+
|
2092 |
+
$json = $json->convert_blessed([$enable])
|
2093 |
+
|
2094 |
+
$enabled = $json->get_convert_blessed
|
2095 |
+
|
2096 |
+
If C<$enable> is true (or missing), then C<encode>, upon encountering a
|
2097 |
+
blessed object, will check for the availability of the C<TO_JSON> method
|
2098 |
+
on the object's class. If found, it will be called in scalar context
|
2099 |
+
and the resulting scalar will be encoded instead of the object. If no
|
2100 |
+
C<TO_JSON> method is found, the value of C<allow_blessed> will decide what
|
2101 |
+
to do.
|
2102 |
+
|
2103 |
+
The C<TO_JSON> method may safely call die if it wants. If C<TO_JSON>
|
2104 |
+
returns other blessed objects, those will be handled in the same
|
2105 |
+
way. C<TO_JSON> must take care of not causing an endless recursion cycle
|
2106 |
+
(== crash) in this case. The name of C<TO_JSON> was chosen because other
|
2107 |
+
methods called by the Perl core (== not by the user of the object) are
|
2108 |
+
usually in upper case letters and to avoid collisions with the C<to_json>
|
2109 |
+
function or method.
|
2110 |
+
|
2111 |
+
This setting does not yet influence C<decode> in any way.
|
2112 |
+
|
2113 |
+
If C<$enable> is false, then the C<allow_blessed> setting will decide what
|
2114 |
+
to do when a blessed object is found.
|
2115 |
+
|
2116 |
+
=head2 filter_json_object
|
2117 |
+
|
2118 |
+
$json = $json->filter_json_object([$coderef])
|
2119 |
+
|
2120 |
+
When C<$coderef> is specified, it will be called from C<decode> each
|
2121 |
+
time it decodes a JSON object. The only argument passed to the coderef
|
2122 |
+
is a reference to the newly-created hash. If the code references returns
|
2123 |
+
a single scalar (which need not be a reference), this value
|
2124 |
+
(i.e. a copy of that scalar to avoid aliasing) is inserted into the
|
2125 |
+
deserialised data structure. If it returns an empty list
|
2126 |
+
(NOTE: I<not> C<undef>, which is a valid scalar), the original deserialised
|
2127 |
+
hash will be inserted. This setting can slow down decoding considerably.
|
2128 |
+
|
2129 |
+
When C<$coderef> is omitted or undefined, any existing callback will
|
2130 |
+
be removed and C<decode> will not change the deserialised hash in any
|
2131 |
+
way.
|
2132 |
+
|
2133 |
+
Example, convert all JSON objects into the integer 5:
|
2134 |
+
|
2135 |
+
my $js = JSON::PP->new->filter_json_object (sub { 5 });
|
2136 |
+
# returns [5]
|
2137 |
+
$js->decode ('[{}]'); # the given subroutine takes a hash reference.
|
2138 |
+
# throw an exception because allow_nonref is not enabled
|
2139 |
+
# so a lone 5 is not allowed.
|
2140 |
+
$js->decode ('{"a":1, "b":2}');
|
2141 |
+
|
2142 |
+
=head2 filter_json_single_key_object
|
2143 |
+
|
2144 |
+
$json = $json->filter_json_single_key_object($key [=> $coderef])
|
2145 |
+
|
2146 |
+
Works remotely similar to C<filter_json_object>, but is only called for
|
2147 |
+
JSON objects having a single key named C<$key>.
|
2148 |
+
|
2149 |
+
This C<$coderef> is called before the one specified via
|
2150 |
+
C<filter_json_object>, if any. It gets passed the single value in the JSON
|
2151 |
+
object. If it returns a single value, it will be inserted into the data
|
2152 |
+
structure. If it returns nothing (not even C<undef> but the empty list),
|
2153 |
+
the callback from C<filter_json_object> will be called next, as if no
|
2154 |
+
single-key callback were specified.
|
2155 |
+
|
2156 |
+
If C<$coderef> is omitted or undefined, the corresponding callback will be
|
2157 |
+
disabled. There can only ever be one callback for a given key.
|
2158 |
+
|
2159 |
+
As this callback gets called less often then the C<filter_json_object>
|
2160 |
+
one, decoding speed will not usually suffer as much. Therefore, single-key
|
2161 |
+
objects make excellent targets to serialise Perl objects into, especially
|
2162 |
+
as single-key JSON objects are as close to the type-tagged value concept
|
2163 |
+
as JSON gets (it's basically an ID/VALUE tuple). Of course, JSON does not
|
2164 |
+
support this in any way, so you need to make sure your data never looks
|
2165 |
+
like a serialised Perl hash.
|
2166 |
+
|
2167 |
+
Typical names for the single object key are C<__class_whatever__>, or
|
2168 |
+
C<$__dollars_are_rarely_used__$> or C<}ugly_brace_placement>, or even
|
2169 |
+
things like C<__class_md5sum(classname)__>, to reduce the risk of clashing
|
2170 |
+
with real hashes.
|
2171 |
+
|
2172 |
+
Example, decode JSON objects of the form C<< { "__widget__" => <id> } >>
|
2173 |
+
into the corresponding C<< $WIDGET{<id>} >> object:
|
2174 |
+
|
2175 |
+
# return whatever is in $WIDGET{5}:
|
2176 |
+
JSON::PP
|
2177 |
+
->new
|
2178 |
+
->filter_json_single_key_object (__widget__ => sub {
|
2179 |
+
$WIDGET{ $_[0] }
|
2180 |
+
})
|
2181 |
+
->decode ('{"__widget__": 5')
|
2182 |
+
|
2183 |
+
# this can be used with a TO_JSON method in some "widget" class
|
2184 |
+
# for serialisation to json:
|
2185 |
+
sub WidgetBase::TO_JSON {
|
2186 |
+
my ($self) = @_;
|
2187 |
+
|
2188 |
+
unless ($self->{id}) {
|
2189 |
+
$self->{id} = ..get..some..id..;
|
2190 |
+
$WIDGET{$self->{id}} = $self;
|
2191 |
+
}
|
2192 |
+
|
2193 |
+
{ __widget__ => $self->{id} }
|
2194 |
+
}
|
2195 |
+
|
2196 |
+
=head2 shrink
|
2197 |
+
|
2198 |
+
$json = $json->shrink([$enable])
|
2199 |
+
|
2200 |
+
$enabled = $json->get_shrink
|
2201 |
+
|
2202 |
+
In JSON::XS, this flag resizes strings generated by either
|
2203 |
+
C<encode> or C<decode> to their minimum size possible.
|
2204 |
+
It will also try to downgrade any strings to octet-form if possible.
|
2205 |
+
|
2206 |
+
In JSON::PP, it is noop about resizing strings but tries
|
2207 |
+
C<utf8::downgrade> to the returned string by C<encode>.
|
2208 |
+
See to L<utf8>.
|
2209 |
+
|
2210 |
+
See to L<JSON::XS/OBJECT-ORIENTED INTERFACE>
|
2211 |
+
|
2212 |
+
=head2 max_depth
|
2213 |
+
|
2214 |
+
$json = $json->max_depth([$maximum_nesting_depth])
|
2215 |
+
|
2216 |
+
$max_depth = $json->get_max_depth
|
2217 |
+
|
2218 |
+
Sets the maximum nesting level (default C<512>) accepted while encoding
|
2219 |
+
or decoding. If a higher nesting level is detected in JSON text or a Perl
|
2220 |
+
data structure, then the encoder and decoder will stop and croak at that
|
2221 |
+
point.
|
2222 |
+
|
2223 |
+
Nesting level is defined by number of hash- or arrayrefs that the encoder
|
2224 |
+
needs to traverse to reach a given point or the number of C<{> or C<[>
|
2225 |
+
characters without their matching closing parenthesis crossed to reach a
|
2226 |
+
given character in a string.
|
2227 |
+
|
2228 |
+
If no argument is given, the highest possible setting will be used, which
|
2229 |
+
is rarely useful.
|
2230 |
+
|
2231 |
+
See L<JSON::XS/SSECURITY CONSIDERATIONS> for more info on why this is useful.
|
2232 |
+
|
2233 |
+
When a large value (100 or more) was set and it de/encodes a deep nested object/text,
|
2234 |
+
it may raise a warning 'Deep recursion on subroutine' at the perl runtime phase.
|
2235 |
+
|
2236 |
+
=head2 max_size
|
2237 |
+
|
2238 |
+
$json = $json->max_size([$maximum_string_size])
|
2239 |
+
|
2240 |
+
$max_size = $json->get_max_size
|
2241 |
+
|
2242 |
+
Set the maximum length a JSON text may have (in bytes) where decoding is
|
2243 |
+
being attempted. The default is C<0>, meaning no limit. When C<decode>
|
2244 |
+
is called on a string that is longer then this many bytes, it will not
|
2245 |
+
attempt to decode the string but throw an exception. This setting has no
|
2246 |
+
effect on C<encode> (yet).
|
2247 |
+
|
2248 |
+
If no argument is given, the limit check will be deactivated (same as when
|
2249 |
+
C<0> is specified).
|
2250 |
+
|
2251 |
+
See L<JSON::XS/SECURITY CONSIDERATIONS> for more info on why this is useful.
|
2252 |
+
|
2253 |
+
=head2 encode
|
2254 |
+
|
2255 |
+
$json_text = $json->encode($perl_scalar)
|
2256 |
+
|
2257 |
+
Converts the given Perl data structure (a simple scalar or a reference
|
2258 |
+
to a hash or array) to its JSON representation. Simple scalars will be
|
2259 |
+
converted into JSON string or number sequences, while references to arrays
|
2260 |
+
become JSON arrays and references to hashes become JSON objects. Undefined
|
2261 |
+
Perl values (e.g. C<undef>) become JSON C<null> values.
|
2262 |
+
References to the integers C<0> and C<1> are converted into C<true> and C<false>.
|
2263 |
+
|
2264 |
+
=head2 decode
|
2265 |
+
|
2266 |
+
$perl_scalar = $json->decode($json_text)
|
2267 |
+
|
2268 |
+
The opposite of C<encode>: expects a JSON text and tries to parse it,
|
2269 |
+
returning the resulting simple scalar or reference. Croaks on error.
|
2270 |
+
|
2271 |
+
JSON numbers and strings become simple Perl scalars. JSON arrays become
|
2272 |
+
Perl arrayrefs and JSON objects become Perl hashrefs. C<true> becomes
|
2273 |
+
C<1> (C<JSON::true>), C<false> becomes C<0> (C<JSON::false>) and
|
2274 |
+
C<null> becomes C<undef>.
|
2275 |
+
|
2276 |
+
=head2 decode_prefix
|
2277 |
+
|
2278 |
+
($perl_scalar, $characters) = $json->decode_prefix($json_text)
|
2279 |
+
|
2280 |
+
This works like the C<decode> method, but instead of raising an exception
|
2281 |
+
when there is trailing garbage after the first JSON object, it will
|
2282 |
+
silently stop parsing there and return the number of characters consumed
|
2283 |
+
so far.
|
2284 |
+
|
2285 |
+
JSON->new->decode_prefix ("[1] the tail")
|
2286 |
+
=> ([], 3)
|
2287 |
+
|
2288 |
+
=head1 INCREMENTAL PARSING
|
2289 |
+
|
2290 |
+
Most of this section are copied and modified from L<JSON::XS/INCREMENTAL PARSING>.
|
2291 |
+
|
2292 |
+
In some cases, there is the need for incremental parsing of JSON texts.
|
2293 |
+
This module does allow you to parse a JSON stream incrementally.
|
2294 |
+
It does so by accumulating text until it has a full JSON object, which
|
2295 |
+
it then can decode. This process is similar to using C<decode_prefix>
|
2296 |
+
to see if a full JSON object is available, but is much more efficient
|
2297 |
+
(and can be implemented with a minimum of method calls).
|
2298 |
+
|
2299 |
+
This module will only attempt to parse the JSON text once it is sure it
|
2300 |
+
has enough text to get a decisive result, using a very simple but
|
2301 |
+
truly incremental parser. This means that it sometimes won't stop as
|
2302 |
+
early as the full parser, for example, it doesn't detect parenthesis
|
2303 |
+
mismatches. The only thing it guarantees is that it starts decoding as
|
2304 |
+
soon as a syntactically valid JSON text has been seen. This means you need
|
2305 |
+
to set resource limits (e.g. C<max_size>) to ensure the parser will stop
|
2306 |
+
parsing in the presence if syntax errors.
|
2307 |
+
|
2308 |
+
The following methods implement this incremental parser.
|
2309 |
+
|
2310 |
+
=head2 incr_parse
|
2311 |
+
|
2312 |
+
$json->incr_parse( [$string] ) # void context
|
2313 |
+
|
2314 |
+
$obj_or_undef = $json->incr_parse( [$string] ) # scalar context
|
2315 |
+
|
2316 |
+
@obj_or_empty = $json->incr_parse( [$string] ) # list context
|
2317 |
+
|
2318 |
+
This is the central parsing function. It can both append new text and
|
2319 |
+
extract objects from the stream accumulated so far (both of these
|
2320 |
+
functions are optional).
|
2321 |
+
|
2322 |
+
If C<$string> is given, then this string is appended to the already
|
2323 |
+
existing JSON fragment stored in the C<$json> object.
|
2324 |
+
|
2325 |
+
After that, if the function is called in void context, it will simply
|
2326 |
+
return without doing anything further. This can be used to add more text
|
2327 |
+
in as many chunks as you want.
|
2328 |
+
|
2329 |
+
If the method is called in scalar context, then it will try to extract
|
2330 |
+
exactly I<one> JSON object. If that is successful, it will return this
|
2331 |
+
object, otherwise it will return C<undef>. If there is a parse error,
|
2332 |
+
this method will croak just as C<decode> would do (one can then use
|
2333 |
+
C<incr_skip> to skip the erroneous part). This is the most common way of
|
2334 |
+
using the method.
|
2335 |
+
|
2336 |
+
And finally, in list context, it will try to extract as many objects
|
2337 |
+
from the stream as it can find and return them, or the empty list
|
2338 |
+
otherwise. For this to work, there must be no separators between the JSON
|
2339 |
+
objects or arrays, instead they must be concatenated back-to-back. If
|
2340 |
+
an error occurs, an exception will be raised as in the scalar context
|
2341 |
+
case. Note that in this case, any previously-parsed JSON texts will be
|
2342 |
+
lost.
|
2343 |
+
|
2344 |
+
Example: Parse some JSON arrays/objects in a given string and return them.
|
2345 |
+
|
2346 |
+
my @objs = JSON->new->incr_parse ("[5][7][1,2]");
|
2347 |
+
|
2348 |
+
=head2 incr_text
|
2349 |
+
|
2350 |
+
$lvalue_string = $json->incr_text
|
2351 |
+
|
2352 |
+
This method returns the currently stored JSON fragment as an lvalue, that
|
2353 |
+
is, you can manipulate it. This I<only> works when a preceding call to
|
2354 |
+
C<incr_parse> in I<scalar context> successfully returned an object. Under
|
2355 |
+
all other circumstances you must not call this function (I mean it.
|
2356 |
+
although in simple tests it might actually work, it I<will> fail under
|
2357 |
+
real world conditions). As a special exception, you can also call this
|
2358 |
+
method before having parsed anything.
|
2359 |
+
|
2360 |
+
This function is useful in two cases: a) finding the trailing text after a
|
2361 |
+
JSON object or b) parsing multiple JSON objects separated by non-JSON text
|
2362 |
+
(such as commas).
|
2363 |
+
|
2364 |
+
$json->incr_text =~ s/\s*,\s*//;
|
2365 |
+
|
2366 |
+
In Perl 5.005, C<lvalue> attribute is not available.
|
2367 |
+
You must write codes like the below:
|
2368 |
+
|
2369 |
+
$string = $json->incr_text;
|
2370 |
+
$string =~ s/\s*,\s*//;
|
2371 |
+
$json->incr_text( $string );
|
2372 |
+
|
2373 |
+
=head2 incr_skip
|
2374 |
+
|
2375 |
+
$json->incr_skip
|
2376 |
+
|
2377 |
+
This will reset the state of the incremental parser and will remove the
|
2378 |
+
parsed text from the input buffer. This is useful after C<incr_parse>
|
2379 |
+
died, in which case the input buffer and incremental parser state is left
|
2380 |
+
unchanged, to skip the text parsed so far and to reset the parse state.
|
2381 |
+
|
2382 |
+
=head2 incr_reset
|
2383 |
+
|
2384 |
+
$json->incr_reset
|
2385 |
+
|
2386 |
+
This completely resets the incremental parser, that is, after this call,
|
2387 |
+
it will be as if the parser had never parsed anything.
|
2388 |
+
|
2389 |
+
This is useful if you want to repeatedly parse JSON objects and want to
|
2390 |
+
ignore any trailing data, which means you have to reset the parser after
|
2391 |
+
each successful decode.
|
2392 |
+
|
2393 |
+
See to L<JSON::XS/INCREMENTAL PARSING> for examples.
|
2394 |
+
|
2395 |
+
|
2396 |
+
=head1 JSON::PP OWN METHODS
|
2397 |
+
|
2398 |
+
=head2 allow_singlequote
|
2399 |
+
|
2400 |
+
$json = $json->allow_singlequote([$enable])
|
2401 |
+
|
2402 |
+
If C<$enable> is true (or missing), then C<decode> will accept
|
2403 |
+
JSON strings quoted by single quotations that are invalid JSON
|
2404 |
+
format.
|
2405 |
+
|
2406 |
+
$json->allow_singlequote->decode({"foo":'bar'});
|
2407 |
+
$json->allow_singlequote->decode({'foo':"bar"});
|
2408 |
+
$json->allow_singlequote->decode({'foo':'bar'});
|
2409 |
+
|
2410 |
+
As same as the C<relaxed> option, this option may be used to parse
|
2411 |
+
application-specific files written by humans.
|
2412 |
+
|
2413 |
+
|
2414 |
+
=head2 allow_barekey
|
2415 |
+
|
2416 |
+
$json = $json->allow_barekey([$enable])
|
2417 |
+
|
2418 |
+
If C<$enable> is true (or missing), then C<decode> will accept
|
2419 |
+
bare keys of JSON object that are invalid JSON format.
|
2420 |
+
|
2421 |
+
As same as the C<relaxed> option, this option may be used to parse
|
2422 |
+
application-specific files written by humans.
|
2423 |
+
|
2424 |
+
$json->allow_barekey->decode('{foo:"bar"}');
|
2425 |
+
|
2426 |
+
=head2 allow_bignum
|
2427 |
+
|
2428 |
+
$json = $json->allow_bignum([$enable])
|
2429 |
+
|
2430 |
+
If C<$enable> is true (or missing), then C<decode> will convert
|
2431 |
+
the big integer Perl cannot handle as integer into a L<Math::BigInt>
|
2432 |
+
object and convert a floating number (any) into a L<Math::BigFloat>.
|
2433 |
+
|
2434 |
+
On the contrary, C<encode> converts C<Math::BigInt> objects and C<Math::BigFloat>
|
2435 |
+
objects into JSON numbers with C<allow_blessed> enable.
|
2436 |
+
|
2437 |
+
$json->allow_nonref->allow_blessed->allow_bignum;
|
2438 |
+
$bigfloat = $json->decode('2.000000000000000000000000001');
|
2439 |
+
print $json->encode($bigfloat);
|
2440 |
+
# => 2.000000000000000000000000001
|
2441 |
+
|
2442 |
+
See to L<JSON::XS/MAPPING> about the normal conversion of JSON number.
|
2443 |
+
|
2444 |
+
=head2 loose
|
2445 |
+
|
2446 |
+
$json = $json->loose([$enable])
|
2447 |
+
|
2448 |
+
The unescaped [\x00-\x1f\x22\x2f\x5c] strings are invalid in JSON strings
|
2449 |
+
and the module doesn't allow to C<decode> to these (except for \x2f).
|
2450 |
+
If C<$enable> is true (or missing), then C<decode> will accept these
|
2451 |
+
unescaped strings.
|
2452 |
+
|
2453 |
+
$json->loose->decode(qq|["abc
|
2454 |
+
def"]|);
|
2455 |
+
|
2456 |
+
See L<JSON::XS/SSECURITY CONSIDERATIONS>.
|
2457 |
+
|
2458 |
+
=head2 escape_slash
|
2459 |
+
|
2460 |
+
$json = $json->escape_slash([$enable])
|
2461 |
+
|
2462 |
+
According to JSON Grammar, I<slash> (U+002F) is escaped. But default
|
2463 |
+
JSON::PP (as same as JSON::XS) encodes strings without escaping slash.
|
2464 |
+
|
2465 |
+
If C<$enable> is true (or missing), then C<encode> will escape slashes.
|
2466 |
+
|
2467 |
+
=head2 indent_length
|
2468 |
+
|
2469 |
+
$json = $json->indent_length($length)
|
2470 |
+
|
2471 |
+
JSON::XS indent space length is 3 and cannot be changed.
|
2472 |
+
JSON::PP set the indent space length with the given $length.
|
2473 |
+
The default is 3. The acceptable range is 0 to 15.
|
2474 |
+
|
2475 |
+
=head2 sort_by
|
2476 |
+
|
2477 |
+
$json = $json->sort_by($function_name)
|
2478 |
+
$json = $json->sort_by($subroutine_ref)
|
2479 |
+
|
2480 |
+
If $function_name or $subroutine_ref are set, its sort routine are used
|
2481 |
+
in encoding JSON objects.
|
2482 |
+
|
2483 |
+
$js = $pc->sort_by(sub { $JSON::PP::a cmp $JSON::PP::b })->encode($obj);
|
2484 |
+
# is($js, q|{"a":1,"b":2,"c":3,"d":4,"e":5,"f":6,"g":7,"h":8,"i":9}|);
|
2485 |
+
|
2486 |
+
$js = $pc->sort_by('own_sort')->encode($obj);
|
2487 |
+
# is($js, q|{"a":1,"b":2,"c":3,"d":4,"e":5,"f":6,"g":7,"h":8,"i":9}|);
|
2488 |
+
|
2489 |
+
sub JSON::PP::own_sort { $JSON::PP::a cmp $JSON::PP::b }
|
2490 |
+
|
2491 |
+
As the sorting routine runs in the JSON::PP scope, the given
|
2492 |
+
subroutine name and the special variables C<$a>, C<$b> will begin
|
2493 |
+
'JSON::PP::'.
|
2494 |
+
|
2495 |
+
If $integer is set, then the effect is same as C<canonical> on.
|
2496 |
+
|
2497 |
+
=head1 INTERNAL
|
2498 |
+
|
2499 |
+
For developers.
|
2500 |
+
|
2501 |
+
=over
|
2502 |
+
|
2503 |
+
=item PP_encode_box
|
2504 |
+
|
2505 |
+
Returns
|
2506 |
+
|
2507 |
+
{
|
2508 |
+
depth => $depth,
|
2509 |
+
indent_count => $indent_count,
|
2510 |
+
}
|
2511 |
+
|
2512 |
+
|
2513 |
+
=item PP_decode_box
|
2514 |
+
|
2515 |
+
Returns
|
2516 |
+
|
2517 |
+
{
|
2518 |
+
text => $text,
|
2519 |
+
at => $at,
|
2520 |
+
ch => $ch,
|
2521 |
+
len => $len,
|
2522 |
+
depth => $depth,
|
2523 |
+
encoding => $encoding,
|
2524 |
+
is_valid_utf8 => $is_valid_utf8,
|
2525 |
+
};
|
2526 |
+
|
2527 |
+
=back
|
2528 |
+
|
2529 |
+
=head1 MAPPING
|
2530 |
+
|
2531 |
+
This section is copied from JSON::XS and modified to C<JSON::PP>.
|
2532 |
+
JSON::XS and JSON::PP mapping mechanisms are almost equivalent.
|
2533 |
+
|
2534 |
+
See to L<JSON::XS/MAPPING>.
|
2535 |
+
|
2536 |
+
=head2 JSON -> PERL
|
2537 |
+
|
2538 |
+
=over 4
|
2539 |
+
|
2540 |
+
=item object
|
2541 |
+
|
2542 |
+
A JSON object becomes a reference to a hash in Perl. No ordering of object
|
2543 |
+
keys is preserved (JSON does not preserver object key ordering itself).
|
2544 |
+
|
2545 |
+
=item array
|
2546 |
+
|
2547 |
+
A JSON array becomes a reference to an array in Perl.
|
2548 |
+
|
2549 |
+
=item string
|
2550 |
+
|
2551 |
+
A JSON string becomes a string scalar in Perl - Unicode codepoints in JSON
|
2552 |
+
are represented by the same codepoints in the Perl string, so no manual
|
2553 |
+
decoding is necessary.
|
2554 |
+
|
2555 |
+
=item number
|
2556 |
+
|
2557 |
+
A JSON number becomes either an integer, numeric (floating point) or
|
2558 |
+
string scalar in perl, depending on its range and any fractional parts. On
|
2559 |
+
the Perl level, there is no difference between those as Perl handles all
|
2560 |
+
the conversion details, but an integer may take slightly less memory and
|
2561 |
+
might represent more values exactly than floating point numbers.
|
2562 |
+
|
2563 |
+
If the number consists of digits only, C<JSON> will try to represent
|
2564 |
+
it as an integer value. If that fails, it will try to represent it as
|
2565 |
+
a numeric (floating point) value if that is possible without loss of
|
2566 |
+
precision. Otherwise it will preserve the number as a string value (in
|
2567 |
+
which case you lose roundtripping ability, as the JSON number will be
|
2568 |
+
re-encoded to a JSON string).
|
2569 |
+
|
2570 |
+
Numbers containing a fractional or exponential part will always be
|
2571 |
+
represented as numeric (floating point) values, possibly at a loss of
|
2572 |
+
precision (in which case you might lose perfect roundtripping ability, but
|
2573 |
+
the JSON number will still be re-encoded as a JSON number).
|
2574 |
+
|
2575 |
+
Note that precision is not accuracy - binary floating point values cannot
|
2576 |
+
represent most decimal fractions exactly, and when converting from and to
|
2577 |
+
floating point, C<JSON> only guarantees precision up to but not including
|
2578 |
+
the least significant bit.
|
2579 |
+
|
2580 |
+
When C<allow_bignum> is enable, the big integers
|
2581 |
+
and the numeric can be optionally converted into L<Math::BigInt> and
|
2582 |
+
L<Math::BigFloat> objects.
|
2583 |
+
|
2584 |
+
=item true, false
|
2585 |
+
|
2586 |
+
These JSON atoms become C<JSON::PP::true> and C<JSON::PP::false>,
|
2587 |
+
respectively. They are overloaded to act almost exactly like the numbers
|
2588 |
+
C<1> and C<0>. You can check whether a scalar is a JSON boolean by using
|
2589 |
+
the C<JSON::is_bool> function.
|
2590 |
+
|
2591 |
+
print JSON::PP::true . "\n";
|
2592 |
+
=> true
|
2593 |
+
print JSON::PP::true + 1;
|
2594 |
+
=> 1
|
2595 |
+
|
2596 |
+
ok(JSON::true eq '1');
|
2597 |
+
ok(JSON::true == 1);
|
2598 |
+
|
2599 |
+
C<JSON> will install these missing overloading features to the backend modules.
|
2600 |
+
|
2601 |
+
|
2602 |
+
=item null
|
2603 |
+
|
2604 |
+
A JSON null atom becomes C<undef> in Perl.
|
2605 |
+
|
2606 |
+
C<JSON::PP::null> returns C<undef>.
|
2607 |
+
|
2608 |
+
=back
|
2609 |
+
|
2610 |
+
|
2611 |
+
=head2 PERL -> JSON
|
2612 |
+
|
2613 |
+
The mapping from Perl to JSON is slightly more difficult, as Perl is a
|
2614 |
+
truly typeless language, so we can only guess which JSON type is meant by
|
2615 |
+
a Perl value.
|
2616 |
+
|
2617 |
+
=over 4
|
2618 |
+
|
2619 |
+
=item hash references
|
2620 |
+
|
2621 |
+
Perl hash references become JSON objects. As there is no inherent ordering
|
2622 |
+
in hash keys (or JSON objects), they will usually be encoded in a
|
2623 |
+
pseudo-random order that can change between runs of the same program but
|
2624 |
+
stays generally the same within a single run of a program. C<JSON>
|
2625 |
+
optionally sort the hash keys (determined by the I<canonical> flag), so
|
2626 |
+
the same data structure will serialise to the same JSON text (given same
|
2627 |
+
settings and version of JSON::XS), but this incurs a runtime overhead
|
2628 |
+
and is only rarely useful, e.g. when you want to compare some JSON text
|
2629 |
+
against another for equality.
|
2630 |
+
|
2631 |
+
|
2632 |
+
=item array references
|
2633 |
+
|
2634 |
+
Perl array references become JSON arrays.
|
2635 |
+
|
2636 |
+
=item other references
|
2637 |
+
|
2638 |
+
Other unblessed references are generally not allowed and will cause an
|
2639 |
+
exception to be thrown, except for references to the integers C<0> and
|
2640 |
+
C<1>, which get turned into C<false> and C<true> atoms in JSON. You can
|
2641 |
+
also use C<JSON::false> and C<JSON::true> to improve readability.
|
2642 |
+
|
2643 |
+
to_json [\0,JSON::PP::true] # yields [false,true]
|
2644 |
+
|
2645 |
+
=item JSON::PP::true, JSON::PP::false, JSON::PP::null
|
2646 |
+
|
2647 |
+
These special values become JSON true and JSON false values,
|
2648 |
+
respectively. You can also use C<\1> and C<\0> directly if you want.
|
2649 |
+
|
2650 |
+
JSON::PP::null returns C<undef>.
|
2651 |
+
|
2652 |
+
=item blessed objects
|
2653 |
+
|
2654 |
+
Blessed objects are not directly representable in JSON. See the
|
2655 |
+
C<allow_blessed> and C<convert_blessed> methods on various options on
|
2656 |
+
how to deal with this: basically, you can choose between throwing an
|
2657 |
+
exception, encoding the reference as if it weren't blessed, or provide
|
2658 |
+
your own serialiser method.
|
2659 |
+
|
2660 |
+
See to L<convert_blessed>.
|
2661 |
+
|
2662 |
+
=item simple scalars
|
2663 |
+
|
2664 |
+
Simple Perl scalars (any scalar that is not a reference) are the most
|
2665 |
+
difficult objects to encode: JSON::XS and JSON::PP will encode undefined scalars as
|
2666 |
+
JSON C<null> values, scalars that have last been used in a string context
|
2667 |
+
before encoding as JSON strings, and anything else as number value:
|
2668 |
+
|
2669 |
+
# dump as number
|
2670 |
+
encode_json [2] # yields [2]
|
2671 |
+
encode_json [-3.0e17] # yields [-3e+17]
|
2672 |
+
my $value = 5; encode_json [$value] # yields [5]
|
2673 |
+
|
2674 |
+
# used as string, so dump as string
|
2675 |
+
print $value;
|
2676 |
+
encode_json [$value] # yields ["5"]
|
2677 |
+
|
2678 |
+
# undef becomes null
|
2679 |
+
encode_json [undef] # yields [null]
|
2680 |
+
|
2681 |
+
You can force the type to be a string by stringifying it:
|
2682 |
+
|
2683 |
+
my $x = 3.1; # some variable containing a number
|
2684 |
+
"$x"; # stringified
|
2685 |
+
$x .= ""; # another, more awkward way to stringify
|
2686 |
+
print $x; # perl does it for you, too, quite often
|
2687 |
+
|
2688 |
+
You can force the type to be a number by numifying it:
|
2689 |
+
|
2690 |
+
my $x = "3"; # some variable containing a string
|
2691 |
+
$x += 0; # numify it, ensuring it will be dumped as a number
|
2692 |
+
$x *= 1; # same thing, the choice is yours.
|
2693 |
+
|
2694 |
+
You can not currently force the type in other, less obscure, ways.
|
2695 |
+
|
2696 |
+
Note that numerical precision has the same meaning as under Perl (so
|
2697 |
+
binary to decimal conversion follows the same rules as in Perl, which
|
2698 |
+
can differ to other languages). Also, your perl interpreter might expose
|
2699 |
+
extensions to the floating point numbers of your platform, such as
|
2700 |
+
infinities or NaN's - these cannot be represented in JSON, and it is an
|
2701 |
+
error to pass those in.
|
2702 |
+
|
2703 |
+
=item Big Number
|
2704 |
+
|
2705 |
+
When C<allow_bignum> is enable,
|
2706 |
+
C<encode> converts C<Math::BigInt> objects and C<Math::BigFloat>
|
2707 |
+
objects into JSON numbers.
|
2708 |
+
|
2709 |
+
|
2710 |
+
=back
|
2711 |
+
|
2712 |
+
=head1 UNICODE HANDLING ON PERLS
|
2713 |
+
|
2714 |
+
If you do not know about Unicode on Perl well,
|
2715 |
+
please check L<JSON::XS/A FEW NOTES ON UNICODE AND PERL>.
|
2716 |
+
|
2717 |
+
=head2 Perl 5.8 and later
|
2718 |
+
|
2719 |
+
Perl can handle Unicode and the JSON::PP de/encode methods also work properly.
|
2720 |
+
|
2721 |
+
$json->allow_nonref->encode(chr hex 3042);
|
2722 |
+
$json->allow_nonref->encode(chr hex 12345);
|
2723 |
+
|
2724 |
+
Returns C<"\u3042"> and C<"\ud808\udf45"> respectively.
|
2725 |
+
|
2726 |
+
$json->allow_nonref->decode('"\u3042"');
|
2727 |
+
$json->allow_nonref->decode('"\ud808\udf45"');
|
2728 |
+
|
2729 |
+
Returns UTF-8 encoded strings with UTF8 flag, regarded as C<U+3042> and C<U+12345>.
|
2730 |
+
|
2731 |
+
Note that the versions from Perl 5.8.0 to 5.8.2, Perl built-in C<join> was broken,
|
2732 |
+
so JSON::PP wraps the C<join> with a subroutine. Thus JSON::PP works slow in the versions.
|
2733 |
+
|
2734 |
+
|
2735 |
+
=head2 Perl 5.6
|
2736 |
+
|
2737 |
+
Perl can handle Unicode and the JSON::PP de/encode methods also work.
|
2738 |
+
|
2739 |
+
=head2 Perl 5.005
|
2740 |
+
|
2741 |
+
Perl 5.005 is a byte semantics world -- all strings are sequences of bytes.
|
2742 |
+
That means the unicode handling is not available.
|
2743 |
+
|
2744 |
+
In encoding,
|
2745 |
+
|
2746 |
+
$json->allow_nonref->encode(chr hex 3042); # hex 3042 is 12354.
|
2747 |
+
$json->allow_nonref->encode(chr hex 12345); # hex 12345 is 74565.
|
2748 |
+
|
2749 |
+
Returns C<B> and C<E>, as C<chr> takes a value more than 255, it treats
|
2750 |
+
as C<$value % 256>, so the above codes are equivalent to :
|
2751 |
+
|
2752 |
+
$json->allow_nonref->encode(chr 66);
|
2753 |
+
$json->allow_nonref->encode(chr 69);
|
2754 |
+
|
2755 |
+
In decoding,
|
2756 |
+
|
2757 |
+
$json->decode('"\u00e3\u0081\u0082"');
|
2758 |
+
|
2759 |
+
The returned is a byte sequence C<0xE3 0x81 0x82> for UTF-8 encoded
|
2760 |
+
japanese character (C<HIRAGANA LETTER A>).
|
2761 |
+
And if it is represented in Unicode code point, C<U+3042>.
|
2762 |
+
|
2763 |
+
Next,
|
2764 |
+
|
2765 |
+
$json->decode('"\u3042"');
|
2766 |
+
|
2767 |
+
We ordinary expect the returned value is a Unicode character C<U+3042>.
|
2768 |
+
But here is 5.005 world. This is C<0xE3 0x81 0x82>.
|
2769 |
+
|
2770 |
+
$json->decode('"\ud808\udf45"');
|
2771 |
+
|
2772 |
+
This is not a character C<U+12345> but bytes - C<0xf0 0x92 0x8d 0x85>.
|
2773 |
+
|
2774 |
+
|
2775 |
+
=head1 TODO
|
2776 |
+
|
2777 |
+
=over
|
2778 |
+
|
2779 |
+
=item speed
|
2780 |
+
|
2781 |
+
=item memory saving
|
2782 |
+
|
2783 |
+
=back
|
2784 |
+
|
2785 |
+
|
2786 |
+
=head1 SEE ALSO
|
2787 |
+
|
2788 |
+
Most of the document are copied and modified from JSON::XS doc.
|
2789 |
+
|
2790 |
+
L<JSON::XS>
|
2791 |
+
|
2792 |
+
RFC4627 (L<http://www.ietf.org/rfc/rfc4627.txt>)
|
2793 |
+
|
2794 |
+
=head1 AUTHOR
|
2795 |
+
|
2796 |
+
Makamaka Hannyaharamitu, E<lt>makamaka[at]cpan.orgE<gt>
|
2797 |
+
|
2798 |
+
|
2799 |
+
=head1 COPYRIGHT AND LICENSE
|
2800 |
+
|
2801 |
+
Copyright 2007-2012 by Makamaka Hannyaharamitu
|
2802 |
+
|
2803 |
+
This library is free software; you can redistribute it and/or modify
|
2804 |
+
it under the same terms as Perl itself.
|
2805 |
+
|
2806 |
+
=cut
|
uroman/lib/JSON/backportPP/Boolean.pm
ADDED
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
=head1 NAME
|
2 |
+
|
3 |
+
JSON::PP::Boolean - dummy module providing JSON::PP::Boolean
|
4 |
+
|
5 |
+
=head1 SYNOPSIS
|
6 |
+
|
7 |
+
# do not "use" yourself
|
8 |
+
|
9 |
+
=head1 DESCRIPTION
|
10 |
+
|
11 |
+
This module exists only to provide overload resolution for Storable
|
12 |
+
and similar modules. See L<JSON::PP> for more info about this class.
|
13 |
+
|
14 |
+
=cut
|
15 |
+
|
16 |
+
use JSON::backportPP ();
|
17 |
+
use strict;
|
18 |
+
|
19 |
+
1;
|
20 |
+
|
21 |
+
=head1 AUTHOR
|
22 |
+
|
23 |
+
This idea is from L<JSON::XS::Boolean> written by
|
24 |
+
Marc Lehmann <schmorp[at]schmorp.de>
|
25 |
+
|
26 |
+
=cut
|
27 |
+
|
uroman/lib/JSON/backportPP/Compat5005.pm
ADDED
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
package # This is JSON::backportPP
|
2 |
+
JSON::backportPP5005;
|
3 |
+
|
4 |
+
use 5.005;
|
5 |
+
use strict;
|
6 |
+
|
7 |
+
my @properties;
|
8 |
+
|
9 |
+
$JSON::PP5005::VERSION = '1.10';
|
10 |
+
|
11 |
+
BEGIN {
|
12 |
+
|
13 |
+
sub utf8::is_utf8 {
|
14 |
+
0; # It is considered that UTF8 flag off for Perl 5.005.
|
15 |
+
}
|
16 |
+
|
17 |
+
sub utf8::upgrade {
|
18 |
+
}
|
19 |
+
|
20 |
+
sub utf8::downgrade {
|
21 |
+
1; # must always return true.
|
22 |
+
}
|
23 |
+
|
24 |
+
sub utf8::encode {
|
25 |
+
}
|
26 |
+
|
27 |
+
sub utf8::decode {
|
28 |
+
}
|
29 |
+
|
30 |
+
*JSON::PP::JSON_PP_encode_ascii = \&_encode_ascii;
|
31 |
+
*JSON::PP::JSON_PP_encode_latin1 = \&_encode_latin1;
|
32 |
+
*JSON::PP::JSON_PP_decode_surrogates = \&_decode_surrogates;
|
33 |
+
*JSON::PP::JSON_PP_decode_unicode = \&_decode_unicode;
|
34 |
+
|
35 |
+
# missing in B module.
|
36 |
+
sub B::SVp_IOK () { 0x01000000; }
|
37 |
+
sub B::SVp_NOK () { 0x02000000; }
|
38 |
+
sub B::SVp_POK () { 0x04000000; }
|
39 |
+
|
40 |
+
$INC{'bytes.pm'} = 1; # dummy
|
41 |
+
}
|
42 |
+
|
43 |
+
|
44 |
+
|
45 |
+
sub _encode_ascii {
|
46 |
+
join('', map { $_ <= 127 ? chr($_) : sprintf('\u%04x', $_) } unpack('C*', $_[0]) );
|
47 |
+
}
|
48 |
+
|
49 |
+
|
50 |
+
sub _encode_latin1 {
|
51 |
+
join('', map { chr($_) } unpack('C*', $_[0]) );
|
52 |
+
}
|
53 |
+
|
54 |
+
|
55 |
+
sub _decode_surrogates { # from http://homepage1.nifty.com/nomenclator/unicode/ucs_utf.htm
|
56 |
+
my $uni = 0x10000 + (hex($_[0]) - 0xD800) * 0x400 + (hex($_[1]) - 0xDC00); # from perlunicode
|
57 |
+
my $bit = unpack('B32', pack('N', $uni));
|
58 |
+
|
59 |
+
if ( $bit =~ /^00000000000(...)(......)(......)(......)$/ ) {
|
60 |
+
my ($w, $x, $y, $z) = ($1, $2, $3, $4);
|
61 |
+
return pack('B*', sprintf('11110%s10%s10%s10%s', $w, $x, $y, $z));
|
62 |
+
}
|
63 |
+
else {
|
64 |
+
Carp::croak("Invalid surrogate pair");
|
65 |
+
}
|
66 |
+
}
|
67 |
+
|
68 |
+
|
69 |
+
sub _decode_unicode {
|
70 |
+
my ($u) = @_;
|
71 |
+
my ($utf8bit);
|
72 |
+
|
73 |
+
if ( $u =~ /^00([89a-f][0-9a-f])$/i ) { # 0x80-0xff
|
74 |
+
return pack( 'H2', $1 );
|
75 |
+
}
|
76 |
+
|
77 |
+
my $bit = unpack("B*", pack("H*", $u));
|
78 |
+
|
79 |
+
if ( $bit =~ /^00000(.....)(......)$/ ) {
|
80 |
+
$utf8bit = sprintf('110%s10%s', $1, $2);
|
81 |
+
}
|
82 |
+
elsif ( $bit =~ /^(....)(......)(......)$/ ) {
|
83 |
+
$utf8bit = sprintf('1110%s10%s10%s', $1, $2, $3);
|
84 |
+
}
|
85 |
+
else {
|
86 |
+
Carp::croak("Invalid escaped unicode");
|
87 |
+
}
|
88 |
+
|
89 |
+
return pack('B*', $utf8bit);
|
90 |
+
}
|
91 |
+
|
92 |
+
|
93 |
+
sub JSON::PP::incr_text {
|
94 |
+
$_[0]->{_incr_parser} ||= JSON::PP::IncrParser->new;
|
95 |
+
|
96 |
+
if ( $_[0]->{_incr_parser}->{incr_parsing} ) {
|
97 |
+
Carp::croak("incr_text can not be called when the incremental parser already started parsing");
|
98 |
+
}
|
99 |
+
|
100 |
+
$_[0]->{_incr_parser}->{incr_text} = $_[1] if ( @_ > 1 );
|
101 |
+
$_[0]->{_incr_parser}->{incr_text};
|
102 |
+
}
|
103 |
+
|
104 |
+
|
105 |
+
1;
|
106 |
+
__END__
|
107 |
+
|
108 |
+
=pod
|
109 |
+
|
110 |
+
=head1 NAME
|
111 |
+
|
112 |
+
JSON::PP5005 - Helper module in using JSON::PP in Perl 5.005
|
113 |
+
|
114 |
+
=head1 DESCRIPTION
|
115 |
+
|
116 |
+
JSON::PP calls internally.
|
117 |
+
|
118 |
+
=head1 AUTHOR
|
119 |
+
|
120 |
+
Makamaka Hannyaharamitu, E<lt>makamaka[at]cpan.orgE<gt>
|
121 |
+
|
122 |
+
|
123 |
+
=head1 COPYRIGHT AND LICENSE
|
124 |
+
|
125 |
+
Copyright 2007-2012 by Makamaka Hannyaharamitu
|
126 |
+
|
127 |
+
This library is free software; you can redistribute it and/or modify
|
128 |
+
it under the same terms as Perl itself.
|
129 |
+
|
130 |
+
=cut
|
131 |
+
|
uroman/lib/JSON/backportPP/Compat5006.pm
ADDED
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
package # This is JSON::backportPP
|
2 |
+
JSON::backportPP56;
|
3 |
+
|
4 |
+
use 5.006;
|
5 |
+
use strict;
|
6 |
+
|
7 |
+
my @properties;
|
8 |
+
|
9 |
+
$JSON::PP56::VERSION = '1.08';
|
10 |
+
|
11 |
+
BEGIN {
|
12 |
+
|
13 |
+
sub utf8::is_utf8 {
|
14 |
+
my $len = length $_[0]; # char length
|
15 |
+
{
|
16 |
+
use bytes; # byte length;
|
17 |
+
return $len != length $_[0]; # if !=, UTF8-flagged on.
|
18 |
+
}
|
19 |
+
}
|
20 |
+
|
21 |
+
|
22 |
+
sub utf8::upgrade {
|
23 |
+
; # noop;
|
24 |
+
}
|
25 |
+
|
26 |
+
|
27 |
+
sub utf8::downgrade ($;$) {
|
28 |
+
return 1 unless ( utf8::is_utf8( $_[0] ) );
|
29 |
+
|
30 |
+
if ( _is_valid_utf8( $_[0] ) ) {
|
31 |
+
my $downgrade;
|
32 |
+
for my $c ( unpack( "U*", $_[0] ) ) {
|
33 |
+
if ( $c < 256 ) {
|
34 |
+
$downgrade .= pack("C", $c);
|
35 |
+
}
|
36 |
+
else {
|
37 |
+
$downgrade .= pack("U", $c);
|
38 |
+
}
|
39 |
+
}
|
40 |
+
$_[0] = $downgrade;
|
41 |
+
return 1;
|
42 |
+
}
|
43 |
+
else {
|
44 |
+
Carp::croak("Wide character in subroutine entry") unless ( $_[1] );
|
45 |
+
0;
|
46 |
+
}
|
47 |
+
}
|
48 |
+
|
49 |
+
|
50 |
+
sub utf8::encode ($) { # UTF8 flag off
|
51 |
+
if ( utf8::is_utf8( $_[0] ) ) {
|
52 |
+
$_[0] = pack( "C*", unpack( "C*", $_[0] ) );
|
53 |
+
}
|
54 |
+
else {
|
55 |
+
$_[0] = pack( "U*", unpack( "C*", $_[0] ) );
|
56 |
+
$_[0] = pack( "C*", unpack( "C*", $_[0] ) );
|
57 |
+
}
|
58 |
+
}
|
59 |
+
|
60 |
+
|
61 |
+
sub utf8::decode ($) { # UTF8 flag on
|
62 |
+
if ( _is_valid_utf8( $_[0] ) ) {
|
63 |
+
utf8::downgrade( $_[0] );
|
64 |
+
$_[0] = pack( "U*", unpack( "U*", $_[0] ) );
|
65 |
+
}
|
66 |
+
}
|
67 |
+
|
68 |
+
|
69 |
+
*JSON::PP::JSON_PP_encode_ascii = \&_encode_ascii;
|
70 |
+
*JSON::PP::JSON_PP_encode_latin1 = \&_encode_latin1;
|
71 |
+
*JSON::PP::JSON_PP_decode_surrogates = \&JSON::PP::_decode_surrogates;
|
72 |
+
*JSON::PP::JSON_PP_decode_unicode = \&JSON::PP::_decode_unicode;
|
73 |
+
|
74 |
+
unless ( defined &B::SVp_NOK ) { # missing in B module.
|
75 |
+
eval q{ sub B::SVp_NOK () { 0x02000000; } };
|
76 |
+
}
|
77 |
+
|
78 |
+
}
|
79 |
+
|
80 |
+
|
81 |
+
|
82 |
+
sub _encode_ascii {
|
83 |
+
join('',
|
84 |
+
map {
|
85 |
+
$_ <= 127 ?
|
86 |
+
chr($_) :
|
87 |
+
$_ <= 65535 ?
|
88 |
+
sprintf('\u%04x', $_) : sprintf('\u%x\u%x', JSON::PP::_encode_surrogates($_));
|
89 |
+
} _unpack_emu($_[0])
|
90 |
+
);
|
91 |
+
}
|
92 |
+
|
93 |
+
|
94 |
+
sub _encode_latin1 {
|
95 |
+
join('',
|
96 |
+
map {
|
97 |
+
$_ <= 255 ?
|
98 |
+
chr($_) :
|
99 |
+
$_ <= 65535 ?
|
100 |
+
sprintf('\u%04x', $_) : sprintf('\u%x\u%x', JSON::PP::_encode_surrogates($_));
|
101 |
+
} _unpack_emu($_[0])
|
102 |
+
);
|
103 |
+
}
|
104 |
+
|
105 |
+
|
106 |
+
sub _unpack_emu { # for Perl 5.6 unpack warnings
|
107 |
+
return !utf8::is_utf8($_[0]) ? unpack('C*', $_[0])
|
108 |
+
: _is_valid_utf8($_[0]) ? unpack('U*', $_[0])
|
109 |
+
: unpack('C*', $_[0]);
|
110 |
+
}
|
111 |
+
|
112 |
+
|
113 |
+
sub _is_valid_utf8 {
|
114 |
+
my $str = $_[0];
|
115 |
+
my $is_utf8;
|
116 |
+
|
117 |
+
while ($str =~ /(?:
|
118 |
+
(
|
119 |
+
[\x00-\x7F]
|
120 |
+
|[\xC2-\xDF][\x80-\xBF]
|
121 |
+
|[\xE0][\xA0-\xBF][\x80-\xBF]
|
122 |
+
|[\xE1-\xEC][\x80-\xBF][\x80-\xBF]
|
123 |
+
|[\xED][\x80-\x9F][\x80-\xBF]
|
124 |
+
|[\xEE-\xEF][\x80-\xBF][\x80-\xBF]
|
125 |
+
|[\xF0][\x90-\xBF][\x80-\xBF][\x80-\xBF]
|
126 |
+
|[\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF]
|
127 |
+
|[\xF4][\x80-\x8F][\x80-\xBF][\x80-\xBF]
|
128 |
+
)
|
129 |
+
| (.)
|
130 |
+
)/xg)
|
131 |
+
{
|
132 |
+
if (defined $1) {
|
133 |
+
$is_utf8 = 1 if (!defined $is_utf8);
|
134 |
+
}
|
135 |
+
else {
|
136 |
+
$is_utf8 = 0 if (!defined $is_utf8);
|
137 |
+
if ($is_utf8) { # eventually, not utf8
|
138 |
+
return;
|
139 |
+
}
|
140 |
+
}
|
141 |
+
}
|
142 |
+
|
143 |
+
return $is_utf8;
|
144 |
+
}
|
145 |
+
|
146 |
+
|
147 |
+
1;
|
148 |
+
__END__
|
149 |
+
|
150 |
+
=pod
|
151 |
+
|
152 |
+
=head1 NAME
|
153 |
+
|
154 |
+
JSON::PP56 - Helper module in using JSON::PP in Perl 5.6
|
155 |
+
|
156 |
+
=head1 DESCRIPTION
|
157 |
+
|
158 |
+
JSON::PP calls internally.
|
159 |
+
|
160 |
+
=head1 AUTHOR
|
161 |
+
|
162 |
+
Makamaka Hannyaharamitu, E<lt>makamaka[at]cpan.orgE<gt>
|
163 |
+
|
164 |
+
|
165 |
+
=head1 COPYRIGHT AND LICENSE
|
166 |
+
|
167 |
+
Copyright 2007-2012 by Makamaka Hannyaharamitu
|
168 |
+
|
169 |
+
This library is free software; you can redistribute it and/or modify
|
170 |
+
it under the same terms as Perl itself.
|
171 |
+
|
172 |
+
=cut
|
173 |
+
|
uroman/lib/NLP/Chinese.pm
ADDED
@@ -0,0 +1,239 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
################################################################
|
2 |
+
# #
|
3 |
+
# Chinese #
|
4 |
+
# #
|
5 |
+
################################################################
|
6 |
+
|
7 |
+
package NLP::Chinese;
|
8 |
+
|
9 |
+
$utf8 = NLP::UTF8;
|
10 |
+
%empty_ht = ();
|
11 |
+
|
12 |
+
sub read_chinese_tonal_pinyin_files {
|
13 |
+
local($caller, *ht, @filenames) = @_;
|
14 |
+
|
15 |
+
$n_kHanyuPinlu = 0;
|
16 |
+
$n_kXHC1983 = 0;
|
17 |
+
$n_kHanyuPinyin = 0;
|
18 |
+
$n_kMandarin = 0;
|
19 |
+
$n_cedict = 0;
|
20 |
+
$n_simple_pinyin = 0;
|
21 |
+
|
22 |
+
foreach $filename (@filenames) {
|
23 |
+
if ($filename =~ /unihan/i) {
|
24 |
+
my $line_number = 0;
|
25 |
+
if (open(IN, $filename)) {
|
26 |
+
while (<IN>) {
|
27 |
+
$line_number++;
|
28 |
+
next if /^#/;
|
29 |
+
s/\s*$//;
|
30 |
+
if (($u, $type, $value) = split(/\t/, $_)) {
|
31 |
+
if ($type =~ /^(kHanyuPinlu|kXHC1983|kHanyuPinyin|kMandarin)$/) {
|
32 |
+
$u = $util->trim($u);
|
33 |
+
$type = $util->trim($type);
|
34 |
+
$value = $util->trim($value);
|
35 |
+
$f = $utf8->unicode_string2string($u);
|
36 |
+
|
37 |
+
if ($type eq "kHanyuPinlu") {
|
38 |
+
$value =~ s/\(.*?\)//g;
|
39 |
+
$value = $util->trim($value);
|
40 |
+
$translit = $caller->number_to_accent_tone($value);
|
41 |
+
$ht{"kHanyuPinlu"}->{$f} = $translit;
|
42 |
+
$n_kHanyuPinlu++;
|
43 |
+
} elsif ($type eq "kXHC1983") {
|
44 |
+
@translits = ($value =~ /:(\S+)/g);
|
45 |
+
$translit = join(" ", @translits);
|
46 |
+
$ht{"kXHC1983"}->{$f} = $translit;
|
47 |
+
$n_kXHC1983++;
|
48 |
+
} elsif ($type eq "kHanyuPinyin") {
|
49 |
+
$value =~ s/^.*://;
|
50 |
+
$value =~ s/,/ /g;
|
51 |
+
$ht{"kHanyuPinyin"}->{$f} = $value;
|
52 |
+
$n_kHanyuPinyin++;
|
53 |
+
} elsif ($type eq "kMandarin") {
|
54 |
+
$ht{"kMandarin"}->{$f} = $value;
|
55 |
+
$n_kMandarin++;
|
56 |
+
}
|
57 |
+
}
|
58 |
+
}
|
59 |
+
}
|
60 |
+
close(IN);
|
61 |
+
print "Read in $n_kHanyuPinlu kHanyuPinlu, $n_kXHC1983 n_kXHC1983, $n_kHanyuPinyin n_kHanyuPinyin $n_kMandarin n_kMandarin\n";
|
62 |
+
} else {
|
63 |
+
print STDERR "Can't open $filename\n";
|
64 |
+
}
|
65 |
+
} elsif ($filename =~ /cedict/i) {
|
66 |
+
if (open(IN, $filename)) {
|
67 |
+
my $line_number = 0;
|
68 |
+
while (<IN>) {
|
69 |
+
$line_number++;
|
70 |
+
next if /^#/;
|
71 |
+
s/\s*$//;
|
72 |
+
if (($f, $translit) = ($_ =~ /^\S+\s+(\S+)\s+\[([^\[\]]+)\]/)) {
|
73 |
+
$translit = $utf8->extended_lower_case($translit);
|
74 |
+
$translit = $caller->number_to_accent_tone($translit);
|
75 |
+
$translit =~ s/\s//g;
|
76 |
+
if ($old_translit = $ht{"cedict"}->{$f}) {
|
77 |
+
# $ht{CONFLICT}->{("DUPLICATE " . $f)} = "CEDICT($f): $old_translit\nCEDICT($f): $translit (duplicate)\n" unless $translit eq $old_translit;
|
78 |
+
$ht{"cedicts"}->{$f} = join(" ", $ht{"cedicts"}->{$f}, $translit) unless $old_translit eq $translit;
|
79 |
+
} else {
|
80 |
+
$ht{"cedict"}->{$f} = $translit;
|
81 |
+
$ht{"cedicts"}->{$f} = $translit;
|
82 |
+
}
|
83 |
+
$n_cedict++;
|
84 |
+
}
|
85 |
+
}
|
86 |
+
close(IN);
|
87 |
+
# print "Read in $n_cedict n_cedict\n";
|
88 |
+
} else {
|
89 |
+
print STDERR "Can't open $filename";
|
90 |
+
}
|
91 |
+
} elsif ($filename =~ /chinese_to_pinyin/i) {
|
92 |
+
if (open(IN, $filename)) {
|
93 |
+
my $line_number = 0;
|
94 |
+
while (<IN>) {
|
95 |
+
$line_number++;
|
96 |
+
next if /^#/;
|
97 |
+
if (($f, $translit) = ($_ =~ /^(\S+)\t(\S+)\s*$/)) {
|
98 |
+
$ht{"simple_pinyin"}->{$f} = $translit;
|
99 |
+
$n_simple_pinyin++;
|
100 |
+
}
|
101 |
+
}
|
102 |
+
close(IN);
|
103 |
+
# print "Read in $n_simple_pinyin n_simple_pinyin\n";
|
104 |
+
} else {
|
105 |
+
print STDERR "Can't open $filename";
|
106 |
+
}
|
107 |
+
} else {
|
108 |
+
print STDERR "Don't know what to do with file $filename (in read_chinese_tonal_pinyin_files)\n";
|
109 |
+
}
|
110 |
+
}
|
111 |
+
}
|
112 |
+
|
113 |
+
sub tonal_pinyin {
|
114 |
+
local($caller, $s, *ht, $gloss) = @_;
|
115 |
+
|
116 |
+
return $result if defined($result = $ht{COMBINED}->{$s});
|
117 |
+
|
118 |
+
$cedict_pinyin = $ht{"cedict"}->{$s} || "";
|
119 |
+
$cedicts_pinyin = $ht{"cedicts"}->{$s} || "";
|
120 |
+
$unihan_pinyin = "";
|
121 |
+
@characters = $utf8->split_into_utf8_characters($s, "return only chars", *empty_ht);
|
122 |
+
foreach $c (@characters) {
|
123 |
+
if ($pinyin = $ht{"simple_pinyin"}->{$c}) {
|
124 |
+
$unihan_pinyin .= $pinyin;
|
125 |
+
} elsif ($pinyin = $ht{"kHanyuPinlu"}->{$c}) {
|
126 |
+
$pinyin =~ s/^(\S+)\s.*$/$1/;
|
127 |
+
$unihan_pinyin .= $pinyin;
|
128 |
+
} elsif ($pinyin = $ht{"kXHC1983"}->{$c}) {
|
129 |
+
$pinyin =~ s/^(\S+)\s.*$/$1/;
|
130 |
+
$unihan_pinyin .= $pinyin;
|
131 |
+
} elsif ($pinyin = $ht{"kHanyuPinyin"}->{$c}) {
|
132 |
+
$pinyin =~ s/^(\S+)\s.*$/$1/;
|
133 |
+
$unihan_pinyin .= $pinyin;
|
134 |
+
} elsif ($pinyin = $ht{"cedicts"}->{$c}) {
|
135 |
+
$pinyin =~ s/^(\S+)\s.*$/$1/;
|
136 |
+
$unihan_pinyin .= $pinyin;
|
137 |
+
# middle dot, katakana middle dot, multiplication sign
|
138 |
+
} elsif ($c =~ /^(\xC2\xB7|\xE3\x83\xBB|\xC3\x97)$/) {
|
139 |
+
$unihan_pinyin .= $c;
|
140 |
+
# ASCII
|
141 |
+
} elsif ($c =~ /^([\x21-\x7E])$/) {
|
142 |
+
$unihan_pinyin .= $c;
|
143 |
+
} else {
|
144 |
+
$unihan_pinyin .= "?";
|
145 |
+
$hex = $utf8->utf8_to_hex($c);
|
146 |
+
$unicode = uc $utf8->utf8_to_4hex_unicode($c);
|
147 |
+
# print STDERR "Tonal pinyin: Unknown character $c ($hex/U+$unicode) -> ?\n";
|
148 |
+
}
|
149 |
+
}
|
150 |
+
$pinyin_title = "";
|
151 |
+
if (($#characters >= 1) && $cedicts_pinyin) {
|
152 |
+
foreach $pinyin (split(/\s+/, $cedicts_pinyin)) {
|
153 |
+
$pinyin_title .= "$s $pinyin (CEDICT)\n";
|
154 |
+
}
|
155 |
+
$pinyin_title .= "\n";
|
156 |
+
}
|
157 |
+
foreach $c (@characters) {
|
158 |
+
my %local_ht = ();
|
159 |
+
@pinyins = ();
|
160 |
+
foreach $type (("kHanyuPinlu", "kXHC1983", "kHanyuPinyin", "cedicts")) {
|
161 |
+
if ($pinyin_s = $ht{$type}->{$c}) {
|
162 |
+
foreach $pinyin (split(/\s+/, $pinyin_s)) {
|
163 |
+
push(@pinyins, $pinyin) unless $util->member($pinyin, @pinyins);
|
164 |
+
$type2 = ($type eq "cedicts") ? "CEDICT" : $type;
|
165 |
+
$local_ht{$pinyin} = ($local_ht{$pinyin}) ? join(", ", $local_ht{$pinyin}, $type2) : $type2;
|
166 |
+
}
|
167 |
+
}
|
168 |
+
}
|
169 |
+
foreach $pinyin (@pinyins) {
|
170 |
+
$type_s = $local_ht{$pinyin};
|
171 |
+
$pinyin_title .= "$c $pinyin ($type_s)\n";
|
172 |
+
}
|
173 |
+
}
|
174 |
+
$pinyin_title =~ s/\n$//;
|
175 |
+
$pinyin_title =~ s/\n/
/g;
|
176 |
+
$unihan_pinyin = "" if $unihan_pinyin =~ /^\?+$/;
|
177 |
+
if (($#characters >= 1) && $cedict_pinyin && $unihan_pinyin && ($unihan_pinyin ne $cedict_pinyin)) {
|
178 |
+
$log = "Gloss($s): $gloss\nCEdict($s): $cedicts_pinyin\nUnihan($s): $unihan_pinyin\n";
|
179 |
+
foreach $type (("kHanyuPinlu", "kXHC1983", "kHanyuPinyin")) {
|
180 |
+
$log_line = "$type($s): ";
|
181 |
+
foreach $c (@characters) {
|
182 |
+
$pinyin = $ht{$type}->{$c} || "";
|
183 |
+
if ($pinyin =~ / /) {
|
184 |
+
$log_line .= "($pinyin)";
|
185 |
+
} elsif ($pinyin) {
|
186 |
+
$log_line .= $pinyin;
|
187 |
+
} else {
|
188 |
+
$log_line .= "?";
|
189 |
+
}
|
190 |
+
}
|
191 |
+
$log .= "$log_line\n";
|
192 |
+
}
|
193 |
+
$ht{CONFLICT}->{$s} = $log;
|
194 |
+
}
|
195 |
+
$result = $unihan_pinyin || $cedict_pinyin;
|
196 |
+
$result = $cedict_pinyin if ($#characters > 0) && $cedict_pinyin;
|
197 |
+
$ht{COMBINED}->{$s} = $result;
|
198 |
+
$ht{PINYIN_TITLE}->{$s} = $pinyin_title;
|
199 |
+
return $result;
|
200 |
+
}
|
201 |
+
|
202 |
+
%number_to_accent_tone_ht = (
|
203 |
+
"a1", "\xC4\x81", "a2", "\xC3\xA1", "a3", "\xC7\x8E", "a4", "\xC3\xA0",
|
204 |
+
"e1", "\xC4\x93", "e2", "\xC3\xA9", "e3", "\xC4\x9B", "e4", "\xC3\xA8",
|
205 |
+
"i1", "\xC4\xAB", "i2", "\xC3\xAD", "i3", "\xC7\x90", "i4", "\xC3\xAC",
|
206 |
+
"o1", "\xC5\x8D", "o2", "\xC3\xB3", "o3", "\xC7\x92", "o4", "\xC3\xB2",
|
207 |
+
"u1", "\xC5\xAB", "u2", "\xC3\xBA", "u3", "\xC7\x94", "u4", "\xC3\xB9",
|
208 |
+
"u:1","\xC7\x96", "u:2","\xC7\x98", "u:3","\xC7\x9A", "u:4","\xC7\x9C",
|
209 |
+
"\xC3\xBC1","\xC7\x96","\xC3\xBC2","\xC7\x98","\xC3\xBC3","\xC7\x9A","\xC3\xBC4","\xC7\x9C"
|
210 |
+
);
|
211 |
+
|
212 |
+
sub number_to_accent_tone {
|
213 |
+
local($caller, $s) = @_;
|
214 |
+
|
215 |
+
my $result = "";
|
216 |
+
while (($pre,$alpha,$tone_number,$rest) = ($s =~ /^(.*?)((?:[a-z]|u:|\xC3\xBC)+)([1-5])(.*)$/i)) {
|
217 |
+
if ($tone_number eq "5") {
|
218 |
+
$result .= "$pre$alpha";
|
219 |
+
} elsif ((($pre_acc,$acc_letter,$post_acc) = ($alpha =~ /^(.*)([ae])(.*)$/))
|
220 |
+
|| (($pre_acc,$acc_letter,$post_acc) = ($alpha =~ /^(.*)(o)(u.*)$/))
|
221 |
+
|| (($pre_acc,$acc_letter,$post_acc) = ($alpha =~ /^(.*)(u:|[iou]|\xC3\xBC)([^aeiou]*)$/))) {
|
222 |
+
$result .= "$pre$pre_acc" . ($number_to_accent_tone_ht{($acc_letter . $tone_number)} || ($acc_letter . $tone_number)) . $post_acc;
|
223 |
+
} else {
|
224 |
+
$result .= "$pre$alpha$tone_number";
|
225 |
+
}
|
226 |
+
$s = $rest;
|
227 |
+
}
|
228 |
+
$result .= $s;
|
229 |
+
$result =~ s/u:/\xC3\xBC/g;
|
230 |
+
return $result;
|
231 |
+
}
|
232 |
+
|
233 |
+
sub string_contains_utf8_cjk_unified_ideograph_p {
|
234 |
+
local($caller, $s) = @_;
|
235 |
+
|
236 |
+
return ($s =~ /([\xE4-\xE9]|\xE3[\x90-\xBF]|\xF0[\xA0-\xAC])/);
|
237 |
+
}
|
238 |
+
|
239 |
+
1;
|
uroman/lib/NLP/English.pm
ADDED
The diff for this file is too large to render.
See raw diff
|
|
uroman/lib/NLP/Romanizer.pm
ADDED
@@ -0,0 +1,2020 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
################################################################
|
2 |
+
# #
|
3 |
+
# Romanizer #
|
4 |
+
# #
|
5 |
+
################################################################
|
6 |
+
|
7 |
+
package NLP::Romanizer;
|
8 |
+
|
9 |
+
use NLP::Chinese;
|
10 |
+
use NLP::UTF8;
|
11 |
+
use NLP::utilities;
|
12 |
+
use JSON;
|
13 |
+
$utf8 = NLP::UTF8;
|
14 |
+
$util = NLP::utilities;
|
15 |
+
$chinesePM = NLP::Chinese;
|
16 |
+
|
17 |
+
my $verbosePM = 0;
|
18 |
+
%empty_ht = ();
|
19 |
+
|
20 |
+
my $braille_capital_letter_indicator = "\xE2\xA0\xA0";
|
21 |
+
my $braille_number_indicator = "\xE2\xA0\xBC";
|
22 |
+
my $braille_decimal_point = "\xE2\xA0\xA8";
|
23 |
+
my $braille_comma = "\xE2\xA0\x82";
|
24 |
+
my $braille_solidus = "\xE2\xA0\x8C";
|
25 |
+
my $braille_numeric_space = "\xE2\xA0\x90";
|
26 |
+
my $braille_letter_indicator = "\xE2\xA0\xB0";
|
27 |
+
my $braille_period = "\xE2\xA0\xB2";
|
28 |
+
|
29 |
+
sub new {
|
30 |
+
local($caller) = @_;
|
31 |
+
|
32 |
+
my $object = {};
|
33 |
+
my $class = ref( $caller ) || $caller;
|
34 |
+
bless($object, $class);
|
35 |
+
return $object;
|
36 |
+
}
|
37 |
+
|
38 |
+
sub load_unicode_data {
|
39 |
+
local($this, *ht, $filename) = @_;
|
40 |
+
# ../../data/UnicodeData.txt
|
41 |
+
|
42 |
+
$n = 0;
|
43 |
+
if (open(IN, $filename)) {
|
44 |
+
while (<IN>) {
|
45 |
+
if (($unicode_value, $char_name, $general_category, $canon_comb_classes, $bidir_category, $char_decomp_mapping, $decimal_digit_value, $digit_value, $numeric_value, $mirrored, $unicode_1_0_name, $comment_field, $uc_mapping, $lc_mapping, $title_case_mapping) = split(";", $_)) {
|
46 |
+
$utf8_code = $utf8->unicode_hex_string2string($unicode_value);
|
47 |
+
$ht{UTF_TO_CHAR_NAME}->{$utf8_code} = $char_name;
|
48 |
+
$ht{UTF_NAME_TO_UNICODE}->{$char_name} = $unicode_value;
|
49 |
+
$ht{UTF_NAME_TO_CODE}->{$char_name} = $utf8_code;
|
50 |
+
$ht{UTF_TO_CAT}->{$utf8_code} = $general_category;
|
51 |
+
$ht{UTF_TO_NUMERIC}->{$utf8_code} = $numeric_value unless $numeric_value eq "";
|
52 |
+
$n++;
|
53 |
+
}
|
54 |
+
}
|
55 |
+
close(IN);
|
56 |
+
# print STDERR "Loaded $n entries from $filename\n";
|
57 |
+
} else {
|
58 |
+
print STDERR "Can't open $filename\n";
|
59 |
+
}
|
60 |
+
}
|
61 |
+
|
62 |
+
sub load_unicode_overwrite_romanization {
|
63 |
+
local($this, *ht, $filename) = @_;
|
64 |
+
# ../../data/UnicodeDataOverwrite.txt
|
65 |
+
|
66 |
+
$n = 0;
|
67 |
+
if (open(IN, $filename)) {
|
68 |
+
while (<IN>) {
|
69 |
+
next if /^#/;
|
70 |
+
$unicode_value = $util->slot_value_in_double_colon_del_list($_, "u");
|
71 |
+
$romanization = $util->slot_value_in_double_colon_del_list($_, "r");
|
72 |
+
$numeric = $util->slot_value_in_double_colon_del_list($_, "num");
|
73 |
+
$picture = $util->slot_value_in_double_colon_del_list($_, "pic");
|
74 |
+
$syllable_info = $util->slot_value_in_double_colon_del_list($_, "syllable-info");
|
75 |
+
$tone_mark = $util->slot_value_in_double_colon_del_list($_, "tone-mark");
|
76 |
+
$char_name = $util->slot_value_in_double_colon_del_list($_, "name");
|
77 |
+
$entry_processed_p = 0;
|
78 |
+
$utf8_code = $utf8->unicode_hex_string2string($unicode_value);
|
79 |
+
if ($unicode_value) {
|
80 |
+
$ht{UTF_TO_CHAR_ROMANIZATION}->{$utf8_code} = $romanization if $romanization;
|
81 |
+
$ht{UTF_TO_NUMERIC}->{$utf8_code} = $numeric if defined($numeric) && ($numeric ne "");
|
82 |
+
$ht{UTF_TO_PICTURE_DESCR}->{$utf8_code} = $picture if $picture;
|
83 |
+
$ht{UTF_TO_SYLLABLE_INFO}->{$utf8_code} = $syllable_info if $syllable_info;
|
84 |
+
$ht{UTF_TO_TONE_MARK}->{$utf8_code} = $tone_mark if $tone_mark;
|
85 |
+
$ht{UTF_TO_CHAR_NAME}->{$utf8_code} = $char_name if $char_name;
|
86 |
+
$entry_processed_p = 1 if $romanization || $numeric || $picture || $syllable_info || $tone_mark;
|
87 |
+
}
|
88 |
+
$n++ if $entry_processed_p;
|
89 |
+
}
|
90 |
+
close(IN);
|
91 |
+
} else {
|
92 |
+
print STDERR "Can't open $filename\n";
|
93 |
+
}
|
94 |
+
}
|
95 |
+
|
96 |
+
sub load_script_data {
|
97 |
+
local($this, *ht, $filename) = @_;
|
98 |
+
# ../../data/Scripts.txt
|
99 |
+
|
100 |
+
$n = 0;
|
101 |
+
if (open(IN, $filename)) {
|
102 |
+
while (<IN>) {
|
103 |
+
next unless $script_name = $util->slot_value_in_double_colon_del_list($_, "script-name");
|
104 |
+
$abugida_default_vowel_s = $util->slot_value_in_double_colon_del_list($_, "abugida-default-vowel");
|
105 |
+
$alt_script_name_s = $util->slot_value_in_double_colon_del_list($_, "alt-script-name");
|
106 |
+
$language_s = $util->slot_value_in_double_colon_del_list($_, "language");
|
107 |
+
$direction = $util->slot_value_in_double_colon_del_list($_, "direction"); # right-to-left
|
108 |
+
$font_family_s = $util->slot_value_in_double_colon_del_list($_, "font-family");
|
109 |
+
$ht{SCRIPT_P}->{$script_name} = 1;
|
110 |
+
$ht{SCRIPT_NORM}->{(uc $script_name)} = $script_name;
|
111 |
+
$ht{DIRECTION}->{$script_name} = $direction if $direction;
|
112 |
+
foreach $language (split(/,\s*/, $language_s)) {
|
113 |
+
$ht{SCRIPT_LANGUAGE}->{$script_name}->{$language} = 1;
|
114 |
+
$ht{LANGUAGE_SCRIPT}->{$language}->{$script_name} = 1;
|
115 |
+
}
|
116 |
+
foreach $alt_script_name (split(/,\s*/, $alt_script_name_s)) {
|
117 |
+
$ht{SCRIPT_NORM}->{$alt_script_name} = $script_name;
|
118 |
+
$ht{SCRIPT_NORM}->{(uc $alt_script_name)} = $script_name;
|
119 |
+
}
|
120 |
+
foreach $abugida_default_vowel (split(/,\s*/, $abugida_default_vowel_s)) {
|
121 |
+
$ht{SCRIPT_ABUDIGA_DEFAULT_VOWEL}->{$script_name}->{$abugida_default_vowel} = 1 if $abugida_default_vowel;
|
122 |
+
}
|
123 |
+
foreach $font_family (split(/,\s*/, $font_family_s)) {
|
124 |
+
$ht{SCRIPT_FONT}->{$script_name}->{$font_family} = 1 if $font_family;
|
125 |
+
}
|
126 |
+
$n++;
|
127 |
+
}
|
128 |
+
close(IN);
|
129 |
+
# print STDERR "Loaded $n entries from $filename\n";
|
130 |
+
} else {
|
131 |
+
print STDERR "Can't open $filename\n";
|
132 |
+
}
|
133 |
+
}
|
134 |
+
|
135 |
+
sub unicode_hangul_romanization {
|
136 |
+
local($this, $s, $pass_through_p) = @_;
|
137 |
+
|
138 |
+
$pass_through_p = 0 unless defined($pass_through_p);
|
139 |
+
@leads = split(/\s+/, "g gg n d dd r m b bb s ss - j jj c k t p h");
|
140 |
+
# @vowels = split(/\s+/, "a ae ya yai e ei ye yei o oa oai oi yo u ue uei ui yu w wi i");
|
141 |
+
@vowels = split(/\s+/, "a ae ya yae eo e yeo ye o wa wai oe yo u weo we wi yu eu yi i");
|
142 |
+
@tails = split(/\s+/, "- g gg gs n nj nh d l lg lm lb ls lt lp lh m b bs s ss ng j c k t p h");
|
143 |
+
$result = "";
|
144 |
+
@chars = $utf8->split_into_utf8_characters($s, "return only chars", *empty_ht);
|
145 |
+
foreach $char (@chars) {
|
146 |
+
$unicode = $utf8->utf8_to_unicode($char);
|
147 |
+
if (($unicode >= 0xAC00) && ($unicode <= 0xD7A3)) {
|
148 |
+
$code = $unicode - 0xAC00;
|
149 |
+
$lead_index = int($code / (28*21));
|
150 |
+
$vowel_index = int($code/28) % 21;
|
151 |
+
$tail_index = $code % 28;
|
152 |
+
$rom = $leads[$lead_index] . $vowels[$vowel_index] . $tails[$tail_index];
|
153 |
+
$rom =~ s/-//g;
|
154 |
+
$result .= $rom;
|
155 |
+
} elsif ($pass_through_p) {
|
156 |
+
$result .= $char;
|
157 |
+
}
|
158 |
+
}
|
159 |
+
return $result;
|
160 |
+
}
|
161 |
+
|
162 |
+
sub listify_comma_sep_string {
|
163 |
+
local($this, $s) = @_;
|
164 |
+
|
165 |
+
@result_list = ();
|
166 |
+
return @result_list unless $s =~ /\S/;
|
167 |
+
$s = $util->trim2($s);
|
168 |
+
my $elem;
|
169 |
+
|
170 |
+
while (($elem, $rest) = ($s =~ /^("(?:\\"|[^"])*"|'(?:\\'|[^'])*'|[^"', ]+),\s*(.*)$/)) {
|
171 |
+
push(@result_list, $util->dequote_string($elem));
|
172 |
+
$s = $rest;
|
173 |
+
}
|
174 |
+
push(@result_list, $util->dequote_string($s)) if $s =~ /\S/;
|
175 |
+
|
176 |
+
return @result_list;
|
177 |
+
}
|
178 |
+
|
179 |
+
sub braille_string_p {
|
180 |
+
local($this, $s) = @_;
|
181 |
+
|
182 |
+
return ($s =~ /^(\xE2[\xA0-\xA3][\x80-\xBF])+$/);
|
183 |
+
}
|
184 |
+
|
185 |
+
sub register_word_boundary_info {
|
186 |
+
local($this, *ht, $lang_code, $utf8_source_string, $utf8_target_string, $use_only_for_whole_word_p,
|
187 |
+
$use_only_at_start_of_word_p, $use_only_at_end_of_word_p,
|
188 |
+
$dont_use_at_start_of_word_p, $dont_use_at_end_of_word_p) = @_;
|
189 |
+
|
190 |
+
if ($use_only_for_whole_word_p) {
|
191 |
+
if ($lang_code) {
|
192 |
+
$ht{USE_ONLY_FOR_WHOLE_WORD_LANG_SPEC}->{$lang_code}->{$utf8_source_string}->{$utf8_target_string} = 1;
|
193 |
+
} else {
|
194 |
+
$ht{USE_ONLY_FOR_WHOLE_WORD}->{$utf8_source_string}->{$utf8_target_string} = 1;
|
195 |
+
}
|
196 |
+
}
|
197 |
+
if ($use_only_at_start_of_word_p) {
|
198 |
+
if ($lang_code) {
|
199 |
+
$ht{USE_ONLY_AT_START_OF_WORD_LANG_SPEC}->{$lang_code}->{$utf8_source_string}->{$utf8_target_string} = 1;
|
200 |
+
} else {
|
201 |
+
$ht{USE_ONLY_AT_START_OF_WORD}->{$utf8_source_string}->{$utf8_target_string} = 1;
|
202 |
+
}
|
203 |
+
}
|
204 |
+
if ($use_only_at_end_of_word_p) {
|
205 |
+
if ($lang_code) {
|
206 |
+
$ht{USE_ONLY_AT_END_OF_WORD_LANG_SPEC}->{$lang_code}->{$utf8_source_string}->{$utf8_target_string} = 1;
|
207 |
+
} else {
|
208 |
+
$ht{USE_ONLY_AT_END_OF_WORD}->{$utf8_source_string}->{$utf8_target_string} = 1;
|
209 |
+
}
|
210 |
+
}
|
211 |
+
if ($dont_use_at_start_of_word_p) {
|
212 |
+
if ($lang_code) {
|
213 |
+
$ht{DONT_USE_AT_START_OF_WORD_LANG_SPEC}->{$lang_code}->{$utf8_source_string}->{$utf8_target_string} = 1;
|
214 |
+
} else {
|
215 |
+
$ht{DONT_USE_AT_START_OF_WORD}->{$utf8_source_string}->{$utf8_target_string} = 1;
|
216 |
+
}
|
217 |
+
}
|
218 |
+
if ($dont_use_at_end_of_word_p) {
|
219 |
+
if ($lang_code) {
|
220 |
+
$ht{DONT_USE_AT_END_OF_WORD_LANG_SPEC}->{$lang_code}->{$utf8_source_string}->{$utf8_target_string} = 1;
|
221 |
+
} else {
|
222 |
+
$ht{DONT_USE_AT_END_OF_WORD}->{$utf8_source_string}->{$utf8_target_string} = 1;
|
223 |
+
}
|
224 |
+
}
|
225 |
+
}
|
226 |
+
|
227 |
+
sub load_romanization_table {
|
228 |
+
local($this, *ht, $filename) = @_;
|
229 |
+
# ../../data/romanization-table.txt
|
230 |
+
|
231 |
+
$n = 0;
|
232 |
+
$line_number = 0;
|
233 |
+
if (open(IN, $filename)) {
|
234 |
+
while (<IN>) {
|
235 |
+
$line_number++;
|
236 |
+
next if /^#/;
|
237 |
+
if ($_ =~ /^::preserve\s/) {
|
238 |
+
$from_unicode = $util->slot_value_in_double_colon_del_list($_, "from");
|
239 |
+
$to_unicode = $util->slot_value_in_double_colon_del_list($_, "to");
|
240 |
+
if ($from_unicode =~ /^(?:U\+|\\u)[0-9A-F]{4,}$/i) {
|
241 |
+
$from_unicode =~ s/^(?:U\+|\\u)//;
|
242 |
+
$from_code_point = hex($from_unicode);
|
243 |
+
} else {
|
244 |
+
$from_code_point = "";
|
245 |
+
}
|
246 |
+
if ($to_unicode =~ /^(?:U\+|\\u)[0-9A-F]{4,}$/i) {
|
247 |
+
$to_unicode =~ s/^(?:U\+|\\u)//;
|
248 |
+
$to_code_point = hex($to_unicode);
|
249 |
+
} else {
|
250 |
+
$to_code_point = $from_code_point;
|
251 |
+
}
|
252 |
+
if ($from_code_point ne "") {
|
253 |
+
# print STDERR "Preserve code-points $from_unicode--$to_unicode = $from_code_point--$to_code_point\n";
|
254 |
+
foreach $code_point (($from_code_point .. $to_code_point)) {
|
255 |
+
$utf8_string = $utf8->unicode2string($code_point);
|
256 |
+
$ht{UTF_CHAR_MAPPING}->{$utf8_string}->{$utf8_string} = 1;
|
257 |
+
}
|
258 |
+
$n++;
|
259 |
+
}
|
260 |
+
next;
|
261 |
+
}
|
262 |
+
$utf8_source_string = $util->slot_value_in_double_colon_del_list($_, "s");
|
263 |
+
$utf8_target_string = $util->slot_value_in_double_colon_del_list($_, "t");
|
264 |
+
$utf8_alt_target_string_s = $util->slot_value_in_double_colon_del_list($_, "t-alt");
|
265 |
+
$use_alt_in_pointed_p = ($_ =~ /::use-alt-in-pointed\b/);
|
266 |
+
$use_only_for_whole_word_p = ($_ =~ /::use-only-for-whole-word\b/);
|
267 |
+
$use_only_at_start_of_word_p = ($_ =~ /::use-only-at-start-of-word\b/);
|
268 |
+
$use_only_at_end_of_word_p = ($_ =~ /::use-only-at-end-of-word\b/);
|
269 |
+
$dont_use_at_start_of_word_p = ($_ =~ /::dont-use-at-start-of-word\b/);
|
270 |
+
$dont_use_at_end_of_word_p = ($_ =~ /::dont-use-at-end-of-word\b/);
|
271 |
+
$use_only_in_lower_case_enviroment_p = ($_ =~ /::use-only-in-lower-case-enviroment\b/);
|
272 |
+
$word_external_punctuation_p = ($_ =~ /::word-external-punctuation\b/);
|
273 |
+
$utf8_source_string =~ s/\s*$//;
|
274 |
+
$utf8_target_string =~ s/\s*$//;
|
275 |
+
$utf8_alt_target_string_s =~ s/\s*$//;
|
276 |
+
$utf8_target_string =~ s/^"(.*)"$/$1/;
|
277 |
+
$utf8_target_string =~ s/^'(.*)'$/$1/;
|
278 |
+
@utf8_alt_targets = $this->listify_comma_sep_string($utf8_alt_target_string_s);
|
279 |
+
$numeric = $util->slot_value_in_double_colon_del_list($_, "num");
|
280 |
+
$numeric =~ s/\s*$//;
|
281 |
+
$annotation = $util->slot_value_in_double_colon_del_list($_, "annotation");
|
282 |
+
$annotation =~ s/\s*$//;
|
283 |
+
$lang_code = $util->slot_value_in_double_colon_del_list($_, "lcode");
|
284 |
+
$prob = $util->slot_value_in_double_colon_del_list($_, "p") || 1;
|
285 |
+
unless (($utf8_target_string eq "") && ($numeric =~ /\d/)) {
|
286 |
+
if ($lang_code) {
|
287 |
+
$ht{UTF_CHAR_MAPPING_LANG_SPEC}->{$lang_code}->{$utf8_source_string}->{$utf8_target_string} = $prob;
|
288 |
+
} else {
|
289 |
+
$ht{UTF_CHAR_MAPPING}->{$utf8_source_string}->{$utf8_target_string} = $prob;
|
290 |
+
}
|
291 |
+
if ($word_external_punctuation_p) {
|
292 |
+
if ($lang_code) {
|
293 |
+
$ht{WORD_EXTERNAL_PUNCTUATION_LANG_SPEC}->{$lang_code}->{$utf8_source_string}->{$utf8_target_string} = $prob;
|
294 |
+
} else {
|
295 |
+
$ht{WORD_EXTERNAL_PUNCTUATION}->{$utf8_source_string}->{$utf8_target_string} = $prob;
|
296 |
+
}
|
297 |
+
}
|
298 |
+
if ($this->braille_string_p($utf8_source_string)) {
|
299 |
+
if (($utf8_target_string =~ /^[a-z]+$/)
|
300 |
+
&& (! ($utf8_source_string =~ /^$braille_capital_letter_indicator/))) {
|
301 |
+
my $uc_utf8_source_string = "$braille_capital_letter_indicator$utf8_source_string";
|
302 |
+
my $uc_utf8_target_string = ucfirst $utf8_target_string;
|
303 |
+
if ($lang_code) {
|
304 |
+
$ht{UTF_CHAR_MAPPING_LANG_SPEC}->{$lang_code}->{$uc_utf8_source_string}->{$uc_utf8_target_string} = $prob;
|
305 |
+
} else {
|
306 |
+
$ht{UTF_CHAR_MAPPING}->{$uc_utf8_source_string}->{$uc_utf8_target_string} = $prob;
|
307 |
+
}
|
308 |
+
$this->register_word_boundary_info(*ht, $lang_code, $uc_utf8_source_string, $uc_utf8_target_string,
|
309 |
+
$use_only_for_whole_word_p, $use_only_at_start_of_word_p, $use_only_at_end_of_word_p,
|
310 |
+
$dont_use_at_start_of_word_p, $dont_use_at_end_of_word_p);
|
311 |
+
}
|
312 |
+
if (($utf8_target_string =~ /^[0-9]$/)
|
313 |
+
&& ($utf8_source_string =~ /^$braille_number_indicator./)) {
|
314 |
+
my $core_number_char = $utf8_source_string;
|
315 |
+
$core_number_char =~ s/$braille_number_indicator//;
|
316 |
+
$ht{BRAILLE_TO_DIGIT}->{$core_number_char} = $utf8_target_string;
|
317 |
+
}
|
318 |
+
}
|
319 |
+
}
|
320 |
+
if ($use_only_in_lower_case_enviroment_p) {
|
321 |
+
if ($lang_code) {
|
322 |
+
$ht{USE_ONLY_IN_LOWER_CASE_ENVIROMENT_LANG_SPEC}->{$lang_code}->{$utf8_source_string}->{$utf8_target_string} = 1;
|
323 |
+
} else {
|
324 |
+
$ht{USE_ONLY_IN_LOWER_CASE_ENVIROMENT}->{$utf8_source_string}->{$utf8_target_string} = 1;
|
325 |
+
}
|
326 |
+
}
|
327 |
+
$this->register_word_boundary_info(*ht, $lang_code, $utf8_source_string, $utf8_target_string,
|
328 |
+
$use_only_for_whole_word_p, $use_only_at_start_of_word_p, $use_only_at_end_of_word_p,
|
329 |
+
$dont_use_at_start_of_word_p, $dont_use_at_end_of_word_p);
|
330 |
+
foreach $utf8_alt_target (@utf8_alt_targets) {
|
331 |
+
if ($lang_code) {
|
332 |
+
$ht{UTF_CHAR_ALT_MAPPING_LANG_SPEC}->{$lang_code}->{$utf8_source_string}->{$utf8_alt_target} = $prob;
|
333 |
+
$ht{USE_ALT_IN_POINTED_LANG_SPEC}->{$lang_code}->{$utf8_source_string}->{$utf8_alt_target} = 1 if $use_alt_in_pointed_p;
|
334 |
+
} else {
|
335 |
+
$ht{UTF_CHAR_ALT_MAPPING}->{$utf8_source_string}->{$utf8_alt_target} = $prob;
|
336 |
+
$ht{USE_ALT_IN_POINTED}->{$utf8_source_string}->{$utf8_alt_target} = 1 if $use_alt_in_pointed_p;
|
337 |
+
}
|
338 |
+
if ($use_only_for_whole_word_p) {
|
339 |
+
if ($lang_code) {
|
340 |
+
$ht{USE_ALT_ONLY_FOR_WHOLE_WORD_LANG_SPEC}->{$lang_code}->{$utf8_source_string}->{$utf8_alt_target} = 1;
|
341 |
+
} else {
|
342 |
+
$ht{USE_ALT_ONLY_FOR_WHOLE_WORD}->{$utf8_source_string}->{$utf8_alt_target} = 1;
|
343 |
+
}
|
344 |
+
}
|
345 |
+
if ($use_only_at_start_of_word_p) {
|
346 |
+
if ($lang_code) {
|
347 |
+
$ht{USE_ALT_ONLY_AT_START_OF_WORD_LANG_SPEC}->{$lang_code}->{$utf8_source_string}->{$utf8_alt_target} = 1;
|
348 |
+
} else {
|
349 |
+
$ht{USE_ALT_ONLY_AT_START_OF_WORD}->{$utf8_source_string}->{$utf8_alt_target} = 1;
|
350 |
+
}
|
351 |
+
}
|
352 |
+
if ($use_only_at_end_of_word_p) {
|
353 |
+
if ($lang_code) {
|
354 |
+
$ht{USE_ALT_ONLY_AT_END_OF_WORD_LANG_SPEC}->{$lang_code}->{$utf8_source_string}->{$utf8_alt_target} = 1;
|
355 |
+
} else {
|
356 |
+
$ht{USE_ALT_ONLY_AT_END_OF_WORD}->{$utf8_source_string}->{$utf8_alt_target} = 1;
|
357 |
+
}
|
358 |
+
}
|
359 |
+
}
|
360 |
+
if ($numeric =~ /\d/) {
|
361 |
+
$ht{UTF_TO_NUMERIC}->{$utf8_source_string} = $numeric;
|
362 |
+
}
|
363 |
+
if ($annotation =~ /\S/) {
|
364 |
+
$ht{UTF_ANNOTATION}->{$utf8_source_string} = $annotation;
|
365 |
+
}
|
366 |
+
$n++;
|
367 |
+
}
|
368 |
+
close(IN);
|
369 |
+
# print STDERR "Loaded $n entries from $filename\n";
|
370 |
+
} else {
|
371 |
+
print STDERR "Can't open $filename\n";
|
372 |
+
}
|
373 |
+
}
|
374 |
+
|
375 |
+
sub char_name_to_script {
|
376 |
+
local($this, $char_name, *ht) = @_;
|
377 |
+
|
378 |
+
return $cached_result if $cached_result = $ht{CHAR_NAME_TO_SCRIPT}->{$char_name};
|
379 |
+
$orig_char_name = $char_name;
|
380 |
+
$char_name =~ s/\s+(CONSONANT|LETTER|LIGATURE|SIGN|SYLLABLE|SYLLABICS|VOWEL)\b.*$//;
|
381 |
+
my $script_name;
|
382 |
+
while ($char_name) {
|
383 |
+
last if $script_name = $ht{SCRIPT_NORM}->{(uc $char_name)};
|
384 |
+
$char_name =~ s/\s*\S+\s*$//;
|
385 |
+
}
|
386 |
+
$script_name = "" unless defined($script_name);
|
387 |
+
$ht{CHAR_NAME_TO_SCRIPT}->{$char_name} = $script_name;
|
388 |
+
return $script_name;
|
389 |
+
}
|
390 |
+
|
391 |
+
sub letter_plus_char_p {
|
392 |
+
local($this, $char_name) = @_;
|
393 |
+
|
394 |
+
return $cached_result if $cached_result = $ht{CHAR_NAME_LETTER_PLUS}->{$char_name};
|
395 |
+
my $letter_plus_p = ($char_name =~ /\b(?:LETTER|VOWEL SIGN|AU LENGTH MARK|CONSONANT SIGN|SIGN VIRAMA|SIGN PAMAAEH|SIGN COENG|SIGN AL-LAKUNA|SIGN ASAT|SIGN ANUSVARA|SIGN ANUSVARAYA|SIGN BINDI|TIPPI|SIGN NIKAHIT|SIGN CANDRABINDU|SIGN VISARGA|SIGN REAHMUK|SIGN NUKTA|SIGN DOT BELOW|HEBREW POINT)\b/) ? 1 : 0;
|
396 |
+
$ht{CHAR_NAME_LETTER_PLUS}->{$char_name} = $letter_plus_p;
|
397 |
+
return $letter_plus_p;
|
398 |
+
}
|
399 |
+
|
400 |
+
sub subjoined_char_p {
|
401 |
+
local($this, $char_name) = @_;
|
402 |
+
|
403 |
+
return $cached_result if $cached_result = $ht{CHAR_NAME_SUBJOINED}->{$char_name};
|
404 |
+
my $subjoined_p = (($char_name =~ /\b(?:SUBJOINED LETTER|VOWEL SIGN|AU LENGTH MARK|EMPHASIS MARK|CONSONANT SIGN|SIGN VIRAMA|SIGN PAMAAEH|SIGN COENG|SIGN ASAT|SIGN ANUSVARA|SIGN ANUSVARAYA|SIGN BINDI|TIPPI|SIGN NIKAHIT|SIGN CANDRABINDU|SIGN VISARGA|SIGN REAHMUK|SIGN DOT BELOW|HEBREW (POINT|PUNCTUATION GERESH)|ARABIC (?:DAMMA|DAMMATAN|FATHA|FATHATAN|HAMZA|KASRA|KASRATAN|MADDAH|SHADDA|SUKUN))\b/)) ? 1 : 0;
|
405 |
+
$ht{CHAR_NAME_SUBJOINED}->{$char_name} = $subjoined_p;
|
406 |
+
return $subjoined_p;
|
407 |
+
}
|
408 |
+
|
409 |
+
sub new_node_id {
|
410 |
+
local($this, *chart_ht) = @_;
|
411 |
+
|
412 |
+
my $n_nodes = $chart_ht{N_NODES};
|
413 |
+
$n_nodes++;
|
414 |
+
$chart_ht{N_NODES} = $n_nodes;
|
415 |
+
return $n_nodes;
|
416 |
+
}
|
417 |
+
|
418 |
+
sub add_node {
|
419 |
+
local($this, $s, $start, $end, *chart_ht, $type, $comment) = @_;
|
420 |
+
|
421 |
+
my $node_id = $this->new_node_id(*chart_ht);
|
422 |
+
# print STDERR "add_node($node_id, $start-$end): $s [$comment]\n" if $comment =~ /number/;
|
423 |
+
# print STDERR "add_node($node_id, $start-$end): $s [$comment]\n" if ($start >= 0) && ($start < 50);
|
424 |
+
$chart_ht{NODE_START}->{$node_id} = $start;
|
425 |
+
$chart_ht{NODE_END}->{$node_id} = $end;
|
426 |
+
$chart_ht{NODES_STARTING_AT}->{$start}->{$node_id} = 1;
|
427 |
+
$chart_ht{NODES_ENDING_AT}->{$end}->{$node_id} = 1;
|
428 |
+
$chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}->{$end}->{$node_id} = 1;
|
429 |
+
$chart_ht{NODE_TYPE}->{$node_id} = $type;
|
430 |
+
$chart_ht{NODE_COMMENT}->{$node_id} = $comment;
|
431 |
+
$chart_ht{NODE_ROMAN}->{$node_id} = $s;
|
432 |
+
return $node_id;
|
433 |
+
}
|
434 |
+
|
435 |
+
sub get_node_for_span {
|
436 |
+
local($this, $start, $end, *chart_ht) = @_;
|
437 |
+
|
438 |
+
return "" unless defined($chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}->{$end});
|
439 |
+
my @node_ids = sort { $a <=> $b } keys %{$chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}->{$end}};
|
440 |
+
|
441 |
+
return (@node_ids) ? $node_ids[0] : "";
|
442 |
+
}
|
443 |
+
|
444 |
+
sub get_node_for_span_and_type {
|
445 |
+
local($this, $start, $end, *chart_ht, $type) = @_;
|
446 |
+
|
447 |
+
return "" unless defined($chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}->{$end});
|
448 |
+
my @node_ids = sort { $a <=> $b } keys %{$chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}->{$end}};
|
449 |
+
|
450 |
+
foreach $node_id (@node_ids) {
|
451 |
+
return $node_id if $chart_ht{NODE_TYPE}->{$node_id} eq $type;
|
452 |
+
}
|
453 |
+
return "";
|
454 |
+
}
|
455 |
+
|
456 |
+
sub get_node_roman {
|
457 |
+
local($this, $node_id, *chart_id, $default) = @_;
|
458 |
+
|
459 |
+
$default = "" unless defined($default);
|
460 |
+
my $roman = $chart_ht{NODE_ROMAN}->{$node_id};
|
461 |
+
return (defined($roman)) ? $roman : $default;
|
462 |
+
}
|
463 |
+
|
464 |
+
sub set_node_id_slot_value {
|
465 |
+
local($this, $node_id, $slot, $value, *chart_id) = @_;
|
466 |
+
|
467 |
+
$chart_ht{NODE_SLOT}->{$node_id}->{$slot} = $value;
|
468 |
+
}
|
469 |
+
|
470 |
+
sub copy_slot_values {
|
471 |
+
local($this, $old_node_id, $new_node_id, *chart_id, @slots) = @_;
|
472 |
+
|
473 |
+
if (@slots) {
|
474 |
+
foreach $slot (keys %{$chart_ht{NODE_SLOT}->{$old_node_id}}) {
|
475 |
+
if (($slots[0] eq "all") || $util->member($slot, @slots)) {
|
476 |
+
my $value = $chart_ht{NODE_SLOT}->{$old_node_id}->{$slot};
|
477 |
+
$chart_ht{NODE_SLOT}->{$new_node_id}->{$slot} = $value if defined($value);
|
478 |
+
}
|
479 |
+
}
|
480 |
+
}
|
481 |
+
}
|
482 |
+
|
483 |
+
sub get_node_id_slot_value {
|
484 |
+
local($this, $node_id, $slot, *chart_id, $default) = @_;
|
485 |
+
|
486 |
+
$default = "" unless defined($default);
|
487 |
+
my $value = $chart_ht{NODE_SLOT}->{$node_id}->{$slot};
|
488 |
+
return (defined($value)) ? $value : $default;
|
489 |
+
}
|
490 |
+
|
491 |
+
sub get_node_for_span_with_slot_value {
|
492 |
+
local($this, $start, $end, $slot, *chart_id, $default) = @_;
|
493 |
+
|
494 |
+
$default = "" unless defined($default);
|
495 |
+
return $default unless defined($chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}->{$end});
|
496 |
+
my @node_ids = sort { $a <=> $b } keys %{$chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}->{$end}};
|
497 |
+
foreach $node_id (@node_ids) {
|
498 |
+
my $value = $chart_ht{NODE_SLOT}->{$node_id}->{$slot};
|
499 |
+
return $value if defined($value);
|
500 |
+
}
|
501 |
+
return $default;
|
502 |
+
}
|
503 |
+
|
504 |
+
sub get_node_for_span_with_slot {
|
505 |
+
local($this, $start, $end, $slot, *chart_id, $default) = @_;
|
506 |
+
|
507 |
+
$default = "" unless defined($default);
|
508 |
+
return $default unless defined($chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}->{$end});
|
509 |
+
my @node_ids = sort { $a <=> $b } keys %{$chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}->{$end}};
|
510 |
+
foreach $node_id (@node_ids) {
|
511 |
+
my $value = $chart_ht{NODE_SLOT}->{$node_id}->{$slot};
|
512 |
+
return $node_id if defined($value);
|
513 |
+
}
|
514 |
+
return $default;
|
515 |
+
}
|
516 |
+
|
517 |
+
sub register_new_complex_number_span_segment {
|
518 |
+
local($this, $start, $mid, $end, *chart_id, $line_number) = @_;
|
519 |
+
# e.g. 4 10 (= 40); 20 5 (= 25)
|
520 |
+
# might become part of larger complex number span, e.g. 4 1000 3 100 20 1
|
521 |
+
|
522 |
+
# print STDERR "register_new_complex_number_span_segment $start-$mid-$end\n" if $line_number == 43;
|
523 |
+
if (defined($old_start = $chart_ht{COMPLEX_NUMERIC_END_START}->{$mid})) {
|
524 |
+
undef($chart_ht{COMPLEX_NUMERIC_END_START}->{$mid});
|
525 |
+
$chart_ht{COMPLEX_NUMERIC_START_END}->{$old_start} = $end;
|
526 |
+
$chart_ht{COMPLEX_NUMERIC_END_START}->{$end} = $old_start;
|
527 |
+
} else {
|
528 |
+
$chart_ht{COMPLEX_NUMERIC_START_END}->{$start} = $end;
|
529 |
+
$chart_ht{COMPLEX_NUMERIC_END_START}->{$end} = $start;
|
530 |
+
}
|
531 |
+
}
|
532 |
+
|
533 |
+
sub romanize_by_token_with_caching {
|
534 |
+
local($this, $s, $lang_code, $output_style, *ht, *pinyin_ht, $initial_char_offset, $control, $line_number) = @_;
|
535 |
+
|
536 |
+
$control = "" unless defined($control);
|
537 |
+
my $return_chart_p = ($control =~ /return chart/i);
|
538 |
+
my $return_offset_mappings_p = ($control =~ /return offset mappings/i);
|
539 |
+
return $this->romanize($s, $lang_code, $output_style, *ht, *pinyin_ht, $initial_char_offset, $control, $line_number)
|
540 |
+
if $return_chart_p || $return_offset_mappings_p;
|
541 |
+
my $result = "";
|
542 |
+
my @separators = ();
|
543 |
+
my @tokens = ();
|
544 |
+
$s =~ s/\n$//; # Added May 2, 2019 as bug-fix (duplicate empty lines)
|
545 |
+
while (($sep, $token, $rest) = ($s =~ /^(\s*)(\S+)(.*)$/)) {
|
546 |
+
push(@separators, $sep);
|
547 |
+
push(@tokens, $token);
|
548 |
+
$s = $rest;
|
549 |
+
}
|
550 |
+
push(@separators, $s);
|
551 |
+
while (@tokens) {
|
552 |
+
my $sep = shift @separators;
|
553 |
+
my $token = shift @tokens;
|
554 |
+
$result .= $sep;
|
555 |
+
if ($token =~ /^[\x00-\x7F]*$/) { # all ASCII
|
556 |
+
$result .= $token;
|
557 |
+
} else {
|
558 |
+
my $rom_token = $ht{CACHED_ROMANIZATION}->{$lang_code}->{$token};
|
559 |
+
unless (defined($rom_token)) {
|
560 |
+
$rom_token = $this->romanize($token, $lang_code, $output_style, *ht, *pinyin_ht, $initial_char_offset, $control, $line_number);
|
561 |
+
$ht{CACHED_ROMANIZATION}->{$lang_code}->{$token} = $rom_token if defined($rom_token);
|
562 |
+
}
|
563 |
+
$result .= $rom_token;
|
564 |
+
}
|
565 |
+
}
|
566 |
+
my $sep = shift @separators;
|
567 |
+
$result .= $sep if defined($sep);
|
568 |
+
|
569 |
+
return $result;
|
570 |
+
}
|
571 |
+
|
572 |
+
sub romanize {
|
573 |
+
local($this, $s, $lang_code, $output_style, *ht, *pinyin_ht, $initial_char_offset, $control, $line_number, $initial_rom_char_offset) = @_;
|
574 |
+
|
575 |
+
my $orig_lang_code = $lang_code;
|
576 |
+
# Check whether the text (to be romanized) starts with a language code directive.
|
577 |
+
if (($line_lang_code) = ($s =~ /^::lcode\s+([a-z][a-z][a-z])\s/)) {
|
578 |
+
$lang_code = $line_lang_code;
|
579 |
+
}
|
580 |
+
$initial_char_offset = 0 unless defined($initial_char_offset);
|
581 |
+
$initial_rom_char_offset = 0 unless defined($initial_rom_char_offset);
|
582 |
+
$control = "" unless defined($control);
|
583 |
+
my $return_chart_p = ($control =~ /return chart/i);
|
584 |
+
my $return_offset_mappings_p = ($control =~ /return offset mappings/i);
|
585 |
+
$line_number = "" unless defined($line_number);
|
586 |
+
my @chars = $utf8->split_into_utf8_characters($s, "return only chars", *empty_ht);
|
587 |
+
my $n_characters = $#chars + 1;
|
588 |
+
%chart_ht = ();
|
589 |
+
$chart_ht{N_CHARS} = $n_characters;
|
590 |
+
$chart_ht{N_NODES} = 0;
|
591 |
+
my $char = "";
|
592 |
+
my $char_name = "";
|
593 |
+
my $prev_script = "";
|
594 |
+
my $current_script = "";
|
595 |
+
my $script_start = 0;
|
596 |
+
my $script_end = 0;
|
597 |
+
my $prev_letter_plus_script = "";
|
598 |
+
my $current_letter_plus_script = "";
|
599 |
+
my $letter_plus_script_start = 0;
|
600 |
+
my $letter_plus_script_end = 0;
|
601 |
+
my $log ="";
|
602 |
+
my $n_right_to_left_chars = 0;
|
603 |
+
my $n_left_to_right_chars = 0;
|
604 |
+
my $hebrew_word_start = ""; # used to identify Hebrew words with points
|
605 |
+
my $hebrew_word_contains_point = 0;
|
606 |
+
my $current_word_start = "";
|
607 |
+
my $current_word_script = "";
|
608 |
+
my $braille_all_caps_p = 0;
|
609 |
+
|
610 |
+
# prep
|
611 |
+
foreach $i ((0 .. ($#chars + 1))) {
|
612 |
+
if ($i <= $#chars) {
|
613 |
+
$char = $chars[$i];
|
614 |
+
$chart_ht{ORIG_CHAR}->{$i} = $char;
|
615 |
+
$char_name = $ht{UTF_TO_CHAR_NAME}->{$char} || "";
|
616 |
+
$chart_ht{CHAR_NAME}->{$i} = $char_name;
|
617 |
+
$current_script = $this->char_name_to_script($char_name, *ht);
|
618 |
+
$current_script_direction = $ht{DIRECTION}->{$current_script} || '';
|
619 |
+
if ($current_script_direction eq 'right-to-left') {
|
620 |
+
$n_right_to_left_chars++;
|
621 |
+
} elsif (($char =~ /^[a-z]$/i) || ! ($char =~ /^[\x00-\x7F]$/)) {
|
622 |
+
$n_left_to_right_chars++;
|
623 |
+
}
|
624 |
+
$chart_ht{CHAR_SCRIPT}->{$i} = $current_script;
|
625 |
+
$chart_ht{SCRIPT_SEGMENT_START}->{$i} = ""; # default value, to be updated later
|
626 |
+
$chart_ht{SCRIPT_SEGMENT_END}->{$i} = ""; # default value, to be updated later
|
627 |
+
$chart_ht{LETTER_TOKEN_SEGMENT_START}->{$i} = ""; # default value, to be updated later
|
628 |
+
$chart_ht{LETTER_TOKEN_SEGMENT_END}->{$i} = ""; # default value, to be updated later
|
629 |
+
$subjoined_char_p = $this->subjoined_char_p($char_name);
|
630 |
+
$chart_ht{CHAR_SUBJOINED}->{$i} = $subjoined_char_p;
|
631 |
+
$letter_plus_char_p = $this->letter_plus_char_p($char_name);
|
632 |
+
$chart_ht{CHAR_LETTER_PLUS}->{$i} = $letter_plus_char_p;
|
633 |
+
$current_letter_plus_script = ($letter_plus_char_p) ? $current_script : "";
|
634 |
+
$numeric_value = $ht{UTF_TO_NUMERIC}->{$char};
|
635 |
+
$numeric_value = "" unless defined($numeric_value);
|
636 |
+
$annotation = $ht{UTF_ANNOTATION}->{$char};
|
637 |
+
$annotation = "" unless defined($annotation);
|
638 |
+
$chart_ht{CHAR_NUMERIC_VALUE}->{$i} = $numeric_value;
|
639 |
+
$chart_ht{CHAR_ANNOTATION}->{$i} = $annotation;
|
640 |
+
$syllable_info = $ht{UTF_TO_SYLLABLE_INFO}->{$char} || "";
|
641 |
+
$chart_ht{CHAR_SYLLABLE_INFO}->{$i} = $syllable_info;
|
642 |
+
$tone_mark = $ht{UTF_TO_TONE_MARK}->{$char} || "";
|
643 |
+
$chart_ht{CHAR_TONE_MARK}->{$i} = $tone_mark;
|
644 |
+
} else {
|
645 |
+
$char = "";
|
646 |
+
$char_name = "";
|
647 |
+
$current_script = "";
|
648 |
+
$current_letter_plus_script = "";
|
649 |
+
}
|
650 |
+
if ($char_name =~ /^HEBREW (LETTER|POINT|PUNCTUATION GERESH) /) {
|
651 |
+
$hebrew_word_start = $i if $hebrew_word_start eq "";
|
652 |
+
$hebrew_word_contains_point = 1 if $char_name =~ /^HEBREW POINT /;
|
653 |
+
} elsif ($hebrew_word_start ne "") {
|
654 |
+
if ($hebrew_word_contains_point) {
|
655 |
+
foreach $j (($hebrew_word_start .. ($i-1))) {
|
656 |
+
$chart_ht{CHAR_PART_OF_POINTED_HEBREW_WORD}->{$j} = 1;
|
657 |
+
}
|
658 |
+
$chart_ht{CHAR_START_OF_WORD}->{$hebrew_word_start} = 1;
|
659 |
+
$chart_ht{CHAR_END_OF_WORD}->{($i-1)} = 1;
|
660 |
+
}
|
661 |
+
$hebrew_word_start = "";
|
662 |
+
$hebrew_word_contains_point = 0;
|
663 |
+
}
|
664 |
+
my $part_of_word_p = $current_script
|
665 |
+
&& ($this->letter_plus_char_p($char_name)
|
666 |
+
|| $this->subjoined_char_p($char_name)
|
667 |
+
|| ($char_name =~ /\b(LETTER|SYLLABLE|SYLLABICS|LIGATURE)\b/));
|
668 |
+
|
669 |
+
# Braille punctuation
|
670 |
+
my $end_offset = 0;
|
671 |
+
if ($char_name =~ /^Braille\b/i) {
|
672 |
+
if (($char =~ /^\s*$/) || ($char_name =~ /BLANK/)) {
|
673 |
+
$part_of_word_p = 0;
|
674 |
+
$braille_all_caps_p = 0;
|
675 |
+
} elsif ($chart_ht{NOT_PART_OF_WORD_P}->{$i}) {
|
676 |
+
$part_of_word_p = 0;
|
677 |
+
$braille_all_caps_p = 0;
|
678 |
+
} elsif ((keys %{$ht{WORD_EXTERNAL_PUNCTUATION_LANG_SPEC}->{$lang_code}->{$char}})
|
679 |
+
|| (keys %{$ht{WORD_EXTERNAL_PUNCTUATION}->{$char}})) {
|
680 |
+
$part_of_word_p = 0;
|
681 |
+
$braille_all_caps_p = 0;
|
682 |
+
} elsif (($i+1 <= $#chars)
|
683 |
+
&& ($s1 = $char . $chars[$i+1])
|
684 |
+
&& ((keys %{$ht{WORD_EXTERNAL_PUNCTUATION_LANG_SPEC}->{$lang_code}->{$s1}})
|
685 |
+
|| (keys %{$ht{WORD_EXTERNAL_PUNCTUATION}->{$s1}}))) {
|
686 |
+
$part_of_word_p = 0;
|
687 |
+
$braille_all_caps_p = 0;
|
688 |
+
$chart_ht{NOT_PART_OF_WORD_P}->{($i+1)} = 1;
|
689 |
+
} elsif (($i+2 <= $#chars)
|
690 |
+
&& ($s2 = $char . $chars[$i+1] . $chars[$i+2])
|
691 |
+
&& ((keys %{$ht{WORD_EXTERNAL_PUNCTUATION_LANG_SPEC}->{$lang_code}->{$s2}})
|
692 |
+
|| (keys %{$ht{WORD_EXTERNAL_PUNCTUATION}->{$s2}}))) {
|
693 |
+
$part_of_word_p = 0;
|
694 |
+
$braille_all_caps_p = 0;
|
695 |
+
$chart_ht{NOT_PART_OF_WORD_P}->{($i+1)} = 1;
|
696 |
+
$chart_ht{NOT_PART_OF_WORD_P}->{($i+2)} = 1;
|
697 |
+
} elsif (($i+1 <= $#chars)
|
698 |
+
&& ($char eq $braille_capital_letter_indicator)
|
699 |
+
&& ($chars[$i+1] eq $braille_capital_letter_indicator)) {
|
700 |
+
$braille_all_caps_p = 1;
|
701 |
+
} else {
|
702 |
+
$part_of_word_p = 1;
|
703 |
+
}
|
704 |
+
# last period in Braille text is also not part_of_word_p
|
705 |
+
if (($char eq $braille_period)
|
706 |
+
&& (($i == $#chars)
|
707 |
+
|| (($i < $#chars)
|
708 |
+
&& (! $this->braille_string_p($chars[$i+1]))))) {
|
709 |
+
$part_of_word_p = 0;
|
710 |
+
}
|
711 |
+
# period before other word-external punctuation is also not part_of_word_p
|
712 |
+
if (($i > 0)
|
713 |
+
&& ($chars[$i-1] eq $braille_period)
|
714 |
+
&& (! $part_of_word_p)
|
715 |
+
&& ($current_word_start ne "")) {
|
716 |
+
$end_offset = -1;
|
717 |
+
}
|
718 |
+
} else {
|
719 |
+
$braille_all_caps_p = 0;
|
720 |
+
}
|
721 |
+
$chart_ht{BRAILLE_ALL_CAPS_P}->{$i} = $braille_all_caps_p;
|
722 |
+
|
723 |
+
if (($current_word_start ne "")
|
724 |
+
&& ((! $part_of_word_p)
|
725 |
+
|| ($current_script ne $current_word_script))) {
|
726 |
+
# END OF WORD
|
727 |
+
$chart_ht{CHAR_START_OF_WORD}->{$current_word_start} = 1;
|
728 |
+
$chart_ht{CHAR_END_OF_WORD}->{($i-1+$end_offset)} = 1;
|
729 |
+
my $word = join("", @chars[$current_word_start .. ($i-1+$end_offset)]);
|
730 |
+
$chart_ht{WORD_START_END}->{$current_word_start}->{$i} = $word;
|
731 |
+
$chart_ht{WORD_END_START}->{$i+$end_offset}->{$current_word_start} = $word;
|
732 |
+
# print STDERR "Word ($current_word_start-$i+$end_offset): $word ($current_word_script)\n";
|
733 |
+
$current_word_start = "";
|
734 |
+
$current_word_script = "";
|
735 |
+
}
|
736 |
+
if ($part_of_word_p && ($current_word_start eq "")) {
|
737 |
+
# START OF WORD
|
738 |
+
$current_word_start = $i;
|
739 |
+
$current_word_script = $current_script;
|
740 |
+
}
|
741 |
+
# print STDERR "$i char: $char ($current_script)\n";
|
742 |
+
unless ($current_script eq $prev_script) {
|
743 |
+
if ($prev_script && ($i-1 >= $script_start)) {
|
744 |
+
my $script_end = $i;
|
745 |
+
$chart_ht{SCRIPT_SEGMENT_START_TO_END}->{$script_start} = $script_end;
|
746 |
+
$chart_ht{SCRIPT_SEGMENT_END_TO_START}->{$script_end} = $script_start;
|
747 |
+
foreach $i (($script_start .. $script_end)) {
|
748 |
+
$chart_ht{SCRIPT_SEGMENT_START}->{$i} = $script_start;
|
749 |
+
$chart_ht{SCRIPT_SEGMENT_END}->{$i} = $script_end;
|
750 |
+
}
|
751 |
+
# print STDERR "Script segment $script_start-$script_end: $prev_script\n";
|
752 |
+
}
|
753 |
+
$script_start = $i;
|
754 |
+
}
|
755 |
+
unless ($current_letter_plus_script eq $prev_letter_plus_script) {
|
756 |
+
if ($prev_letter_plus_script && ($i-1 >= $letter_plus_script_start)) {
|
757 |
+
my $letter_plus_script_end = $i;
|
758 |
+
$chart_ht{LETTER_TOKEN_SEGMENT_START_TO_END}->{$letter_plus_script_start} = $letter_plus_script_end;
|
759 |
+
$chart_ht{LETTER_TOKEN_SEGMENT_END_TO_START}->{$letter_plus_script_end} = $letter_plus_script_start;
|
760 |
+
foreach $i (($letter_plus_script_start .. $letter_plus_script_end)) {
|
761 |
+
$chart_ht{LETTER_TOKEN_SEGMENT_START}->{$i} = $letter_plus_script_start;
|
762 |
+
$chart_ht{LETTER_TOKEN_SEGMENT_END}->{$i} = $letter_plus_script_end;
|
763 |
+
}
|
764 |
+
# print STDERR "Script token segment $letter_plus_script_start-$letter_plus_script_end: $prev_letter_plus_script\n";
|
765 |
+
}
|
766 |
+
$letter_plus_script_start = $i;
|
767 |
+
}
|
768 |
+
$prev_script = $current_script;
|
769 |
+
$prev_letter_plus_script = $current_letter_plus_script;
|
770 |
+
}
|
771 |
+
$ht{STRING_IS_DOMINANTLY_RIGHT_TO_LEFT}->{$s} = 1 if $n_right_to_left_chars > $n_left_to_right_chars;
|
772 |
+
|
773 |
+
# main
|
774 |
+
my $i = 0;
|
775 |
+
while ($i <= $#chars) {
|
776 |
+
my $char = $chart_ht{ORIG_CHAR}->{$i};
|
777 |
+
my $current_script = $chart_ht{CHAR_SCRIPT}->{$i};
|
778 |
+
$chart_ht{CHART_CONTAINS_SCRIPT}->{$current_script} = 1;
|
779 |
+
my $script_segment_start = $chart_ht{SCRIPT_SEGMENT_START}->{$i};
|
780 |
+
my $script_segment_end = $chart_ht{SCRIPT_SEGMENT_END}->{$i};
|
781 |
+
my $char_name = $chart_ht{CHAR_NAME}->{$i};
|
782 |
+
my $subjoined_char_p = $chart_ht{CHAR_SUBJOINED}->{$i};
|
783 |
+
my $letter_plus_char_p = $chart_ht{CHAR_LETTER_PLUS}->{$i};
|
784 |
+
my $numeric_value = $chart_ht{CHAR_NUMERIC_VALUE}->{$i};
|
785 |
+
my $annotation = $chart_ht{CHAR_ANNOTATION}->{$i};
|
786 |
+
# print STDERR " $char_name annotation: $annotation\n" if $annotation;
|
787 |
+
my $tone_mark = $chart_ht{CHAR_TONE_MARK}->{$i};
|
788 |
+
my $found_char_mapping_p = 0;
|
789 |
+
my $prev_char_name = ($i >= 1) ? $chart_ht{CHAR_NAME}->{($i-1)} : "";
|
790 |
+
my $prev2_script = ($i >= 2) ? $chart_ht{CHAR_SCRIPT}->{($i-2)} : "";
|
791 |
+
my $prev_script = ($i >= 1) ? $chart_ht{CHAR_SCRIPT}->{($i-1)} : "";
|
792 |
+
my $next_script = ($i < $#chars) ? $chart_ht{CHAR_SCRIPT}->{($i+1)} : "";
|
793 |
+
my $next_char = ($i < $#chars) ? $chart_ht{ORIG_CHAR}->{($i+1)} : "";
|
794 |
+
my $next_char_name = $ht{UTF_TO_CHAR_NAME}->{$next_char} || "";
|
795 |
+
my $prev2_letter_plus_char_p = ($i >= 2) ? $chart_ht{CHAR_LETTER_PLUS}->{($i-2)} : 0;
|
796 |
+
my $prev_letter_plus_char_p = ($i >= 1) ? $chart_ht{CHAR_LETTER_PLUS}->{($i-1)} : 0;
|
797 |
+
my $next_letter_plus_char_p = ($i < $#chars) ? $chart_ht{CHAR_LETTER_PLUS}->{($i+1)} : 0;
|
798 |
+
my $next_index = $i + 1;
|
799 |
+
|
800 |
+
# Braille numeric mode
|
801 |
+
if ($char eq $braille_number_indicator) {
|
802 |
+
my $offset = 0;
|
803 |
+
my $numeric_value = "";
|
804 |
+
my $digit;
|
805 |
+
while ($i+$offset < $#chars) {
|
806 |
+
$offset++;
|
807 |
+
my $offset_char = $chart_ht{ORIG_CHAR}->{$i+$offset};
|
808 |
+
if (defined($digit = $ht{BRAILLE_TO_DIGIT}->{$offset_char})) {
|
809 |
+
$numeric_value .= $digit;
|
810 |
+
} elsif (($offset_char eq $braille_decimal_point)
|
811 |
+
|| ($ht{UTF_CHAR_MAPPING}->{$offset_char}->{"."})) {
|
812 |
+
$numeric_value .= ".";
|
813 |
+
} elsif ($offset_char eq $braille_comma) {
|
814 |
+
$numeric_value .= ",";
|
815 |
+
} elsif ($offset_char eq $braille_numeric_space) {
|
816 |
+
$numeric_value .= " ";
|
817 |
+
} elsif ($offset_char eq $braille_solidus) {
|
818 |
+
$numeric_value .= "/";
|
819 |
+
} elsif ($offset_char eq $braille_number_indicator) {
|
820 |
+
# stay in Braille numeric mode
|
821 |
+
} elsif ($offset_char eq $braille_letter_indicator) {
|
822 |
+
# consider as part of number, but without contributing to numeric_value
|
823 |
+
last;
|
824 |
+
} else {
|
825 |
+
$offset--;
|
826 |
+
last;
|
827 |
+
}
|
828 |
+
}
|
829 |
+
if ($offset) {
|
830 |
+
$next_index = $i + $offset + 1;
|
831 |
+
$node_id = $this->add_node($numeric_value, $i, $next_index, *chart_ht, "", "braille number");
|
832 |
+
$found_char_mapping_p = 1;
|
833 |
+
}
|
834 |
+
}
|
835 |
+
|
836 |
+
unless ($found_char_mapping_p) {
|
837 |
+
foreach $string_length (reverse(1 .. 6)) {
|
838 |
+
next if ($i + $string_length-1) > $#chars;
|
839 |
+
my $start_of_word_p = $chart_ht{CHAR_START_OF_WORD}->{$i} || 0;
|
840 |
+
my $end_of_word_p = $chart_ht{CHAR_END_OF_WORD}->{($i+$string_length-1)} || 0;
|
841 |
+
my $multi_char_substring = join("", @chars[$i..($i+$string_length-1)]);
|
842 |
+
my @mappings = keys %{$ht{UTF_CHAR_MAPPING_LANG_SPEC}->{$lang_code}->{$multi_char_substring}};
|
843 |
+
@mappings = keys %{$ht{UTF_CHAR_MAPPING}->{$multi_char_substring}} unless @mappings;
|
844 |
+
my @mappings_whole = ();
|
845 |
+
my @mappings_start_or_end = ();
|
846 |
+
my @mappings_other = ();
|
847 |
+
foreach $mapping (@mappings) {
|
848 |
+
next if $mapping =~ /\(__.*__\)/;
|
849 |
+
if ($ht{USE_ONLY_FOR_WHOLE_WORD_LANG_SPEC}->{$lang_code}->{$multi_char_substring}->{$mapping}
|
850 |
+
|| $ht{USE_ONLY_FOR_WHOLE_WORD}->{$multi_char_substring}->{$mapping}) {
|
851 |
+
push(@mappings_whole, $mapping) if $start_of_word_p && $end_of_word_p;
|
852 |
+
} elsif ($ht{USE_ONLY_AT_START_OF_WORD_LANG_SPEC}->{$lang_code}->{$multi_char_substring}->{$mapping}
|
853 |
+
|| $ht{USE_ONLY_AT_START_OF_WORD}->{$multi_char_substring}->{$mapping}) {
|
854 |
+
push(@mappings_start_or_end, $mapping) if $start_of_word_p;
|
855 |
+
} elsif ($ht{USE_ONLY_AT_END_OF_WORD_LANG_SPEC}->{$lang_code}->{$multi_char_substring}->{$mapping}
|
856 |
+
|| $ht{USE_ONLY_AT_END_OF_WORD}->{$multi_char_substring}->{$mapping}) {
|
857 |
+
push(@mappings_start_or_end, $mapping) if $end_of_word_p;
|
858 |
+
} else {
|
859 |
+
push(@mappings_other, $mapping);
|
860 |
+
}
|
861 |
+
}
|
862 |
+
@mappings = @mappings_whole;
|
863 |
+
@mappings = @mappings_start_or_end unless @mappings;
|
864 |
+
@mappings = @mappings_other unless @mappings;
|
865 |
+
foreach $mapping (@mappings) {
|
866 |
+
next if $mapping =~ /\(__.*__\)/;
|
867 |
+
if ($ht{DONT_USE_AT_START_OF_WORD_LANG_SPEC}->{$lang_code}->{$multi_char_substring}->{$mapping}
|
868 |
+
|| $ht{DONT_USE_AT_START_OF_WORD}->{$multi_char_substring}->{$mapping}) {
|
869 |
+
next if $start_of_word_p;
|
870 |
+
}
|
871 |
+
if ($ht{DONT_USE_AT_END_OF_WORD_LANG_SPEC}->{$lang_code}->{$multi_char_substring}->{$mapping}
|
872 |
+
|| $ht{DONT_USE_AT_END_OF_WORD}->{$multi_char_substring}->{$mapping}) {
|
873 |
+
next if $end_of_word_p;
|
874 |
+
}
|
875 |
+
my $mapping2 = ($chart_ht{BRAILLE_ALL_CAPS_P}->{$i}) ? (uc $mapping) : $mapping;
|
876 |
+
$node_id = $this->add_node($mapping2, $i, $i+$string_length, *chart_ht, "", "multi-char-mapping");
|
877 |
+
$next_index = $i + $string_length;
|
878 |
+
$found_char_mapping_p = 1;
|
879 |
+
if ($annotation) {
|
880 |
+
@annotation_elems = split(/,\s*/, $annotation);
|
881 |
+
foreach $annotation_elem (@annotation_elems) {
|
882 |
+
if (($a_slot, $a_value) = ($annotation_elem =~ /^(\S+?):(\S+)\s*$/)) {
|
883 |
+
$this->set_node_id_slot_value($node_id, $a_slot, $a_value, *chart_ht);
|
884 |
+
} else {
|
885 |
+
$this->set_node_id_slot_value($node_id, $annotation_elem, 1, *chart_ht);
|
886 |
+
}
|
887 |
+
}
|
888 |
+
}
|
889 |
+
}
|
890 |
+
my @alt_mappings = keys %{$ht{UTF_CHAR_ALT_MAPPING_LANG_SPEC}->{$lang_code}->{$multi_char_substring}};
|
891 |
+
@alt_mappings = keys %{$ht{UTF_CHAR_ALT_MAPPING}->{$multi_char_substring}} unless @alt_mappings;
|
892 |
+
@alt_mappings = () if ($#alt_mappings == 0) && ($alt_mappings[0] eq "_NONE_");
|
893 |
+
foreach $alt_mapping (@alt_mappings) {
|
894 |
+
if ($chart_ht{CHAR_PART_OF_POINTED_HEBREW_WORD}->{$i}) {
|
895 |
+
next unless
|
896 |
+
$ht{USE_ALT_IN_POINTED_LANG_SPEC}->{$lang_code}->{$multi_char_substring}->{$alt_mapping}
|
897 |
+
|| $ht{USE_ALT_IN_POINTED}->{$multi_char_substring}->{$alt_mapping};
|
898 |
+
}
|
899 |
+
if ($ht{USE_ALT_ONLY_FOR_WHOLE_WORD_LANG_SPEC}->{$lang_code}->{$multi_char_substring}->{$alt_mapping}
|
900 |
+
|| $ht{USE_ALT_ONLY_FOR_WHOLE_WORD}->{$multi_char_substring}->{$alt_mapping}) {
|
901 |
+
next unless $start_of_word_p && $end_of_word_p;
|
902 |
+
}
|
903 |
+
if ($ht{USE_ALT_ONLY_AT_START_OF_WORD_LANG_SPEC}->{$lang_code}->{$multi_char_substring}->{$alt_mapping}
|
904 |
+
|| $ht{USE_ALT_ONLY_AT_START_OF_WORD}->{$multi_char_substring}->{$alt_mapping}) {
|
905 |
+
next unless $start_of_word_p;
|
906 |
+
}
|
907 |
+
if ($ht{USE_ALT_ONLY_AT_END_OF_WORD_LANG_SPEC}->{$lang_code}->{$multi_char_substring}->{$alt_mapping}
|
908 |
+
|| $ht{USE_ALT_ONLY_AT_END_OF_WORD}->{$multi_char_substring}->{$alt_mapping}) {
|
909 |
+
next unless $end_of_word_p;
|
910 |
+
}
|
911 |
+
my $alt_mapping2 = ($chart_ht{BRAILLE_ALL_CAPS_P}->{$i}) ? (uc $alt_mapping) : $alt_mapping;
|
912 |
+
$node_id = $this->add_node($alt_mapping2, $i, $i+$string_length, *chart_ht, "alt", "multi-char-mapping");
|
913 |
+
if ($annotation) {
|
914 |
+
@annotation_elems = split(/,\s*/, $annotation);
|
915 |
+
foreach $annotation_elem (@annotation_elems) {
|
916 |
+
if (($a_slot, $a_value) = ($annotation_elem =~ /^(\S+?):(\S+)\s*$/)) {
|
917 |
+
$this->set_node_id_slot_value($node_id, $a_slot, $a_value, *chart_ht);
|
918 |
+
} else {
|
919 |
+
$this->set_node_id_slot_value($node_id, $annotation_elem, 1, *chart_ht);
|
920 |
+
}
|
921 |
+
}
|
922 |
+
}
|
923 |
+
}
|
924 |
+
}
|
925 |
+
}
|
926 |
+
unless ($found_char_mapping_p) {
|
927 |
+
my $prev_node_id = $this->get_node_for_span($i-4, $i, *chart_ht)
|
928 |
+
|| $this->get_node_for_span($i-3, $i, *chart_ht)
|
929 |
+
|| $this->get_node_for_span($i-2, $i, *chart_ht)
|
930 |
+
|| $this->get_node_for_span($i-1, $i, *chart_ht);
|
931 |
+
my $prev_char_roman = ($prev_node_id) ? $this->get_node_roman($prev_node_id, *chart_id) : "";
|
932 |
+
my $prev_node_start = ($prev_node_id) ? $chart_ht{NODE_START}->{$prev_node_id} : "";
|
933 |
+
|
934 |
+
# Number
|
935 |
+
if (($numeric_value =~ /\d/)
|
936 |
+
&& (! ($char_name =~ /SUPERSCRIPT/))) {
|
937 |
+
my $prev_numeric_value = $this->get_node_for_span_with_slot_value($i-1, $i, "numeric-value", *chart_id);
|
938 |
+
my $sep = "";
|
939 |
+
$sep = " " if ($char_name =~ /^vulgar fraction /i) && ($prev_numeric_value =~ /\d/);
|
940 |
+
$node_id = $this->add_node("$sep$numeric_value", $i, $i+1, *chart_ht, "", "number");
|
941 |
+
$this->set_node_id_slot_value($node_id, "numeric-value", $numeric_value, *chart_ht);
|
942 |
+
if ((($prev_numeric_value =~ /\d/) && ($numeric_value =~ /\d\d/))
|
943 |
+
|| (($prev_numeric_value =~ /\d\d/) && ($numeric_value =~ /\d/))) {
|
944 |
+
# pull in any other parts of single digits
|
945 |
+
my $j = 1;
|
946 |
+
# pull in any single digits adjoining on left
|
947 |
+
if ($prev_numeric_value =~ /^\d$/) {
|
948 |
+
while (1) {
|
949 |
+
if (($i-$j-1 >= 0)
|
950 |
+
&& defined($digit_value = $this->get_node_for_span_with_slot_value($i-$j-1, $i-$j, "numeric-value", *chart_id))
|
951 |
+
&& ($digit_value =~ /^\d$/)) {
|
952 |
+
$j++;
|
953 |
+
} elsif (($i-$j-2 >= 0)
|
954 |
+
&& ($chart_ht{ORIG_CHAR}->{($i-$j-1)} =~ /^[.,]$/)
|
955 |
+
&& defined($digit_value = $this->get_node_for_span_with_slot_value($i-$j-2, $i-$j-1, "numeric-value", *chart_id))
|
956 |
+
&& ($digit_value =~ /^\d$/)) {
|
957 |
+
$j += 2;
|
958 |
+
} else {
|
959 |
+
last;
|
960 |
+
}
|
961 |
+
}
|
962 |
+
}
|
963 |
+
# pull in any single digits adjoining on right
|
964 |
+
my $k = 0;
|
965 |
+
if ($numeric_value =~ /^\d$/) {
|
966 |
+
while (1) {
|
967 |
+
if (defined($next_numeric_value = $chart_ht{CHAR_NUMERIC_VALUE}->{($i+$k+1)})
|
968 |
+
&& ($next_numeric_value =~ /^\d$/)) {
|
969 |
+
$k++;
|
970 |
+
} else {
|
971 |
+
last;
|
972 |
+
}
|
973 |
+
}
|
974 |
+
}
|
975 |
+
$this->register_new_complex_number_span_segment($i-$j, $i, $i+$k+1, *chart_ht, $line_number);
|
976 |
+
}
|
977 |
+
if ($chinesePM->string_contains_utf8_cjk_unified_ideograph_p($char)
|
978 |
+
&& ($tonal_translit = $chinesePM->tonal_pinyin($char, *pinyin_ht, ""))) {
|
979 |
+
$de_accented_translit = $util->de_accent_string($tonal_translit);
|
980 |
+
if ($numeric_value =~ /^(10000|1000000000000|10000000000000000)$/) {
|
981 |
+
$chart_ht{NODE_TYPE}->{$node_id} = "alt"; # keep, but demote
|
982 |
+
$alt_node_id = $this->add_node($de_accented_translit, $i, $i+1, *chart_ht, "", "CJK");
|
983 |
+
} else {
|
984 |
+
$alt_node_id = $this->add_node($de_accented_translit, $i, $i+1, *chart_ht, "alt", "CJK");
|
985 |
+
}
|
986 |
+
}
|
987 |
+
|
988 |
+
# ASCII
|
989 |
+
} elsif ($char =~ /^[\x00-\x7F]$/) {
|
990 |
+
$this->add_node($char, $i, $i+1, *chart_ht, "", "ASCII"); # ASCII character, incl. control characters
|
991 |
+
|
992 |
+
# Emoji, dingbats, pictographs
|
993 |
+
} elsif ($char =~ /^(\xE2[\x98-\x9E]|\xF0\x9F[\x8C-\xA7])/) {
|
994 |
+
$this->add_node($char, $i, $i+1, *chart_ht, "", "pictograph");
|
995 |
+
|
996 |
+
# Hangul (Korean)
|
997 |
+
} elsif (($char =~ /^[\xEA-\xED]/)
|
998 |
+
&& ($romanized_char = $this->unicode_hangul_romanization($char))) {
|
999 |
+
$this->add_node($romanized_char, $i, $i+1, *chart_ht, "", "Hangul");
|
1000 |
+
|
1001 |
+
# CJK (Chinese, Japanese, Korean)
|
1002 |
+
} elsif ($chinesePM->string_contains_utf8_cjk_unified_ideograph_p($char)
|
1003 |
+
&& ($tonal_translit = $chinesePM->tonal_pinyin($char, *pinyin_ht, ""))) {
|
1004 |
+
$de_accented_translit = $util->de_accent_string($tonal_translit);
|
1005 |
+
$this->add_node($de_accented_translit, $i, $i+1, *chart_ht, "", "CJK");
|
1006 |
+
|
1007 |
+
# Virama (cancel preceding vowel in Abudiga scripts)
|
1008 |
+
} elsif ($char_name =~ /\bSIGN (?:VIRAMA|AL-LAKUNA|ASAT|COENG|PAMAAEH)\b/) {
|
1009 |
+
# VIRAMA: cancel preceding default vowel (in Abudiga scripts)
|
1010 |
+
if (($prev_script eq $current_script)
|
1011 |
+
&& (($prev_char_roman_consonant, $prev_char_roman_vowel) = ($prev_char_roman =~ /^(.*[bcdfghjklmnpqrstvwxyz])([aeiou]+)$/i))
|
1012 |
+
&& ($ht{SCRIPT_ABUDIGA_DEFAULT_VOWEL}->{$current_script}->{(lc $prev_char_roman_vowel)})) {
|
1013 |
+
$this->add_node($prev_char_roman_consonant, $prev_node_start, $i+1, *chart_ht, "", "virama");
|
1014 |
+
} else {
|
1015 |
+
$this->add_node("", $i, $i+1, *chart_ht, "", "unexpected-virama");
|
1016 |
+
}
|
1017 |
+
|
1018 |
+
# Nukta (special (typically foreign) variant)
|
1019 |
+
} elsif ($char_name =~ /\bSIGN (?:NUKTA)\b/) {
|
1020 |
+
# NUKTA (dot): indicates special (typically foreign) variant; normally covered by multi-mappings
|
1021 |
+
if ($prev_script eq $current_script) {
|
1022 |
+
my $node_id = $this->add_node($prev_char_roman, $prev_node_start, $i+1, *chart_ht, "", "nukta");
|
1023 |
+
$this->copy_slot_values($prev_node_id, $node_id, *chart_id, "all");
|
1024 |
+
$this->set_node_id_slot_value($node_id, "nukta", 1, *chart_ht);
|
1025 |
+
} else {
|
1026 |
+
$this->add_node("", $i, $i+1, *chart_ht, "", "unexpected-nukta");
|
1027 |
+
}
|
1028 |
+
|
1029 |
+
# Zero-width character, incl. zero width space/non-joiner/joiner, left-to-right/right-to-left mark
|
1030 |
+
} elsif ($char =~ /^\xE2\x80[\x8B-\x8F\xAA-\xAE]$/) {
|
1031 |
+
if ($prev_node_id) {
|
1032 |
+
my $node_id = $this->add_node($prev_char_roman, $prev_node_start, $i+1, *chart_ht, "", "zero-width-char");
|
1033 |
+
$this->copy_slot_values($prev_node_id, $node_id, *chart_id, "all");
|
1034 |
+
} else {
|
1035 |
+
$this->add_node("", $i, $i+1, *chart_ht, "", "zero-width-char");
|
1036 |
+
}
|
1037 |
+
} elsif (($char =~ /^\xEF\xBB\xBF$/) && $prev_node_id) { # OK to leave byte-order-mark at beginning of line
|
1038 |
+
my $node_id = $this->add_node($prev_char_roman, $prev_node_start, $i+1, *chart_ht, "", "zero-width-char");
|
1039 |
+
$this->copy_slot_values($prev_node_id, $node_id, *chart_id, "all");
|
1040 |
+
|
1041 |
+
# Tone mark
|
1042 |
+
} elsif ($tone_mark) {
|
1043 |
+
if ($prev_script eq $current_script) {
|
1044 |
+
my $node_id = $this->add_node($prev_char_roman, $prev_node_start, $i+1, *chart_ht, "", "tone-mark");
|
1045 |
+
$this->copy_slot_values($prev_node_id, $node_id, *chart_id, "all");
|
1046 |
+
$this->set_node_id_slot_value($node_id, "tone-mark", $tone_mark, *chart_ht);
|
1047 |
+
} else {
|
1048 |
+
$this->add_node("", $i, $i+1, *chart_ht, "", "unexpected-tone-mark");
|
1049 |
+
}
|
1050 |
+
|
1051 |
+
# Diacritic
|
1052 |
+
} elsif (($char_name =~ /\b(ACCENT|TONE|COMBINING DIAERESIS|COMBINING DIAERESIS BELOW|COMBINING MACRON|COMBINING VERTICAL LINE ABOVE|COMBINING DOT ABOVE RIGHT|COMBINING TILDE|COMBINING CYRILLIC|MUUSIKATOAN|TRIISAP)\b/) && ($ht{UTF_TO_CAT}->{$char} =~ /^Mn/)) {
|
1053 |
+
if ($prev_script eq $current_script) {
|
1054 |
+
my $node_id = $this->add_node($prev_char_roman, $prev_node_start, $i+1, *chart_ht, "", "diacritic");
|
1055 |
+
$this->copy_slot_values($prev_node_id, $node_id, *chart_id, "all");
|
1056 |
+
$diacritic = lc $char_name;
|
1057 |
+
$diacritic =~ s/^.*(?:COMBINING CYRILLIC|COMBINING|SIGN)\s+//i;
|
1058 |
+
$diacritic =~ s/^.*(ACCENT|TONE)/$1/i;
|
1059 |
+
$diacritic =~ s/^\s*//;
|
1060 |
+
$this->set_node_id_slot_value($node_id, "diacritic", $diacritic, *chart_ht);
|
1061 |
+
# print STDERR "diacritic: $diacritic\n";
|
1062 |
+
} else {
|
1063 |
+
$this->add_node("", $i, $i+1, *chart_ht, "", "unexpected-diacritic");
|
1064 |
+
}
|
1065 |
+
|
1066 |
+
# Romanize to find out more
|
1067 |
+
} elsif ($char_name) {
|
1068 |
+
if (defined($romanized_char = $this->romanize_char_at_position($i, $lang_code, $output_style, *ht, *chart_ht))) {
|
1069 |
+
# print STDERR "ROM l.$line_number/$i: $romanized_char\n" if $line_number =~ /^[12]$/;
|
1070 |
+
print STDOUT "ROM l.$line_number/$i: $romanized_char\n" if $verbosePM;
|
1071 |
+
|
1072 |
+
# Empty string mapping
|
1073 |
+
if ($romanized_char eq "\"\"") {
|
1074 |
+
$this->add_node("", $i, $i+1, *chart_ht, "", "empty-string-mapping");
|
1075 |
+
# consider adding something for implausible romanizations of length 6+
|
1076 |
+
|
1077 |
+
# keep original character (instead of romanized_char lengthener, character-18b00 etc.)
|
1078 |
+
} elsif (($romanized_char =~ /^(character|lengthener|modifier)/)) {
|
1079 |
+
$this->add_node($char, $i, $i+1, *chart_ht, "", "nevermind-keep-original");
|
1080 |
+
|
1081 |
+
# Syllabic suffix in Abudiga languages, e.g. -m, -ng
|
1082 |
+
} elsif (($romanized_char =~ /^\+(H|M|N|NG)$/i)
|
1083 |
+
&& ($prev_script eq $current_script)
|
1084 |
+
&& ($ht{SCRIPT_ABUDIGA_DEFAULT_VOWEL}->{$current_script}->{"a"})) {
|
1085 |
+
my $core_suffix = $romanized_char;
|
1086 |
+
$core_suffix =~ s/^\+//;
|
1087 |
+
if ($prev_char_roman =~ /[aeiou]$/i) {
|
1088 |
+
$this->add_node($core_suffix, $i, $i+1, *chart_ht, "", "syllable-end-consonant");
|
1089 |
+
} else {
|
1090 |
+
$this->add_node(join("", $prev_char_roman, "a", $core_suffix), $prev_node_start, $i+1, *chart_ht, "", "syllable-end-consonant-with-added-a");
|
1091 |
+
$this->add_node(join("", "a", $core_suffix), $i, $i+1, *chart_ht, "backup", "syllable-end-consonant");
|
1092 |
+
}
|
1093 |
+
|
1094 |
+
# Japanese special cases
|
1095 |
+
} elsif ($char_name =~ /(?:HIRAGANA|KATAKANA) LETTER SMALL Y/) {
|
1096 |
+
if (($prev_script eq $current_script)
|
1097 |
+
&& (($prev_char_roman_consonant) = ($prev_char_roman =~ /^(.*[bcdfghjklmnpqrstvwxyz])i$/i))) {
|
1098 |
+
unless ($this->get_node_for_span_and_type($prev_node_start, $i+1, *chart_ht, "")) {
|
1099 |
+
$this->add_node("$prev_char_roman_consonant$romanized_char", $prev_node_start, $i+1, *chart_ht, "", "japanese-contraction");
|
1100 |
+
}
|
1101 |
+
} else {
|
1102 |
+
$this->add_node($romanized_char, $i, $i+1, *chart_ht, "", "unexpected-japanese-contraction-character");
|
1103 |
+
}
|
1104 |
+
} elsif (($prev_script =~ /^(HIRAGANA|KATAKANA)$/i)
|
1105 |
+
&& ($char_name eq "KATAKANA-HIRAGANA PROLONGED SOUND MARK") # Choonpu
|
1106 |
+
&& (($prev_char_roman_vowel) = ($prev_char_roman =~ /([aeiou])$/i))) {
|
1107 |
+
$this->add_node("$prev_char_roman$prev_char_roman_vowel", $prev_node_start, $i+1, *chart_ht, "", "japanese-vowel-lengthening");
|
1108 |
+
} elsif (($current_script =~ /^(Hiragana|Katakana)$/i)
|
1109 |
+
&& ($char_name =~ /^(HIRAGANA|KATAKANA) LETTER SMALL TU$/i) # Sokuon/Sukun
|
1110 |
+
&& ($next_script eq $current_script)
|
1111 |
+
&& ($romanized_next_char = $this->romanize_char_at_position_incl_multi($i+1, $lang_code, $output_style, *ht, *chart_ht))
|
1112 |
+
&& (($doubled_consonant) = ($romanized_next_char =~ /^(ch|[bcdfghjklmnpqrstwz])/i))) {
|
1113 |
+
# Note: $romanized_next_char could be part of a multi-character mapping
|
1114 |
+
# print STDERR "current_script: $current_script char_name: $char_name next_script: $next_script romanized_next_char: $romanized_next_char doubled_consonant: $doubled_consonant\n";
|
1115 |
+
$doubled_consonant = "t" if $doubled_consonant eq "ch";
|
1116 |
+
$this->add_node($doubled_consonant, $i, $i+1, *chart_ht, "", "japanese-consonant-doubling");
|
1117 |
+
|
1118 |
+
# Greek small letter mu to micro-sign (instead of to "m") as used in abbreviations for microgram/micrometer/microliter/microsecond/micromolar/microfarad etc.
|
1119 |
+
} elsif (($char_name eq "GREEK SMALL LETTER MU")
|
1120 |
+
&& (! ($prev_script =~ /^GREEK$/))
|
1121 |
+
&& ($i < $#chars)
|
1122 |
+
&& ($chart_ht{ORIG_CHAR}->{($i+1)} =~ /^[cfgjlmstv]$/i)) {
|
1123 |
+
$this->add_node("\xC2\xB5", $i, $i+1, *chart_ht, "", "greek-mu-to-micro-sign");
|
1124 |
+
|
1125 |
+
# Gurmukhi addak (doubles following consonant)
|
1126 |
+
} elsif (($current_script eq "Gurmukhi")
|
1127 |
+
&& ($char_name eq "GURMUKHI ADDAK")) {
|
1128 |
+
if (($next_script eq $current_script)
|
1129 |
+
&& ($romanized_next_char = $this->romanize_char_at_position_incl_multi($i+1, $lang_code, $output_style, *ht, *chart_ht))
|
1130 |
+
&& (($doubled_consonant) = ($romanized_next_char =~ /^([bcdfghjklmnpqrstvwxz])/i))) {
|
1131 |
+
$this->add_node($doubled_consonant, $i, $i+1, *chart_ht, "", "gurmukhi-consonant-doubling");
|
1132 |
+
} else {
|
1133 |
+
$this->add_node("'", $i, $i+1, *chart_ht, "", "gurmukhi-unexpected-addak");
|
1134 |
+
}
|
1135 |
+
|
1136 |
+
# Subjoined character
|
1137 |
+
} elsif ($subjoined_char_p
|
1138 |
+
&& ($prev_script eq $current_script)
|
1139 |
+
&& (($prev_char_roman_consonant, $prev_char_roman_vowel) = ($prev_char_roman =~ /^(.*[bcdfghjklmnpqrstvwxyz])([aeiou]+)$/i))
|
1140 |
+
&& ($ht{SCRIPT_ABUDIGA_DEFAULT_VOWEL}->{$current_script}->{(lc $prev_char_roman_vowel)})) {
|
1141 |
+
my $new_roman = "$prev_char_roman_consonant$romanized_char";
|
1142 |
+
$this->add_node($new_roman, $prev_node_start, $i+1, *chart_ht, "", "subjoined-character");
|
1143 |
+
# print STDERR " Subjoin l.$line_number/$i: $new_roman\n" if $line_number =~ /^[12]$/;
|
1144 |
+
|
1145 |
+
# Thai special case: written-pre-consonant-spoken-post-consonant
|
1146 |
+
} elsif (($char_name =~ /THAI CHARACTER/)
|
1147 |
+
&& ($prev_script eq $current_script)
|
1148 |
+
&& ($chart_ht{CHAR_SYLLABLE_INFO}->{($i-1)} =~ /written-pre-consonant-spoken-post-consonant/i)
|
1149 |
+
&& ($prev_char_roman =~ /^[aeiou]+$/i)
|
1150 |
+
&& ($romanized_char =~ /^[bcdfghjklmnpqrstvwxyz]/)) {
|
1151 |
+
$this->add_node("$romanized_char$prev_char_roman", $prev_node_start, $i+1, *chart_ht, "", "thai-vowel-consonant-swap");
|
1152 |
+
|
1153 |
+
# Thai special case: THAI CHARACTER O ANG (U+0E2D "\xE0\xB8\xAD")
|
1154 |
+
} elsif ($char_name eq "THAI CHARACTER O ANG") {
|
1155 |
+
if ($prev_script ne $current_script) {
|
1156 |
+
$this->add_node("", $i, $i+1, *chart_ht, "", "thai-initial-o-ang-drop");
|
1157 |
+
} elsif ($next_script ne $current_script) {
|
1158 |
+
$this->add_node("", $i, $i+1, *chart_ht, "", "thai-final-o-ang-drop");
|
1159 |
+
} else {
|
1160 |
+
my $romanized_next_char = $this->romanize_char_at_position($i+1, $lang_code, $output_style, *ht, *chart_ht);
|
1161 |
+
my $romanized_prev2_char = $this->romanize_char_at_position($i-2, $lang_code, $output_style, *ht, *chart_ht);
|
1162 |
+
if (($prev_char_roman =~ /^[bcdfghjklmnpqrstvwxz]+$/i)
|
1163 |
+
&& ($romanized_next_char =~ /^[bcdfghjklmnpqrstvwxz]+$/i)) {
|
1164 |
+
$this->add_node("o", $i, $i+1, *chart_ht, "", "thai-middle-o-ang"); # keep between consonants
|
1165 |
+
} elsif (($prev2_script eq $current_script)
|
1166 |
+
&& 0
|
1167 |
+
&& ($prev_char_name =~ /^THAI CHARACTER MAI [A-Z]+$/) # Thai tone
|
1168 |
+
&& ($romanized_prev2_char =~ /^[bcdfghjklmnpqrstvwxz]+$/i)
|
1169 |
+
&& ($romanized_next_char =~ /^[bcdfghjklmnpqrstvwxz]+$/i)) {
|
1170 |
+
$this->add_node("o", $i, $i+1, *chart_ht, "", "thai-middle-o-ang"); # keep between consonant+tone-mark and consonant
|
1171 |
+
} else {
|
1172 |
+
$this->add_node("", $i, $i+1, *chart_ht, "", "thai-middle-o-ang-drop"); # drop next to vowel
|
1173 |
+
}
|
1174 |
+
}
|
1175 |
+
|
1176 |
+
# Romanization with space
|
1177 |
+
} elsif ($romanized_char =~ /\s/) {
|
1178 |
+
$this->add_node($char, $i, $i+1, *chart_ht, "", "space");
|
1179 |
+
|
1180 |
+
# Tibetan special cases
|
1181 |
+
} elsif ($current_script eq "Tibetan") {
|
1182 |
+
|
1183 |
+
if ($subjoined_char_p
|
1184 |
+
&& ($prev_script eq $current_script)
|
1185 |
+
&& $prev_letter_plus_char_p
|
1186 |
+
&& ($prev_char_roman =~ /^[bcdfghjklmnpqrstvwxyz]+$/i)) {
|
1187 |
+
$this->add_node("$prev_char_roman$romanized_char", $prev_node_start, $i+1, *chart_ht, "", "subjoined-tibetan-character");
|
1188 |
+
} elsif ($romanized_char =~ /^-A$/i) {
|
1189 |
+
my $romanized_next_char = $this->romanize_char_at_position($i+1, $lang_code, $output_style, *ht, *chart_ht);
|
1190 |
+
if (! $prev_letter_plus_char_p) {
|
1191 |
+
$this->add_node("'", $i, $i+1, *chart_ht, "", "tibetan-frontal-dash-a");
|
1192 |
+
} elsif (($prev_script eq $current_script)
|
1193 |
+
&& ($next_script eq $current_script)
|
1194 |
+
&& ($prev_char_roman =~ /[bcdfghjklmnpqrstvwxyz]$/)
|
1195 |
+
&& ($romanized_next_char =~ /^[aeiou]/)) {
|
1196 |
+
$this->add_node("a'", $i, $i+1, *chart_ht, "", "tibetan-medial-dash-a");
|
1197 |
+
} elsif (($prev_script eq $current_script)
|
1198 |
+
&& ($next_script eq $current_script)
|
1199 |
+
&& ($prev_char_roman =~ /[aeiou]$/)
|
1200 |
+
&& ($romanized_next_char =~ /[aeiou]/)) {
|
1201 |
+
$this->add_node("'", $i, $i+1, *chart_ht, "", "tibetan-reduced-medial-dash-a");
|
1202 |
+
} elsif (($prev_script eq $current_script)
|
1203 |
+
&& (! ($prev_char_roman =~ /[aeiou]/))
|
1204 |
+
&& (! $next_letter_plus_char_p)) {
|
1205 |
+
$this->add_node("a", $i, $i+1, *chart_ht, "", "tibetan-final-dash-a");
|
1206 |
+
} else {
|
1207 |
+
$this->add_node("a", $i, $i+1, *chart_ht, "", "unexpected-tibetan-dash-a");
|
1208 |
+
}
|
1209 |
+
} elsif (($romanized_char =~ /^[AEIOU]/i)
|
1210 |
+
&& ($prev_script eq $current_script)
|
1211 |
+
&& ($prev_char_roman =~ /^A$/i)
|
1212 |
+
&& (! $prev2_letter_plus_char_p)) {
|
1213 |
+
$this->add_node($romanized_char, $prev_node_start, $i+1, *chart_ht, "", "tibetan-dropped-word-initial-a");
|
1214 |
+
} else {
|
1215 |
+
$this->add_node($romanized_char, $i, $i+1, *chart_ht, "", "standard-unicode-based-romanization");
|
1216 |
+
}
|
1217 |
+
|
1218 |
+
# Khmer (for MUUSIKATOAN etc. see under "Diacritic" above)
|
1219 |
+
} elsif (($current_script eq "Khmer")
|
1220 |
+
&& (($char_roman_consonant, $char_roman_vowel) = ($romanized_char =~ /^(.*[bcdfghjklmnpqrstvwxyz])([ao]+)-$/i))) {
|
1221 |
+
my $romanized_next_char = $this->romanize_char_at_position($i+1, $lang_code, $output_style, *ht, *chart_ht);
|
1222 |
+
if (($next_script eq $current_script)
|
1223 |
+
&& ($romanized_next_char =~ /^[aeiouy]/i)) {
|
1224 |
+
$this->add_node($char_roman_consonant, $i, $i+1, *chart_ht, "", "khmer-vowel-drop");
|
1225 |
+
} else {
|
1226 |
+
$this->add_node("$char_roman_consonant$char_roman_vowel", $i, $i+1, *chart_ht, "", "khmer-standard-unicode-based-romanization");
|
1227 |
+
}
|
1228 |
+
|
1229 |
+
# Abudiga add default vowel
|
1230 |
+
} elsif ((@abudiga_default_vowels = sort keys %{$ht{SCRIPT_ABUDIGA_DEFAULT_VOWEL}->{$current_script}})
|
1231 |
+
&& ($abudiga_default_vowel = $abudiga_default_vowels[0])
|
1232 |
+
&& ($romanized_char =~ /^[bcdfghjklmnpqrstvwxyz]+$/i)) {
|
1233 |
+
my $new_roman = join("", $romanized_char, $abudiga_default_vowel);
|
1234 |
+
$this->add_node($new_roman, $i, $i+1, *chart_ht, "", "standard-unicode-based-romanization-plus-abudiga-default-vowel");
|
1235 |
+
# print STDERR " Abudiga add default vowel l.$line_number/$i: $new_roman\n" if $line_number =~ /^[12]$/;
|
1236 |
+
|
1237 |
+
# Standard romanization
|
1238 |
+
} else {
|
1239 |
+
$node_id = $this->add_node($romanized_char, $i, $i+1, *chart_ht, "", "standard-unicode-based-romanization");
|
1240 |
+
}
|
1241 |
+
} else {
|
1242 |
+
$this->add_node($char, $i, $i+1, *chart_ht, "", "unexpected-original");
|
1243 |
+
}
|
1244 |
+
} elsif (defined($romanized_char = $this->romanize_char_at_position($i, $lang_code, $output_style, *ht, *chart_ht))
|
1245 |
+
&& ((length($romanized_char) <= 2)
|
1246 |
+
|| ($ht{UTF_TO_CHAR_ROMANIZATION}->{$char}))) { # or from unicode_overwrite_romanization table
|
1247 |
+
$romanized_char =~ s/^""$//;
|
1248 |
+
$this->add_node($romanized_char, $i, $i+1, *chart_ht, "", "romanized-without-character-name");
|
1249 |
+
} else {
|
1250 |
+
$this->add_node($char, $i, $i+1, *chart_ht, "", "unexpected-original-without-character-name");
|
1251 |
+
}
|
1252 |
+
}
|
1253 |
+
$i = $next_index;
|
1254 |
+
}
|
1255 |
+
|
1256 |
+
$this->schwa_deletion(0, $n_characters, *chart_ht, $lang_code);
|
1257 |
+
$this->default_vowelize_tibetan(0, $n_characters, *chart_ht, $lang_code, $line_number) if $chart_ht{CHART_CONTAINS_SCRIPT}->{"Tibetan"};
|
1258 |
+
$this->assemble_numbers_in_chart(*chart_ht, $line_number);
|
1259 |
+
|
1260 |
+
if ($return_chart_p) {
|
1261 |
+
} elsif ($return_offset_mappings_p) {
|
1262 |
+
($result, $offset_mappings, $new_char_offset, $new_rom_char_offset) = $this->best_romanized_string(0, $n_characters, *chart_ht, $control, $initial_char_offset, $initial_rom_char_offset);
|
1263 |
+
} else {
|
1264 |
+
$result = $this->best_romanized_string(0, $n_characters, *chart_ht) unless $return_chart_p;
|
1265 |
+
}
|
1266 |
+
|
1267 |
+
if ($verbosePM) {
|
1268 |
+
my $logfile = "/nfs/isd/ulf/cgi-mt/amr-tmp/uroman-log.txt";
|
1269 |
+
$util->append_to_file($logfile, $log) if $log && (-r $logfile);
|
1270 |
+
}
|
1271 |
+
|
1272 |
+
return ($result, $offset_mappings) if $return_offset_mappings_p;
|
1273 |
+
return *chart_ht if $return_chart_p;
|
1274 |
+
return $result;
|
1275 |
+
}
|
1276 |
+
|
1277 |
+
sub string_to_json_string {
|
1278 |
+
local($this, $s) = @_;
|
1279 |
+
|
1280 |
+
utf8::decode($s);
|
1281 |
+
my $j = JSON->new->utf8->encode([$s]);
|
1282 |
+
$j =~ s/^\[(.*)\]$/$1/;
|
1283 |
+
return $j;
|
1284 |
+
}
|
1285 |
+
|
1286 |
+
sub chart_to_json_romanization_elements {
|
1287 |
+
local($this, $chart_start, $chart_end, *chart_ht, $line_number) = @_;
|
1288 |
+
|
1289 |
+
my $result = "";
|
1290 |
+
my $start = $chart_start;
|
1291 |
+
my $end;
|
1292 |
+
while ($start < $chart_end) {
|
1293 |
+
$end = $this->find_end_of_rom_segment($start, $chart_end, *chart_ht);
|
1294 |
+
my @best_romanizations;
|
1295 |
+
if (($end && ($start < $end))
|
1296 |
+
&& (@best_romanizations = $this->best_romanizations($start, $end, *chart_ht))) {
|
1297 |
+
$orig_segment = $this->orig_string_at_span($start, $end, *chart_ht);
|
1298 |
+
$next_start = $end;
|
1299 |
+
} else {
|
1300 |
+
$orig_segment = $chart_ht{ORIG_CHAR}->{$start};
|
1301 |
+
@best_romanizations = ($orig);
|
1302 |
+
$next_start = $start + 1;
|
1303 |
+
}
|
1304 |
+
$exclusive_end = $end - 1;
|
1305 |
+
# $guarded_orig = $util->string_guard($orig_segment);
|
1306 |
+
$guarded_orig = $this->string_to_json_string($orig_segment);
|
1307 |
+
$result .= " { \"line\": $line_number, \"start\": $start, \"end\": $exclusive_end, \"orig\": $guarded_orig, \"roms\": [";
|
1308 |
+
foreach $i ((0 .. $#best_romanizations)) {
|
1309 |
+
my $rom = $best_romanizations[$i];
|
1310 |
+
# my $guarded_rom = $util->string_guard($rom);
|
1311 |
+
my $guarded_rom = $this->string_to_json_string($rom);
|
1312 |
+
$result .= " { \"rom\": $guarded_rom";
|
1313 |
+
# $result .= ", \"alt\": true" if $i >= 1;
|
1314 |
+
$result .= " }";
|
1315 |
+
$result .= "," if $i < $#best_romanizations;
|
1316 |
+
}
|
1317 |
+
$result .= " ] },\n";
|
1318 |
+
$start = $next_start;
|
1319 |
+
}
|
1320 |
+
return $result;
|
1321 |
+
}
|
1322 |
+
|
1323 |
+
sub default_vowelize_tibetan {
|
1324 |
+
local($this, $chart_start, $chart_end, *chart_ht, $lang_code, $line_number) = @_;
|
1325 |
+
|
1326 |
+
# my $verbose = ($line_number == 103);
|
1327 |
+
# print STDERR "\nStart default_vowelize_tibetan l.$line_number $chart_start-$chart_end\n" if $verbose;
|
1328 |
+
my $token_start = $chart_start;
|
1329 |
+
my $next_token_start = $chart_start;
|
1330 |
+
while (($token_start = $next_token_start) < $chart_end) {
|
1331 |
+
$next_token_start = $token_start + 1;
|
1332 |
+
|
1333 |
+
next unless $chart_ht{CHAR_LETTER_PLUS}->{$token_start};
|
1334 |
+
my $current_script = $chart_ht{CHAR_SCRIPT}->{$token_start};
|
1335 |
+
next unless ($current_script eq "Tibetan");
|
1336 |
+
my $token_end = $chart_ht{LETTER_TOKEN_SEGMENT_START_TO_END}->{$token_start};
|
1337 |
+
next unless $token_end;
|
1338 |
+
next unless $token_end > $token_start;
|
1339 |
+
$next_token_start = $token_end;
|
1340 |
+
|
1341 |
+
my $start = $token_start;
|
1342 |
+
my $end;
|
1343 |
+
my @node_ids = ();
|
1344 |
+
while ($start < $token_end) {
|
1345 |
+
$end = $this->find_end_of_rom_segment($start, $chart_end, *chart_ht);
|
1346 |
+
last unless $end && ($end > $start);
|
1347 |
+
my @alt_node_ids = sort { $a <=> $b } keys %{$chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}->{$end}};
|
1348 |
+
last unless @alt_node_ids;
|
1349 |
+
push(@node_ids, $alt_node_ids[0]);
|
1350 |
+
$start = $end;
|
1351 |
+
}
|
1352 |
+
my $contains_vowel_p = 0;
|
1353 |
+
my @romanizations = ();
|
1354 |
+
foreach $node_id (@node_ids) {
|
1355 |
+
my $roman = $chart_ht{NODE_ROMAN}->{$node_id};
|
1356 |
+
$roman = "" unless defined($roman);
|
1357 |
+
push(@romanizations, $roman);
|
1358 |
+
$contains_vowel_p = 1 if $roman =~ /[aeiou]/i;
|
1359 |
+
}
|
1360 |
+
# print STDERR " old: $token_start-$token_end @romanizations\n" if $verbose;
|
1361 |
+
unless ($contains_vowel_p) {
|
1362 |
+
my $default_vowel_target_index;
|
1363 |
+
if ($#node_ids <= 1) {
|
1364 |
+
$default_vowel_target_index = 0;
|
1365 |
+
} elsif ($romanizations[$#romanizations] eq "s") {
|
1366 |
+
if ($romanizations[($#romanizations-1)] eq "y") {
|
1367 |
+
$default_vowel_target_index = $#romanizations-1;
|
1368 |
+
} else {
|
1369 |
+
$default_vowel_target_index = $#romanizations-2;
|
1370 |
+
}
|
1371 |
+
} else {
|
1372 |
+
$default_vowel_target_index = $#romanizations-1;
|
1373 |
+
}
|
1374 |
+
$romanizations[$default_vowel_target_index] .= "a";
|
1375 |
+
my $old_node_id = $node_ids[$default_vowel_target_index];
|
1376 |
+
my $old_start = $chart_ht{NODE_START}->{$old_node_id};
|
1377 |
+
my $old_end = $chart_ht{NODE_END}->{$old_node_id};
|
1378 |
+
my $old_roman = $chart_ht{NODE_ROMAN}->{$old_node_id};
|
1379 |
+
my $new_roman = $old_roman . "a";
|
1380 |
+
my $new_node_id = $this->add_node($new_roman, $old_start, $old_end, *chart_ht, "", "tibetan-default-vowel");
|
1381 |
+
$this->copy_slot_values($old_node_id, $new_node_id, *chart_id, "all");
|
1382 |
+
$chart_ht{NODE_TYPE}->{$old_node_id} = "backup"; # keep, but demote
|
1383 |
+
}
|
1384 |
+
if (($romanizations[0] eq "'")
|
1385 |
+
&& ($#romanizations >= 1)
|
1386 |
+
&& ($romanizations[1] =~ /^[o]$/)) {
|
1387 |
+
my $old_node_id = $node_ids[0];
|
1388 |
+
my $old_start = $chart_ht{NODE_START}->{$old_node_id};
|
1389 |
+
my $old_end = $chart_ht{NODE_END}->{$old_node_id};
|
1390 |
+
my $new_node_id = $this->add_node("", $old_start, $old_end, *chart_ht, "", "tibetan-delete-apostrophe");
|
1391 |
+
$this->copy_slot_values($old_node_id, $new_node_id, *chart_id, "all");
|
1392 |
+
$chart_ht{NODE_TYPE}->{$old_node_id} = "alt"; # keep, but demote
|
1393 |
+
}
|
1394 |
+
if (($#node_ids >= 1)
|
1395 |
+
&& ($romanizations[$#romanizations] =~ /^[bcdfghjklmnpqrstvwxz]+y$/)) {
|
1396 |
+
my $old_node_id = $node_ids[$#romanizations];
|
1397 |
+
my $old_start = $chart_ht{NODE_START}->{$old_node_id};
|
1398 |
+
my $old_end = $chart_ht{NODE_END}->{$old_node_id};
|
1399 |
+
my $old_roman = $chart_ht{NODE_ROMAN}->{$old_node_id};
|
1400 |
+
my $new_roman = $old_roman . "a";
|
1401 |
+
my $new_node_id = $this->add_node($new_roman, $old_start, $old_end, *chart_ht, "", "tibetan-syllable-final-vowel");
|
1402 |
+
$this->copy_slot_values($old_node_id, $new_node_id, *chart_id, "all");
|
1403 |
+
$chart_ht{NODE_TYPE}->{$old_node_id} = "alt"; # keep, but demote
|
1404 |
+
}
|
1405 |
+
foreach $old_node_id (@node_ids) {
|
1406 |
+
my $old_roman = $chart_ht{NODE_ROMAN}->{$old_node_id};
|
1407 |
+
next unless $old_roman =~ /-a/;
|
1408 |
+
my $old_start = $chart_ht{NODE_START}->{$old_node_id};
|
1409 |
+
my $old_end = $chart_ht{NODE_END}->{$old_node_id};
|
1410 |
+
my $new_roman = $old_roman;
|
1411 |
+
$new_roman =~ s/-a/a/;
|
1412 |
+
my $new_node_id = $this->add_node($new_roman, $old_start, $old_end, *chart_ht, "", "tibetan-syllable-delete-dash");
|
1413 |
+
$this->copy_slot_values($old_node_id, $new_node_id, *chart_id, "all");
|
1414 |
+
$chart_ht{NODE_TYPE}->{$old_node_id} = "alt"; # keep, but demote
|
1415 |
+
}
|
1416 |
+
}
|
1417 |
+
}
|
1418 |
+
|
1419 |
+
sub schwa_deletion {
|
1420 |
+
local($this, $chart_start, $chart_end, *chart_ht, $lang_code) = @_;
|
1421 |
+
# delete word-final simple "a" in Devanagari (e.g. nepaala -> nepaal)
|
1422 |
+
# see Wikipedia article "Schwa deletion in Indo-Aryan languages"
|
1423 |
+
|
1424 |
+
if ($chart_ht{CHART_CONTAINS_SCRIPT}->{"Devanagari"}) {
|
1425 |
+
my $script_start = $chart_start;
|
1426 |
+
my $next_script_start = $chart_start;
|
1427 |
+
while (($script_start = $next_script_start) < $chart_end) {
|
1428 |
+
$next_script_start = $script_start + 1;
|
1429 |
+
|
1430 |
+
my $current_script = $chart_ht{CHAR_SCRIPT}->{$script_start};
|
1431 |
+
next unless ($current_script eq "Devanagari");
|
1432 |
+
my $script_end = $chart_ht{SCRIPT_SEGMENT_START_TO_END}->{$script_start};
|
1433 |
+
next unless $script_end;
|
1434 |
+
next unless $script_end - $script_start >= 2;
|
1435 |
+
$next_script_start = $script_end;
|
1436 |
+
my $end_node_id = $this->get_node_for_span($script_end-1, $script_end, *chart_ht);
|
1437 |
+
next unless $end_node_id;
|
1438 |
+
my $end_roman = $chart_ht{NODE_ROMAN}->{$end_node_id};
|
1439 |
+
next unless ($end_consonant) = ($end_roman =~ /^([bcdfghjklmnpqrstvwxz]+)a$/i);
|
1440 |
+
my $prev_node_id = $this->get_node_for_span($script_end-4, $script_end-1, *chart_ht)
|
1441 |
+
|| $this->get_node_for_span($script_end-3, $script_end-1, *chart_ht)
|
1442 |
+
|| $this->get_node_for_span($script_end-2, $script_end-1, *chart_ht);
|
1443 |
+
next unless $prev_node_id;
|
1444 |
+
my $prev_roman = $chart_ht{NODE_ROMAN}->{$prev_node_id};
|
1445 |
+
next unless $prev_roman =~ /[aeiou]/i;
|
1446 |
+
# TO DO: check further back for vowel (e.g. if $prev_roman eq "r" due to vowel cancelation)
|
1447 |
+
|
1448 |
+
$chart_ht{NODE_TYPE}->{$end_node_id} = "alt"; # keep, but demote
|
1449 |
+
# print STDERR "* Schwa deletion " . ($script_end-1) . "-$script_end $end_roman->$end_consonant\n";
|
1450 |
+
$this->add_node($end_consonant, $script_end-1, $script_end, *chart_ht, "", "devanagari-with-deleted-final-schwa");
|
1451 |
+
}
|
1452 |
+
}
|
1453 |
+
}
|
1454 |
+
|
1455 |
+
sub best_romanized_string {
|
1456 |
+
local($this, $chart_start, $chart_end, *chart_ht, $control, $orig_char_offset, $rom_char_offset) = @_;
|
1457 |
+
|
1458 |
+
$control = "" unless defined($control);
|
1459 |
+
my $current_orig_char_offset = $orig_char_offset || 0;
|
1460 |
+
my $current_rom_char_offset = $rom_char_offset || 0;
|
1461 |
+
my $return_offset_mappings_p = ($control =~ /\breturn offset mappings\b/);
|
1462 |
+
my $result = "";
|
1463 |
+
my $start = $chart_start;
|
1464 |
+
my $end;
|
1465 |
+
my @char_offsets = ("$current_orig_char_offset:$current_rom_char_offset");
|
1466 |
+
while ($start < $chart_end) {
|
1467 |
+
$end = $this->find_end_of_rom_segment($start, $chart_end, *chart_ht);
|
1468 |
+
my $n_orig_chars_in_segment = 0;
|
1469 |
+
my $n_rom_chars_in_segment = 0;
|
1470 |
+
if ($end && ($start < $end)) {
|
1471 |
+
my @best_romanizations = $this->best_romanizations($start, $end, *chart_ht);
|
1472 |
+
my $best_romanization = (@best_romanizations) ? $best_romanizations[0] : undef;
|
1473 |
+
if (defined($best_romanization)) {
|
1474 |
+
$result .= $best_romanization;
|
1475 |
+
if ($return_offset_mappings_p) {
|
1476 |
+
$n_orig_chars_in_segment = $end-$start;
|
1477 |
+
$n_rom_chars_in_segment = $utf8->length_in_utf8_chars($best_romanization);
|
1478 |
+
}
|
1479 |
+
$start = $end;
|
1480 |
+
} else {
|
1481 |
+
my $best_romanization = $chart_ht{ORIG_CHAR}->{$start};
|
1482 |
+
$result .= $best_romanization;
|
1483 |
+
$start++;
|
1484 |
+
if ($return_offset_mappings_p) {
|
1485 |
+
$n_orig_chars_in_segment = 1;
|
1486 |
+
$n_rom_chars_in_segment = $utf8->length_in_utf8_chars($best_romanization);
|
1487 |
+
}
|
1488 |
+
}
|
1489 |
+
} else {
|
1490 |
+
my $best_romanization = $chart_ht{ORIG_CHAR}->{$start};
|
1491 |
+
$result .= $best_romanization;
|
1492 |
+
$start++;
|
1493 |
+
if ($return_offset_mappings_p) {
|
1494 |
+
$n_orig_chars_in_segment = 1;
|
1495 |
+
$n_rom_chars_in_segment = $utf8->length_in_utf8_chars($best_romanization);
|
1496 |
+
}
|
1497 |
+
}
|
1498 |
+
if ($return_offset_mappings_p) {
|
1499 |
+
my $new_orig_char_offset = $current_orig_char_offset + $n_orig_chars_in_segment;
|
1500 |
+
my $new_rom_char_offset = $current_rom_char_offset + $n_rom_chars_in_segment;
|
1501 |
+
my $offset_mapping = "$new_orig_char_offset:$new_rom_char_offset";
|
1502 |
+
push(@char_offsets, $offset_mapping);
|
1503 |
+
$current_orig_char_offset = $new_orig_char_offset;
|
1504 |
+
$current_rom_char_offset = $new_rom_char_offset;
|
1505 |
+
}
|
1506 |
+
}
|
1507 |
+
return ($result, join(",", @char_offsets), $current_orig_char_offset, $current_rom_char_offset) if $return_offset_mappings_p;
|
1508 |
+
return $result;
|
1509 |
+
}
|
1510 |
+
|
1511 |
+
sub orig_string_at_span {
|
1512 |
+
local($this, $start, $end, *chart_ht) = @_;
|
1513 |
+
|
1514 |
+
my $result = "";
|
1515 |
+
foreach $i (($start .. ($end-1))) {
|
1516 |
+
$result .= $chart_ht{ORIG_CHAR}->{$i};
|
1517 |
+
}
|
1518 |
+
return $result;
|
1519 |
+
}
|
1520 |
+
|
1521 |
+
sub find_end_of_rom_segment {
|
1522 |
+
local($this, $start, $chart_end, *chart_ht) = @_;
|
1523 |
+
|
1524 |
+
my @ends = sort { $a <=> $b } keys %{$chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}};
|
1525 |
+
my $end_index = $#ends;
|
1526 |
+
while (($end_index >= 0) && ($ends[$end_index] > $chart_end)) {
|
1527 |
+
$end_index--;
|
1528 |
+
}
|
1529 |
+
if (($end_index >= 0)
|
1530 |
+
&& defined($end = $ends[$end_index])
|
1531 |
+
&& ($start < $end)) {
|
1532 |
+
return $end;
|
1533 |
+
} else {
|
1534 |
+
return "";
|
1535 |
+
}
|
1536 |
+
}
|
1537 |
+
|
1538 |
+
sub best_romanizations {
|
1539 |
+
local($this, $start, $end, *chart_ht) = @_;
|
1540 |
+
|
1541 |
+
@regular_romanizations = ();
|
1542 |
+
@alt_romanizations = ();
|
1543 |
+
@backup_romanizations = ();
|
1544 |
+
|
1545 |
+
foreach $node_id (sort { $a <=> $b } keys %{$chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}->{$end}}) {
|
1546 |
+
my $type = $chart_ht{NODE_TYPE}->{$node_id};
|
1547 |
+
my $roman = $chart_ht{NODE_ROMAN}->{$node_id};
|
1548 |
+
if (! defined($roman)) {
|
1549 |
+
# ignore
|
1550 |
+
} elsif (($type eq "backup") && ! defined($backup_romanization)) {
|
1551 |
+
push(@backup_romanizations, $roman) unless $util->member($roman, @backup_romanizations);
|
1552 |
+
} elsif (($type eq "alt") && ! defined($alt_romanization)) {
|
1553 |
+
push(@alt_romanizations, $roman) unless $util->member($roman, @alt_romanizations);
|
1554 |
+
} else {
|
1555 |
+
push(@regular_romanizations, $roman) unless $util->member($roman, @regular_romanizations);
|
1556 |
+
}
|
1557 |
+
}
|
1558 |
+
@regular_alt_romanizations = sort @regular_romanizations;
|
1559 |
+
foreach $alt_romanization (sort @alt_romanizations) {
|
1560 |
+
push(@regular_alt_romanizations, $alt_romanization) unless $util->member($alt_romanization, @regular_alt_romanizations);
|
1561 |
+
}
|
1562 |
+
return @regular_alt_romanizations if @regular_alt_romanizations;
|
1563 |
+
return sort @backup_romanizations;
|
1564 |
+
}
|
1565 |
+
|
1566 |
+
sub join_alt_romanizations_for_viz {
|
1567 |
+
local($this, @list) = @_;
|
1568 |
+
|
1569 |
+
my @viz_romanizations = ();
|
1570 |
+
|
1571 |
+
foreach $alt_rom (@list) {
|
1572 |
+
if ($alt_rom eq "") {
|
1573 |
+
push(@viz_romanizations, "-");
|
1574 |
+
} else {
|
1575 |
+
push(@viz_romanizations, $alt_rom);
|
1576 |
+
}
|
1577 |
+
}
|
1578 |
+
return join(", ", @viz_romanizations);
|
1579 |
+
}
|
1580 |
+
|
1581 |
+
sub markup_orig_rom_strings {
|
1582 |
+
local($this, $chart_start, $chart_end, *ht, *chart_ht, *pinyin_ht, $last_group_id_index) = @_;
|
1583 |
+
|
1584 |
+
my $marked_up_rom = "";
|
1585 |
+
my $marked_up_orig = "";
|
1586 |
+
my $start = $chart_start;
|
1587 |
+
my $end;
|
1588 |
+
while ($start < $chart_end) {
|
1589 |
+
my $segment_start = $start;
|
1590 |
+
my $segment_end = $start+1;
|
1591 |
+
my $end = $this->find_end_of_rom_segment($start, $chart_end, *chart_ht);
|
1592 |
+
my $rom_segment = "";
|
1593 |
+
my $orig_segment = "";
|
1594 |
+
my $rom_title = "";
|
1595 |
+
my $orig_title = "";
|
1596 |
+
my $contains_alt_romanizations = 0;
|
1597 |
+
if ($end) {
|
1598 |
+
$segment_end = $end;
|
1599 |
+
my @best_romanizations = $this->best_romanizations($start, $end, *chart_ht);
|
1600 |
+
my $best_romanization = (@best_romanizations) ? $best_romanizations[0] : undef;
|
1601 |
+
if (defined($best_romanization)) {
|
1602 |
+
$rom_segment .= $best_romanization;
|
1603 |
+
$orig_segment .= $this->orig_string_at_span($start, $end, *chart_ht);
|
1604 |
+
$segment_end = $end;
|
1605 |
+
if ($#best_romanizations >= 1) {
|
1606 |
+
$rom_title .= $util->guard_html("Alternative romanizations: " . $this->join_alt_romanizations_for_viz(@best_romanizations) . "\n");
|
1607 |
+
$contains_alt_romanizations = 1;
|
1608 |
+
}
|
1609 |
+
} else {
|
1610 |
+
my $segment = $this->orig_string_at_span($start, $start+1, *chart_ht);
|
1611 |
+
$rom_segment .= $segment;
|
1612 |
+
$orig_segment .= $segment;
|
1613 |
+
$segment_end = $start+1;
|
1614 |
+
}
|
1615 |
+
$start = $segment_end;
|
1616 |
+
} else {
|
1617 |
+
$rom_segment .= $chart_ht{ORIG_CHAR}->{$start};
|
1618 |
+
$orig_segment .= $this->orig_string_at_span($start, $start+1, *chart_ht);
|
1619 |
+
$segment_end = $start+1;
|
1620 |
+
$start = $segment_end;
|
1621 |
+
}
|
1622 |
+
my $next_char = $chart_ht{ORIG_CHAR}->{$segment_end};
|
1623 |
+
my $next_char_is_combining_p = $this->char_is_combining_char($next_char, *ht);
|
1624 |
+
while ($next_char_is_combining_p
|
1625 |
+
&& ($segment_end < $chart_end)
|
1626 |
+
&& ($end = $this->find_end_of_rom_segment($segment_end, $chart_end, *chart_ht))
|
1627 |
+
&& ($end > $segment_end)
|
1628 |
+
&& (@best_romanizations = $this->best_romanizations($segment_end, $end, *chart_ht))
|
1629 |
+
&& defined($best_romanization = $best_romanizations[0])) {
|
1630 |
+
$orig_segment .= $this->orig_string_at_span($segment_end, $end, *chart_ht);
|
1631 |
+
$rom_segment .= $best_romanization;
|
1632 |
+
if ($#best_romanizations >= 1) {
|
1633 |
+
$rom_title .= $util->guard_html("Alternative romanizations: " . $this->join_alt_romanizations_for_viz(@best_romanizations) . "\n");
|
1634 |
+
$contains_alt_romanizations = 1;
|
1635 |
+
}
|
1636 |
+
$segment_end = $end;
|
1637 |
+
$start = $segment_end;
|
1638 |
+
$next_char = $chart_ht{ORIG_CHAR}->{$segment_end};
|
1639 |
+
$next_char_is_combining_p = $this->char_is_combining_char($next_char, *ht);
|
1640 |
+
}
|
1641 |
+
foreach $i (($segment_start .. ($segment_end-1))) {
|
1642 |
+
$orig_title .= "+‎ ‎" unless $orig_title eq "";
|
1643 |
+
my $char = $chart_ht{ORIG_CHAR}->{$i};
|
1644 |
+
my $numeric = $ht{UTF_TO_NUMERIC}->{$char};
|
1645 |
+
$numeric = "" unless defined($numeric);
|
1646 |
+
my $pic_descr = $ht{UTF_TO_PICTURE_DESCR}->{$char};
|
1647 |
+
$pic_descr = "" unless defined($pic_descr);
|
1648 |
+
if ($char =~ /^\xE4\xB7[\x80-\xBF]$/) {
|
1649 |
+
$orig_title .= "$char_name\n";
|
1650 |
+
} elsif (($char =~ /^[\xE3-\xE9][\x80-\xBF]{2,2}$/) && $chinesePM->string_contains_utf8_cjk_unified_ideograph_p($char)) {
|
1651 |
+
my $unicode = $utf8->utf8_to_unicode($char);
|
1652 |
+
$orig_title .= "CJK Unified Ideograph U+" . (uc sprintf("%04x", $unicode)) . "\n";
|
1653 |
+
$orig_title .= "Chinese: $tonal_translit\n" if $tonal_translit = $chinesePM->tonal_pinyin($char, *pinyin_ht, "");
|
1654 |
+
$orig_title .= "Number: $numeric\n" if $numeric =~ /\d/;
|
1655 |
+
} elsif ($char_name = $ht{UTF_TO_CHAR_NAME}->{$char}) {
|
1656 |
+
$orig_title .= "$char_name\n";
|
1657 |
+
$orig_title .= "Number: $numeric\n" if $numeric =~ /\d/;
|
1658 |
+
$orig_title .= "Picture: $pic_descr\n" if $pic_descr =~ /\S/;
|
1659 |
+
} else {
|
1660 |
+
my $unicode = $utf8->utf8_to_unicode($char);
|
1661 |
+
if (($unicode >= 0xAC00) && ($unicode <= 0xD7A3)) {
|
1662 |
+
$orig_title .= "Hangul syllable U+" . (uc sprintf("%04x", $unicode)) . "\n";
|
1663 |
+
} else {
|
1664 |
+
$orig_title .= "Unicode character U+" . (uc sprintf("%04x", $unicode)) . "\n";
|
1665 |
+
}
|
1666 |
+
}
|
1667 |
+
}
|
1668 |
+
(@non_ascii_roms) = ($rom_segment =~ /([\xC0-\xFF][\x80-\xBF]*)/g);
|
1669 |
+
foreach $char (@non_ascii_roms) {
|
1670 |
+
my $char_name = $ht{UTF_TO_CHAR_NAME}->{$char};
|
1671 |
+
my $unicode = $utf8->utf8_to_unicode($char);
|
1672 |
+
my $unicode_s = "U+" . (uc sprintf("%04x", $unicode));
|
1673 |
+
if ($char_name) {
|
1674 |
+
$rom_title .= "$char_name\n";
|
1675 |
+
} else {
|
1676 |
+
$rom_title .= "$unicode_s\n";
|
1677 |
+
}
|
1678 |
+
}
|
1679 |
+
$last_group_id_index++;
|
1680 |
+
$rom_title =~ s/\s*$//;
|
1681 |
+
$rom_title =~ s/\n/
/g;
|
1682 |
+
$orig_title =~ s/\s*$//;
|
1683 |
+
$orig_title =~ s/\n/
‎/g;
|
1684 |
+
$orig_title = "‭" . $orig_title . "‬";
|
1685 |
+
my $rom_title_clause = ($rom_title eq "") ? "" : " title=\"$rom_title\"";
|
1686 |
+
my $orig_title_clause = ($orig_title eq "") ? "" : " title=\"$orig_title\"";
|
1687 |
+
my $alt_rom_clause = ($contains_alt_romanizations) ? "border-bottom:1px dotted;" : "";
|
1688 |
+
$marked_up_rom .= "<span id=\"span-$last_group_id_index-1\" onmouseover=\"highlight_elems('span-$last_group_id_index','1');\" onmouseout=\"highlight_elems('span-$last_group_id_index','0');\" style=\"color:#00BB00;$alt_rom_clause\"$rom_title_clause>" . $util->guard_html($rom_segment) . "<\/span>";
|
1689 |
+
$marked_up_orig .= "<span id=\"span-$last_group_id_index-2\" onmouseover=\"highlight_elems('span-$last_group_id_index','1');\" onmouseout=\"highlight_elems('span-$last_group_id_index','0');\"$orig_title_clause>" . $util->guard_html($orig_segment) . "<\/span>";
|
1690 |
+
if (($last_char = $chart_ht{ORIG_CHAR}->{($segment_end-1)})
|
1691 |
+
&& ($last_char_name = $ht{UTF_TO_CHAR_NAME}->{$last_char})
|
1692 |
+
&& ($last_char_name =~ /^(FULLWIDTH COLON|FULLWIDTH COMMA|FULLWIDTH RIGHT PARENTHESIS|IDEOGRAPHIC COMMA|IDEOGRAPHIC FULL STOP|RIGHT CORNER BRACKET|BRAILLE PATTERN BLANK|TIBETAN MARK .*)$/)) {
|
1693 |
+
$marked_up_orig .= "<wbr>";
|
1694 |
+
$marked_up_rom .= "<wbr>";
|
1695 |
+
}
|
1696 |
+
}
|
1697 |
+
return ($marked_up_rom, $marked_up_orig, $last_group_id_index);
|
1698 |
+
}
|
1699 |
+
|
1700 |
+
sub romanizations_with_alternatives {
|
1701 |
+
local($this, *ht, *chart_ht, *pinyin_ht, $chart_start, $chart_end) = @_;
|
1702 |
+
|
1703 |
+
$chart_start = 0 unless defined($chart_start);
|
1704 |
+
$chart_end = $chart_ht{N_CHARS} unless defined($chart_end);
|
1705 |
+
my $result = "";
|
1706 |
+
my $start = $chart_start;
|
1707 |
+
my $end;
|
1708 |
+
# print STDOUT "romanizations_with_alternatives $chart_start-$chart_end\n";
|
1709 |
+
while ($start < $chart_end) {
|
1710 |
+
my $segment_start = $start;
|
1711 |
+
my $segment_end = $start+1;
|
1712 |
+
my $end = $this->find_end_of_rom_segment($start, $chart_end, *chart_ht);
|
1713 |
+
my $rom_segment = "";
|
1714 |
+
# print STDOUT " $start-$end\n";
|
1715 |
+
if ($end) {
|
1716 |
+
$segment_end = $end;
|
1717 |
+
my @best_romanizations = $this->best_romanizations($start, $end, *chart_ht);
|
1718 |
+
# print STDOUT " $start-$end @best_romanizations\n";
|
1719 |
+
if (@best_romanizations) {
|
1720 |
+
if ($#best_romanizations == 0) {
|
1721 |
+
$rom_segment .= $best_romanizations[0];
|
1722 |
+
} else {
|
1723 |
+
$rom_segment .= "{" . join("|", @best_romanizations) . "}";
|
1724 |
+
}
|
1725 |
+
$segment_end = $end;
|
1726 |
+
} else {
|
1727 |
+
my $segment = $this->orig_string_at_span($start, $start+1, *chart_ht);
|
1728 |
+
$rom_segment .= $segment;
|
1729 |
+
$segment_end = $start+1;
|
1730 |
+
}
|
1731 |
+
$start = $segment_end;
|
1732 |
+
} else {
|
1733 |
+
$rom_segment .= $chart_ht{ORIG_CHAR}->{$start};
|
1734 |
+
$segment_end = $start+1;
|
1735 |
+
$start = $segment_end;
|
1736 |
+
}
|
1737 |
+
# print STDOUT " $start-$end ** $rom_segment\n";
|
1738 |
+
$result .= $rom_segment;
|
1739 |
+
}
|
1740 |
+
return $result;
|
1741 |
+
}
|
1742 |
+
|
1743 |
+
sub quick_romanize {
|
1744 |
+
local($this, $s, $lang_code, *ht) = @_;
|
1745 |
+
|
1746 |
+
my $result = "";
|
1747 |
+
my @chars = $utf8->split_into_utf8_characters($s, "return only chars", *empty_ht);
|
1748 |
+
while (@chars) {
|
1749 |
+
my $found_match_in_table_p = 0;
|
1750 |
+
foreach $string_length (reverse(1..4)) {
|
1751 |
+
next if ($string_length-1) > $#chars;
|
1752 |
+
$multi_char_substring = join("", @chars[0..($string_length-1)]);
|
1753 |
+
my @mappings = keys %{$ht{UTF_CHAR_MAPPING_LANG_SPEC}->{$lang_code}->{$multi_char_substring}};
|
1754 |
+
@mappings = keys %{$ht{UTF_CHAR_MAPPING}->{$multi_char_substring}} unless @mappings;
|
1755 |
+
if (@mappings) {
|
1756 |
+
my $mapping = $mappings[0];
|
1757 |
+
$result .= $mapping;
|
1758 |
+
foreach $_ ((1 .. $string_length)) {
|
1759 |
+
shift @chars;
|
1760 |
+
}
|
1761 |
+
$found_match_in_table_p = 1;
|
1762 |
+
last;
|
1763 |
+
}
|
1764 |
+
}
|
1765 |
+
unless ($found_match_in_table_p) {
|
1766 |
+
$result .= $chars[0];
|
1767 |
+
shift @chars;
|
1768 |
+
}
|
1769 |
+
}
|
1770 |
+
return $result;
|
1771 |
+
}
|
1772 |
+
|
1773 |
+
sub char_is_combining_char {
|
1774 |
+
local($this, $c, *ht) = @_;
|
1775 |
+
|
1776 |
+
return 0 unless $c;
|
1777 |
+
my $category = $ht{UTF_TO_CAT}->{$c};
|
1778 |
+
return 0 unless $category;
|
1779 |
+
return $category =~ /^M/;
|
1780 |
+
}
|
1781 |
+
|
1782 |
+
sub mark_up_string_for_mouse_over {
|
1783 |
+
local($this, $s, *ht, $control, *pinyin_ht) = @_;
|
1784 |
+
|
1785 |
+
$control = "" unless defined($control);
|
1786 |
+
$no_ascii_p = ($control =~ /NO-ASCII/);
|
1787 |
+
my $result = "";
|
1788 |
+
@chars = $utf8->split_into_utf8_characters($s, "return only chars", *empty_ht);
|
1789 |
+
while (@chars) {
|
1790 |
+
$char = shift @chars;
|
1791 |
+
$numeric = $ht{UTF_TO_NUMERIC}->{$char};
|
1792 |
+
$numeric = "" unless defined($numeric);
|
1793 |
+
$pic_descr = $ht{UTF_TO_PICTURE_DESCR}->{$char};
|
1794 |
+
$pic_descr = "" unless defined($pic_descr);
|
1795 |
+
$next_char = ($#chars >= 0) ? $chars[0] : "";
|
1796 |
+
$next_char_is_combining_p = $this->char_is_combining_char($next_char, *ht);
|
1797 |
+
if ($no_ascii_p
|
1798 |
+
&& ($char =~ /^[\x00-\x7F]*$/)
|
1799 |
+
&& ! $next_char_is_combining_p) {
|
1800 |
+
$result .= $util->guard_html($char);
|
1801 |
+
} elsif (($char =~ /^[\xE3-\xE9][\x80-\xBF]{2,2}$/) && $chinesePM->string_contains_utf8_cjk_unified_ideograph_p($char)) {
|
1802 |
+
$unicode = $utf8->utf8_to_unicode($char);
|
1803 |
+
$title = "CJK Unified Ideograph U+" . (uc sprintf("%04x", $unicode));
|
1804 |
+
$title .= "
Chinese: $tonal_translit" if $tonal_translit = $chinesePM->tonal_pinyin($char, *pinyin_ht, "");
|
1805 |
+
$title .= "
Number: $numeric" if $numeric =~ /\d/;
|
1806 |
+
$result .= "<span title=\"$title\">" . $util->guard_html($char) . "<\/span>";
|
1807 |
+
} elsif ($char_name = $ht{UTF_TO_CHAR_NAME}->{$char}) {
|
1808 |
+
$title = $char_name;
|
1809 |
+
$title .= "
Number: $numeric" if $numeric =~ /\d/;
|
1810 |
+
$title .= "
Picture: $pic_descr" if $pic_descr =~ /\S/;
|
1811 |
+
$char_plus = $char;
|
1812 |
+
while ($next_char_is_combining_p) {
|
1813 |
+
# combining marks (Mc:non-spacing, Mc:spacing combining, Me: enclosing)
|
1814 |
+
$next_char_name = $ht{UTF_TO_CHAR_NAME}->{$next_char};
|
1815 |
+
$title .= "
+ $next_char_name";
|
1816 |
+
$char = shift @chars;
|
1817 |
+
$char_plus .= $char;
|
1818 |
+
$next_char = ($#chars >= 0) ? $chars[0] : "";
|
1819 |
+
$next_char_is_combining_p = $this->char_is_combining_char($next_char, *ht);
|
1820 |
+
}
|
1821 |
+
$result .= "<span title=\"$title\">" . $util->guard_html($char_plus) . "<\/span>";
|
1822 |
+
$result .= "<wbr>" if $char_name =~ /^(FULLWIDTH COLON|FULLWIDTH COMMA|FULLWIDTH RIGHT PARENTHESIS|IDEOGRAPHIC COMMA|IDEOGRAPHIC FULL STOP|RIGHT CORNER BRACKET)$/;
|
1823 |
+
} elsif (($unicode = $utf8->utf8_to_unicode($char))
|
1824 |
+
&& ($unicode >= 0xAC00) && ($unicode <= 0xD7A3)) {
|
1825 |
+
$title = "Hangul syllable U+" . (uc sprintf("%04x", $unicode));
|
1826 |
+
$result .= "<span title=\"$title\">" . $util->guard_html($char) . "<\/span>";
|
1827 |
+
} else {
|
1828 |
+
$result .= $util->guard_html($char);
|
1829 |
+
}
|
1830 |
+
}
|
1831 |
+
return $result;
|
1832 |
+
}
|
1833 |
+
|
1834 |
+
sub romanize_char_at_position_incl_multi {
|
1835 |
+
local($this, $i, $lang_code, $output_style, *ht, *chart_ht) = @_;
|
1836 |
+
|
1837 |
+
my $char = $chart_ht{ORIG_CHAR}->{$i};
|
1838 |
+
return "" unless defined($char);
|
1839 |
+
my @mappings = keys %{$ht{UTF_CHAR_MAPPING_LANG_SPEC}->{$lang_code}->{$char}};
|
1840 |
+
return $mappings[0] if @mappings;
|
1841 |
+
@mappings = keys %{$ht{UTF_CHAR_MAPPING}->{$char}};
|
1842 |
+
return $mappings[0] if @mappings;
|
1843 |
+
return $this->romanize_char_at_position($i, $lang_code, $output_style, *ht, *chart_ht);
|
1844 |
+
}
|
1845 |
+
|
1846 |
+
sub romanize_char_at_position {
|
1847 |
+
local($this, $i, $lang_code, $output_style, *ht, *chart_ht) = @_;
|
1848 |
+
|
1849 |
+
my $char = $chart_ht{ORIG_CHAR}->{$i};
|
1850 |
+
return "" unless defined($char);
|
1851 |
+
return $char if $char =~ /^[\x00-\x7F]$/; # ASCII
|
1852 |
+
my $romanization = $ht{UTF_TO_CHAR_ROMANIZATION}->{$char};
|
1853 |
+
return $romanization if $romanization;
|
1854 |
+
my $char_name = $chart_ht{CHAR_NAME}->{$i};
|
1855 |
+
$romanization = $this->romanize_charname($char_name, $lang_code, $output_style, *ht, $char);
|
1856 |
+
$ht{SUSPICIOUS_ROMANIZATION}->{$char_name}->{$romanization}
|
1857 |
+
= ($ht{SUSPICIOUS_ROMANIZATION}->{$char_name}->{$romanization} || 0) + 1
|
1858 |
+
unless (length($romanization) < 4)
|
1859 |
+
|| ($romanization =~ /\s/)
|
1860 |
+
|| ($romanization =~ /^[bcdfghjklmnpqrstvwxyz]{2,3}[aeiou]-$/) # Khmer ngo-/nyo-/pho- OK
|
1861 |
+
|| ($romanization =~ /^[bcdfghjklmnpqrstvwxyz]{2,2}[aeiougw][aeiou]{1,2}$/) # Canadian, Ethiopic syllable OK
|
1862 |
+
|| ($romanization =~ /^(allah|bbux|nyaa|nnya|quuv|rrep|shch|shur|syrx)$/i) # Arabic; Yi; Ethiopic syllable nyaa; Cyrillic letter shcha
|
1863 |
+
|| (($char_name =~ /^(YI SYLLABLE|VAI SYLLABLE|ETHIOPIC SYLLABLE|CANADIAN SYLLABICS|CANADIAN SYLLABICS CARRIER)\s+(\S+)$/) && (length($romanization) <= 5));
|
1864 |
+
# print STDERR "romanize_char_at_position $i $char_name :: $romanization\n" if $char_name =~ /middle/i;
|
1865 |
+
return $romanization;
|
1866 |
+
}
|
1867 |
+
|
1868 |
+
sub romanize_charname {
|
1869 |
+
local($this, $char_name, $lang_code, $output_style, *ht, $char) = @_;
|
1870 |
+
|
1871 |
+
my $cached_result = $ht{ROMANIZE_CHARNAME}->{$char_name}->{$lang_code}->{$output_style};
|
1872 |
+
# print STDERR "(C) romanize_charname($char_name): $cached_result\n" if $cached_result && ($char_name =~ /middle/i);
|
1873 |
+
return $cached_result if defined($cashed_result);
|
1874 |
+
$orig_char_name = $char_name;
|
1875 |
+
$char_name =~ s/^.* LETTER\s+([A-Z]+)-\d+$/$1/; # HENTAIGANA LETTER A-3
|
1876 |
+
$char_name =~ s/^.* LETTER\s+//;
|
1877 |
+
$char_name =~ s/^.* SYLLABLE\s+B\d\d\d\s+//; # Linear B syllables
|
1878 |
+
$char_name =~ s/^.* SYLLABLE\s+//;
|
1879 |
+
$char_name =~ s/^.* SYLLABICS\s+//;
|
1880 |
+
$char_name =~ s/^.* LIGATURE\s+//;
|
1881 |
+
$char_name =~ s/^.* VOWEL SIGN\s+//;
|
1882 |
+
$char_name =~ s/^.* CONSONANT SIGN\s+//;
|
1883 |
+
$char_name =~ s/^.* CONSONANT\s+//;
|
1884 |
+
$char_name =~ s/^.* VOWEL\s+//;
|
1885 |
+
$char_name =~ s/ WITH .*$//;
|
1886 |
+
$char_name =~ s/ WITHOUT .*$//;
|
1887 |
+
$char_name =~ s/\s+(ABOVE|AGUNG|BAR|BARREE|BELOW|CEDILLA|CEREK|DIGRAPH|DOACHASHMEE|FINAL FORM|GHUNNA|GOAL|INITIAL FORM|ISOLATED FORM|KAWI|LELET|LELET RASWADI|LONSUM|MAHAPRANA|MEDIAL FORM|MURDA|MURDA MAHAPRANA|REVERSED|ROTUNDA|SASAK|SUNG|TAM|TEDUNG|TYPE ONE|TYPE TWO|WOLOSO)\s*$//;
|
1888 |
+
$char_name =~ s/^([A-Z]+)\d+$/$1/; # Linear B syllables etc.
|
1889 |
+
foreach $_ ((1 .. 3)) {
|
1890 |
+
$char_name =~ s/^.*\b(?:ABKHASIAN|ACADEMY|AFRICAN|AIVILIK|AITON|AKHMIMIC|ALEUT|ALI GALI|ALPAPRAANA|ALTERNATE|ALTERNATIVE|AMBA|ARABIC|ARCHAIC|ASPIRATED|ATHAPASCAN|BASELINE|BLACKLETTER|BARRED|BASHKIR|BERBER|BHATTIPROLU|BIBLE-CREE|BIG|BINOCULAR|BLACKFOOT|BLENDED|BOTTOM|BROAD|BROKEN|CANDRA|CAPITAL|CARRIER|CHILLU|CLOSE|CLOSED|COPTIC|CROSSED|CRYPTOGRAMMIC|CURLED|CURLY|CYRILLIC|DANTAJA|DENTAL|DIALECT-P|DIAERESIZED|DOTLESS|DOUBLE|DOUBLE-STRUCK|EASTERN PWO KAREN|EGYPTOLOGICAL|FARSI|FINAL|FLATTENED|GLOTTAL|GREAT|GREEK|HALF|HIGH|INITIAL|INSULAR|INVERTED|IOTIFIED|JONA|KANTAJA|KASHMIRI|KHAKASSIAN|KHAMTI|KHANDA|KINNA|KIRGHIZ|KOMI|L-SHAPED|LATINATE|LITTLE|LONG|LONG-LEGGED|LOOPED|LOW|MAHAAPRAANA|MALAYALAM|MANCHU|MANDAILING|MATHEMATICAL|MEDIAL|MIDDLE-WELSH|MON|MONOCULAR|MOOSE-CREE|MULTIOCULAR|MUURDHAJA|N-CREE|NARROW|NASKAPI|NDOLE|NEUTRAL|NIKOLSBURG|NORTHERN|NUBIAN|NUNAVIK|NUNAVUT|OJIBWAY|OLD|OPEN|ORKHON|OVERLONG|PALI|PERSIAN|PHARYNGEAL|PRISHTHAMATRA|R-CREE|REDUPLICATION|REVERSED|ROMANIAN|ROUND|ROUNDED|RUDIMENTA|RUMAI PALAUNG|SANSKRIT|SANYAKA|SARA|SAYISI|SCRIPT|SEBATBEIT|SEMISOFT|SGAW KAREN|SHAN|SHARP|SHWE PALAUNG|SHORT|SIBE|SIDEWAYS|SIMALUNGUN|SMALL|SOGDIAN|SOFT|SOUTH-SLAVEY|SOUTHERN|SPIDERY|STIRRUP|STRAIGHT|STRETCHED|SUBSCRIPT|SWASH|TAI LAING|TAILED|TAILLESS|TAALUJA|TH-CREE|TALL|THREE-LEGGED|TURNED|TODO|TOP|TROKUTASTI|TUAREG|UKRAINIAN|UNBLENDED|VISIGOTHIC|VOCALIC|VOICED|VOICELESS|VOLAPUK|WAVY|WESTERN PWO KAREN|WEST-CREE|WESTERN|WIDE|WOODS-CREE|Y-CREE|YENISEI|YIDDISH)\s+//;
|
1891 |
+
}
|
1892 |
+
$char_name =~ s/\s+(ABOVE|AGUNG|BAR|BARREE|BELOW|CEDILLA|CEREK|DIGRAPH|DOACHASHMEE|FINAL FORM|GHUNNA|GOAL|INITIAL FORM|ISOLATED FORM|KAWI|LELET|LELET RASWADI|LONSUM|MAHAPRANA|MEDIAL FORM|MURDA|MURDA MAHAPRANA|REVERSED|ROTUNDA|SASAK|SUNG|TAM|TEDUNG|TYPE ONE|TYPE TWO|WOLOSO)\s*$//;
|
1893 |
+
if ($char_name =~ /THAI CHARACTER/) {
|
1894 |
+
$char_name =~ s/^THAI CHARACTER\s+//;
|
1895 |
+
if ($char =~ /^\xE0\xB8[\x81-\xAE]/) {
|
1896 |
+
# Thai consonants
|
1897 |
+
$char_name =~ s/^([^AEIOU]*).*/$1/i;
|
1898 |
+
} elsif ($char_name =~ /^SARA [AEIOU]/) {
|
1899 |
+
# Thai vowels
|
1900 |
+
$char_name =~ s/^SARA\s+//;
|
1901 |
+
} else {
|
1902 |
+
$char_name = $char;
|
1903 |
+
}
|
1904 |
+
}
|
1905 |
+
if ($orig_char_name =~ /(HIRAGANA LETTER|KATAKANA LETTER|SYLLABLE|LIGATURE)/) {
|
1906 |
+
$char_name = lc $char_name;
|
1907 |
+
} elsif ($char_name =~ /\b(ANUSVARA|ANUSVARAYA|NIKAHIT|SIGN BINDI|TIPPI)\b/) {
|
1908 |
+
$char_name = "+m";
|
1909 |
+
} elsif ($char_name =~ /\bSCHWA\b/) {
|
1910 |
+
$char_name = "e";
|
1911 |
+
} elsif ($char_name =~ /\bIOTA\b/) {
|
1912 |
+
$char_name = "i";
|
1913 |
+
} elsif ($char_name =~ /\s/) {
|
1914 |
+
} elsif ($orig_char_name =~ /KHMER LETTER/) {
|
1915 |
+
$char_name .= "-";
|
1916 |
+
} elsif ($orig_char_name =~ /CHEROKEE LETTER/) {
|
1917 |
+
# use whole letter as is
|
1918 |
+
} elsif ($orig_char_name =~ /KHMER INDEPENDENT VOWEL/) {
|
1919 |
+
$char_name =~ s/q//;
|
1920 |
+
} elsif ($orig_char_name =~ /LETTER/) {
|
1921 |
+
$char_name =~ s/^[AEIOU]+([^AEIOU]+)$/$1/i;
|
1922 |
+
$char_name =~ s/^([^-AEIOUY]+)[AEIOU].*/$1/i;
|
1923 |
+
$char_name =~ s/^(Y)[AEIOU].*/$1/i if $orig_char_name =~ /\b(?:BENGALI|DEVANAGARI|GURMUKHI|GUJARATI|KANNADA|MALAYALAM|MODI|MYANMAR|ORIYA|TAMIL|TELUGU|TIBETAN)\b.*\bLETTER YA\b/;
|
1924 |
+
$char_name =~ s/^(Y[AEIOU]+)[^AEIOU].*$/$1/i;
|
1925 |
+
$char_name =~ s/^([AEIOU]+)[^AEIOU]+[AEIOU].*/$1/i;
|
1926 |
+
}
|
1927 |
+
|
1928 |
+
my $result = ($orig_char_name =~ /\bCAPITAL\b/) ? (uc $char_name) : (lc $char_name);
|
1929 |
+
# print STDERR "(R) romanize_charname($orig_char_name): $result\n" if $orig_char_name =~ /middle/i;
|
1930 |
+
$ht{ROMANIZE_CHARNAME}->{$char_name}->{$lang_code}->{$output_style} = $result;
|
1931 |
+
return $result;
|
1932 |
+
}
|
1933 |
+
|
1934 |
+
sub assemble_numbers_in_chart {
|
1935 |
+
local($this, *chart_ht, $line_number) = @_;
|
1936 |
+
|
1937 |
+
foreach $start (sort { $a <=> $b } keys %{$chart_ht{COMPLEX_NUMERIC_START_END}}) {
|
1938 |
+
my $end = $chart_ht{COMPLEX_NUMERIC_START_END}->{$start};
|
1939 |
+
my @numbers = ();
|
1940 |
+
foreach $i (($start .. ($end-1))) {
|
1941 |
+
my $orig_char = $chart_ht{ORIG_CHAR}->{$i};
|
1942 |
+
my $node_id = $this->get_node_for_span_with_slot($i, $i+1, "numeric-value", *chart_id);
|
1943 |
+
if (defined($node_id)) {
|
1944 |
+
my $number = $chart_ht{NODE_ROMAN}->{$node_id};
|
1945 |
+
if (defined($number)) {
|
1946 |
+
push(@numbers, $number);
|
1947 |
+
} elsif ($orig_char =~ /^[.,]$/) { # decimal point, comma separator
|
1948 |
+
push(@numbers, $orig_char);
|
1949 |
+
} else {
|
1950 |
+
print STDERR "Found no romanization for node_id $node_id ($i-" . ($i+1) . ") in assemble_numbers_in_chart\n" if $verbosePM;
|
1951 |
+
}
|
1952 |
+
} else {
|
1953 |
+
print STDERR "Found no node_id for span $i-" . ($i+1) . " in assemble_numbers_in_chart\n" if $verbosePM;
|
1954 |
+
}
|
1955 |
+
}
|
1956 |
+
my $complex_number = $this->assemble_number(join("\xC2\xB7", @numbers), $line_number);
|
1957 |
+
# print STDERR "assemble_numbers_in_chart l.$line_number $start-$end $complex_number (@numbers)\n";
|
1958 |
+
$this->add_node($complex_number, $start, $end, *chart_ht, "", "complex-number");
|
1959 |
+
}
|
1960 |
+
}
|
1961 |
+
|
1962 |
+
sub assemble_number {
|
1963 |
+
local($this, $s, $line_number) = @_;
|
1964 |
+
# e.g. 10 9 100 7 10 8 = 1978
|
1965 |
+
|
1966 |
+
my $middot = "\xC2\xB7";
|
1967 |
+
my @tokens = split(/$middot/, $s); # middle dot U+00B7
|
1968 |
+
my $i = 0;
|
1969 |
+
my @orig_tokens = @tokens;
|
1970 |
+
|
1971 |
+
# assemble single digit numbers, e.g. 1 7 5 -> 175
|
1972 |
+
while ($i < $#tokens) {
|
1973 |
+
if ($tokens[$i] =~ /^\d$/) {
|
1974 |
+
my $j = $i+1;
|
1975 |
+
while (($j <= $#tokens) && ($tokens[$j] =~ /^[0-9.,]$/)) {
|
1976 |
+
$j++;
|
1977 |
+
}
|
1978 |
+
$j--;
|
1979 |
+
if ($j>$i) {
|
1980 |
+
my $new_token = join("", @tokens[$i .. $j]);
|
1981 |
+
$new_token =~ s/,//g;
|
1982 |
+
splice(@tokens, $i, $j-$i+1, $new_token);
|
1983 |
+
}
|
1984 |
+
}
|
1985 |
+
$i++;
|
1986 |
+
}
|
1987 |
+
|
1988 |
+
foreach $power ((10, 100, 1000, 10000, 100000, 1000000, 100000000, 1000000000, 1000000000000)) {
|
1989 |
+
for (my $i=0; $i <= $#tokens; $i++) {
|
1990 |
+
if ($tokens[$i] == $power) {
|
1991 |
+
if (($i > 0) && ($tokens[($i-1)] < $power)) {
|
1992 |
+
splice(@tokens, $i-1, 2, ($tokens[($i-1)] * $tokens[$i]));
|
1993 |
+
$i--;
|
1994 |
+
if (($i < $#tokens) && ($tokens[($i+1)] < $power)) {
|
1995 |
+
splice(@tokens, $i, 2, ($tokens[$i] + $tokens[($i+1)]));
|
1996 |
+
$i--;
|
1997 |
+
}
|
1998 |
+
}
|
1999 |
+
}
|
2000 |
+
# 400 30 (e.g. Egyptian)
|
2001 |
+
my $gen_pattern = $power;
|
2002 |
+
$gen_pattern =~ s/^1/\[1-9\]/;
|
2003 |
+
if (($tokens[$i] =~ /^$gen_pattern$/) && ($i < $#tokens) && ($tokens[($i+1)] < $power)) {
|
2004 |
+
splice(@tokens, $i, 2, ($tokens[$i] + $tokens[($i+1)]));
|
2005 |
+
$i--;
|
2006 |
+
}
|
2007 |
+
}
|
2008 |
+
last if $#tokens == 0;
|
2009 |
+
}
|
2010 |
+
my $result = join($middot, @tokens);
|
2011 |
+
if ($verbosePM) {
|
2012 |
+
my $logfile = "/nfs/isd/ulf/cgi-mt/amr-tmp/uroman-number-log.txt";
|
2013 |
+
$util->append_to_file($logfile, "$s -> $result\n") if -r $logfile;
|
2014 |
+
# print STDERR " assemble number l.$line_number @orig_tokens -> $result\n" if $line_number == 43;
|
2015 |
+
}
|
2016 |
+
return $result;
|
2017 |
+
}
|
2018 |
+
|
2019 |
+
1;
|
2020 |
+
|
uroman/lib/NLP/UTF8.pm
ADDED
@@ -0,0 +1,1404 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
################################################################
|
2 |
+
# #
|
3 |
+
# UTF8 #
|
4 |
+
# #
|
5 |
+
################################################################
|
6 |
+
|
7 |
+
package NLP::UTF8;
|
8 |
+
|
9 |
+
use NLP::utilities;
|
10 |
+
$util = NLP::utilities;
|
11 |
+
|
12 |
+
%empty_ht = ();
|
13 |
+
|
14 |
+
sub new {
|
15 |
+
local($caller) = @_;
|
16 |
+
|
17 |
+
my $object = {};
|
18 |
+
my $class = ref( $caller ) || $caller;
|
19 |
+
bless($object, $class);
|
20 |
+
return $object;
|
21 |
+
}
|
22 |
+
|
23 |
+
sub unicode_string2string {
|
24 |
+
# input: string that might contain unicode sequences such as "U+0627"
|
25 |
+
# output: string in pure utf-8
|
26 |
+
local($caller,$s) = @_;
|
27 |
+
|
28 |
+
my $pre;
|
29 |
+
my $unicode;
|
30 |
+
my $post;
|
31 |
+
my $r1;
|
32 |
+
my $r2;
|
33 |
+
my $r3;
|
34 |
+
|
35 |
+
($pre,$unicode,$post) = ($s =~ /^(.*)(?:U\+|\\u)([0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f])(.*)$/);
|
36 |
+
return $s unless defined($post);
|
37 |
+
$r1 = $caller->unicode_string2string($pre);
|
38 |
+
$r2 = $caller->unicode_hex_string2string($unicode);
|
39 |
+
$r3 = $caller->unicode_string2string($post);
|
40 |
+
$result = $r1 . $r2 . $r3;
|
41 |
+
return $result;
|
42 |
+
}
|
43 |
+
|
44 |
+
sub unicode_hex_string2string {
|
45 |
+
# input: "0627" (interpreted as hex code)
|
46 |
+
# output: utf-8 string for Arabic letter alef
|
47 |
+
local($caller,$unicode) = @_;
|
48 |
+
return "" unless defined($unicode);
|
49 |
+
my $d = hex($unicode);
|
50 |
+
return $caller->unicode2string($d);
|
51 |
+
}
|
52 |
+
|
53 |
+
sub unicode2string {
|
54 |
+
# input: non-neg integer, e.g. 0x627
|
55 |
+
# output: utf-8 string for Arabic letter alef
|
56 |
+
local($caller,$d) = @_;
|
57 |
+
return "" unless defined($d) && $d >= 0;
|
58 |
+
return sprintf("%c",$d) if $d <= 0x7F;
|
59 |
+
|
60 |
+
my $lastbyte1 = ($d & 0x3F) | 0x80;
|
61 |
+
$d >>= 6;
|
62 |
+
return sprintf("%c%c",$d | 0xC0, $lastbyte1) if $d <= 0x1F;
|
63 |
+
|
64 |
+
my $lastbyte2 = ($d & 0x3F) | 0x80;
|
65 |
+
$d >>= 6;
|
66 |
+
return sprintf("%c%c%c",$d | 0xE0, $lastbyte2, $lastbyte1) if $d <= 0xF;
|
67 |
+
|
68 |
+
my $lastbyte3 = ($d & 0x3F) | 0x80;
|
69 |
+
$d >>= 6;
|
70 |
+
return sprintf("%c%c%c%c",$d | 0xF0, $lastbyte3, $lastbyte2, $lastbyte1) if $d <= 0x7;
|
71 |
+
|
72 |
+
my $lastbyte4 = ($d & 0x3F) | 0x80;
|
73 |
+
$d >>= 6;
|
74 |
+
return sprintf("%c%c%c%c%c",$d | 0xF8, $lastbyte4, $lastbyte3, $lastbyte2, $lastbyte1) if $d <= 0x3;
|
75 |
+
|
76 |
+
my $lastbyte5 = ($d & 0x3F) | 0x80;
|
77 |
+
$d >>= 6;
|
78 |
+
return sprintf("%c%c%c%c%c%c",$d | 0xFC, $lastbyte5, $lastbyte4, $lastbyte3, $lastbyte2, $lastbyte1) if $d <= 0x1;
|
79 |
+
return ""; # bad input
|
80 |
+
}
|
81 |
+
|
82 |
+
sub html2utf8 {
|
83 |
+
local($caller, $string) = @_;
|
84 |
+
|
85 |
+
return $string unless $string =~ /\&\#\d{3,5};/;
|
86 |
+
|
87 |
+
my $prev = "";
|
88 |
+
my $s = $string;
|
89 |
+
while ($s ne $prev) {
|
90 |
+
$prev = $s;
|
91 |
+
($pre,$d,$post) = ($s =~ /^(.*)\&\#(\d+);(.*)$/);
|
92 |
+
if (defined($d) && ((($d >= 160) && ($d <= 255))
|
93 |
+
|| (($d >= 1500) && ($d <= 1699))
|
94 |
+
|| (($d >= 19968) && ($d <= 40879)))) {
|
95 |
+
$html_code = "\&\#" . $d . ";";
|
96 |
+
$utf8_code = $caller->unicode2string($d);
|
97 |
+
$s =~ s/$html_code/$utf8_code/;
|
98 |
+
}
|
99 |
+
}
|
100 |
+
return $s;
|
101 |
+
}
|
102 |
+
|
103 |
+
sub xhtml2utf8 {
|
104 |
+
local($caller, $string) = @_;
|
105 |
+
|
106 |
+
return $string unless $string =~ /\&\#x[0-9a-fA-F]{2,5};/;
|
107 |
+
|
108 |
+
my $prev = "";
|
109 |
+
my $s = $string;
|
110 |
+
while ($s ne $prev) {
|
111 |
+
$prev = $s;
|
112 |
+
if (($pre, $html_code, $x, $post) = ($s =~ /^(.*)(\&\#x([0-9a-fA-F]{2,5});)(.*)$/)) {
|
113 |
+
$utf8_code = $caller->unicode_hex_string2string($x);
|
114 |
+
$s =~ s/$html_code/$utf8_code/;
|
115 |
+
}
|
116 |
+
}
|
117 |
+
return $s;
|
118 |
+
}
|
119 |
+
|
120 |
+
sub utf8_marker {
|
121 |
+
return sprintf("%c%c%c\n", 0xEF, 0xBB, 0xBF);
|
122 |
+
}
|
123 |
+
|
124 |
+
sub enforcer {
|
125 |
+
# input: string that might not conform to utf-8
|
126 |
+
# output: string in pure utf-8, with a few "smart replacements" and possibly "?"
|
127 |
+
local($caller,$s,$no_repair) = @_;
|
128 |
+
|
129 |
+
my $ascii;
|
130 |
+
my $utf8;
|
131 |
+
my $rest;
|
132 |
+
|
133 |
+
return $s if $s =~ /^[\x00-\x7F]*$/;
|
134 |
+
|
135 |
+
$no_repair = 0 unless defined($no_repair);
|
136 |
+
$orig = $s;
|
137 |
+
$result = "";
|
138 |
+
|
139 |
+
while ($s ne "") {
|
140 |
+
($ascii,$rest) = ($s =~ /^([\x00-\x7F]+)(.*)$/);
|
141 |
+
if (defined($ascii)) {
|
142 |
+
$result .= $ascii;
|
143 |
+
$s = $rest;
|
144 |
+
next;
|
145 |
+
}
|
146 |
+
($utf8,$rest) = ($s =~ /^([\xC0-\xDF][\x80-\xBF])(.*)$/);
|
147 |
+
($utf8,$rest) = ($s =~ /^([\xE0-\xEF][\x80-\xBF][\x80-\xBF])(.*)$/)
|
148 |
+
unless defined($rest);
|
149 |
+
($utf8,$rest) = ($s =~ /^([\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])(.*)$/)
|
150 |
+
unless defined($rest);
|
151 |
+
($utf8,$rest) = ($s =~ /^([\xF8-\xFB][\x80-\xBF][\x80-\xBF][\x80-\xBF][\x80-\xBF])(.*)$/)
|
152 |
+
unless defined($rest);
|
153 |
+
if (defined($utf8)) {
|
154 |
+
$result .= $utf8;
|
155 |
+
$s = $rest;
|
156 |
+
next;
|
157 |
+
}
|
158 |
+
($c,$rest) = ($s =~ /^(.)(.*)$/);
|
159 |
+
if (defined($c)) {
|
160 |
+
if ($no_repair) { $result .= "?"; }
|
161 |
+
elsif ($c =~ /\x85/) { $result .= "..."; }
|
162 |
+
elsif ($c =~ /\x91/) { $result .= "'"; }
|
163 |
+
elsif ($c =~ /\x92/) { $result .= "'"; }
|
164 |
+
elsif ($c =~ /\x93/) { $result .= $caller->unicode2string(0x201C); }
|
165 |
+
elsif ($c =~ /\x94/) { $result .= $caller->unicode2string(0x201D); }
|
166 |
+
elsif ($c =~ /[\xC0-\xFF]/) {
|
167 |
+
$c2 = $c;
|
168 |
+
$c2 =~ tr/[\xC0-\xFF]/[\x80-\xBF]/;
|
169 |
+
$result .= "\xC3$c2";
|
170 |
+
} else {
|
171 |
+
$result .= "?";
|
172 |
+
}
|
173 |
+
$s = $rest;
|
174 |
+
next;
|
175 |
+
}
|
176 |
+
$s = "";
|
177 |
+
}
|
178 |
+
$result .= "\n" if ($orig =~ /\n$/) && ! ($result =~ /\n$/);
|
179 |
+
return $result;
|
180 |
+
}
|
181 |
+
|
182 |
+
sub split_into_utf8_characters {
|
183 |
+
# input: utf8 string
|
184 |
+
# output: list of sub-strings, each representing a utf8 character
|
185 |
+
local($caller,$string,$group_control, *ht) = @_;
|
186 |
+
|
187 |
+
@characters = ();
|
188 |
+
$end_of_token_p_string = "";
|
189 |
+
$skipped_bytes = "";
|
190 |
+
$group_control = "" unless defined($group_control);
|
191 |
+
$group_ascii_numbers = ($group_control =~ /ASCII numbers/);
|
192 |
+
$group_ascii_spaces = ($group_control =~ /ASCII spaces/);
|
193 |
+
$group_ascii_punct = ($group_control =~ /ASCII punct/);
|
194 |
+
$group_ascii_chars = ($group_control =~ /ASCII chars/);
|
195 |
+
$group_xml_chars = ($group_control =~ /XML chars/);
|
196 |
+
$group_xml_tags = ($group_control =~ /XML tags/);
|
197 |
+
$return_only_chars = ($group_control =~ /return only chars/);
|
198 |
+
$return_trailing_whitespaces = ($group_control =~ /return trailing whitespaces/);
|
199 |
+
if ($group_control =~ /ASCII all/) {
|
200 |
+
$group_ascii_numbers = 1;
|
201 |
+
$group_ascii_spaces = 1;
|
202 |
+
$group_ascii_chars = 1;
|
203 |
+
$group_ascii_punct = 1;
|
204 |
+
}
|
205 |
+
if ($group_control =~ /(XML chars and tags|XML tags and chars)/) {
|
206 |
+
$group_xml_chars = 1;
|
207 |
+
$group_xml_tags = 1;
|
208 |
+
}
|
209 |
+
$orig_string = $string;
|
210 |
+
$string .= " ";
|
211 |
+
while ($string =~ /\S/) {
|
212 |
+
# one-character UTF-8 = ASCII
|
213 |
+
if ($string =~ /^[\x00-\x7F]/) {
|
214 |
+
if ($group_xml_chars
|
215 |
+
&& (($dec_unicode, $rest) = ($string =~ /^&#(\d+);(.*)$/s))
|
216 |
+
&& ($utf8_char = $caller->unicode2string($dec_unicode))) {
|
217 |
+
push(@characters, $utf8_char);
|
218 |
+
$string = $rest;
|
219 |
+
} elsif ($group_xml_chars
|
220 |
+
&& (($hex_unicode, $rest) = ($string =~ /^&#x([0-9a-f]{1,6});(.*)$/is))
|
221 |
+
&& ($utf8_char = $caller->unicode_hex_string2string($hex_unicode))) {
|
222 |
+
push(@characters, $utf8_char);
|
223 |
+
$string = $rest;
|
224 |
+
} elsif ($group_xml_chars
|
225 |
+
&& (($html_entity_name, $rest) = ($string =~ /^&([a-z]{1,6});(.*)$/is))
|
226 |
+
&& ($dec_unicode = $ht{HTML_ENTITY_NAME_TO_DECUNICODE}->{$html_entity_name})
|
227 |
+
&& ($utf8_char = $caller->unicode2string($dec_unicode))
|
228 |
+
) {
|
229 |
+
push(@characters, $utf8_char);
|
230 |
+
$string = $rest;
|
231 |
+
} elsif ($group_xml_tags
|
232 |
+
&& (($tag, $rest) = ($string =~ /^(<\/?[a-zA-Z][-_:a-zA-Z0-9]*(\s+[a-zA-Z][-_:a-zA-Z0-9]*=\"[^"]*\")*\s*\/?>)(.*)$/s))) {
|
233 |
+
push(@characters, $tag);
|
234 |
+
$string = $rest;
|
235 |
+
} elsif ($group_ascii_numbers && ($string =~ /^[12]\d\d\d\.[01]?\d.[0-3]?\d([^0-9].*)?$/)) {
|
236 |
+
($date) = ($string =~ /^(\d\d\d\d\.\d?\d.\d?\d)([^0-9].*)?$/);
|
237 |
+
push(@characters,$date);
|
238 |
+
$string = substr($string, length($date));
|
239 |
+
} elsif ($group_ascii_numbers && ($string =~ /^\d/)) {
|
240 |
+
($number) = ($string =~ /^(\d+(,\d\d\d)*(\.\d+)?)/);
|
241 |
+
push(@characters,$number);
|
242 |
+
$string = substr($string, length($number));
|
243 |
+
} elsif ($group_ascii_spaces && ($string =~ /^(\s+)/)) {
|
244 |
+
($space) = ($string =~ /^(\s+)/);
|
245 |
+
$string = substr($string, length($space));
|
246 |
+
} elsif ($group_ascii_punct && (($punct_seq) = ($string =~ /^(-+|\.+|[:,%()"])/))) {
|
247 |
+
push(@characters,$punct_seq);
|
248 |
+
$string = substr($string, length($punct_seq));
|
249 |
+
} elsif ($group_ascii_chars && (($word) = ($string =~ /^(\$[A-Z]*|[A-Z]{1,3}\$)/))) {
|
250 |
+
push(@characters,$word);
|
251 |
+
$string = substr($string, length($word));
|
252 |
+
} elsif ($group_ascii_chars && (($abbrev) = ($string =~ /^((?:Jan|Feb|Febr|Mar|Apr|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec|Mr|Mrs|Dr|a.m|p.m)\.)/))) {
|
253 |
+
push(@characters,$abbrev);
|
254 |
+
$string = substr($string, length($abbrev));
|
255 |
+
} elsif ($group_ascii_chars && (($word) = ($string =~ /^(second|minute|hour|day|week|month|year|inch|foot|yard|meter|kilometer|mile)-(?:long|old)/i))) {
|
256 |
+
push(@characters,$word);
|
257 |
+
$string = substr($string, length($word));
|
258 |
+
} elsif ($group_ascii_chars && (($word) = ($string =~ /^(zero|one|two|three|four|five|six|seven|eight|nine|ten|eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen|eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety|hundred|thousand|million|billion|trillion)-/i))) {
|
259 |
+
push(@characters,$word);
|
260 |
+
$string = substr($string, length($word));
|
261 |
+
} elsif ($group_ascii_chars && (($word) = ($string =~ /^([a-zA-Z]+)(?:[ ,;%?|()"]|'s |' |\. |\d+[:hms][0-9 ])/))) {
|
262 |
+
push(@characters,$word);
|
263 |
+
$string = substr($string, length($word));
|
264 |
+
} elsif ($group_ascii_chars && ($string =~ /^([\x21-\x27\x2A-\x7E]+)/)) { # exclude ()
|
265 |
+
($ascii) = ($string =~ /^([\x21-\x27\x2A-\x7E]+)/); # ASCII black-characters
|
266 |
+
push(@characters,$ascii);
|
267 |
+
$string = substr($string, length($ascii));
|
268 |
+
} elsif ($group_ascii_chars && ($string =~ /^([\x21-\x7E]+)/)) {
|
269 |
+
($ascii) = ($string =~ /^([\x21-\x7E]+)/); # ASCII black-characters
|
270 |
+
push(@characters,$ascii);
|
271 |
+
$string = substr($string, length($ascii));
|
272 |
+
} elsif ($group_ascii_chars && ($string =~ /^([\x00-\x7F]+)/)) {
|
273 |
+
($ascii) = ($string =~ /^([\x00-\x7F]+)/);
|
274 |
+
push(@characters,$ascii);
|
275 |
+
$string = substr($string, length($ascii));
|
276 |
+
} else {
|
277 |
+
push(@characters,substr($string, 0, 1));
|
278 |
+
$string = substr($string, 1);
|
279 |
+
}
|
280 |
+
|
281 |
+
# two-character UTF-8
|
282 |
+
} elsif ($string =~ /^[\xC0-\xDF][\x80-\xBF]/) {
|
283 |
+
push(@characters,substr($string, 0, 2));
|
284 |
+
$string = substr($string, 2);
|
285 |
+
|
286 |
+
# three-character UTF-8
|
287 |
+
} elsif ($string =~ /^[\xE0-\xEF][\x80-\xBF][\x80-\xBF]/) {
|
288 |
+
push(@characters,substr($string, 0, 3));
|
289 |
+
$string = substr($string, 3);
|
290 |
+
|
291 |
+
# four-character UTF-8
|
292 |
+
} elsif ($string =~ /^[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]/) {
|
293 |
+
push(@characters,substr($string, 0, 4));
|
294 |
+
$string = substr($string, 4);
|
295 |
+
|
296 |
+
# five-character UTF-8
|
297 |
+
} elsif ($string =~ /^[\xF8-\xFB][\x80-\xBF][\x80-\xBF][\x80-\xBF][\x80-\xBF]/) {
|
298 |
+
push(@characters,substr($string, 0, 5));
|
299 |
+
$string = substr($string, 5);
|
300 |
+
|
301 |
+
# six-character UTF-8
|
302 |
+
} elsif ($string =~ /^[\xFC-\xFD][\x80-\xBF][\x80-\xBF][\x80-\xBF][\x80-\xBF][\x80-\xBF]/) {
|
303 |
+
push(@characters,substr($string, 0, 6));
|
304 |
+
$string = substr($string, 6);
|
305 |
+
|
306 |
+
# not a UTF-8 character
|
307 |
+
} else {
|
308 |
+
$skipped_bytes .= substr($string, 0, 1);
|
309 |
+
$string = substr($string, 1);
|
310 |
+
}
|
311 |
+
|
312 |
+
$end_of_token_p_string .= ($string =~ /^\S/) ? "0" : "1"
|
313 |
+
if $#characters >= length($end_of_token_p_string);
|
314 |
+
}
|
315 |
+
$string =~ s/ $//; # remove previously added space, but keep original spaces
|
316 |
+
if ($return_trailing_whitespaces) {
|
317 |
+
while ($string =~ /^[ \t]/) {
|
318 |
+
push(@characters,substr($string, 0, 1));
|
319 |
+
$string = substr($string, 1);
|
320 |
+
}
|
321 |
+
push(@characters, "\n") if $orig_string =~ /\n$/;
|
322 |
+
}
|
323 |
+
return ($return_only_chars) ? @characters : ($skipped_bytes, $end_of_token_p_string, @characters);
|
324 |
+
}
|
325 |
+
|
326 |
+
sub max_substring_info {
|
327 |
+
local($caller,$s1,$s2,$info_type) = @_;
|
328 |
+
|
329 |
+
($skipped_bytes1, $end_of_token_p_string1, @char_list1) = $caller->split_into_utf8_characters($s1, "", *empty_ht);
|
330 |
+
($skipped_bytes2, $end_of_token_p_string2, @char_list2) = $caller->split_into_utf8_characters($s2, "", *empty_ht);
|
331 |
+
return 0 if $skipped_bytes1 || $skipped_bytes2;
|
332 |
+
|
333 |
+
$best_substring_start1 = 0;
|
334 |
+
$best_substring_start2 = 0;
|
335 |
+
$best_substring_length = 0;
|
336 |
+
|
337 |
+
foreach $start_pos2 ((0 .. $#char_list2)) {
|
338 |
+
last if $start_pos2 + $best_substring_length > $#char_list2;
|
339 |
+
foreach $start_pos1 ((0 .. $#char_list1)) {
|
340 |
+
last if $start_pos1 + $best_substring_length > $#char_list1;
|
341 |
+
$matching_length = 0;
|
342 |
+
while (($start_pos1 + $matching_length <= $#char_list1)
|
343 |
+
&& ($start_pos2 + $matching_length <= $#char_list2)
|
344 |
+
&& ($char_list1[$start_pos1+$matching_length] eq $char_list2[$start_pos2+$matching_length])) {
|
345 |
+
$matching_length++;
|
346 |
+
}
|
347 |
+
if ($matching_length > $best_substring_length) {
|
348 |
+
$best_substring_length = $matching_length;
|
349 |
+
$best_substring_start1 = $start_pos1;
|
350 |
+
$best_substring_start2 = $start_pos2;
|
351 |
+
}
|
352 |
+
}
|
353 |
+
}
|
354 |
+
if ($info_type =~ /^max-ratio1$/) {
|
355 |
+
$length1 = $#char_list1 + 1;
|
356 |
+
return ($length1 > 0) ? ($best_substring_length / $length1) : 0;
|
357 |
+
} elsif ($info_type =~ /^max-ratio2$/) {
|
358 |
+
$length2 = $#char_list2 + 1;
|
359 |
+
return ($length2 > 0) ? ($best_substring_length / $length2) : 0;
|
360 |
+
} elsif ($info_type =~ /^substring$/) {
|
361 |
+
return join("", @char_list1[$best_substring_start1 .. $best_substring_start1+$best_substring_length-1]);
|
362 |
+
} else {
|
363 |
+
$length1 = $#char_list1 + 1;
|
364 |
+
$length2 = $#char_list2 + 1;
|
365 |
+
$info = "s1=$s1;s2=$s2";
|
366 |
+
$info .= ";best_substring_length=$best_substring_length";
|
367 |
+
$info .= ";best_substring_start1=$best_substring_start1";
|
368 |
+
$info .= ";best_substring_start2=$best_substring_start2";
|
369 |
+
$info .= ";length1=$length1";
|
370 |
+
$info .= ";length2=$length2";
|
371 |
+
return $info;
|
372 |
+
}
|
373 |
+
}
|
374 |
+
|
375 |
+
sub n_shared_chars_at_start {
|
376 |
+
local($caller,$s1,$s2) = @_;
|
377 |
+
|
378 |
+
my $n = 0;
|
379 |
+
while (($s1 ne "") && ($s2 ne "")) {
|
380 |
+
($c1, $rest1) = ($s1 =~ /^(.[\x80-\xBF]*)(.*)$/);
|
381 |
+
($c2, $rest2) = ($s2 =~ /^(.[\x80-\xBF]*)(.*)$/);
|
382 |
+
if ($c1 eq $c2) {
|
383 |
+
$n++;
|
384 |
+
$s1 = $rest1;
|
385 |
+
$s2 = $rest2;
|
386 |
+
} else {
|
387 |
+
last;
|
388 |
+
}
|
389 |
+
}
|
390 |
+
return $n;
|
391 |
+
}
|
392 |
+
|
393 |
+
sub char_length {
|
394 |
+
local($caller,$string,$byte_offset) = @_;
|
395 |
+
|
396 |
+
my $char = ($byte_offset) ? substr($string, $byte_offset) : $string;
|
397 |
+
return 1 if $char =~ /^[\x00-\x7F]/;
|
398 |
+
return 2 if $char =~ /^[\xC0-\xDF]/;
|
399 |
+
return 3 if $char =~ /^[\xE0-\xEF]/;
|
400 |
+
return 4 if $char =~ /^[\xF0-\xF7]/;
|
401 |
+
return 5 if $char =~ /^[\xF8-\xFB]/;
|
402 |
+
return 6 if $char =~ /^[\xFC-\xFD]/;
|
403 |
+
return 0;
|
404 |
+
}
|
405 |
+
|
406 |
+
sub length_in_utf8_chars {
|
407 |
+
local($caller,$s) = @_;
|
408 |
+
|
409 |
+
$s =~ s/[\x80-\xBF]//g;
|
410 |
+
$s =~ s/[\x00-\x7F\xC0-\xFF]/c/g;
|
411 |
+
return length($s);
|
412 |
+
}
|
413 |
+
|
414 |
+
sub byte_length_of_n_chars {
|
415 |
+
local($caller,$char_length,$string,$byte_offset,$undef_return_value) = @_;
|
416 |
+
|
417 |
+
$byte_offset = 0 unless defined($byte_offset);
|
418 |
+
$undef_return_value = -1 unless defined($undef_return_value);
|
419 |
+
my $result = 0;
|
420 |
+
my $len;
|
421 |
+
foreach $i ((1 .. $char_length)) {
|
422 |
+
$len = $caller->char_length($string,($byte_offset+$result));
|
423 |
+
return $undef_return_value unless $len;
|
424 |
+
$result += $len;
|
425 |
+
}
|
426 |
+
return $result;
|
427 |
+
}
|
428 |
+
|
429 |
+
sub replace_non_ASCII_bytes {
|
430 |
+
local($caller,$string,$replacement) = @_;
|
431 |
+
|
432 |
+
$replacement = "HEX" unless defined($replacement);
|
433 |
+
if ($replacement =~ /^(Unicode|U\+4|\\u|HEX)$/) {
|
434 |
+
$new_string = "";
|
435 |
+
while (($pre,$utf8_char, $post) = ($string =~ /^([\x09\x0A\x20-\x7E]*)([\x00-\x08\x0B-\x1F\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]|[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]|[\xF8-\xFF][\x80-\xBF]+|[\x80-\xBF])(.*)$/s)) {
|
436 |
+
if ($replacement =~ /Unicode/) {
|
437 |
+
$new_string .= $pre . "<U" . (uc $caller->utf8_to_unicode($utf8_char)) . ">";
|
438 |
+
} elsif ($replacement =~ /\\u/) {
|
439 |
+
$new_string .= $pre . "\\u" . (uc sprintf("%04x", $caller->utf8_to_unicode($utf8_char)));
|
440 |
+
} elsif ($replacement =~ /U\+4/) {
|
441 |
+
$new_string .= $pre . "<U+" . (uc $caller->utf8_to_4hex_unicode($utf8_char)) . ">";
|
442 |
+
} else {
|
443 |
+
$new_string .= $pre . "<HEX-" . $caller->utf8_to_hex($utf8_char) . ">";
|
444 |
+
}
|
445 |
+
$string = $post;
|
446 |
+
}
|
447 |
+
$new_string .= $string;
|
448 |
+
} else {
|
449 |
+
$new_string = $string;
|
450 |
+
$new_string =~ s/[\x80-\xFF]/$replacement/g;
|
451 |
+
}
|
452 |
+
return $new_string;
|
453 |
+
}
|
454 |
+
|
455 |
+
sub valid_utf8_string_p {
|
456 |
+
local($caller,$string) = @_;
|
457 |
+
|
458 |
+
return $string =~ /^(?:[\x09\x0A\x20-\x7E]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]|[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])*$/;
|
459 |
+
}
|
460 |
+
|
461 |
+
sub valid_utf8_string_incl_ascii_control_p {
|
462 |
+
local($caller,$string) = @_;
|
463 |
+
|
464 |
+
return $string =~ /^(?:[\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF][\x80-\xBF]|[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF])*$/;
|
465 |
+
}
|
466 |
+
|
467 |
+
sub utf8_to_hex {
|
468 |
+
local($caller,$s) = @_;
|
469 |
+
|
470 |
+
$hex = "";
|
471 |
+
foreach $i ((0 .. length($s)-1)) {
|
472 |
+
$hex .= uc sprintf("%2.2x",ord(substr($s, $i, 1)));
|
473 |
+
}
|
474 |
+
return $hex;
|
475 |
+
}
|
476 |
+
|
477 |
+
sub hex_to_utf8 {
|
478 |
+
local($caller,$s) = @_;
|
479 |
+
# surface string \xE2\x80\xBA to UTF8
|
480 |
+
|
481 |
+
my $utf8 = "";
|
482 |
+
while (($hex, $rest) = ($s =~ /^(?:\\x)?([0-9A-Fa-f]{2,2})(.*)$/)) {
|
483 |
+
$utf8 .= sprintf("%c", hex($hex));
|
484 |
+
$s = $rest;
|
485 |
+
}
|
486 |
+
return $utf8;
|
487 |
+
}
|
488 |
+
|
489 |
+
sub utf8_to_4hex_unicode {
|
490 |
+
local($caller,$s) = @_;
|
491 |
+
|
492 |
+
return sprintf("%4.4x", $caller->utf8_to_unicode($s));
|
493 |
+
}
|
494 |
+
|
495 |
+
sub utf8_to_unicode {
|
496 |
+
local($caller,$s) = @_;
|
497 |
+
|
498 |
+
$unicode = 0;
|
499 |
+
foreach $i ((0 .. length($s)-1)) {
|
500 |
+
$c = substr($s, $i, 1);
|
501 |
+
if ($c =~ /^[\x80-\xBF]$/) {
|
502 |
+
$unicode = $unicode * 64 + (ord($c) & 0x3F);
|
503 |
+
} elsif ($c =~ /^[\xC0-\xDF]$/) {
|
504 |
+
$unicode = $unicode * 32 + (ord($c) & 0x1F);
|
505 |
+
} elsif ($c =~ /^[\xE0-\xEF]$/) {
|
506 |
+
$unicode = $unicode * 16 + (ord($c) & 0x0F);
|
507 |
+
} elsif ($c =~ /^[\xF0-\xF7]$/) {
|
508 |
+
$unicode = $unicode * 8 + (ord($c) & 0x07);
|
509 |
+
} elsif ($c =~ /^[\xF8-\xFB]$/) {
|
510 |
+
$unicode = $unicode * 4 + (ord($c) & 0x03);
|
511 |
+
} elsif ($c =~ /^[\xFC-\xFD]$/) {
|
512 |
+
$unicode = $unicode * 2 + (ord($c) & 0x01);
|
513 |
+
}
|
514 |
+
}
|
515 |
+
return $unicode;
|
516 |
+
}
|
517 |
+
|
518 |
+
sub charhex {
|
519 |
+
local($caller,$string) = @_;
|
520 |
+
|
521 |
+
my $result = "";
|
522 |
+
while ($string ne "") {
|
523 |
+
$char = substr($string, 0, 1);
|
524 |
+
$string = substr($string, 1);
|
525 |
+
if ($char =~ /^[ -~]$/) {
|
526 |
+
$result .= $char;
|
527 |
+
} else {
|
528 |
+
$hex = sprintf("%2.2x",ord($char));
|
529 |
+
$hex =~ tr/a-f/A-F/;
|
530 |
+
$result .= "<HEX-$hex>";
|
531 |
+
}
|
532 |
+
}
|
533 |
+
return $result;
|
534 |
+
}
|
535 |
+
|
536 |
+
sub windows1252_to_utf8 {
|
537 |
+
local($caller,$s, $norm_to_ascii_p, $preserve_potential_utf8s_p) = @_;
|
538 |
+
|
539 |
+
return $s if $s =~ /^[\x00-\x7F]*$/; # all ASCII
|
540 |
+
|
541 |
+
$norm_to_ascii_p = 1 unless defined($norm_to_ascii_p);
|
542 |
+
$preserve_potential_utf8s_p = 1 unless defined($preserve_potential_utf8s_p);
|
543 |
+
my $result = "";
|
544 |
+
my $c = "";
|
545 |
+
while ($s ne "") {
|
546 |
+
$n_bytes = 1;
|
547 |
+
if ($s =~ /^[\x00-\x7F]/) {
|
548 |
+
$result .= substr($s, 0, 1); # ASCII
|
549 |
+
} elsif ($preserve_potential_utf8s_p && ($s =~ /^[\xC0-\xDF][\x80-\xBF]/)) {
|
550 |
+
$result .= substr($s, 0, 2); # valid 2-byte UTF8
|
551 |
+
$n_bytes = 2;
|
552 |
+
} elsif ($preserve_potential_utf8s_p && ($s =~ /^[\xE0-\xEF][\x80-\xBF][\x80-\xBF]/)) {
|
553 |
+
$result .= substr($s, 0, 3); # valid 3-byte UTF8
|
554 |
+
$n_bytes = 3;
|
555 |
+
} elsif ($preserve_potential_utf8s_p && ($s =~ /^[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]/)) {
|
556 |
+
$result .= substr($s, 0, 4); # valid 4-byte UTF8
|
557 |
+
$n_bytes = 4;
|
558 |
+
} elsif ($preserve_potential_utf8s_p && ($s =~ /^[\xF8-\xFB][\x80-\xBF][\x80-\xBF][\x80-\xBF][\x80-\xBF]/)) {
|
559 |
+
$result .= substr($s, 0, 5); # valid 5-byte UTF8
|
560 |
+
$n_bytes = 5;
|
561 |
+
} elsif ($s =~ /^[\xA0-\xBF]/) {
|
562 |
+
$c = substr($s, 0, 1);
|
563 |
+
$result .= "\xC2$c";
|
564 |
+
} elsif ($s =~ /^[\xC0-\xFF]/) {
|
565 |
+
$c = substr($s, 0, 1);
|
566 |
+
$c =~ tr/[\xC0-\xFF]/[\x80-\xBF]/;
|
567 |
+
$result .= "\xC3$c";
|
568 |
+
} elsif ($s =~ /^\x80/) {
|
569 |
+
$result .= "\xE2\x82\xAC"; # Euro sign
|
570 |
+
} elsif ($s =~ /^\x82/) {
|
571 |
+
$result .= "\xE2\x80\x9A"; # single low quotation mark
|
572 |
+
} elsif ($s =~ /^\x83/) {
|
573 |
+
$result .= "\xC6\x92"; # Latin small letter f with hook
|
574 |
+
} elsif ($s =~ /^\x84/) {
|
575 |
+
$result .= "\xE2\x80\x9E"; # double low quotation mark
|
576 |
+
} elsif ($s =~ /^\x85/) {
|
577 |
+
$result .= ($norm_to_ascii_p) ? "..." : "\xE2\x80\xA6"; # horizontal ellipsis (three dots)
|
578 |
+
} elsif ($s =~ /^\x86/) {
|
579 |
+
$result .= "\xE2\x80\xA0"; # dagger
|
580 |
+
} elsif ($s =~ /^\x87/) {
|
581 |
+
$result .= "\xE2\x80\xA1"; # double dagger
|
582 |
+
} elsif ($s =~ /^\x88/) {
|
583 |
+
$result .= "\xCB\x86"; # circumflex
|
584 |
+
} elsif ($s =~ /^\x89/) {
|
585 |
+
$result .= "\xE2\x80\xB0"; # per mille sign
|
586 |
+
} elsif ($s =~ /^\x8A/) {
|
587 |
+
$result .= "\xC5\xA0"; # Latin capital letter S with caron
|
588 |
+
} elsif ($s =~ /^\x8B/) {
|
589 |
+
$result .= "\xE2\x80\xB9"; # single left-pointing angle quotation mark
|
590 |
+
} elsif ($s =~ /^\x8C/) {
|
591 |
+
$result .= "\xC5\x92"; # OE ligature
|
592 |
+
} elsif ($s =~ /^\x8E/) {
|
593 |
+
$result .= "\xC5\xBD"; # Latin capital letter Z with caron
|
594 |
+
} elsif ($s =~ /^\x91/) {
|
595 |
+
$result .= ($norm_to_ascii_p) ? "`" : "\xE2\x80\x98"; # left single quotation mark
|
596 |
+
} elsif ($s =~ /^\x92/) {
|
597 |
+
$result .= ($norm_to_ascii_p) ? "'" : "\xE2\x80\x99"; # right single quotation mark
|
598 |
+
} elsif ($s =~ /^\x93/) {
|
599 |
+
$result .= "\xE2\x80\x9C"; # left double quotation mark
|
600 |
+
} elsif ($s =~ /^\x94/) {
|
601 |
+
$result .= "\xE2\x80\x9D"; # right double quotation mark
|
602 |
+
} elsif ($s =~ /^\x95/) {
|
603 |
+
$result .= "\xE2\x80\xA2"; # bullet
|
604 |
+
} elsif ($s =~ /^\x96/) {
|
605 |
+
$result .= ($norm_to_ascii_p) ? "-" : "\xE2\x80\x93"; # n dash
|
606 |
+
} elsif ($s =~ /^\x97/) {
|
607 |
+
$result .= ($norm_to_ascii_p) ? "-" : "\xE2\x80\x94"; # m dash
|
608 |
+
} elsif ($s =~ /^\x98/) {
|
609 |
+
$result .= ($norm_to_ascii_p) ? "~" : "\xCB\x9C"; # small tilde
|
610 |
+
} elsif ($s =~ /^\x99/) {
|
611 |
+
$result .= "\xE2\x84\xA2"; # trade mark sign
|
612 |
+
} elsif ($s =~ /^\x9A/) {
|
613 |
+
$result .= "\xC5\xA1"; # Latin small letter s with caron
|
614 |
+
} elsif ($s =~ /^\x9B/) {
|
615 |
+
$result .= "\xE2\x80\xBA"; # single right-pointing angle quotation mark
|
616 |
+
} elsif ($s =~ /^\x9C/) {
|
617 |
+
$result .= "\xC5\x93"; # oe ligature
|
618 |
+
} elsif ($s =~ /^\x9E/) {
|
619 |
+
$result .= "\xC5\xBE"; # Latin small letter z with caron
|
620 |
+
} elsif ($s =~ /^\x9F/) {
|
621 |
+
$result .= "\xC5\xB8"; # Latin capital letter Y with diaeresis
|
622 |
+
} else {
|
623 |
+
$result .= "?";
|
624 |
+
}
|
625 |
+
$s = substr($s, $n_bytes);
|
626 |
+
}
|
627 |
+
return $result;
|
628 |
+
}
|
629 |
+
|
630 |
+
sub delete_weird_stuff {
|
631 |
+
local($caller, $s) = @_;
|
632 |
+
|
633 |
+
# delete control chacters (except tab and linefeed), zero-width characters, byte order mark,
|
634 |
+
# directional marks, join marks, variation selectors, Arabic tatweel
|
635 |
+
$s =~ s/([\x00-\x08\x0B-\x1F\x7F]|\xC2[\x80-\x9F]|\xD9\x80|\xE2\x80[\x8B-\x8F]|\xEF\xB8[\x80-\x8F]|\xEF\xBB\xBF|\xF3\xA0[\x84-\x87][\x80-\xBF])//g;
|
636 |
+
return $s;
|
637 |
+
}
|
638 |
+
|
639 |
+
sub number_of_utf8_character {
|
640 |
+
local($caller, $s) = @_;
|
641 |
+
|
642 |
+
$s2 = $s;
|
643 |
+
$s2 =~ s/[\x80-\xBF]//g;
|
644 |
+
return length($s2);
|
645 |
+
}
|
646 |
+
|
647 |
+
sub cap_letter_reg_exp {
|
648 |
+
# includes A-Z and other Latin-based capital letters with accents, umlauts and other decorations etc.
|
649 |
+
return "[A-Z]|\xC3[\x80-\x96\x98-\x9E]|\xC4[\x80\x82\x84\x86\x88\x8A\x8C\x8E\x90\x94\x964\x98\x9A\x9C\x9E\xA0\xA2\xA4\xA6\xA8\xAA\xAC\xAE\xB0\xB2\xB4\xB6\xB9\xBB\xBD\xBF]|\xC5[\x81\x83\x85\x87\x8A\x8C\x8E\x90\x92\x96\x98\x9A\x9C\x9E\xA0\xA2\xA4\xA6\xA8\xAA\xAC\xB0\xB2\xB4\xB6\xB8\xB9\xBB\xBD]";
|
650 |
+
}
|
651 |
+
|
652 |
+
sub regex_extended_case_expansion {
|
653 |
+
local($caller, $s) = @_;
|
654 |
+
|
655 |
+
if ($s =~ /\xC3/) {
|
656 |
+
$s =~ s/\xC3\xA0/\xC3\[\x80\xA0\]/g;
|
657 |
+
$s =~ s/\xC3\xA1/\xC3\[\x81\xA1\]/g;
|
658 |
+
$s =~ s/\xC3\xA2/\xC3\[\x82\xA2\]/g;
|
659 |
+
$s =~ s/\xC3\xA3/\xC3\[\x83\xA3\]/g;
|
660 |
+
$s =~ s/\xC3\xA4/\xC3\[\x84\xA4\]/g;
|
661 |
+
$s =~ s/\xC3\xA5/\xC3\[\x85\xA5\]/g;
|
662 |
+
$s =~ s/\xC3\xA6/\xC3\[\x86\xA6\]/g;
|
663 |
+
$s =~ s/\xC3\xA7/\xC3\[\x87\xA7\]/g;
|
664 |
+
$s =~ s/\xC3\xA8/\xC3\[\x88\xA8\]/g;
|
665 |
+
$s =~ s/\xC3\xA9/\xC3\[\x89\xA9\]/g;
|
666 |
+
$s =~ s/\xC3\xAA/\xC3\[\x8A\xAA\]/g;
|
667 |
+
$s =~ s/\xC3\xAB/\xC3\[\x8B\xAB\]/g;
|
668 |
+
$s =~ s/\xC3\xAC/\xC3\[\x8C\xAC\]/g;
|
669 |
+
$s =~ s/\xC3\xAD/\xC3\[\x8D\xAD\]/g;
|
670 |
+
$s =~ s/\xC3\xAE/\xC3\[\x8E\xAE\]/g;
|
671 |
+
$s =~ s/\xC3\xAF/\xC3\[\x8F\xAF\]/g;
|
672 |
+
$s =~ s/\xC3\xB0/\xC3\[\x90\xB0\]/g;
|
673 |
+
$s =~ s/\xC3\xB1/\xC3\[\x91\xB1\]/g;
|
674 |
+
$s =~ s/\xC3\xB2/\xC3\[\x92\xB2\]/g;
|
675 |
+
$s =~ s/\xC3\xB3/\xC3\[\x93\xB3\]/g;
|
676 |
+
$s =~ s/\xC3\xB4/\xC3\[\x94\xB4\]/g;
|
677 |
+
$s =~ s/\xC3\xB5/\xC3\[\x95\xB5\]/g;
|
678 |
+
$s =~ s/\xC3\xB6/\xC3\[\x96\xB6\]/g;
|
679 |
+
$s =~ s/\xC3\xB8/\xC3\[\x98\xB8\]/g;
|
680 |
+
$s =~ s/\xC3\xB9/\xC3\[\x99\xB9\]/g;
|
681 |
+
$s =~ s/\xC3\xBA/\xC3\[\x9A\xBA\]/g;
|
682 |
+
$s =~ s/\xC3\xBB/\xC3\[\x9B\xBB\]/g;
|
683 |
+
$s =~ s/\xC3\xBC/\xC3\[\x9C\xBC\]/g;
|
684 |
+
$s =~ s/\xC3\xBD/\xC3\[\x9D\xBD\]/g;
|
685 |
+
$s =~ s/\xC3\xBE/\xC3\[\x9E\xBE\]/g;
|
686 |
+
}
|
687 |
+
if ($s =~ /\xC5/) {
|
688 |
+
$s =~ s/\xC5\x91/\xC5\[\x90\x91\]/g;
|
689 |
+
$s =~ s/\xC5\xA1/\xC5\[\xA0\xA1\]/g;
|
690 |
+
$s =~ s/\xC5\xB1/\xC5\[\xB0\xB1\]/g;
|
691 |
+
}
|
692 |
+
|
693 |
+
return $s;
|
694 |
+
}
|
695 |
+
|
696 |
+
sub extended_lower_case {
|
697 |
+
local($caller, $s) = @_;
|
698 |
+
|
699 |
+
$s =~ tr/A-Z/a-z/;
|
700 |
+
|
701 |
+
# Latin-1
|
702 |
+
if ($s =~ /\xC3[\x80-\x9F]/) {
|
703 |
+
$s =~ s/À/à/g;
|
704 |
+
$s =~ s/Á/á/g;
|
705 |
+
$s =~ s/Â/â/g;
|
706 |
+
$s =~ s/Ã/ã/g;
|
707 |
+
$s =~ s/Ä/ä/g;
|
708 |
+
$s =~ s/Å/å/g;
|
709 |
+
$s =~ s/Æ/æ/g;
|
710 |
+
$s =~ s/Ç/ç/g;
|
711 |
+
$s =~ s/È/è/g;
|
712 |
+
$s =~ s/É/é/g;
|
713 |
+
$s =~ s/Ê/ê/g;
|
714 |
+
$s =~ s/Ë/ë/g;
|
715 |
+
$s =~ s/Ì/ì/g;
|
716 |
+
$s =~ s/Í/í/g;
|
717 |
+
$s =~ s/Î/î/g;
|
718 |
+
$s =~ s/Ï/ï/g;
|
719 |
+
$s =~ s/Ð/ð/g;
|
720 |
+
$s =~ s/Ñ/ñ/g;
|
721 |
+
$s =~ s/Ò/ò/g;
|
722 |
+
$s =~ s/Ó/ó/g;
|
723 |
+
$s =~ s/Ô/ô/g;
|
724 |
+
$s =~ s/Õ/õ/g;
|
725 |
+
$s =~ s/Ö/ö/g;
|
726 |
+
$s =~ s/Ø/ø/g;
|
727 |
+
$s =~ s/Ù/ù/g;
|
728 |
+
$s =~ s/Ú/ú/g;
|
729 |
+
$s =~ s/Û/û/g;
|
730 |
+
$s =~ s/Ü/ü/g;
|
731 |
+
$s =~ s/Ý/ý/g;
|
732 |
+
$s =~ s/Þ/þ/g;
|
733 |
+
}
|
734 |
+
# Latin Extended-A
|
735 |
+
if ($s =~ /[\xC4-\xC5][\x80-\xBF]/) {
|
736 |
+
$s =~ s/Ā/ā/g;
|
737 |
+
$s =~ s/Ă/ă/g;
|
738 |
+
$s =~ s/Ą/ą/g;
|
739 |
+
$s =~ s/Ć/ć/g;
|
740 |
+
$s =~ s/Ĉ/ĉ/g;
|
741 |
+
$s =~ s/Ċ/ċ/g;
|
742 |
+
$s =~ s/Č/č/g;
|
743 |
+
$s =~ s/Ď/ď/g;
|
744 |
+
$s =~ s/Đ/đ/g;
|
745 |
+
$s =~ s/Ē/ē/g;
|
746 |
+
$s =~ s/Ĕ/ĕ/g;
|
747 |
+
$s =~ s/Ė/ė/g;
|
748 |
+
$s =~ s/Ę/ę/g;
|
749 |
+
$s =~ s/Ě/ě/g;
|
750 |
+
$s =~ s/Ĝ/ĝ/g;
|
751 |
+
$s =~ s/Ğ/ğ/g;
|
752 |
+
$s =~ s/Ġ/ġ/g;
|
753 |
+
$s =~ s/Ģ/ģ/g;
|
754 |
+
$s =~ s/Ĥ/ĥ/g;
|
755 |
+
$s =~ s/Ħ/ħ/g;
|
756 |
+
$s =~ s/Ĩ/ĩ/g;
|
757 |
+
$s =~ s/Ī/ī/g;
|
758 |
+
$s =~ s/Ĭ/ĭ/g;
|
759 |
+
$s =~ s/Į/į/g;
|
760 |
+
$s =~ s/İ/ı/g;
|
761 |
+
$s =~ s/IJ/ij/g;
|
762 |
+
$s =~ s/Ĵ/ĵ/g;
|
763 |
+
$s =~ s/Ķ/ķ/g;
|
764 |
+
$s =~ s/Ĺ/ĺ/g;
|
765 |
+
$s =~ s/Ļ/ļ/g;
|
766 |
+
$s =~ s/Ľ/ľ/g;
|
767 |
+
$s =~ s/Ŀ/ŀ/g;
|
768 |
+
$s =~ s/Ł/ł/g;
|
769 |
+
$s =~ s/Ń/ń/g;
|
770 |
+
$s =~ s/Ņ/ņ/g;
|
771 |
+
$s =~ s/Ň/ň/g;
|
772 |
+
$s =~ s/Ŋ/ŋ/g;
|
773 |
+
$s =~ s/Ō/ō/g;
|
774 |
+
$s =~ s/Ŏ/ŏ/g;
|
775 |
+
$s =~ s/Ő/ő/g;
|
776 |
+
$s =~ s/Œ/œ/g;
|
777 |
+
$s =~ s/Ŕ/ŕ/g;
|
778 |
+
$s =~ s/Ŗ/ŗ/g;
|
779 |
+
$s =~ s/Ř/ř/g;
|
780 |
+
$s =~ s/Ś/ś/g;
|
781 |
+
$s =~ s/Ŝ/ŝ/g;
|
782 |
+
$s =~ s/Ş/ş/g;
|
783 |
+
$s =~ s/Š/š/g;
|
784 |
+
$s =~ s/Ţ/ţ/g;
|
785 |
+
$s =~ s/Ť/ť/g;
|
786 |
+
$s =~ s/Ŧ/ŧ/g;
|
787 |
+
$s =~ s/Ũ/ũ/g;
|
788 |
+
$s =~ s/Ū/ū/g;
|
789 |
+
$s =~ s/Ŭ/ŭ/g;
|
790 |
+
$s =~ s/Ů/ů/g;
|
791 |
+
$s =~ s/Ű/ű/g;
|
792 |
+
$s =~ s/Ų/ų/g;
|
793 |
+
$s =~ s/Ŵ/ŵ/g;
|
794 |
+
$s =~ s/Ŷ/ŷ/g;
|
795 |
+
$s =~ s/Ź/ź/g;
|
796 |
+
$s =~ s/Ż/ż/g;
|
797 |
+
$s =~ s/Ž/ž/g;
|
798 |
+
}
|
799 |
+
# Greek letters
|
800 |
+
if ($s =~ /\xCE[\x86-\xAB]/) {
|
801 |
+
$s =~ s/Α/α/g;
|
802 |
+
$s =~ s/Β/β/g;
|
803 |
+
$s =~ s/Γ/γ/g;
|
804 |
+
$s =~ s/Δ/δ/g;
|
805 |
+
$s =~ s/Ε/ε/g;
|
806 |
+
$s =~ s/Ζ/ζ/g;
|
807 |
+
$s =~ s/Η/η/g;
|
808 |
+
$s =~ s/Θ/θ/g;
|
809 |
+
$s =~ s/Ι/ι/g;
|
810 |
+
$s =~ s/Κ/κ/g;
|
811 |
+
$s =~ s/Λ/λ/g;
|
812 |
+
$s =~ s/Μ/μ/g;
|
813 |
+
$s =~ s/Ν/ν/g;
|
814 |
+
$s =~ s/Ξ/ξ/g;
|
815 |
+
$s =~ s/Ο/ο/g;
|
816 |
+
$s =~ s/Π/π/g;
|
817 |
+
$s =~ s/Ρ/ρ/g;
|
818 |
+
$s =~ s/Σ/σ/g;
|
819 |
+
$s =~ s/Τ/τ/g;
|
820 |
+
$s =~ s/Υ/υ/g;
|
821 |
+
$s =~ s/Φ/φ/g;
|
822 |
+
$s =~ s/Χ/χ/g;
|
823 |
+
$s =~ s/Ψ/ψ/g;
|
824 |
+
$s =~ s/Ω/ω/g;
|
825 |
+
$s =~ s/Ϊ/ϊ/g;
|
826 |
+
$s =~ s/Ϋ/ϋ/g;
|
827 |
+
$s =~ s/Ά/ά/g;
|
828 |
+
$s =~ s/Έ/έ/g;
|
829 |
+
$s =~ s/Ή/ή/g;
|
830 |
+
$s =~ s/Ί/ί/g;
|
831 |
+
$s =~ s/Ό/ό/g;
|
832 |
+
$s =~ s/Ύ/ύ/g;
|
833 |
+
$s =~ s/Ώ/ώ/g;
|
834 |
+
}
|
835 |
+
# Cyrillic letters
|
836 |
+
if ($s =~ /\xD0[\x80-\xAF]/) {
|
837 |
+
$s =~ s/А/а/g;
|
838 |
+
$s =~ s/Б/б/g;
|
839 |
+
$s =~ s/В/в/g;
|
840 |
+
$s =~ s/Г/г/g;
|
841 |
+
$s =~ s/Д/д/g;
|
842 |
+
$s =~ s/Е/е/g;
|
843 |
+
$s =~ s/Ж/ж/g;
|
844 |
+
$s =~ s/З/з/g;
|
845 |
+
$s =~ s/И/и/g;
|
846 |
+
$s =~ s/Й/й/g;
|
847 |
+
$s =~ s/К/к/g;
|
848 |
+
$s =~ s/Л/л/g;
|
849 |
+
$s =~ s/М/м/g;
|
850 |
+
$s =~ s/Н/н/g;
|
851 |
+
$s =~ s/О/о/g;
|
852 |
+
$s =~ s/П/п/g;
|
853 |
+
$s =~ s/Р/р/g;
|
854 |
+
$s =~ s/С/с/g;
|
855 |
+
$s =~ s/Т/т/g;
|
856 |
+
$s =~ s/У/у/g;
|
857 |
+
$s =~ s/Ф/ф/g;
|
858 |
+
$s =~ s/Х/х/g;
|
859 |
+
$s =~ s/Ц/ц/g;
|
860 |
+
$s =~ s/Ч/ч/g;
|
861 |
+
$s =~ s/Ш/ш/g;
|
862 |
+
$s =~ s/Щ/щ/g;
|
863 |
+
$s =~ s/Ъ/ъ/g;
|
864 |
+
$s =~ s/Ы/ы/g;
|
865 |
+
$s =~ s/Ь/ь/g;
|
866 |
+
$s =~ s/Э/э/g;
|
867 |
+
$s =~ s/Ю/ю/g;
|
868 |
+
$s =~ s/Я/я/g;
|
869 |
+
$s =~ s/Ѐ/ѐ/g;
|
870 |
+
$s =~ s/Ё/ё/g;
|
871 |
+
$s =~ s/Ђ/ђ/g;
|
872 |
+
$s =~ s/Ѓ/ѓ/g;
|
873 |
+
$s =~ s/Є/є/g;
|
874 |
+
$s =~ s/Ѕ/ѕ/g;
|
875 |
+
$s =~ s/І/і/g;
|
876 |
+
$s =~ s/Ї/ї/g;
|
877 |
+
$s =~ s/Ј/ј/g;
|
878 |
+
$s =~ s/Љ/љ/g;
|
879 |
+
$s =~ s/Њ/њ/g;
|
880 |
+
$s =~ s/Ћ/ћ/g;
|
881 |
+
$s =~ s/Ќ/ќ/g;
|
882 |
+
$s =~ s/Ѝ/ѝ/g;
|
883 |
+
$s =~ s/Ў/ў/g;
|
884 |
+
$s =~ s/Џ/џ/g;
|
885 |
+
}
|
886 |
+
# Fullwidth A-Z
|
887 |
+
if ($s =~ /\xEF\xBC[\xA1-\xBA]/) {
|
888 |
+
$s =~ s/A/a/g;
|
889 |
+
$s =~ s/B/b/g;
|
890 |
+
$s =~ s/C/c/g;
|
891 |
+
$s =~ s/D/d/g;
|
892 |
+
$s =~ s/E/e/g;
|
893 |
+
$s =~ s/F/f/g;
|
894 |
+
$s =~ s/G/g/g;
|
895 |
+
$s =~ s/H/h/g;
|
896 |
+
$s =~ s/I/i/g;
|
897 |
+
$s =~ s/J/j/g;
|
898 |
+
$s =~ s/K/k/g;
|
899 |
+
$s =~ s/L/l/g;
|
900 |
+
$s =~ s/M/m/g;
|
901 |
+
$s =~ s/N/n/g;
|
902 |
+
$s =~ s/O/o/g;
|
903 |
+
$s =~ s/P/p/g;
|
904 |
+
$s =~ s/Q/q/g;
|
905 |
+
$s =~ s/R/r/g;
|
906 |
+
$s =~ s/S/s/g;
|
907 |
+
$s =~ s/T/t/g;
|
908 |
+
$s =~ s/U/u/g;
|
909 |
+
$s =~ s/V/v/g;
|
910 |
+
$s =~ s/W/w/g;
|
911 |
+
$s =~ s/X/x/g;
|
912 |
+
$s =~ s/Y/y/g;
|
913 |
+
$s =~ s/Z/z/g;
|
914 |
+
}
|
915 |
+
|
916 |
+
return $s;
|
917 |
+
}
|
918 |
+
|
919 |
+
sub extended_upper_case {
|
920 |
+
local($caller, $s) = @_;
|
921 |
+
|
922 |
+
$s =~ tr/a-z/A-Z/;
|
923 |
+
return $s unless $s =~ /[\xC3-\xC5][\x80-\xBF]/;
|
924 |
+
|
925 |
+
$s =~ s/\xC3\xA0/\xC3\x80/g;
|
926 |
+
$s =~ s/\xC3\xA1/\xC3\x81/g;
|
927 |
+
$s =~ s/\xC3\xA2/\xC3\x82/g;
|
928 |
+
$s =~ s/\xC3\xA3/\xC3\x83/g;
|
929 |
+
$s =~ s/\xC3\xA4/\xC3\x84/g;
|
930 |
+
$s =~ s/\xC3\xA5/\xC3\x85/g;
|
931 |
+
$s =~ s/\xC3\xA6/\xC3\x86/g;
|
932 |
+
$s =~ s/\xC3\xA7/\xC3\x87/g;
|
933 |
+
$s =~ s/\xC3\xA8/\xC3\x88/g;
|
934 |
+
$s =~ s/\xC3\xA9/\xC3\x89/g;
|
935 |
+
$s =~ s/\xC3\xAA/\xC3\x8A/g;
|
936 |
+
$s =~ s/\xC3\xAB/\xC3\x8B/g;
|
937 |
+
$s =~ s/\xC3\xAC/\xC3\x8C/g;
|
938 |
+
$s =~ s/\xC3\xAD/\xC3\x8D/g;
|
939 |
+
$s =~ s/\xC3\xAE/\xC3\x8E/g;
|
940 |
+
$s =~ s/\xC3\xAF/\xC3\x8F/g;
|
941 |
+
$s =~ s/\xC3\xB0/\xC3\x90/g;
|
942 |
+
$s =~ s/\xC3\xB1/\xC3\x91/g;
|
943 |
+
$s =~ s/\xC3\xB2/\xC3\x92/g;
|
944 |
+
$s =~ s/\xC3\xB3/\xC3\x93/g;
|
945 |
+
$s =~ s/\xC3\xB4/\xC3\x94/g;
|
946 |
+
$s =~ s/\xC3\xB5/\xC3\x95/g;
|
947 |
+
$s =~ s/\xC3\xB6/\xC3\x96/g;
|
948 |
+
$s =~ s/\xC3\xB8/\xC3\x98/g;
|
949 |
+
$s =~ s/\xC3\xB9/\xC3\x99/g;
|
950 |
+
$s =~ s/\xC3\xBA/\xC3\x9A/g;
|
951 |
+
$s =~ s/\xC3\xBB/\xC3\x9B/g;
|
952 |
+
$s =~ s/\xC3\xBC/\xC3\x9C/g;
|
953 |
+
$s =~ s/\xC3\xBD/\xC3\x9D/g;
|
954 |
+
$s =~ s/\xC3\xBE/\xC3\x9E/g;
|
955 |
+
|
956 |
+
$s =~ s/\xC5\x91/\xC5\x90/g;
|
957 |
+
$s =~ s/\xC5\xA1/\xC5\xA0/g;
|
958 |
+
$s =~ s/\xC5\xB1/\xC5\xB0/g;
|
959 |
+
return $s unless $s =~ /[\xC3-\xC5][\x80-\xBF]/;
|
960 |
+
|
961 |
+
return $s;
|
962 |
+
}
|
963 |
+
|
964 |
+
sub extended_first_upper_case {
|
965 |
+
local($caller, $s) = @_;
|
966 |
+
|
967 |
+
if (($first_char, $rest) = ($s =~ /^([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF][\x80-\xBF])(.*)$/)) {
|
968 |
+
return $caller->extended_upper_case($first_char) . $rest;
|
969 |
+
} else {
|
970 |
+
return $s;
|
971 |
+
}
|
972 |
+
}
|
973 |
+
|
974 |
+
sub repair_doubly_converted_utf8_strings {
|
975 |
+
local($caller, $s) = @_;
|
976 |
+
|
977 |
+
if ($s =~ /\xC3[\x82-\x85]\xC2[\x80-\xBF]/) {
|
978 |
+
$s =~ s/\xC3\x82\xC2([\x80-\xBF])/\xC2$1/g;
|
979 |
+
$s =~ s/\xC3\x83\xC2([\x80-\xBF])/\xC3$1/g;
|
980 |
+
$s =~ s/\xC3\x84\xC2([\x80-\xBF])/\xC4$1/g;
|
981 |
+
$s =~ s/\xC3\x85\xC2([\x80-\xBF])/\xC5$1/g;
|
982 |
+
}
|
983 |
+
return $s;
|
984 |
+
}
|
985 |
+
|
986 |
+
sub repair_misconverted_windows_to_utf8_strings {
|
987 |
+
local($caller, $s) = @_;
|
988 |
+
|
989 |
+
# correcting conversions of UTF8 using Latin1-to-UTF converter
|
990 |
+
if ($s =~ /\xC3\xA2\xC2\x80\xC2[\x90-\xEF]/) {
|
991 |
+
my $result = "";
|
992 |
+
while (($pre,$last_c,$post) = ($s =~ /^(.*?)\xC3\xA2\xC2\x80\xC2([\x90-\xEF])(.*)$/s)) {
|
993 |
+
$result .= "$pre\xE2\x80$last_c";
|
994 |
+
$s = $post;
|
995 |
+
}
|
996 |
+
$result .= $s;
|
997 |
+
$s = $result;
|
998 |
+
}
|
999 |
+
# correcting conversions of Windows1252-to-UTF8 using Latin1-to-UTF converter
|
1000 |
+
if ($s =~ /\xC2[\x80-\x9F]/) {
|
1001 |
+
my $result = "";
|
1002 |
+
while (($pre,$c_windows,$post) = ($s =~ /^(.*?)\xC2([\x80-\x9F])(.*)$/s)) {
|
1003 |
+
$c_utf8 = $caller->windows1252_to_utf8($c_windows, 0);
|
1004 |
+
$result .= ($c_utf8 eq "?") ? ($pre . "\xC2" . $c_windows) : "$pre$c_utf8";
|
1005 |
+
$s = $post;
|
1006 |
+
}
|
1007 |
+
$result .= $s;
|
1008 |
+
$s = $result;
|
1009 |
+
}
|
1010 |
+
if ($s =~ /\xC3/) {
|
1011 |
+
$s =~ s/\xC3\xA2\xE2\x80\x9A\xC2\xAC/\xE2\x82\xAC/g; # x80 -> Euro sign
|
1012 |
+
# x81 codepoint undefined in Windows 1252
|
1013 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xC5\xA1/\xE2\x80\x9A/g; # x82 -> single low-9 quotation mark
|
1014 |
+
$s =~ s/\xC3\x86\xE2\x80\x99/\xC6\x92/g; # x83 -> Latin small letter f with hook
|
1015 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xC5\xBE/\xE2\x80\x9E/g; # x84 -> double low-9 quotation mark
|
1016 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xC2\xA6/\xE2\x80\xA6/g; # x85 -> horizontal ellipsis
|
1017 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xC2\xA0/\xE2\x80\xA0/g; # x86 -> dagger
|
1018 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xC2\xA1/\xE2\x80\xA1/g; # x87 -> double dagger
|
1019 |
+
$s =~ s/\xC3\x8B\xE2\x80\xA0/\xCB\x86/g; # x88 -> modifier letter circumflex accent
|
1020 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xC2\xB0/\xE2\x80\xB0/g; # x89 -> per mille sign
|
1021 |
+
$s =~ s/\xC3\x85\xC2\xA0/\xC5\xA0/g; # x8A -> Latin capital letter S with caron
|
1022 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xC2\xB9/\xE2\x80\xB9/g; # x8B -> single left-pointing angle quotation mark
|
1023 |
+
$s =~ s/\xC3\x85\xE2\x80\x99/\xC5\x92/g; # x8C -> Latin capital ligature OE
|
1024 |
+
# x8D codepoint undefined in Windows 1252
|
1025 |
+
$s =~ s/\xC3\x85\xC2\xBD/\xC5\xBD/g; # x8E -> Latin capital letter Z with caron
|
1026 |
+
# x8F codepoint undefined in Windows 1252
|
1027 |
+
# x90 codepoint undefined in Windows 1252
|
1028 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xCB\x9C/\xE2\x80\x98/g; # x91 a-circumflex+euro+small tilde -> left single quotation mark
|
1029 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xE2\x84\xA2/\xE2\x80\x99/g; # x92 a-circumflex+euro+trademark -> right single quotation mark
|
1030 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xC5\x93/\xE2\x80\x9C/g; # x93 a-circumflex+euro+Latin small ligature oe -> left double quotation mark
|
1031 |
+
# x94 maps through undefined intermediate code point
|
1032 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xC2\xA2/\xE2\x80\xA2/g; # x95 a-circumflex+euro+cent sign -> bullet
|
1033 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xE2\x80\x9C/\xE2\x80\x93/g; # x96 a-circumflex+euro+left double quotation mark -> en dash
|
1034 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xE2\x80\x9D/\xE2\x80\x94/g; # x97 a-circumflex+euro+right double quotation mark -> em dash
|
1035 |
+
$s =~ s/\xC3\x8B\xC5\x93/\xCB\x9C/g; # x98 Latin capital e diaeresis+Latin small ligature oe -> small tilde
|
1036 |
+
$s =~ s/\xC3\xA2\xE2\x80\x9E\xC2\xA2/\xE2\x84\xA2/g; # x99 -> trade mark sign
|
1037 |
+
$s =~ s/\xC3\x85\xC2\xA1/\xC5\xA1/g; # x9A -> Latin small letter s with caron
|
1038 |
+
$s =~ s/\xC3\xA2\xE2\x82\xAC\xC2\xBA/\xE2\x80\xBA/g; # x9B -> single right-pointing angle quotation mark
|
1039 |
+
$s =~ s/\xC3\x85\xE2\x80\x9C/\xC5\x93/g; # x9C -> Latin small ligature oe
|
1040 |
+
# x9D codepoint undefined in Windows 1252
|
1041 |
+
$s =~ s/\xC3\x85\xC2\xBE/\xC5\xBE/g; # x9E -> Latin small letter z with caron
|
1042 |
+
$s =~ s/\xC3\x85\xC2\xB8/\xC5\xB8/g; # x9F -> Latin capital letter Y with diaeresis
|
1043 |
+
$s =~ s/\xC3\xAF\xC2\xBF\xC2\xBD/\xEF\xBF\xBD/g; # replacement character
|
1044 |
+
}
|
1045 |
+
|
1046 |
+
return $s;
|
1047 |
+
}
|
1048 |
+
|
1049 |
+
sub latin1_to_utf {
|
1050 |
+
local($caller, $s) = @_;
|
1051 |
+
|
1052 |
+
my $result = "";
|
1053 |
+
while (($pre,$c,$post) = ($s =~ /^(.*?)([\x80-\xFF])(.*)$/s)) {
|
1054 |
+
$result .= $pre;
|
1055 |
+
if ($c =~ /^[\x80-\xBF]$/) {
|
1056 |
+
$result .= "\xC2$c";
|
1057 |
+
} elsif ($c =~ /^[\xC0-\xFF]$/) {
|
1058 |
+
$c =~ tr/[\xC0-\xFF]/[\x80-\xBF]/;
|
1059 |
+
$result .= "\xC3$c";
|
1060 |
+
}
|
1061 |
+
$s = $post;
|
1062 |
+
}
|
1063 |
+
$result .= $s;
|
1064 |
+
return $result;
|
1065 |
+
}
|
1066 |
+
|
1067 |
+
sub character_type_is_letter_type {
|
1068 |
+
local($caller, $char_type) = @_;
|
1069 |
+
|
1070 |
+
return ($char_type =~ /\b((CJK|hiragana|kana|katakana)\s+character|diacritic|letter|syllable)\b/);
|
1071 |
+
}
|
1072 |
+
|
1073 |
+
sub character_type {
|
1074 |
+
local($caller, $c) = @_;
|
1075 |
+
|
1076 |
+
if ($c =~ /^[\x00-\x7F]/) {
|
1077 |
+
return "XML tag" if $c =~ /^<.*>$/;
|
1078 |
+
return "ASCII Latin letter" if $c =~ /^[a-z]$/i;
|
1079 |
+
return "ASCII digit" if $c =~ /^[0-9]$/i;
|
1080 |
+
return "ASCII whitespace" if $c =~ /^[\x09-\x0D\x20]$/;
|
1081 |
+
return "ASCII control-character" if $c =~ /^[\x00-\x1F\x7F]$/;
|
1082 |
+
return "ASCII currency" if $c eq "\$";
|
1083 |
+
return "ASCII punctuation";
|
1084 |
+
} elsif ($c =~ /^[\xC0-\xDF]/) {
|
1085 |
+
return "non-UTF8 (invalid)" unless $c =~ /^[\xC0-\xDF][\x80-\xBF]$/;
|
1086 |
+
return "non-shortest-UTF8 (invalid)" if $c =~ /[\xC0-\xC1]/;
|
1087 |
+
return "non-ASCII control-character" if $c =~ /\xC2[\x80-\x9F]/;
|
1088 |
+
return "non-ASCII whitespace" if $c =~ /\xC2\xA0/;
|
1089 |
+
return "non-ASCII currency" if $c =~ /\xC2[\xA2-\xA5]/;
|
1090 |
+
return "fraction" if $c =~ /\xC2[\xBC-\xBE]/; # NEW
|
1091 |
+
return "superscript digit" if $c =~ /\xC2[\xB2\xB3\xB9]/;
|
1092 |
+
return "non-ASCII Latin letter" if $c =~ /\xC2\xB5/; # micro sign
|
1093 |
+
return "non-ASCII punctuation" if $c =~ /\xC2[\xA0-\xBF]/;
|
1094 |
+
return "non-ASCII punctuation" if $c =~ /\xC3[\x97\xB7]/;
|
1095 |
+
return "non-ASCII Latin letter" if $c =~ /\xC3[\x80-\xBF]/;
|
1096 |
+
return "Latin ligature letter" if $c =~ /\xC4[\xB2\xB3]/;
|
1097 |
+
return "Latin ligature letter" if $c =~ /\xC5[\x92\x93]/;
|
1098 |
+
return "non-ASCII Latin letter" if $c =~ /[\xC4-\xC8]/;
|
1099 |
+
return "non-ASCII Latin letter" if $c =~ /\xC9[\x80-\x8F]/;
|
1100 |
+
return "IPA" if $c =~ /\xC9[\x90-\xBF]/;
|
1101 |
+
return "IPA" if $c =~ /\xCA[\x80-\xBF]/;
|
1102 |
+
return "IPA" if $c =~ /\xCB[\x80-\xBF]/;
|
1103 |
+
return "combining-diacritic" if $c =~ /\xCC[\x80-\xBF]/;
|
1104 |
+
return "combining-diacritic" if $c =~ /\xCD[\x80-\xAF]/;
|
1105 |
+
return "Greek punctuation" if $c =~ /\xCD[\xBE]/; # Greek question mark
|
1106 |
+
return "Greek punctuation" if $c =~ /\xCE[\x87]/; # Greek semicolon
|
1107 |
+
return "Greek letter" if $c =~ /\xCD[\xB0-\xBF]/;
|
1108 |
+
return "Greek letter" if $c =~ /\xCE/;
|
1109 |
+
return "Greek letter" if $c =~ /\xCF[\x80-\xA1\xB3\xB7\xB8\xBA\xBB]/;
|
1110 |
+
return "Coptic letter" if $c =~ /\xCF[\xA2-\xAF]/;
|
1111 |
+
return "Cyrillic letter" if $c =~ /[\xD0-\xD3]/;
|
1112 |
+
return "Cyrillic letter" if $c =~ /\xD4[\x80-\xAF]/;
|
1113 |
+
return "Armenian punctuation" if $c =~ /\xD5[\x9A-\x9F]/;
|
1114 |
+
return "Armenian punctuation" if $c =~ /\xD6[\x89-\x8F]/;
|
1115 |
+
return "Armenian letter" if $c =~ /\xD4[\xB0-\xBF]/;
|
1116 |
+
return "Armenian letter" if $c =~ /\xD5/;
|
1117 |
+
return "Armenian letter" if $c =~ /\xD6[\x80-\x8F]/;
|
1118 |
+
return "Hebrew accent" if $c =~ /\xD6[\x91-\xAE]/;
|
1119 |
+
return "Hebrew punctuation" if $c =~ /\xD6\xBE/;
|
1120 |
+
return "Hebrew punctuation" if $c =~ /\xD7[\x80\x83\x86\xB3\xB4]/;
|
1121 |
+
return "Hebrew point" if $c =~ /\xD6[\xB0-\xBF]/;
|
1122 |
+
return "Hebrew point" if $c =~ /\xD7[\x81\x82\x87]/;
|
1123 |
+
return "Hebrew letter" if $c =~ /\xD7[\x90-\xB2]/;
|
1124 |
+
return "other Hebrew" if $c =~ /\xD6[\x90-\xBF]/;
|
1125 |
+
return "other Hebrew" if $c =~ /\xD7/;
|
1126 |
+
return "Arabic currency" if $c =~ /\xD8\x8B/; # Afghani sign
|
1127 |
+
return "Arabic punctuation" if $c =~ /\xD8[\x89-\x8D\x9B\x9E\x9F]/;
|
1128 |
+
return "Arabic punctuation" if $c =~ /\xD9[\xAA-\xAD]/;
|
1129 |
+
return "Arabic punctuation" if $c =~ /\xDB[\x94]/;
|
1130 |
+
return "Arabic tatweel" if $c =~ /\xD9\x80/;
|
1131 |
+
return "Arabic letter" if $c =~ /\xD8[\xA0-\xBF]/;
|
1132 |
+
return "Arabic letter" if $c =~ /\xD9[\x81-\x9F]/;
|
1133 |
+
return "Arabic letter" if $c =~ /\xD9[\xAE-\xBF]/;
|
1134 |
+
return "Arabic letter" if $c =~ /\xDA[\x80-\xBF]/;
|
1135 |
+
return "Arabic letter" if $c =~ /\xDB[\x80-\x95]/;
|
1136 |
+
return "Arabic Indic digit" if $c =~ /\xD9[\xA0-\xA9]/;
|
1137 |
+
return "Arabic Indic digit" if $c =~ /\xDB[\xB0-\xB9]/;
|
1138 |
+
return "other Arabic" if $c =~ /[\xD8-\xDB]/;
|
1139 |
+
return "Syriac punctuation" if $c =~ /\xDC[\x80-\x8F]/;
|
1140 |
+
return "Syriac letter" if $c =~ /\xDC[\x90-\xAF]/;
|
1141 |
+
return "Syriac diacritic" if $c =~ /\xDC[\xB0-\xBF]/;
|
1142 |
+
return "Syriac diacritic" if $c =~ /\xDD[\x80-\x8A]/;
|
1143 |
+
return "Thaana letter" if $c =~ /\xDE/;
|
1144 |
+
} elsif ($c =~ /^[\xE0-\xEF]/) {
|
1145 |
+
return "non-UTF8 (invalid)" unless $c =~ /^[\xE0-\xEF][\x80-\xBF]{2,2}$/;
|
1146 |
+
return "non-shortest-UTF8 (invalid)" if $c =~ /\xE0[\x80-\x9F]/;
|
1147 |
+
return "Arabic letter" if $c =~ /\xE0\xA2[\xA0-\xBF]/; # extended letters
|
1148 |
+
return "other Arabic" if $c =~ /\xE0\xA3/; # extended characters
|
1149 |
+
return "Devanagari punctuation" if $c =~ /\xE0\xA5[\xA4\xA5]/; # danda, double danda
|
1150 |
+
return "Devanagari digit" if $c =~ /\xE0\xA5[\xA6-\xAF]/;
|
1151 |
+
return "Devanagari letter" if $c =~ /\xE0[\xA4-\xA5]/;
|
1152 |
+
return "Bengali digit" if $c =~ /\xE0\xA7[\xA6-\xAF]/;
|
1153 |
+
return "Bengali currency" if $c =~ /\xE0\xA7[\xB2-\xB9]/;
|
1154 |
+
return "Bengali letter" if $c =~ /\xE0[\xA6-\xA7]/;
|
1155 |
+
return "Gurmukhi digit" if $c =~ /\xE0\xA9[\xA6-\xAF]/;
|
1156 |
+
return "Gurmukhi letter" if $c =~ /\xE0[\xA8-\xA9]/;
|
1157 |
+
return "Gujarati digit" if $c =~ /\xE0\xAB[\xA6-\xAF]/;
|
1158 |
+
return "Gujarati letter" if $c =~ /\xE0[\xAA-\xAB]/;
|
1159 |
+
return "Oriya digit" if $c =~ /\xE0\xAD[\xA6-\xAF]/;
|
1160 |
+
return "Oriya fraction" if $c =~ /\xE0\xAD[\xB2-\xB7]/;
|
1161 |
+
return "Oriya letter" if $c =~ /\xE0[\xAC-\xAD]/;
|
1162 |
+
return "Tamil digit" if $c =~ /\xE0\xAF[\xA6-\xAF]/;
|
1163 |
+
return "Tamil number" if $c =~ /\xE0\xAF[\xB0-\xB2]/; # number (10, 100, 1000)
|
1164 |
+
return "Tamil letter" if $c =~ /\xE0[\xAE-\xAF]/;
|
1165 |
+
return "Telegu digit" if $c =~ /\xE0\xB1[\xA6-\xAF]/;
|
1166 |
+
return "Telegu fraction" if $c =~ /\xE0\xB1[\xB8-\xBE]/;
|
1167 |
+
return "Telegu letter" if $c =~ /\xE0[\xB0-\xB1]/;
|
1168 |
+
return "Kannada digit" if $c =~ /\xE0\xB3[\xA6-\xAF]/;
|
1169 |
+
return "Kannada letter" if $c =~ /\xE0[\xB2-\xB3]/;
|
1170 |
+
return "Malayalam digit" if $c =~ /\xE0\xB5[\x98-\x9E\xA6-\xB8]/;
|
1171 |
+
return "Malayalam punctuation" if $c =~ /\xE0\xB5\xB9/; # date mark
|
1172 |
+
return "Malayalam letter" if $c =~ /\xE0[\xB4-\xB5]/;
|
1173 |
+
return "Sinhala digit" if $c =~ /\xE0\xB7[\xA6-\xAF]/;
|
1174 |
+
return "Sinhala punctuation" if $c =~ /\xE0\xB7\xB4/;
|
1175 |
+
return "Sinhala letter" if $c =~ /\xE0[\xB6-\xB7]/;
|
1176 |
+
return "Thai currency" if $c =~ /\xE0\xB8\xBF/;
|
1177 |
+
return "Thai digit" if $c =~ /\xE0\xB9[\x90-\x99]/;
|
1178 |
+
return "Thai character" if $c =~ /\xE0[\xB8-\xB9]/;
|
1179 |
+
return "Lao punctuation" if $c =~ /\xE0\xBA\xAF/; # Lao ellipsis
|
1180 |
+
return "Lao digit" if $c =~ /\xE0\xBB[\x90-\x99]/;
|
1181 |
+
return "Lao character" if $c =~ /\xE0[\xBA-\xBB]/;
|
1182 |
+
return "Tibetan punctuation" if $c =~ /\xE0\xBC[\x81-\x94]/;
|
1183 |
+
return "Tibetan sign" if $c =~ /\xE0\xBC[\x95-\x9F]/;
|
1184 |
+
return "Tibetan digit" if $c =~ /\xE0\xBC[\xA0-\xB3]/;
|
1185 |
+
return "Tibetan punctuation" if $c =~ /\xE0\xBC[\xB4-\xBD]/;
|
1186 |
+
return "Tibetan letter" if $c =~ /\xE0[\xBC-\xBF]/;
|
1187 |
+
return "Myanmar digit" if $c =~ /\xE1\x81[\x80-\x89]/;
|
1188 |
+
return "Myanmar digit" if $c =~ /\xE1\x82[\x90-\x99]/; # Myanmar Shan digits
|
1189 |
+
return "Myanmar punctuation" if $c =~ /\xE1\x81[\x8A-\x8B]/;
|
1190 |
+
return "Myanmar letter" if $c =~ /\xE1[\x80-\x81]/;
|
1191 |
+
return "Myanmar letter" if $c =~ /\xE1\x82[\x80-\x9F]/;
|
1192 |
+
return "Georgian punctuation" if $c =~ /\xE1\x83\xBB/;
|
1193 |
+
return "Georgian letter" if $c =~ /\xE1\x82[\xA0-\xBF]/;
|
1194 |
+
return "Georgian letter" if $c =~ /\xE1\x83/;
|
1195 |
+
return "Georgian letter" if $c =~ /\xE1\xB2[\x90-\xBF]/; # Georgian Mtavruli capital letters
|
1196 |
+
return "Georgian letter" if $c =~ /\xE2\xB4[\x80-\xAF]/; # Georgian small letters (Khutsuri)
|
1197 |
+
return "Korean Hangul letter" if $c =~ /\xE1[\x84-\x87]/;
|
1198 |
+
return "Ethiopic punctuation" if $c =~ /\xE1\x8D[\xA0-\xA8]/;
|
1199 |
+
return "Ethiopic digit" if $c =~ /\xE1\x8D[\xA9-\xB1]/;
|
1200 |
+
return "Ethiopic number" if $c =~ /\xE1\x8D[\xB2-\xBC]/;
|
1201 |
+
return "Ethiopic syllable" if $c =~ /\xE1[\x88-\x8D]/;
|
1202 |
+
return "Cherokee letter" if $c =~ /\xE1\x8E[\xA0-\xBF]/;
|
1203 |
+
return "Cherokee letter" if $c =~ /\xE1\x8F/;
|
1204 |
+
return "Canadian punctuation" if $c =~ /\xE1\x90\x80/; # Canadian Syllabics hyphen
|
1205 |
+
return "Canadian punctuation" if $c =~ /\xE1\x99\xAE/; # Canadian Syllabics full stop
|
1206 |
+
return "Canadian syllable" if $c =~ /\xE1[\x90-\x99]/;
|
1207 |
+
return "Canadian syllable" if $c =~ /\xE1\xA2[\xB0-\xBF]/;
|
1208 |
+
return "Canadian syllable" if $c =~ /\xE1\xA3/;
|
1209 |
+
return "Ogham whitespace" if $c =~ /\xE1\x9A\x80/;
|
1210 |
+
return "Ogham letter" if $c =~ /\xE1\x9A[\x81-\x9A]/;
|
1211 |
+
return "Ogham punctuation" if $c =~ /\xE1\x9A[\x9B-\x9C]/;
|
1212 |
+
return "Runic punctuation" if $c =~ /\xE1\x9B[\xAB-\xAD]/;
|
1213 |
+
return "Runic letter" if $c =~ /\xE1\x9A[\xA0-\xBF]/;
|
1214 |
+
return "Runic letter" if $c =~ /\xE1\x9B/;
|
1215 |
+
return "Khmer currency" if $c =~ /\xE1\x9F\x9B/;
|
1216 |
+
return "Khmer digit" if $c =~ /\xE1\x9F[\xA0-\xA9]/;
|
1217 |
+
return "Khmer letter" if $c =~ /\xE1[\x9E-\x9F]/;
|
1218 |
+
return "Mongolian punctuation" if $c =~ /\xE1\xA0[\x80-\x8A]/;
|
1219 |
+
return "Mongolian digit" if $c =~ /\xE1\xA0[\x90-\x99]/;
|
1220 |
+
return "Mongolian letter" if $c =~ /\xE1[\xA0-\xA1]/;
|
1221 |
+
return "Mongolian letter" if $c =~ /\xE1\xA2[\x80-\xAF]/;
|
1222 |
+
return "Buginese letter" if $c =~ /\xE1\xA8[\x80-\x9B]/;
|
1223 |
+
return "Buginese punctuation" if $c =~ /\xE1\xA8[\x9E-\x9F]/;
|
1224 |
+
return "Balinese letter" if $c =~ /\xE1\xAC/;
|
1225 |
+
return "Balinese letter" if $c =~ /\xE1\xAD[\x80-\x8F]/;
|
1226 |
+
return "Balinese digit" if $c =~ /\xE1\xAD[\x90-\x99]/;
|
1227 |
+
return "Balinese puncutation" if $c =~ /\xE1\xAD[\x9A-\xA0]/;
|
1228 |
+
return "Balinese symbol" if $c =~ /\xE1\xAD[\xA1-\xBF]/;
|
1229 |
+
return "Sundanese digit" if $c =~ /\xE1\xAE[\xB0-\xB9]/;
|
1230 |
+
return "Sundanese letter" if $c =~ /\xE1\xAE/;
|
1231 |
+
return "Cyrillic letter" if $c =~ /\xE1\xB2[\x80-\x8F]/;
|
1232 |
+
return "Sundanese punctuation" if $c =~ /\xE1\xB3[\x80-\x8F]/;
|
1233 |
+
return "IPA" if $c =~ /\xE1[\xB4-\xB6]/;
|
1234 |
+
return "non-ASCII Latin letter" if $c =~ /\xE1[\xB8-\xBB]/;
|
1235 |
+
return "Greek letter" if $c =~ /\xE1[\xBC-\xBF]/;
|
1236 |
+
return "non-ASCII whitespace" if $c =~ /\xE2\x80[\x80-\x8A\xAF]/;
|
1237 |
+
return "zero-width space" if $c =~ /\xE2\x80\x8B/;
|
1238 |
+
return "zero-width non-space" if $c =~ /\xE2\x80\x8C/;
|
1239 |
+
return "zero-width joiner" if $c =~ /\xE2\x80\x8D/;
|
1240 |
+
return "directional mark" if $c =~ /\xE2\x80[\x8E-\x8F\xAA-\xAE]/;
|
1241 |
+
return "non-ASCII punctuation" if $c =~ /\xE2\x80[\x90-\xBF]/;
|
1242 |
+
return "non-ASCII punctuation" if $c =~ /\xE2\x81[\x80-\x9E]/;
|
1243 |
+
return "superscript letter" if $c =~ /\xE2\x81[\xB1\xBF]/;
|
1244 |
+
return "superscript digit" if $c =~ /\xE2\x81[\xB0-\xB9]/;
|
1245 |
+
return "superscript punctuation" if $c =~ /\xE2\x81[\xBA-\xBE]/;
|
1246 |
+
return "subscript digit" if $c =~ /\xE2\x82[\x80-\x89]/;
|
1247 |
+
return "subscript punctuation" if $c =~ /\xE2\x82[\x8A-\x8E]/;
|
1248 |
+
return "non-ASCII currency" if $c =~ /\xE2\x82[\xA0-\xBF]/;
|
1249 |
+
return "letterlike symbol" if $c =~ /\xE2\x84/;
|
1250 |
+
return "letterlike symbol" if $c =~ /\xE2\x85[\x80-\x8F]/;
|
1251 |
+
return "fraction" if $c =~ /\xE2\x85[\x90-\x9E]/; # NEW
|
1252 |
+
return "Roman number" if $c =~ /\xE2\x85[\xA0-\xBF]/; # NEW
|
1253 |
+
return "arrow symbol" if $c =~ /\xE2\x86[\x90-\xBF]/;
|
1254 |
+
return "arrow symbol" if $c =~ /\xE2\x87/;
|
1255 |
+
return "mathematical operator" if $c =~ /\xE2[\x88-\x8B]/;
|
1256 |
+
return "technical symbol" if $c =~ /\xE2[\x8C-\x8F]/;
|
1257 |
+
return "enclosed alphanumeric" if $c =~ /\xE2\x91[\xA0-\xBF]/;
|
1258 |
+
return "enclosed alphanumeric" if $c =~ /\xE2[\x92-\x93]/;
|
1259 |
+
return "box drawing" if $c =~ /\xE2[\x94-\x95]/;
|
1260 |
+
return "geometric shape" if $c =~ /\xE2\x96[\xA0-\xBF]/;
|
1261 |
+
return "geometric shape" if $c =~ /\xE2\x97/;
|
1262 |
+
return "pictograph" if $c =~ /\xE2[\x98-\x9E]/;
|
1263 |
+
return "arrow symbol" if $c =~ /\xE2\xAC[\x80-\x91\xB0-\xBF]/;
|
1264 |
+
return "geometric shape" if $c =~ /\xE2\xAC[\x92-\xAF]/;
|
1265 |
+
return "arrow symbol" if $c =~ /\xE2\xAD[\x80-\x8F\x9A-\xBF]/;
|
1266 |
+
return "geometric shape" if $c =~ /\xE2\xAD[\x90-\x99]/;
|
1267 |
+
return "arrow symbol" if $c =~ /\xE2\xAE[\x80-\xB9]/;
|
1268 |
+
return "geometric shape" if $c =~ /\xE2\xAE[\xBA-\xBF]/;
|
1269 |
+
return "geometric shape" if $c =~ /\xE2\xAF[\x80-\x88\x8A-\x8F]/;
|
1270 |
+
return "symbol" if $c =~ /\xE2[\xAC-\xAF]/;
|
1271 |
+
return "Coptic fraction" if $c =~ /\xE2\xB3\xBD/;
|
1272 |
+
return "Coptic punctuation" if $c =~ /\xE2\xB3[\xB9-\xBF]/;
|
1273 |
+
return "Coptic letter" if $c =~ /\xE2[\xB2-\xB3]/;
|
1274 |
+
return "Georgian letter" if $c =~ /\xE2\xB4[\x80-\xAF]/;
|
1275 |
+
return "Tifinagh punctuation" if $c =~ /\xE2\xB5\xB0/;
|
1276 |
+
return "Tifinagh letter" if $c =~ /\xE2\xB4[\xB0-\xBF]/;
|
1277 |
+
return "Tifinagh letter" if $c =~ /\xE2\xB5/;
|
1278 |
+
return "Ethiopic syllable" if $c =~ /\xE2\xB6/;
|
1279 |
+
return "Ethiopic syllable" if $c =~ /\xE2\xB7[\x80-\x9F]/;
|
1280 |
+
return "non-ASCII punctuation" if $c =~ /\xE3\x80[\x80-\x91\x94-\x9F\xB0\xBB-\xBD]/;
|
1281 |
+
return "symbol" if $c =~ /\xE3\x80[\x91\x92\xA0\xB6\xB7]/;
|
1282 |
+
return "Japanese hiragana character" if $c =~ /\xE3\x81/;
|
1283 |
+
return "Japanese hiragana character" if $c =~ /\xE3\x82[\x80-\x9F]/;
|
1284 |
+
return "Japanese katakana character" if $c =~ /\xE3\x82[\xA0-\xBF]/;
|
1285 |
+
return "Japanese katakana character" if $c =~ /\xE3\x83/;
|
1286 |
+
return "Bopomofo letter" if $c =~ /\xE3\x84[\x80-\xAF]/;
|
1287 |
+
return "Korean Hangul letter" if $c =~ /\xE3\x84[\xB0-\xBF]/;
|
1288 |
+
return "Korean Hangul letter" if $c =~ /\xE3\x85/;
|
1289 |
+
return "Korean Hangul letter" if $c =~ /\xE3\x86[\x80-\x8F]/;
|
1290 |
+
return "Bopomofo letter" if $c =~ /\xE3\x86[\xA0-\xBF]/;
|
1291 |
+
return "CJK stroke" if $c =~ /\xE3\x87[\x80-\xAF]/;
|
1292 |
+
return "Japanese kana character" if $c =~ /\xE3\x87[\xB0-\xBF]/;
|
1293 |
+
return "CJK symbol" if $c =~ /\xE3[\x88-\x8B]/;
|
1294 |
+
return "CJK square Latin abbreviation" if $c =~ /\xE3\x8D[\xB1-\xBA]/;
|
1295 |
+
return "CJK square Latin abbreviation" if $c =~ /\xE3\x8E/;
|
1296 |
+
return "CJK square Latin abbreviation" if $c =~ /\xE3\x8F[\x80-\x9F\xBF]/;
|
1297 |
+
return "CJK character" if $c =~ /\xE4[\xB8-\xBF]/;
|
1298 |
+
return "CJK character" if $c =~ /[\xE5-\xE9]/;
|
1299 |
+
return "Yi syllable" if $c =~ /\xEA[\x80-\x92]/;
|
1300 |
+
return "Lisu letter" if $c =~ /\xEA\x93[\x90-\xBD]/;
|
1301 |
+
return "Lisu punctuation" if $c =~ /\xEA\x93[\xBE-\xBF]/;
|
1302 |
+
return "Cyrillic letter" if $c =~ /\xEA\x99/;
|
1303 |
+
return "Cyrillic letter" if $c =~ /\xEA\x9A[\x80-\x9F]/;
|
1304 |
+
return "modifier tone" if $c =~ /\xEA\x9C[\x80-\xA1]/;
|
1305 |
+
return "Javanese punctuation" if $c =~ /\xEA\xA7[\x81-\x8D\x9E-\x9F]/;
|
1306 |
+
return "Javanese digit" if $c =~ /\xEA\xA7[\x90-\x99]/;
|
1307 |
+
return "Javanese letter" if $c =~ /\xEA\xA6/;
|
1308 |
+
return "Javanese letter" if $c =~ /\xEA\xA7[\x80-\x9F]/;
|
1309 |
+
return "Ethiopic syllable" if $c =~ /\xEA\xAC[\x80-\xAF]/;
|
1310 |
+
return "Cherokee letter" if $c =~ /\xEA\xAD[\xB0-\xBF]/;
|
1311 |
+
return "Cherokee letter" if $c =~ /\xEA\xAE/;
|
1312 |
+
return "Meetai Mayek digit" if $c =~ /\xEA\xAF[\xB0-\xB9]/;
|
1313 |
+
return "Meetai Mayek letter" if $c =~ /\xEA\xAF/;
|
1314 |
+
return "Korean Hangul syllable" if $c =~ /\xEA[\xB0-\xBF]/;
|
1315 |
+
return "Korean Hangul syllable" if $c =~ /[\xEB-\xEC]/;
|
1316 |
+
return "Korean Hangul syllable" if $c =~ /\xED[\x80-\x9E]/;
|
1317 |
+
return "Klingon letter" if $c =~ /\xEF\xA3[\x90-\xA9]/;
|
1318 |
+
return "Klingon digit" if $c =~ /\xEF\xA3[\xB0-\xB9]/;
|
1319 |
+
return "Klingon punctuation" if $c =~ /\xEF\xA3[\xBD-\xBE]/;
|
1320 |
+
return "Klingon symbol" if $c =~ /\xEF\xA3\xBF/;
|
1321 |
+
return "private use character" if $c =~ /\xEE/;
|
1322 |
+
return "Latin typographic ligature" if $c =~ /\xEF\xAC[\x80-\x86]/;
|
1323 |
+
return "Hebrew presentation letter" if $c =~ /\xEF\xAC[\x9D-\xBF]/;
|
1324 |
+
return "Hebrew presentation letter" if $c =~ /\xEF\xAD[\x80-\x8F]/;
|
1325 |
+
return "Arabic presentation letter" if $c =~ /\xEF\xAD[\x90-\xBF]/;
|
1326 |
+
return "Arabic presentation letter" if $c =~ /\xEF[\xAE-\xB7]/;
|
1327 |
+
return "non-ASCII punctuation" if $c =~ /\xEF\xB8[\x90-\x99]/;
|
1328 |
+
return "non-ASCII punctuation" if $c =~ /\xEF\xB8[\xB0-\xBF]/;
|
1329 |
+
return "non-ASCII punctuation" if $c =~ /\xEF\xB9[\x80-\xAB]/;
|
1330 |
+
return "Arabic presentation letter" if $c =~ /\xEF\xB9[\xB0-\xBF]/;
|
1331 |
+
return "Arabic presentation letter" if $c =~ /\xEF\xBA/;
|
1332 |
+
return "Arabic presentation letter" if $c =~ /\xEF\xBB[\x80-\xBC]/;
|
1333 |
+
return "byte-order mark/zero-width no-break space" if $c eq "\xEF\xBB\xBF";
|
1334 |
+
return "fullwidth currency" if $c =~ /\xEF\xBC\x84/;
|
1335 |
+
return "fullwidth digit" if $c =~ /\xEF\xBC[\x90-\x99]/;
|
1336 |
+
return "fullwidth Latin letter" if $c =~ /\xEF\xBC[\xA1-\xBA]/;
|
1337 |
+
return "fullwidth Latin letter" if $c =~ /\xEF\xBD[\x81-\x9A]/;
|
1338 |
+
return "fullwidth punctuation" if $c =~ /\xEF\xBC/;
|
1339 |
+
return "fullwidth punctuation" if $c =~ /\xEF\xBD[\x9B-\xA4]/;
|
1340 |
+
return "halfwidth Japanese punctuation" if $c =~ /\xEF\xBD[\xA1-\xA4]/;
|
1341 |
+
return "halfwidth Japanese katakana character" if $c =~ /\xEF\xBD[\xA5-\xBF]/;
|
1342 |
+
return "halfwidth Japanese katakana character" if $c =~ /\xEF\xBE[\x80-\x9F]/;
|
1343 |
+
return "fullwidth currency" if $c =~ /\xEF\xBF[\xA0-\xA6]/;
|
1344 |
+
return "replacement character" if $c eq "\xEF\xBF\xBD";
|
1345 |
+
} elsif ($c =~ /[\xF0-\xF7]/) {
|
1346 |
+
return "non-UTF8 (invalid)" unless $c =~ /[\xF0-\xF7][\x80-\xBF]{3,3}$/;
|
1347 |
+
return "non-shortest-UTF8 (invalid)" if $c =~ /\xF0[\x80-\x8F]/;
|
1348 |
+
return "Linear B syllable" if $c =~ /\xF0\x90\x80/;
|
1349 |
+
return "Linear B syllable" if $c =~ /\xF0\x90\x81[\x80-\x8F]/;
|
1350 |
+
return "Linear B symbol" if $c =~ /\xF0\x90\x81[\x90-\x9F]/;
|
1351 |
+
return "Linear B ideogram" if $c =~ /\xF0\x90[\x82-\x83]/;
|
1352 |
+
return "Gothic letter" if $c =~ /\xF0\x90\x8C[\xB0-\xBF]/;
|
1353 |
+
return "Gothic letter" if $c =~ /\xF0\x90\x8D[\x80-\x8F]/;
|
1354 |
+
return "Phoenician letter" if $c =~ /\xF0\x90\xA4[\x80-\x95]/;
|
1355 |
+
return "Phoenician number" if $c =~ /\xF0\x90\xA4[\x96-\x9B]/;
|
1356 |
+
return "Phoenician punctuation" if $c =~ /\xF0\x90\xA4\x9F/; # word separator
|
1357 |
+
return "Old Hungarian number" if $c =~ /\xF0\x90\xB3[\xBA-\xBF]/;
|
1358 |
+
return "Old Hungarian letter" if $c =~ /\xF0\x90[\xB2-\xB3]/;
|
1359 |
+
return "Cuneiform digit" if $c =~ /\xF0\x92\x90/; # numberic sign
|
1360 |
+
return "Cuneiform digit" if $c =~ /\xF0\x92\x91[\x80-\xAF]/; # numberic sign
|
1361 |
+
return "Cuneiform punctuation" if $c =~ /\xF0\x92\x91[\xB0-\xBF]/;
|
1362 |
+
return "Cuneiform sign" if $c =~ /\xF0\x92[\x80-\x95]/;
|
1363 |
+
return "Egyptian hieroglyph number" if $c =~ /\xF0\x93\x81\xA8/;
|
1364 |
+
return "Egyptian hieroglyph number" if $c =~ /\xF0\x93\x82[\xAD-\xB6]/;
|
1365 |
+
return "Egyptian hieroglyph number" if $c =~ /\xF0\x93\x86[\x90\xBC-\xBF]/;
|
1366 |
+
return "Egyptian hieroglyph number" if $c =~ /\xF0\x93\x87[\x80-\x84]/;
|
1367 |
+
return "Egyptian hieroglyph number" if $c =~ /\xF0\x93\x8D[\xA2-\xAB]/;
|
1368 |
+
return "Egyptian hieroglyph number" if $c =~ /\xF0\x93\x8E[\x86-\x92]/;
|
1369 |
+
return "Egyptian hieroglyph number" if $c =~ /\xF0\x93\x8F[\xBA-\xBF]/;
|
1370 |
+
return "Egyptian hieroglyph number" if $c =~ /\xF0\x93\x90[\x80-\x83]/;
|
1371 |
+
return "Egyptian hieroglyph" if $c =~ /\xF0\x93[\x80-\x90]/;
|
1372 |
+
return "enclosed alphanumeric" if $c =~ /\xF0\x9F[\x84-\x87]/;
|
1373 |
+
return "Mahjong symbol" if $c =~ /\xF0\x9F\x80[\x80-\xAF]/;
|
1374 |
+
return "Domino symbol" if $c =~ /\xF0\x9F\x80[\xB0-\xBF]/;
|
1375 |
+
return "Domino symbol" if $c =~ /\xF0\x9F\x81/;
|
1376 |
+
return "Domino symbol" if $c =~ /\xF0\x9F\x82[\x80-\x9F]/;
|
1377 |
+
return "Playing card symbol" if $c =~ /\xF0\x9F\x82[\xA0-\xBF]/;
|
1378 |
+
return "Playing card symbol" if $c =~ /\xF0\x9F\x83/;
|
1379 |
+
return "CJK symbol" if $c =~ /\xF0\x9F[\x88-\x8B]/;
|
1380 |
+
return "pictograph" if $c =~ /\xF0\x9F[\x8C-\x9B]/;
|
1381 |
+
return "geometric shape" if $c =~ /\xF0\x9F[\x9E-\x9F]/;
|
1382 |
+
return "non-ASCII punctuation" if $c =~ /\xF0\x9F[\xA0-\xA3]/;
|
1383 |
+
return "pictograph" if $c =~ /\xF0\x9F[\xA4-\xAB]/;
|
1384 |
+
return "CJK character" if $c =~ /\xF0[\xA0-\xAF]/;
|
1385 |
+
return "tag" if $c =~ /\xF3\xA0[\x80-\x81]/;
|
1386 |
+
return "variation selector" if $c =~ /\xF3\xA0[\x84-\x87]/;
|
1387 |
+
return "private use character" if $c =~ /\xF3[\xB0-\xBF]/;
|
1388 |
+
return "private use character" if $c =~ /\xF4[\x80-\x8F]/;
|
1389 |
+
# ...
|
1390 |
+
} elsif ($c =~ /[\xF8-\xFB]/) {
|
1391 |
+
return "non-UTF8 (invalid)" unless $c =~ /[\xF8-\xFB][\x80-\xBF]{4,4}$/;
|
1392 |
+
} elsif ($c =~ /[\xFC-\xFD]/) {
|
1393 |
+
return "non-UTF8 (invalid)" unless $c =~ /[\xFC-\xFD][\x80-\xBF]{5,5}$/;
|
1394 |
+
} elsif ($c =~ /\xFE/) {
|
1395 |
+
return "non-UTF8 (invalid)" unless $c =~ /\xFE][\x80-\xBF]{6,6}$/;
|
1396 |
+
} else {
|
1397 |
+
return "non-UTF8 (invalid)";
|
1398 |
+
}
|
1399 |
+
return "other character";
|
1400 |
+
}
|
1401 |
+
|
1402 |
+
1;
|
1403 |
+
|
1404 |
+
|
uroman/lib/NLP/stringDistance.pm
ADDED
@@ -0,0 +1,724 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
################################################################
|
2 |
+
# #
|
3 |
+
# stringDistance #
|
4 |
+
# #
|
5 |
+
################################################################
|
6 |
+
|
7 |
+
package NLP::stringDistance;
|
8 |
+
|
9 |
+
use List::Util qw(min max);
|
10 |
+
$utf8 = NLP::UTF8;
|
11 |
+
$util = NLP::utilities;
|
12 |
+
$romanizer = NLP::Romanizer;
|
13 |
+
|
14 |
+
%dummy_ht = ();
|
15 |
+
|
16 |
+
sub rule_string_expansion {
|
17 |
+
local($this, *ht, $s, $lang_code) = @_;
|
18 |
+
|
19 |
+
my @characters = $utf8->split_into_utf8_characters($s, "return only chars, return trailing whitespaces", *dummy_ht);
|
20 |
+
foreach $sub_len ((0 .. ($#characters-1))) {
|
21 |
+
my $sub = join("", @characters[0 .. $sub_len]);
|
22 |
+
foreach $super_len ((($sub_len + 1) .. $#characters)) {
|
23 |
+
my $super = join("", @characters[0 .. $super_len]);
|
24 |
+
# print STDERR " $sub -> $super\n" unless $ht{RULE_STRING_EXPANSION}->{$lang_code}->{$sub}->{$super};
|
25 |
+
$ht{RULE_STRING_EXPANSION}->{$lang_code}->{$sub}->{$super} = 1;
|
26 |
+
$ht{RULE_STRING_HAS_EXPANSION}->{$lang_code}->{$sub} = 1;
|
27 |
+
# print STDERR " RULE_STRING_HAS_EXPANSION $lang_code $sub\n";
|
28 |
+
}
|
29 |
+
}
|
30 |
+
}
|
31 |
+
|
32 |
+
sub load_string_distance_data {
|
33 |
+
local($this, $filename, *ht, $verbose) = @_;
|
34 |
+
|
35 |
+
$verbose = 0 unless defined($verbose);
|
36 |
+
open(IN,$filename) || die "Could not open $filename";
|
37 |
+
my $line_number = 0;
|
38 |
+
my $n_cost_rules = 0;
|
39 |
+
while (<IN>) {
|
40 |
+
$line_number++;
|
41 |
+
my $line = $_;
|
42 |
+
$line =~ s/^\xEF\xBB\xBF//;
|
43 |
+
$line =~ s/\s*$//;
|
44 |
+
next if $line =~ /^\s*(\#.*)?$/;
|
45 |
+
print STDERR "** Warning: line $line_number contains suspicious control character: $line\n" if $line =~ /[\x00-\x1F]/;
|
46 |
+
my $s1 = $util->slot_value_in_double_colon_del_list($line, "s1");
|
47 |
+
my $s2 = $util->slot_value_in_double_colon_del_list($line, "s2");
|
48 |
+
$s1 = $util->dequote_string($s1); # 'can\'t' => can't
|
49 |
+
$s2 = $util->dequote_string($s2);
|
50 |
+
my $cost = $util->slot_value_in_double_colon_del_list($line, "cost");
|
51 |
+
if (($s1 eq "") && ($s2 eq "")) {
|
52 |
+
print STDERR "Ignoring bad line $line_number in $filename, because both s1 and s2 are empty strings\n";
|
53 |
+
next;
|
54 |
+
}
|
55 |
+
unless ($cost =~ /^\d+(\.\d+)?$/) {
|
56 |
+
if ($cost eq "") {
|
57 |
+
print STDERR "Ignoring bad line $line_number in $filename, because of missing cost\n";
|
58 |
+
} else {
|
59 |
+
print STDERR "Ignoring bad line $line_number in $filename, because of ill-formed cost $cost\n";
|
60 |
+
}
|
61 |
+
next;
|
62 |
+
}
|
63 |
+
my $lang_code1_s = $util->slot_value_in_double_colon_del_list($line, "lc1");
|
64 |
+
my $lang_code2_s = $util->slot_value_in_double_colon_del_list($line, "lc2");
|
65 |
+
my @lang_codes_1 = ($lang_code1_s eq "") ? ("") : split(/,\s*/, $lang_code1_s);
|
66 |
+
my @lang_codes_2 = ($lang_code2_s eq "") ? ("") : split(/,\s*/, $lang_code2_s);
|
67 |
+
my $left_context1 = $util->slot_value_in_double_colon_del_list($line, "left1");
|
68 |
+
my $left_context2 = $util->slot_value_in_double_colon_del_list($line, "left2");
|
69 |
+
my $right_context1 = $util->slot_value_in_double_colon_del_list($line, "right1");
|
70 |
+
my $right_context2 = $util->slot_value_in_double_colon_del_list($line, "right2");
|
71 |
+
my $bad_left = $util->slot_value_in_double_colon_del_list($line, "left");
|
72 |
+
if ($bad_left) {
|
73 |
+
print STDERR "** Warning: slot '::left $bad_left' in line $line_number\n";
|
74 |
+
next;
|
75 |
+
}
|
76 |
+
my $bad_right = $util->slot_value_in_double_colon_del_list($line, "right");
|
77 |
+
if ($bad_right) {
|
78 |
+
print STDERR "** Warning: slot '::right $bad_right' in line $line_number\n";
|
79 |
+
next;
|
80 |
+
}
|
81 |
+
my $in_lang_codes1 = $util->slot_value_in_double_colon_del_list($line, "in-lc1");
|
82 |
+
my $in_lang_codes2 = $util->slot_value_in_double_colon_del_list($line, "in-lc2");
|
83 |
+
my $out_lang_codes1 = $util->slot_value_in_double_colon_del_list($line, "out-lc1");
|
84 |
+
my $out_lang_codes2 = $util->slot_value_in_double_colon_del_list($line, "out-lc2");
|
85 |
+
if ($left_context1) {
|
86 |
+
if ($left_context1 =~ /^\/.*\/$/) {
|
87 |
+
$left_context1 =~ s/^\///;
|
88 |
+
$left_context1 =~ s/\/$//;
|
89 |
+
} else {
|
90 |
+
print STDERR "Ignoring unrecognized non-regular-express ::left1 $left_context1 in $line_number of $filename\n";
|
91 |
+
$left_context1 = "";
|
92 |
+
}
|
93 |
+
}
|
94 |
+
if ($left_context2) {
|
95 |
+
if ($left_context2 =~ /^\/.*\/$/) {
|
96 |
+
$left_context2 =~ s/^\///;
|
97 |
+
$left_context2 =~ s/\/$//;
|
98 |
+
} else {
|
99 |
+
$left_context2 = "";
|
100 |
+
print STDERR "Ignoring unrecognized non-regular-express ::left2 $left_context2 in $line_number of $filename\n";
|
101 |
+
}
|
102 |
+
}
|
103 |
+
if ($right_context1) {
|
104 |
+
unless ($right_context1 =~ /^(\[[^\[\]]*\])+$/) {
|
105 |
+
$right_context1 = "";
|
106 |
+
print STDERR "Ignoring unrecognized right-context ::right1 $right_context1 in $line_number of $filename\n";
|
107 |
+
}
|
108 |
+
}
|
109 |
+
if ($right_context2) {
|
110 |
+
unless ($right_context2 =~ /^(\[[^\[\]]*\])+$/) {
|
111 |
+
$right_context2 = "";
|
112 |
+
print STDERR "Ignoring unrecognized right-context ::right2 $right_context2 in $line_number of $filename\n";
|
113 |
+
}
|
114 |
+
}
|
115 |
+
foreach $lang_code1 (@lang_codes_1) {
|
116 |
+
foreach $lang_code2 (@lang_codes_2) {
|
117 |
+
$n_cost_rules++;
|
118 |
+
my $cost_rule_id = $n_cost_rules;
|
119 |
+
$ht{COST}->{$lang_code1}->{$lang_code2}->{$s1}->{$s2}->{$cost_rule_id} = $cost;
|
120 |
+
$ht{RULE_STRING}->{$lang_code1}->{$s1} = 1;
|
121 |
+
$ht{RULE_STRING}->{$lang_code2}->{$s2} = 1;
|
122 |
+
$ht{LEFT1}->{$cost_rule_id} = $left_context1;
|
123 |
+
$ht{LEFT2}->{$cost_rule_id} = $left_context2;
|
124 |
+
$ht{RIGHT1}->{$cost_rule_id} = $right_context1;
|
125 |
+
$ht{RIGHT2}->{$cost_rule_id} = $right_context2;
|
126 |
+
$ht{INLC1}->{$cost_rule_id} = $in_lang_codes1;
|
127 |
+
$ht{INLC2}->{$cost_rule_id} = $in_lang_codes2;
|
128 |
+
$ht{OUTLC1}->{$cost_rule_id} = $out_lang_codes1;
|
129 |
+
$ht{OUTLC2}->{$cost_rule_id} = $out_lang_codes2;
|
130 |
+
unless (($s1 eq $s2)
|
131 |
+
&& ($lang_code1 eq $lang_code2)
|
132 |
+
&& ($left_context1 eq $left_context2)
|
133 |
+
&& ($right_context1 eq $right_context2)
|
134 |
+
&& ($in_lang_codes1 eq $in_lang_codes2)
|
135 |
+
&& ($out_lang_codes1 eq $out_lang_codes2)) {
|
136 |
+
$n_cost_rules++;
|
137 |
+
$cost_rule_id = $n_cost_rules;
|
138 |
+
$ht{COST}->{$lang_code2}->{$lang_code1}->{$s2}->{$s1}->{$cost_rule_id} = $cost;
|
139 |
+
$ht{LEFT1}->{$cost_rule_id} = $left_context2;
|
140 |
+
$ht{LEFT2}->{$cost_rule_id} = $left_context1;
|
141 |
+
$ht{RIGHT1}->{$cost_rule_id} = $right_context2;
|
142 |
+
$ht{RIGHT2}->{$cost_rule_id} = $right_context1;
|
143 |
+
$ht{INLC1}->{$cost_rule_id} = $in_lang_codes2;
|
144 |
+
$ht{INLC2}->{$cost_rule_id} = $in_lang_codes1;
|
145 |
+
$ht{OUTLC1}->{$cost_rule_id} = $out_lang_codes2;
|
146 |
+
$ht{OUTLC2}->{$cost_rule_id} = $out_lang_codes1;
|
147 |
+
# print STDERR " Flip rule in line $line: $line\n";
|
148 |
+
}
|
149 |
+
$this->rule_string_expansion(*ht, $s1, $lang_code1);
|
150 |
+
$this->rule_string_expansion(*ht, $s2, $lang_code2);
|
151 |
+
}
|
152 |
+
}
|
153 |
+
}
|
154 |
+
close(IN);
|
155 |
+
print STDERR "Read in $n_cost_rules rules from $line_number lines in $filename\n" if $verbose;
|
156 |
+
}
|
157 |
+
|
158 |
+
sub romanized_string_to_simple_chart {
|
159 |
+
local($this, $s, *chart_ht) = @_;
|
160 |
+
|
161 |
+
my @characters = $utf8->split_into_utf8_characters($s, "return only chars, return trailing whitespaces", *dummy_ht);
|
162 |
+
$chart_ht{N_CHARS} = $#characters + 1;
|
163 |
+
$chart_ht{N_NODES} = 0;
|
164 |
+
foreach $i ((0 .. $#characters)) {
|
165 |
+
$romanizer->add_node($characters[$i], $i, ($i+1), *chart_ht, "", "");
|
166 |
+
}
|
167 |
+
}
|
168 |
+
|
169 |
+
sub linearize_chart_points {
|
170 |
+
local($this, *chart_ht, $chart_id, *sd_ht, $verbose) = @_;
|
171 |
+
|
172 |
+
$verbose = 0 unless defined($verbose);
|
173 |
+
print STDERR "Linearize $chart_id\n" if $verbose;
|
174 |
+
my $current_chart_pos = 0;
|
175 |
+
my $current_linear_chart_pos = 0;
|
176 |
+
$sd_ht{POS2LINPOS}->{$chart_id}->{$current_chart_pos} = $current_linear_chart_pos;
|
177 |
+
$sd_ht{LINPOS2POS}->{$chart_id}->{$current_linear_chart_pos} = $current_chart_pos;
|
178 |
+
print STDERR " LINPOS2POS.$chart_id LIN: $current_linear_chart_pos POS: $current_chart_pos\n" if $verbose;
|
179 |
+
my @end_chart_positions = keys %{$chart_ht{NODES_ENDING_AT}};
|
180 |
+
my $end_chart_pos = (@end_chart_positions) ? max(@end_chart_positions) : 0;
|
181 |
+
$sd_ht{MAXPOS}->{$chart_id} = $end_chart_pos;
|
182 |
+
print STDERR " Chart span: $current_chart_pos-$end_chart_pos\n" if $verbose;
|
183 |
+
while ($current_chart_pos < $end_chart_pos) {
|
184 |
+
my @node_ids = keys %{$chart_ht{NODES_STARTING_AT}->{$current_chart_pos}};
|
185 |
+
foreach $node_id (@node_ids) {
|
186 |
+
my $roman_s = $chart_ht{NODE_ROMAN}->{$node_id};
|
187 |
+
my @roman_chars = $utf8->split_into_utf8_characters($roman_s, "return only chars, return trailing whitespaces", *dummy_ht);
|
188 |
+
print STDERR " $current_chart_pos/$current_linear_chart_pos node: $node_id $roman_s (@roman_chars)\n" if $verbose;
|
189 |
+
if ($#roman_chars >= 1) {
|
190 |
+
foreach $i ((1 .. $#roman_chars)) {
|
191 |
+
$current_linear_chart_pos++;
|
192 |
+
$sd_ht{SPLITPOS2LINPOS}->{$chart_id}->{$current_chart_pos}->{$node_id}->{$i} = $current_linear_chart_pos;
|
193 |
+
$sd_ht{LINPOS2SPLITPOS}->{$chart_id}->{$current_linear_chart_pos}->{$current_chart_pos}->{$node_id}->{$i} = 1;
|
194 |
+
print STDERR " LINPOS2SPLITPOS.$chart_id LIN: $current_linear_chart_pos POS: $current_chart_pos NODE: $node_id I: $i\n" if $verbose;
|
195 |
+
}
|
196 |
+
}
|
197 |
+
}
|
198 |
+
$current_chart_pos++;
|
199 |
+
if ($util->member($current_chart_pos, @end_chart_positions)) {
|
200 |
+
$current_linear_chart_pos++;
|
201 |
+
$sd_ht{POS2LINPOS}->{$chart_id}->{$current_chart_pos} = $current_linear_chart_pos;
|
202 |
+
$sd_ht{LINPOS2POS}->{$chart_id}->{$current_linear_chart_pos} = $current_chart_pos;
|
203 |
+
print STDERR " LINPOS2POS.$chart_id LIN: $current_linear_chart_pos POS: $current_chart_pos\n" if $verbose;
|
204 |
+
}
|
205 |
+
}
|
206 |
+
$current_chart_pos = 0;
|
207 |
+
while ($current_chart_pos <= $end_chart_pos) {
|
208 |
+
my $current_linear_chart_pos = $sd_ht{POS2LINPOS}->{$chart_id}->{$current_chart_pos};
|
209 |
+
$current_linear_chart_pos = "?" unless defined($current_linear_chart_pos);
|
210 |
+
my @node_ids = keys %{$chart_ht{NODES_STARTING_AT}->{$current_chart_pos}};
|
211 |
+
# print STDERR " LINROM.$chart_id LIN: $current_linear_chart_pos POS: $current_chart_pos NODES: @node_ids\n" if $verbose;
|
212 |
+
foreach $node_id (@node_ids) {
|
213 |
+
my $end_pos = $chart_ht{NODE_END}->{$node_id};
|
214 |
+
my $end_linpos = $sd_ht{POS2LINPOS}->{$chart_id}->{$end_pos};
|
215 |
+
my $roman_s = $chart_ht{NODE_ROMAN}->{$node_id};
|
216 |
+
my @roman_chars = $utf8->split_into_utf8_characters($roman_s, "return only chars, return trailing whitespaces", *dummy_ht);
|
217 |
+
print STDERR " LINROM.$chart_id LIN: $current_linear_chart_pos POS: $current_chart_pos NODE: $node_id CHARS: @roman_chars\n" if $verbose;
|
218 |
+
if (@roman_chars) {
|
219 |
+
foreach $i ((0 .. $#roman_chars)) {
|
220 |
+
my $from_linear_chart_pos
|
221 |
+
= (($i == 0)
|
222 |
+
? $sd_ht{POS2LINPOS}->{$chart_id}->{$current_chart_pos}
|
223 |
+
: $sd_ht{SPLITPOS2LINPOS}->{$chart_id}->{$current_chart_pos}->{$node_id}->{$i});
|
224 |
+
print STDERR " FROM.$chart_id I: $i POS: $current_chart_pos NODE: $node_id FROM: $from_linear_chart_pos\n" if $verbose;
|
225 |
+
my $to_linear_chart_pos
|
226 |
+
= (($i == $#roman_chars)
|
227 |
+
? $end_linpos
|
228 |
+
: $sd_ht{SPLITPOS2LINPOS}->{$chart_id}->{$current_chart_pos}->{$node_id}->{($i+1)});
|
229 |
+
print STDERR " TO.$chart_id I: $i POS: $current_chart_pos NODE: $node_id FROM: $to_linear_chart_pos\n" if $verbose;
|
230 |
+
my $roman_char = $roman_chars[$i];
|
231 |
+
$sd_ht{LIN_IJ_ROMAN}->{$chart_id}->{$from_linear_chart_pos}->{$to_linear_chart_pos}->{$roman_char} = 1;
|
232 |
+
}
|
233 |
+
} else {
|
234 |
+
my $from_linear_chart_pos = $sd_ht{POS2LINPOS}->{$chart_id}->{$current_chart_pos};
|
235 |
+
my $to_linear_chart_pos = $sd_ht{POS2LINPOS}->{$chart_id}->{($current_chart_pos+1)};
|
236 |
+
# HHERE check this out
|
237 |
+
my $i = 1;
|
238 |
+
while (! (defined($to_linear_chart_pos))) {
|
239 |
+
$i++;
|
240 |
+
$to_linear_chart_pos = $sd_ht{POS2LINPOS}->{$chart_id}->{($current_chart_pos+$i)};
|
241 |
+
}
|
242 |
+
if (defined($from_linear_chart_pos) && defined($to_linear_chart_pos)) {
|
243 |
+
$sd_ht{LIN_IJ_ROMAN}->{$chart_id}->{$from_linear_chart_pos}->{$to_linear_chart_pos}->{""} = 1
|
244 |
+
} else {
|
245 |
+
print STDERR " UNDEF.$chart_id from: "
|
246 |
+
. ((defined($from_linear_chart_pos)) ? $from_linear_chart_pos : "?")
|
247 |
+
. " to: "
|
248 |
+
. ((defined($to_linear_chart_pos)) ? $to_linear_chart_pos : "?")
|
249 |
+
. "\n";
|
250 |
+
}
|
251 |
+
}
|
252 |
+
}
|
253 |
+
$current_chart_pos++;
|
254 |
+
}
|
255 |
+
$sd_ht{MAXLINPOS}->{$chart_id} = $sd_ht{POS2LINPOS}->{$chart_id}->{$end_chart_pos};
|
256 |
+
}
|
257 |
+
|
258 |
+
sub expand_lin_ij_roman {
|
259 |
+
local($this, *sd_ht, $chart_id, $lang_code, *ht) = @_;
|
260 |
+
|
261 |
+
foreach $start (sort { $a <=> $b } keys %{$sd_ht{LIN_IJ_ROMAN}->{$chart_id}}) {
|
262 |
+
foreach $end (sort { $a <=> $b } keys %{$sd_ht{LIN_IJ_ROMAN}->{$chart_id}->{$start}}) {
|
263 |
+
foreach $roman (sort keys %{$sd_ht{LIN_IJ_ROMAN}->{$chart_id}->{$start}->{$end}}) {
|
264 |
+
if ($ht{RULE_STRING_HAS_EXPANSION}->{$lang_code}->{$roman}
|
265 |
+
|| $ht{RULE_STRING_HAS_EXPANSION}->{""}->{$roman}) {
|
266 |
+
$this->expand_lin_ij_roman_rec(*sd_ht, $chart_id, $start, $end, $roman, $lang_code, *ht);
|
267 |
+
}
|
268 |
+
}
|
269 |
+
}
|
270 |
+
}
|
271 |
+
}
|
272 |
+
|
273 |
+
sub expand_lin_ij_roman_rec {
|
274 |
+
local($this, *sd_ht, $chart_id, $start, $end, $roman, $lang_code, *ht) = @_;
|
275 |
+
|
276 |
+
# print STDERR " expand_lin_ij_roman_rec.$chart_id $start-$end $lang_code $roman\n";
|
277 |
+
return unless $ht{RULE_STRING_HAS_EXPANSION}->{$lang_code}->{$roman}
|
278 |
+
|| $ht{RULE_STRING_HAS_EXPANSION}->{""}->{$roman};
|
279 |
+
foreach $new_end (keys %{$sd_ht{LIN_IJ_ROMAN}->{$chart_id}->{$end}}) {
|
280 |
+
foreach $next_roman (sort keys %{$sd_ht{LIN_IJ_ROMAN}->{$chart_id}->{$end}->{$new_end}}) {
|
281 |
+
my $exp_roman = join("", $roman, $next_roman);
|
282 |
+
if ($ht{RULE_STRING}->{$lang_code}->{$exp_roman}
|
283 |
+
|| $ht{RULE_STRING}->{""}->{$exp_roman}) {
|
284 |
+
$sd_ht{LIN_IJ_ROMAN}->{$chart_id}->{$start}->{$new_end}->{$exp_roman} = 1;
|
285 |
+
# print STDERR " Expansion ($start-$new_end) $exp_roman\n";
|
286 |
+
}
|
287 |
+
if ($ht{RULE_STRING_HAS_EXPANSION}->{$lang_code}->{$exp_roman}
|
288 |
+
|| $ht{RULE_STRING_HAS_EXPANSION}->{""}->{$exp_roman}) {
|
289 |
+
$this->expand_lin_ij_roman_rec(*sd_ht, $chart_id, $start, $new_end, $exp_roman, $lang_code, *ht);
|
290 |
+
}
|
291 |
+
}
|
292 |
+
}
|
293 |
+
}
|
294 |
+
|
295 |
+
sub trace_string_distance {
|
296 |
+
local($this, *sd_ht, $chart1_id, $chart2_id, $control, $line_number, $cost) = @_;
|
297 |
+
|
298 |
+
my $chart_comb_id = join("/", $chart1_id, $chart2_id);
|
299 |
+
return "mismatch" if $sd_ht{MISMATCH}->{$chart_comb_id};
|
300 |
+
my $chart1_end = $sd_ht{MAXLINPOS}->{$chart1_id};
|
301 |
+
my $chart2_end = $sd_ht{MAXLINPOS}->{$chart2_id};
|
302 |
+
my $verbose = ($control =~ /verbose/);
|
303 |
+
my $chunks_p = ($control =~ /chunks/);
|
304 |
+
my @traces = ();
|
305 |
+
my @s1_s = ();
|
306 |
+
my @s2_s = ();
|
307 |
+
my @e1_s = ();
|
308 |
+
my @e2_s = ();
|
309 |
+
my @r1_s = ();
|
310 |
+
my @r2_s = ();
|
311 |
+
my @ic_s = ();
|
312 |
+
|
313 |
+
# print STDERR "trace_string_distance $chart1_id $chart2_id $line_number\n";
|
314 |
+
while ($chart1_end || $chart2_end) {
|
315 |
+
my $incr_cost = $sd_ht{INCR_COST_IJ}->{$chart_comb_id}->{$chart1_end}->{$chart2_end};
|
316 |
+
my $prec_i = $sd_ht{PREC_I}->{$chart_comb_id}->{$chart1_end}->{$chart2_end};
|
317 |
+
my $prec_j = $sd_ht{PREC_J}->{$chart_comb_id}->{$chart1_end}->{$chart2_end};
|
318 |
+
if ($incr_cost || $verbose || $chunks_p) {
|
319 |
+
my $roman1 = $sd_ht{ROMAN1}->{$chart_comb_id}->{$chart1_end}->{$chart2_end};
|
320 |
+
my $roman2 = $sd_ht{ROMAN2}->{$chart_comb_id}->{$chart1_end}->{$chart2_end};
|
321 |
+
if ($verbose) {
|
322 |
+
push(@traces, "$prec_i-$chart1_end/$prec_j-$chart2_end:$roman1/$roman2:$incr_cost");
|
323 |
+
} else {
|
324 |
+
if (defined($roman1)) {
|
325 |
+
push(@traces, "$roman1/$roman2:$incr_cost");
|
326 |
+
} else {
|
327 |
+
$print_prec_i = (defined($prec_i)) ? $prec_i : "?";
|
328 |
+
$print_prec_j = (defined($prec_j)) ? $prec_j : "?";
|
329 |
+
print STDERR " $prec_i-$chart1_end, $prec_j-$chart2_end\n";
|
330 |
+
}
|
331 |
+
}
|
332 |
+
if ($chunks_p) {
|
333 |
+
push(@s1_s, $prec_i);
|
334 |
+
push(@s2_s, $prec_j);
|
335 |
+
push(@e1_s, $chart1_end);
|
336 |
+
push(@e2_s, $chart2_end);
|
337 |
+
push(@r1_s, $roman1);
|
338 |
+
push(@r2_s, $roman2);
|
339 |
+
push(@ic_s, $incr_cost);
|
340 |
+
}
|
341 |
+
}
|
342 |
+
$chart1_end = $prec_i;
|
343 |
+
$chart2_end = $prec_j;
|
344 |
+
}
|
345 |
+
if ($chunks_p) {
|
346 |
+
my $r1 = "";
|
347 |
+
my $r2 = "";
|
348 |
+
my $tc = 0;
|
349 |
+
my $in_chunk = 0;
|
350 |
+
foreach $i ((0 .. $#ic_s)) {
|
351 |
+
if ($ic_s[$i]) {
|
352 |
+
$r1 = $r1_s[$i] . $r1;
|
353 |
+
$r2 = $r2_s[$i] . $r2;
|
354 |
+
$tc += $ic_s[$i];
|
355 |
+
$in_chunk = 1;
|
356 |
+
} elsif ($in_chunk) {
|
357 |
+
$chunk = "$r1/$r2/$tc";
|
358 |
+
$chunk .= "*" if $cost > 5;
|
359 |
+
$sd_ht{N_COST_CHUNK}->{$chunk} = ($sd_ht{N_COST_CHUNK}->{$chunk} || 0) + 1;
|
360 |
+
$sd_ht{EX_COST_CHUNK}->{$chunk}->{$line_number} = 1;
|
361 |
+
$r1 = "";
|
362 |
+
$r2 = "";
|
363 |
+
$tc = 0;
|
364 |
+
$in_chunk = 0;
|
365 |
+
}
|
366 |
+
}
|
367 |
+
if ($in_chunk) {
|
368 |
+
$chunk = "$r1/$r2/$tc";
|
369 |
+
$chunk .= "*" if $cost > 5;
|
370 |
+
$sd_ht{N_COST_CHUNK}->{$chunk} = ($sd_ht{N_COST_CHUNK}->{$chunk} || 0) + 1;
|
371 |
+
$sd_ht{EX_COST_CHUNK}->{$chunk}->{$line_number} = 1;
|
372 |
+
}
|
373 |
+
} else {
|
374 |
+
return join(" ", reverse @traces);
|
375 |
+
}
|
376 |
+
}
|
377 |
+
|
378 |
+
sub right_context_match {
|
379 |
+
local($this, $right_context_rule, *sd_ht, $chart_id, $start_pos) = @_;
|
380 |
+
|
381 |
+
return 1 if $right_context_rule eq "";
|
382 |
+
if (($right_context_item, $right_context_rest) = ($right_context_rule =~ /^\[([^\[\]]*)\]*(.*)$/)) {
|
383 |
+
my $guarded_right_context_item = $right_context_item;
|
384 |
+
$guarded_right_context_item =~ s/\$/\\\$/g;
|
385 |
+
my @end_positions = keys %{$sd_ht{LIN_IJ_ROMAN}->{$chart_id}->{$start_pos}};
|
386 |
+
return 1 if ($#end_positions == -1)
|
387 |
+
&& (($right_context_item eq "")
|
388 |
+
|| ($right_context_item =~ /\$/));
|
389 |
+
foreach $end_pos (@end_positions) {
|
390 |
+
my @romans = keys %{$sd_ht{LIN_IJ_ROMAN}->{$chart_id}->{$start_pos}->{$end_pos}};
|
391 |
+
foreach $roman (@romans) {
|
392 |
+
if ($roman =~ /^[$guarded_right_context_item]/) {
|
393 |
+
return $this->right_context_match($right_context_rest, *sd_ht, $chart_id, $end_pos);
|
394 |
+
}
|
395 |
+
}
|
396 |
+
}
|
397 |
+
}
|
398 |
+
return 0;
|
399 |
+
}
|
400 |
+
|
401 |
+
sub string_distance {
|
402 |
+
local($this, *sd_ht, $chart1_id, $chart2_id, $lang_code1, $lang_code2, *ht, $control) = @_;
|
403 |
+
|
404 |
+
my $verbose = ($control =~ /verbose/i);
|
405 |
+
my $chart_comb_id = join("/", $chart1_id, $chart2_id);
|
406 |
+
|
407 |
+
my $chart1_end_pos = $sd_ht{MAXLINPOS}->{$chart1_id};
|
408 |
+
my $chart2_end_pos = $sd_ht{MAXLINPOS}->{$chart2_id};
|
409 |
+
print STDERR "string_distance.$chart_comb_id $chart1_end_pos/$chart2_end_pos\n" if $verbose;
|
410 |
+
$sd_ht{COST_IJ}->{$chart_comb_id}->{0}->{0} = 0;
|
411 |
+
$sd_ht{COMB_LEFT_ROMAN1}->{$chart_comb_id}->{0}->{0} = "";
|
412 |
+
$sd_ht{COMB_LEFT_ROMAN2}->{$chart_comb_id}->{0}->{0} = "";
|
413 |
+
# HHERE
|
414 |
+
foreach $chart1_start ((0 .. $chart1_end_pos)) {
|
415 |
+
# print STDERR " C1 $chart1_start- ($chart1_start .. $chart1_end_pos)\n";
|
416 |
+
my $prev_further_expansion_possible = 0;
|
417 |
+
my @chart1_ends = sort { $a <=> $b } keys %{$sd_ht{LIN_IJ_ROMAN}->{$chart1_id}->{$chart1_start}};
|
418 |
+
my $max_chart1_ends = (@chart1_ends) ? $chart1_ends[$#chart1_ends] : -1;
|
419 |
+
foreach $chart1_end (($chart1_start .. $chart1_end_pos)) {
|
420 |
+
my $further_expansion_possible = ($chart1_start == $chart1_end)
|
421 |
+
|| defined($sd_ht{LINPOS2SPLITPOS}->{$chart1_id}->{$chart1_start})
|
422 |
+
|| ($chart1_end < $max_chart1_ends);
|
423 |
+
my @romans1 = (($chart1_start == $chart1_end)
|
424 |
+
? ("")
|
425 |
+
: (sort keys %{$sd_ht{LIN_IJ_ROMAN}->{$chart1_id}->{$chart1_start}->{$chart1_end}}));
|
426 |
+
if ($#romans1 == -1) {
|
427 |
+
$further_expansion_possible = 1 if $prev_further_expansion_possible;
|
428 |
+
} else {
|
429 |
+
$prev_further_expansion_possible = 0;
|
430 |
+
}
|
431 |
+
# print STDERR " C1 $chart1_start-$chart1_end romans1: @romans1 {$further_expansion_possible} *l*\n";
|
432 |
+
foreach $roman1 (@romans1) {
|
433 |
+
# print STDERR " C1 $chart1_start-$chart1_end $roman1 {$further_expansion_possible} *?*\n";
|
434 |
+
next unless $ht{RULE_STRING}->{$lang_code1}->{$roman1}
|
435 |
+
|| $ht{RULE_STRING}->{""}->{$roman1};
|
436 |
+
# print STDERR " C1 $chart1_start-$chart1_end $roman1 {$further_expansion_possible} ***\n";
|
437 |
+
foreach $lang_code1o (($lang_code1, "")) {
|
438 |
+
foreach $lang_code2o (($lang_code2, "")) {
|
439 |
+
my @chart2_starts = (sort { $a <=> $b } keys %{$sd_ht{COST_IJ}->{$chart_comb_id}->{$chart1_start}});
|
440 |
+
foreach $chart2_start (@chart2_starts) {
|
441 |
+
# print STDERR " C1 $chart1_start-$chart1_end $roman1 C2 $chart2_start- (@chart2_starts)\n";
|
442 |
+
foreach $chart2_end (($chart2_start .. $chart2_end_pos)) {
|
443 |
+
print STDERR " C1 $chart1_start-$chart1_end $roman1 C2 $chart2_start-$chart2_end\n";
|
444 |
+
my @romans2 = (($chart2_start == $chart2_end)
|
445 |
+
? ("")
|
446 |
+
: (sort keys %{$sd_ht{LIN_IJ_ROMAN}->{$chart2_id}->{$chart2_start}->{$chart2_end}}));
|
447 |
+
foreach $roman2 (@romans2) {
|
448 |
+
if ($roman1 eq $roman2) {
|
449 |
+
print STDERR " C1 $chart1_start-$chart1_end $roman1 C2 $chart2_start-$chart2_end $roman2 (IDENTITY)\n";
|
450 |
+
my $cost = 0;
|
451 |
+
my $preceding_cost = $sd_ht{COST_IJ}->{$chart_comb_id}->{$chart1_start}->{$chart2_start};
|
452 |
+
my $combined_cost = $preceding_cost + $cost;
|
453 |
+
my $old_cost = $sd_ht{COST_IJ}->{$chart_comb_id}->{$chart1_end}->{$chart2_end};
|
454 |
+
if ((! defined($old_cost)) || ($combined_cost < $old_cost)) {
|
455 |
+
$sd_ht{COST_IJ}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = $combined_cost;
|
456 |
+
push(@chart2_starts, $chart2_end) unless $util->member($chart2_end, @chart2_starts);
|
457 |
+
$sd_ht{PREC_I}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = $chart1_start;
|
458 |
+
$sd_ht{PREC_J}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = $chart2_start;
|
459 |
+
$sd_ht{ROMAN1}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = $roman1;
|
460 |
+
$sd_ht{ROMAN2}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = $roman2;
|
461 |
+
$sd_ht{COMB_LEFT_ROMAN1}->{$chart_comb_id}->{$chart1_end}->{$chart2_end}
|
462 |
+
= $sd_ht{COMB_LEFT_ROMAN1}->{$chart_comb_id}->{$chart1_start}->{$chart2_start} . $roman1;
|
463 |
+
$sd_ht{COMB_LEFT_ROMAN2}->{$chart_comb_id}->{$chart1_end}->{$chart2_end}
|
464 |
+
= $sd_ht{COMB_LEFT_ROMAN2}->{$chart_comb_id}->{$chart1_start}->{$chart2_start} . $roman2;
|
465 |
+
$comb_left_roman1 = $sd_ht{COMB_LEFT_ROMAN1}->{$chart_comb_id}->{$chart1_end}->{$chart2_end};
|
466 |
+
$sd_ht{INCR_COST_IJ}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = $cost;
|
467 |
+
$sd_ht{COST_RULE}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = "IDENTITY";
|
468 |
+
print STDERR " New cost $chart1_end/$chart2_end: $combined_cost (+$cost from $chart1_start/$chart2_start $roman1/$roman2)\n" if $verbose;
|
469 |
+
}
|
470 |
+
} else {
|
471 |
+
next unless $ht{RULE_STRING}->{$lang_code2o}->{$roman2};
|
472 |
+
print STDERR " C1 $chart1_start-$chart1_end $roman1 C2 $chart2_start-$chart2_end $roman2\n";
|
473 |
+
next unless defined($ht{COST}->{$lang_code1o}->{$lang_code2o}->{$roman1}->{$roman2});
|
474 |
+
my @cost_rule_ids = keys %{$ht{COST}->{$lang_code1o}->{$lang_code2o}->{$roman1}->{$roman2}};
|
475 |
+
foreach $cost_rule_id (@cost_rule_ids) {
|
476 |
+
## check whether any context requirements are satisfied
|
477 |
+
# left context rules are regular expressions
|
478 |
+
my $left_context_rule1 = $ht{LEFT1}->{$cost_rule_id};
|
479 |
+
if ($left_context_rule1) {
|
480 |
+
my $comb_left_roman1 = $sd_ht{COMB_LEFT_ROMAN1}->{$chart_comb_id}->{$chart1_start}->{$chart2_start};
|
481 |
+
if (defined($comb_left_roman1)) {
|
482 |
+
next unless $comb_left_roman1 =~ /$left_context_rule1/;
|
483 |
+
} else {
|
484 |
+
print STDERR " No comb_left_roman1 value for $chart_comb_id $chart1_start,$chart2_start\n";
|
485 |
+
}
|
486 |
+
}
|
487 |
+
my $left_context_rule2 = $ht{LEFT2}->{$cost_rule_id};
|
488 |
+
if ($left_context_rule2) {
|
489 |
+
my $comb_left_roman2 = $sd_ht{COMB_LEFT_ROMAN2}->{$chart_comb_id}->{$chart1_start}->{$chart2_start};
|
490 |
+
if (defined($comb_left_roman2)) {
|
491 |
+
next unless $comb_left_roman2 =~ /$left_context_rule2/;
|
492 |
+
} else {
|
493 |
+
print STDERR " No comb_left_roman2 value for $chart_comb_id $chart1_start,$chart2_start\n";
|
494 |
+
}
|
495 |
+
}
|
496 |
+
my $right_context_rule1 = $ht{RIGHT1}->{$cost_rule_id};
|
497 |
+
if ($right_context_rule1) {
|
498 |
+
my $match_p = $this->right_context_match($right_context_rule1, *sd_ht, $chart1_id, $chart1_end);
|
499 |
+
# print STDERR " Match?($right_context_rule1, 1, $chart1_end) = $match_p\n";
|
500 |
+
next unless $match_p;
|
501 |
+
}
|
502 |
+
my $right_context_rule2 = $ht{RIGHT2}->{$cost_rule_id};
|
503 |
+
if ($right_context_rule2) {
|
504 |
+
my $match_p = $this->right_context_match($right_context_rule2, *sd_ht, $chart2_id, $chart2_end);
|
505 |
+
# print STDERR " Match?($right_context_rule2, 2, $chart2_end) = $match_p\n";
|
506 |
+
next unless $match_p;
|
507 |
+
}
|
508 |
+
my $cost = $ht{COST}->{$lang_code1o}->{$lang_code2o}->{$roman1}->{$roman2}->{$cost_rule_id};
|
509 |
+
my $preceding_cost = $sd_ht{COST_IJ}->{$chart_comb_id}->{$chart1_start}->{$chart2_start};
|
510 |
+
my $combined_cost = $preceding_cost + $cost;
|
511 |
+
my $old_cost = $sd_ht{COST_IJ}->{$chart_comb_id}->{$chart1_end}->{$chart2_end};
|
512 |
+
if ((! defined($old_cost)) || ($combined_cost < $old_cost)) {
|
513 |
+
$sd_ht{COST_IJ}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = $combined_cost;
|
514 |
+
push(@chart2_starts, $chart2_end) unless $util->member($chart2_end, @chart2_starts);
|
515 |
+
$sd_ht{PREC_I}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = $chart1_start;
|
516 |
+
$sd_ht{PREC_J}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = $chart2_start;
|
517 |
+
$sd_ht{ROMAN1}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = $roman1;
|
518 |
+
$sd_ht{ROMAN2}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = $roman2;
|
519 |
+
$sd_ht{COMB_LEFT_ROMAN1}->{$chart_comb_id}->{$chart1_end}->{$chart2_end}
|
520 |
+
= $sd_ht{COMB_LEFT_ROMAN1}->{$chart_comb_id}->{$chart1_start}->{$chart2_start} . $roman1;
|
521 |
+
$sd_ht{COMB_LEFT_ROMAN2}->{$chart_comb_id}->{$chart1_end}->{$chart2_end}
|
522 |
+
= $sd_ht{COMB_LEFT_ROMAN2}->{$chart_comb_id}->{$chart1_start}->{$chart2_start} . $roman2;
|
523 |
+
$comb_left_roman1 = $sd_ht{COMB_LEFT_ROMAN1}->{$chart_comb_id}->{$chart1_end}->{$chart2_end};
|
524 |
+
# print STDERR " Comb-left-roman1($chart_comb_id,$chart1_end,$chart2_end) = $comb_left_roman1\n";
|
525 |
+
$sd_ht{INCR_COST_IJ}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = $cost;
|
526 |
+
$sd_ht{COST_RULE}->{$chart_comb_id}->{$chart1_end}->{$chart2_end} = $cost_rule_id;
|
527 |
+
print STDERR " New cost $chart1_end/$chart2_end: $combined_cost (+$cost from $chart1_start/$chart2_start $roman1/$roman2)\n" if $verbose;
|
528 |
+
}
|
529 |
+
}
|
530 |
+
}
|
531 |
+
}
|
532 |
+
}
|
533 |
+
}
|
534 |
+
}
|
535 |
+
}
|
536 |
+
$further_expansion_possible = 1
|
537 |
+
if $ht{RULE_STRING_HAS_EXPANSION}->{$lang_code1}->{$roman1}
|
538 |
+
|| $ht{RULE_STRING_HAS_EXPANSION}->{""}->{$roman1};
|
539 |
+
# print STDERR " further_expansion_possible: $further_expansion_possible (lc: $lang_code1 r1: $roman1) ***\n";
|
540 |
+
}
|
541 |
+
# print STDERR " last C1 $chart1_start-$chart1_end (@romans1)\n" unless $further_expansion_possible;
|
542 |
+
last unless $further_expansion_possible;
|
543 |
+
$prev_further_expansion_possible = 1 if $further_expansion_possible;
|
544 |
+
}
|
545 |
+
}
|
546 |
+
my $total_cost = $sd_ht{COST_IJ}->{$chart_comb_id}->{$chart1_end_pos}->{$chart2_end_pos};
|
547 |
+
unless (defined($total_cost)) {
|
548 |
+
$total_cost = 99.9999;
|
549 |
+
$sd_ht{MISMATCH}->{$chart_comb_id} = 1;
|
550 |
+
}
|
551 |
+
return $total_cost;
|
552 |
+
}
|
553 |
+
|
554 |
+
sub print_sd_ht {
|
555 |
+
local($this, *sd_ht, $chart1_id, $chart2_id, *OUT) = @_;
|
556 |
+
|
557 |
+
print OUT "string-distance chart:\n";
|
558 |
+
foreach $chart_id (($chart1_id, $chart2_id)) {
|
559 |
+
print OUT "SD chart $chart_id:\n";
|
560 |
+
foreach $from_linear_chart_pos (sort { $a <=> $b } keys %{$sd_ht{LIN_IJ_ROMAN}->{$chart_id}}) {
|
561 |
+
foreach $to_linear_chart_pos (sort { $a <=> $b } keys %{$sd_ht{LIN_IJ_ROMAN}->{$chart_id}->{$from_linear_chart_pos}}) {
|
562 |
+
foreach $roman_char (sort keys %{$sd_ht{LIN_IJ_ROMAN}->{$chart_id}->{$from_linear_chart_pos}->{$to_linear_chart_pos}}) {
|
563 |
+
print OUT " Lnode($from_linear_chart_pos-$to_linear_chart_pos): $roman_char\n";
|
564 |
+
}
|
565 |
+
}
|
566 |
+
}
|
567 |
+
}
|
568 |
+
}
|
569 |
+
|
570 |
+
sub print_chart_ht {
|
571 |
+
local($this, *chart_ht, *OUT) = @_;
|
572 |
+
|
573 |
+
print OUT "uroman chart:\n";
|
574 |
+
foreach $start (sort { $a <=> $b } keys %{$chart_ht{NODES_STARTING_AT}}) {
|
575 |
+
foreach $end (sort { $a <=> $b } keys %{$chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}}) {
|
576 |
+
foreach $node_id (keys %{$chart_ht{NODES_STARTING_AND_ENDING_AT}->{$start}->{$end}}) {
|
577 |
+
$roman_s = $chart_ht{NODE_ROMAN}->{$node_id};
|
578 |
+
print OUT " Node $node_id ($start-$end): $roman_s\n";
|
579 |
+
}
|
580 |
+
}
|
581 |
+
}
|
582 |
+
}
|
583 |
+
|
584 |
+
sub normalize_string {
|
585 |
+
local($this, $s) = @_;
|
586 |
+
|
587 |
+
# $s =~ s/(\xE2\x80\x8C)//g; # delete zero width non-joiner
|
588 |
+
$s =~ s/(\xE2\x80[\x93-\x94])/-/g; # en-dash, em-dash
|
589 |
+
$s =~ s/([\x00-\x7F\xC0-\xFE][\x80-\xBF]*)\1+/$1$1/g; # shorten 3 or more occurrences of same character in a row to 2
|
590 |
+
$s =~ s/[ \t]+/ /g;
|
591 |
+
|
592 |
+
return $s;
|
593 |
+
}
|
594 |
+
|
595 |
+
my $string_distance_chart_id = 0;
|
596 |
+
sub string_distance_by_chart {
|
597 |
+
local($this, $s1, $s2, $lang_code1, $lang_code2, *ht, *pinyin_ht, $control) = @_;
|
598 |
+
|
599 |
+
$control = "" unless defined($control);
|
600 |
+
%sd_ht = ();
|
601 |
+
|
602 |
+
$s1 = $this->normalize_string($s1);
|
603 |
+
my $lc_s1 = $utf8->extended_lower_case($s1);
|
604 |
+
$string_distance_chart_id++;
|
605 |
+
my $chart1_id = $string_distance_chart_id;
|
606 |
+
*chart_ht = $romanizer->romanize($lc_s1, $lang_code1, "", *ht, *pinyin_ht, 0, "return chart", $chart1_id);
|
607 |
+
$this->linearize_chart_points(*chart_ht, $chart1_id, *sd_ht);
|
608 |
+
$this->expand_lin_ij_roman(*sd_ht, $chart1_id, $lang_code1, *ht);
|
609 |
+
|
610 |
+
$s2 = $this->normalize_string($s2);
|
611 |
+
my $lc_s2 = $utf8->extended_lower_case($s2);
|
612 |
+
$string_distance_chart_id++;
|
613 |
+
my $chart2_id = $string_distance_chart_id;
|
614 |
+
*chart_ht = $romanizer->romanize($lc_s2, $lang_code2, "", *ht, *pinyin_ht, 0, "return chart", $chart2_id);
|
615 |
+
$this->linearize_chart_points(*chart_ht, $chart2_id, *sd_ht);
|
616 |
+
$this->expand_lin_ij_roman(*sd_ht, $chart2_id, $lang_code2, *ht);
|
617 |
+
|
618 |
+
my $cost = $this->string_distance(*sd_ht, $chart1_id, $chart2_id, $lang_code1, $lang_code2, *ht, $control);
|
619 |
+
return $cost;
|
620 |
+
}
|
621 |
+
|
622 |
+
my $n_quick_romanized_string_distance = 0;
|
623 |
+
sub quick_romanized_string_distance_by_chart {
|
624 |
+
local($this, $s1, $s2, *ht, $control, $lang_code1, $lang_code2) = @_;
|
625 |
+
|
626 |
+
# my $verbose = ($s1 eq "apit") && ($s2 eq "apet");
|
627 |
+
# print STDERR "Start quick_romanized_string_distance_by_chart\n";
|
628 |
+
$s1 = lc $s1;
|
629 |
+
$s2 = lc $s2;
|
630 |
+
$control = "" unless defined($control);
|
631 |
+
$lang_code1 = "" unless defined($lang_code1);
|
632 |
+
$lang_code2 = "" unless defined($lang_code2);
|
633 |
+
my $cache_p = ($control =~ /cache/);
|
634 |
+
my $total_cost;
|
635 |
+
if ($cache_p) {
|
636 |
+
$total_cost = $ht{CACHED_QRSD}->{$s1}->{$s2};
|
637 |
+
if (defined($total_cost)) {
|
638 |
+
return $total_cost;
|
639 |
+
}
|
640 |
+
}
|
641 |
+
my @lang_codes1 = ($lang_code1 eq "") ? ("") : ($lang_code1, "");
|
642 |
+
my @lang_codes2 = ($lang_code2 eq "") ? ("") : ($lang_code2, "");
|
643 |
+
my $chart1_end_pos = length($s1);
|
644 |
+
my $chart2_end_pos = length($s2);
|
645 |
+
my %sd_ht = ();
|
646 |
+
$sd_ht{COST_IJ}->{0}->{0} = 0;
|
647 |
+
foreach $chart1_start ((0 .. $chart1_end_pos)) {
|
648 |
+
foreach $chart1_end (($chart1_start .. $chart1_end_pos)) {
|
649 |
+
my $substr1 = substr($s1, $chart1_start, ($chart1_end-$chart1_start));
|
650 |
+
foreach $lang_code1o (@lang_codes1) {
|
651 |
+
foreach $lang_code2o (@lang_codes2) {
|
652 |
+
# next unless defined($ht{COST}->{$lang_code1o}->{$lang_code2o}->{$substr1});
|
653 |
+
}
|
654 |
+
}
|
655 |
+
my @chart2_starts = (sort { $a <=> $b } keys %{$sd_ht{COST_IJ}->{$chart1_start}});
|
656 |
+
foreach $chart2_start (@chart2_starts) {
|
657 |
+
foreach $chart2_end (($chart2_start .. $chart2_end_pos)) {
|
658 |
+
my $substr2 = substr($s2, $chart2_start, ($chart2_end-$chart2_start));
|
659 |
+
foreach $lang_code1o (@lang_codes1) {
|
660 |
+
foreach $lang_code2o (@lang_codes2) {
|
661 |
+
if ($substr1 eq $substr2) {
|
662 |
+
my $cost = 0;
|
663 |
+
my $preceding_cost = $sd_ht{COST_IJ}->{$chart1_start}->{$chart2_start};
|
664 |
+
if (defined($preceding_cost)) {
|
665 |
+
my $combined_cost = $preceding_cost + $cost;
|
666 |
+
my $old_cost = $sd_ht{COST_IJ}->{$chart1_end}->{$chart2_end};
|
667 |
+
if ((! defined($old_cost)) || ($combined_cost < $old_cost)) {
|
668 |
+
$sd_ht{COST_IJ}->{$chart1_end}->{$chart2_end} = $combined_cost;
|
669 |
+
push(@chart2_starts, $chart2_end) unless $util->member($chart2_end, @chart2_starts);
|
670 |
+
}
|
671 |
+
}
|
672 |
+
} else {
|
673 |
+
next unless defined($ht{COST}->{$lang_code1o}->{$lang_code2o}->{$substr1}->{$substr2});
|
674 |
+
my @cost_rule_ids = keys %{$ht{COST}->{$lang_code1o}->{$lang_code2o}->{$substr1}->{$substr2}};
|
675 |
+
my $best_cost = 99.99;
|
676 |
+
foreach $cost_rule_id (@cost_rule_ids) {
|
677 |
+
my $cost = $ht{COST}->{$lang_code1o}->{$lang_code2o}->{$substr1}->{$substr2}->{$cost_rule_id};
|
678 |
+
my $left_context_rule1 = $ht{LEFT1}->{$cost_rule_id};
|
679 |
+
next if $left_context_rule1
|
680 |
+
&& (! (substr($s1, 0, $chart1_start) =~ /$left_context_rule1/));
|
681 |
+
my $left_context_rule2 = $ht{LEFT2}->{$cost_rule_id};
|
682 |
+
next if $left_context_rule2
|
683 |
+
&& (! (substr($s2, 0, $chart2_start) =~ /$left_context_rule2/));
|
684 |
+
my $right_context_rule1 = $ht{RIGHT1}->{$cost_rule_id};
|
685 |
+
my $right_context1 = substr($s1, $chart1_end);
|
686 |
+
next if $right_context_rule1
|
687 |
+
&& (! (($right_context1 =~ /^$right_context_rule1/)
|
688 |
+
|| (($right_context_rule1 =~ /^\[[^\[\]]*\$/)
|
689 |
+
&& ($right_context1 eq ""))));
|
690 |
+
my $right_context_rule2 = $ht{RIGHT2}->{$cost_rule_id};
|
691 |
+
my $right_context2 = substr($s2, $chart2_end);
|
692 |
+
next if $right_context_rule2
|
693 |
+
&& (! (($right_context2 =~ /^$right_context_rule2/)
|
694 |
+
|| (($right_context_rule2 =~ /^\[[^\[\]]*\$/)
|
695 |
+
&& ($right_context2 eq ""))));
|
696 |
+
$best_cost = $cost if $cost < $best_cost;
|
697 |
+
my $preceding_cost = $sd_ht{COST_IJ}->{$chart1_start}->{$chart2_start};
|
698 |
+
my $combined_cost = $preceding_cost + $cost;
|
699 |
+
my $old_cost = $sd_ht{COST_IJ}->{$chart1_end}->{$chart2_end};
|
700 |
+
if ((! defined($old_cost)) || ($combined_cost < $old_cost)) {
|
701 |
+
$sd_ht{COST_IJ}->{$chart1_end}->{$chart2_end} = $combined_cost;
|
702 |
+
push(@chart2_starts, $chart2_end) unless $util->member($chart2_end, @chart2_starts);
|
703 |
+
}
|
704 |
+
}
|
705 |
+
}
|
706 |
+
}
|
707 |
+
}
|
708 |
+
}
|
709 |
+
}
|
710 |
+
}
|
711 |
+
}
|
712 |
+
$total_cost = $sd_ht{COST_IJ}->{$chart1_end_pos}->{$chart2_end_pos};
|
713 |
+
$total_cost = 99.99 unless defined($total_cost);
|
714 |
+
$ht{CACHED_QRSD}->{$s1}->{$s2} = $total_cost if $cache_p;
|
715 |
+
$n_quick_romanized_string_distance++;
|
716 |
+
return $total_cost;
|
717 |
+
}
|
718 |
+
|
719 |
+
sub get_n_quick_romanized_string_distance {
|
720 |
+
return $n_quick_romanized_string_distance;
|
721 |
+
}
|
722 |
+
|
723 |
+
1;
|
724 |
+
|
uroman/lib/NLP/utilities.pm
ADDED
The diff for this file is too large to render.
See raw diff
|
|
uroman/tarballs/uroman-v1.0.tar.gz
ADDED
Binary file (440 kB). View file
|
|
uroman/tarballs/uroman-v1.1.tar.gz
ADDED
Binary file (507 kB). View file
|
|
uroman/tarballs/uroman-v1.2.4.tar.gz
ADDED
Binary file (504 kB). View file
|
|
uroman/tarballs/uroman-v1.2.5.tar.gz
ADDED
Binary file (576 kB). View file
|
|
uroman/tarballs/uroman-v1.2.6.tar.gz
ADDED
Binary file (568 kB). View file
|
|
uroman/tarballs/uroman-v1.2.7.tar.gz
ADDED
Binary file (567 kB). View file
|
|
uroman/tarballs/uroman-v1.2.tar.gz
ADDED
Binary file (495 kB). View file
|
|
uroman/test/multi-script.txt
ADDED
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
::lcode deu Grüße aus Bordeaux
|
2 |
+
::lcode tur İstanbul, Türkiye'de yer alan şehir ve ülkenin 81 ilinden biri.
|
3 |
+
::lcode eng ⠠⠺⠑⠀⠓⠕⠇⠙⠀⠘⠮⠀⠞⠗⠥⠹⠎⠀⠞⠕⠀⠆⠀⠎⠑⠇⠋⠤⠑⠧⠊⠙⠢⠞⠂⠀⠞⠀⠁⠇⠇⠀⠍⠑⠝⠀⠜⠑⠀⠉⠗⠂⠞⠫⠀⠑⠟⠥⠁⠇⠂⠀⠞⠀⠮⠽⠀⠜⠑⠀⠑⠝⠙⠪⠫⠀⠃⠽⠀⠸⠮⠀⠠⠉⠗⠑⠁⠞⠕⠗⠀⠾⠀⠉⠻⠞⠁⠔⠀⠥⠝⠁⠇⠊⠑⠝⠁⠃⠇⠑⠀⠠⠐⠗⠎⠂⠀⠞⠀⠁⠍⠰⠛⠀⠘⠮⠀⠜⠑⠀⠠⠇⠊⠋⠑⠂⠀⠠⠇⠊⠃⠻⠞⠽⠀⠯⠀⠮⠀⠏⠥⠗⠎⠥⠊⠞⠀⠷⠀⠠⠓⠁⠏⠏⠊⠰⠎⠲
|
4 |
+
::lcode ell Το Λος Άντζελες (στα ισπανικά Los Angeles = Οι Άγγελοι) ή στην Αμερικανική αργκό L.A., ελ έι) είναι η δεύτερη μεγαλύτερη πόλη των Ηνωμένων Πολιτειών από άποψη πληθυσμού, καθώς και ένα από τα σημαντικότερα οικονομικά, πολιτιστικά επιστημονικά και ψυχαγωγικά κέντρα του κόσμου.
|
5 |
+
::lcode rus Герма́ния (нем. Deutschland), официальное название — Федерати́вная Респу́блика Герма́ния (нем. Bundesrepublik Deutschland), ФРГ (нем. BRD) — государство в Западной Европе. Площадь территории — 357 021 км². Численность населения по переписи 2011 года — более 80 миллионов человек. [2][6].
|
6 |
+
::lcode ukr Володи́мир Олекса́ндрович Зеле́нський (нар. 25 січня 1978, Кривий Ріг) — український державний діяч, політик, шоумен, актор, комік, режисер, продюсер та сценарист, шостий Президент України з 20 травня 2019 року.
|
7 |
+
::lcode srp Сва људска бића рађају се слободна и једнака у достојанству и правима. Она су обдарена разумом и свешћу и треба једни према другима да поступају у духу братства.
|
8 |
+
::lcode ara كندا (بالإنجليزية: Canada) هي دولة في أمريكا الشمالية تتألف من 10 مقاطعات وثلاثة أقاليم. تقع في القسم الشمالي من القارة وتمتد من المحيط الأطلسي في الشرق إلى المحيط الهادئ في الغرب وتمتد شمالاً في المحيط المتجمد الشمالي. كندا هي البلد الثاني عالمياً من حيث المساحة الكلية. كما أن حدود كندا المشتركة مع الولايات المتحدة من الجنوب والشمال الغربي هي الأطول في العالم.
|
9 |
+
::lcode fas کالیفرنیا (به انگلیسی: California) ایالتی در غرب آمریکا بر کرانهٔ اقیانوس آرام است. مرکز آن ساکرامنتو و شهرهای مهم آن لسآنجلس، سن دیگو، سن خوزه و سانفرانسیسکو هستند.همچنین این ایالت پر جمعیت ترین ایالت امریکا است.
|
10 |
+
::lcode uig ئامېرىكا قوشما شتاتلىرى بولسا شىمالىي ئامېرىكاغا جايلاشقان بىر دۆلەت. ئۇنىڭ پايتەختى بولسا ۋاشىنگتون، ئەڭ چوڭ شەھىرى بولسا نيۇيورك شەھىرى. دۆلەت تىلى بولسا ئېنگلىزتىلى. ھازىرقى زۇڭتۇڭ باراك ئوباما. بۇ دۆلەت ئەسلىدە ئەنگىلىيەنىڭ مۇستەملىكىسى بولۇپ ۋاشىنگىتوننىڭ رەھپەرلىكىدە 1776 يىلى 7 ئاينىڭ 4 كۇنى مۇستەقىل بولغان، يەر مەيدانى 9 مىلىيون 826 مىڭ 630 كۋادىرات كلومېتىر، نوپۇسى 306 مىللىيون 142 مىڭ، بۇلارنىڭ ئاسساسلىق دىنى خرىستىئان دىنى.
|
11 |
+
::lcode amh ኢትዮጵያ ከዓለም ሶስቱ ትልቅ የአብርሃም ሀይማኖቶች ጋር ታሪካዊ ግንኙነት አላት።
|
12 |
+
::lcode hin कैलिफ़ोर्निया शब्द का पहला अर्थ था जो क्षेत्र जहाँ आज बाहा कैलिफ़ोर्निया प्रायद्वीप, नेवाडा, यूटा और एरिज़ोना, नया मेक्सिको, और वायोमिंग के कई विभाग स्थित हैं।
|
13 |
+
::lcode mar लंडन (इंग्लिश: London ) हे इंग्लंडचे व युनायटेड किंग्डमचे राजधानीचे व सर्वात मोठे शहर तसेच युरोपियन संघामधील सर्वात मोठे महान���र क्षेत्र आहे.
|
14 |
+
::lcode nep यसको उचाइ समुन्द्र सतहबाट ८,८४८ मीटर (२९,०२८ फीट) छ। यो नेपालको सोलुखुम्बु जिल्लाको खुम्जुङ्ग गा. वि. स. मा पर्छ ।
|
15 |
+
::lcode tam தமிழ்நாடு (Tamil Nadu) இந்தியாவின் 29 மாநிலங்களில் ஒன்றாகும். தமிழ்நாடு, தமிழகம் என்றும் பரவலாக அழைக்கப்படுகிறது.
|
16 |
+
::lcode mal ഇന്ത്യയുടെ തെക്കുപടിഞ്ഞാറെ അറ്റത്തുള്ള സംസ്ഥാനമാണ് കേരളം.
|
17 |
+
::lcode ori ଓଡ଼ିଶା ଭାରତର ପୂର୍ବ ଉପକୂଳରେ ଥିବା ଏକ ପ୍ରଶାସନିକ ରାଜ୍ୟ । ଏହାର ଉତ୍ତର-ପୂର୍ବରେ ପଶ୍ଚିମବଙ୍ଗ, ଉତ୍ତରରେ ଝାଡ଼ଖଣ୍ଡ, ପଶ୍ଚିମ ଓ ଉତ୍ତର-ପଶ୍ଚିମରେ ଛତିଶଗଡ଼, ଦକ୍ଷିଣ ଓ ଦକ୍ଷିଣ-ପଶ୍ଚିମରେ ଆନ୍ଧ୍ରପ୍ରଦେଶ ଅବସ୍ଥିତ । ଏହା ଆୟତନ ହିସାବରେ ନବମ ଓ ଜନସଂଖ୍ୟା ହିସାବରେ ଏଗାରତମ ରାଜ୍ୟ । ଓଡ଼ିଆ ଭାଷା ରାଜ୍ୟର ସରକାରୀ ଭାଷା । ୨୦୦୧ ଜନଗଣନା ଅନୁସାରେ ରାଜ୍ୟର ପ୍ରାୟ ୩୩.୨ ନିୟୁତ ଲୋକ ଓଡ଼ିଆ ଭାଷା ବ୍ୟବହାର କରନ୍ତି ।
|
18 |
+
::lcode zho 加拿大在一万四千年前即有原住民在此生活。
|
19 |
+
::lcode heb כֹּל עוֹד בַּלֵּבָב פְּנִימָה נֶפֶשׁ יְהוּדִי הוֹמִיָּה וּלְפַאֲתֵי מִזְרָח, קָדִימָה, עַיִן לְצִיּוֹן צוֹפִיָּה, עוֹד לֹא אָבְדָה תִּקְוָתֵנוּ, הַתִּקְוָה בַּת שְׁנוֹת אַלְפַּיִם לִהְיוֹת עַם חָפְשִׁי בְּאַרְצֵנוּ, אֶרֶץ צִיּוֹן וִירוּשָׁלַיִם.
|
20 |
+
::lcode yid דווקא איז אן העברעישער זשורנאל וואס באשרייבט די יידיש־שפראכיקע קולטור. עס איז דערשינען געווארן תמוז ה'תשס"ז (יולי 2006).
|
21 |
+
::lcode hye Տալնոեի շրջան (ուկր.՝ Тальнівський район), շրջան Ուկրաինայի Չերկասիի մարզում։ Ստեղծվել է 1923 թվականին։ Վարչական կենտրոնը՝ Տալնոե։ Աշխարհագրությունը Շրջանի տարածքի մակերեսը կազմում է 917 կմ²։ Բնակչություն
|
22 |
+
::lcode tai มีประเทศอิสระ 2 ประเทศ คือ ซานมารีโนและนครรัฐวาติกัน เป็นดินแดนที่ล้อมรอบไปด้วยพื้นที่ของอิตาลี ในขณะที่เมืองกัมปีโอเนดีตาเลีย เป็นดินแดนส่วนแยกของอิตาลีที่ถูกล้อมรอบด้วยพื้นที่ประเทศสวิตเซอร์แลนด์
|
23 |
+
북쪽에는 인도네시아와 동티모르, 파푸아 뉴기니, 북동쪽에는 솔로몬 제도와 바누아투, 누벨칼레도니, 그리고 남동쪽에는 뉴질랜드가 있다.
|
24 |
+
ಬಾ ಇಲ್ಲಿ ಸಂಭವಿಸು ಇಂದೆನ್ನ ಹೃದಯದಲಿ ನಿತ್ಯವೂ ಅವತರಿಪ ಸತ್ಯಾವತಾರ ಮಣ್ಣಾಗಿ ಮರವಾಗಿ ಮಿಗವಾಗಿ ಕಗವಾಗೀ... ಮಣ್ಣಾಗಿ ಮರವಾಗಿ ಮಿಗವಾಗಿ ಕಗವಾಗಿ ಭವ ಭವದಿ ಭತಿಸಿಹೇ ಭವತಿ ದೂರ ನಿತ್ಯವೂ ಅವತರಿಪ ಸತ್ಯಾವತಾರ || ಬಾ ಇಲ್ಲಿ ||
|
25 |
+
ვეპხის ტყაოსანი შოთა რუსთაველი ღმერთსი შემვედრე, ნუთუ კვლა დამხსნას სოფლისა შრომასა, ცეცხლს, წყალსა და მიწასა, ჰაერთა თანა მრომასა; მომცნეს ფრთენი და აღვფრინდე, მივჰხვდე მას ჩემსა ნდომასა, დღისით და ღამით ვჰხედვიდე მზისა ელვათა კრთომაასა.
|
26 |
+
᚛ᚐᚅᚋ ᚋᚖᚂᚓᚌᚖᚋᚏᚔᚇ ᚋᚐᚉᚔ ᚍᚓᚉᚒᚋᚓᚅ᚜
|
27 |
+
ᛁᚳ᛫ᛗᚨᚷ᛫ᚷᛚᚨᛋ᛫ᛖᚩᛏᚪᚾ᛫ᚩᚾᛞ᛫ᚻᛁᛏ᛫ᚾᛖ᛫ᚻᛖᚪᚱᛗᛁᚪᚧ᛫ᛗᛖ᛬
|
28 |
+
𓊪𓏏𓍯𓃭𓐝𓇌𓋴
|
29 |
+
チェコスロバキア
|
30 |
+
ལྷ་ས་གྲ���ང་ཁྱེར
|
31 |
+
ᓵᓕ ᓴᕕᐊᕐᔪᒃ ᐃᒻᒥᓂᒃ ᓂᓪᓕᕈᑎᖃᓲᖑᕗᖅ ᑕᐃᑦᓱᒪᓂᑕᑦᓴᔭᐅᓂᕋᕐᓱᓂ. ᐃᒻᒥᓂᓪᓗᑕᐅᖅ ᓂᓪᓕᕈᑎᖃᓱᖑᒻᒥᓱᓂ ᐅᓪᓗᒥᓂᑕᑦᓴᔭᐅᓂᕋᕐᓱᓂ.
|
32 |
+
ⴰⵎⴰⴳⵔⴰⴷ 1 ⴰⵔ ⴷ ⵜⵜⵍⴰⵍⴰⵏ ⵎⵉⴷⴷⵏ ⴳⴰⵏ ⵉⵍⴻⵍⵍⵉⵜⵏ ⵎⴳⴰⴷⴷⴰⵏ ⵖ ⵡⴰⴷⴷⵓⵔ ⴷ ⵉⵣⵔⴼⴰⵏ, ⵢⵉⵍⵉ ⴰⴽⵯ ⴷⴰⵔⵙⵏ ⵓⵏⵍⵍⵉ ⴷ ⵓⴼⵔⴰⴽ, ⵉⵍⵍⴰ ⴼⵍⵍⴰ ⵙⵏ ⴰⴷ ⵜⵜⵎⵢⴰⵡⴰⵙⵏ ⵏⴳⵔⴰⵜⵙⵏ ⵙ ⵜⴰⴳⵎⴰⵜ.
|
uroman/test/multi-script.uroman-ref.txt
ADDED
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
::lcode deu Gruesse aus Bordeaux
|
2 |
+
::lcode tur Istanbul, Tuerkiye'de yer alan shehir ve uelkenin 81 ilinden biri.
|
3 |
+
::lcode eng We hold ⠘e truos to ; self-evid⠢t, t all men aee cr,te equal, t ey aee endoee by ⠸e Creator u cita⠔ unalienable ⠠⠐rs, t amg ⠘e aee Life, Libity ⠯ e pursuit a Happis.
|
4 |
+
::lcode ell To Los Andzeles (sta ispanika Los Angeles = Oi Angeloi) e sten Amerikanike arngo L.A., el ei) einai e deutere megalutere pole ton Enomenon Politeion apo apopse plethysmou, kathos kai ena apo ta semandikotera oikonomika, politistika epistemonika kai psychagogika kendra tou kosmou.
|
5 |
+
::lcode rus Germaniya (nem. Deutschland), ofitsialnoe nazvanie — Federativnaya Respublika Germaniya (nem. Bundesrepublik Deutschland), FRG (nem. BRD) — gosudarstvo v Zapadnoi Evrope. Ploshchad territorii — 357 021 km². Chislennost naseleniya po perepisi 2011 goda — bolee 80 millionov chelovek. [2][6].
|
6 |
+
::lcode ukr Volodimir Oleksandrovich Zelensky (nar. 25 sichnya 1978, Krivy Rig) — ukrayinsky derzhavny diyach, politik, shoumen, aktor, komik, rezhiser, prodyuser ta stsenarist, shosty Prezident Ukrayini z 20 travnya 2019 roku.
|
7 |
+
::lcode srp Sva ljudska bitsha radjaju se slobodna i jednaka u dostojanstvu i pravima. Ona su obdarena razumom i sveshtshu i treba jedni prema drugima da postupaju u dukhu bratstva.
|
8 |
+
::lcode ara knda (balinjlyzya: Canada) hy dwla fy amryka alshmalya ttalf mn 10 mqat'at wthlatha aqalym. tq' fy alqsm alshmaly mn alqara wtmtd mn almhyt alatlsy fy alshrq ila almhyt alhadye fy alghrb wtmtd shmalan fy almhyt almtjmd alshmaly. knda hy albld althany 'almyan mn hyth almsaha alklya. kma an hdwd knda almshtrka m' alwlayat almthda mn aljnwb walshmal alghrby hy alatwl fy al'alm.
|
9 |
+
::lcode fas kalifrnia (bh anglisi: California) ialti dr ghrb amrika br kranh' aqianws aram ast. mrkz an sakramntw w shhrhai mhm an lsanjls, sn digw, sn khwzh w sanfransiskw hstnd.hmtchnin in ialt pr jm'it trin ialt amrika ast.
|
10 |
+
::lcode uig yeameraka qwshma shtatlara bwlsa shamalay yeamerakagha jaylashqan bar doelaet. yeunang paytaekhta bwlsa vashangtwn, yeaeng tchwng shaehara bwlsa nyuywrk shaehara. doelaet tala bwlsa yeenglaztala. hazarqa zungtung barak yewbama. bu doelaet yeaesladae yeaengalayaenang mustaemlakasa bwlup vashangatwnnang raehpaerlakadae 1776 yala 7 yeaynang 4 kuna mustaeqal bwlghan, yaer maeydana 9 malaywn 826 mang 630 kvadarat klwmetar, nwpusa 306 mallaywn 142 mang, bularnang yeassaslaq dana khrastayean dana.
|
11 |
+
::lcode amh iteyopheyaa kaaalame sosetu teleqe yaaberehaame hayemaanotoche gaare taarikaawi genenyunate alaate.
|
12 |
+
::lcode hin kailiphorniyaa shabda kaa pahalaa artha thaa jo kssetra jahaam aaj baahaa kailiphorniyaa praayadviip, nevaaddaa, yuuttaa aur erijonaa, nayaa meksiko, aur vaayomimga ke kaii vibhaag sthit haim.
|
13 |
+
::lcode mar lamddan (imglish: London ) he imglamddace va yunaayattedd kimgddamace raajadhaaniice va sarvaat motthe shahar tasec yuropiyan samghaamadhiil sarvaat motthe mahaanagar kssetra aahe.
|
14 |
+
::lcode nep yasako ucaai samundra satahabaatt 8,848 miittar (29,028 phiitt) cha. yo nepaalako solukhumbu jillaako khumjungga gaa. vi. sa. maa parcha .
|
15 |
+
::lcode tam tamilnaadu (Tamil Nadu) intiyaavin 29 maanilangkalil onraakum. tamilnaadu, tamilakam enrum paravalaaka alaikkappadukiratu.
|
16 |
+
::lcode mal intyayutte tekkupattinynyaarre arrrrattulllla samsthaanamaann keerallam.
|
17 |
+
::lcode ori oddishaa bhaaratara puurba upakuullare thibaa eka prashaasanika raajya . ehaara uttara-puurbare pashcimabangga, uttarare jhaaddakhanndda, pashcima o uttara-pashcimare chatishagadda, dakssinna o dakssinna-pashcimare aandhrapradesha abasthita . ehaa aayatana hisaabare nabama o janasamkhyaa hisaabare egaaratama raajya . oddiaa bhaassaa raajyara sarakaarii bhaassaa . 2001 janagannanaa anusaare raajyara praaya 33.2 niyuta loka oddiaa bhaassaa byabahaara karanti .
|
18 |
+
::lcode zho jianadazai14000nianqianjiyouyuanzhuminzaicishenghuo.
|
19 |
+
::lcode heb kol 'od balevav penimah nefesh yehudi homiyah ulefa'ate mizerach, qadimah, 'ayin letsiyon tsofiyah, 'od lo avedah tiqvatenu, hatiqvah bat shenot 'alepayim liheyot 'am chafeshiy be'aretsenu, erets tsiyon virushalayim.
|
20 |
+
::lcode yid dvvqa ayz an h'vr'ysh'r zshvrnal vvas vashryyvt dy yydysh-shfrakyq' qvltvr. 's ayz d'rshyn'n g'vvarn tmvz h'tshs"z (yvly 2006).
|
21 |
+
::lcode hye Talnoei shrjan (ukr., Talnivsky raion), shrjan Ukrainayi Cherkasii marzum. Steghtsvel e 1923 tvakanin. Varchakan kentrone, Talnoe. Ashkharhagrutyune Shrjani taratski makerese kazmum e 917 km². Bnakchutyun
|
22 |
+
::lcode tai miipratesisra 2 prates kuee saanmaariinolaeankrratwaatikan peondindaentiilomrobpaidwypueentiikongitaalii naiknatiimeueengkampiionediitaaleiiy peondindaenswnyaekkongitaaliitiituuklomrobdwypueentiipratesswitserlaend
|
23 |
+
bugjjogeneun indonesiawa dongtimoreu, papua nyugini, bugdongjjogeneun solromon jedowa banuatu, nubelkalredoni, geurigo namdongjjogeneun nyujilraendeuga issda.
|
24 |
+
baa illi sambhavisu imdenna hrdayadali nityavuu avataripa satyaavataara mannnnaagi maravaagi migavaagi kagavaagii... mannnnaagi maravaagi migavaagi kagavaagi bhava bhavadi bhatisihee bhavati duura nityavuu avataripa satyaavataara || baa illi ||
|
25 |
+
vepxis tqaosani shota rustaveli ghmertsi shemvedre, nutu kvla damxsnas sophlisa shromasa, tsetsxls, tsqalsa da mitsasa, haerta tana mromasa; momtsnes phrteni da aghvphrinde, mivhxvde mas chemsa ndomasa, dghisit da ghamit vhxedvide mzisa elvata krtomaasa.
|
26 |
+
anm moilegoimrid maki vekumen
|
27 |
+
ic mag glas eotan ond hit ne hearmiath me.
|
28 |
+
ptolmys
|
29 |
+
chekosurobakia
|
30 |
+
lha·sa·grong·khyer
|
31 |
+
saali safiaryok imminik nillirotiqasoongofoq taitsomanitatsayaonirarsoni. imminillotaoq nillirotiqasongommisoni ollominitatsayaonirarsoni.
|
32 |
+
amagrad 1 ar d ttlalan middn gan ilellitn mgaddan gh waddur d izrfan, yili ak darsn unlli d ufrak, illa flla sn ad ttmyawasn ngratsn s tagmat.
|
uroman/test/string-similarity-test-input.txt
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
trap strap
|
2 |
+
colour color
|
3 |
+
labeling labelling
|
4 |
+
organisation organization
|
5 |
+
Philadelphia Filadelfia
|
6 |
+
Vladimir Volodymyr
|
7 |
+
Moskva Moskvoy
|
uroman/test/string-similarity-test-output-ref.txt
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Lang-code-1: eng Lang-code-2: eng
|
2 |
+
trap strap 1
|
3 |
+
colour color 0.1
|
4 |
+
labeling labelling 0.02
|
5 |
+
organisation organization 0.1
|
6 |
+
Philadelphia Filadelfia 0.02
|
7 |
+
Vladimir Volodymyr 0.5
|
8 |
+
Moskva Moskvoy 0.5
|
uroman/text/amh.txt
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
ኢትዮጵያ ከዓለም ሶስቱ ትልቅ የአብርሃም ሀይማኖቶች ጋር ታሪካዊ ግንኙነት አላት።
|
2 |
+
ክርስትናን በአራተኛው ምዕተ-ዓመት ተቀብላለች።
|
3 |
+
ከሕዝቡ አንድ ሶስተኛው እስላም ነው።
|
4 |
+
የመጀመሪያው የእስላም ሂጅራ ወደ ኢትዮጵያ ነው የተከናወነው።
|
5 |
+
ነጋሽ በአፍሪካ የመጀመሪያው የእስላም መቀመጫ ናት።
|
6 |
+
እስከ ፲፱፻፸ ዎቹ ድረስ ብዙ ቤተ-እስራኤሎች በኢትዮጵያ ይኖሩ ነበር።
|
7 |
+
የራስ ተፈሪ እንቅስቃሴ ኢትዮጵያን በትልቅ ክብር ነው የሚያያት።
|
uroman/text/ara.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
كندا (بالإنجليزية: Canada) هي دولة في أمريكا الشمالية تتألف من 10 مقاطعات وثلاثة أقاليم. تقع في القسم الشمالي من القارة وتمتد من المحيط الأطلسي في الشرق إلى المحيط الهادئ في الغرب وتمتد شمالاً في المحيط المتجمد الشمالي. كندا هي البلد الثاني عالمياً من حيث المساحة الكلية. كما أن حدود كندا المشتركة مع الولايات المتحدة من الجنوب والشمال الغربي هي الأطول في العالم.
|
2 |
+
أراضي كندا مأهولة منذ آلاف السنين من قبل مجموعات مختلفة من السكان الأصليين. مع حلول أواخر القرن الخامس عشر بدأت الحملات البريطانية والفرنسية استكشاف المنطقة ومن ثم استوطنتها على طول ساحل المحيط الأطلسي. تنازلت فرنسا عن ما يقرب من جميع مستعمراتها في أمريكا الشمالية في عام 1763 بعد حرب السنوات السبع. في عام 1867، مع اتحاد ثلاثة مستعمرات بريطانية في أمريكا الشمالية عبر كونفدرالية تشكلت كندا باعتبارها كيانًا فدراليًا ذا سيادة يضم أربع مقاطعات. بدأ ذلك عملية اتسعت فيها مساحة كندا وتوسع حكمها الذاتي عن المملكة المتحدة. تجلت هذه الاستقلالية من خلال تشريع وستمنستر عام 1931 وبلغت ذروتها في صورة قانون كندا عام 1982 والذي قطع الاعتماد القانوني لكندا على البرلمان البريطاني.
|
3 |
+
كندا دولة فيدرالية يحكمها نظام ديمقراطي تمثيلي وملكية دستورية حيث الملكة إليزابيث الثانية قائدة للدولة. الأمة الكندية أمة ثنائية اللغة حيث الإنكليزية والفرنسية لغتان رسميتان على المستوى الاتحادي. تعد كندا واحدة من أكثر دول العالم تطوراً، حيث تمتلك اقتصاداً متنوعاً وتعتمد على مواردها الطبيعية الوفيرة، وعلى التجارة وبخاصة مع الولايات المتحدة اللتان تربطهما علاقة طويلة ومعقدة. كندا عضو في مجموعة الدول الصناعية السبع ومجموعة الثماني ومجموعة العشرين وحلف شمال الأطلسي ومنظمة التعاون والتنمية الاقتصادية ومنظمة التجارة العالمية ودول الكومنولث والفرنكوفونية ومنظمة الدول الأمريكية والإبيك والأمم المتحدة. تمتلك كندا واحداً من أعلى مستويات المعيشة في العالم حيث مؤشر التنمية البشرية يضعها في المرتبة الثامنة عالمياً.
|
uroman/text/ben.txt
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
বার্লিন (জার্মান: Berlin বেয়ালিন্') জার্মানির রাজধানী, এবং ইউরোপ মহাদেশের একটি ঐতিহাসিক শহর। বার্লিন শহরে ৩৪ লক্ষেরও বেশি লোক বাস করেন। শহরটি একাধারে একটি শহর এবং জার্মানির একটি রাজ্য। বার্লিনের আয়তন ৩৪৩ বর্গমাইল; এটির আয়তন প্যারিস শহরের প্রায় ৯ গুণ।
|
2 |
+
বার্লিন একটি বহুসাংস্কৃতিক শহর। বিশ্বের ১৮৪টি দেশ থেকে আগত প্রায় ৪ লক্ষ ৩০ হাজার অভিবাসী বার্লিনে বাস করে। এদের মধ্যে তুরস্ক থেকে আগত অভিবাসীরা সংখ্যা সবচেয়ে বেশি; বার্লিনে প্রায় ১ লক্ষ ১৯ হাজার তুর্কি অভিবাসী বাস করে। তুরস্কের বাইরে বার্লিনেই ইউরোপে তুর্কিদের সবচেয়ে বড় সম্প্রদায় অবস্থিত।
|
3 |
+
১৯৪৯ সাল থেকে ১৯৯০ পর্যন্ত বার্লিন পূর্ব বার্লিন ও পশ্চিম বার্লিন---এই দুই ভাগে বিভক্ত ছিল। ১৯৬১ সালে পূর্ব জার্মান সরকার সেখানকার নাগরিকদের পশ্চিম বার্লিনে পালিয়ে যাওয়া ঠেকাতে দুই বার্লিনের মাঝে একটি দেয়াল তুলে দেয়। দেয়ালটি ১৯৬১ সাল থেকে ১৯৮৯ সাল পর্যন্ত টিকে ছিল। ঐ সময় ৫ হাজারেরও বেশি ব্যক্তি দেয়ালটি টপকানোর চেষ্টা করে; এদের মধ্যে ৩২০০ জনকে গ্রেফতার করা হয় এবং ১৯১ জন নিহত হয়।
|
4 |
+
১৯৮৯ সালে দেয়ালটি ভেঙে ফেলার পর বার্লিনের ব্রান্ডেনবুর্গ ফটক পূর্ব ও পশ্চিম বার্লিনের পুনঃএকত্রীকরণের প্রতীক হিসেবে দাঁড়িয়ে আছে।
|
5 |
+
বার্লিনের স্থানীয় ফুটবল দলের নাম হের্টা বে এস ৎসে বের্লিন। তারা ঘরোয়া ম্যাচগুলি বার্লিনের "অলিম্পিয়াষ্টাডিয়ন" নামের স্টেডিয়ামে খেলে থাকে। এই স্টেডিয়ামেই ১৯৩৬ সালের গ্রীষ্মকালীন অলিম্পিক্স অনুষ্ঠিত হয়।
|
6 |
+
বার্লিনে কুকুর পোষা খুবই ব্যয়বহুল একটি কাজ। কুকুরের মালিককে প্রতি বছর দেড়শ ইউরো কর দিতে হয়।
|
7 |
+
বার্লিনের কাউফ্হাউস ডেস ভেস্টেন্স (Kaufhaus des Westens, সংক্ষেপে KaDeWe, কাডেভে) ইউরোপের বৃহত্তম ডিপার্টমেন্ট স্টোর। এর আট তলাবিশিষ্ট ভবনে প্রায় ৪ লক্ষ জিনিস বেচা কেনা হয়।
|
8 |
+
মার্কিন যুক্তরাষ্ট্রের লস অ্যাঞ্জেলেস বার্লিনের ভগ্নী শহর।
|
uroman/text/bod.txt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
ཁྲིན་ཀོན་ཆུས
|
2 |
+
ལྷ་ས་གྲོང་ཁྱེར
|
3 |
+
[[ཁྲིན་ཀོན་ཆུས་ཞེས་པ་ནི་རྒྱ་ནག་གཞུང་གིས་བཙན་འཛུལ་བྱས་རྗེས་བཏགས་པའི་མིང་ཞིག་ཡིན་པ་དང། དེ་ནི་ད་ལྟའི་ཆར་ལྷ་ས་གྲོང་ཁྱེར་གྱི་ཁོངས་གཏོགས་རྫོང་ཁག་བདུན་པོ་ཕུད་པའི་གྲོང་ཁྱེར་ནང་ཁུལ་གྱི་ས་ཁུལ་ཁག་བསྡུས་པའི་གནས་དེར་ཁྲེང་ཀོན་ཆུས་ཞེས་པའི་ཁོངས་སུ་གཏོགས་པར་བཤད་ཡོད་ཅིང། ནུབ་ཏུ་སྟོད་ལུང་ས་འབྲེལ་འབྲས་སྤུངས་དན་བག་ཡན་དང་ཤར་དུ་གཤོངས་ཀ་གླིང་ཡན་ཙམ་དུ་ཡིན་ཚོད་འདུག]]
|
uroman/text/egy.txt
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
𓈎𓃭𓇋𓍯𓊪𓄿𓆓𓂋𓄿𓏏𓆇
|
2 |
+
𓊪𓏏𓍯𓃭𓐝𓇌𓋴
|
3 |
+
𓆿𓍧𓎇𓏻
|
4 |
+
𓇌𓊪𓏲𓌙𓈉
|
5 |
+
|