Spaces:
Build error
Build error
freemt
commited on
Commit
·
7fd4e54
1
Parent(s):
d6448a5
Update slow-track for more lang pairs
Browse files- data/xiyouji-ch1-de.txt +0 -115
- docs/build/doctrees/environment.pickle +0 -0
- docs/build/doctrees/intro.doctree +0 -0
- docs/build/doctrees/userguide-zh.doctree +0 -0
- docs/build/html/_sources/intro.rst.txt +4 -4
- docs/build/html/_sources/userguide-zh.rst.txt +1 -1
- docs/build/html/intro.html +4 -4
- docs/build/html/searchindex.js +1 -1
- docs/build/html/userguide-zh.html +1 -1
- docs/source/intro.rst +4 -4
- docs/source/userguide-zh.rst +1 -1
- gradio_queue.db +0 -0
- img/plt.png +0 -0
- radiobee/__main__.py +1 -1
- radiobee/detect.py +32 -16
- radiobee/detect_alt.py +66 -0
- radiobee/gradiobee.py +11 -6
- radiobee/text2lists.py +17 -4
- tests/test_detect.py +17 -3
- tests/test_text2lists.py +14 -6
- tests/test_text2lists_bug2.py +4 -6
data/xiyouji-ch1-de.txt
CHANGED
@@ -2,125 +2,10 @@ Wu Ch’êng-ên
|
|
2 |
|
3 |
Monkeys Pilgerfahrt
|
4 |
|
5 |
-
Hugendubel
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
|
10 |
|
11 |
Nach der englischen Übersetzung von Arthur Waley übertragen von Georgette Boner und Maria Nils.
|
12 |
|
13 |
-
1980 © der deutschen Ausgabe Heinrich Hugendubel Verlag, München, Titel der Originalausgabe MONKEY © George Allen & Unwin Ltd. London
|
14 |
-
|
15 |
-
Alle Rechte vorbehalten
|
16 |
-
|
17 |
-
Umschlaggestaltung: Dieter Bonhorst, mit einer Illustration von Maja Weber
|
18 |
-
|
19 |
-
Druck und Bindung: May & Co., Darmstadt
|
20 |
-
|
21 |
-
ISBN 3 88034 9
|
22 |
-
|
23 |
-
Printed in Germany
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
* * *
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
Die Rechtschreibung und Interpunktion der Originalausgabe sind unverändert. Offensichtliche Fehler wurden stillschweigend korrigiert.
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
Inhalt
|
38 |
-
|
39 |
-
|
40 |
-
Vorwort zur englischen Ausgabe von Arthur Waley
|
41 |
-
|
42 |
-
1. Kapitel: Die Geburt des magischen Affen Monkey
|
43 |
-
|
44 |
-
2. Kapitel: Monkey’s Lehrjahre beim Patriarchen
|
45 |
-
|
46 |
-
3. Kapitel: Die Waffen des Drachenkönigs; Monkey streicht seinen Namen aus der Liste Yamas, des Königs der Toten und erregt den Zorn des Jade-Kaisers
|
47 |
-
|
48 |
-
4. Kapitel: Monkey erhält den Posten eines Pferdeknechts im Himmel und kehrt wegen dieser Beleidigung schnellstens auf die Erde zurück
|
49 |
-
|
50 |
-
5. Kapitel: ›Der Große Weise Himmelsebenbürtige‹
|
51 |
-
|
52 |
-
6. Kapitel: Der Zauberer Erh-lang und Lao-tsu nehmen Monkey gefangen
|
53 |
-
|
54 |
-
7. Kapitel: Monkey verliert eine Wette gegen Buddha
|
55 |
-
|
56 |
-
8. Kapitel: Ein Bote für die Heiligen Schriften
|
57 |
-
|
58 |
-
9. Kapitel: Die Gesetze des Karma
|
59 |
-
|
60 |
-
10. Kapitel: Ein gebrochenes Versprechen
|
61 |
-
|
62 |
-
11. Kapitel: Der Kaiser vor dem Totengericht
|
63 |
-
|
64 |
-
12. Kapitel: Tripitaka erhält den Auftrag, die Heiligen Schriften aus Indien zu holen
|
65 |
-
|
66 |
-
13. Kapitel: Der Tod von Tripitakas Reisegefährten
|
67 |
-
|
68 |
-
14. Kapitel: Tripitaka hebt den Bann von Monkey auf und macht ihn zu seinem Reisegefährten
|
69 |
-
|
70 |
-
15. Kapitel: Monkeys Kampf mit dem verwunschenen Drachen
|
71 |
-
|
72 |
-
16. Kapitel: Monkey vertreibt einen ›Unhold‹
|
73 |
-
|
74 |
-
17. Kapitel: Der ›Unhold‹ Pigsy beschließt, Tripitaka und Monkey zu begleiten
|
75 |
-
|
76 |
-
18. Kapitel: ›Das Ungeheuer vom Strom‹ schließt sich der Pilgerfahrt an
|
77 |
-
|
78 |
-
19. Kapitel: Der Geist des toten Königs bittet Monkey um seine Hilfe
|
79 |
-
|
80 |
-
20. Kapitel: Die durch bösen Zauber verwunschene Stadt Kräh-Hahn
|
81 |
-
|
82 |
-
21. Kapitel: Lao-tsu’s Elexier erweckt den toten König wieder zum Leben; der falsche Zauberer wird in seine ursprüngliche Gestalt, einen Löwen, zurückverwandelt
|
83 |
-
|
84 |
-
22. Kapitel: 500 Buddhisten werden von Monkey aus der Sklaverei befreit
|
85 |
-
|
86 |
-
23. Kapitel: Monkey verulkt Taoisten, die einen Gottesdienst feiern
|
87 |
-
|
88 |
-
24. Kapitel: Eine Wette mit tödlichem Ausgang
|
89 |
-
|
90 |
-
25. Kapitel: Menschenopfer
|
91 |
-
|
92 |
-
26. Kapitel: Der Flußkönig stellt Tripitaka eine Falle
|
93 |
-
|
94 |
-
27. Kapitel: Göttliche Intervention und Rettung Tripitakas
|
95 |
-
|
96 |
-
28. Kapitel: Tripitaka erhält die Heiligen Schriften
|
97 |
-
|
98 |
-
29. Kapitel: Die Heimreise
|
99 |
-
|
100 |
-
30. Kapitel: Willkommensfest in Ch’ang-an
|
101 |
-
|
102 |
-
Arthur Waley zur deutschen Ausgabe
|
103 |
-
|
104 |
-
|
105 |
-
|
106 |
-
|
107 |
-
|
108 |
-
Vorwort zur englischen Ausgabe von Arthur Waley
|
109 |
-
|
110 |
-
|
111 |
-
Die vorliegende Erzählung wurde von Wu Ch’êng-ên aus Huai-an in Kiangsu niedergeschrieben. Seine genauen Daten sind nicht bekannt. Doch scheint er zwischen 1505 und 1580 n. Chr. gelebt und sich als Dichter eines gewissen Ruhmes erfreut zu haben. Einige seiner eher unbedeutenden Verse sind in einer Anthologie der Ming-Dichtung überliefert.
|
112 |
-
|
113 |
-
Tripitaka, dessen Pilgerfahrt nach Indien das Thema der Erzählung bildet, ist eine wirkliche Person, in der Geschichte besser bekannt als Hsüan Tsang. Er lebte im siebten Jahrhundert n. Chr. Über seine Reise gibt es eingehende zeitgenössische Berichte. Bereits im zehnten Jahrhundert, und vermutlich schon früher, war Tripitakas Pilgerfahrt Gegenstand eines ganzen Zyklus phantastischer Legenden. Seit dem dreizehnten Jahrhundert sind diese Legenden ständig auf der chinesischen Bühne dargestellt worden. Wu Ch’êng-ên standen daher für seine lange Märchenerzählung eine Menge Bausteine zur Verfügung. Das ursprüngliche Buch ist von unendlichem Umfang und wird gewöhnlich in gekürzten Fassungen gelesen. Bei diesen Bearbeitungen blieb die ursprüngliche Anzahl der einzelnen Episoden bestehen; ihre Länge jedoch wurde, besonders durch Streichen von Dialogen, erheblich gekürzt. — Ich habe meist das entgegengesetzte Prinzip angewandt, indem ich zahlreiche Episoden ausließ, die beibehaltenen jedoch nahezu ungekürzt übersetzte, mit Ausnahme der meisten eingestreuten, für eine Übertragung ins Englische ungeeigneten Verse.
|
114 |
-
|
115 |
-
Monkey ist ein wahrhaft einzigartiges Werk in seiner Verbindung von Schönheit mit Ungereimtheit, von Tiefe mit Unsinn. Folklore, Allegorie, Religion, Geschichte, antibürokratische Satire und reine Poesie — dies sind die außerordentlich verschiedenen Elemente, aus denen das Buch sich zusammenfügt. Die Bürokraten der Erzählung sind Heilige im Himmel, und man könnte auf die Vermutung kommen, daß die Satire sich noch eher gegen die Religion als gegen die Bürokratie wandte. Dem ist aber nicht so. Es ist nämlich eine in China geläufige Anschauung, daß die Hierarchie im Himmel ein Spiegelbild der Regierungsform auf Erden sei. Hier wie so oft lassen die Chinesen die Katze aus dem Sack, wo andere Völker uns Rätsel aufgeben. Es ist häufig als Theorie geltend gemacht worden, daß eines Volkes Götter die Spiegelung seiner irdischen Regenten darstellen. In den meisten Fällen bleibt die Ableitung im Dunkeln. Im Volksglauben der Chinesen jedoch gibt es keinerlei Doppelsinn. Der Himmel ist einfach das gesamte bürokratische System, leibhaftig ins Empyreum versetzt.
|
116 |
-
|
117 |
-
Was die Allegorie anbelangt, so versinnbildlicht Tripitaka unverkennbar den ängstlich und beflissen durch die Schwierigkeiten des Lebens tappenden Menschen, während Monkey die ewige Unruhe des Genies personifiziert. Pigsy wiederum symbolisiert offensichtlich die physischen Begierden, primitive Kraft und eine Art schwerfälliger Geduld. Sandy ist rätselhafter. Die Kommentatoren sagen, er stelle ch’êng dar, was gewöhnlich mit ›Redlichkeit‹ übersetzt wird, allein noch eher etwas im Sinne von ›Integrität des Herzens‹ bedeutet. Er kam nicht als nachträglicher Einfall in die Erzählung, erscheint er doch bereits in einigen der frühesten Fassungen der Legende. Aber es muß zugegeben werden, daß sein Bild, obgleich für die Erzählung in unerklärlicher Weise nötig, dennoch in den Umrissen seltsam undeutlich und farblos bleibt.
|
118 |
-
|
119 |
-
Auszüge des vorliegenden Buches sind erschienen in Giles’ History of Chinese Literature und in Timothy Richard’s Mission to Heaven, zu einer Zeit, als nur die gekürzten Fassungen bekannt waren. Eine zugängliche, doch recht ungenaue Beschreibung des Werkes gibt Helen Hayes in A Buddhist Pilgrim’s Progress (Wisdom of the East Series). Ferner existiert eine recht freie japanische Paraphrase von verschiedenen Händen, mit einer 1806 datierten Einleitung des bekannten Novellisten Bakin und Illustrationen, deren einige von Hokusai stammen. Einer der Übersetzer, Hokusais Schüler Gakutei, gesteht, daß er keine Kenntnis von der Chinesischen Umgangssprache hatte, als er die Arbeit unternahm.
|
120 |
-
|
121 |
-
Der meiner Übersetzung zugrundeliegende Text erschien 1921 in der Oriental Press, Shanghai, mit einer ausführlichen und gelehrten Einleitung von Dr. Hu Shih, derzeitigem chinesischen Botschafter in Washington.
|
122 |
-
|
123 |
-
|
124 |
|
125 |
|
126 |
|
|
|
2 |
|
3 |
Monkeys Pilgerfahrt
|
4 |
|
|
|
|
|
|
|
|
|
5 |
|
6 |
|
7 |
Nach der englischen Übersetzung von Arthur Waley übertragen von Georgette Boner und Maria Nils.
|
8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
|
10 |
|
11 |
|
docs/build/doctrees/environment.pickle
CHANGED
Binary files a/docs/build/doctrees/environment.pickle and b/docs/build/doctrees/environment.pickle differ
|
|
docs/build/doctrees/intro.doctree
CHANGED
Binary files a/docs/build/doctrees/intro.doctree and b/docs/build/doctrees/intro.doctree differ
|
|
docs/build/doctrees/userguide-zh.doctree
CHANGED
Binary files a/docs/build/doctrees/userguide-zh.doctree and b/docs/build/doctrees/userguide-zh.doctree differ
|
|
docs/build/html/_sources/intro.rst.txt
CHANGED
@@ -3,19 +3,19 @@ Introduction
|
|
3 |
|
4 |
``radiobee`` (or ``radiobee aligner`` in full) is a powerful dualtext aligner.
|
5 |
|
6 |
-
The aim
|
7 |
|
8 |
The current implementation has been developed in Python 3 and ``gradio``.
|
9 |
|
10 |
Motivation
|
11 |
**********
|
12 |
|
13 |
-
Properly aligned texts (paragraph-to-paragraph or sentence-to-sentence) find applications in machine learning (e.g. machine translation), CAT (tmx, translation terms etc.) and education (dual-language ebook), etc.
|
14 |
|
15 |
Limitations
|
16 |
***********
|
17 |
|
18 |
-
Currently, only zh-en/en-zh pairs are supported for fast-track
|
19 |
If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.
|
20 |
|
21 |
-
An experimental slow-track mode (approximately 500 pairs per 5 minutes) is introdueced for other
|
|
|
3 |
|
4 |
``radiobee`` (or ``radiobee aligner`` in full) is a powerful dualtext aligner.
|
5 |
|
6 |
+
The aim is to provide an interface to align two texts.
|
7 |
|
8 |
The current implementation has been developed in Python 3 and ``gradio``.
|
9 |
|
10 |
Motivation
|
11 |
**********
|
12 |
|
13 |
+
Properly aligned texts (paragraph-to-paragraph or sentence-to-sentence) find many applications in machine learning (e.g. machine translation), CAT (tmx, translation terms etc.) and education (dual-language ebook), etc.
|
14 |
|
15 |
Limitations
|
16 |
***********
|
17 |
|
18 |
+
Currently, only zh-en/en-zh pairs are supported for fast-track mode although further pairs will be added if and when time permits.
|
19 |
If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.
|
20 |
|
21 |
+
An experimental slow-track mode (approximately 500 pairs per 5 minutes) is introdueced for other laugnage pairs.
|
docs/build/html/_sources/userguide-zh.rst.txt
CHANGED
@@ -3,7 +3,7 @@
|
|
3 |
|
4 |
- ``radiobee aligner`` 是 ``bumblebee aligner`` 的孪生兄弟。请加入qq群 ``316287378`` 了解这些对齐工具。
|
5 |
|
6 |
-
- ``radiobee``
|
7 |
- ``radiobee`` 目前仅支持纯文本文件上载 (txt, md, csv 等)。 以后可能会支持 ``docx``, ``pdf``, ``srt``, ``html`` 等格式。
|
8 |
- ``file 2`` 为空白时,``radiobee`` 则会视 ``file 1`` 为中英文混合文本及试着分离中英文,然后进行对齐。
|
9 |
- 英中、中英非空行限制在 ``2000`` 以内,其他语言对的对齐(``500`` 对约需5分钟)则限制在 ``200`` 以内。
|
|
|
3 |
|
4 |
- ``radiobee aligner`` 是 ``bumblebee aligner`` 的孪生兄弟。请加入qq群 ``316287378`` 了解这些对齐工具。
|
5 |
|
6 |
+
- ``radiobee`` 快对模式目前仅支持中英、英中对齐。
|
7 |
- ``radiobee`` 目前仅支持纯文本文件上载 (txt, md, csv 等)。 以后可能会支持 ``docx``, ``pdf``, ``srt``, ``html`` 等格式。
|
8 |
- ``file 2`` 为空白时,``radiobee`` 则会视 ``file 1`` 为中英文混合文本及试着分离中英文,然后进行对齐。
|
9 |
- 英中、中英非空行限制在 ``2000`` 以内,其他语言对的对齐(``500`` 对约需5分钟)则限制在 ``200`` 以内。
|
docs/build/html/intro.html
CHANGED
@@ -77,17 +77,17 @@
|
|
77 |
<section id="introduction">
|
78 |
<h1>Introduction<a class="headerlink" href="#introduction" title="Permalink to this headline"></a></h1>
|
79 |
<p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code> (or <code class="docutils literal notranslate"><span class="pre">radiobee</span> <span class="pre">aligner</span></code> in full) is a powerful dualtext aligner.</p>
|
80 |
-
<p>The aim
|
81 |
<p>The current implementation has been developed in Python 3 and <code class="docutils literal notranslate"><span class="pre">gradio</span></code>.</p>
|
82 |
<section id="motivation">
|
83 |
<h2>Motivation<a class="headerlink" href="#motivation" title="Permalink to this headline"></a></h2>
|
84 |
-
<p>Properly aligned texts (paragraph-to-paragraph or sentence-to-sentence) find applications in machine learning (e.g. machine translation), CAT (tmx, translation terms etc.) and education (dual-language ebook), etc.</p>
|
85 |
</section>
|
86 |
<section id="limitations">
|
87 |
<h2>Limitations<a class="headerlink" href="#limitations" title="Permalink to this headline"></a></h2>
|
88 |
-
<p>Currently, only zh-en/en-zh pairs are supported for fast-track
|
89 |
If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.</p>
|
90 |
-
<p>An experimental slow-track mode (approximately 500 pairs per 5 minutes) is introdueced for other
|
91 |
</section>
|
92 |
</section>
|
93 |
|
|
|
77 |
<section id="introduction">
|
78 |
<h1>Introduction<a class="headerlink" href="#introduction" title="Permalink to this headline"></a></h1>
|
79 |
<p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code> (or <code class="docutils literal notranslate"><span class="pre">radiobee</span> <span class="pre">aligner</span></code> in full) is a powerful dualtext aligner.</p>
|
80 |
+
<p>The aim is to provide an interface to align two texts.</p>
|
81 |
<p>The current implementation has been developed in Python 3 and <code class="docutils literal notranslate"><span class="pre">gradio</span></code>.</p>
|
82 |
<section id="motivation">
|
83 |
<h2>Motivation<a class="headerlink" href="#motivation" title="Permalink to this headline"></a></h2>
|
84 |
+
<p>Properly aligned texts (paragraph-to-paragraph or sentence-to-sentence) find many applications in machine learning (e.g. machine translation), CAT (tmx, translation terms etc.) and education (dual-language ebook), etc.</p>
|
85 |
</section>
|
86 |
<section id="limitations">
|
87 |
<h2>Limitations<a class="headerlink" href="#limitations" title="Permalink to this headline"></a></h2>
|
88 |
+
<p>Currently, only zh-en/en-zh pairs are supported for fast-track mode although further pairs will be added if and when time permits.
|
89 |
If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.</p>
|
90 |
+
<p>An experimental slow-track mode (approximately 500 pairs per 5 minutes) is introdueced for other laugnage pairs.</p>
|
91 |
</section>
|
92 |
</section>
|
93 |
|
docs/build/html/searchindex.js
CHANGED
@@ -1 +1 @@
|
|
1 |
-
Search.setIndex({docnames:["examples","index","intro","modules","radiobee","userguide","userguide-zh"],envversion:{"sphinx.domains.c":2,"sphinx.domains.changeset":1,"sphinx.domains.citation":1,"sphinx.domains.cpp":4,"sphinx.domains.index":1,"sphinx.domains.javascript":2,"sphinx.domains.math":2,"sphinx.domains.python":3,"sphinx.domains.rst":2,"sphinx.domains.std":2,sphinx:56},filenames:["examples.rst","index.rst","intro.rst","modules.rst","radiobee.rst","userguide.rst","userguide-zh.rst"],objects:{},objnames:{},objtypes:{},terms:{"1":[5,6],"12":[5,6],"2":[5,6],"200":[5,6],"2000":[5,6],"3":2,"316287378":[5,6],"4":[5,6],"5":2,"500":[2,6],"8":[5,6],"\u4e00\u822c\u65e0\u9700\u7406\u4f1a\u8fd9\u4e9b\u53c2\u6570":6,"\u4e2d\u82f1\u975e\u7a7a\u884c\u9650\u5236\u5728":6,"\u4e3a\u4e2d\u82f1\u6587\u6df7\u5408\u6587\u672c\u53ca\u8bd5\u7740\u5206\u79bb\u4e2d\u82f1\u6587":6,"\u4e3a\u7a7a\u767d\u65f6":6,"\u4e86\u89e3\u8fd9\u4e9b\u5bf9\u9f50\u5de5\u5177":6,"\u4ee5\u5185":6,"\u4ee5\u540e\u53ef\u80fd\u4f1a\u652f\u6301":6,"\u4f18\u8d28\u5bf9":6,"\u4f7f\u7528\u8bf4\u660e":1,"\u5176\u4ed6\u8bed\u8a00\u5bf9\u7684\u5bf9\u9f50":6,"\u5219\u4f1a\u89c6":6,"\u5219\u9650\u5236\u5728":6,"\u53e6\u4e00\u65b9\u9762":6,"\u53ef\u4ee5\u53f3\u51fb\u62f7\u51fa\u56fe\u7684\u94fe\u63a5\u7528\u6d4f\u89c8\u5668\u72ec\u7acb\u8bbf\u95ee\u62f7\u51fa\u6765\u7684\u94fe\u63a5\u6216\u53f3\u51fb\u5b58\u76d8\u518d\u7528\u770b\u56fe\u7a0b\u5e8f\u6253\u5f00\u5b58\u76d8\u7684\u56fe\u6587\u4ef6":6,"\u548c":6,"\u5acc\u56fe\u592a\u5c0f\u7684\u8bdd":6,"\u5b58\u4e0b\u6709\u5173\u53c2\u6570\u67e5\u770b\u6216\u901a\u77e5\u5f00\u53d1\u8005":6,"\u5bf9\u7ea6\u97005\u5206\u949f":6,"\u662f":6,"\u6700\u5c0f":6,"\u7136\u540e\u8fdb\u884c\u5bf9\u9f50":6,"\u7684\u5b6a\u751f\u5144\u5f1f":6,"\u7684\u5efa\u8bae\u503c":6,"\u76ee\u524d\u4ec5\u652f\u6301\u4e2d\u82f1":
|
|
|
1 |
+
Search.setIndex({docnames:["examples","index","intro","modules","radiobee","userguide","userguide-zh"],envversion:{"sphinx.domains.c":2,"sphinx.domains.changeset":1,"sphinx.domains.citation":1,"sphinx.domains.cpp":4,"sphinx.domains.index":1,"sphinx.domains.javascript":2,"sphinx.domains.math":2,"sphinx.domains.python":3,"sphinx.domains.rst":2,"sphinx.domains.std":2,sphinx:56},filenames:["examples.rst","index.rst","intro.rst","modules.rst","radiobee.rst","userguide.rst","userguide-zh.rst"],objects:{},objnames:{},objtypes:{},terms:{"1":[5,6],"12":[5,6],"2":[5,6],"200":[5,6],"2000":[5,6],"3":2,"316287378":[5,6],"4":[5,6],"5":2,"500":[2,6],"8":[5,6],"\u4e00\u822c\u65e0\u9700\u7406\u4f1a\u8fd9\u4e9b\u53c2\u6570":6,"\u4e2d\u82f1\u975e\u7a7a\u884c\u9650\u5236\u5728":6,"\u4e3a\u4e2d\u82f1\u6587\u6df7\u5408\u6587\u672c\u53ca\u8bd5\u7740\u5206\u79bb\u4e2d\u82f1\u6587":6,"\u4e3a\u7a7a\u767d\u65f6":6,"\u4e86\u89e3\u8fd9\u4e9b\u5bf9\u9f50\u5de5\u5177":6,"\u4ee5\u5185":6,"\u4ee5\u540e\u53ef\u80fd\u4f1a\u652f\u6301":6,"\u4f18\u8d28\u5bf9":6,"\u4f7f\u7528\u8bf4\u660e":1,"\u5176\u4ed6\u8bed\u8a00\u5bf9\u7684\u5bf9\u9f50":6,"\u5219\u4f1a\u89c6":6,"\u5219\u9650\u5236\u5728":6,"\u53e6\u4e00\u65b9\u9762":6,"\u53ef\u4ee5\u53f3\u51fb\u62f7\u51fa\u56fe\u7684\u94fe\u63a5\u7528\u6d4f\u89c8\u5668\u72ec\u7acb\u8bbf\u95ee\u62f7\u51fa\u6765\u7684\u94fe\u63a5\u6216\u53f3\u51fb\u5b58\u76d8\u518d\u7528\u770b\u56fe\u7a0b\u5e8f\u6253\u5f00\u5b58\u76d8\u7684\u56fe\u6587\u4ef6":6,"\u548c":6,"\u5acc\u56fe\u592a\u5c0f\u7684\u8bdd":6,"\u5b58\u4e0b\u6709\u5173\u53c2\u6570\u67e5\u770b\u6216\u901a\u77e5\u5f00\u53d1\u8005":6,"\u5bf9\u7ea6\u97005\u5206\u949f":6,"\u5feb\u5bf9\u6a21\u5f0f\u76ee\u524d\u4ec5\u652f\u6301\u4e2d\u82f1":6,"\u662f":6,"\u6700\u5c0f":6,"\u7136\u540e\u8fdb\u884c\u5bf9\u9f50":6,"\u7684\u5b6a\u751f\u5144\u5f1f":6,"\u7684\u5efa\u8bae\u503c":6,"\u76ee\u524d\u4ec5\u652f\u6301\u4e2d\u82f1":[],"\u76ee\u524d\u4ec5\u652f\u6301\u7eaf\u6587\u672c\u6587\u4ef6\u4e0a\u8f7d":6,"\u7b2c\u4e8c\u6b21\u4e0a\u8f7d\u6587\u4ef6\u524d\u8bf7\u70b9\u51fb":6,"\u7b49":6,"\u7b49\u683c\u5f0f":6,"\u82f1\u4e2d":6,"\u82f1\u4e2d\u5bf9\u9f50":6,"\u8bbe\u5927\u4e9b\u5219\u4f1a\u5f97\u5230\u5c11\u4e00\u4e9b\u5bf9\u9f50\u5bf9\u56e0\u4e3a\u53ef\u80fd\u9519\u5931\u4e86\u4e00\u4e9b":6,"\u8bbe\u5927\u4e9b\u6216":6,"\u8bbe\u5c0f\u4e9b\u53ef\u4ee5\u5f97\u5230\u66f4\u591a\u7684\u5bf9\u9f50\u5bf9\u4f46\u4e5f\u4f1a\u6709\u66f4\u591a":6,"\u8bbe\u5c0f\u4e9b\u6216":6,"\u8bef\u62a5\u5bf9":6,"\u8bf7\u52a0\u5165qq\u7fa4":6,"\u8fd0\u884c\u51fa\u9519\u65f6\u53ef\u4ee5\u70b9\u51fb":6,"\u9519\u8bef\u5224\u65ad\u4e3a\u5bf9\u9f50\u7684\u5bf9":6,"do":5,"new":5,As:0,For:0,If:[2,5],On:5,The:[2,5],To:5,about:5,ad:2,address:5,aim:2,align:[0,2,5,6],align_s:[1,3],align_text:[1,3],also:5,although:2,amend_avec:[1,3],an:2,app:[1,3],applic:2,approxim:2,ar:[2,5],attempt:5,been:[0,2],befor:5,better:5,blank:5,browser:5,built:0,bumblebe:[5,6],can:5,candid:5,cannot:0,cat:2,chines:5,clear:[5,6],click:[0,5],cmat2tset:[1,3],co:0,contact:2,content:3,copi:5,csv:[5,6],current:2,de:2,develop:[2,5],dl_type:[5,6],docterm_scor:[1,3],docx:[5,6],download:0,dual:2,dualtext:2,e:2,ebook:2,educ:2,en2zh:[1,3],en2zh_token:[1,3],en:[2,5],english:5,epsilon:[5,6],esp:[5,6],etc:[2,5],exampl:[1,2,5],experiment:2,fals:5,fast:2,file2text:[1,3],file:[5,6],files2df:[1,3],find:2,first:5,flag:[5,6],format:5,full:2,further:2,g:2,gen_aset:[1,3],gen_eps_minsampl:[1,3],gen_model:[1,3],gen_pset:[1,3],gen_row_align:[1,3],go:5,good:5,gradio:2,group:5,ha:[0,2],hand:5,have:5,help:2,here:[],how:1,html:[5,6],http:0,huggingfac:0,identifi:5,idf_typ:[5,6],imag:5,implement:2,index:1,inform:5,insert_spac:[1,3],instal:1,interfac:2,interpolate_pset:[1,3],introduct:1,introduec:2,ja:2,join:5,just:0,know:5,languag:2,languang:5,larger:5,later:5,laugnag:2,learn:2,left:5,limit:[1,5],line:5,lists2cmat:[1,3],loadtext:[1,3],look:5,machin:2,mai:5,mani:2,md:[5,6],mdx_e2c:[1,3],method:0,mikee:0,min_sampl:[5,6],minimum:5,minut:2,miss:5,mix:5,mode:2,modul:[1,3],more:5,motiv:1,need:5,non:5,norm:[5,6],normal:5,now:0,number:5,one:0,onli:2,onlin:0,open:5,other:[2,5],output:5,packag:[0,1,3],page:1,pair:[2,5],paragraph:2,particular:2,pdf:[5,6],per:2,permit:2,pip:0,pleas:5,plot_cmat:[1,3],plot_df:[1,3],posit:5,power:2,proced:5,process_upload:[1,3],properli:2,provid:2,publish:0,pure:5,pypi:0,python:2,qq:5,radiobe:[0,2,5,6],result:5,right:5,row:0,ru:2,save:5,search:1,seg_text:[1,3],select:5,sentenc:2,separ:5,should:5,shuffle_s:[1,3],sibl:5,slow:2,smaller:5,smatrix:[1,3],someth:5,space:0,srt:[5,6],submit:[0,5],submodul:[1,3],subsequ:5,suggest:[0,5],support:[2,5],tab:5,tabl:0,tend:5,term:2,testrun:0,text:[2,5],tf_type:[5,6],them:5,time:2,tmx:2,touch:5,track:2,translat:2,treat:5,trim_df:[1,3],two:2,txt:[5,6],unless:5,upload:5,us:[0,1],usag:1,valu:5,version:0,wa:[],welcom:2,what:5,when:[2,5],willing:2,wrong:5,yet:0,you:[2,5],zh:[2,5],zip:0},titles:["Examples","Welcome to radiobee\u2019s documentation!","Introduction","radiobee","radiobee package","How to use","\u4f7f\u7528\u8bf4\u660e"],titleterms:{"\u4f7f\u7528\u8bf4\u660e":6,align_s:4,align_text:4,amend_avec:4,app:4,cmat2tset:4,content:[1,4],docterm_scor:4,document:1,en2zh:4,en2zh_token:4,exampl:0,file2text:4,files2df:4,gen_aset:4,gen_eps_minsampl:4,gen_model:4,gen_pset:4,gen_row_align:4,how:5,indic:1,insert_spac:4,instal:0,interpolate_pset:4,introduct:2,limit:2,lists2cmat:4,loadtext:4,mdx_e2c:4,modul:4,motiv:2,packag:4,plot_cmat:4,plot_df:4,process_upload:4,radiobe:[1,3,4],s:1,seg_text:4,shuffle_s:4,smatrix:4,submodul:4,tabl:1,trim_df:4,us:5,usag:0,welcom:1}})
|
docs/build/html/userguide-zh.html
CHANGED
@@ -74,7 +74,7 @@
|
|
74 |
<h1>使用说明<a class="headerlink" href="#id1" title="Permalink to this headline"></a></h1>
|
75 |
<ul class="simple">
|
76 |
<li><p><code class="docutils literal notranslate"><span class="pre">radiobee</span> <span class="pre">aligner</span></code> 是 <code class="docutils literal notranslate"><span class="pre">bumblebee</span> <span class="pre">aligner</span></code> 的孪生兄弟。请加入qq群 <code class="docutils literal notranslate"><span class="pre">316287378</span></code> 了解这些对齐工具。</p></li>
|
77 |
-
<li><p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code>
|
78 |
<li><p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code> 目前仅支持纯文本文件上载 (txt, md, csv 等)。 以后可能会支持 <code class="docutils literal notranslate"><span class="pre">docx</span></code>, <code class="docutils literal notranslate"><span class="pre">pdf</span></code>, <code class="docutils literal notranslate"><span class="pre">srt</span></code>, <code class="docutils literal notranslate"><span class="pre">html</span></code> 等格式。</p></li>
|
79 |
<li><p><code class="docutils literal notranslate"><span class="pre">file</span> <span class="pre">2</span></code> 为空白时,<code class="docutils literal notranslate"><span class="pre">radiobee</span></code> 则会视 <code class="docutils literal notranslate"><span class="pre">file</span> <span class="pre">1</span></code> 为中英文混合文本及试着分离中英文,然后进行对齐。</p></li>
|
80 |
<li><p>英中、中英非空行限制在 <code class="docutils literal notranslate"><span class="pre">2000</span></code> 以内,其他语言对的对齐(<code class="docutils literal notranslate"><span class="pre">500</span></code> 对约需5分钟)则限制在 <code class="docutils literal notranslate"><span class="pre">200</span></code> 以内。</p></li>
|
|
|
74 |
<h1>使用说明<a class="headerlink" href="#id1" title="Permalink to this headline"></a></h1>
|
75 |
<ul class="simple">
|
76 |
<li><p><code class="docutils literal notranslate"><span class="pre">radiobee</span> <span class="pre">aligner</span></code> 是 <code class="docutils literal notranslate"><span class="pre">bumblebee</span> <span class="pre">aligner</span></code> 的孪生兄弟。请加入qq群 <code class="docutils literal notranslate"><span class="pre">316287378</span></code> 了解这些对齐工具。</p></li>
|
77 |
+
<li><p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code> 快对模式目前仅支持中英、英中对齐。</p></li>
|
78 |
<li><p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code> 目前仅支持纯文本文件上载 (txt, md, csv 等)。 以后可能会支持 <code class="docutils literal notranslate"><span class="pre">docx</span></code>, <code class="docutils literal notranslate"><span class="pre">pdf</span></code>, <code class="docutils literal notranslate"><span class="pre">srt</span></code>, <code class="docutils literal notranslate"><span class="pre">html</span></code> 等格式。</p></li>
|
79 |
<li><p><code class="docutils literal notranslate"><span class="pre">file</span> <span class="pre">2</span></code> 为空白时,<code class="docutils literal notranslate"><span class="pre">radiobee</span></code> 则会视 <code class="docutils literal notranslate"><span class="pre">file</span> <span class="pre">1</span></code> 为中英文混合文本及试着分离中英文,然后进行对齐。</p></li>
|
80 |
<li><p>英中、中英非空行限制在 <code class="docutils literal notranslate"><span class="pre">2000</span></code> 以内,其他语言对的对齐(<code class="docutils literal notranslate"><span class="pre">500</span></code> 对约需5分钟)则限制在 <code class="docutils literal notranslate"><span class="pre">200</span></code> 以内。</p></li>
|
docs/source/intro.rst
CHANGED
@@ -3,19 +3,19 @@ Introduction
|
|
3 |
|
4 |
``radiobee`` (or ``radiobee aligner`` in full) is a powerful dualtext aligner.
|
5 |
|
6 |
-
The aim
|
7 |
|
8 |
The current implementation has been developed in Python 3 and ``gradio``.
|
9 |
|
10 |
Motivation
|
11 |
**********
|
12 |
|
13 |
-
Properly aligned texts (paragraph-to-paragraph or sentence-to-sentence) find applications in machine learning (e.g. machine translation), CAT (tmx, translation terms etc.) and education (dual-language ebook), etc.
|
14 |
|
15 |
Limitations
|
16 |
***********
|
17 |
|
18 |
-
Currently, only zh-en/en-zh pairs are supported for fast-track
|
19 |
If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.
|
20 |
|
21 |
-
An experimental slow-track mode (approximately 500 pairs per 5 minutes) is introdueced for other
|
|
|
3 |
|
4 |
``radiobee`` (or ``radiobee aligner`` in full) is a powerful dualtext aligner.
|
5 |
|
6 |
+
The aim is to provide an interface to align two texts.
|
7 |
|
8 |
The current implementation has been developed in Python 3 and ``gradio``.
|
9 |
|
10 |
Motivation
|
11 |
**********
|
12 |
|
13 |
+
Properly aligned texts (paragraph-to-paragraph or sentence-to-sentence) find many applications in machine learning (e.g. machine translation), CAT (tmx, translation terms etc.) and education (dual-language ebook), etc.
|
14 |
|
15 |
Limitations
|
16 |
***********
|
17 |
|
18 |
+
Currently, only zh-en/en-zh pairs are supported for fast-track mode although further pairs will be added if and when time permits.
|
19 |
If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.
|
20 |
|
21 |
+
An experimental slow-track mode (approximately 500 pairs per 5 minutes) is introdueced for other laugnage pairs.
|
docs/source/userguide-zh.rst
CHANGED
@@ -3,7 +3,7 @@
|
|
3 |
|
4 |
- ``radiobee aligner`` 是 ``bumblebee aligner`` 的孪生兄弟。请加入qq群 ``316287378`` 了解这些对齐工具。
|
5 |
|
6 |
-
- ``radiobee``
|
7 |
- ``radiobee`` 目前仅支持纯文本文件上载 (txt, md, csv 等)。 以后可能会支持 ``docx``, ``pdf``, ``srt``, ``html`` 等格式。
|
8 |
- ``file 2`` 为空白时,``radiobee`` 则会视 ``file 1`` 为中英文混合文本及试着分离中英文,然后进行对齐。
|
9 |
- 英中、中英非空行限制在 ``2000`` 以内,其他语言对的对齐(``500`` 对约需5分钟)则限制在 ``200`` 以内。
|
|
|
3 |
|
4 |
- ``radiobee aligner`` 是 ``bumblebee aligner`` 的孪生兄弟。请加入qq群 ``316287378`` 了解这些对齐工具。
|
5 |
|
6 |
+
- ``radiobee`` 快对模式目前仅支持中英、英中对齐。
|
7 |
- ``radiobee`` 目前仅支持纯文本文件上载 (txt, md, csv 等)。 以后可能会支持 ``docx``, ``pdf``, ``srt``, ``html`` 等格式。
|
8 |
- ``file 2`` 为空白时,``radiobee`` 则会视 ``file 1`` 为中英文混合文本及试着分离中英文,然后进行对齐。
|
9 |
- 英中、中英非空行限制在 ``2000`` 以内,其他语言对的对齐(``500`` 对约需5分钟)则限制在 ``200`` 以内。
|
gradio_queue.db
CHANGED
Binary files a/gradio_queue.db and b/gradio_queue.db differ
|
|
img/plt.png
CHANGED
![]() |
![]() |
radiobee/__main__.py
CHANGED
@@ -309,7 +309,7 @@ if __name__ == "__main__":
|
|
309 |
else:
|
310 |
raise SystemExit(f"Tried {numb} times to no avail, giving up...")
|
311 |
|
312 |
-
description = "WIP showcasing a blazing fast dualtext aligner, currrently supported language pairs: en-zh/zh-en"
|
313 |
|
314 |
# moved to userguide.rst in docs
|
315 |
article = dedent(
|
|
|
309 |
else:
|
310 |
raise SystemExit(f"Tried {numb} times to no avail, giving up...")
|
311 |
|
312 |
+
description = "WIP showcasing a blazing fast dualtext aligner, currrently supported language pairs: en-zh/zh-en for fast-track, other language pairs are handled by slow-track"
|
313 |
|
314 |
# moved to userguide.rst in docs
|
315 |
article = dedent(
|
radiobee/detect.py
CHANGED
@@ -27,12 +27,23 @@ def with_func_attrs(**attrs: Any) -> Callable:
|
|
27 |
# @with_func_attrs(set_languages=None)
|
28 |
# def detect(text: str) -> str:
|
29 |
def detect(text: str, set_languages: Optional[List[str]] = None) -> str:
|
30 |
-
"""Detect language via polyglot and fastlid.
|
|
|
|
|
|
|
|
|
|
|
31 |
# if not text.strip(): return "en"
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
try:
|
33 |
-
|
34 |
-
detect.lang_conf =
|
35 |
-
lang, conf = _[0]
|
36 |
except UnknownLanguage:
|
37 |
if set_languages is None:
|
38 |
def_lang = "en"
|
@@ -40,26 +51,31 @@ def detect(text: str, set_languages: Optional[List[str]] = None) -> str:
|
|
40 |
# def_lang = set_languages[-1]
|
41 |
def_lang = set_languages[0]
|
42 |
logger.warning(" UnknownLanguage exception: probably snippet too short, setting to %s", def_lang)
|
43 |
-
|
44 |
except Exception as exc:
|
45 |
logger.error(exc)
|
46 |
-
|
47 |
|
48 |
del conf
|
49 |
|
50 |
-
#
|
51 |
if set_languages is None:
|
52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
|
54 |
# set_languages is set
|
55 |
if not isinstance(set_languages, (list, tuple)):
|
56 |
logger.warning("set_languages (%s) ought to be a list/tuple")
|
57 |
|
58 |
-
|
59 |
-
return lang
|
60 |
-
|
61 |
-
# lang not in set_languages, use fastlid
|
62 |
-
fastlid.set_languages = set_languages
|
63 |
-
lang, _ = fastlid(text)
|
64 |
-
|
65 |
-
return lang
|
|
|
27 |
# @with_func_attrs(set_languages=None)
|
28 |
# def detect(text: str) -> str:
|
29 |
def detect(text: str, set_languages: Optional[List[str]] = None) -> str:
|
30 |
+
"""Detect language via polyglot and fastlid.
|
31 |
+
|
32 |
+
check first with fastlid, if conf < 0.3, check with
|
33 |
+
|
34 |
+
Alternative in detec_alt.py
|
35 |
+
"""
|
36 |
# if not text.strip(): return "en"
|
37 |
+
fastlid.set_languages = set_languages
|
38 |
+
lang, conf = fastlid(text)
|
39 |
+
detect.lang_conf = lang, conf
|
40 |
+
if conf >= 0.3 or lang in ["zh"]:
|
41 |
+
return lang
|
42 |
+
|
43 |
try:
|
44 |
+
langs = [(elm.code[:2], elm.confidence) for elm in Detector(text).languages]
|
45 |
+
detect.lang_conf = langs
|
46 |
+
# lang, conf = _[0]
|
47 |
except UnknownLanguage:
|
48 |
if set_languages is None:
|
49 |
def_lang = "en"
|
|
|
51 |
# def_lang = set_languages[-1]
|
52 |
def_lang = set_languages[0]
|
53 |
logger.warning(" UnknownLanguage exception: probably snippet too short, setting to %s", def_lang)
|
54 |
+
langs = [(def_lang, 0)]
|
55 |
except Exception as exc:
|
56 |
logger.error(exc)
|
57 |
+
langs = [("en", 0)]
|
58 |
|
59 |
del conf
|
60 |
|
61 |
+
# return first enrty's lang
|
62 |
if set_languages is None:
|
63 |
+
def_lang = langs[0][0]
|
64 |
+
else:
|
65 |
+
def_lang = "en"
|
66 |
+
|
67 |
+
# pick the first in Detector(text).languages
|
68 |
+
|
69 |
+
# just to silence pyright
|
70 |
+
# set_languages_: List[str] = [""] if set_languages is None else set_languages
|
71 |
+
|
72 |
+
for elm in langs:
|
73 |
+
if elm[0] in set_languages: # type: ignore
|
74 |
+
def_lang = elm[0]
|
75 |
+
break
|
76 |
|
77 |
# set_languages is set
|
78 |
if not isinstance(set_languages, (list, tuple)):
|
79 |
logger.warning("set_languages (%s) ought to be a list/tuple")
|
80 |
|
81 |
+
return def_lang
|
|
|
|
|
|
|
|
|
|
|
|
|
|
radiobee/detect_alt.py
ADDED
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""Detect language via polyglot and fastlid."""
|
2 |
+
# pylint: disable=
|
3 |
+
|
4 |
+
from typing import Any, Callable, List, Optional
|
5 |
+
|
6 |
+
from polyglot.text import Detector
|
7 |
+
import polyglot.detect.base
|
8 |
+
from polyglot.detect.base import UnknownLanguage
|
9 |
+
from fastlid import fastlid
|
10 |
+
|
11 |
+
from logzero import logger
|
12 |
+
|
13 |
+
polyglot.detect.base.logger.setLevel("ERROR")
|
14 |
+
|
15 |
+
|
16 |
+
def with_func_attrs(**attrs: Any) -> Callable:
|
17 |
+
"""Define func_attrs."""
|
18 |
+
|
19 |
+
def with_attrs(fct: Callable) -> Callable:
|
20 |
+
for key, val in attrs.items():
|
21 |
+
setattr(fct, key, val)
|
22 |
+
return fct
|
23 |
+
|
24 |
+
return with_attrs
|
25 |
+
|
26 |
+
|
27 |
+
# @with_func_attrs(set_languages=None)
|
28 |
+
# def detect(text: str) -> str:
|
29 |
+
def detect(text: str, set_languages: Optional[List[str]] = None) -> str:
|
30 |
+
"""Detect language via polyglot and fastlid."""
|
31 |
+
# if not text.strip(): return "en"
|
32 |
+
try:
|
33 |
+
_ = [(elm.code[:2], elm.confidence) for elm in Detector(text).languages]
|
34 |
+
detect.lang_conf = _
|
35 |
+
lang, conf = _[0]
|
36 |
+
except UnknownLanguage:
|
37 |
+
if set_languages is None:
|
38 |
+
def_lang = "en"
|
39 |
+
else:
|
40 |
+
# def_lang = set_languages[-1]
|
41 |
+
def_lang = set_languages[0]
|
42 |
+
logger.warning(" UnknownLanguage exception: probably snippet too short, setting to %s", def_lang)
|
43 |
+
lang, conf = def_lang, 0
|
44 |
+
except Exception as exc:
|
45 |
+
logger.error(exc)
|
46 |
+
lang, conf = "en", 0
|
47 |
+
|
48 |
+
del conf
|
49 |
+
|
50 |
+
# if set_languages is None,
|
51 |
+
# trust polyglot.text.Detector
|
52 |
+
if set_languages is None:
|
53 |
+
return lang
|
54 |
+
|
55 |
+
# set_languages is set
|
56 |
+
if not isinstance(set_languages, (list, tuple)):
|
57 |
+
logger.warning("set_languages (%s) ought to be a list/tuple")
|
58 |
+
|
59 |
+
if lang in set_languages:
|
60 |
+
return lang
|
61 |
+
|
62 |
+
# lang not in set_languages, use fastlid
|
63 |
+
fastlid.set_languages = set_languages
|
64 |
+
lang, _ = fastlid(text)
|
65 |
+
|
66 |
+
return lang
|
radiobee/gradiobee.py
CHANGED
@@ -2,6 +2,7 @@
|
|
2 |
# pylint: disable=invalid-name
|
3 |
from pathlib import Path
|
4 |
import platform
|
|
|
5 |
from itertools import zip_longest
|
6 |
|
7 |
# import tempfile
|
@@ -32,7 +33,7 @@ uname = platform.uname()
|
|
32 |
HFSPACES = False
|
33 |
if "amzn2" in uname.release: # on hf spaces
|
34 |
HFSPACES = True
|
35 |
-
import SentenceTransformer
|
36 |
model_s = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v1')
|
37 |
sns.set()
|
38 |
sns.set_style("darkgrid")
|
@@ -102,7 +103,7 @@ def gradiobee(
|
|
102 |
# process file1/text1: split text1 to text1 text2 to zh-en
|
103 |
|
104 |
len_max = 2000
|
105 |
-
if not text2.strip():
|
106 |
_ = [elm.strip() for elm in text1.splitlines() if elm.strip()]
|
107 |
if not _: # essentially empty file1
|
108 |
return error_msg("Nothing worthy of processing in file 1")
|
@@ -151,7 +152,9 @@ def gradiobee(
|
|
151 |
# return df_trimmed, output_plot, file_dl, file_dl_xlsx, df_aligned
|
152 |
|
153 |
# end if single file
|
|
|
154 |
else: # file1 file 2: proceed
|
|
|
155 |
lang1, _ = fastlid(text1)
|
156 |
lang2, _ = fastlid(text2)
|
157 |
|
@@ -175,13 +178,14 @@ def gradiobee(
|
|
175 |
df_trimmed = trim_df(df1)
|
176 |
# --- end else single
|
177 |
|
|
|
|
|
178 |
logger.debug("lang1: %s, lang2: %s", lang1, lang2)
|
179 |
if debug:
|
180 |
-
print("gradiobee ln
|
181 |
print("fast track? ", lang1 in lang_en_zh and lang2 in lang_en_zh)
|
182 |
|
183 |
# fast track
|
184 |
-
lang_en_zh = ["en", "zh"]
|
185 |
if lang1 in lang_en_zh and lang2 in lang_en_zh:
|
186 |
try:
|
187 |
cmat = lists2cmat(
|
@@ -208,10 +212,11 @@ def gradiobee(
|
|
208 |
try:
|
209 |
vec1 = model_s.encode(list1)
|
210 |
vec2 = model_s.encode(list2)
|
211 |
-
cmat = vec1.dot(vec2.T)
|
|
|
212 |
except Exception as exc:
|
213 |
logger.error(exc)
|
214 |
-
return error_msg(exc)
|
215 |
|
216 |
tset = pd.DataFrame(cmat2tset(cmat))
|
217 |
tset.columns = ["x", "y", "cos"]
|
|
|
2 |
# pylint: disable=invalid-name
|
3 |
from pathlib import Path
|
4 |
import platform
|
5 |
+
import inspect
|
6 |
from itertools import zip_longest
|
7 |
|
8 |
# import tempfile
|
|
|
33 |
HFSPACES = False
|
34 |
if "amzn2" in uname.release: # on hf spaces
|
35 |
HFSPACES = True
|
36 |
+
from sentence_transformers import SentenceTransformer
|
37 |
model_s = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v1')
|
38 |
sns.set()
|
39 |
sns.set_style("darkgrid")
|
|
|
103 |
# process file1/text1: split text1 to text1 text2 to zh-en
|
104 |
|
105 |
len_max = 2000
|
106 |
+
if not text2.strip(): # empty file2
|
107 |
_ = [elm.strip() for elm in text1.splitlines() if elm.strip()]
|
108 |
if not _: # essentially empty file1
|
109 |
return error_msg("Nothing worthy of processing in file 1")
|
|
|
152 |
# return df_trimmed, output_plot, file_dl, file_dl_xlsx, df_aligned
|
153 |
|
154 |
# end if single file
|
155 |
+
# not single file
|
156 |
else: # file1 file 2: proceed
|
157 |
+
fastlid.set_languages = None
|
158 |
lang1, _ = fastlid(text1)
|
159 |
lang2, _ = fastlid(text2)
|
160 |
|
|
|
178 |
df_trimmed = trim_df(df1)
|
179 |
# --- end else single
|
180 |
|
181 |
+
lang_en_zh = ["en", "zh"]
|
182 |
+
|
183 |
logger.debug("lang1: %s, lang2: %s", lang1, lang2)
|
184 |
if debug:
|
185 |
+
print("gradiobee.py ln 82 lang1: %s, lang2: %s" % (lang1, lang2))
|
186 |
print("fast track? ", lang1 in lang_en_zh and lang2 in lang_en_zh)
|
187 |
|
188 |
# fast track
|
|
|
189 |
if lang1 in lang_en_zh and lang2 in lang_en_zh:
|
190 |
try:
|
191 |
cmat = lists2cmat(
|
|
|
212 |
try:
|
213 |
vec1 = model_s.encode(list1)
|
214 |
vec2 = model_s.encode(list2)
|
215 |
+
# cmat = vec1.dot(vec2.T)
|
216 |
+
cmat = vec2.dot(vec1.T)
|
217 |
except Exception as exc:
|
218 |
logger.error(exc)
|
219 |
+
return error_msg(f"{exc}, {__file__} {inspect.currentframe().f_lineno}, period")
|
220 |
|
221 |
tset = pd.DataFrame(cmat2tset(cmat))
|
222 |
tset.columns = ["x", "y", "cos"]
|
radiobee/text2lists.py
CHANGED
@@ -7,6 +7,7 @@ from typing import Iterable, List, Optional, Tuple, Union # noqa
|
|
7 |
import numpy as np
|
8 |
|
9 |
# from fastlid import fastlid
|
|
|
10 |
from logzero import logger
|
11 |
|
12 |
from radiobee.lists2cmat import lists2cmat
|
@@ -21,9 +22,8 @@ def text2lists(
|
|
21 |
|
22 |
Args:
|
23 |
text: mixed text
|
24 |
-
set_languages: default
|
25 |
-
|
26 |
-
set_languages = ["en", "zh"]
|
27 |
|
28 |
Attributes:
|
29 |
cmat: correlation matrix (len(list_l) x len(list_r))
|
@@ -42,7 +42,19 @@ def text2lists(
|
|
42 |
|
43 |
# set_languages default to ["en", "zh"]
|
44 |
if set_languages is None:
|
45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
# fastlid.set_languages = set_languages
|
48 |
|
@@ -51,6 +63,7 @@ def text2lists(
|
|
51 |
|
52 |
# lang0, _ = fastlid(text[:15000])
|
53 |
lang0 = detect(text, set_languages)
|
|
|
54 |
res = []
|
55 |
left = True # start with left list1
|
56 |
|
|
|
7 |
import numpy as np
|
8 |
|
9 |
# from fastlid import fastlid
|
10 |
+
from polyglot.text import Detector
|
11 |
from logzero import logger
|
12 |
|
13 |
from radiobee.lists2cmat import lists2cmat
|
|
|
22 |
|
23 |
Args:
|
24 |
text: mixed text
|
25 |
+
set_languages: no default (open-end)
|
26 |
+
use polyglot.text.Detector to pick two languages
|
|
|
27 |
|
28 |
Attributes:
|
29 |
cmat: correlation matrix (len(list_l) x len(list_r))
|
|
|
42 |
|
43 |
# set_languages default to ["en", "zh"]
|
44 |
if set_languages is None:
|
45 |
+
lang12 = [elm.code for elm in Detector(text).languages]
|
46 |
+
|
47 |
+
# set_languages = ["en", "zh"]
|
48 |
+
|
49 |
+
# set 'un' to 'en'
|
50 |
+
# set_languages = ['en' if elm in ['un'] else elm for elm in lang12[:2]]
|
51 |
+
set_languages = []
|
52 |
+
for elm in lang12[:2]:
|
53 |
+
if elm in ["un"]:
|
54 |
+
logger.warning(" Unknown language, set to en")
|
55 |
+
set_languages.append("en")
|
56 |
+
else:
|
57 |
+
set_languages.append(elm)
|
58 |
|
59 |
# fastlid.set_languages = set_languages
|
60 |
|
|
|
63 |
|
64 |
# lang0, _ = fastlid(text[:15000])
|
65 |
lang0 = detect(text, set_languages)
|
66 |
+
|
67 |
res = []
|
68 |
left = True # start with left list1
|
69 |
|
tests/test_detect.py
CHANGED
@@ -21,6 +21,20 @@ def test_detect(test_input, expected):
|
|
21 |
|
22 |
def test_detect_de():
|
23 |
"""Test detect de."""
|
24 |
-
|
25 |
-
assert detect(
|
26 |
-
assert detect(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
|
22 |
def test_detect_de():
|
23 |
"""Test detect de."""
|
24 |
+
text_de = "4\u3000In der Beschränkung zeigt sich erst der Meister, / Und das Gesetz nur kann uns Freiheit geben. 参见http://www.business-it.nl/files/7d413a5dca62fc735a072b16fbf050b1-27.php." # noqa
|
25 |
+
assert detect(text_de) == "de"
|
26 |
+
assert detect(text_de, ["en", "zh"]) == "zh"
|
27 |
+
|
28 |
+
|
29 |
+
def test_elm1():
|
30 |
+
"""Test ——撰文:Thomas Gibbons-Neff和Fahim Abed,摄影:Jim Huylebroek=."""
|
31 |
+
elm1 = "——撰文:Thomas Gibbons-Neff和Fahim Abed,摄影:Jim Huylebroek"
|
32 |
+
assert detect(elm1) == "ja"
|
33 |
+
assert detect(elm1, ["en", "zh"]) == "zh"
|
34 |
+
|
35 |
+
|
36 |
+
def test_elm2():
|
37 |
+
"""Test 在卢旺达基加利的一家牛奶吧。 JACQUES NKINZINGABO FOR THE NEW YORK TIMES."""
|
38 |
+
elm2 = "在卢旺达基加利的一家牛奶吧。 JACQUES NKINZINGABO FOR THE NEW YORK TIMES"
|
39 |
+
assert detect(elm2) == "zh"
|
40 |
+
assert detect(elm2, ["en", "zh"]) == "zh"
|
tests/test_text2lists.py
CHANGED
@@ -4,18 +4,19 @@ from radiobee.loadtext import loadtext
|
|
4 |
from radiobee.text2lists import text2lists
|
5 |
|
6 |
|
7 |
-
def
|
8 |
"""Test text2lists data\test-dual.txt."""
|
9 |
filename = r"data\test-dual.txt"
|
10 |
text = loadtext(filename) # noqa
|
11 |
l1, l2 = text2lists(text)
|
12 |
assert l2[0] in [""]
|
13 |
-
assert "国际\n中\n双语" in l1[0]
|
|
|
14 |
|
15 |
|
16 |
def test_shakespeare1000():
|
17 |
"""Separate first 1000.
|
18 |
-
|
19 |
from pathlib import Path
|
20 |
import zipfile
|
21 |
dir_loc = r""
|
@@ -34,11 +35,11 @@ def test_shakespeare1000():
|
|
34 |
break
|
35 |
line += 1
|
36 |
Path(f"data/shakespeare-zh-en-{numb_lines}.txt").write_text("\n".join(text1000), encoding="utf8")
|
37 |
-
|
38 |
tset = cmat2test(cmat)
|
39 |
df = pd.DataFrame(tset).rename(columns=dict(zip(range(0, 3), ['x', 'y', 'cos'])))
|
40 |
plot_df(df)
|
41 |
-
|
42 |
"""
|
43 |
# text1000a = Path("data/shakespeare-zh-en-1000.txt").read_text(encoding="utf8")
|
44 |
# text2000 = Path("data/shakespeare-zh-en-1000.txt").read_text(encoding="utf8")
|
@@ -46,5 +47,12 @@ def test_shakespeare1000():
|
|
46 |
|
47 |
# l1000a, l10002b = text2lists(text1000)
|
48 |
# l2000a, l2000b = text2lists(text2000)
|
49 |
-
|
50 |
l4000, r4000 = text2lists(text4000)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
from radiobee.text2lists import text2lists
|
5 |
|
6 |
|
7 |
+
def test_text2lists_dual1():
|
8 |
"""Test text2lists data\test-dual.txt."""
|
9 |
filename = r"data\test-dual.txt"
|
10 |
text = loadtext(filename) # noqa
|
11 |
l1, l2 = text2lists(text)
|
12 |
assert l2[0] in [""]
|
13 |
+
assert "国际\n中\n双语"[:2] in l1[0]
|
14 |
+
assert '2021' in l2[5]
|
15 |
|
16 |
|
17 |
def test_shakespeare1000():
|
18 |
"""Separate first 1000.
|
19 |
+
|
20 |
from pathlib import Path
|
21 |
import zipfile
|
22 |
dir_loc = r""
|
|
|
35 |
break
|
36 |
line += 1
|
37 |
Path(f"data/shakespeare-zh-en-{numb_lines}.txt").write_text("\n".join(text1000), encoding="utf8")
|
38 |
+
|
39 |
tset = cmat2test(cmat)
|
40 |
df = pd.DataFrame(tset).rename(columns=dict(zip(range(0, 3), ['x', 'y', 'cos'])))
|
41 |
plot_df(df)
|
42 |
+
|
43 |
"""
|
44 |
# text1000a = Path("data/shakespeare-zh-en-1000.txt").read_text(encoding="utf8")
|
45 |
# text2000 = Path("data/shakespeare-zh-en-1000.txt").read_text(encoding="utf8")
|
|
|
47 |
|
48 |
# l1000a, l10002b = text2lists(text1000)
|
49 |
# l2000a, l2000b = text2lists(text2000)
|
50 |
+
|
51 |
l4000, r4000 = text2lists(text4000)
|
52 |
+
|
53 |
+
|
54 |
+
def test_test_dual2():
|
55 |
+
"""Test data/test-dual.txt."""
|
56 |
+
test_dual = Path("data/test-dual.txt").read_text(encoding="utf8")
|
57 |
+
|
58 |
+
l_dual, r_dual = text2lists(test_dual)
|
tests/test_text2lists_bug2.py
CHANGED
@@ -7,10 +7,8 @@ from radiobee.text2lists import text2lists
|
|
7 |
def test_text2lists_bug2():
|
8 |
"""Test text2lists data\问题2测试文件.txt."""
|
9 |
filename = r"data\问题2测试文件.txt"
|
10 |
-
|
11 |
-
l1, l2 = text2lists(
|
12 |
-
# assert l2[0] in [""]
|
13 |
-
# assert "国际\n中\n双语" in l1[0]
|
14 |
|
15 |
-
assert len(l1) ==
|
16 |
-
assert len(l2) ==
|
|
|
7 |
def test_text2lists_bug2():
|
8 |
"""Test text2lists data\问题2测试文件.txt."""
|
9 |
filename = r"data\问题2测试文件.txt"
|
10 |
+
textbug2 = loadtext(filename) # noqa
|
11 |
+
l1, l2 = text2lists(textbug2)
|
|
|
|
|
12 |
|
13 |
+
assert len(l1) == 5
|
14 |
+
assert len(l2) == 4
|