Spaces:
Build error
Build error
freemt
commited on
Commit
•
1ca37ad
1
Parent(s):
265100f
Update docs
Browse files- data/en.txt +2 -2
- data/zh.txt +1 -1
- docs/build/doctrees/environment.pickle +0 -0
- docs/build/doctrees/examples.doctree +0 -0
- docs/build/doctrees/intro.doctree +0 -0
- docs/build/html/_sources/examples.rst.txt +2 -0
- docs/build/html/_sources/intro.rst.txt +1 -1
- docs/build/html/examples.html +1 -0
- docs/build/html/intro.html +1 -1
- docs/build/html/searchindex.js +1 -1
- docs/source/examples.rst +2 -0
- docs/source/intro.rst +1 -1
- radiobee/__init__.py +1 -0
- radiobee/detect.py +1 -1
- radiobee/radiobee_cli.py +545 -0
- radiobee/trim_df.py +2 -6
- requirements.txt +4 -1
data/en.txt
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
-
[Young Warrior] Kingold(
|
2 |
-
It seems that the standalone version can
|
3 |
omit the GUI and specify the two files to be aligned directly on the command line.
|
4 |
|
5 |
|
|
|
1 |
+
[Young Warrior] Kingold(...) 2021-12-30 22:27:37
|
2 |
+
It seems that the standalone version can
|
3 |
omit the GUI and specify the two files to be aligned directly on the command line.
|
4 |
|
5 |
|
data/zh.txt
CHANGED
@@ -1,4 +1,4 @@
|
|
1 |
-
【少侠】Kingold(
|
2 |
单机版貌似可以省略掉图形界面,直接
|
3 |
命令行指定两个待对齐文件。
|
4 |
|
|
|
1 |
+
【少侠】Kingold(...) 2021-12-30 22:27:37
|
2 |
单机版貌似可以省略掉图形界面,直接
|
3 |
命令行指定两个待对齐文件。
|
4 |
|
docs/build/doctrees/environment.pickle
CHANGED
Binary files a/docs/build/doctrees/environment.pickle and b/docs/build/doctrees/environment.pickle differ
|
|
docs/build/doctrees/examples.doctree
CHANGED
Binary files a/docs/build/doctrees/examples.doctree and b/docs/build/doctrees/examples.doctree differ
|
|
docs/build/doctrees/intro.doctree
CHANGED
Binary files a/docs/build/doctrees/intro.doctree and b/docs/build/doctrees/intro.doctree differ
|
|
docs/build/html/_sources/examples.rst.txt
CHANGED
@@ -3,6 +3,8 @@ Examples
|
|
3 |
|
4 |
``radiobee`` has in-built examples. Just click one of the rows in the ``Examples`` table and click ``Submit`` to testrun.
|
5 |
|
|
|
|
|
6 |
Installation/Usage:
|
7 |
*******************
|
8 |
As the package has not been published on PyPi yet, it CANNOT be installed using pip.
|
|
|
3 |
|
4 |
``radiobee`` has in-built examples. Just click one of the rows in the ``Examples`` table and click ``Submit`` to testrun.
|
5 |
|
6 |
+
`gradio 3` (run in hf spaces) seems to have trouble with examples. Hence, examples may be taken off line until the problem is fixed.
|
7 |
+
|
8 |
Installation/Usage:
|
9 |
*******************
|
10 |
As the package has not been published on PyPi yet, it CANNOT be installed using pip.
|
docs/build/html/_sources/intro.rst.txt
CHANGED
@@ -18,4 +18,4 @@ Limitations
|
|
18 |
Currently, only zh-en/en-zh pairs are supported in fast-track mode although further pairs will be added if and when time permits.
|
19 |
If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.
|
20 |
|
21 |
-
An experimental slow-track mode (time required approximately 10 times that of fast-track mode) is
|
|
|
18 |
Currently, only zh-en/en-zh pairs are supported in fast-track mode although further pairs will be added if and when time permits.
|
19 |
If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.
|
20 |
|
21 |
+
An experimental slow-track mode (time required approximately 10 times that of fast-track mode) is introduced for other laugnage pairs.
|
docs/build/html/examples.html
CHANGED
@@ -76,6 +76,7 @@
|
|
76 |
<section id="examples">
|
77 |
<h1>Examples<a class="headerlink" href="#examples" title="Permalink to this headline"></a></h1>
|
78 |
<p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code> has in-built examples. Just click one of the rows in the <code class="docutils literal notranslate"><span class="pre">Examples</span></code> table and click <code class="docutils literal notranslate"><span class="pre">Submit</span></code> to testrun.</p>
|
|
|
79 |
<section id="installation-usage">
|
80 |
<h2>Installation/Usage:<a class="headerlink" href="#installation-usage" title="Permalink to this headline"></a></h2>
|
81 |
<p>As the package has not been published on PyPi yet, it CANNOT be installed using pip.</p>
|
|
|
76 |
<section id="examples">
|
77 |
<h1>Examples<a class="headerlink" href="#examples" title="Permalink to this headline"></a></h1>
|
78 |
<p><code class="docutils literal notranslate"><span class="pre">radiobee</span></code> has in-built examples. Just click one of the rows in the <code class="docutils literal notranslate"><span class="pre">Examples</span></code> table and click <code class="docutils literal notranslate"><span class="pre">Submit</span></code> to testrun.</p>
|
79 |
+
<p><cite>gradio 3</cite> (run in hf spaces) seems to have trouble with examples. Hence, examples may be taken off line until the problem is fixed.</p>
|
80 |
<section id="installation-usage">
|
81 |
<h2>Installation/Usage:<a class="headerlink" href="#installation-usage" title="Permalink to this headline"></a></h2>
|
82 |
<p>As the package has not been published on PyPi yet, it CANNOT be installed using pip.</p>
|
docs/build/html/intro.html
CHANGED
@@ -87,7 +87,7 @@
|
|
87 |
<h2>Limitations<a class="headerlink" href="#limitations" title="Permalink to this headline"></a></h2>
|
88 |
<p>Currently, only zh-en/en-zh pairs are supported in fast-track mode although further pairs will be added if and when time permits.
|
89 |
If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.</p>
|
90 |
-
<p>An experimental slow-track mode (time required approximately 10 times that of fast-track mode) is
|
91 |
</section>
|
92 |
</section>
|
93 |
|
|
|
87 |
<h2>Limitations<a class="headerlink" href="#limitations" title="Permalink to this headline"></a></h2>
|
88 |
<p>Currently, only zh-en/en-zh pairs are supported in fast-track mode although further pairs will be added if and when time permits.
|
89 |
If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.</p>
|
90 |
+
<p>An experimental slow-track mode (time required approximately 10 times that of fast-track mode) is introduced for other laugnage pairs.</p>
|
91 |
</section>
|
92 |
</section>
|
93 |
|
docs/build/html/searchindex.js
CHANGED
@@ -1 +1 @@
|
|
1 |
-
Search.setIndex({docnames:["examples","index","intro","modules","radiobee","userguide","userguide-zh"],envversion:{"sphinx.domains.c":2,"sphinx.domains.changeset":1,"sphinx.domains.citation":1,"sphinx.domains.cpp":4,"sphinx.domains.index":1,"sphinx.domains.javascript":2,"sphinx.domains.math":2,"sphinx.domains.python":3,"sphinx.domains.rst":2,"sphinx.domains.std":2,sphinx:56},filenames:["examples.rst","index.rst","intro.rst","modules.rst","radiobee.rst","userguide.rst","userguide-zh.rst"],objects:{},objnames:{},objtypes:{},terms:{"1":[5,6],"10":2,"12":[5,6],"2":[5,6],"200":[5,6],"2000":[5,6],"3":2,"316287378":[5,6],"4":[5,6],"
|
|
|
1 |
+
Search.setIndex({docnames:["examples","index","intro","modules","radiobee","userguide","userguide-zh"],envversion:{"sphinx.domains.c":2,"sphinx.domains.changeset":1,"sphinx.domains.citation":1,"sphinx.domains.cpp":4,"sphinx.domains.index":1,"sphinx.domains.javascript":2,"sphinx.domains.math":2,"sphinx.domains.python":3,"sphinx.domains.rst":2,"sphinx.domains.std":2,sphinx:56},filenames:["examples.rst","index.rst","intro.rst","modules.rst","radiobee.rst","userguide.rst","userguide-zh.rst"],objects:{},objnames:{},objtypes:{},terms:{"1":[5,6],"10":2,"12":[5,6],"2":[5,6],"200":[5,6],"2000":[5,6],"3":[0,2],"316287378":[5,6],"4":[5,6],"500":6,"8":[5,6],"\u4e00\u822c\u65e0\u9700\u7406\u4f1a\u8fd9\u4e9b\u53c2\u6570":6,"\u4e2d\u82f1\u975e\u7a7a\u884c\u9650\u5236\u5728":6,"\u4e3a\u4e2d\u82f1\u6587\u6df7\u5408\u6587\u672c\u53ca\u8bd5\u7740\u5206\u79bb\u4e2d\u82f1\u6587":6,"\u4e3a\u7a7a\u767d\u65f6":6,"\u4e86\u89e3\u8fd9\u4e9b\u5bf9\u9f50\u5de5\u5177":6,"\u4ee5\u5185":6,"\u4ee5\u540e\u53ef\u80fd\u4f1a\u652f\u6301":6,"\u4f18\u8d28\u5bf9":6,"\u4f7f\u7528\u8bf4\u660e":1,"\u5176\u4ed6\u8bed\u8a00\u5bf9\u7684\u5bf9\u9f50":6,"\u5219\u4f1a\u89c6":6,"\u5219\u9650\u5236\u5728":6,"\u53e6\u4e00\u65b9\u9762":6,"\u53ef\u4ee5\u53f3\u51fb\u62f7\u51fa\u56fe\u7684\u94fe\u63a5\u7528\u6d4f\u89c8\u5668\u72ec\u7acb\u8bbf\u95ee\u62f7\u51fa\u6765\u7684\u94fe\u63a5\u6216\u53f3\u51fb\u5b58\u76d8\u518d\u7528\u770b\u56fe\u7a0b\u5e8f\u6253\u5f00\u5b58\u76d8\u7684\u56fe\u6587\u4ef6":6,"\u548c":6,"\u5acc\u56fe\u592a\u5c0f\u7684\u8bdd":6,"\u5b58\u4e0b\u6709\u5173\u53c2\u6570\u67e5\u770b\u6216\u901a\u77e5\u5f00\u53d1\u8005":6,"\u5bf9\u7ea6\u97005\u5206\u949f":6,"\u5feb\u5bf9\u6a21\u5f0f\u76ee\u524d\u4ec5\u652f\u6301\u4e2d\u82f1":6,"\u662f":6,"\u6700\u5c0f":6,"\u7136\u540e\u8fdb\u884c\u5bf9\u9f50":6,"\u7684\u5b6a\u751f\u5144\u5f1f":6,"\u7684\u5efa\u8bae\u503c":6,"\u76ee\u524d\u4ec5\u652f\u6301\u7eaf\u6587\u672c\u6587\u4ef6\u4e0a\u8f7d":6,"\u7b2c\u4e8c\u6b21\u4e0a\u8f7d\u6587\u4ef6\u524d\u8bf7\u70b9\u51fb":6,"\u7b49":6,"\u7b49\u683c\u5f0f":6,"\u82f1\u4e2d":6,"\u82f1\u4e2d\u5bf9\u9f50":6,"\u8bbe\u5927\u4e9b\u5219\u4f1a\u5f97\u5230\u5c11\u4e00\u4e9b\u5bf9\u9f50\u5bf9\u56e0\u4e3a\u53ef\u80fd\u9519\u5931\u4e86\u4e00\u4e9b":6,"\u8bbe\u5927\u4e9b\u6216":6,"\u8bbe\u5c0f\u4e9b\u53ef\u4ee5\u5f97\u5230\u66f4\u591a\u7684\u5bf9\u9f50\u5bf9\u4f46\u4e5f\u4f1a\u6709\u66f4\u591a":6,"\u8bbe\u5c0f\u4e9b\u6216":6,"\u8bef\u62a5\u5bf9":6,"\u8bf7\u52a0\u5165qq\u7fa4":6,"\u8fd0\u884c\u51fa\u9519\u65f6\u53ef\u4ee5\u70b9\u51fb":6,"\u9519\u8bef\u5224\u65ad\u4e3a\u5bf9\u9f50\u7684\u5bf9":6,"do":5,"new":5,As:0,For:0,If:[2,5],On:5,The:[2,5],To:5,about:5,ad:2,address:5,aim:2,align:[0,2,5,6],align_s:[1,3],align_text:[1,3],also:5,although:2,amend_avec:[1,3],an:2,app:[1,3],applic:2,approxim:2,ar:[2,5],attempt:5,been:[0,2],befor:5,better:5,blank:5,browser:5,built:0,bumblebe:[5,6],can:5,candid:5,cannot:0,cat:2,chines:5,clear:[5,6],click:[0,5],cmat2tset:[1,3],co:0,contact:2,content:3,copi:5,csv:[5,6],current:2,de:2,develop:[2,5],dl_type:[5,6],docterm_scor:[1,3],docx:[5,6],download:0,dual:2,dualtext:2,e:2,ebook:2,educ:2,en2zh:[1,3],en2zh_token:[1,3],en:[2,5],english:5,epsilon:[5,6],esp:[5,6],etc:[2,5],exampl:[1,2,5],experiment:2,fals:5,fast:2,file2text:[1,3],file:[5,6],files2df:[1,3],find:2,first:5,fix:0,flag:[5,6],format:5,full:2,further:2,g:2,gen_aset:[1,3],gen_eps_minsampl:[1,3],gen_model:[1,3],gen_pset:[1,3],gen_row_align:[1,3],go:5,good:5,gradio:[0,2],group:5,ha:[0,2],hand:5,have:[0,5],help:2,henc:0,hf:0,how:1,html:[5,6],http:0,huggingfac:0,identifi:5,idf_typ:[5,6],imag:5,implement:2,index:1,inform:5,insert_spac:[1,3],instal:1,interfac:2,interpolate_pset:[1,3],introduc:2,introduct:1,ja:2,join:5,just:0,know:5,languag:2,languang:5,larger:5,later:5,laugnag:2,learn:2,left:5,limit:[1,5],line:[0,5],lists2cmat:[1,3],loadtext:[1,3],look:5,machin:2,mai:[0,5],mani:2,md:[5,6],mdx_e2c:[1,3],method:0,mikee:0,min_sampl:[5,6],minimum:5,miss:5,mix:5,mode:2,modul:[1,3],more:5,motiv:1,need:5,non:5,norm:[5,6],normal:5,now:0,number:5,off:0,one:0,onli:2,onlin:0,open:5,other:[2,5],output:5,packag:[0,1,3],page:1,pair:[2,5],paragraph:2,particular:2,pdf:[5,6],permit:2,pip:0,pleas:5,plot_cmat:[1,3],plot_df:[1,3],posit:5,power:2,problem:0,proced:5,process_upload:[1,3],properli:2,provid:2,publish:0,pure:5,pypi:0,python:2,qq:5,radiobe:[0,2,5,6],requir:2,result:5,right:5,row:0,ru:2,run:0,save:5,search:1,seem:0,seg_text:[1,3],select:5,sentenc:2,separ:5,should:5,shuffle_s:[1,3],sibl:5,slow:2,smaller:5,smatrix:[1,3],someth:5,space:0,srt:[5,6],submit:[0,5],submodul:[1,3],subsequ:5,suggest:[0,5],support:[2,5],tab:5,tabl:0,taken:0,tend:5,term:2,testrun:0,text:[2,5],tf_type:[5,6],them:5,time:2,tmx:2,touch:5,track:2,translat:2,treat:5,trim_df:[1,3],troubl:0,two:2,txt:[5,6],unless:5,until:0,upload:5,us:[0,1],usag:1,valu:5,version:0,welcom:2,what:5,when:[2,5],willing:2,wrong:5,yet:0,you:[2,5],zh:[2,5],zip:0},titles:["Examples","Welcome to radiobee\u2019s documentation!","Introduction","radiobee","radiobee package","How to use","\u4f7f\u7528\u8bf4\u660e"],titleterms:{"\u4f7f\u7528\u8bf4\u660e":6,align_s:4,align_text:4,amend_avec:4,app:4,cmat2tset:4,content:[1,4],docterm_scor:4,document:1,en2zh:4,en2zh_token:4,exampl:0,file2text:4,files2df:4,gen_aset:4,gen_eps_minsampl:4,gen_model:4,gen_pset:4,gen_row_align:4,how:5,indic:1,insert_spac:4,instal:0,interpolate_pset:4,introduct:2,limit:2,lists2cmat:4,loadtext:4,mdx_e2c:4,modul:4,motiv:2,packag:4,plot_cmat:4,plot_df:4,process_upload:4,radiobe:[1,3,4],s:1,seg_text:4,shuffle_s:4,smatrix:4,submodul:4,tabl:1,trim_df:4,us:5,usag:0,welcom:1}})
|
docs/source/examples.rst
CHANGED
@@ -3,6 +3,8 @@ Examples
|
|
3 |
|
4 |
``radiobee`` has in-built examples. Just click one of the rows in the ``Examples`` table and click ``Submit`` to testrun.
|
5 |
|
|
|
|
|
6 |
Installation/Usage:
|
7 |
*******************
|
8 |
As the package has not been published on PyPi yet, it CANNOT be installed using pip.
|
|
|
3 |
|
4 |
``radiobee`` has in-built examples. Just click one of the rows in the ``Examples`` table and click ``Submit`` to testrun.
|
5 |
|
6 |
+
`gradio 3` (run in hf spaces) seems to have trouble with examples. Hence, examples may be taken off line until the problem is fixed.
|
7 |
+
|
8 |
Installation/Usage:
|
9 |
*******************
|
10 |
As the package has not been published on PyPi yet, it CANNOT be installed using pip.
|
docs/source/intro.rst
CHANGED
@@ -18,4 +18,4 @@ Limitations
|
|
18 |
Currently, only zh-en/en-zh pairs are supported in fast-track mode although further pairs will be added if and when time permits.
|
19 |
If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.
|
20 |
|
21 |
-
An experimental slow-track mode (time required approximately 10 times that of fast-track mode) is
|
|
|
18 |
Currently, only zh-en/en-zh pairs are supported in fast-track mode although further pairs will be added if and when time permits.
|
19 |
If you are willing to help with a particular pair (for example, de-zh, ja-zh, ru-zh, etc.), you are welcome to contact the developer.
|
20 |
|
21 |
+
An experimental slow-track mode (time required approximately 10 times that of fast-track mode) is introduced for other laugnage pairs.
|
radiobee/__init__.py
CHANGED
@@ -1 +1,2 @@
|
|
1 |
"""Init."""
|
|
|
|
1 |
"""Init."""
|
2 |
+
__version__ = "0.1.0b"
|
radiobee/detect.py
CHANGED
@@ -29,7 +29,7 @@ def with_func_attrs(**attrs: Any) -> Callable:
|
|
29 |
def detect(text: str, set_languages: Optional[List[str]] = None) -> str:
|
30 |
"""Detect language via polyglot and fastlid.
|
31 |
|
32 |
-
check first with fastlid, if conf < 0.3, check with
|
33 |
|
34 |
Alternative in detec_alt.py
|
35 |
"""
|
|
|
29 |
def detect(text: str, set_languages: Optional[List[str]] = None) -> str:
|
30 |
"""Detect language via polyglot and fastlid.
|
31 |
|
32 |
+
check first with fastlid, if conf < 0.3, check with polyglot.text.Detector
|
33 |
|
34 |
Alternative in detec_alt.py
|
35 |
"""
|
radiobee/radiobee_cli.py
ADDED
@@ -0,0 +1,545 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""Run radiobee-cli, based on gradiobee.
|
2 |
+
|
3 |
+
https://stackoverflow.com/questions/71007924/how-can-i-get-a-version-to-the-root-of-a-typer-typer-application
|
4 |
+
"""
|
5 |
+
# pylint: disable=invalid-name, too-many-arguments, too-many-branches, too-many-locals, too-many-statements, unused-variable, too-many-return-statements, unused-import
|
6 |
+
|
7 |
+
from typing import Optional
|
8 |
+
from pathlib import Path
|
9 |
+
import platform
|
10 |
+
import inspect
|
11 |
+
from itertools import zip_longest
|
12 |
+
|
13 |
+
# import tempfile
|
14 |
+
|
15 |
+
# from click import click
|
16 |
+
import typer
|
17 |
+
from sklearn.cluster import DBSCAN
|
18 |
+
from fastlid import fastlid
|
19 |
+
from logzero import logger
|
20 |
+
from icecream import ic
|
21 |
+
|
22 |
+
import numpy as np # noqa
|
23 |
+
import pandas as pd
|
24 |
+
import matplotlib # noqa
|
25 |
+
import matplotlib.pyplot as plt
|
26 |
+
import seaborn as sns
|
27 |
+
|
28 |
+
import sys
|
29 |
+
if "." not in sys.path:
|
30 |
+
sys.path.append(".")
|
31 |
+
|
32 |
+
# from radiobee.process_upload import process_upload
|
33 |
+
from radiobee.files2df import files2df
|
34 |
+
from radiobee.file2text import file2text
|
35 |
+
from radiobee.lists2cmat import lists2cmat
|
36 |
+
from radiobee.gen_pset import gen_pset
|
37 |
+
from radiobee.gen_aset import gen_aset
|
38 |
+
from radiobee.align_texts import align_texts
|
39 |
+
from radiobee.cmat2tset import cmat2tset
|
40 |
+
from radiobee.trim_df import trim_df
|
41 |
+
from radiobee.error_msg import error_msg
|
42 |
+
from radiobee.text2lists import text2lists
|
43 |
+
|
44 |
+
from radiobee.align_sents import align_sents
|
45 |
+
from radiobee.shuffle_sents import shuffle_sents # type: ignore
|
46 |
+
from radiobee.paras2sents import paras2sents # type: ignore
|
47 |
+
from radiobee import __version__
|
48 |
+
|
49 |
+
sns.set()
|
50 |
+
sns.set_style("darkgrid")
|
51 |
+
pd.options.display.float_format = "{:,.2f}".format
|
52 |
+
|
53 |
+
debug = False
|
54 |
+
debug = True
|
55 |
+
|
56 |
+
_ = """
|
57 |
+
def gradiobee( # noqa
|
58 |
+
file1,
|
59 |
+
file2,
|
60 |
+
tf_type,
|
61 |
+
idf_type,
|
62 |
+
dl_type,
|
63 |
+
norm,
|
64 |
+
eps,
|
65 |
+
min_samples,
|
66 |
+
# debug=False,
|
67 |
+
sent_ali_algo,
|
68 |
+
):
|
69 |
+
# """
|
70 |
+
|
71 |
+
app = typer.Typer(
|
72 |
+
add_completion=False,
|
73 |
+
)
|
74 |
+
|
75 |
+
|
76 |
+
def version_callback(value: bool):
|
77 |
+
if value:
|
78 |
+
ver = typer.style(f"{__version__}", fg=typer.colors.GREEN, bold=True)
|
79 |
+
typer.echo(f"radiobee-cli {ver}")
|
80 |
+
raise typer.Exit()
|
81 |
+
|
82 |
+
|
83 |
+
@app.command()
|
84 |
+
def radiobee_cli(
|
85 |
+
file1: str = typer.Argument(..., help="first file name"),
|
86 |
+
file2: str = typer.Argument(None, help="optinal second file name (if not provided, the first file will be separated to two files)"),
|
87 |
+
tf_type: str = typer.Option("linear", help="tf type [linear, sqrt, log, binary]"),
|
88 |
+
idf_type: str = typer.Option(None, help="idf type [None, standard, smooth, bm25]"),
|
89 |
+
dl_type: str = typer.Option("", help="dl type [None, linear, sqrt, log]"),
|
90 |
+
norm: str = typer.Option("", help="norm [None, l1, l2]"),
|
91 |
+
eps: float = typer.Option(10, help="epsilon, typicaly between 1 and 20"),
|
92 |
+
min_samples: int = typer.Option(6, help="minimum samples, typicaly between 1 and 20"),
|
93 |
+
sent_ali_algo: str = typer.Option("", help="sentence align algorithm [None, fast, slow]"),
|
94 |
+
version: Optional[bool] = typer.Option(
|
95 |
+
None, "--version", "-V", callback=version_callback, is_eager=True,
|
96 |
+
),
|
97 |
+
):
|
98 |
+
"""Align dualtext."""
|
99 |
+
logger.debug(" *debug* ")
|
100 |
+
|
101 |
+
# possible further switchse
|
102 |
+
# para_sent: para/sent
|
103 |
+
# sent_ali: default/radio/gale-church
|
104 |
+
plot_dia = True # noqa
|
105 |
+
|
106 |
+
# outputs: check return
|
107 |
+
# if outputs is modified, also need to modify error_msg's outputs
|
108 |
+
|
109 |
+
# convert "None" to None for those Radio types
|
110 |
+
for _ in [idf_type, dl_type, norm]:
|
111 |
+
if _ in "None":
|
112 |
+
_ = None
|
113 |
+
|
114 |
+
# logger.info("file1: *%s*, file2: *%s*", file1, file2)
|
115 |
+
if file2 is not None:
|
116 |
+
logger.info("file1.name: *%s*, file2.name: *%s*", file1.name, file2.name)
|
117 |
+
else:
|
118 |
+
logger.info("file1.name: *%s*, file2: *%s*", file1.name, file2)
|
119 |
+
|
120 |
+
# bypass if file1 or file2 is str input
|
121 |
+
# if not (isinstance(file1, str) or isinstance(file2, str)):
|
122 |
+
text1 = file2text(file1)
|
123 |
+
|
124 |
+
if file2 is None:
|
125 |
+
logger.debug("file2 is None")
|
126 |
+
text2 = ""
|
127 |
+
else:
|
128 |
+
logger.debug("file2.name: %s", file2.name)
|
129 |
+
text2 = file2text(file2)
|
130 |
+
|
131 |
+
# if not text1.strip() or not text2.strip():
|
132 |
+
if not text1.strip():
|
133 |
+
msg = (
|
134 |
+
"file 1 is apparently empty... Upload a none empty file and try again."
|
135 |
+
# f"text1[:10]: [{text1[:10]}], "
|
136 |
+
# f"text2[:10]: [{text2[:10]}]"
|
137 |
+
)
|
138 |
+
return error_msg(msg)
|
139 |
+
|
140 |
+
# single file
|
141 |
+
# when text2 is empty
|
142 |
+
# process file1/text1: split text1 to text1 text2 to zh-en
|
143 |
+
|
144 |
+
len_max = 2000
|
145 |
+
if not text2.strip(): # empty file2
|
146 |
+
_ = [elm.strip() for elm in text1.splitlines() if elm.strip()]
|
147 |
+
if not _: # essentially empty file1
|
148 |
+
return error_msg("Nothing worthy of processing in file 1")
|
149 |
+
|
150 |
+
logger.info(
|
151 |
+
"single file: len %s, max %s",
|
152 |
+
len(_), 2 * len_max
|
153 |
+
)
|
154 |
+
# exit if there are too many lines
|
155 |
+
if len(_) > 2 * len_max:
|
156 |
+
return error_msg(f" Too many lines ({len(_)}) > {2 * len_max}, alignment op halted, sorry.", "info")
|
157 |
+
|
158 |
+
_ = zip_longest(_, [""])
|
159 |
+
_ = pd.DataFrame(_, columns=["text1", "text2"])
|
160 |
+
df_trimmed = trim_df(_)
|
161 |
+
|
162 |
+
# text1 = loadtext("data/test-dual.txt")
|
163 |
+
list1, list2 = text2lists(text1)
|
164 |
+
|
165 |
+
lang1 = text2lists.lang1
|
166 |
+
lang2 = text2lists.lang2
|
167 |
+
offset = text2lists.offset # noqa
|
168 |
+
|
169 |
+
_ = """
|
170 |
+
ax = sns.heatmap(lists2cmat(list1, list2), cmap="gist_earth_r") # ax=plt.gca()
|
171 |
+
ax.invert_yaxis()
|
172 |
+
ax.set(
|
173 |
+
xlabel=lang1,
|
174 |
+
ylabel=lang2,
|
175 |
+
title=f"cos similary heatmap \n(offset={offset})",
|
176 |
+
)
|
177 |
+
plt_loc = "img/plt.png"
|
178 |
+
plt.savefig(plt_loc)
|
179 |
+
# """
|
180 |
+
|
181 |
+
# output_plot = plt_loc # for gr.outputs.Image
|
182 |
+
|
183 |
+
#
|
184 |
+
_ = zip_longest(list1, list2, fillvalue="")
|
185 |
+
df_aligned = pd.DataFrame(
|
186 |
+
_,
|
187 |
+
columns=["text1", "tex2"]
|
188 |
+
)
|
189 |
+
|
190 |
+
file_dl = Path(f"{Path(file1.name).stem[:-8]}-{lang1}-{lang2}.csv")
|
191 |
+
file_dl_xlsx = Path(
|
192 |
+
f"{Path(file1.name).stem[:-8]}-{lang1}-{lang2}.xlsx"
|
193 |
+
)
|
194 |
+
|
195 |
+
# return df_trimmed, output_plot, file_dl, file_dl_xlsx, df_aligned
|
196 |
+
|
197 |
+
# end if single file
|
198 |
+
# not single file
|
199 |
+
else: # file1 file 2: proceed
|
200 |
+
fastlid.set_languages = None
|
201 |
+
lang1, _ = fastlid(text1)
|
202 |
+
lang2, _ = fastlid(text2)
|
203 |
+
|
204 |
+
df1 = files2df(file1, file2)
|
205 |
+
|
206 |
+
list1 = [elm for elm in df1.text1 if elm]
|
207 |
+
list2 = [elm for elm in df1.text2 if elm]
|
208 |
+
# len1 = len(list1) # noqa
|
209 |
+
# len2 = len(list2) # noqa
|
210 |
+
|
211 |
+
# exit if there are too many lines
|
212 |
+
len12 = len(list1) + len(list2)
|
213 |
+
logger.info(
|
214 |
+
"fast track: len1 %s, len2 %s, tot %s, max %s",
|
215 |
+
len(list1), len(list2), len(list1) + len(list2), 3 * len_max
|
216 |
+
)
|
217 |
+
if len12 > 3 * len_max:
|
218 |
+
return error_msg(f" Too many lines ({len(list1)} + {len(list2)} > {3 * len_max}), alignment op halted, sorry.", "info")
|
219 |
+
|
220 |
+
file_dl = Path(f"{Path(file1.name).stem[:-8]}-{Path(file2.name).stem[:-8]}.csv")
|
221 |
+
file_dl_xlsx = Path(
|
222 |
+
f"{Path(file1.name).stem[:-8]}-{Path(file2.name).stem[:-8]}.xlsx"
|
223 |
+
)
|
224 |
+
|
225 |
+
df_trimmed = trim_df(df1)
|
226 |
+
# --- end else single
|
227 |
+
|
228 |
+
lang_en_zh = ["en", "zh"]
|
229 |
+
|
230 |
+
logger.debug("lang1: %s, lang2: %s", lang1, lang2)
|
231 |
+
if debug:
|
232 |
+
ic(f" lang1: {lang1}, lang2: {lang2}")
|
233 |
+
ic("fast track? ", lang1 in lang_en_zh and lang2 in lang_en_zh)
|
234 |
+
|
235 |
+
# fast track
|
236 |
+
if lang1 in lang_en_zh and lang2 in lang_en_zh:
|
237 |
+
try:
|
238 |
+
cmat = lists2cmat(
|
239 |
+
list1,
|
240 |
+
list2,
|
241 |
+
tf_type=tf_type,
|
242 |
+
idf_type=idf_type,
|
243 |
+
dl_type=dl_type,
|
244 |
+
norm=norm,
|
245 |
+
)
|
246 |
+
except Exception as exc:
|
247 |
+
logger.error(exc)
|
248 |
+
return error_msg(exc)
|
249 |
+
# slow track
|
250 |
+
else:
|
251 |
+
logger.info(
|
252 |
+
"slow track: len1 %s, len2 %s, tot: %s, max %s",
|
253 |
+
len(list1), len(list2), len(list1) + len(list2),
|
254 |
+
3 * len_max
|
255 |
+
)
|
256 |
+
if len(list1) + len(list2) > 3 * len_max:
|
257 |
+
msg = (
|
258 |
+
f" len1 {len(list1)} + len2 {len(list2)} > {3 * len_max}. "
|
259 |
+
"This will take too long to complete "
|
260 |
+
"and will hog this experimental server and hinder "
|
261 |
+
"other users from trying the service. "
|
262 |
+
"Aborted...sorry"
|
263 |
+
)
|
264 |
+
return error_msg(msg, "info ")
|
265 |
+
try:
|
266 |
+
from radiobee.model_s import model_s # pylint: disable=import-outside-toplevel
|
267 |
+
vec1 = model_s.encode(list1)
|
268 |
+
vec2 = model_s.encode(list2)
|
269 |
+
# cmat = vec1.dot(vec2.T)
|
270 |
+
cmat = vec2.dot(vec1.T)
|
271 |
+
except Exception as exc:
|
272 |
+
logger.error(exc)
|
273 |
+
_ = inspect.currentframe().f_lineno # type: ignore
|
274 |
+
return error_msg(
|
275 |
+
f"{exc}, {Path(__file__).name} ln{_}, period"
|
276 |
+
)
|
277 |
+
|
278 |
+
tset = pd.DataFrame(cmat2tset(cmat))
|
279 |
+
tset.columns = ["x", "y", "cos"]
|
280 |
+
|
281 |
+
_ = """
|
282 |
+
df_trimmed = pd.concat(
|
283 |
+
[
|
284 |
+
df1.iloc[:4, :],
|
285 |
+
pd.DataFrame(
|
286 |
+
[
|
287 |
+
[
|
288 |
+
"...",
|
289 |
+
"...",
|
290 |
+
]
|
291 |
+
],
|
292 |
+
columns=df1.columns,
|
293 |
+
),
|
294 |
+
df1.iloc[-4:, :],
|
295 |
+
],
|
296 |
+
ignore_index=1,
|
297 |
+
)
|
298 |
+
# """
|
299 |
+
|
300 |
+
# process list1, list2 to obtained df_aligned
|
301 |
+
# quick fix ValueError: not enough values to unpack (expected at least 1, got 0)
|
302 |
+
# fixed in gen_pet, but we leave the loop here
|
303 |
+
for min_s in range(min_samples):
|
304 |
+
logger.info(" min_samples, using %s", min_samples - min_s)
|
305 |
+
try:
|
306 |
+
pset = gen_pset(
|
307 |
+
cmat,
|
308 |
+
eps=eps,
|
309 |
+
min_samples=min_samples - min_s,
|
310 |
+
delta=7,
|
311 |
+
)
|
312 |
+
break
|
313 |
+
except ValueError:
|
314 |
+
logger.info(" decrease min_samples by %s", min_s + 1)
|
315 |
+
continue
|
316 |
+
except Exception as e:
|
317 |
+
logger.error(e)
|
318 |
+
continue
|
319 |
+
else:
|
320 |
+
# break should happen above when min_samples = 2
|
321 |
+
raise Exception("bummer, this shouldn't happen, probably another bug")
|
322 |
+
|
323 |
+
min_samples = gen_pset.min_samples
|
324 |
+
|
325 |
+
# will result in error message:
|
326 |
+
# UserWarning: Starting a Matplotlib GUI outside of
|
327 |
+
# the main thread will likely fail."
|
328 |
+
_ = """
|
329 |
+
plot_cmat(
|
330 |
+
cmat,
|
331 |
+
eps=eps,
|
332 |
+
min_samples=min_samples,
|
333 |
+
xlabel=lang1,
|
334 |
+
ylabel=lang2,
|
335 |
+
)
|
336 |
+
# """
|
337 |
+
|
338 |
+
# move plot_cmat's code to the main thread here
|
339 |
+
# to make it work
|
340 |
+
xlabel = lang1
|
341 |
+
ylabel = lang2
|
342 |
+
|
343 |
+
len1, len2 = cmat.shape
|
344 |
+
ylim, xlim = len1, len2
|
345 |
+
|
346 |
+
# does not seem to show up
|
347 |
+
ic(f" len1 (ylim): {len1}, len2 (xlim): {len2}")
|
348 |
+
logger.debug(" len1 (ylim): %s, len2 (xlim): %s", len1, len2)
|
349 |
+
if debug:
|
350 |
+
print(f" len1 (ylim): {len1}, len2 (xlim): {len2}")
|
351 |
+
|
352 |
+
df_ = pd.DataFrame(cmat2tset(cmat))
|
353 |
+
df_.columns = ["x", "y", "cos"]
|
354 |
+
|
355 |
+
sns.set()
|
356 |
+
sns.set_style("darkgrid")
|
357 |
+
|
358 |
+
# close all existing figures, necesssary for hf spaces
|
359 |
+
plt.close("all")
|
360 |
+
|
361 |
+
# if sys.platform not in ["win32", "linux"]:
|
362 |
+
# going for noninterative
|
363 |
+
# to cater for Mac, thanks to WhiteFox
|
364 |
+
plt.switch_backend("Agg")
|
365 |
+
|
366 |
+
# figsize=(13, 8), (339, 212) mm on '1280x800+0+0'
|
367 |
+
fig = plt.figure(figsize=(13, 8))
|
368 |
+
|
369 |
+
# gs = fig.add_gridspec(2, 2, wspace=0.4, hspace=0.58)
|
370 |
+
gs = fig.add_gridspec(1, 2, wspace=0.4, hspace=0.58)
|
371 |
+
ax_heatmap = fig.add_subplot(gs[0, 0]) # ax2
|
372 |
+
ax0 = fig.add_subplot(gs[0, 1])
|
373 |
+
# ax1 = fig.add_subplot(gs[1, 0])
|
374 |
+
|
375 |
+
cmap = "viridis_r"
|
376 |
+
sns.heatmap(cmat, cmap=cmap, ax=ax_heatmap).invert_yaxis()
|
377 |
+
ax_heatmap.set_xlabel(xlabel)
|
378 |
+
ax_heatmap.set_ylabel(ylabel)
|
379 |
+
ax_heatmap.set_title("cos similarity heatmap")
|
380 |
+
|
381 |
+
fig.suptitle(f"alignment projection\n(eps={eps}, min_samples={min_samples})")
|
382 |
+
|
383 |
+
_ = DBSCAN(min_samples=min_samples, eps=eps).fit(df_).labels_ > -1
|
384 |
+
|
385 |
+
# _x = DBSCAN(min_samples=min_samples, eps=eps).fit(df_).labels_ < 0
|
386 |
+
_x = ~_
|
387 |
+
|
388 |
+
# max cos along columns
|
389 |
+
df_.plot.scatter("x", "y", c="cos", cmap=cmap, ax=ax0)
|
390 |
+
|
391 |
+
# outliers
|
392 |
+
df_[_x].plot.scatter("x", "y", c="r", marker="x", alpha=0.6, ax=ax0)
|
393 |
+
ax0.set_xlabel(xlabel)
|
394 |
+
ax0.set_ylabel(ylabel)
|
395 |
+
ax0.set_xlim(xmin=0, xmax=xlim)
|
396 |
+
ax0.set_ylim(ymin=0, ymax=ylim)
|
397 |
+
ax0.set_title(
|
398 |
+
"max along columns (x: outliers)\n"
|
399 |
+
"potential aligned pairs (green line), "
|
400 |
+
f"{round(sum(_) / xlim, 2):.0%}"
|
401 |
+
)
|
402 |
+
|
403 |
+
plt_loc = "img/plt.png"
|
404 |
+
ic(f" plotting to {plt_loc}")
|
405 |
+
plt.savefig(plt_loc)
|
406 |
+
|
407 |
+
# clustered
|
408 |
+
# df_[_].plot.scatter("x", "y", c="cos", cmap=cmap, ax=ax1)
|
409 |
+
# ax1.set_xlabel(xlabel)
|
410 |
+
# ax1.set_ylabel(ylabel)
|
411 |
+
# ax1.set_xlim(0, len1)
|
412 |
+
# ax1.set_title(f"potential aligned pairs ({round(sum(_) / len1, 2):.0%})")
|
413 |
+
# end of plot_cmat
|
414 |
+
|
415 |
+
src_len, tgt_len = cmat.shape
|
416 |
+
aset = gen_aset(pset, src_len, tgt_len)
|
417 |
+
final_list = align_texts(aset, list2, list1) # note the order
|
418 |
+
|
419 |
+
# df_aligned
|
420 |
+
df_aligned = pd.DataFrame(final_list, columns=["text1", "text2", "likelihood"])
|
421 |
+
|
422 |
+
# swap text1 text2
|
423 |
+
df_aligned = df_aligned[["text2", "text1", "likelihood"]]
|
424 |
+
df_aligned.columns = ["text1", "text2", "likelihood"]
|
425 |
+
|
426 |
+
ic("paras aligned: ", df_aligned.head(10))
|
427 |
+
|
428 |
+
# round the last column to 2
|
429 |
+
# df_aligned.likelihood = df_aligned.likelihood.round(2)
|
430 |
+
# df_aligned = df_aligned.round({"likelihood": 2})
|
431 |
+
|
432 |
+
# df_aligned.likelihood = df_aligned.likelihood.apply(lambda x: np.nan if x in [""] else x)
|
433 |
+
|
434 |
+
if len(df_aligned) > 200:
|
435 |
+
df_html = None
|
436 |
+
else: # show a one-bathc table in html
|
437 |
+
# style
|
438 |
+
styled = df_aligned.style.set_properties(
|
439 |
+
**{
|
440 |
+
"font-size": "10pt",
|
441 |
+
"border-color": "black",
|
442 |
+
"border": "1px black solid !important"
|
443 |
+
}
|
444 |
+
# border-color="black",
|
445 |
+
).set_table_styles([{
|
446 |
+
"selector": "", # noqs
|
447 |
+
"props": [("border", "2px black solid !important")]}] # noqs
|
448 |
+
).set_precision(2)
|
449 |
+
|
450 |
+
# .bar(subset="likelihood", color="#5fba7d")
|
451 |
+
|
452 |
+
# .background_gradient("Greys")
|
453 |
+
|
454 |
+
# df_html = df_aligned.to_html()
|
455 |
+
# df_html = styled.to_html()
|
456 |
+
df_html = styled.render()
|
457 |
+
|
458 |
+
# ===
|
459 |
+
if plot_dia:
|
460 |
+
output_plot = "img/plt.png"
|
461 |
+
else:
|
462 |
+
output_plot = None
|
463 |
+
|
464 |
+
_ = df_aligned.to_csv(index=False)
|
465 |
+
file_dl.write_text(_, encoding="utf8")
|
466 |
+
|
467 |
+
# file_dl.write_text(_, encoding="gb2312") # no go
|
468 |
+
df_aligned.to_excel(file_dl_xlsx)
|
469 |
+
|
470 |
+
# return df_trimmed, plt
|
471 |
+
|
472 |
+
# return df_trimmed, plt, file_dl, file_dl_xlsx, df_aligned
|
473 |
+
|
474 |
+
# output_plot: gr.outputs.Image(type="auto", label="...")
|
475 |
+
# return df_trimmed, output_plot, file_dl, file_dl_xlsx, df_aligned
|
476 |
+
# return df_trimmed, output_plot, file_dl, file_dl_xlsx, styled, df_html # gradio cant handle style
|
477 |
+
|
478 |
+
ic("sent-ali-algo: ", sent_ali_algo)
|
479 |
+
|
480 |
+
# ### sent-ali-algo is None: para align
|
481 |
+
if sent_ali_algo in ["None"]:
|
482 |
+
ic("returning para-ali outputs")
|
483 |
+
return df_trimmed, output_plot, file_dl, file_dl_xlsx, None, None, df_aligned, df_html
|
484 |
+
|
485 |
+
# ### proceed with sent align
|
486 |
+
if sent_ali_algo in ["fast"]:
|
487 |
+
ic(sent_ali_algo)
|
488 |
+
align_func = align_sents
|
489 |
+
|
490 |
+
ic(df_aligned.shape, df_aligned.columns)
|
491 |
+
|
492 |
+
aligned_sents = paras2sents(df_aligned, align_func)
|
493 |
+
|
494 |
+
# ic(pd.DataFrame(aligned_sents).shape, aligned_sents)
|
495 |
+
ic(pd.DataFrame(aligned_sents).shape)
|
496 |
+
|
497 |
+
df_aligned_sents = pd.DataFrame(aligned_sents, columns=["text1", "text2"])
|
498 |
+
else: # ["slow"]
|
499 |
+
ic(sent_ali_algo)
|
500 |
+
align_func = shuffle_sents
|
501 |
+
aligned_sents = paras2sents(df_aligned, align_func, lang1, lang2)
|
502 |
+
|
503 |
+
# add extra entry if necessary
|
504 |
+
aligned_sents = [list(sent) + [""] if len(sent) == 2 else list(sent) for sent in aligned_sents]
|
505 |
+
|
506 |
+
df_aligned_sents = pd.DataFrame(aligned_sents, columns=["text1", "text2", "likelihood"])
|
507 |
+
|
508 |
+
# prepare sents downloads
|
509 |
+
file_dl_sents = Path(f"{file_dl.stem}-sents{file_dl.suffix}")
|
510 |
+
file_dl_xlsx_sents = Path(f"{file_dl_xlsx.stem}-sents{file_dl_xlsx.suffix}")
|
511 |
+
_ = df_aligned_sents.to_csv(index=False)
|
512 |
+
file_dl_sents.write_text(_, encoding="utf8")
|
513 |
+
|
514 |
+
df_aligned_sents.to_excel(file_dl_xlsx_sents)
|
515 |
+
|
516 |
+
# prepare html output
|
517 |
+
if len(df_aligned_sents) > 200:
|
518 |
+
df_html = None
|
519 |
+
else: # show a one-bathc table in html
|
520 |
+
# style
|
521 |
+
styled = df_aligned_sents.style.set_properties(
|
522 |
+
**{
|
523 |
+
"font-size": "10pt",
|
524 |
+
"border-color": "black",
|
525 |
+
"border": "1px black solid !important"
|
526 |
+
}
|
527 |
+
# border-color="black",
|
528 |
+
).set_table_styles([{
|
529 |
+
"selector": "", # noqs
|
530 |
+
"props": [("border", "2px black solid !important")]}] # noqs
|
531 |
+
).format(
|
532 |
+
precision=2
|
533 |
+
)
|
534 |
+
df_html = styled.to_html()
|
535 |
+
|
536 |
+
# aligned sents outputs
|
537 |
+
ic("aligned sents outputs")
|
538 |
+
|
539 |
+
# return df_trimmed, output_plot, file_dl, file_dl_xlsx, None, None, df_aligned, df_html
|
540 |
+
return df_trimmed, output_plot, file_dl, file_dl_xlsx, file_dl_sents, file_dl_xlsx_sents, df_aligned_sents, df_html
|
541 |
+
|
542 |
+
|
543 |
+
if __name__ == "__main__":
|
544 |
+
# typer.run(radiobee_cli)
|
545 |
+
app()
|
radiobee/trim_df.py
CHANGED
@@ -14,12 +14,8 @@ def trim_df(
|
|
14 |
[
|
15 |
df1.iloc[:len_, :],
|
16 |
pd.DataFrame(
|
17 |
-
[
|
18 |
-
|
19 |
-
"...",
|
20 |
-
"...",
|
21 |
-
]
|
22 |
-
],
|
23 |
columns=df1.columns,
|
24 |
),
|
25 |
df1.iloc[-len_:, :],
|
|
|
14 |
[
|
15 |
df1.iloc[:len_, :],
|
16 |
pd.DataFrame(
|
17 |
+
# [["...", "...",]],
|
18 |
+
[["..."] * len(df1.columns)],
|
|
|
|
|
|
|
|
|
19 |
columns=df1.columns,
|
20 |
),
|
21 |
df1.iloc[-len_:, :],
|
requirements.txt
CHANGED
@@ -27,4 +27,7 @@ nltk
|
|
27 |
sentence_splitter
|
28 |
icecream
|
29 |
# lazy
|
30 |
-
alive-progress
|
|
|
|
|
|
|
|
27 |
sentence_splitter
|
28 |
icecream
|
29 |
# lazy
|
30 |
+
alive-progress
|
31 |
+
|
32 |
+
# cli
|
33 |
+
click
|