drewThomasson commited on
Commit
f045c49
·
verified ·
1 Parent(s): 30fa9fe

Upload 115 files

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. nltk_data/tokenizers/punkt.zip +3 -0
  2. nltk_data/tokenizers/punkt/PY3/README +98 -0
  3. nltk_data/tokenizers/punkt/PY3/czech.pickle +3 -0
  4. nltk_data/tokenizers/punkt/PY3/danish.pickle +3 -0
  5. nltk_data/tokenizers/punkt/PY3/dutch.pickle +3 -0
  6. nltk_data/tokenizers/punkt/PY3/english.pickle +3 -0
  7. nltk_data/tokenizers/punkt/PY3/estonian.pickle +3 -0
  8. nltk_data/tokenizers/punkt/PY3/finnish.pickle +3 -0
  9. nltk_data/tokenizers/punkt/PY3/french.pickle +3 -0
  10. nltk_data/tokenizers/punkt/PY3/german.pickle +3 -0
  11. nltk_data/tokenizers/punkt/PY3/greek.pickle +3 -0
  12. nltk_data/tokenizers/punkt/PY3/italian.pickle +3 -0
  13. nltk_data/tokenizers/punkt/PY3/malayalam.pickle +3 -0
  14. nltk_data/tokenizers/punkt/PY3/norwegian.pickle +3 -0
  15. nltk_data/tokenizers/punkt/PY3/polish.pickle +3 -0
  16. nltk_data/tokenizers/punkt/PY3/portuguese.pickle +3 -0
  17. nltk_data/tokenizers/punkt/PY3/russian.pickle +3 -0
  18. nltk_data/tokenizers/punkt/PY3/slovene.pickle +3 -0
  19. nltk_data/tokenizers/punkt/PY3/spanish.pickle +3 -0
  20. nltk_data/tokenizers/punkt/PY3/swedish.pickle +3 -0
  21. nltk_data/tokenizers/punkt/PY3/turkish.pickle +3 -0
  22. nltk_data/tokenizers/punkt/README +98 -0
  23. nltk_data/tokenizers/punkt/czech.pickle +3 -0
  24. nltk_data/tokenizers/punkt/danish.pickle +3 -0
  25. nltk_data/tokenizers/punkt/dutch.pickle +3 -0
  26. nltk_data/tokenizers/punkt/english.pickle +3 -0
  27. nltk_data/tokenizers/punkt/estonian.pickle +3 -0
  28. nltk_data/tokenizers/punkt/finnish.pickle +3 -0
  29. nltk_data/tokenizers/punkt/french.pickle +3 -0
  30. nltk_data/tokenizers/punkt/german.pickle +3 -0
  31. nltk_data/tokenizers/punkt/greek.pickle +3 -0
  32. nltk_data/tokenizers/punkt/italian.pickle +3 -0
  33. nltk_data/tokenizers/punkt/malayalam.pickle +3 -0
  34. nltk_data/tokenizers/punkt/norwegian.pickle +3 -0
  35. nltk_data/tokenizers/punkt/polish.pickle +3 -0
  36. nltk_data/tokenizers/punkt/portuguese.pickle +3 -0
  37. nltk_data/tokenizers/punkt/russian.pickle +3 -0
  38. nltk_data/tokenizers/punkt/slovene.pickle +3 -0
  39. nltk_data/tokenizers/punkt/spanish.pickle +3 -0
  40. nltk_data/tokenizers/punkt/swedish.pickle +3 -0
  41. nltk_data/tokenizers/punkt/turkish.pickle +3 -0
  42. nltk_data/tokenizers/punkt_tab.zip +3 -0
  43. nltk_data/tokenizers/punkt_tab/README +98 -0
  44. nltk_data/tokenizers/punkt_tab/czech/abbrev_types.txt +118 -0
  45. nltk_data/tokenizers/punkt_tab/czech/collocations.tab +96 -0
  46. nltk_data/tokenizers/punkt_tab/czech/ortho_context.tab +0 -0
  47. nltk_data/tokenizers/punkt_tab/czech/sent_starters.txt +54 -0
  48. nltk_data/tokenizers/punkt_tab/danish/abbrev_types.txt +211 -0
  49. nltk_data/tokenizers/punkt_tab/danish/collocations.tab +101 -0
  50. nltk_data/tokenizers/punkt_tab/danish/ortho_context.tab +0 -0
nltk_data/tokenizers/punkt.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51c3078994aeaf650bfc8e028be4fb42b4a0d177d41c012b6a983979653660ec
3
+ size 13905355
nltk_data/tokenizers/punkt/PY3/README ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
2
+
3
+ Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
4
+ been contributed by various people using NLTK for sentence boundary detection.
5
+
6
+ For information about how to use these models, please confer the tokenization HOWTO:
7
+ http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
8
+ and chapter 3.8 of the NLTK book:
9
+ http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
10
+
11
+ There are pretrained tokenizers for the following languages:
12
+
13
+ File Language Source Contents Size of training corpus(in tokens) Model contributed by
14
+ =======================================================================================================================================================================
15
+ czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
16
+ Literarni Noviny
17
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
18
+ danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
19
+ (Berlingske Avisdata, Copenhagen) Weekend Avisen
20
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
21
+ dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
22
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
23
+ english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
24
+ (American)
25
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
26
+ estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
27
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
28
+ finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
29
+ Text Bank (Suomen Kielen newspapers
30
+ Tekstipankki)
31
+ Finnish Center for IT Science
32
+ (CSC)
33
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
34
+ french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
35
+ (European)
36
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
37
+ german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
38
+ (Switzerland) CD-ROM
39
+ (Uses "ss"
40
+ instead of "ß")
41
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
42
+ greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
43
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
44
+ italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
45
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
46
+ norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
47
+ (Bokmål and Information Technologies,
48
+ Nynorsk) Bergen
49
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
50
+ polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
51
+ (http://www.nkjp.pl/)
52
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
53
+ portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
54
+ (Brazilian) (Linguateca)
55
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
56
+ slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
57
+ Slovene Academy for Arts
58
+ and Sciences
59
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
60
+ spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
61
+ (European)
62
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
63
+ swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
64
+ (and some other texts)
65
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
66
+ turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
67
+ (Türkçe Derlem Projesi)
68
+ University of Ankara
69
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
70
+
71
+ The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
72
+ Unicode using the codecs module.
73
+
74
+ Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
75
+ Computational Linguistics 32: 485-525.
76
+
77
+ ---- Training Code ----
78
+
79
+ # import punkt
80
+ import nltk.tokenize.punkt
81
+
82
+ # Make a new Tokenizer
83
+ tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
84
+
85
+ # Read in training corpus (one example: Slovene)
86
+ import codecs
87
+ text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
88
+
89
+ # Train tokenizer
90
+ tokenizer.train(text)
91
+
92
+ # Dump pickled tokenizer
93
+ import pickle
94
+ out = open("slovene.pickle","wb")
95
+ pickle.dump(tokenizer, out)
96
+ out.close()
97
+
98
+ ---------
nltk_data/tokenizers/punkt/PY3/czech.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:64b0734b6fbe8e8d7cac79f48d1dd9f853824e57c4e3594dadd74ba2c1d97f50
3
+ size 1119050
nltk_data/tokenizers/punkt/PY3/danish.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6189c7dd254e29e2bd406a7f6a4336297c8953214792466a790ea4444223ceb3
3
+ size 1191710
nltk_data/tokenizers/punkt/PY3/dutch.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fda0d6a13f02e8898daec7fe923da88e25abe081bcfa755c0e015075c215fe4c
3
+ size 693759
nltk_data/tokenizers/punkt/PY3/english.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5cad3758596392364e3be9803dbd7ebeda384b68937b488a01365f5551bb942c
3
+ size 406697
nltk_data/tokenizers/punkt/PY3/estonian.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b364f72538d17b146a98009ad239a8096ce6c0a8b02958c0bc776ecd0c58a25f
3
+ size 1499502
nltk_data/tokenizers/punkt/PY3/finnish.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6a4b5ff5500ee851c456f9dd40d5fc0d8c1859c88eb3178de1317d26b7d22833
3
+ size 1852226
nltk_data/tokenizers/punkt/PY3/french.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:28e3a4cd2971989b3cb9fd3433a6f15d17981e464db2be039364313b5de94f29
3
+ size 553575
nltk_data/tokenizers/punkt/PY3/german.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ddcbbe85e2042a019b1a6e37fd8c153286c38ba201fae0f5bfd9a3f74abae25c
3
+ size 1463575
nltk_data/tokenizers/punkt/PY3/greek.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:85dabc44ab90a5f208ef37ff6b4892ebe7e740f71fb4da47cfd95417ca3e22fd
3
+ size 876006
nltk_data/tokenizers/punkt/PY3/italian.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:68a94007b1e4ffdc4d1a190185ca5442c3dafeb17ab39d30329e84cd74a43947
3
+ size 615089
nltk_data/tokenizers/punkt/PY3/malayalam.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f8cf58acbdb7f472ac40affc13663be42dafb47c15030c11ade0444c9e0e53d
3
+ size 221207
nltk_data/tokenizers/punkt/PY3/norwegian.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4ff7a46d1438b311457d15d7763060b8d3270852c1850fd788c5cee194dc4a1d
3
+ size 1181271
nltk_data/tokenizers/punkt/PY3/polish.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:624900ae3ddfb4854a98c5d3b8b1c9bb719975f33fee61ce1441dab9f8a00718
3
+ size 1738386
nltk_data/tokenizers/punkt/PY3/portuguese.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:02a0b7b25c3c7471e1791b66a31bbb530afbb0160aee4fcecf0107652067b4a1
3
+ size 611919
nltk_data/tokenizers/punkt/PY3/russian.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:549762f8190024d89b511472df21a3a135eee5d9233e63ac244db737c2c61d7e
3
+ size 33020
nltk_data/tokenizers/punkt/PY3/slovene.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:52ef2cc0ed27d79b3aa635cbbc40ad811883a75a4b8a8be1ae406972870fd864
3
+ size 734444
nltk_data/tokenizers/punkt/PY3/spanish.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:164a50fadc5a49f8ec7426eae11d3111ee752b48a3ef373d47745011192a5984
3
+ size 562337
nltk_data/tokenizers/punkt/PY3/swedish.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b0f7d538bfd5266633b09e842cd92e9e0ac10f1d923bf211e1497972ddc47318
3
+ size 979681
nltk_data/tokenizers/punkt/PY3/turkish.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ae68ef5863728ac5332e87eb1f6bae772ff32a13a4caa2b01a5c68103e853c5b
3
+ size 1017038
nltk_data/tokenizers/punkt/README ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
2
+
3
+ Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
4
+ been contributed by various people using NLTK for sentence boundary detection.
5
+
6
+ For information about how to use these models, please confer the tokenization HOWTO:
7
+ http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
8
+ and chapter 3.8 of the NLTK book:
9
+ http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
10
+
11
+ There are pretrained tokenizers for the following languages:
12
+
13
+ File Language Source Contents Size of training corpus(in tokens) Model contributed by
14
+ =======================================================================================================================================================================
15
+ czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
16
+ Literarni Noviny
17
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
18
+ danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
19
+ (Berlingske Avisdata, Copenhagen) Weekend Avisen
20
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
21
+ dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
22
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
23
+ english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
24
+ (American)
25
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
26
+ estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
27
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
28
+ finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
29
+ Text Bank (Suomen Kielen newspapers
30
+ Tekstipankki)
31
+ Finnish Center for IT Science
32
+ (CSC)
33
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
34
+ french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
35
+ (European)
36
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
37
+ german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
38
+ (Switzerland) CD-ROM
39
+ (Uses "ss"
40
+ instead of "ß")
41
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
42
+ greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
43
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
44
+ italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
45
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
46
+ norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
47
+ (Bokmål and Information Technologies,
48
+ Nynorsk) Bergen
49
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
50
+ polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
51
+ (http://www.nkjp.pl/)
52
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
53
+ portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
54
+ (Brazilian) (Linguateca)
55
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
56
+ slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
57
+ Slovene Academy for Arts
58
+ and Sciences
59
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
60
+ spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
61
+ (European)
62
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
63
+ swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
64
+ (and some other texts)
65
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
66
+ turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
67
+ (Türkçe Derlem Projesi)
68
+ University of Ankara
69
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
70
+
71
+ The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
72
+ Unicode using the codecs module.
73
+
74
+ Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
75
+ Computational Linguistics 32: 485-525.
76
+
77
+ ---- Training Code ----
78
+
79
+ # import punkt
80
+ import nltk.tokenize.punkt
81
+
82
+ # Make a new Tokenizer
83
+ tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
84
+
85
+ # Read in training corpus (one example: Slovene)
86
+ import codecs
87
+ text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
88
+
89
+ # Train tokenizer
90
+ tokenizer.train(text)
91
+
92
+ # Dump pickled tokenizer
93
+ import pickle
94
+ out = open("slovene.pickle","wb")
95
+ pickle.dump(tokenizer, out)
96
+ out.close()
97
+
98
+ ---------
nltk_data/tokenizers/punkt/czech.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5ba73d293c7d7953956bcf02f3695ec5c1f0d527f2a3c38097f5593394fa1690
3
+ size 1265552
nltk_data/tokenizers/punkt/danish.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea29760a0a9197f52ca59e78aeafc5a6f55d05258faf7db1709b2b9eb321ef20
3
+ size 1264725
nltk_data/tokenizers/punkt/dutch.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a8e26b3d68c45c38e594d19e2d5677447bfdcaa636d3b1e7acfed0e9272d73c
3
+ size 742624
nltk_data/tokenizers/punkt/english.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dda37972ae88998a6fd3e3ec002697a6bd362b32d050fda7d7ca5276873092aa
3
+ size 433305
nltk_data/tokenizers/punkt/estonian.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3867fee26a36bdb197c64362aa13ac683f5f33fa4d0d225a5d56707582a55a1d
3
+ size 1596714
nltk_data/tokenizers/punkt/finnish.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a9e17b3d5b4df76345d812b8a65b1da0767eda5086eadcc11e625eef0942835
3
+ size 1951656
nltk_data/tokenizers/punkt/french.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de05f3d5647d3d2296626fb83f68428e4c6ad6e05a00ed4694c8bdc8f2f197ee
3
+ size 583482
nltk_data/tokenizers/punkt/german.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eab497fa085413130c8fd0fb13b929128930afe2f6a26ea8715c95df7088e97c
3
+ size 1526714
nltk_data/tokenizers/punkt/greek.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:21752a6762fad5cfe46fb5c45fad9a85484a0e8e81c67e6af6fb973cfc27d67c
3
+ size 1953106
nltk_data/tokenizers/punkt/italian.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dcb2717d7be5f26e860a92e05acf69b1123a5f4527cd7a269a9ab9e9e668c805
3
+ size 658331
nltk_data/tokenizers/punkt/malayalam.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f8cf58acbdb7f472ac40affc13663be42dafb47c15030c11ade0444c9e0e53d
3
+ size 221207
nltk_data/tokenizers/punkt/norwegian.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e4a97f8f9a03a0338dd746bcc89a0ae0f54ae43b835fa37d83e279e1ca794faf
3
+ size 1259779
nltk_data/tokenizers/punkt/polish.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:16127b6d10933427a3e90fb20e9be53e1fb371ff79a730c1030734ed80b90c92
3
+ size 2042451
nltk_data/tokenizers/punkt/portuguese.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bb01bf7c79a4eadc2178bbd209665139a0e4b38f2d1c44fef097de93955140e0
3
+ size 649051
nltk_data/tokenizers/punkt/russian.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bc984432fbe31f7000014f8047502476889169c60f09be5413ca09276b16c909
3
+ size 33027
nltk_data/tokenizers/punkt/slovene.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7dac650212b3787b39996c01bd2084115493e6f6ec390bab61f767525b08b8ea
3
+ size 832867
nltk_data/tokenizers/punkt/spanish.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:271dc6027c4aae056f72a9bfab5645cf67e198bf4f972895844e40f5989ccdc3
3
+ size 597831
nltk_data/tokenizers/punkt/swedish.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:40d50ebdad6caa87715f2e300b1217ec92c42de205a543cc4a56903bd2c9acfa
3
+ size 1034496
nltk_data/tokenizers/punkt/turkish.pickle ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d3ae47d76501d027698809d12e75292c9c392910488543342802f95db9765ccc
3
+ size 1225013
nltk_data/tokenizers/punkt_tab.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c2b16c23d738effbdc5789d7aa601397c13ba2819bf922fb904687f3f16657ed
3
+ size 4259017
nltk_data/tokenizers/punkt_tab/README ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Pretrained Punkt Models -- Jan Strunk (New version trained after issues 313 and 514 had been corrected)
2
+
3
+ Most models were prepared using the test corpora from Kiss and Strunk (2006). Additional models have
4
+ been contributed by various people using NLTK for sentence boundary detection.
5
+
6
+ For information about how to use these models, please confer the tokenization HOWTO:
7
+ http://nltk.googlecode.com/svn/trunk/doc/howto/tokenize.html
8
+ and chapter 3.8 of the NLTK book:
9
+ http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html#sec-segmentation
10
+
11
+ There are pretrained tokenizers for the following languages:
12
+
13
+ File Language Source Contents Size of training corpus(in tokens) Model contributed by
14
+ =======================================================================================================================================================================
15
+ czech.pickle Czech Multilingual Corpus 1 (ECI) Lidove Noviny ~345,000 Jan Strunk / Tibor Kiss
16
+ Literarni Noviny
17
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
18
+ danish.pickle Danish Avisdata CD-Rom Ver. 1.1. 1995 Berlingske Tidende ~550,000 Jan Strunk / Tibor Kiss
19
+ (Berlingske Avisdata, Copenhagen) Weekend Avisen
20
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
21
+ dutch.pickle Dutch Multilingual Corpus 1 (ECI) De Limburger ~340,000 Jan Strunk / Tibor Kiss
22
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
23
+ english.pickle English Penn Treebank (LDC) Wall Street Journal ~469,000 Jan Strunk / Tibor Kiss
24
+ (American)
25
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
26
+ estonian.pickle Estonian University of Tartu, Estonia Eesti Ekspress ~359,000 Jan Strunk / Tibor Kiss
27
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
28
+ finnish.pickle Finnish Finnish Parole Corpus, Finnish Books and major national ~364,000 Jan Strunk / Tibor Kiss
29
+ Text Bank (Suomen Kielen newspapers
30
+ Tekstipankki)
31
+ Finnish Center for IT Science
32
+ (CSC)
33
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
34
+ french.pickle French Multilingual Corpus 1 (ECI) Le Monde ~370,000 Jan Strunk / Tibor Kiss
35
+ (European)
36
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
37
+ german.pickle German Neue Zürcher Zeitung AG Neue Zürcher Zeitung ~847,000 Jan Strunk / Tibor Kiss
38
+ (Switzerland) CD-ROM
39
+ (Uses "ss"
40
+ instead of "ß")
41
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
42
+ greek.pickle Greek Efstathios Stamatatos To Vima (TO BHMA) ~227,000 Jan Strunk / Tibor Kiss
43
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
44
+ italian.pickle Italian Multilingual Corpus 1 (ECI) La Stampa, Il Mattino ~312,000 Jan Strunk / Tibor Kiss
45
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
46
+ norwegian.pickle Norwegian Centre for Humanities Bergens Tidende ~479,000 Jan Strunk / Tibor Kiss
47
+ (Bokmål and Information Technologies,
48
+ Nynorsk) Bergen
49
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
50
+ polish.pickle Polish Polish National Corpus Literature, newspapers, etc. ~1,000,000 Krzysztof Langner
51
+ (http://www.nkjp.pl/)
52
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
53
+ portuguese.pickle Portuguese CETENFolha Corpus Folha de São Paulo ~321,000 Jan Strunk / Tibor Kiss
54
+ (Brazilian) (Linguateca)
55
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
56
+ slovene.pickle Slovene TRACTOR Delo ~354,000 Jan Strunk / Tibor Kiss
57
+ Slovene Academy for Arts
58
+ and Sciences
59
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
60
+ spanish.pickle Spanish Multilingual Corpus 1 (ECI) Sur ~353,000 Jan Strunk / Tibor Kiss
61
+ (European)
62
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
63
+ swedish.pickle Swedish Multilingual Corpus 1 (ECI) Dagens Nyheter ~339,000 Jan Strunk / Tibor Kiss
64
+ (and some other texts)
65
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
66
+ turkish.pickle Turkish METU Turkish Corpus Milliyet ~333,000 Jan Strunk / Tibor Kiss
67
+ (Türkçe Derlem Projesi)
68
+ University of Ankara
69
+ -----------------------------------------------------------------------------------------------------------------------------------------------------------------------
70
+
71
+ The corpora contained about 400,000 tokens on average and mostly consisted of newspaper text converted to
72
+ Unicode using the codecs module.
73
+
74
+ Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
75
+ Computational Linguistics 32: 485-525.
76
+
77
+ ---- Training Code ----
78
+
79
+ # import punkt
80
+ import nltk.tokenize.punkt
81
+
82
+ # Make a new Tokenizer
83
+ tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
84
+
85
+ # Read in training corpus (one example: Slovene)
86
+ import codecs
87
+ text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
88
+
89
+ # Train tokenizer
90
+ tokenizer.train(text)
91
+
92
+ # Dump pickled tokenizer
93
+ import pickle
94
+ out = open("slovene.pickle","wb")
95
+ pickle.dump(tokenizer, out)
96
+ out.close()
97
+
98
+ ---------
nltk_data/tokenizers/punkt_tab/czech/abbrev_types.txt ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ t
2
+ množ
3
+ např
4
+ j.h
5
+ man
6
+ ú
7
+ jug
8
+ dr
9
+ bl
10
+ ml
11
+ okr
12
+ st
13
+ uh
14
+ šp
15
+ judr
16
+ u.s.a
17
+ p
18
+ arg
19
+ žitě
20
+ st.celsia
21
+ etc
22
+ p.s
23
+ t.r
24
+ lok
25
+ mil
26
+ ict
27
+ n
28
+ tl
29
+ min
30
+ č
31
+ d
32
+ al
33
+ ravenně
34
+ mj
35
+ nar
36
+ plk
37
+ s.p
38
+ a.g
39
+ roč
40
+ b
41
+ zdi
42
+ r.s.c
43
+ přek
44
+ m
45
+ gen
46
+ csc
47
+ mudr
48
+ vic
49
+ š
50
+ sb
51
+ resp
52
+ tzn
53
+ iv
54
+ s.r.o
55
+ mar
56
+ w
57
+ čs
58
+ vi
59
+ tzv
60
+ ul
61
+ pen
62
+ zv
63
+ str
64
+ čp
65
+ org
66
+ rak
67
+ sv
68
+ pplk
69
+ u.s
70
+ prof
71
+ c.k
72
+ op
73
+ g
74
+ vii
75
+ kr
76
+ ing
77
+ j.o
78
+ drsc
79
+ m3
80
+ l
81
+ tr
82
+ ceo
83
+ ch
84
+ fuk
85
+ vl
86
+ viii
87
+ líp
88
+ hl.m
89
+ t.zv
90
+ phdr
91
+ o.k
92
+ tis
93
+ doc
94
+ kl
95
+ ard
96
+ čkd
97
+ pok
98
+ apod
99
+ r
100
+
101
+ a.s
102
+ j
103
+ jr
104
+ i.m
105
+ e
106
+ kupř
107
+ f
108
+
109
+ xvi
110
+ mir
111
+ atď
112
+ vr
113
+ r.i.v
114
+ hl
115
+ kv
116
+ t.j
117
+ y
118
+ q.p.r
nltk_data/tokenizers/punkt_tab/czech/collocations.tab ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ i dejmala
2
+ ##number## prosince
3
+ h steina
4
+ ##number## listopadu
5
+ a dvořák
6
+ v klaus
7
+ i čnhl
8
+ ##number## wladyslawowo
9
+ ##number## letech
10
+ a jiráska
11
+ a dubček
12
+ ##number## štrasburk
13
+ ##number## juniorské
14
+ ##number## století
15
+ ##number## kola
16
+ ##number## pád
17
+ ##number## května
18
+ ##number## týdne
19
+ v dlouhý
20
+ k design
21
+ ##number## červenec
22
+ i ligy
23
+ ##number## kolo
24
+ z svěrák
25
+ ##number## mája
26
+ ##number## šimková
27
+ a bělého
28
+ a bradáč
29
+ ##number## ročníku
30
+ ##number## dubna
31
+ a vivaldiho
32
+ v mečiara
33
+ c carrićre
34
+ ##number## sjezd
35
+ ##number## výroční
36
+ ##number## kole
37
+ ##number## narozenin
38
+ k maleevová
39
+ i čnfl
40
+ ##number## pádě
41
+ ##number## září
42
+ ##number## výročí
43
+ a dvořáka
44
+ h g.
45
+ ##number## ledna
46
+ a dvorský
47
+ h měsíc
48
+ ##number## srpna
49
+ ##number## tř.
50
+ a mozarta
51
+ ##number## sudetoněmeckých
52
+ o sokolov
53
+ k škrach
54
+ v benda
55
+ ##number## symfonie
56
+ ##number## července
57
+ x šalda
58
+ c abrahama
59
+ a tichý
60
+ ##number## místo
61
+ k bielecki
62
+ v havel
63
+ ##number## etapu
64
+ a dubčeka
65
+ i liga
66
+ ##number## světový
67
+ v klausem
68
+ ##number## ženy
69
+ ##number## létech
70
+ ##number## minutě
71
+ ##number## listopadem
72
+ ##number## místě
73
+ o vlček
74
+ k peteraje
75
+ i sponzor
76
+ ##number## června
77
+ ##number## min.
78
+ ##number## oprávněnou
79
+ ##number## květnu
80
+ ##number## aktu
81
+ ##number## květnem
82
+ ##number## října
83
+ i rynda
84
+ ##number## února
85
+ i snfl
86
+ a mozart
87
+ z košler
88
+ a dvorskému
89
+ v marhoul
90
+ v mečiar
91
+ ##number## ročník
92
+ ##number## máje
93
+ v havla
94
+ k gott
95
+ s bacha
96
+ ##number## ad
nltk_data/tokenizers/punkt_tab/czech/ortho_context.tab ADDED
The diff for this file is too large to render. See raw diff
 
nltk_data/tokenizers/punkt_tab/czech/sent_starters.txt ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ milena
3
+ tomáš
4
+ oznámila
5
+ podle
6
+ my
7
+ vyplývá
8
+ hlavní
9
+ jelikož
10
+ musíme
11
+ kdyby
12
+ foto
13
+ rozptylové
14
+ snad
15
+ zároveň
16
+ jaroslav
17
+ po
18
+ v
19
+ kromě
20
+ pokud
21
+ toto
22
+ jenže
23
+ oba
24
+ jak
25
+ zatímco
26
+ ten
27
+ myslím
28
+ navíc
29
+ dušan
30
+ zdá
31
+ dnes
32
+ přesto
33
+ tato
34
+ ti
35
+ bratislava
36
+ ale
37
+ když
38
+ nicméně
39
+ tento
40
+ mirka
41
+ přitom
42
+ dokud
43
+ jan
44
+ bohužel
45
+ ta
46
+ díky
47
+ prohlásil
48
+ praha
49
+ jestliže
50
+ jde
51
+ vždyť
52
+ moskva
53
+ proto
54
+ to
nltk_data/tokenizers/punkt_tab/danish/abbrev_types.txt ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ t
2
+ tlf
3
+ b.p
4
+ evt
5
+ j.h
6
+ lenz
7
+ mht
8
+ gl
9
+ bl
10
+ stud.polit
11
+ e.j
12
+ st
13
+ o
14
+ dec
15
+ mag
16
+ h.b
17
+ p
18
+ adm
19
+ el.lign
20
+ e.s
21
+ saalba
22
+ styrt
23
+ nr
24
+ m.a.s.h
25
+ etc
26
+ pharm
27
+ hg
28
+ j.j
29
+ dj
30
+ mountainb
31
+ f.kr
32
+ h.r
33
+ cand.jur
34
+ sp
35
+ osv
36
+ s.g
37
+ ndr
38
+ inc
39
+ b.i.g
40
+ dk-sver
41
+ sl
42
+ v.s.o.d
43
+ cand.mag
44
+ d.v.s
45
+ v.i
46
+ bøddel
47
+ fr
48
+ ø«
49
+ dr.phil
50
+ chr
51
+ p.d
52
+ bj
53
+ fhv
54
+ tilskudsforhold
55
+ m.a
56
+ sek
57
+ p.g.a
58
+ int
59
+ pokalf
60
+ ik
61
+ dir
62
+ em-lodtrækn
63
+ a.h
64
+ o.lign
65
+ p.t
66
+ m.v
67
+ n.j
68
+ m.h.t
69
+ m.m
70
+ a.p
71
+ pers
72
+ 4-bakketurn
73
+ dr.med
74
+ w.ø
75
+ polit
76
+ fremsættes
77
+ techn
78
+ tidl
79
+ o.g
80
+ i.c.i
81
+ mill
82
+ skt
83
+ m.fl
84
+ cand.merc
85
+ kbh
86
+ indiv
87
+ stk
88
+ dk-maked
89
+ memorandum
90
+ mestersk
91
+ mag.art
92
+ kitzb
93
+ h
94
+ lic
95
+ fig
96
+ dressurst
97
+ sportsg
98
+ r.e.m
99
+ d.u.m
100
+ sct
101
+ kld
102
+ bl.a
103
+ hf
104
+ g.a
105
+ corp
106
+ w
107
+ konk
108
+ zoeterm
109
+ b.t
110
+ a.d
111
+ l.b
112
+ jf
113
+ s.b
114
+ kgl
115
+ ill
116
+ beck
117
+ tosset
118
+ afd
119
+ johs
120
+ pct
121
+ k.b
122
+ sv
123
+ verbalt
124
+ kgs
125
+ l.m.k
126
+ j.l
127
+ aus
128
+ superl
129
+ t.v
130
+ mia
131
+ kr
132
+ pr
133
+ præmien
134
+ j.b.s
135
+ j.o
136
+ o.s.v
137
+ edb-oplysninger
138
+ o.m.a
139
+ ca
140
+ 1b
141
+ f.eks
142
+ rens
143
+ ch
144
+ mr
145
+ schw
146
+ d.c
147
+ utraditionelt
148
+ idrætsgym
149
+ hhv
150
+ e.l
151
+ s.s
152
+ eks
153
+ f.o.m
154
+ dk-storbrit
155
+ dk-jugo
156
+ n.z
157
+ derivater
158
+ c
159
+ pt
160
+ vm-kval
161
+ kl
162
+ hr
163
+ cand
164
+ jur
165
+ sav
166
+ h.c
167
+ arab.-danm
168
+ d.a.d
169
+ fl
170
+ o.a
171
+ a.s
172
+ cand.polit
173
+ grundejerform
174
+ j
175
+ faglærte
176
+ cr
177
+ a.a
178
+ mou
179
+ f.r.i
180
+ årh
181
+ o.m.m
182
+ sve
183
+ c.a
184
+ engl
185
+ sikkerhedssystemerne
186
+ m.f
187
+ j.k
188
+ phil
189
+ f
190
+ vet
191
+ mio
192
+ k.e
193
+ m.k
194
+ atla
195
+ idrætsg
196
+ n.n
197
+ 4-bakketur
198
+ dvs
199
+ sdr
200
+ s.j
201
+ hol
202
+ s.h
203
+ pei
204
+ kbhvn
205
+ aa
206
+ m.g.i
207
+ fvt
208
+
209
+ b.c
210
+ th
211
+ lrs
nltk_data/tokenizers/punkt_tab/danish/collocations.tab ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ##number## skak
2
+ ##number## speedway
3
+ ##number## rally
4
+ ##number## april
5
+ ##number## dm-fin
6
+ ##number## viceformand
7
+ m jensen
8
+ ##number## kano/kajak
9
+ ##number## bowling
10
+ ##number## dm-finale
11
+ ##number## årh.
12
+ ##number## januar
13
+ ##number## august
14
+ ##number## marathon
15
+ ##number## kamp
16
+ ##number## skihop
17
+ ##number## etage
18
+ ##number## tennis
19
+ ##number## cykling
20
+ e andersen
21
+ ##number## december
22
+ g h.
23
+ ##number## neb
24
+ ##number## sektion
25
+ ##number## afd.
26
+ ##number## klasse
27
+ ##number## trampolin
28
+ ##number## bordtennis
29
+ ##number## formel
30
+ ##number## århundredes
31
+ ##number## dm-semifin
32
+ ##number## heks
33
+ ##number## taekwondo
34
+ ##number## galop
35
+ ##number## basketball
36
+ ##number## dm
37
+ m skræl
38
+ ##number## trav
39
+ ##number## provins
40
+ ##number## triathlon
41
+ k axel
42
+ ##number## rugby
43
+ s h.
44
+ ##number## klaverkoncert
45
+ a p.
46
+ e løgstrup
47
+ k telefax
48
+ ##number## gyldendal
49
+ ##number## fodbold
50
+ e rosenfeldt
51
+ ##number## oktober
52
+ k o.
53
+ ##number## september
54
+ ##number## dec.
55
+ ##number## juledag
56
+ ##number## badminton
57
+ ##number## sejlsport
58
+ ##number## håndbold
59
+ r førsund
60
+ e jørgensen
61
+ d ##number##
62
+ k e
63
+ ##number## alp.ski
64
+ ##number## judo
65
+ ##number## roning
66
+ ##number## november
67
+ ##number## atletik
68
+ ##number## århundrede
69
+ ##number## ridning
70
+ ##number## marts
71
+ m andersen
72
+ d roosevelt
73
+ ##number## brydning
74
+ s kr.
75
+ ##number## runde
76
+ ##number## division
77
+ ##number## sal
78
+ ##number## boksning
79
+ ##number## minut
80
+ ##number## golf
81
+ ##number## juni
82
+ ##number## symfoni
83
+ ##number## hurtigløb
84
+ k jørgensen
85
+ ##number## jörgen
86
+ ##number## klasses
87
+ e jacobsen
88
+ k jensen
89
+ ##number## februar
90
+ k nielsen
91
+ ##number## volleyball
92
+ ##number## maj
93
+ ##number## verdenskrig
94
+ ##number## juli
95
+ ##number## ishockey
96
+ ##number## kunstskøjteløb
97
+ b jørgensen
98
+ ##number## gymnastik
99
+ ##number## svømning
100
+ ##number## tw
101
+ i pedersens
nltk_data/tokenizers/punkt_tab/danish/ortho_context.tab ADDED
The diff for this file is too large to render. See raw diff