Spaces:

DebasishDhal99
/

The-Language-Transliteration-Project

Running

DebasishDhal99 commited on Aug 26, 2023

Commit

d043c48

•

1 Parent(s): 0a1267d

Updating the tokenizer method from nltk to regular splitting

Files changed (1) hide show

turkish.py CHANGED Viewed

@@ -58,7 +58,8 @@ def turkish_word_to_latin(word):
 def turkish_sentence_to_latin(sentence):
-    word_list = word_tokenize(sentence)
     processed_word_list = []
     for word in word_list:

 def turkish_sentence_to_latin(sentence):
+    # word_list = word_tokenize(sentence) #Nltk tokenizer didn't work out as it is also splitting by sep = "'" sometimes. İstanbul'u becomes İstanbul ' u
+    word_list = sentence.split(" ")
     processed_word_list = []
     for word in word_list: