--- license: apache-2.0 datasets: - common_language language: - ar - eu - br - ca - zh - cv - cs - nl - en - eo - et - fr - ka - de - el - id - ia - it - ja - rw - ky - lv - mt - mn - fa - pl - pt - ro - rm - ru - sl - es - sv - ta - tt - tr - uk - cy metrics: - accuracy - precision - recall - f1 tags: - language-detection - Frisian - Dhivehi - Hakha_Chin - Kabyle - Sakha --- ### Overview This model supports the detection of **45** languages, and it's fine-tuned using **multilingual-e5-base** model on the **common-language** dataset.
The overall accuracy is **98.37%**, and more evaluation results are shown the below. ### Download the model ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained('Mike0307/multilingual-e5-language-detection') model = AutoModelForSequenceClassification.from_pretrained('Mike0307/multilingual-e5-language-detection', num_labels=45) ``` ### Example of language detection ```python import torch languages = [ "Arabic", "Basque", "Breton", "Catalan", "Chinese_China", "Chinese_Hongkong", "Chinese_Taiwan", "Chuvash", "Czech", "Dhivehi", "Dutch", "English", "Esperanto", "Estonian", "French", "Frisian", "Georgian", "German", "Greek", "Hakha_Chin", "Indonesian", "Interlingua", "Italian", "Japanese", "Kabyle", "Kinyarwanda", "Kyrgyz", "Latvian", "Maltese", "Mongolian", "Persian", "Polish", "Portuguese", "Romanian", "Romansh_Sursilvan", "Russian", "Sakha", "Slovenian", "Spanish", "Swedish", "Tamil", "Tatar", "Turkish", "Ukranian", "Welsh" ] def predict(text, model, tokenizer, device = torch.device('cpu')): model.to(device) model.eval() tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=128, return_tensors="pt") input_ids = tokenized['input_ids'] attention_mask = tokenized['attention_mask'] with torch.no_grad(): input_ids = input_ids.to(device) attention_mask = attention_mask.to(device) outputs = model(input_ids=input_ids, attention_mask=attention_mask) logits = outputs.logits probabilities = torch.nn.functional.softmax(logits, dim=1) return probabilities def get_topk(probabilities, languages, k=3): topk_prob, topk_indices = torch.topk(probabilities, k) topk_prob = topk_prob.cpu().numpy()[0].tolist() topk_indices = topk_indices.cpu().numpy()[0].tolist() topk_labels = [languages[index] for index in topk_indices] return topk_prob, topk_labels text = "你的測試句子" probabilities = predict(text, model, tokenizer) topk_prob, topk_labels = get_topk(probabilities, languages) print(topk_prob, topk_labels) # [0.999620258808, 0.00025940246996469, 2.7690215574693e-05] # ['Chinese_Taiwan', 'Chinese_Hongkong', 'Chinese_China'] ``` ### Evaluation Results The test datasets refers to the **common_language** test datasets. |language | precision | recall | f1-score | support | | --- | --- | ---| --- | --- | |Arabic|1.00|1.00|1.00|151| | Basque | 0.99 | 1.00 | 1.00 | 111| | Breton | 1.00 | 0.90 | 0.95 | 252| | Catalan | 0.96 | 0.99 | 0.97 | 96| | Chinese_China | 0.98 | 1.00 | 0.99 | 100| | Chinese_Hongkong | 0.97 | 0.87 | 0.92 | 115| | Chinese_Taiwan | 0.92 | 0.98 | 0.95 | 170| | Chuvash | 0.98 | 1.00 | 0.99 | 137| | Czech | 0.98 | 1.00 | 0.99 | 128| | Dhivehi | 1.00 | 1.00 | 1.00 | 111| | Dutch | 0.99 | 1.00 | 0.99 | 144| | English | 0.96 | 1.00 | 0.98 | 98| | Esperanto | 0.98 | 0.98 | 0.98 | 107| | Estonian | 1.00 | 0.99 | 0.99 | 93| | French | 0.95 | 1.00 | 0.98 | 106| | Frisian | 1.00 | 0.98 | 0.99 | 117| | Georgian | 1.00 | 1.00 | 1.00 | 110| | German | 1.00 | 1.00 | 1.00 | 101| | Greek | 1.00 | 1.00 | 1.00 | 153| | Hakha_Chin | 0.99 | 1.00 | 0.99 | 202| | Indonesian | 0.99 | 0.99 | 0.99 | 150| | Interlingua | 0.96 | 0.97 | 0.96 | 182| | Italian | 0.99 | 0.94 | 0.96 | 100| | Japanese | 1.00 | 1.00 | 1.00 | 144| | Kabyle | 1.00 | 0.96 | 0.98 | 156| | Kinyarwanda | 0.97 | 1.00 | 0.99 | 103| | Kyrgyz | 0.98 | 1.00 | 0.99 | 129| | Latvian | 0.98 | 0.98 | 0.98 | 171| | Maltese | 0.99 | 0.98 | 0.98 | 152| | Mongolian | 1.00 | 1.00 | 1.00 | 112| | Persian | 1.00 | 1.00 | 1.00 | 123| | Polish | 0.91 | 0.99 | 0.95 | 128| | Portuguese | 0.94 | 0.99 | 0.96 | 124| | Romanian | 1.00 | 1.00 | 1.00 | 152| |Romansh_Sursilvan | 0.99 | 0.95 | 0.97 | 106| | Russian | 0.99 | 0.99 | 0.99 | 100| | Sakha | 0.99 | 1.00 | 1.00 | 105| | Slovenian | 0.99 | 1.00 | 1.00 | 166| | Spanish | 0.96 | 0.95 | 0.95 | 94| | Swedish | 0.99 | 1.00 | 0.99 | 190| | Tamil | 1.00 | 1.00 | 1.00 | 135| | Tatar | 1.00 | 0.96 | 0.98 | 173| | Turkish | 1.00 | 1.00 | 1.00 | 137| | Ukranian | 0.99 | 1.00 | 1.00 | 126| | Welsh | 0.98 | 1.00 | 0.99 | 103| || | *macro avg* | 0.98 | 0.99 | 0.98 | 5963| | *weighted avg* | 0.98 | 0.98 | 0.98 | 5963| || | *overall accuracy* | | | 0.9837 | 5963|