Language Identification

该模型是基于 AllenNLP 在 qgyd2021/language_identification 数据集上训练的语种识别模型。

在 valid 验证集上的准确率情况:

语种 样本数量 准确率
af 6221 0.8666
ar 19808 0.9994
bg 19913 0.9958
bn 7396 0.9968
bs 1653 0.8232
cs 19122 0.9615
da 19500 0.9727
de 19702 0.996
el 19455 0.9761
en 39710 0.9942
eo 18542 0.9944
es 19924 0.9937
et 19482 0.9727
fi 19223 0.9554
fo 4612 0.9697
fr 19990 0.9957
ga 19949 0.9973
gl 508 0.822
hi 19984 0.9965
hi_en 1358 0.951
hr 18840 0.9789
hu 669 0.8873
hy 124 0.9688
id 4669 0.9968
is 19795 0.9876
it 19742 0.9941
ja 20130 0.9996
ko 20098 0.9998
lt 19280 0.9721
lv 19459 0.9931
mr 10300 0.9961
mt 19708 0.993
nl 18452 0.9258
no 19404 0.9714
pl 19920 0.9973
pt 19996 0.9946
ro 19804 0.9944
ru 20003 0.9954
sk 19804 0.9861
sl 19665 0.9926
sv 18941 0.95
sw 19768 0.9871
th 19917 0.9991
tl 19572 0.9991
tn 19883 0.9933
tr 19809 0.9939
ts 19752 0.9854
uk 17643 0.9994
ur 19895 0.992
vi 19836 0.9982
yo 1936 0.9827
zh 40108 0.9996
zu 5406 0.9905

测试代码:

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import argparse
import time

from allennlp.models.archival import archive_model, load_archive
from allennlp.predictors.text_classifier import TextClassifierPredictor

from project_settings import project_path


def get_args():
    """
    python3 step_5_predict_by_archive.py
    :return:
    """
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--text",
        default="hello guy.",
        type=str
    )
    parser.add_argument(
        "--archive_file",
        default=(project_path / "trained_models/language_identification").as_posix(),
        type=str
    )
    args = parser.parse_args()
    return args


def main():
    args = get_args()

    archive = load_archive(archive_file=args.archive_file)

    predictor = TextClassifierPredictor(
        model=archive.model,
        dataset_reader=archive.dataset_reader,
    )

    json_dict = {
        "sentence": args.text
    }

    begin_time = time.time()
    outputs = predictor.predict_json(
        json_dict
    )
    label = outputs["label"]
    prob = round(max(outputs["probs"]), 4)
    print(label)
    print(prob)

    print('time cost: {}'.format(time.time() - begin_time))
    return


if __name__ == '__main__':
    main()

requirements.txt

allennlp==2.10.1
allennlp-models==2.10.1
torch==1.12.1
overrides==1.9.0
pytorch_pretrained_bert==0.6.2
Downloads last month
8
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Space using qgyd2021/language_identification 1