bert-base-chinese-finetuned-cmrc2018
This model is a fine-tuned version of bert-base-chinese on the CMRC2018 (Chinese Machine Reading Comprehension) dataset.
Model Description
This is a BERT-based extractive question answering model for Chinese text. The model is designed to locate and extract answer spans from given contexts in response to questions.
Key Features:
- Base Model: bert-base-chinese
- Task: Extractive Question Answering
- Language: Chinese
- Training Dataset: CMRC2018
Performance Metrics
Evaluation results on the test set:
- Exact Match: 59.708
- F1 Score: 60.0723
- Number of evaluation samples: 6,254
- Evaluation speed: 283.054 samples/second
Intended Uses & Limitations
Intended Uses
- Chinese reading comprehension tasks
- Answer extraction from given documents
- Context-based question answering systems
Limitations
- Only supports extractive QA (cannot generate new answers)
- Answers must be present in the context
- Does not support multi-hop reasoning
- Cannot handle unanswerable questions
Training Details
Training Hyperparameters
- Learning rate: 3e-05
- Train batch size: 12
- Eval batch size: 8
- Seed: 42
- Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
- LR scheduler: linear
- Number of epochs: 5.0
Training Results
- Training time: 892.86 seconds
- Training samples: 18,960
- Training speed: 106.175 samples/second
- Training loss: 0.5625
Framework Versions
- Transformers: 4.47.0.dev0
- Pytorch: 2.5.1+cu124
- Datasets: 3.1.0
- Tokenizers: 20.3
Usage
import torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
# Load model and tokenizer
model = AutoModelForQuestionAnswering.from_pretrained("real-jiakai/bert-base-chinese-finetuned-cmrc2018")
tokenizer = AutoTokenizer.from_pretrained("real-jiakai/bert-base-chinese-finetuned-cmrc2018")
# Prepare inputs
question = "长城有多长?"
context = "长城是中国古代的伟大建筑工程,全长超过2万公里,横跨中国北部多个省份。"
# Tokenize inputs
inputs = tokenizer(
question,
context,
return_tensors="pt",
max_length=384,
truncation=True
)
# Get answer
outputs = model(**inputs)
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1
answer = tokenizer.decode(inputs["input_ids"][0][answer_start:answer_end])
print("Answer:", answer)
Citation
If you use this model, please cite the CMRC2018 dataset:
@inproceedings{cui-emnlp2019-cmrc2018,
title = "A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension",
author = "Cui, Yiming and
Liu, Ting and
Che, Wanxiang and
Xiao, Li and
Chen, Zhipeng and
Ma, Wentao and
Wang, Shijin and
Hu, Guoping",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1600",
doi = "10.18653/v1/D19-1600",
pages = "5886--5891",
}
- Downloads last month
- 26
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for real-jiakai/bert-base-chinese-finetuned-cmrc2018
Base model
google-bert/bert-base-chinese