evaluation scores are different from Google paper
#2
by
zhongwei
- opened
I just evaluated the model using run_summarization.py with hugging face dataset: ccdv/arxiv-summarization, the Rouge1 score = 41.68
The Rouge1 score at Google paper ( https://arxiv.org/pdf/2208.04347.pdf ) for model PEGASUS-XBase with arXiv evaluation is 49.4
what are the reasons for the big difference? how would we get same score at hugging face as google paper.