Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Longform QA
#8
by
shehzaadzd
- opened
The FactScore paper (https://arxiv.org/pdf/2305.14251.pdf) offers an automatic method to evaluate hallucination on long-form QA. They also provide a benchmark relating to biographies with a mix of entities.
Can this be integrated into the leaderboard?
@shehzaadzd from a quick glance, it requires access to an OpenAI key -- for example, see this snippet from https://github.com/shmsw25/FActScore:
from factscore.factscorer import FactScorer
fs = FactScorer(openai_key="...")
# topics: list of strings (human entities used to generate bios)
# generations: list of strings (model generations)
out = fs.get_score(topics, generations, gamma=10)
print (out["score"]) # FActScore
[..]
How would you implement/include it?
The FActScore Llama-7b model only uses InstructGPT for splitting sentences into facts. It could be possible to train an open-source model to this if openAI models cannot be included in this benchmark.