How to use sigmoid() to filter irrelevant documents out

#11
by shaunxu - opened

The score from the model represents the relevant against query and passages. I would like to use sigmoid() to filter those irrelevant out but failed. All scores are greater than 0.5 which means related somehow.

Could you please give some advice how to filter the irrelevant passages. Which function could I use and what threshold prefer? Maybe sigmoid() and >= 0.5?

I would use kmeans to do a binary classification.

Also if the "spread" of score is even from 0 to 1, biary classification equals to a threshold of 0.5

You can detect if it's even or not using Kolmogorov-Smirnov test, then, take the top 10% instead of half of it and cross fingers

Thanks for your response. I didn't find any information from the model card that mentioned the range of score, so I have to guess it is between 0 to 1 based on some tests. But in some cases, it gave me very low score (less than 0.5) even though the query and passage are related. For example for
query: "What is the grain production in the government work report?",
passage": "Government Work Report This dataset contains our country's annual government work reports, including information on food, education, technology, etc."
From my perspective they are strongly related but the score was 0.19072403013706207

This means it will be filtered out if the threshold was 0.5.

So I've no idea on how to set a proper threshold to filter the irrelevant documents out.

Sign up or log in to comment