Spaces:
Running
Running
anakin87
commited on
Commit
Β·
16cd190
1
Parent(s):
bcb986c
generate info page from readme
Browse files- README.md +26 -10
- Rock_fact_checker.py +1 -1
- app_utils/frontend_utils.py +1 -1
- pages/Info.py +6 -1
README.md
CHANGED
@@ -10,23 +10,39 @@ pinned: false
|
|
10 |
license: apache-2.0
|
11 |
---
|
12 |
|
13 |
-
# Fact Checking
|
14 |
|
15 |
## *Fact checking baseline combining dense retrieval and textual entailment*
|
16 |
|
17 |
### Idea π‘
|
18 |
-
This project aims to show that a naive and simple baseline for fact checking can be built by combining dense retrieval and a textual entailment task (based on Natural Language Inference models).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
|
20 |
-
### System description
|
21 |
-
This project is strongly based on [Haystack](https://github.com/deepset-ai/haystack), an open source NLP framework to realize search system. The main components of our system are an indexing pipeline and a search pipeline.
|
22 |
|
23 |
#### Indexing pipeline
|
24 |
* [Crawling](https://github.com/anakin87/fact-checking-rocks/blob/321ba7893bbe79582f8c052493acfda497c5b785/notebooks/get_wikipedia_data.ipynb): Crawl data from Wikipedia, starting from the page [List of mainstream rock performers](https://en.wikipedia.org/wiki/List_of_mainstream_rock_performers) and using the [python wrapper](https://github.com/goldsmith/Wikipedia)
|
25 |
-
* [Indexing
|
26 |
-
*
|
27 |
-
*
|
28 |
-
*
|
29 |
-
*
|
30 |
-
*
|
31 |
|
32 |
#### Search pipeline
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
license: apache-2.0
|
11 |
---
|
12 |
|
13 |
+
# Fact Checking πΈ Rocks! [![Generic badge](https://img.shields.io/badge/π€-Open%20in%20Spaces-blue.svg)](https://huggingface.co/spaces/anakin87/fact-checking-rocks) [![Generic badge](https://img.shields.io/github/stars/anakin87/fact-checking-rocks?label=Github&style=social)](https://github.com/anakin87/fact-checking-rocks)
|
14 |
|
15 |
## *Fact checking baseline combining dense retrieval and textual entailment*
|
16 |
|
17 |
### Idea π‘
|
18 |
+
This project aims to show that a *naive and simple baseline* for fact checking can be built by combining dense retrieval and a textual entailment task (based on Natural Language Inference models).
|
19 |
+
In a nutshell, the flow is as follows:
|
20 |
+
* the users enters a factual statement
|
21 |
+
* the relevant passages are retrieved from the knowledge base using dense retrieval
|
22 |
+
* the system computes the text entailment between each relevant passage and the statement, using a Natural Language Inference model
|
23 |
+
* the entailment scores are aggregated to produce a summary score.
|
24 |
+
|
25 |
+
### System description πͺ
|
26 |
+
This project is strongly based on [π Haystack](https://github.com/deepset-ai/haystack), an open source NLP framework to realize search system. The main components of our system are an indexing pipeline and a search pipeline.
|
27 |
|
|
|
|
|
28 |
|
29 |
#### Indexing pipeline
|
30 |
* [Crawling](https://github.com/anakin87/fact-checking-rocks/blob/321ba7893bbe79582f8c052493acfda497c5b785/notebooks/get_wikipedia_data.ipynb): Crawl data from Wikipedia, starting from the page [List of mainstream rock performers](https://en.wikipedia.org/wiki/List_of_mainstream_rock_performers) and using the [python wrapper](https://github.com/goldsmith/Wikipedia)
|
31 |
+
* [Indexing](https://github.com/anakin87/fact-checking-rocks/blob/321ba7893bbe79582f8c052493acfda497c5b785/notebooks/indexing.ipynb)
|
32 |
+
* preprocess the downloaded documents into chunks consisting of 2 sentences
|
33 |
+
* chunks with less than 10 words are discarded, because not very informative
|
34 |
+
* instantiate a [FAISS](https://github.com/facebookresearch/faiss) Document store and store the passages on it
|
35 |
+
* create embeddings for the passages, using a Sentence Transformer model and save them in FAISS. The retrieval task will involve [*asymmetric semantic search*](https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search) (statements to be verified are usually shorter than inherent passages), therefore I choose the model `msmarco-distilbert-base-tas-b`.
|
36 |
+
* save FAISS index
|
37 |
|
38 |
#### Search pipeline
|
39 |
+
|
40 |
+
* the user enters a factual statement
|
41 |
+
* compute the embedding of the user statement using the same Sentence Transformer (`msmarco-distilbert-base-tas-b`)
|
42 |
+
* retrieve the K most relevant text passages stored in FAISS (along with their relevance scores)
|
43 |
+
* **text entailment task**: compute the text entailment between each text passage (premise) and the user statement (hypotesis), using a Natural Language Inference model (`microsoft/deberta-v2-xlarge-mnli`). For every text passage, we have 3 scores (summing to 1): entailment, contradiction, neutral. *(For this task, I developed a custom Haystack node: `EntailmentChecker`)*
|
44 |
+
* aggregate the text entailment scores: compute the weighted average of them, where the weight is the relevance score. **Now it is possible to tell if the knowledge base confirms, is neutral or disproves the user statement.**
|
45 |
+
* *empirical consideration: if in the first N documents (N<K), there is a strong evidence of entailment/contradiction (partial aggregate scores > 0.5), it is better not to consider less relevant documents*
|
46 |
+
|
47 |
+
### Limits and possible improvements β¨
|
48 |
+
|
Rock_fact_checker.py
CHANGED
@@ -30,7 +30,7 @@ def main():
|
|
30 |
set_state_if_absent("raw_json", None)
|
31 |
set_state_if_absent("random_statement_requested", False)
|
32 |
|
33 |
-
st.write("# Fact
|
34 |
st.write()
|
35 |
st.markdown(
|
36 |
"""
|
|
|
30 |
set_state_if_absent("raw_json", None)
|
31 |
set_state_if_absent("random_statement_requested", False)
|
32 |
|
33 |
+
st.write("# Fact Checking πΈ Rocks!")
|
34 |
st.write()
|
35 |
st.markdown(
|
36 |
"""
|
app_utils/frontend_utils.py
CHANGED
@@ -11,7 +11,7 @@ entailment_html_messages = {
|
|
11 |
|
12 |
def build_sidebar():
|
13 |
sidebar="""
|
14 |
-
<h1 style='text-align: center'>Fact
|
15 |
<div style='text-align: center'>
|
16 |
<i>Fact checking baseline combining dense retrieval and textual entailment</i>
|
17 |
<p><br/><a href='https://github.com/anakin87/fact-checking-rocks'>Github project</a> - Based on <a href='https://github.com/deepset-ai/haystack'>Haystack</a></p>
|
|
|
11 |
|
12 |
def build_sidebar():
|
13 |
sidebar="""
|
14 |
+
<h1 style='text-align: center'>Fact Checking πΈ Rocks!</h1>
|
15 |
<div style='text-align: center'>
|
16 |
<i>Fact checking baseline combining dense retrieval and textual entailment</i>
|
17 |
<p><br/><a href='https://github.com/anakin87/fact-checking-rocks'>Github project</a> - Based on <a href='https://github.com/deepset-ai/haystack'>Haystack</a></p>
|
pages/Info.py
CHANGED
@@ -1,4 +1,9 @@
|
|
1 |
import streamlit as st
|
2 |
from app_utils.frontend_utils import build_sidebar
|
3 |
|
4 |
-
build_sidebar()
|
|
|
|
|
|
|
|
|
|
|
|
1 |
import streamlit as st
|
2 |
from app_utils.frontend_utils import build_sidebar
|
3 |
|
4 |
+
build_sidebar()
|
5 |
+
|
6 |
+
with open('README.md','r') as fin:
|
7 |
+
readme = fin.read().rpartition('---')[-1]
|
8 |
+
|
9 |
+
st.markdown(readme, unsafe_allow_html=True)
|