espejelomar
commited on
Commit
•
5e4b6fe
1
Parent(s):
91a2987
Fix writing style
Browse files
app.py
CHANGED
@@ -16,14 +16,14 @@ st.subheader("Code Search: An Introduction to Semantic Search")
|
|
16 |
|
17 |
st.markdown(
|
18 |
"""
|
19 |
-
Suppose you have a database of texts and you want to find which entries best match the meaning of a single text you have.
|
20 |
|
21 |
-
*"**Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines which only find documents based on lexical matches, semantic search can also find synonyms.**" - [SBert Documentation](https://www.sbert.net/examples/applications/semantic-search/README.html)*.
|
22 |
|
23 |
|
24 |
-
Let's make this interactive and use the power of **[Streamlit](https://docs.streamlit.io/) + [HF Spaces](https://huggingface.co/spaces) + the '[sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)' model + the [Code Search Net Dataset](https://huggingface.co/datasets/code_search_net) in Hugging Face Datasets** by
|
25 |
|
26 |
-
Disclaimer
|
27 |
"""
|
28 |
)
|
29 |
|
@@ -54,9 +54,9 @@ st.subheader("What happens underneath? Obtaining embeddings")
|
|
54 |
|
55 |
st.markdown(
|
56 |
f"""
|
57 |
-
First we embed our text database: we convert each
|
58 |
-
|
59 |
-
|
60 |
|
61 |
"""
|
62 |
)
|
@@ -73,11 +73,11 @@ st.subheader("Obtaining the closest observations in the vector space")
|
|
73 |
|
74 |
st.markdown(
|
75 |
f"""
|
76 |
-
We now have two numerical representations of texts (embeddings):
|
77 |
|
78 |
-
|
79 |
|
80 |
-
The following figure (obtained from the [Sentence Transformers documentation](https://www.sbert.net/examples/applications/semantic-search/README.html)) shows in blue how we would represent the code search net dataset
|
81 |
|
82 |
"""
|
83 |
)
|
@@ -91,7 +91,7 @@ st.image(
|
|
91 |
st.markdown(
|
92 |
"""
|
93 |
|
94 |
-
We compare the embedding of our query with the embeddings of each of the texts in the database (there are easier ways to do it but in this case it won't be necessary) using the cosine similarity function, better explained in the [Pytorch documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.cosine_similarity.html). The results of the cosine similarity function will detect which of the texts in the database are closest to our query in vector space.
|
95 |
|
96 |
This is what we did in the dynamic example above!
|
97 |
"""
|
|
|
16 |
|
17 |
st.markdown(
|
18 |
"""
|
19 |
+
Suppose you have a database of texts and you want to find which entries best match the meaning of a single text you have. For example, we want to see which function on Github best matches a description of a function we want to create in Python. However, just word-for-word matching will not work due to the complexity of the programming questions. We need to do a "smart" search: that's what semantic search is for.
|
20 |
|
21 |
+
*"**Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, which only find documents based on lexical matches, semantic search can also find synonyms.**" - [SBert Documentation](https://www.sbert.net/examples/applications/semantic-search/README.html)*.
|
22 |
|
23 |
|
24 |
+
Let's make this interactive and use the power of **[Streamlit](https://docs.streamlit.io/) + [HF Spaces](https://huggingface.co/spaces) + the '[sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)' model + the [Code Search Net Dataset](https://huggingface.co/datasets/code_search_net) in Hugging Face Datasets** by describing a function we want to create and search what function in Github already is similar to our search. Right here!:
|
25 |
|
26 |
+
**Disclaimer**: We will use a sample dataset of 200,000 Python functions on Github for agility. Therefore, the results may not be optimal.
|
27 |
"""
|
28 |
)
|
29 |
|
|
|
54 |
|
55 |
st.markdown(
|
56 |
f"""
|
57 |
+
First, we embed our text database: we convert each function description from the code search net dataset (the 'func_documentation_string' column) into a vector space. That is, convert function descriptions from words to numbers to understand the language within each description. The understanding of each text will be reflected in a vector called embedding.
|
58 |
+
|
59 |
+
The system would look like the following figure. We embed (1) our texts database (in this case, the Github set of function descriptions) and (2) our description (in this case: 'Create a dictionary') **with the same model**. Notice that our descriptions could be a matrix of several elements.
|
60 |
|
61 |
"""
|
62 |
)
|
|
|
73 |
|
74 |
st.markdown(
|
75 |
f"""
|
76 |
+
We now have two numerical representations of texts (embeddings): our original text database and our query (here, the description of a python function). Our goal: get the texts in the database that have the closest meaning to our query.
|
77 |
|
78 |
+
Most similar queries will be closer together in the vector space, and queries that differ most will be farther apart.
|
79 |
|
80 |
+
The following figure (obtained from the [Sentence Transformers documentation](https://www.sbert.net/examples/applications/semantic-search/README.html)) shows in blue how we would represent the code search net dataset and in orange your '**{anchor}**' query, which is outside the original dataset. The blue dot with the annotation 'Relevant Document' would be our search's most similar Github function.
|
81 |
|
82 |
"""
|
83 |
)
|
|
|
91 |
st.markdown(
|
92 |
"""
|
93 |
|
94 |
+
We compare the embedding of our query with the embeddings of each of the texts in the database (there are easier ways to do it, but in this case, it won't be necessary) using the cosine similarity function, better explained in the [Pytorch documentation] [Pytorch documentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.cosine_similarity.html). The results of the cosine similarity function will detect which of the texts in the database are closest to our query in vector space.
|
95 |
|
96 |
This is what we did in the dynamic example above!
|
97 |
"""
|