import streamlit as st from fns import * st.set_page_config( page_title="Synthesist", page_icon="👋", ) # st.write("# Welcome to Pathfinder! 👋") st.image('local_files/synth_logo.png') st.sidebar.success("Select a function above.") st.sidebar.markdown("Current functions include visualizing papers in the arxiv embedding, searching for similar papers to an input paper or prompt phrase, or answering quick questions.") st.markdown("") st.markdown( """ **Synthesist** (from Peter Watt's [Blindsight](https://scalar.usc.edu/works/network-ecologies/on-peter-watts-blindsight)) is a framework for searching and visualizing papers on the [arXiv](https://arxiv.org/) using the context sensitivity from modern large language models (LLMs) to better parse patterns in paper contexts. This tool was built during the [JSALT workshop](https://www.clsp.jhu.edu/2024-jelinek-summer-workshop-on-speech-and-language-technology/) to do awesome things. **👈 Select a tool from the sidebar** to see some examples of what this framework can do! ### Tool summary: - Please wait while the initial data loads and compiles, this takes about a minute initially. - `Paper search` looks for relevant papers given an arxiv id or a question. This is not meant to be a replacement to existing tools like the [ADS](https://ui.adsabs.harvard.edu/), [arxivsorter](https://www.arxivsorter.org/), semantic search or google scholar, but rather a supplement to find papers that otherwise might be missed during a literature survey. It is trained on astro-ph (astrophysics of galaxies) papers up to last-year-ish mined from arxiv and supplemented with ADS metadata, if you are interested in extending it please reach out! Also add: more pages, actual generation, diff. toggles for retrieval/gen, feedback form, socials, literature, contact us, copyright, collaboration, etc. The image below shows a representation of all the astro-ph.GA papers that can be explored in more detail using the `Arxiv embedding` page. The papers tend to cluster together by similarity, and result in an atlas that shows well studied (forests) and currently uncharted areas (water). """ ) s = time.time() st.markdown(f'Loading data for retrieval system, please wait before jumping to one of the pages....') st.session_state.retrieval_system = EmbeddingRetrievalSystem() st.session_state.dataset = load_dataset('arxiv_corpus/', split = "train") st.markdown(f'Loaded retrieval system, time taken: %.1f sec' %(time.time()-s))