Spaces:
Sleeping
Sleeping
File size: 10,370 Bytes
b7a1a13 4d0ce5a b7a1a13 da2827e b7a1a13 08266a1 b7a1a13 85ed173 4d0ce5a b7a1a13 85ed173 4d0ce5a 85ed173 2ec1565 85ed173 4d0ce5a b7a1a13 9b6c0d7 b7a1a13 08266a1 b7a1a13 d40f080 b7a1a13 d40f080 a5e14fd d40f080 00eb2b3 643add2 00eb2b3 dc4d7e3 d40f080 f0c49d3 1695bf2 d40f080 1695bf2 d40f080 2c26611 d40f080 2c26611 d40f080 2c26611 d40f080 2c26611 d40f080 2c26611 d40f080 00eb2b3 d40f080 00eb2b3 f0c49d3 00eb2b3 f0c49d3 00eb2b3 d40f080 b7a1a13 00eb2b3 b9ad25f 00eb2b3 85ed173 00eb2b3 b9ad25f 00eb2b3 85ed173 00eb2b3 b9ad25f 00eb2b3 85ed173 00eb2b3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
import streamlit as st
import time
from utils import valid_url
from model import get_tidy_tab_t5, predict_model_t5
from model import get_tidy_tab_pegasus, predict_model_pegasus
from model import load_model_bart, predict_model_bart
from loadhtml import get_content
## Site
st.title("Tab Recall Simplified 🚀")
st.markdown("Condense your Browser Tabs into a few impactful words. - Inspired in Arc Max")
# load model
# Sidebar
st.sidebar.title("🔬 Descriptive Tabs: Try the Models!")
st.sidebar.caption("Tidy Tabs - Title")
user_input_url = st.sidebar.text_input('Enter your url:')
error_message_url = None
button_clicked = st.sidebar.button("Rename the tab")
def load_tab():
if(user_input_url):
# Error message state
if(error_message_url):
error_message_url.empty()
is_url_valid,url = valid_url(user_input_url)
if is_url_valid:
text, title = get_content(url)
if(text == ""):
print("error")
else:
with st.spinner('Wait for it...'):
st.sidebar.write(f'**<title>:** {title}')
time.sleep(1)
with st.spinner('Wait for it...'):
st.sidebar.write(f'**T5-small:** {predict_model_t5(text)}')
with st.spinner('Wait for it...'):
st.sidebar.write(f'**Pegasus xsum:** {predict_model_pegasus(text)}')
with st.spinner('Wait for it...'):
st.sidebar.write(f'**Bart-Large-Cnn:** {predict_model_bart(text)}')
else:
error_message = st.sidebar.error(f'Is not a valid URL. Please enter a valid URL.')
if button_clicked:
load_tab()
st.sidebar.divider()
###
# Content
###
st.image('./assets/banner_tabs.png', width=350, caption='Navigate Through Powerful Features with Intuitive Tabs')
st.info("All three models are deployed in a single Hugging Face Space using the free tier. Specifications: CPU-based (no GPU), 2 vCPU cores, 16 GB RAM, and 50 GB storage.",icon="ℹ️")
###
# Examples
###
st.markdown("""
Here are some examples you can try that aren't included in the training or test datasets
# How to Test with URL Examples
On the left side, you can test the models with URL examples or any URLs. These examples are not part of the dataset. Note: Websites requiring a JavaScript engine will not function correctly.
```
Urls:
High Similarity to Training Data:
https://www.nytimes.com/2007/01/10/technology/10apple.html
https://www.nytimes.com/2021/04/15/arts/design/Met-museum-roof-garden-da-corte.html
https://github.com/torvalds
Less than 2% Overlap with Training Data:
https://substack.com/browse/staff-picks/post/145699191
https://brentcates.substack.com/p/julian-assange-is-now-free-to-collapse
Moderate Similarity to Training Data:
https://techcrunch.com/2024/07/05/openai-breach-is-a-reminder-that-ai-companies-are-treasure-troves-for-hackers/
https://www.forbes.com/sites/davidphelan/2024/07/09/apple-iphone-16-pro-major-design-upgrade-coming-new-report-claims/
https://www.crn.com/news/channel-programs/18828789/microsoft-to-release-windows-xp-service-pack-1
https://www.rickbayless.com/recipe/pastor-style-tacos/
No Similarity to Training Data:
https://www.notioneverything.com/blog/notion-note-taking-templates
https://www.eluniverso.com/noticias/ecuador/quito-prohibido-circular-dos-personas-moto-seguridad-nota/
https://www.swift.org/blog/swift-on-windows/
https://arc.net/max
Some websites, like x.com or instagram.com, are not accessible because they use JavaScript engines to load content, which is beyond the scope of this project.
Feel free to try any URL 🧪🌐
```
# The Dataset
The project's creator collected the dataset from various sources on the internet.
The dataset includes:
| Feature | Description |
|----------------------------|-----------------------------------------------------------------------------------------------|
| URL | URL of the webpage |
| title | Extracted from the HTML `<title>` |
| description | Extracted from `<meta name="description" content="description">` |
| paragraphs | Extracted from `<p>` tags |
| headings | Extracted from `<h1>`, `<h2>`, `<h3>` tags |
| combined text | Formatted as `[title] title \\n [description]description` |
The dataset primarily comprises data gathered from nytimes.com and GitHub.com, supplemented by approximately 60 other websites featuring diverse content. From GitHub, 1226 summaries were created programmatically, creating the summary with the format Username GitHub Profiles to explore the model's ability to generate patterns with new words. For the New York Times, 1056 websites were summarized based on their text content, using Claude 3.5 Sonnet of Anthropic with a specified prompt.
## Prompt for labels Generation
```
Claude 3.5 Sonnet Prompt
I’m going to share with you a csv file with one column . I want you to create a summary of 1 to 3 words maximum of the text. The text could have HTML tags. The title is the title of the page and the description is the page's description.
The result gives me like this
```
summary 1
summary 2
...
summary n
```
Only plain text and no additional instructions
```
- This small dataset aims to provide an initial assessment of model performance in a pre-trained task limited to concise summaries of 1 to 4 words. Due to the inherent complexity of this task, I suggest future efforts focus on constructing a larger dataset comprising 50,000 to 500,000 websites to more comprehensively evaluate model capabilities.
- Testing revealed that the description meta tag significantly enhanced result generation. Increasing dataset size and incorporating contextual data are expected to further improve model performance in larger-scale applications with millions of data points.
- Out of the 60 additional websites included, only 41 are sourced from substack.com. This means that less than 2% of the dataset contains information from substack.com. This is valuable for understanding the impact of small data examples.
- P.S. I tested ChatGPT-4.0, and the results were highly discouraging for a chunk of data consisting of 100 text filed values.
- In the future, we should aim to increase the dataset size to at least 10,000-15,000 samples and improve the train/test/validation split methodology.
""", unsafe_allow_html=False, help=None)
st.info("I crafted this dataset using a more powerful LLM and scripts, no need for boring manual labeling. The idea is to eliminate human labeling.",icon="ℹ️")
st.markdown("""
#### Access to the data
`https://huggingface.co/datasets/wgcv/website-title-description`
# Models
The objective of the project was to show that it was possible to create a small ML model from a bigger LLM model that could achieve good or better results in specific tasks compared to the original LLM
Given the substantial volume of data, training a model from scratch was deemed impractical. Instead, our approach focused on evaluating the performance of existing pre-trained models as a baseline. This strategy served as an optimal starting point for developing a custom, lightweight model tailored to our specific use case: enhancing browser tab organization and efficiently summarizing the core concepts of favorited websites.
### T5-small
- The [T5-small](https://huggingface.co/wgcv/tidy-tab-model-t5-small) model is a fine-tuned of google-t5/t5-small.
- It's a text-to-text model.
- It's a general model for all NLP tasks.
- The task is defined by the input format.
- To perform summarization, prefix the text with 'summarize:'.
- 60.5M parameters.
- Disclaimer: The model was retrained once more because poor inference was observed.
### Pegasus-xsum
- The [Pegasus-xsum](https://huggingface.co/wgcv/tidy-tab-model-pegasus-xsum) model is a fine-tuned of google/pegasus-xsum.
- It's a text-to-text model.
- It's a specialized summarization model.
- 570M params.
### Bart-large
- The [Bart-large](https://huggingface.co/wgcv/tidy-tab-model-bart-large-cnn) model is a fine-tuned of facebook/bart-large-cnn.
- Prior to our fine-tuning, it was fine-tuned on the CNN/Daily Mail dataset.
- It's a BART model, using a transformer encoder-decoder (seq2seq) architecture.
- BART models typically perform better with small datasets compared to text-to-text models.
- 406M params.
### Potential avenues for performance enhancement include:
- Data preprocessing optimization
- Dataset expansion
- Comprehensive hyperparameter tuning
- These strategies could significantly improve model efficacy.
- Add more language in the dataset
### Access to the Models
`https://huggingface.co/wgcv/tidy-tab-model-t5-small`
`https://huggingface.co/wgcv/tidy-tab-model-pegasus-xsum`
`https://huggingface.co/wgcv/tidy-tab-model-bart-large-cnn`
## co2_eq_emissions
- emissions: 0.16 grams of CO2)
- source: mlco2.github.io
- training_type: fine-tuning
- geographical_location: U.S.
- hardware_used: 1 - T4 GPU
##
""", unsafe_allow_html=False, help=None)
with st.sidebar.status("Loading models...", expanded=True, state="complete") as models:
st.write("Loading 1/3... (https://huggingface.co/wgcv/tidy-tab-model-t5-small)")
get_tidy_tab_t5()
st.write("Loaded T5-Small")
st.write("Loading 2/3... (https://huggingface.co/wgcv/tidy-tab-model-pegasus-xsum)")
get_tidy_tab_pegasus()
st.write("Loaded Pegasus xsum")
st.write("Loading 3/3... (https://huggingface.co/wgcv/tidy-tab-model-bart-large-cnn)")
load_model_bart()
st.write("Loaded Pegasus Bart-Large")
models.update(label="All models loaded!", state="complete", expanded=False)
|