Spaces:

anzorq
/

spaces-semantic-search-api

Running

Parsed all spaces using HfApi list_spaces().
Downloaded their README.md files and extracted the main app file's name (app_file).
Selected only those spaces that have 40+ lines of code and 1000+ characters in the main app file.
Generated descriptions by feeding the contents of the app file into chat LLMs with the following system message:
Write a short, one-sentence description of the provided app's purpose. Example descriptions: "Remove background from images.", "Generate captions for images using ViT and GPT2.", "Predict the nutritional value of food based on an image of the food.
Created embeddings of the resulting descriptions using the all-MiniLM-L6-v2 SentenceTransformer model.

The descriptions and embeddings are open in the dataset. Feel free to use it. 🙂

julien-c

Jun 1, 2023

thanks a lot this is super helpful! cc @osanseviero too

osanseviero

Jun 1, 2023

Super interesting!

Other options could be do some generation based on the description, title, and article attributes when they are a Gradio demo. But this seems to be working super well and looks great!

radames

Jun 5, 2023

nice use all-MiniLM-L6-v2 , @Xenova maybe we can make an experiment loading/searching the embeddings all via JS on the browsers, what's the best to way to loading all these embeddings on the memory?

Xenova

Jun 5, 2023

•

edited Jun 5, 2023

@radames That would be awesome! :o Probably the easiest is to literally store the embeddings as a concatenated list of Float32Arrays, saved in a file as a large byte array (i.e., ArrayBuffer or Uint8Array). Compression might help, but depending on the number of embeddings, it might not be necessary.

Let me do some rough calculations for the full dataset:
30 000 items * 384 dimensions * 4 bytes per dimension = 46 MB ... almost as large as the model itself!

But I don't even think we need all 30k... we could probably just choose the top 1000-5000 spaces, and we can store in 8-bit or 16-bit precision.

So, we are looking at an uncompressed file between 0.4MB and 3.8MB.

The biggest performance bottleneck would be similarity checking of the many rows, but since it's literally just cosine similarity (or dot product), we can use webworkers for parallelism.

Anyway, looking forward to brainstorming! I think this can be a super cool demo.

julien-c

Mar 5, 2024

Hi @anzorq – i am not sure if you have seen this but we now have the ability to set a short_description for Spaces:

I was thinking, it'd be very neat to generate descriptions using a LLM and open PRs for Space authors to merge/tweak them

Then we could provide your semantic search in a built-in way.

WDYT?

julien-c

Mar 6, 2024

i played with a quick Space to experiment with prompts btw: https://huggingface.co/spaces/julien-c/description-generator

anzorq

Owner Mar 6, 2024

Hi @julien-c , that sounds awesome!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment