really awesome model – how did you generate the descriptions?

#1
by julien-c HF staff - opened

What model did you use to generate the spaces descriptions? πŸ”₯

Gpt-3.5/4 and Palm 2 πŸ™‚

and what did you use as input/prompt? the app.py content, or the Webpage URL? we can discuss privately if you prefer not to share the secret sauce πŸ˜ƒ

cc @victor who was super interested too

No secrets πŸ˜„

  1. Parsed all spaces using HfApi list_spaces().
  2. Downloaded their README.md files and extracted the main app file's name (app_file).
  3. Selected only those spaces that have 40+ lines of code and 1000+ characters in the main app file.
  4. Generated descriptions by feeding the contents of the app file into chat LLMs with the following system message:
    Write a short, one-sentence description of the provided app's purpose. Example descriptions: "Remove background from images.", "Generate captions for images using ViT and GPT2.", "Predict the nutritional value of food based on an image of the food.
  5. Created embeddings of the resulting descriptions using the all-MiniLM-L6-v2 SentenceTransformer model.

The descriptions and embeddings are open in the dataset. Feel free to use it. πŸ™‚

thanks a lot this is super helpful! cc @osanseviero too

Super interesting!

Other options could be do some generation based on the description, title, and article attributes when they are a Gradio demo. But this seems to be working super well and looks great!

nice use all-MiniLM-L6-v2 , @Xenova maybe we can make an experiment loading/searching the embeddings all via JS on the browsers, what's the best to way to loading all these embeddings on the memory?

@radames That would be awesome! :o Probably the easiest is to literally store the embeddings as a concatenated list of Float32Arrays, saved in a file as a large byte array (i.e., ArrayBuffer or Uint8Array). Compression might help, but depending on the number of embeddings, it might not be necessary.

Let me do some rough calculations for the full dataset:
30 000 items * 384 dimensions * 4 bytes per dimension = 46 MB ... almost as large as the model itself!

But I don't even think we need all 30k... we could probably just choose the top 1000-5000 spaces, and we can store in 8-bit or 16-bit precision.

So, we are looking at an uncompressed file between 0.4MB and 3.8MB.

The biggest performance bottleneck would be similarity checking of the many rows, but since it's literally just cosine similarity (or dot product), we can use webworkers for parallelism.

Anyway, looking forward to brainstorming! I think this can be a super cool demo.

Hi @anzorq – i am not sure if you have seen this but we now have the ability to set a short_description for Spaces:

Screenshot 2024-03-05 at 12.01.25.png

I was thinking, it'd be very neat to generate descriptions using a LLM and open PRs for Space authors to merge/tweak them

Then we could provide your semantic search in a built-in way.

WDYT?

i played with a quick Space to experiment with prompts btw: https://huggingface.co/spaces/julien-c/description-generator

Owner

Hi @julien-c , that sounds awesome!

Sign up or log in to comment