2023 version?

#2
by narugo - opened

do we have 2023 version of this index?

or, if we want to make an 2023 one, any suggestions?

thank you :)

I'm not planning on an update right now.

If you wanted to do it locally, you would have to use one of the models to output the intermediate features (I'd wager this would be easier to do using TIMM), then use FAISS to generate the index.
If you decide to use AutoFAISS I'd recommend shuffling the embeddings (taking care to keep embeddings and image IDs in the same order of course) so that it trains the dimensionality reduction part of the index on images uniformly distributed along the years. In my first attempt I kept the embeddings sorted by image ID and the index was good on old images and remarkably worse on newer images.

I'm not planning on an update right now.

If you wanted to do it locally, you would have to use one of the models to output the intermediate features (I'd wager this would be easier to do using TIMM), then use FAISS to generate the index.
If you decide to use AutoFAISS I'd recommend shuffling the embeddings (taking care to keep embeddings and image IDs in the same order of course) so that it trains the dimensionality reduction part of the index on images uniformly distributed along the years. In my first attempt I kept the embeddings sorted by image ID and the index was good on old images and remarkably worse on newer images.

thank you, that helped us a lot. now we can run this index code in script form, and doing plenty of searches.

one more question, if i need to create index to more images, which models are recommended to be used as feature extractors? i see your wd14 v3 taggers, and the swin model works best when tagging. so any suggestions on this?

update: by the way, i read the source code of vit/swinv2/convnext model, i see u extracted the output of predictions_norm node from the convnext model. but in the other 2 models, there is a GAP layer after the predictions_norm, so which output should i use in swinv2/vit models?

update2: i visualized the v3 taggers, the structure is quite special, no predictions_norm found, but something like head, fc found. in vit model, the /core_model/head is a fc layer, and in swin/convnext model, the /core_mode/head/fc is the fc layer. so can i consider the input of fc layer is the suitable embedding for index building? or any other suggestions?

My gut reaction would be to use the SwinV2 (in fact, I'm doing just that for some SemDeDup experiments right now), but you may want to consider throughput and VRAM consumption, in which case either ViT or ConvNext might be a better fit.

I generally use whatever goes into the final linear layer as embeddings, yeah.
That usually means either GAP or LayerNorm output, depending on whichever comes last.
I'm not 100% sure what the literature says or other practitioners do in this case, it's one of those things that worked well enough for me without digging too deep.

In the JAX codebase, which is the canonical one for the v3 models, you'd have to initialize the models with num_classes = 0 to get the embeddings.
I think either TIMM or PyTorch make the process easier though, so you may want to look into that. Ease of use is the reason I started releasing TIMM-compatible weights in the first place.

You can use the PoC tagger wdv3-jax as a starting point for image loading, model init and whatnot.
Keep in mind it is not made for batch extraction, that would take some more fiddling but nothing too complicated.

Actually tell you what, here: https://we.tl/t-x8tKkdMnwF
This is the piece of crap contraption I made for this very purpose. Careful, the edges are rusty and dirty, you might get tetanus using it.
If you polish it feel free to upload it on github and post a link for everyone else.

thank u for the reply, really helped me a lot.

Actually tell you what, here: https://we.tl/t-x8tKkdMnwF
This is the piece of crap contraption I made for this very purpose. Careful, the edges are rusty and dirty, you might get tetanus using it.
If you polish it feel free to upload it on github and post a link for everyone else.

actually i have integrated the wdtaggers into my tools, it has been published to pypi.

in dghs-imgutils>=0.4.5, the embeddings are exportable, like this

from imgutils.tagging import get_wd14_tags

emb = get_wd14_tags(
    '/path/to/your/image.png',
    fmt='embedding',
    model_name="ConvNext"
)
print(emb.shape)  # (1024, )

and, btw, i use this function to refactor this index in danbooru2022, like this

import json
from functools import lru_cache
from typing import Dict

import faiss
import numpy as np
from huggingface_hub import hf_hub_download

from imgutils.data import ImageTyping
from imgutils.tagging import get_wd14_tags


@lru_cache()
def _load_index():
    knn_index = faiss.read_index(hf_hub_download(
        repo_id='SmilingWolf/danbooru2022_image_similarity',
        repo_type='space',
        filename='index/cosine_knn.index',
    ))
    config = json.loads(open(hf_hub_download(
        repo_id='SmilingWolf/danbooru2022_image_similarity',
        repo_type='space',
        filename="index/cosine_infos.json",
    )).read())["index_param"]
    faiss.ParameterSpace().set_index_parameters(knn_index, config)
    image_ids = np.load(hf_hub_download(
        repo_id='SmilingWolf/danbooru2022_image_similarity',
        repo_type='space',
        filename="index/cosine_ids.npy",
    ))
    return knn_index, image_ids


def get_nearest_images_ids(image: ImageTyping, n_neighbours: int = 20) -> Dict[int, float]:
    target = get_wd14_tags(image, fmt='embedding', model_name="ConvNext")[None, ...]
    target = target / np.linalg.norm(target)

    knn_index, images_ids = _load_index()
    dists, indexes = knn_index.search(target, k=n_neighbours)
    neighbours_ids = images_ids[indexes][0]
    neighbours_ids = [int(x) for x in neighbours_ids]
    return dict(zip(neighbours_ids, dists[0].tolist()))


print(get_nearest_images_ids(
    '/path/to/your/image.png',
    n_neighbours=5000
))

as i tested, the result is a little differ from your space (mainly because of the detailed differences in image preprocessing), but its enough for searching images.

btw2, i modified the onnx models you uploaded (including all v2/v3 models), put the embeddings as the second output, and uploaded them here: https://huggingface.co/deepghs/wd14_tagger_with_embeddings

and, one more question about indexing.

i just made the inversion matrix of the fc layers in all wd14 taggers. what i can do now is just reverse the prediction result into the embedding (inverse sigmoid and pinv the fc matrix, quite simple, code here). as i tested the cosine similarity of the inversed and actual embeddings is approx 0.997.

then i managed to make something like "creating embeddings from simple tags", just put 1.0 as pred score for given tags, the 0.0 for others. when i use that to search from the index, the result is pool.

after this, i used some images, get their prediction results and embeddings, i find the scores in the prediction result is extremely important, when i use real scores to inverse embedding, sim is approx 0.997, but when i just set all the scores for positive tags to 1.0, the rest 0.0, the sim is approx 0.5. T_T

in a word, i failed on inversing tags to embeddings with this approach. what i need is to search images by tags from the indices like yours. the embedding space can support better blur search, and it runs much faster than sql select, the size of index files are also much smaller than sqlite database. that's why i choose to use indices.

if we can bridge the search tags (what i thought is pos tags+neg tags, e.g. pos: ['1girl', 'surtr_(arknights)'], neg: ['blue_hair'], the expected search result should be close to 1girl and surtr_(arknights), but not close to blue_hair, and the other tags are not cared) to embeddings to make it easier to search from the indices. any ideas or suggestions?

Yeah I tested a couple things but it doesn't look like that's possible.

So what do we do? We reinvent CLIP and make an MLP adapter to try and reverse it. Basically, we throw more black box stuff at the problem.
I've got a PoC "pipeline" (emphasis on PoC) to fetch images, predictions, embeddings and train such an MLP and use it to reverse the one hot encoded preds.
I'm using the new features in imgutils to generate the embeddings :)

Like in CLIP the objective is to generate an embedding whose cosine distance from the one generated from the image is as low as possible.
One possible augmentation might involve randomly dropping tags during training.

Currently experimenting with how much I can push it. I'm not really hopeful, but who knows.

(I know there's already pre-made data in the deepghs/wd14_tagger_inversion repo, but I wanted to write something from scratch again :P)

We reinvent CLIP and make an MLP adapter to try and reverse it. Basically, we throw more black box stuff at the problem.

indeed, this requirement is quite similar to clip, but still different. clip receives non-structural natural language tokens for input, but ours receives structural tags as input. the structural input is still useful in many aspects, i believe the clip cannot be 100% better than it.

Like in CLIP the objective is to generate an embedding whose cosine distance from the one generated from the image is as low as possible.

I'm considering another model, input containing positive tags and negative tags, and the output should be embeddings include positive tags, and better not to include negative tags. i think this can be used for searching images from the index, like danbooru.

Here are two options for this technological approach:

  1. Inputting pos/neg tags and outputting a midpoint embedding of an image that fits the requirement. Then directly use this midpoint for retrieval as a result.
  2. Inputting pos/neg tags to generate an embedding of an image that fits the requirement. During querying, generate a batch of embeddings and calculate the midpoint. Retrieve based on this midpoint and select the result with the highest average similarity to all embeddings.

Option 1 is simpler to implement but may carry risks since samples close to the embedding might not all very match the pos/neg tags. Option 2 is closer to a generative model but suffers from lower computational efficiency.

To be honest, I'm not entirely sure which option is more appropriate. Just try them, and i will also try some of them. The envisioned usage pattern resembles an offline danbooru with limited fuzzy querying support. It should have higher efficiency than querying an sqlite database and the index file size should be smaller than an sqlite database. Maybe these are just daydreaming? i dont know. But if it can be realized, it will be very very useful. lol

Soooo, try this one: SmilingWolf/danbooru2022_text_to_image

It's not 100% reliable, but the examples should show off a few of the capabilities.

  • First example: it fetches images of a character that version of ConvNext (trained on Dataset v2) was not supposed to know. That is, the images existed in the dataset, but the tag didn't. It exists in Dataset V3 though. This is a consequence of how the tags-to-embeddings network was trained.
  • The second example loads images of Power. I very much like Power from CSM. That is all. It also shows that a couple of core tags can be enough to find a whole bunch of images of a character, but that's not really important.
  • Third example: it tries to steer the embeddings from Saber towards Saber Alter using negative tags. Crafting this example was much more challenging than I though, Saber really appears in broad range of shapes and forms.

A couple of things to note:

  • I'm developing all the stuff using JAX, once I'm satisfied it would be possible to port the weights to ONNX or whatever, the network structure is very very simple right now.
  • it uses the same index from the Image Similarity space. No changes necessary on that front.
  • it should be interrogated using the tags from Dataset V3 (ie. using the selected_tags.csv from the -v3 repos).
  • it makes no effort to accommodate different prompting formats or incomplete tags. You either follow the format in the examples to a T (that also means no spaces after the commas) or it will silently fail.

I also realize the name might be a bit misleading, since most people think of text-to-image as an image generator rather than a tool for retrieval. I'm open to suggestions on how to properly call it.

Moved the space to SmilingWolf/danbooru2022_embeddings_playground.

Added support for images, now it is at least theoretically possible to use all modalities to guide retrieval.
In practice... I dunnow, it's kinda neat?

In example 4 I use a base image with a single character and add tags (both positive and negative) to retrieve images of Quetzalcoatl together with Gudako.
In example 5 I use a positive image and a negative image to remove Gudako from the scene, which correctly finds images with Musashi alone. The general idea was that it would remove the character in the intersection between the two images and surprisingly enough, it worked on the first try.

Sign up or log in to comment