Clustering

#4
by ReggieMontoya - opened

There are thousands of artists in SDXL, and that number might be even higher in other models (both released and soon to come). Rather than knowing which artist you want, or trying to find a random one that works, how about clustering them together based on non-biased image analysis?

I am running a pilot on some of them I have already tested for SDXL awareness. If that goes well, I plan on doing a cladogram of base model artists with minimal prompt influence (no prompt, "a woman", "a man", "a scene").

Any interest in incorporating the results into your dataset? Happy to share the workflow in python and Comfyui.

For sure, I'm interested. Do you mean that you know of an image analysis technique that can determine relationships between artists based on their degree of similarity? I've never heard of anything like that.

Yeah, check out this article:
https://towardsdatascience.com/how-to-cluster-images-based-on-visual-similarity-cd6e7209fe34

I tested it last night and it seemed to work pretty well. There is a lot of tweaking to be done, for sure. I am currently running the 1070 artists you had in your database last time I pulled it, generating a collage of 4 tiled images for use in the clustering algorithm. Once that's done, I will run them all through together with both Kmeans and hierarchical clustering. That should generate a "family tree" of artists. E.g., the oil painters all clumping together, the Japanese woodblock artists, anime... etc.

I haven't seen anyone do this before. If it works, it could be hugely helpful as a resource. I'll let my computer burn up running this for a while and see if I can post some prelim results sometime soon.

Oh, and I have some other ideas on how to test the "strength" of an artist's weight in the models, too. I was thinking of generating some LORAs of varying strength that have a very overwhelming style (e.g. a black square on a white background) and seeing how strong the LORA needs to be before it overwhelms the artist's style in the prompt. Maybe put some feelers out on Reddit to see if people have interest and other ideas.

Very cool. The medium article makes it seem like the only result is that each image is labeled with a cluster. In that case it'll be easy to represent that in this app by just naming each cluster and adding tag to each artist.

But if the result is a similarity-score between images and/or a parent/child hierarchy of some kind, that would be extremely interesting. It'll take a lot of extra code to represent that in this app. But I imagine that an app user would first select an artist, then set a degree of "distance" slider, and then see all the artists artists within that distance of similarity or on the family tree.

I recommend using the images from the SDXL_1_0 folder. Those images were generated from the SDXL 1.0 vanilla model, so they are closest to the real artists' styles. The other 2 models' images are more aesthetic but have much less fidelity. It might also be best to use the portrait images. I would expect they are more likely to result in clusters based on artist style rather than on subject matter (e.g. this cluster has faces, while this had bridges). If it matters for speed/memory, you could probably reduce the images to 128x128 resolution or smaller and still have identifiable image features.

Once I figure out the methodology, then I will start looking into ways to describe it. It could be something as simple as having X number of artist families and randomly showing an image from that family, or as complicated as a browsable tree applet (the code for which surely must exist already, and if not, ChatGPT can help write).

I was thinking of trying to run it on the dataset if I can easily download it. If not, I do have the index images from a few similar sites downloaded.

However, I worry that a single outlier would totally tank the whole thing. Instead, I am trying first on a batch of 378 iterations of the following dynamic prompt:

(art style of {Leonid Afremov|Chiho Aoshima|Anna Dittmann|Pablo Picasso|Victo Ngai|an advertising photograph|Alphonse Mucha}:1.3), a {blue|green|red} {rose|mushroom|glass bottle of liquid|pile of sand|gemstone|plastic cube} {in an empty room|on a table|sitting in grass}

From this, I am generating 4 images. I'll then extract all the features and take the average of the 4 replicates, and cluster on that. That should get closer to the "average" of the style. I'll have to see if I can reconstruct an average image backwards from the imaging features.

Hopefully some time for this in the next day or two.
Example training images:
Chiho Aoshima green pile of sand in an empty room_0002.jpg
Chiho Aoshima green pile of sand in an empty room_0004.jpg
Chiho Aoshima green pile of sand in an empty room_0001.jpg
Chiho Aoshima green pile of sand in an empty room_0003.jpg

Chiho Aoshima green plastic cube in an empty room_0001.jpg
Chiho Aoshima green plastic cube in an empty room_0002.jpg
Chiho Aoshima green plastic cube in an empty room_0003.jpg
Chiho Aoshima green plastic cube in an empty room_0004.jpg

Chiho Aoshima green rose in an empty room_0001.jpg
Chiho Aoshima green rose in an empty room_0002.jpg
Chiho Aoshima green rose in an empty room_0003.jpg
Chiho Aoshima green rose in an empty room_0004.jpg

Progress: the clustering algorithm is working on average feature vectors of 4 images. The raw output correctly clusters the subject of the prompt, but that varies more than the style so the styles are often lumped together. This will be useful to know when considering what a raw artist's output is, without prompting on what the contents of the images should be. For example, landscape artists will tend to output landscapes, which will cluster together. Portraits together, abstracts together, etc.

Here are some examples from the above iteratable prompt, applied across all 1512 images:

hierarchy 2 cluster_1.jpg

hierarchy 2 cluster_18.jpg

hierarchy 2 cluster_13.jpg

Next, I will try clustering only on a single subject (varying color) across the small set of artists, then lastly on a single subject and color to isolate the differences in artists' styles.

That was quick. I can tell it largely ignores colors, rather it's analyzing other aspects of the image. I guess that's both good and bad - bad because color is an important part of an artist's style, but good because it's easy for us to prompt around and also easy for our stupid monkey brains to parse.

Here are 4 clusters of the results for rose. I think it nailed which are similar and which are outliers.
hierarchy 3 cluster_1.jpg
hierarchy 3 cluster_2.jpg
hierarchy 3 cluster_3.jpg
hierarchy 3 cluster_4.jpg

Not sure which is the easiest next step: generating 4 images per artist with a simple portrait prompt and run the whole thing on the average result, or attempt to do the single-image cluster on the images you already have generated. Both are going to be herculean computational tasks!

EDIT: maybe the easiest way to display this is to have the option to select an artist and get shown, in descending order, a list of every other artists' "distance" from them in the similarity matrix.

That way, the user can hop from artist to artist and find the islands of similarity themselves without having to do any complex datavis. The list of neighbors can include the images from that artist so you can see the visual similarity get farther and farther away as you scroll down.

Yeah, I think that representation of the data with the app will work well! That'll be an interesting way to explore, and it won't be too hard to implement. So the output of your computation will be edge weights between every artist and every other artist?

In another discussion thread, someone has suggested ~90 more artist names that they think SDXL knows. It's likely that I'll add many of those to the db at some point.

Right now, I have outputs of cluster memberships (the "correct" cutoff of how many clusters has been a tough nut to crack, unfortunately). I am 2/3 of the way through the training set so far and ran some prelim numbers through the system with promising results! Check this out, noting that sometimes the content of the image is more important than the style. Like in the last one... The hat cluster? I have to look into how I can assess certain "kinds" of features. Again, tough nut to crack.

Still is an amazing way to analyze the data, though! And nobody is doing this yet.

hierarchy 3 cluster_6.jpg

hierarchy 3 cluster_8.jpg

hierarchy 3 cluster_12.jpg

hierarchy 3 cluster_10.jpg

hierarchy 3 cluster_11.jpg

hierarchy 3 cluster_39.jpg

hierarchy 3 cluster_35.jpg

hierarchy 3 cluster_31.jpg

hierarchy 3 cluster_23.jpg

hierarchy 3 cluster_14.jpg

hierarchy 3 cluster_21.jpg

Wow, that's so cool! It's definitely finding stylistic similarities. Seems like color-palette has an influence as well. The two cartoonish clusters, one starting with Allie Brosh and another starting with Alex Toth, seem similar to me. E.g. dark thick lines and flat colors. If they were mixed together, I couldn't guess which belonged to which cluster. I wonder if the difference would be obvious if pointed out.

I added a feature today that sorts artists by similarity based purely on their tags. You pin an artist, then I sort based on each artist's Jaccard similarity coefficient with the pinned artists. This method is crude but useful.

If you end up completing your clustering approach and send me a matrix of scores, and can switch over, and the UI would work the same. Meanwhile, this feature might help you validate your clusters.

The Jaccard score ignores tag semantics, e.g. "dystopian" vs. "utopian" would ideally cause a lower score, but that's far beyond my capabilities.

There is value to both... similar tags is a good match in some ways, visual similarity is a good match in a different way. Both are useful.

I'm back! And I'm done!

I took all the SDXL names you have currently and generated 8 images in a batch from seed 47,
All:
Steps: 20, Sampler: DPM++ 2M Karras, CFG scale: 5, Seed: 47, Size: 1024x1024, (refiner not used)

positive:
(artwork in the style of ARTIST:1.5), (an image:1.5)

negative:
edges, borders

Then I downsampled them to 224 px square and ran them all through the VGG16 image feature extraction, took the mean 4096-dimensional space position of each artist, then looked at their distances in that space. The result is a number for every pair of artists to compare.

I did this 4 times with different layers of the VGG16 model. The layers 1-4 seem to correspond to a gradient of style - to - content similarity. I am overall partial to layer 2, but I will let you decide. Here are some samples of the top 8 images that are closest to the upper left image, layer 2. I will give you the data for the first 4 layers and you can decide which you want to add to the database.

Willy Pogany_layer_2_stitched (Custom).jpg
A. J. Casson_layer_2_stitched (Custom).jpg
Aaron Horkey_layer_2_stitched (Custom).jpg
Alexej von Jawlensky_layer_2_stitched (Custom).jpg
Audrey Kawasaki_layer_2_stitched (Custom).jpg
Carne Griffiths_layer_2_stitched (Custom).jpg
Fabian Perez_layer_2_stitched (Custom).jpg
He Jiaying_layer_2_stitched (Custom).jpg
Henry Asencio_layer_2_stitched (Custom).jpg
Herman Brood_layer_2_stitched (Custom).jpg
James Gilleard_layer_2_stitched (Custom).jpg

Hmmmm not sure of the best way to post the data here. Too big for pastebin, and I can't upload a csv to Huggingface. Maybe just paste raw data in a comment?

EDIT: I uploaded them to the files section. You might need to edit the non-standard English placeholder characters in the artist names (only a few of them at the bottom of the artist list).

Cool! Thanks! When you say, "I uploaded them to the files section.", what repository do you mean?

"The result is a number for every pair of artists to compare", do you mean you have like a CSV with "artistA, artistB, similarity score of layer 1, similarity score of layer 2..."?

Based on the images I can see in this thread, layer 2 seems to contain a clear understanding of stylistic similarity. It's interesting because it's often not what I would have picked, but I see the similarities.

I wonder if feeding output images from SD3.5 vs. SDXL into VGG16 will result in very different comparison scores.

Cool! Thanks! When you say, "I uploaded them to the files section.", what repository do you mean?

It's in the pull requests. I have no idea how Hugging Face works lol. Don't you see the files in a pull request or something? They say they uploaded, they're a "new thread" under community.

I wonder if feeding output images from SD3.5 vs. SDXL into VGG16 will result in very different comparison scores.

Glad you asked! I have been working on SD3.5L for the last week, putting in dozens of 100% GPU times into image generation and another dozen or so CPU hours in feature extractions+clustering.

Here are some good examples of level 2 distance clumps.

Aaron Horkey_layer_2_stitched.jpg

Arshile Gorky_layer_2_stitched.jpg

Atey Ghailan_layer_2_stitched.jpg

Fabian Perez_layer_2_stitched.jpg

Georges Rousse_layer_2_stitched.jpg

Richard Caldicott_layer_2_stitched.jpg

Ursula von Rydingsvard_layer_2_stitched.jpg

Wendy Vecchi_layer_2_stitched.jpg

I didn't think to look in discussions. I've merged those files.

Just looking at the raw CSV, I'm wondering if there's a way to normalize the scores. The number that shows each artist's match to themselves is different for each artist, and they they are the smallest numbers. So I'm guessing that smaller mean more similar? Do you know if all of the numbers are relative to all others, or relative to only the other numbers for a single artist? It probably doesn't matter for the app, since I'll just sort them, but I'm curious.

Those are all within negligible error. I think it had to do with the precision of the model or something.

They are on the order of 10^-5 while the real distance is 10^2 or so. So that's a 10-million-fold difference, which won't affect the ranking at all.

You can arbitrarily set them to 0 if you want, but it won't matter because once a user pins an artist, they'll always be first and everyone else will follow in ascending order of distance. All you have to do is choose which layer dataset gives you the most coherent answer. Since I didn't use prompting much (no portrait vs landscape, etc) and only used base model, the layers might sync up better or worse for the other models on your site.

Sign up or log in to comment