Hello everyone, while a couple of my experiments are blocked for various reasons, I decided to look into Laion’s image datasets and explore what’s inside.
Why?
I have two main reasons for doing this:
It’s important to understand the different parts that make Stable Diffusion (or other AI models) and images used to train the model are at the very core of it. Understanding what went inside should allow us to optimize different workflows.
Sharpening technical skills to get used to such databases for various future tasks. Whether it is for fine-tuning or other possible use cases.
Potential Practical Use-cases
There is more to it than just learning about what’s inside the dataset. There are very practical things that these image datasets can be used for:
Regularization image sets: for Dreambooth fine-tunes in particular, one can scrape these datasets for specific class images and use them for regularization purposes. We haven’t experimented much on this vs. using SD-generated images, but these are a good chance to achieve interesting results.
Fine-tuning on specific subsets: one can scrape images for a specific subject or object, select a high-quality subset, potentially re-caption them and then fine-tune the model with this dataset. Higher quality output is almost guaranteed
Note: I won’t be writing in this post how to scrape images using this database, but you can look into the EveryDream toolkit for a fairly straightforward way to do so. Link
What is LAION Dataset and LAION-Aesthetics
The most relevant part to mention here is that this is THE dataset that was used to create the Stable Diffusion model. Link
LAION 5B is a large-scale dataset for research purposes consisting of 5,85B CLIP-filtered image-text pairs. 2,3B contain English language, 2,2B samples from 100+ other languages, and 1B samples have texts that do not allow a certain language assignment.
And then, there are different versions of subsets generated from this large dataset. In this case, we will be exploring so-called LAION-AESTHETICS subsets. These subsets should be higher quality images, and to select these images, a separate model was trained to predict the rating people gave when asked, “How much do you like this image on a scale from 1 to 10?”. The result of this model is that each image in the LAION database got an Aesthetic score. Link
For this particular exploration, we decided to use the Version 1 subset of English captioned images that have an Aesthetic score of 7 or higher. Link to the subset.
A few findings
Dataset has 52,068,913 rows. i.e., this subset is ~0.9% of the larger 5B dataset, so if you fail to find images of specific things here, consider using larger subsets.
The dataset has the following columns:
URL: public URL of the image (i.e., the images are not stored in this database itself but rather their URLs all over the internet)
TEXT: caption of the image that should be the description of it
WIDTH and HEIGHT: self-explanatory width and height of the image
Similarity: I assume this is a score that assesses how close the caption is to the image
Hash: hash of the image, unsure based on which type of a hash
Punsafe: I assume this is NSFW probability or assessment
Pwatermark: the probability of the image has a watermark on it
Aesthetic: an aesthetic score based on the algorithm used
The distribution of aesthetic scores looks like this, so we can see a very rapid drop in counts as the score goes up:
Another useful metric to look at is different image resolutions and their frequency. Using this, we can deduce how some of them were cropped during the training period and why we might see those headless image generations or weapons that look like they are zoomed into the center. This also has implications for various fine-tuning needs. Here are the top 20 resolution combinations of this subset:
Scores vs. Images
This section is more anecdotal than analytical, but I thought it would be fun to look at images with top and bottom scores based on the various fields in the database (similarity, aesthetic, Pwatermark, and Punsafe).
Top Aesthetic Scores
It’s interesting to see a certain pattern of what the V1 scorer thinks deserves the highest score:
Bottom Aesthetic Scores
Unsure if this subset is objectively worse, but the difference is apparent
Top Pwatermark Scores
As you can see, false positives are possible, especially if images contain some text, so I’d not overly rely on filtering out images for this reason, especially for some specific use cases.
Bottom Pwatermark Scores
On the other hand, there are zero false negatives, so it’s probably ok to rely on this score if we strictly want to find images that do not have a watermark on them
Scores
I thought this part would be risky, but none of the top images were NSFW. Remember that this is a very small subset and an anecdotal observation, so do not assume there won’t be any. Top Punsafe
Bottom Punsafe Scores
The main difference is that images do not even contain humans and are nothing remotely NSFW.
Top Similarity Scores
Judge yourself, but I think this is quite telling about the quality of overall tagging and the room for improvement:
a modern twist on affogato, this dirty chai affogato drowns a generous scoop of homemade chai ice cream with a shot of hot espresso
Danville Bookcase With Doors 42 Wide
Isobel dress from Ohh by Gum €97.95 Connemara Life 2015 Seasons of Ireland on the Wild Atlantic Way
Paper Doll - Print Dress
Park Icon Sign Set Bear Chasing Man Into Trailer
Bottom Similarity Scores
To be honest, these are better than I expected, but I also observed a lot of digits-only captions among these images that I decided not to include here as examples. Interestingly, lower similarity scores seem to correlate with smaller image sizes:
031 copyblogthumbnail
1947 Plymouth
Balcony
South by Southwest Cornbread Salad