The Power of Synthetic Data: Infinite Loop to Improve Fine-Tuning Results with Stable Diffusion Models
How Reusing Fine-Tuned Output Can Improve Model Flexibility and Precision
Hello FollowFox readers!
We thought you already missed Damon’s face, and to celebrate his new Gorillaz album, we did another Stable Diffusion experiment on his photos.
This time we wanted to see if fine-tuning outcomes could be improved using the images generated from the previous fine-tunes. These images are usually referred to as “synthetic data” (wiki explanation), and we will use this term throughout the post. The implications of this methodology can be compelling, and we will discuss these implications in this post.
To give you an idea of what we mean by reusing the fine-tuned output, here is a simplified overview of what we are doing.
And since you love spoilers, here is a preview of the results that we got using this approach:
At the very least, it looks pretty promising, even though we have been observing some tradeoffs between flexibility and precision.
Implications
This methodology can open up some exciting possibilities and further push what is possible with Stable Diffusion models.
First, it allows us to work with very little data. In many cases, the availability of photos and images is not a problem, but in some cases, only a certain amount can be collected. For example, think about pictures of individuals from the past, limited art and monuments from history, and so on. We wrote about these limitations in the previous post (link). This approach can allow us to train a model with whatever data we have, try hard to generate new variations that are different and precise enough, train again, and repeat. One can create a whole new character from a few photos of a monument in the street!
Moreover, this can be a step towards the automated, self-improving loop of fine-tuned Stable Diffusion models. The individual elements are already here; this approach to do fine-tunes the output data + methods to judge the quality of the output by calculating loss values from the originals or using subject recognition models.
Approach
We started with our previous model (link) and generated about 5x more images of the subject. We tried to combine high-quality, high-precision images of the subject with more variability to put the subject in various situations and poses. Even though we spent a couple of hours on this, the dataset we gathered was not perfect, and with even more time and prompt gurus, something better could be achieved. In the end, our data looked something like this:
For the fine-tuning part, we use the same settings of EveryDream2 as in the original post (link), with two exceptions: we did two models, one with a 7.5e-07 learning rate and the second one with 3e-07. In both cases, we did 55 epochs, and we had 100 images in total. We believe these settings were far from optimal since we got the best outputs when using very low CFG settings (2 to 4), meaning that learning rates and steps can be further optimized for such a larger dataset.
Results
TLDR: we think the new models have significantly increased flexibility, and generating decent results on the subject was much easier in various settings. However, there was some decrease in the precision, and the ultra-realistic features and details were less present.
To compare results, we did a few tests: first of all, we repeated our three usual tests used in the original post (link) - realistic photo, video games avatar, and Superman. Then, we tested a few images resembling those used for synthetic data generation. And finally, a couple of totally new prompts.
All results are presented in three rows: the model with original photos, the model with synthetic data at a 7.5e-07 learning rate, and finally, the model with a 3e-07 learning rate. The seeds and prompts were the same between the models, and the only difference was the CFG scale that defaulted to 7 for the original model, but we had to use much lower values for the new, synthetic ones. All detailed prompts and parameters are at the end of this post.
Let’s take a look at some of the outputs:
Realistic Photos
This part was where we noticed the greatest regress - some details are just missing. This can be addressed through fine-tuning and using some weighted combination of the original and synthetic data.
Avatars
We noticed some exciting changes with the avatars - some realism and precision were missing, but the results were more fun and had higher variability.
SuperDamon
A bit subjective, but we felt it was easier to generate images that look less fried but still Damon, especially with the higher LR one.
NeonDamon
The first image on this grid is the one that was used in the new synthetic dataset. While the new model didn’t replicate the “coolness” of the first image, the subsequent ones seemed way more interesting from the new model while still having high precision.
TrippyDamon
Similar to the last one, the first image from the synthetic dataset. We think that this particular case is worse in the case of the newer models, but we tried a few more seeds manually, and it is just the RNG of these specific seeds; the other ones still seemed excellent and interesting.
VikingDamon
This is a new prompt not used in the synthetic data. We saw the most significant improvements when doing such tests, the consistency of higher quality generations judged by the level of accuracy and flexibility.
Exact Prompts Used
Please note that we had to adjust CFG values to much lower ones for the new synthetic models. And lowering CFG values for the original model didn’t increase output quality.
photo of loeb, professional close-up portrait, hyper-realistic, highly detailed, 24mm, dim lighting, high resolution, iPhoneX, by Peter Kemp
Negative prompt: Disfigured, (cartoon), blurry, black and white, female, woman, shadow, painting, shine, reflection, photoshop
Steps: 90, Sampler: Euler a, CFG scale: 7, Seed: 19911, Size: 512x512, Model hash: 593b7249c5
loeb, ((manga)) cover art, manly face with ((scar)) and (blood), warrior, detailed color portrait, trending on artstation, greg rutkowski, 8 k, smooth render, unreal engine 5 rendered, octane rendered, art style by klimt and nixeu and ian sprigger and wlop and krenz cushart, digital art
Negative prompt: photo, realistic, iphone, (((disfigured))), out of frame, ugly, extra limbs, (bad anatomy), gross proportions, (malformed limbs), ((missing arms)), ((missing legs)), (((extra arms))), (((extra legs))), mutated hands, (fused fingers), (too many fingers), (((long neck)))
Steps: 60, Sampler: DPM++ SDE Karras, CFG scale: 7, Seed: 3312531442, Size: 512x512, Model hash: 593b7249c5
photo of ((loeb)) as superman, superhero pose, Superman Returns, detailed face, (close up) shot, cinematic, 8k, sharp focus, canon 5d, high-resolution, professional, hyper-realistic, highly detailed, 24mm, sun lighting, high resolution, iPhoneX, by Peter Kemp, city background
Negative prompt: plastic, toy, blurry, ((far)), letters, dark, ((shadow))), (((disfigured face))), (((female))), (((woman))), blurry, bad art, ((((ugly)))), (((duplicate))), ((morbid)), ((mutilated)), out of frame, extra fingers, mutated hands, ((poorly drawn hands)), ((poorly drawn face)), (((mutation))), (((deformed))), ((ugly)), blurry, ((bad anatomy)), (((bad proportions))), ((extra limbs)), cloned face, (((disfigured))), out of frame, ugly, extra limbs, (bad anatomy), gross proportions, (malformed limbs), ((missing arms)), ((missing legs)), (((extra arms))), (((extra legs))), mutated hands, (fused fingers), (too many fingers), (((long neck)))
Steps: 100, Sampler: Euler a, CFG scale: 7, Seed: 2215890059, Size: 512x512, Model hash: 593b7249c5
loeb synthwave style, nvinkpunk Detailed portrait cyberpunk (photo of loeb), futuristic neon reflective wear, sci-fi, robot parts, perfect face, ((tattoo)), (long hair), matte skin, pores, sharp detail, sharpness, wrinkles, hyperdetailed, hyperrealistic, subsurface scattering, Hasselblad Award Winner, Soft Diffuse Lighting, Smirk, machine face, fine details, realistic shaded, intricate, elegant, award winning half body portrait of a woman in a croptop and cargo pants with ombre navy red teal hairstyle with head in motion and hair flying, paint splashes, splatter, outrun, vaporware, shaded flat illustration, digital art, highly detailed, fine detail, intricate
Negative prompt: lowres, poorly drawn, crippled, crooked, broken, weird, odd, distorted, (big breasts), (big tits), erased, cut, mutilated, sloppy, hideous, ((ugly)), pixelated, ((bad hands)), aliasing, lowres, (monochrome), (black and white), ((b&w)), poorly drawn, sloppy, over exposed, over saturated, burnt image, sloppy, broken, fuzzy, aliasing, cheap, oldschool, poor quality, pixelated, sleepy, closed-eyes, lowres, pixelated, aliasing, old, granny, ugly, ((bad anatomy)), hideous, deformed, mutant, butchered, gore, sloppy, artifacts, mutilated, poorly drawn, poorly detailed, smudged, sketch, pencil, glossy skin, doll, plastic, (signature), (watermark), (words), (letters), (logo), (username), ((disfigured)), ((close up))
Steps: 44, Sampler: DPM++ SDE Karras, CFG scale: 7.5, Seed: 257107453, Size: 512x512, Model hash: 6df795013b
character portrait of loeb as Painting illustration painting a map of the universe, a Pale skin hippie, Masterpiece, best quality art by Mati Klarwein, jungle color palette, realistic highly detailed occult, kawaii, heavenly, ominous lighting, witchcore, pantone, super wide angle, adorable, high quality
Negative prompt: b&w, deformed, photo, photograph, closeup, camera, film still, extra limbs, extra fingers, extra digits, mutated hands, bad anatomy, bad proportions, blur, blurry, incoherent, poorly drawn hands, sketch
Steps: 55, Sampler: DPM++ 2M Karras, CFG scale: 7, Seed: 14036059, Face restoration: CodeFormer, Size: 512x512, Model hash: 6df795013b
loeb as a viking
Negative prompt: naked, helmet, old, ugly, smile, 3d, disfigured, glossy, plastic, ((bad art)), (deformed), blurry, out of frame, (mutation), (bad anatomy), (bad proportions), Photoshop, video game, tiling, cross-eye, 3d render
Steps: 49, Sampler: Euler a, CFG scale: 6, Seed: 2493697232, Face restoration: CodeFormer, Size: 512x512, Model hash: 6df795013b