In this project, I used a DeepFloyd IF diffusion model, a two stage model trained as a text-to-image model, which takes text prompts as input and outputs images that are aligned with the text. To begin with, I instantiated DeepFloyd’s stage_1 and stage_2 objects used for generation, as well as several text prompts for sample generation, which were the following: “An oil painting of a snowy mountain village,” “ A man wearing a hat,” and “A rocket ship”. I used random seed 1213, and tried different combinations of num_inference_steps for both stages to generate various versions of images for the same three prompts. Here are the results:
stage 1 num_inference_steps = 20, stage 2 num_inference_steps = 20
Oil Painting of Snowy Mountain Village | Man Wearing a Hat | Rocket Ship |
---|---|---|
stage 1 num_inference_steps = 40, stage 2 num_inference_steps = 50
stage 1 num_inference_steps = 5, stage 2 num_inference_steps = 5
stage 1 num_inference_steps = 10, stage 2 num_inference_steps = 10
Thoughts:
I ran the three text prompts, ‘an oil painting of a snowy mountain village’, ‘a man wearing a hat’, and ‘a rocket ship’ with 4 iterations of different num_inference_steps. The first iteration was with num_inference_steps 20 for both Stage 1 and Stage 2, which produced images that were not very realistic and more “cartoon-like”, other than the man in the hat, which was somewhat realistic but almost a bit too soft (painting-like). The second iteration was produced with 40 num_inference_steps for Stage 1 and 50 num_inference_steps for Stage 2, which led to a more realistic representation of the three prompts, which were all very well-made, especially the man in the hat – I found this one to be the most realistic of them all. For my third iteration, I tried 5 num_inference_steps for both Stage 1 and Stage 2, which led to pretty faulty pictures. The oil painting of the snowy mountain village and the rocket ship were both splotchy and had a fair bit of noise, and the man wearing a hat turned out completely faulty, almost as if there were incomplete inference steps made towards the final image. For my last iteration, I used 10 num_inference_steps for Stage 1 and Stage 2, which resulted in 3 good pictures of the prompts. All 3 of the pictures were less cartoon-like than the first two iterations, and were fairly solid representations.
In this section, I aim to write my own sampling loops, using the pretrained DeepFloyd denoisers, to implement tasks such as producing optical illusions or inpainting images. The sampling loop essentially performs reverse diffusion, aiming to start from pure noise, and use the denoiser to remove noise, and produce a clean image after T timesteps.
I first implemented the forward process of diffusion: taking a clean image and adding noise to it, as well as scaling it appropriately.
I implemented the noisy_im = forward(im, t) function, which adds noise to an image that corresponds to that given timstep t. Shown below are the results of the forward function on an image of the campanile, at noise time steps [250, 500, 750], which progressively add more noise to the image.
Original Campanile Image | Noise Timestep 250 | Noise Timestep 500 | Noise Timestep 750 |
---|---|---|---|
We can try to use Gaussian blurring to denoise the noisy images, but as we can see, the results do not perform well at denoising the image and recovering the original photo.
Noisy Image at Timestep 250 | Noisy Image at Timestep 500 | Noisy Image at Timestep 750 |
---|---|---|
Blurred Noise Timestep 250 | Blurred Noise Timestep 500 | Blurred Noise Timestep 750 |
---|---|---|
To attempt to denoise the image, I used a pretrained diffusion model called UNet (which was trained on large datasets of clean & noisy pairs of images), which predicts the Gaussian noise added to an image. Using this Gaussian noise, I then subtracted it from the image to obtain an estimate of the original image (after scaling the noise to the size of the image appropriately).
Shown below are the noisy images at timesteps 250, 500, and 750, and the corresponding one-step denoised images using UNet.
Noisy Image at Timestep 250 | Noisy Image at Timestep 500 | Noisy Image at Timestep 750 |
---|---|---|
One-Step Denoised Image at Timestep 250 | One-Step Denoised Image at Timestep 500 | One-Step Denoised Image at Timestep 750 |
---|---|---|
At higher timesteps, due to more noise, the diffusion model struggles more to accurately estimate the noise added, and thus the estimated original image progressively gets less accurate.
Instead of one-step denoising, iterative denoising is a much more accurate way to accurately estimate and remove the noise to obtain a clean image. Instead of iterating by one timestep, which can become inefficient, we can use strided timesteps and still obtain accurate results due to https://yang-song.net/blog/2021/score/. I created strided_timesteps, and then implemented the function iterative_denoise(image, i_start), which
Noisy Campanile at Timestep 690 | Noisy Campanile at Timestep 540 | Noisy Campanile at Timestep 390 |
---|---|---|
Noisy Campanile at Timestep 240 | Noisy Campanile at Timestep 90 | Original Campanile Image |
---|---|---|
Iteratively Denoised Campanile | One-Step Denoised Campanile | Gaussian Blurred Campanile |
---|---|---|
As we can see, the iteratively denoised and one-step denoised campanile perform much better than the Gaussian-blurred campanile, and the iteratively denoised image more accurately captures details of the campanile than the one-step denoised, albeit not perfectly.
By using the function I made in the previous part, iterative_denoise(image, i_start), and setting i_start to 0 and passing random noise into the image, we can generate completely new images from scratch. Below are 5 results of a “high quality photo”, generated using these steps.
We were able to create 5 new images from scratch, but we can create even better quality photos using classifier-free guidance, which reduces hallucination by incorporating both an unconditional and conditional noise estimate. By running the UNet model twice—once with a conditional prompt and once with an empty prompt for the unconditional estimate — I blended the estimates using the formula noise = noise_uncond + guidance_scale * (noise_cond - noise_uncond). For the unconditional generation, I used “a high quality photo” as the prompt to guide the image synthesis process, ensuring that the resulting images are of high quality compared to those generated without CFG. The images produced using this method showed improvements in clarity and detail compared to just diffusion model sampling from the previous part. Below are 5 results from using CFG!
I applied the Classifier-Free Guidance (CFG) technique from the previous part to edit existing images (rather than creating completely new ones from scratch) by adding noise and then denoising them, leveraging the model’s capacity to introduce creative changes. This process, aligned with the SDEdit algorithm, involves noising the original image slightly and then using the iterative_denoise_cfg function to iteratively denoise it, aiming to make subtle edits by forcing the noisy image back onto the natural image manifold. I ran this denoising process at noise levels [1, 3, 5, 7, 10, 20], each reflecting increasing similarity to the original image. The results, labeled by their starting indices, demonstrate a progression of edits, showcasing how the image gradually approximates its original form. Here are three examples of this image-to-image translation, where I used the text prompt “a high quality photo”, and fed in a different image every time. The larger the i_start, the more the result resembles the input image.
Example 1: Campanile as Input Image
SDEdit, i_start = 1 | SDEdit, i_start = 3 | SDEdit, i_start = 5 | SDEdit, i_start = 7 | SDEdit, i_start = 10 | SDEdit, i_start = 20 | Original Campanile |
---|---|---|---|---|---|---|
Example 2: Autumn City as Input Image
SDEdit, i_start = 1 | SDEdit, i_start = 3 | SDEdit, i_start = 5 | SDEdit, i_start = 7 | SDEdit, i_start = 10 | SDEdit, i_start = 20 | Original Autumn City |
---|---|---|---|---|---|---|
Example 3: Lantern as Input Image
SDEdit, i_start = 1 | SDEdit, i_start = 3 | SDEdit, i_start = 5 | SDEdit, i_start = 7 | SDEdit, i_start = 10 | SDEdit, i_start = 20 | Original Lantern |
---|---|---|---|---|---|---|
Here are the examples of using this image-to-image translation on three examples of non-realistic images, one of which is an image I took from the web (cartoon image of a bee), and a poor attempt at a cartoon flower and cartoon colorful person.
With lower i_start, the resulting images are not great at creating results that look like the original image, but as i_start gets bigger, the resulting image becomes more representatitve of the original image. Looking specifically at Example 3, the flower example wasn’t really great but that could be due to the fact that the hand-drawn image I drew wasn’t that great. It’s still interesting to see the progression of the image looking more like the input image based on more initial information (larger i_start).
Example 1: Web Image of a Bee!
SDEdit, i_start = 1 | SDEdit, i_start = 3 | SDEdit, i_start = 5 | SDEdit, i_start = 7 | SDEdit, i_start = 10 | SDEdit, i_start = 20 | Original Bee |
---|---|---|---|---|---|---|
Example 2: Hand-Drawn Image of a Colorful Person!
SDEdit, i_start = 1 | SDEdit, i_start = 3 | SDEdit, i_start = 5 | SDEdit, i_start = 7 | SDEdit, i_start = 10 | SDEdit, i_start = 20 | Original Person |
---|---|---|---|---|---|---|
Example 3: Hand-Drawn Image of a Flower!
SDEdit, i_start = 1 | SDEdit, i_start = 3 | SDEdit, i_start = 5 | SDEdit, i_start = 7 | SDEdit, i_start = 10 | SDEdit, i_start = 20 | Original Flower |
---|---|---|---|---|---|---|
We can use this same logic, but now inpaint a certain part of the image and keep the rest of the image intact.
Example 1: Inpainting the Top of the Campanile
Campanile | Mask | Area to Replace | Inpainted Campanile |
---|---|---|---|
Example 2: Inpainting the Eiffel Tower
Eiffel Tower | Mask | Area to Replace | Inpainted Eiffel Tower |
---|---|---|---|
Example 3: Inpainting a House in the Mountains
House in the Mountains | Mask | Area to Replace | Inpainted House in the Mountains |
---|---|---|---|
We can include a prompt and an input image to give us more control over the images we are generated. Wth larger noise levels, the resulting image starts to look more and more like the original image.
Example 1: Prompted with “A Rocket Ship” on Campanile Image
Rocket Ship, Noise Level 1 | Rocket Ship, Noise Level 3 | Rocket Ship, Noise Level 5 | Rocket Ship, Noise Level 7 | Rocket Ship, Noise Level 10 | Rocket Ship, Noise Level 20 | Original Campanile |
---|---|---|---|---|---|---|
Example 2: Prompted with “A Photo of a Dog” on Image of House on the Mountains
Dog, Noise Level 1 | Dog, Noise Level 3 | Dog, Noise Level 5 | Dog, Noise Level 7 | Dog, Noise Level 10 | Dog, Noise Level 20 | Original House on the Mountains |
---|---|---|---|---|---|---|
Example 3: Prompted with “A Photo of the Amalfi Coast” on Image of New York City!
Amalfi Coast, Noise Level 1 | Amalfi Coast, Noise Level 3 | Amalfi Coast, Noise Level 5 | Amalfi Coast, Noise Level 7 | Amalfi Coast, Noise Level 10 | Amalfi Coast, Noise Level 20 | Original NYC Image |
---|---|---|---|---|---|---|
In this section, I implemented the Visual Anagrams task using a pretrained diffusion model (UNet) to create optical illusions that display different images depending on their orientation. The process involves denoising a noisy image twice: first with a prompt (e.g., “an oil painting of people around a campfire”) to predict noise, and second after flipping upside-down with a different prompt (e.g., “an oil painting of an old man”) to predict noise. After flipping the image back to match the original orientation, the two noise estimates are averaged to produce the final noise, which is then used to denoise.
Shown below are the visual anagrams I created. I show the same image side by side, flipping one of them to show a different prompt!
Oil Painting of an Old Man | Oil Painting of People Around a Campfire |
---|---|
Photo of Amalfi Coast | Photo of Dog |
---|---|
Photo of Pencil | Photo of Rocket Ship |
---|---|
In this section, I made “hybrid images” using a diffusion model and the concept of Factorized Diffusion, where up close, an image looks like a certain prompt, and far away, it looks like a different prompt. This involves creating a composite noise estimate 𝜖 by combining the low frequencies of one noise estimate with the high frequencies of another. To achieve this, I used two separate text prompts to denoise the same noisy image, followed by applying a low-pass filter to the first noise and a high-pass filter to the second noise. The final noise estimate is the sum of the filtered components.
Using this method, I generated hybrid images that exhibit different visual features depending on the viewer’s distance. For example, in the first result, the image is a “skull” from afar but transforms into waterfalls up close.
Example 1: Skull + Waterfalls
Low-pass: ‘a lithograph of a skull’
High-pass: ‘a lithograph of waterfalls’
Example 2: Mountains + Barista
Low-pass: ‘an oil painting of a snowy mountain village’
High-pass: ‘a photo of a hipster barista’
Example 3: Dog + Amalfi Coast
Low-pass: ‘a photo of the amalfi coast’
High-pass: ‘a photo of a dog’
In this part of the project, I worked on training a diffusion model from scratch on the MNIST dataset.
I started by building a simple one-step denoiser, which I implemented as a UNet. This is the model architecture that I followed, as given on the class website.
In this section, I trained a UNet-based denoiser to map noisy images z to clean images x, using an L2 loss. The noisy images were generated by adding Gaussian noise to clean MNIST digits, scaled by a noise level 𝜎. Here is the visualization of the noising process for various 𝜎 values [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0], showing progressively noisier images.
I then trained the UNet model to denoise noisy images z generated by adding Gaussian noise (with noise level = 0.5 to clean MNIST digits. The model was trained on the MNIST training dataset for 5 epochs with a batch size of 256, using the Adam optimizer. A new noisy version of the dataset was created for each batch to improve generalization. The figure below shows the training loss curve, which demonstrates a that the loss steadily decreased over time as the model converged.
Here is the input, output, and denoised results (of random images selected from the test set) during training on the 1st and 5th epoch.
Results after the 1st epoch:
Results after the 5th epoch:
I trained the denoiser on noisy images with noise_level 0.5, but we can visualize the results on test set digits for various 𝜎 values [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0].
As you can see, the denoised outputs get less effective (more artifacts) as the noise level increases.
Like we learned in the previous part of this project, iterative denoising works much more effectively than one-step denoising. In order to do this, I implemented a Denoising Diffusion Probabilistic Model (DDPM) to iteratively denoise noisy images. Unlike single-step denoising, DDPM uses a timestep t to condition the model, allowing it to handle varying levels of noise. The noise is progressively reduced at each step using a predefined variance schedule and its cumulative product. I trained on the model was trained using a simplified schedule of 300 timesteps, optimizing an L2 loss between predicted and actual noise at each step, which improves the denoising results.
To add time conditioning, I slightly adjusted the current UNet architecture, adding 2 FCBlocks positioned like such.
In order to train the UNet, I followed this algorithm:
These were the results for the training losses, which show a successive decrease in loss after training iterations.
To sample from the UNet, I implemented this algorithm, resulting in the following results from Epoch 5 and Epoch 20 shown below.
Sampling algorithm for Time-Conditioned UNet:
Results after Epoch 5:
Results after Epoch 20:
As we can see, the results improve with successive epochs, and compared to the previous part (without iterative denoising), the results look a lot better after adding time conditioning and iterative denoising,
To even further improve the results of the UNet on denoising MNIST images, we can also condition our UNet on the class of the digit 0-9. In order to do this, I added 2 new FCBlocks to the architecture, so that it followed as such. But in order to give the UNet more flexibility and ensure it works without conditioning on class, I made class-conditioning-vector c one-hot-encoded as c_one_hot, and dropped c_one_hot to zero 10% of the time.
Here were the results for the training losses:
To sample from the UNet, I implemented the algorithm shown below, resulting in the following results from Epoch 1, 5, 10, and 20.
Sampling algorithm for Class-Conditioned UNet:
Results after Epoch 1:
Results after Epoch 5:
Results after Epoch 10:
Results after Epoch 20:
As we can see, the class-conditioned UNet improves with successive epochs, and implementing classifier free guidance allows the UNet to produce digits that look slightly different from the MNIST dataset. The results do get thicker and smoother with more and more epochs, and by Epoch 20, the results are maybe a bit too thick.
Overall, this project was really fun, interactive, and challenging! I learned a lot about diffusion models and generative models. Very cool stuff!