Interpretability

GIF 1

Sketch-based Editing 1

GIF 2

Sketch-based Editing 2

GIF 3

Palette-based Editing 1

GIF 5

Palette-based Editing 1

GIF 4
Abstract
Diffusion-based generative models excel in perceptually impressive synthesis but face challenges in interpretability. To this end, we introduce ToddlerDiffusion, an interpretable 2D diffusion image-synthesis framework inspired by the human generation system. Unlike traditional diffusion models with opaque denoising steps, our approach decomposes the generation process into more straightforward, interpretable stages, for instance, first generating contours, then a palette, and finally, a detailed colored image. Each stage will be conditioned based on the previous stage's output. Instead of relying on the naive LDM concatenation conditioning mechanism, we employ Schrödinger Bridge to jump from one stage to another efficiently. This not only enhances overall performance but also enables robust editing, interaction capabilities, and faster convergence and sampling rate. Having more stages could lead to a more complex architecture. However, each stage is meticulously formulated for efficiency and accuracy, surpassing Stable-Diffusion (LDM) performance. Extensive experiments on datasets like LSUN-Churches and CelebHQ validate our approach, consistently outperforming existing methods. ToddlerDiffusion achieves notable efficiency, matching LDM performance on LSUN-Churches while operating 2x faster with a 3x smaller architecture.

Main Idea
At the top is an overview of our proposed pipeline, termed ToddlerDiffusion. First, we unconditionally generate an abstract structure with coarse contours. Secondly, starting from the coarse structure, we generate a tentative palette that matches this structure. Then, we overlay the output from both stages.


Architecture
An overview of proposed architecture, dubbed ToddlerDiffusion. The first block demonstrates the first stage, which generates a sketch unconditionally. Due to our efficient formulation, this stage operates in the image space on 64 x 64 resolution. The bottom module depicts the third stage, which generates an RGB image given a sketch only or both sketch and palette.


Editing Ability


Toddler vs SDEDIT


Robustness


Faster Convergence


Summary