This post collects some of my recent thoughts and readings on diffusion models. As things are not formal, I simply call them “diffusion mumbling”, as they are literally some relatively “reasonable” mumblings, lol. While many of the notes and opinions are very personal (it is my blog, after all, yay), I’ll try to keep them clear and reasonable. A more structured summary can be found in these slides. I won’t repeat everything here: partly because I’m lazy, partly because this is just a personal note. In short, there are two major sections (weighting and ELBO/ likelihood) and two minor sections (prior and sampling), which I’ll briefly outline below. Many ideas are inspired from these two fantastic blogs(1 and 2). For paper references, see the slides for details.

1. Weighting in the Unifying Framework

When we put the loss function of a variety of diffusion models (e.g. DDPM, score matching and flow matching) together, they look strikingly similar. Therefore, a natural idea is to find a framework to unify them. As far as I know, there are at least two papers on this: the “stochastic interpolant” paper and “understanding diffusion objective” paper. I read the “stochastic interpolant” first, which unifies things mainly from ODE/ SDE perspective. This work is great, comprehesive, and mathematically sophisticated. The unified stochastic interpolant framework provides a lot of inspiriration on model designs. But, at least for me, I still find diffusion models somewhat “magical” (maybe Ireceived more statistical rather than physics training, lol). However, when I later read the “understanding diffusion” one, which frames things mainly from the loss and likelihood perspective, I would say it is breathtaking! For anyone interested in diffusion models, I would recommend it as a must have to read.

OK, here’s my very personal take. For diffusion models, both the ODE/SDE view and the loss/prob view are important. The ODE/SDE side helps us dig into technical details and inspires new model designs, while the loss/prob side gives a higher-level understanding. Why do I say this? Because diffusion models are basically stacked (or hierarchical) VAEs, and the ODE/SDE is what “glues” the different layers together. To really get the big picture, we usually have to go back to the original purpose and motivation, which is all about VAE/ loss/ prob. But to actually make diffusion models work, we need clever ways to design the “junctions” between layers. That’s the hard part, and the design really depends on the data. This is exactly where ODE/SDE perspectives kick in.

In the “understanding” paper, they first unify almost all diffusion models with 1) Gaussian source and 2) linear noise schedule/ interpolation ($z_t = \alpha_t x + \sigma_t \epsilon$, where $x$ is data and $\epsilon ~ N(0, I)$ is the Gaussian noise). After some reparametrizations/ rearrangements, everything is reduced to weighting of the loss (if we seperate the noise schedule out, which is essentially importance sampling). This is fantastic and very inspiring! When we spot on the weighting for different diffusion models, we can see that:

The diffusion models put a large weight on low SNR, i.e., closer to noise. This is why the diffusion models are fantastic in perceptual data (e.g., image, video, sound): we human are mainly sensitive low-frequency/ low-resolution part of perceptual data, and diffusion model just cater us on this. Basically, the model has limited capacity, and it cannot do perfectly for both high and low frequency domain. The diffusion models just sacrifice the high frequency part to make things “look”/ “sound” perfect.
The flow matching (FM) is very efficient, since it aggresively put almost all the weight to the low-SNR region. And this is why we usually see FM almost complete generation in the first few steps! FM is too aggresive on low-SNR part, and we need to let it slow down. Therefore, in stable diffusion 3, we re-design the weighting to make it focus a bit more in the middle region.
if we separate the noise schedule out from the weighting, the noise schedule becomes essentially importance sampling. This suggests at least 2 things: 1) we can estimate the weighting by tracking the EMA of binned weighted loss, which is described briefly in the paper; 2) we can use differen noise schedule for training (e.g., to reduce variance) and sampling (e.g., to reduce integraton error and have straight path).

Currently, there are a lot huristic methods for weighting design, and this blog makes a great summary.

OK, now everything is reduced to weighting design, and we accidentally (I think it’s purely a coincidence) put a lot of weight on the low-frequency (low-SNR) region to make diffusion models excel in the perceptual domain. But is the current weighting design optimal (at least for human/animal perception)? What is the optimal weighting, given different tasks and datasets? If we view animal perception through the lens of a diffusion-style objective, what weighting does it imply (I guess it may be adaptive according to different situations)? Can we use those insights to design better optimization objectives? There are many interesting neuroscience-related questions to ask through the lens of diffusion weighting.

2. Likelihood/ ELBO and diffusion loss

The second piece of “understanding” paper is linkage between ELBO and diffusion loss. Although the aurthors claim it’s the main result, I personally think it’s quite standard and a bit less exciting than the unifying framework (sorry!). Furthermore, the relationship need the monotonic weighting function, but many widely-used weighting functions are not. This further reduces the generality of results.

But the linkage between diffusion loss and ELBO under monotonic weighting opens a lot of potentials. as many ML methods are based on likelihood. Recently, I began to play around with RL, and I show two examples how RL people use this results. Basically, they replace the likelihood by ELBO, and then replace ELBO by diffusion loss, that’s it. Well, things look trivial, but if simple idea can work well, then why not?

The remaining two sections of slides are some other immature ideas \& is quite easy to follow. I won’t make extra notes for it, yay.