The key idea of our method is to formulate the 3D-aware image generation task as multiview 2D image set generation. Our assumption is that the distribution of 3D assets, is equivalent to the joint distribution of its corresponding multiview images. This assumption is derived from the bijective correspondence between 3D assets and their multiview projections, given a sufficient number of views. To generate a set of multiview images follow their joint distribution, we then factorized it into the multiplication of an unconditional distribution and a series of conditional distributions with the chain rule of probability.

\[ \begin{split} q_a(\mathbf x)=\ &q_i(\Gamma(\mathbf x, \boldsymbol\pi_0))\cdot\\ &q_i(\Gamma(\mathbf x, \boldsymbol\pi_1)|\Gamma(\mathbf x, \boldsymbol\pi_0))\cdot\\ &\cdots\\ &q_i(\Gamma(\mathbf x, \boldsymbol\pi_N)|\Gamma(\mathbf x, \boldsymbol\pi_0),\cdots,\Gamma(\mathbf x, \boldsymbol\pi_{N-1})) \end{split} \]

In practice, however, multiview images are also difficult to obtain. To use unstructured 2D image collections, we construct training data using depth-based image warping. Then, two diffusion models are trained to fit the unconditional and conditional distributions, respectively.

Our method contains two diffusion models \(\mathcal{G}_u\) and \(\mathcal{G}_c\). \(\mathcal{G}_u\) is an unconditional model for randomly generating the first view, and \(\mathcal{G}_c\) is a conditional generator for novel views. With aggregated conditioning, multiview images are obtained iteratively by refining and completing previously synthesized views.