ControlNet 将diffusion model 克隆成2个模型,一个是trainable copy,一个是locked copy
trainable copy 在特定任务上以端到端方式学习条件控制
locked copy 保留网络来自数亿图像的生成能力,所以具有较好的鲁棒性
2个copy的联系采用 zero convolution ,其权重从0开始增加到优化参数,因为不需要向深层特征增加noise,所以训练速度和微调模型一样快
Conditions: Canny edges, Hough lines, user scribbles, human key points, segmentation maps, shape normals, depths, etc
Related Work
Text-to-Image Diffusion
Diffusion models can be applied to text-to-image generating tasks to achieve state-of-the-art image generating results. This is often achieved by encoding text inputs into latent vectors using pretrained language models like CLIP
Glide is a text-guided diffusion models supporting both image generating and editing.
Disco Diffusion is a clip-guided implementation of to process text prompts.
Stable Diffusion is a large scale implementation of latent diffusion to achieve text-to-image generation.
Imagen is a text-to-image structure that does not use latent images and directly diffuse pixels using a pyramid structure.
SD是将512X512分辨率的图像转化到64X64分辨率的 latent images,所以ControlNet需要将图像空间编码到64X64的隐图像空间Cf,设计一个卷积网络即可。
We use a tiny network E(·) of four convolution layers with 4 × 4 kernels and 2 × 2 strides (activated by ReLU, channels are 16, 32, 64, 128, initialized with Gaussian weights, trained jointly with the full model)