Effortless Efficiency: Low-Cost Pruning of Diffusion Models

Effortless Efficiency:
Low-Cost Pruning of Diffusion Models

¹National University of Singapore, ²RWTH Aachen University, ³University of Oxford
^*Indicates Equal Contribution

Abstract

Diffusion models have achieved impressive advancements in various vision tasks. However, these gains often rely on increasing model size, which escalates computational complexity and memory demands, complicating deployment, raising inference costs, and causing environmental impact. While some studies have explored pruning techniques to improve the memory efficiency of diffusion models, most existing methods require extensive retraining to retain the model performance. Retraining a modern large diffusion model is extremely costly and resource-intensive, which limits the practicality of these methods. In this work, we achieve low-cost diffusion pruning without retraining by proposing a model-agnostic structural pruning framework for diffusion models that learns a differentiable mask to sparsify the model. To ensure effective pruning that preserves the quality of the final denoised latent, we design a novel end-to-end pruning objective that spans the entire diffusion process. As end-to-end pruning is memory-intensive, we further propose time step gradient checkpointing, a technique that significantly reduces memory usage during optimization, enabling end-to-end pruning within a limited memory budget. Results on state-of-the-art U-Net diffusion models SDXL and diffusion transformers (FLUX) demonstrate that our method can effectively prune up to 20% parameters with minimal perceptible performance degradation—and notably, without the need for model retraining. We also showcase that our method can still prune on top of time step distilled diffusion models.

Method

Our pruning method, EcoDiff, achieves efficient yet effective pruning through two key innovations: end-to-end pruning and time step gradient checkpointing.

Overview of our method.
Left: End-to-end pruning considers the entire diffusion process.
Right: Time step gradient checkpointing reduces memory usage during end-to-end optimization.

End-to-End Pruning Objective

To mitigate the error accumulation inherent in conventional diffusion model pruning, we propose an end-to-end pruning objective that considers the entire denoising process. Our goal is to learn masking parameters \(\boldsymbol{\mathcal{M}}\) that minimize the difference between the final denoised latent \(z_0\) produced by the original denoising model \(\epsilon_{\theta}\) and the predicted \(\hat{z}_0\) from the masked model \(\epsilon_{\theta}^{\text{mask}}\), under the same initial noise \(z_T\) and conditioning input \(y\). The pruning objective is formulated as:

\[ \arg\min_{\boldsymbol{\mathcal{M}}} \mathbb{E}_{z_T, y \sim \mathcal{C}} \left[ \left\| \mathcal{F}_{\epsilon_{\theta}}(z_T, y) - \mathcal{F}_{\epsilon_{\theta}^{\text{mask}}}(z_T, y, \boldsymbol{\mathcal{M}}) \right\|_2 \right] + \beta \| \boldsymbol{\mathcal{M}} \|_0, \]

where \(\mathcal{F}\) denotes the full denoising process, \(\mathcal{C}\) is the dataset of conditioning inputs, and \(\beta\) is a regularization coefficient promoting sparsity via the \(L_0\) norm of \(\boldsymbol{\mathcal{M}}\). To make optimization tractable, we apply a continuous relaxation of the discrete masks using hard-concrete distributions, allowing gradient-based optimization of continuous mask variables \(\boldsymbol{\lambda}\). The continuous mask \(\hat{\mathcal{M}}\) is then obtained by a mask parameter \(\boldsymbol{\lambda}\). The final pruning loss given mask parameters \(\boldsymbol{\lambda}\) becomes:

\[ \mathcal{L}(\boldsymbol{\lambda}) = \mathcal{L}_E(\boldsymbol{\lambda}) + \beta \| \boldsymbol{\lambda} \|_1, \]

where \(\mathcal{L}_E\) is the reconstruction loss between \(z_0\) and \(\hat{z}_0\), and \(\| \boldsymbol{\lambda} \|_1\) approximates the sparsity penalty. After optimization, we threshold \(\boldsymbol{\lambda}\) to obtain the final discrete masks \(\boldsymbol{\mathcal{M}}\), achieving effective structural pruning.

Time Step Gradient Checkpointing

Optimizing the end-to-end pruning objective requires backpropagating through all diffusion steps, which is memory-intensive due to the need to store intermediate activations at each step. To address this, we introduce a time step gradient checkpointing technique that significantly reduces memory consumption. Instead of storing all intermediate variables, we only retain the denoised latent variables \(\hat{z}_t\) after each denoising step. During backpropagation, we recompute necessary intermediate activations on-the-fly using the stored \(\hat{z}_t\) as checkpoints. This method reduces the memory complexity from \(O(T)\) to \(O(1)\), where \(T\) is the number of diffusion steps, while maintaining the runtime complexity at \(O(T)\). The approach enables efficient scaling to large diffusion models by balancing memory savings with computational efficiency, allowing for the practical optimization of the pruning masks across the entire denoising process.

Runtime and memory usage with time step gradient checkpointing. Time step gradient checkpointing reduces memory usage from \(O(T)\) to \(O(1)\) while maintaining the runtime complexity at \(O(T)\).

Results

Compatibility with Other State-of-the-Art Acceleration Methods

1. Compatibility with Time-Step Distillation

Our pruning approach is not limited to standard diffusion models; it can also be effectively applied to time-step distilled models. Time-step distillation reduces the number of diffusion steps required during inference, leading to faster image generation. By integrating our structural pruning method with time-step distilled models, we can further compress the model while benefiting from accelerated sampling. This combination allows us to maintain high-quality image generation with reduced computational resources, demonstrating the flexibility and broad applicability of EcoDiff.

We show that our method can prune on top of time-step distilled diffusion models to further accelerate the sampling process and reduce the memory requirements.

2. Compatibility with Feature Reuse

After pruning, our models remain fully compatible with caching methods designed to speed up inference, such as DeepCache and other caching strategies. These methods work by reusing computations across diffusion steps, thereby reducing redundant calculations. Since our pruning technique focuses on structurally reducing model parameters without altering the underlying architecture or input-output dimensions, it does not interfere with caching mechanisms. This compatibility means that our pruning method is orthogonal to other acceleration techniques; it can be combined with caching and potentially other optimizations to achieve even greater efficiency gains. Users can leverage the strengths of both approaches to enhance performance without compromising on the quality of the generated images.

We show that after pruning, the pruned model can still effectively utilize cached methods to further accelerate the sampling process.

Prune Improves Semantic Fidelity

Interestingly, pruning not only reduces model complexity but can also enhance the semantic alignment of generated images with the conditioning inputs. By selectively removing redundant or less significant neurons, pruning can eliminate noise and focus the model's capacity on essential features that contribute to semantic understanding. This refinement leads to sharper attention mechanisms and more precise feature representations, allowing the pruned model to follow input prompts more faithfully. Empirical observations demonstrate that pruned models often produce images that better capture the intended semantics, resulting in outputs that are not only computationally efficient but also qualitatively superior in adhering to the provided descriptions.

We observe some times the pruned model can generate images with better semantic fidelity,
as the example shows below in the second row (cat -> dog).

Retraining After Pruning

While our pruning method effectively maintains model performance without necessitating retraining, we also explore the benefits of light retraining to achieve even greater compression. Specifically, we conduct an experiment where we retrain a pruned SDXL model for 10,000 steps using only 100 training samples. This minimal retraining effort, which requires approximately 12 hours on an A100 GPU, allows the model to adjust to the pruned architecture and enhances its performance. The results show that even with such a modest retraining regimen, the model can further improve in efficiency without compromising image quality, highlighting the potential of combining pruning with targeted retraining for optimal results.

Pruning after retraining. Though our method achieves effective pruning with no retraining, we show that with light retraining, the pruned model can heal the performance loss. Hence, our method can incoorperate post-pruning retraining to further prune the model.

BibTeX

@misc{zhang2024effortlessefficiencylowcostpruning, title={Effortless Efficiency: Low-Cost Pruning of Diffusion Models}, author={Yang Zhang and Er Jin and Yanfei Dong and Ashkan Khakzar and Philip Torr and Johannes Stegmaier and Kenji Kawaguchi}, year={2024}, eprint={2412.02852}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.02852}, }

Effortless Efficiency:Low-Cost Pruning of Diffusion Models

Results on Pruned SDXL and FLUX. No retraining is performed after pruning.