Generative models have achieved impressive advancements in various vision tasks. However, these gains often rely on increasing model size, which raises computational complexity and memory demands. The increased computational demand poses challenges for deployment, elevates inference costs, and impacts the environment. While some studies have explored pruning techniques to improve the memory efficiency of diffusion models, most existing methods require extensive retraining to maintain model performance. Retraining a large model is extremely costly and resource-intensive, which limits the practicality of pruning methods. In this work, we achieve low-cost pruning by proposing a general pruning framework for vision generative models that learns a differentiable mask to sparsify the model. To learn a mask that minimally deteriorates the model, we design a novel end-to-end pruning objective that spans the entire generation process over all steps. Since end-to-end pruning is memory-intensive, we further design a time step gradient checkpointing technique for the end-to-end pruning, a technique that significantly reduces memory usage during optimization, enabling end-to-end pruning within a limited memory budget. Results on the state-of-the-art U-Net diffusion models SDXL and DiT flow models (FLUX) show that our method efficiently prunes 20% of parameters in just 10 A100 GPU hours, outperforming previous pruning approaches.
Our pruning method, EcoDiff, achieves efficient yet effective pruning through two key innovations: end-to-end pruning and time step gradient checkpointing.
To mitigate the error accumulation inherent in conventional diffusion model pruning, we propose an end-to-end pruning objective that considers the entire denoising process. Our goal is to learn masking parameters \(\boldsymbol{\mathcal{M}}\) that minimize the difference between the final denoised latent \(z_0\) produced by the original denoising model \(\epsilon_{\theta}\) and the predicted \(\hat{z}_0\) from the masked model \(\epsilon_{\theta}^{\text{mask}}\), under the same initial noise \(z_T\) and conditioning input \(y\). The pruning objective is formulated as:
where \(\mathcal{F}\) denotes the full denoising process, \(\mathcal{C}\) is the dataset of conditioning inputs, and \(\beta\) is a regularization coefficient promoting sparsity via the \(L_0\) norm of \(\boldsymbol{\mathcal{M}}\). To make optimization tractable, we apply a continuous relaxation of the discrete masks using hard-concrete distributions, allowing gradient-based optimization of continuous mask variables \(\boldsymbol{\lambda}\). The continuous mask \(\hat{\mathcal{M}}\) is then obtained by a mask parameter \(\boldsymbol{\lambda}\). The \(L_0\) complexity loss \(\mathcal{L}_0\) given the hard-concrete parameter \(\boldsymbol{\lambda}\) can be described as:
where \(\alpha\), \(\gamma\), and \(\zeta\) differrentiable hyperparameters controlling the distribution. The end-to-end pruning loss for \(\boldsymbol{\lambda}\) is formulated as:
where \(\mathcal{L}_E\) is the reconstruction loss between \(z_0\) and \(\hat{z}_0\). After optimization, we threshold \(\boldsymbol{\lambda}\) to obtain the final discrete masks \(\boldsymbol{\mathcal{M}}\), achieving effective structural pruning.
Optimizing the end-to-end pruning objective requires backpropagating through all diffusion steps, which is memory-intensive due to the need to store intermediate activations at each step. To address this, we introduce a time step gradient checkpointing technique that significantly reduces memory consumption. Instead of storing all intermediate variables, we only retain the denoised latent variables \(\hat{z}_t\) after each denoising step. During backpropagation, we recompute necessary intermediate activations on-the-fly using the stored \(\hat{z}_t\) as checkpoints. This method reduces the memory complexity from \(O(T)\) to \(O(1)\), where \(T\) is the number of diffusion steps, while maintaining the runtime complexity at \(O(T)\). The approach enables efficient scaling to large diffusion models by balancing memory savings with computational efficiency, allowing for the practical optimization of the pruning masks across the entire denoising process.
Our pruning approach is not limited to standard diffusion models; it can also be effectively applied to time-step distilled models. Time-step distillation reduces the number of diffusion steps required during inference, leading to faster image generation. By integrating our structural pruning method with time-step distilled models, we can further compress the model while benefiting from accelerated sampling. This combination allows us to maintain high-quality image generation with reduced computational resources, demonstrating the flexibility and broad applicability of EcoDiff.
After pruning, our models remain fully compatible with caching methods designed to speed up inference, such as DeepCache and other caching strategies. These methods work by reusing computations across diffusion steps, thereby reducing redundant calculations. Since our pruning technique focuses on structurally reducing model parameters without altering the underlying architecture or input-output dimensions, it does not interfere with caching mechanisms. This compatibility means that our pruning method is orthogonal to other acceleration techniques; it can be combined with caching and potentially other optimizations to achieve even greater efficiency gains. Users can leverage the strengths of both approaches to enhance performance without compromising on the quality of the generated images.
Interestingly, pruning not only reduces model complexity but can also enhance the semantic alignment of generated images with the conditioning inputs. By selectively removing redundant or less significant neurons, pruning can eliminate noise and focus the model's capacity on essential features that contribute to semantic understanding. This refinement leads to sharper attention mechanisms and more precise feature representations, allowing the pruned model to follow input prompts more faithfully. Empirical observations demonstrate that pruned models often produce images that better capture the intended semantics, resulting in outputs that are not only computationally efficient but also qualitatively superior in adhering to the provided descriptions.
While our pruning method effectively maintains model performance without necessitating retraining, we also explore the benefits of retraining to achieve even greater compression. Though our method achieves effective pruning with no retraining, we show that with light retraining, the pruned model can heal the performance loss. Hence, our method can incorporate post-pruning retraining to further prune the model. We investigate two strategies: LoRA fine-tuning and Full model retraining. Both are trained for 10,000 steps, taking approximately 10 hours on an A100 GPU. This retraining effort allows the model to adjust to the pruned architecture and enhances its performance. The results show that even with such a modest retraining regimen, the model can further improve in efficiency without compromising image quality, highlighting the potential of combining pruning with targeted retraining for optimal results.
@inproceedings{zhang2026learnable,
title={Learnable Sparsity for Vision Generative Models},
author={Zhang, Yang and Jin, Er and Liang, Wenzhong and Dong, Yanfei and Khakzar, Ashkan and Torr, Philip and Stegmaier, Johannes and Kawaguchi, Kenji},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=9pNWZLVZ4r}
}