VIVAT: Virtuous Improving VAE Training through Artifact Mitigation

arXiv — cs.LGTuesday, December 2, 2025 at 5:00:00 AM
  • A new paper introduces VIVAT, a systematic approach designed to mitigate common artifacts in the training of Variational Autoencoders (VAEs), which are crucial for generative computer vision. The study identifies five prevalent artifacts and proposes modifications to improve VAE performance, achieving state-of-the-art results in image reconstruction metrics and enhancing text-to-image generation quality.
  • The development of VIVAT is significant as it addresses the persistent issue of artifacts that degrade the quality of VAE outputs, thereby enhancing the reliability and effectiveness of generative models in various applications, including image reconstruction and generation.
  • This advancement reflects a broader trend in the AI field towards improving model robustness and performance through innovative training techniques, as seen in related studies that explore generative learning methods and the integration of visual attributes in model predictions, highlighting the ongoing challenges and solutions in AI model training.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs
NeutralArtificial Intelligence
AlignBench has been introduced as a benchmark for evaluating fine-grained image-text alignment using synthetic image-caption pairs, addressing limitations in existing models like CLIP that rely on rule-based perturbations or short captions. This benchmark allows for a more detailed assessment of visual-language models (VLMs) by annotating each sentence for correctness.
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios
PositiveArtificial Intelligence
CT-GLIP, a new 3D Grounded Language-Image Pretrained model, has been introduced to enhance the alignment of CT scans with radiology reports, addressing limitations in existing methods that rely on global embeddings. This model constructs fine-grained CT-report pairs to improve cross-modal contrastive learning, enabling better identification of organs and abnormalities in a zero-shot manner.
Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution
PositiveArtificial Intelligence
A new Mixture-of-Ranks (MoR) architecture has been proposed for one-step real-world image super-resolution (Real-ISR), integrating sparse Mixture-of-Experts (MoE) to enhance the adaptability of models in reconstructing high-resolution images from degraded samples. This approach utilizes a fine-grained expert partitioning strategy, treating each rank in Low-Rank Adaptation (LoRA) as an independent expert, thereby improving the model's ability to capture heterogeneous characteristics of real-world images.
Zero-shot self-supervised learning of single breath-hold magnetic resonance cholangiopancreatography (MRCP) reconstruction
PositiveArtificial Intelligence
A recent study explored the feasibility of zero-shot self-supervised learning for reconstructing magnetic resonance cholangiopancreatography (MRCP) images, aiming to reduce breath-hold times during scans. The research involved 11 healthy volunteers and compared zero-shot reconstruction with traditional methods, achieving significant acceleration in image acquisition times.
Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation
PositiveArtificial Intelligence
A new framework named Prompt-OT has been introduced to enhance the adaptation of vision-language models (VLMs) like CLIP, addressing challenges related to overfitting and zero-shot generalization during fine-tuning. This optimal transport-guided approach preserves the structural consistency of feature distributions between pre-trained and fine-tuned models, ensuring effective prompt learning.
LORE: Lagrangian-Optimized Robust Embeddings for Visual Encoders
PositiveArtificial Intelligence
The introduction of Lagrangian-Optimized Robust Embeddings (LORE) presents a new unsupervised adversarial fine-tuning framework aimed at enhancing the robustness of visual encoders against adversarial perturbations. This framework addresses critical limitations in existing fine-tuning strategies, particularly their instability and suboptimal trade-offs between robustness and accuracy on clean data.