Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy

A new framework named BayesVLA has been introduced to enhance Vision-Language-Action (VLA) models by addressing the issue of catastrophic forgetting during fine-tuning. This framework decomposes the policy into a visual-action prior and a language-conditioned likelihood, promoting better generalization and instruction following.
The development of BayesVLA is significant as it mitigates the intrinsic challenges posed by modality imbalance in VLA datasets, which previously biased models towards visual shortcuts and language forgetting. This advancement is expected to improve the performance and reliability of VLA models in diverse applications.
The introduction of BayesVLA aligns with ongoing efforts to refine VLA frameworks, as seen in various approaches that enhance action generation, improve efficiency, and address spatial understanding. These innovations reflect a broader trend in AI research aimed at creating more robust and adaptable models capable of generalizing across different tasks and environments.

Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy