Structure over Pixels: Learning Variable-Length Visual Programs

arXiv — cs.LGFriday, May 29, 2026 at 4:00:00 AM
  • What Happened

    Researchers have introduced STROP, a novel discrete visual tokenizer architecture that translates images into ordered sequences of codes, optimizing the length of visual programs for images while focusing on structural representation rather than pixel reconstruction. This approach utilizes a four-phase curriculum to enhance the model's efficiency in estimating active prefix lengths in a single forward pass.

  • Why It Matters

    The development of STROP is significant as it addresses the limitations of existing adaptive tokenizers, which often rely on pre-trained rates or post-hoc searches, thereby improving the representation of structural elements in visual data and enhancing the capabilities of models like DINOv3.

  • The Bigger Picture

    This advancement reflects a broader trend in artificial intelligence where researchers are increasingly focusing on integrating structural understanding with visual representation, as seen in frameworks like SpatialBoost and Granulon, which also aim to enhance visual encoding through innovative methodologies and multi-granularity semantics.

— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Continue Readings
Learning What to Predict: Downstream-Guided Task Design for Continued Pretraining
PositiveArtificial Intelligence
A new approach to continued pretraining, termed V-pretraining, has been introduced, which separates the learner from the task designer, allowing for more effective feedback based on downstream performance without direct supervision. This method aims to optimize self-supervised learning by predicting the reduction in downstream loss following updates.

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about