arXiv:2511.08133v1 Announce Type: new 
Abstract: Scene Text Recognition (STR) remains challenging due to real-world complexities, where decoupled visual-linguistic optimization in existing frameworks amplifies error propagation through cross-modal misalignment. Visual encoders exhibit attention bias toward background distractors, while decoders suffer from spatial misalignment when parsing geometrically deformed text-collectively degrading recognition accuracy for irregular patterns. Inspired by the hierarchical cognitive processes in human visual perception, we propose OTSNet, a novel three-stage network embodying a neurocognitive-inspired Observation-Thinking-Spelling pipeline for unified STR modeling. The architecture comprises three core components: (1) a Dual Attention Macaron Encoder (DAME) that refines visual features through differential attention maps to suppress irrelevant regions and enhance discriminative focus; (2) a Position-Aware Module (PAM) and Semantic Quantizer (SQ) that jointly integrate spatial context with glyph-level semantic abstraction via adaptive sampling; and (3) a Multi-Modal Collaborative Verifier (MMCV) that enforces self-correction through cross-modal fusion of visual, semantic, and character-level features. Extensive experiments demonstrate that OTSNet achieves state-of-the-art performance, attaining 83.5% average accuracy on the challenging Union14M-L benchmark and 79.1% on the heavily occluded OST dataset-establishing new records across 9 out of 14 evaluation scenarios.

تم تقديم OTSNet، وهو إطار جديد للتعرف على النص في المشهد (STR)، لمعالجة التحديات المتعلقة بالتعرف على النص في البيئات المعقدة. من خلال استخدام خط أنابيب مستوحى من علم الأعصاب، فإنه يحسن الدقة عن طريق تقليل انتشار الأخطاء وتحسين استخراج الميزات البصرية. تعتبر هذه الابتكار مهمة لأنها تهدف إلى التغلب على القيود الموجودة في أنظمة STR الحالية، مما قد يؤدي إلى أداء أفضل في تطبيقات مثل القيادة الذاتية والواقع المعزز.

OTSNet, un nuevo marco para el reconocimiento de texto en escenas (STR), se presentó para abordar los desafíos de reconocer texto en entornos complejos. Al utilizar un pipeline inspirado en la neurocognición, mejora la precisión al reducir la propagación de errores y mejorar la extracción de características visuales. Esta innovación es significativa ya que busca superar las limitaciones de los sistemas STR existentes, lo que podría llevar a un mejor rendimiento en aplicaciones como la conducción autónoma y la realidad aumentada.

OTSNet, un nouveau cadre pour la reconnaissance de texte en scène (STR), a été introduit pour relever les défis de la reconnaissance de texte dans des environnements complexes. En utilisant un pipeline inspiré de la neurocognition, il améliore la précision en réduisant la propagation des erreurs et en améliorant l'extraction des caractéristiques visuelles. Cette innovation est significative car elle vise à surmonter les limitations des systèmes STR existants, ce qui pourrait conduire à de meilleures performances dans des applications telles que la conduite autonome et la réalité augmentée.

OTSNet, a new framework for Scene Text Recognition (STR), was introduced to address challenges in recognizing text in complex real-world environments. By utilizing a neurocognitive-inspired pipeline, it enhances accuracy by reducing error propagation and improving visual feature extraction. This innovation is significant as it aims to overcome limitations in existing STR systems, potentially leading to better performance in applications like autonomous driving and augmented reality.

OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition

Was this article worth reading? Share it

Ready to build your own newsroom?