arXiv:2510.23554v1 Announce Type: cross 
Abstract: This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. These detected regions are then processed by Tesseract to extract the source text. This extracted text is fed into a custom Transformer model trained from scratch on a multilingual parallel corpus spanning 5 languages. Unlike systems reliant on monolithic pre-trained models, our architecture emphasizes full customization and adaptability. The system is evaluated on its text detection accuracy, text recognition quality, and translation performance via BLEU scores. The complete pipeline demonstrates promising results, validating the viability of a custom-built system for translating text directly from images.

تم تطوير خط أنابيب جديد لترجمة الصور متعددة اللغات يجمع بين نموذج U-Net لاكتشاف النص، ومحرك Tesseract للتعرف على النص، وTransformer مخصص للترجمة الآلية العصبية. تعزز هذه الطريقة المبتكرة دقة ترجمة النص داخل الصور، مما يسهل على المستخدمين الوصول إلى المعلومات عبر لغات مختلفة. لا تعمل هذه التقنيات على تبسيط عملية الترجمة فحسب، بل تفتح أيضًا آفاقًا جديدة للتطبيقات في مجالات متنوعة، مثل التعليم والتواصل العالمي.

Se ha desarrollado un nuevo pipeline de traducción de imágenes multilingüe que combina un modelo U-Net para la detección de texto, el motor Tesseract para el reconocimiento de texto y un Transformer personalizado para la traducción automática neuronal. Este enfoque innovador mejora la precisión de la traducción de texto en imágenes, facilitando el acceso a la información en diferentes idiomas. La integración de estas tecnologías no solo agiliza el proceso de traducción, sino que también abre nuevas posibilidades para aplicaciones en diversos campos, como la educación y la comunicación global.

Un nouveau pipeline de traduction d'images multilingue a été développé, combinant un modèle U-Net pour la détection de texte, le moteur Tesseract pour la reconnaissance de texte et un Transformer personnalisé pour la traduction automatique neuronale. Cette approche innovante améliore la précision de la traduction de texte dans les images, facilitant ainsi l'accès à l'information dans différentes langues. L'intégration de ces technologies rationalise non seulement le processus de traduction, mais ouvre également de nouvelles possibilités d'applications dans divers domaines, tels que l'éducation et la communication mondiale.

A new multilingual image translation pipeline has been developed, combining a U-Net model for text detection, the Tesseract engine for text recognition, and a custom Transformer for Neural Machine Translation. This innovative approach enhances the accuracy of translating text within images, making it easier for users to access information across different languages. The integration of these technologies not only streamlines the translation process but also opens up new possibilities for applications in various fields, such as education and global communication.

A U-Net and Transformer Pipeline for Multilingual Image Translation

arXiv:2512.07590v1 Announce Type: new 
Abstract: To address the challenge of segmenting noisy images with blurred or fragmented boundaries, this paper presents a robust version of Variational Model Based Tailored UNet (VM_TUNet), a hybrid framework that integrates variational methods with deep learning. The proposed approach incorporates physical priors, an edge detector and a mean curvature term, into a modified Cahn-Hilliard equation, aiming to combine the interpretability and boundary-smoothing advantages of variational partial differential equations (PDEs) with the strong representational ability of deep neural networks. The architecture consists of two collaborative modules: an F module, which conducts efficient frequency domain preprocessing to alleviate poor local minima, and a T module, which ensures accurate and stable local computations, backed by a stability estimate. Extensive experiments on three benchmark datasets indicate that the proposed method achieves a balanced trade-off between performance and computational efficiency, which yields competitive quantitative results and improved visual quality compared to pure convolutional neural network (CNN) based models, while achieving performance close to that of transformer-based method with reasonable computational expense.

تقدم دراسة جديدة نسخة قوية من نموذج UNet المخصص القائم على النماذج التغيرية (VM_TUNet)، الذي يدمج الأساليب التغيرية مع التعلم العميق لتحسين تقسيم الصور، خاصة في الصور المزعجة ذات الحدود الضبابية. يستخدم الإطار كاشف حواف وعبارة انحناء متوسط ضمن معادلة كاهن-هيلارد المعدلة، مما يظهر أداءً محسنًا من خلال وحدتين تعاونية لمعالجة مسبقة فعالة وحسابات محلية مستقرة.

Un nuevo estudio presenta una versión robusta del Variational Model Based Tailored UNet (VM_TUNet), que integra métodos variacionales con aprendizaje profundo para mejorar la segmentación de imágenes, especialmente en imágenes ruidosas con bordes difusos. El marco emplea un detector de bordes y un término de curvatura media dentro de una ecuación de Cahn-Hilliard modificada, demostrando un rendimiento mejorado a través de dos módulos colaborativos para un preprocesamiento eficiente y cálculos locales estables.

Une nouvelle étude présente une version robuste du Variational Model Based Tailored UNet (VM_TUNet), qui intègre des méthodes variationnelles avec l'apprentissage profond pour améliorer la segmentation d'images, en particulier dans les images bruyantes avec des contours flous. Le cadre utilise un détecteur de contours et un terme de courbure moyenne dans une équation de Cahn-Hilliard modifiée, démontrant une performance améliorée grâce à deux modules collaboratifs pour un prétraitement efficace et des calculs locaux stables.

A new study introduces a robust version of the Variational Model Based Tailored UNet (VM_TUNet), which integrates variational methods with deep learning to enhance image segmentation, particularly in noisy images with blurred boundaries. The framework employs an edge detector and a mean curvature term within a modified Cahn-Hilliard equation, demonstrating improved performance through two collaborative modules for efficient preprocessing and stable local computations.

Robust Variational Model Based Tailored UNet: Leveraging Edge Detector and Mean Curvature for Improved Image Segmentation

arXiv:2512.07747v1 Announce Type: new 
Abstract: Unified understanding and generation is a highly appealing research direction in multimodal learning. There exist two approaches: one trains a transformer via an auto-regressive paradigm, and the other adopts a two-stage scheme connecting pre-trained understanding and generative models for alignment fine-tuning. The former demands massive data and computing resources unaffordable for ordinary researchers. Though the latter requires a lower training cost, existing works often suffer from limited task coverage or poor generation quality. Both approaches lack the ability to parse input meta-information (such as task type, image resolution, video duration, etc.) and require manual parameter configuration that is tedious and non-intelligent. In this paper, we propose Unison which adopts the two-stage scheme while preserving the capabilities of the pre-trained models well. With an extremely low training cost, we cover a variety of multimodal understanding tasks, including text, image, and video understanding, as well as diverse generation tasks, such as text-to-visual content generation, editing, controllable generation, and IP-based reference generation. We also equip our model with the ability to automatically parse user intentions, determine the target task type, and accurately extract the meta-information required for the corresponding task. This enables full automation of various multimodal tasks without human intervention. Experiments demonstrate that, under a low-cost setting of only 500k training samples and 50 GPU hours, our model can accurately and automatically identify tasks and extract relevant parameters, and achieve superior performance across a variety of understanding and generation tasks.

تم تقديم إطار عمل جديد يسمى Unison، مصمم لفهم وتوليد موحد في التعلم متعدد الوسائط. يعتمد هذا الإطار على مخطط من مرحلتين يستخدم بفعالية نماذج مدربة مسبقًا مع تقليل تكاليف التدريب بشكل كبير، مما يعالج قيود الأساليب الحالية التي تتطلب إما بيانات ضخمة أو تعاني من ضعف جودة التوليد.

Se ha presentado un nuevo marco llamado Unison, diseñado para la comprensión y generación unificadas en el aprendizaje multimodal. Este marco adopta un esquema de dos etapas que utiliza eficazmente modelos preentrenados mientras reduce significativamente los costos de entrenamiento, abordando las limitaciones de los enfoques existentes que requieren grandes cantidades de datos o sufren de mala calidad en la generación.

Un nouveau cadre nommé Unison a été introduit, conçu pour la compréhension et la génération unifiées dans l'apprentissage multimodal. Ce cadre adopte un schéma en deux étapes qui utilise efficacement des modèles pré-entraînés tout en réduisant considérablement les coûts d'entraînement, répondant ainsi aux limitations des approches existantes qui nécessitent soit des données extensives, soit souffrent d'une mauvaise qualité de génération.

A new framework named Unison has been introduced, designed for unified understanding and generation in multimodal learning. This framework adopts a two-stage scheme that effectively utilizes pre-trained models while significantly reducing training costs, addressing the limitations of existing approaches that either require extensive data or suffer from poor generation quality.

A U-Net and Transformer Pipeline for Multilingual Image Translation

Was this article worth reading? Share it

Humanize AI

Airparser

OpenL Translator