arXiv:2510.19368v2 Announce Type: replace-cross 
Abstract: Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.

تم تقديم Augmentation-driven Multiview Audio Transformer (AMAuT) كإطار عمل جديد يتم تدريبه من الصفر، متجاوزًا قيود النماذج الأساسية الحالية في معالجة الصوت. يدعم هذا الإطار معدلات عينة وأطوال صوتية عشوائية، مما يعزز من مرونته في تطبيقات متنوعة.

Se ha presentado el Augmentation-driven Multiview Audio Transformer (AMAuT) como un nuevo marco que se entrena desde cero, superando las limitaciones de los modelos fundamentales existentes en el procesamiento de audio. Este marco admite tasas de muestreo y longitudes de audio arbitrarias, mejorando su versatilidad en diversas aplicaciones.

L'Augmentation-driven Multiview Audio Transformer (AMAuT) a été introduit comme un nouveau cadre qui s'entraîne à partir de zéro, surmontant les limitations des modèles fondamentaux existants dans le traitement audio. Ce cadre prend en charge des taux d'échantillonnage et des longueurs audio arbitraires, améliorant ainsi sa polyvalence dans diverses applications.

The Augmentation-driven Multiview Audio Transformer (AMAuT) has been introduced as a novel framework that trains from scratch, overcoming limitations of existing foundational models in audio processing. This framework supports arbitrary sample rates and audio lengths, enhancing its versatility in various applications.

AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch

arXiv:2511.18698v1 Announce Type: cross 
Abstract: This paper presents the design, implementation, and evolution of a comprehensive multimodal room-monitoring system that integrates synchronized video and audio processing for real-time activity recognition and anomaly detection. We describe two iterations of the system: an initial lightweight implementation using YOLOv8, ByteTrack, and the Audio Spectrogram Transformer (AST), and an advanced version that incorporates multi-model audio ensembles, hybrid object detection, bidirectional cross-modal attention, and multi-method anomaly detection. The evolution demonstrates significant improvements in accuracy, robustness, and industrial applicability. The advanced system combines three audio models (AST, Wav2Vec2, and HuBERT) for comprehensive audio understanding, dual object detectors (YOLO and DETR) for improved accuracy, and sophisticated fusion mechanisms for enhanced cross-modal learning. Experimental evaluation shows the system's effectiveness in general monitoring scenarios as well as specialized industrial safety applications, achieving real-time performance on standard hardware while maintaining high accuracy.

تم تطوير نظام شامل لمراقبة متعددة الوسائط يدمج معالجة الفيديو والصوت المتزامن للتعرف على الأنشطة في الوقت الحقيقي وكشف الشذوذ. شهد النظام تطورين، حيث تحتوي النسخة المتقدمة على مجموعات صوتية متعددة النماذج وطرق كشف كائنات هجينة، مما يحسن بشكل كبير من دقتها وموثوقيتها.

Se ha desarrollado un sistema integral de monitoreo multimodal que integra el procesamiento de video y audio sincronizado para el reconocimiento de actividades en tiempo real y la detección de anomalías. El sistema ha pasado por dos iteraciones, con la versión avanzada que presenta conjuntos de audio multimodales y métodos de detección de objetos híbridos, mejorando significativamente su precisión y robustez.

Un système complet de surveillance multimodale a été développé, intégrant le traitement vidéo et audio synchronisé pour la reconnaissance d'activités en temps réel et la détection d'anomalies. Le système a connu deux itérations, la version avancée présentant des ensembles audio multimodaux et des méthodes de détection d'objets hybrides, améliorant ainsi considérablement sa précision et sa robustesse.

A comprehensive multimodal room-monitoring system has been developed, integrating synchronized video and audio processing for real-time activity recognition and anomaly detection. The system has undergone two iterations, with the advanced version featuring multi-model audio ensembles and hybrid object detection methods, significantly enhancing its accuracy and robustness.

Multimodal Real-Time Anomaly Detection and Industrial Applications

arXiv:2509.00221v3 Announce Type: replace 
Abstract: Both speech and sensor time series data encode information in both the time- and frequency- domains, like spectral powers and waveform shapelets. We show that speech foundation models learn representations that generalize beyond the speech domain and achieve state-of-the-art performance on diverse time-series tasks from wearable sensors. Probes trained on features extracted from HuBERT and wav2vec 2.0 outperform those extracted from self-supervised models trained directly on modality-specific datasets for mood classification, arrhythmia detection, and activity classification tasks. We find that the convolutional feature encoders of speech models are particularly relevant for wearable sensor applications. The proposed approach enhances performance on data-scarce time-series tasks using simple probing methods. This work takes a step toward developing generalized time-series models that unify speech and sensor modalities.

تظهر الأبحاث الحديثة أن نماذج الأساس الصوتي، مثل HuBERT و wav2vec 2.0، يمكن أن تعمم بشكل فعال على مهام السلاسل الزمنية المستمدة من بيانات المستشعرات القابلة للارتداء، محققة أداءً رائدًا في مجالات مثل تصنيف المزاج واكتشاف عدم انتظام ضربات القلب.

Investigaciones recientes demuestran que los modelos de fundación de voz, como HuBERT y wav2vec 2.0, pueden generalizarse eficazmente a tareas de series temporales derivadas de datos de sensores portátiles, logrando un rendimiento de vanguardia en áreas como la clasificación del estado de ánimo y la detección de arritmias.

Des recherches récentes montrent que les modèles de fond de parole, tels que HuBERT et wav2vec 2.0, peuvent se généraliser efficacement aux tâches de séries temporelles dérivées de données de capteurs portables, atteignant des performances de pointe dans des domaines tels que la classification des humeurs et la détection d'arythmies.

Recent research demonstrates that speech foundation models, such as HuBERT and wav2vec 2.0, can effectively generalize to time series tasks derived from wearable sensor data, achieving state-of-the-art performance in areas like mood classification and arrhythmia detection.

AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch

Was this article worth reading? Share it

Dubsmart LLC

SoundWise.ai

AI Humanizer