UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework

arXiv — cs.CV•Thursday, December 4, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

UniMo has been introduced as an innovative autoregressive model that simultaneously generates and understands 2D human videos and 3D human motions, marking a significant advancement in the integration of these two modalities. This model addresses the challenges posed by the structural and distributional differences between 2D and 3D data, which have largely remained unexplored in existing methodologies.
The development of UniMo is crucial as it enhances the capabilities of artificial intelligence in generating coherent and contextually rich representations of human motion and video, potentially transforming applications in animation, gaming, and virtual reality.
This advancement reflects a broader trend in AI research towards unifying diverse data modalities, as seen in other recent frameworks that leverage large language models (LLMs) for various generative tasks, including storytelling and scene synthesis. The integration of different modalities is becoming increasingly important for creating more sophisticated and interactive AI systems.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataTry the app

Videolulu

Generate faceless videos automatically for your content needs.

AI & DataTry the app

sync. labs

Create, reanimate, and understand humans in video with advanced lip-sync technology.

Creative & DesignTry the app

Continue Readings

arXiv — cs.CL13 hours ago

UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction

PositiveArtificial Intelligence

UW-BioNLP presented their methods for extracting chemotherapy timelines from clinical notes at the ChemoTimelines 2025 shared task, focusing on strategies like chain-of-thought thinking and supervised fine-tuning. Their best-performing model, fine-tuned Qwen3-14B, achieved a score of 0.678 on the test set leaderboard.

Read full article

via arXiv — cs.CL

arXiv — cs.CL13 hours ago

Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

PositiveArtificial Intelligence

The Natural Language Actor-Critic (NLAC) algorithm has been introduced to enhance the training of large language model (LLM) agents, which interact with environments over extended periods. This method addresses challenges in learning from sparse rewards and aims to stabilize training through a generative LLM critic that evaluates actions in natural language space.

Read full article

via arXiv — cs.CL

arXiv — cs.CL13 hours ago

MSME: A Multi-Stage Multi-Expert Framework for Zero-Shot Stance Detection

PositiveArtificial Intelligence

A new framework called MSME has been proposed for zero-shot stance detection, addressing the limitations of large language models (LLMs) in understanding complex real-world scenarios. This Multi-Stage, Multi-Expert framework consists of three stages: Knowledge Preparation, Expert Reasoning, and Pragmatic Analysis, which aim to enhance the accuracy of stance detection by incorporating dynamic background knowledge and recognizing rhetorical cues.

Read full article

via arXiv — cs.CL

arXiv — cs.LG2 days ago

ExPairT-LLM: Exact Learning for LLM Code Selection by Pairwise Queries

PositiveArtificial Intelligence

ExPairT-LLM has been introduced as an exact learning algorithm for code selection, addressing the challenges in code generation by large language models (LLMs). It utilizes pairwise membership and equivalence queries to enhance the accuracy of selecting the correct program from multiple outputs generated by LLMs, significantly improving success rates compared to existing algorithms.

Read full article

via arXiv — cs.LG

arXiv — cs.CL2 days ago

Astra: A Multi-Agent System for GPU Kernel Performance Optimization

PositiveArtificial Intelligence

Astra has been introduced as a pioneering multi-agent system designed for optimizing GPU kernel performance, addressing a long-standing challenge in high-performance computing and machine learning. This system leverages existing CUDA implementations from SGLang, a framework widely used for serving large language models (LLMs), marking a shift from traditional manual tuning methods.

Read full article

via arXiv — cs.CL

arXiv — cs.CL2 days ago

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

NeutralArtificial Intelligence

CryptoBench has been introduced as the first expert-curated, dynamic benchmark aimed at evaluating the capabilities of Large Language Model (LLM) agents specifically in the cryptocurrency sector, addressing challenges such as time sensitivity and the need for data synthesis from specialized sources.

Read full article

via arXiv — cs.CL

arXiv — cs.CV2 days ago

Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

PositiveArtificial Intelligence

A new framework called ThinkDeeper has been introduced to enhance the visual grounding capabilities of autonomous vehicles by utilizing a Spatial-Aware World Model (SA-WM). This model enables vehicles to interpret natural-language commands more effectively by reasoning about future spatial states and disambiguating context-dependent instructions.

Read full article

via arXiv — cs.CV

arXiv — cs.LG2 days ago

Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval

NeutralArtificial Intelligence

A recent study highlights the challenges faced by vision-language models (VLMs) in factual recall, identifying a two-hop problem that involves forming entity representations from visual inputs and recalling associated knowledge. The research benchmarks 14 VLMs, revealing that 11 of them show a decline in factual recall performance compared to their large language model (LLM) counterparts.

Read full article

via arXiv — cs.LG