WebMall - A Multi-Shop Benchmark for Evaluating Web Agents [Technical Report]

arXiv — cs.CLWednesday, December 3, 2025 at 5:00:00 AM
  • WebMall has been introduced as a new offline benchmark designed to evaluate large language model (LLM)-based web agents in performing complex e-commerce tasks across multiple shops, addressing a gap in existing benchmarks that focus on simpler, single-shop scenarios.
  • This development is significant as it allows for more comprehensive testing of web agents, enhancing their ability to automate tasks such as product searching and ordering, which can lead to improved efficiency in e-commerce interactions.
  • The introduction of WebMall reflects a broader trend in AI research towards creating more sophisticated evaluation frameworks that can better simulate real-world scenarios, as seen in related advancements like DEPO, which aims to optimize the efficiency of LLM agents.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about