arXiv:2508.13024v2 Announce Type: replace 
Abstract: LLM-based web agents have the potential to automate long-running web tasks, such as searching for products in multiple e-shops and subsequently ordering the cheapest products that meet the users needs. Benchmarks for evaluating web agents either require agents to perform tasks online using the live Web or offline using simulated environments, which allow for the exact reproduction of the experimental setup. While DeepShop provides an online benchmark that requires agents to perform challenging shopping tasks, existing offline benchmarks such as WebShop, WebArena, or Mind2Web cover only comparatively simple e-commerce tasks that need to be performed against a single shop containing product data from a single source. What is missing is an e-commerce benchmark that simulates multiple shops containing heterogeneous product data and requires agents to perform complex tasks. We fill this gap by introducing WebMall, the first offline multi-shop benchmark for evaluating web agents on challenging comparison shopping tasks. WebMall consists of four simulated shops populated with product data extracted from the Common Crawl. The WebMall tasks range from specific product searches and price comparisons to advanced queries for complementary or substitute products, as well as checkout processes. We validate WebMall using eight agents that differ in observation space, availability of short-term memory, and the employed LLM. The validation highlights the difficulty of the benchmark, with even the best-performing agents achieving task completion rates below 55% in the task categories cheapest product search and vague product search.

تم تقديم WebMall كمعيار جديد غير متصل مصمم لتقييم الوكلاء على الويب المعتمدين على نماذج اللغة الكبيرة (LLM) في تنفيذ مهام التجارة الإلكترونية المعقدة عبر متاجر متعددة، مما يعالج فجوة في المعايير الحالية التي تركز على سيناريوهات أبسط لمتجر واحد.

WebMall se ha presentado como un nuevo estándar offline diseñado para evaluar agentes web basados en modelos de lenguaje de gran tamaño (LLM) en la realización de tareas complejas de comercio electrónico a través de múltiples tiendas, abordando una brecha en los estándares existentes que se centran en escenarios más simples de una sola tienda.

WebMall a été introduit comme une nouvelle référence hors ligne conçue pour évaluer les agents web basés sur des modèles de langage de grande taille (LLM) dans l'exécution de tâches complexes de commerce électronique à travers plusieurs boutiques, comblant ainsi une lacune dans les références existantes qui se concentrent sur des scénarios plus simples à boutique unique.

WebMall has been introduced as a new offline benchmark designed to evaluate large language model (LLM)-based web agents in performing complex e-commerce tasks across multiple shops, addressing a gap in existing benchmarks that focus on simpler, single-shop scenarios.

WebMall - A Multi-Shop Benchmark for Evaluating Web Agents [Technical Report]

One More Thing in AI – Your Shortcut to AI Mastery

WebMall - A Multi-Shop Benchmark for Evaluating Web Agents [Technical Report]

Was this article worth reading? Share it

One More Thing in AI

AI Agents Directory

BuyScout

Stealthbrowser

Scrapeless

ProductFetcher

Ready to build your own newsroom?