OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents
NeutralArtificial Intelligence
The introduction of the OSWorld-MCP benchmark marks a significant advancement in evaluating computer-use agents, particularly in their tool invocation capabilities, which have been largely neglected in prior assessments focused on GUI interactions. This benchmark comprises 158 high-quality tools validated for functionality across seven common applications, providing a comprehensive framework for testing. Notably, evaluations revealed that agents utilizing MCP tools achieved marked improvements in task success rates, with OpenAI o3 increasing from 8.3% to 20.4% and Claude 4 Sonnet from 40.1% to 43.3%. These findings underscore the importance of assessing tool invocation alongside GUI operations, as the strongest models currently exhibit a tool invocation rate of only 36.3%, indicating room for further enhancement. The OSWorld-MCP benchmark not only sets a new standard for evaluation but also emphasizes the potential of multimodal agents in real-world scenarios, paving the way for future…
— via World Pulse Now AI Editorial System