MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

1 Zhejiang University    2 vivo AI Lab    3 Huzhou Institute of Zhejiang University
Equal Contribution     Project Leader     Corresponding Author
MAS-Bench Pipeline

The pipeline of MAS-Bench. The GUI-Shortcut agent first filters products using the search_product shortcut, selects an item via GUI operations, and then adds it to the cart using the add_to_cart shortcut. The entire process is monitored by an automated evaluation module, which outputs metrics such as success rate and efficiency.

Abstract

To enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a framework for systematically benchmarking these hybrid agents is still underexplored. To take the first step in bridging this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent's capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 7 evaluation metrics. The tasks are designed to be solvable via GUI-only operations, but can be significantly accelerated by intelligently embedding shortcuts. Experiments show that hybrid agents achieve significantly higher success rates and efficiency than their GUI-only counterparts. This result also demonstrates the effectiveness of our method for evaluating an agent's shortcut generation capabilities. MAS-Bench fills a critical evaluation gap, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.

Highlights

  • Comprehensive Benchmark: We introduce MAS-Bench, the first benchmark for systematically evaluating GUI-shortcut hybrid mobile agents, featuring 139 tasks, 11 apps, 88 predefined shortcuts, and 7 evaluation metrics.
  • Hybrid Agent Baselines: We show that hybrid agents on MAS-Bench significantly outperform GUI-only counterparts in both success rate and efficiency.
  • Shortcut Generation Evaluation: We propose the first framework to evaluate an agent's ability to generate new shortcuts, revealing a key research gap between the performance of predefined and agent-generated shortcuts.

MAS-Bench

Experiments Results

Our experiments show that hybrid agents evaluated on MAS-Bench achieve significantly higher success rates and reduced interaction costs compared to GUI-only agents. Detailed quantitative comparisons and ablations are provided below.

Agent Input SR ↑ Efficiency Cost GSAR ↑
SS VH MSRS ↓ MET ↓ MToC ↓ MSC ↑
Single-app Tasks (92 Tasks)
Human - 1.000 - - - -
T3A 0.511 1.056 137.641 346.382 0 0
M3A 0.565 1.064 192.775 155.281 0 0
MobileAgentV2 0.446 1.058 1013.386 120.212 0 0
MobileAgent-E 0.359 0.818 459.574 88.772 0.378 0.081
MAS-T3A (Ours) 0.576 0.915 129.279 291.391 1.043 0.117
MAS-MobileAgent (Ours) 0.641 0.613 682.547 99.780 1.348 0.345
Cross-app Tasks (47 Tasks)
Human - 1.000 - - - -
T3A 0.340 1.087 257.122 625.970 0 0
M3A 0.383 1.262 411.145 288.833 0 0
MobileAgentV2 0.170 1.247 2053.133 227.128 0 0
MobileAgent-E 0.064 0.934 469.109 85.859 2.250 0.177
MAS-T3A (Ours) 0.511 0.643 185.911 440.222 2.213 0.185
MAS-MobileAgent (Ours) 0.617 0.829 1441.586 189.836 3.128 0.320

Table: Performance comparison of our MAS agents and baseline methods on MAS-Bench with a predefined shortcuts knowledge base. All agents utilize the Gemini-1.5-Pro. SS and VH refer to the Screenshot and View Hierarchy (UI Tree) input modalities. MSRS is the Mean Step Ratio on Successful tasks, MET is the Mean Execution Time in seconds, and MToC is the Mean Token Cost in thousands (kTokens). The SSR for the predefined shortcuts is 1.0.

Experiment Result Figure

Figure: Performance comparison of MAS-MobileAgent with and without shortcuts. The base models are Gemini-2.5-Pro and Gemini-2.0-Flash. Data points show the relationship between SR and MET for single-app and cross-app tasks, with circle size representing mean cost. Results demonstrate that shortcuts benefit both models, with more significant improvements for the weaker Gemini-2.0-Flash.

Shortcut Examples

Figure: Examples of the resulting shortcut types. Action Replay shortcuts (Task-Level and Subtask-Level) use a sequence of actions with fixed indices, while Dynamic Shortcuts use variable arguments that correspond to UI elements.

Shortcut SR ↑ SSR ↑ MSRS ↓ MSC ↑ MET ↓
Human - - 1.00 - -
Baseline 0.43 - 0.96 - 188.93
SPredefined 0.52 1.00 0.71 1.45 152.15
SReplay-Task 0.34 0.10 0.91 3.04 244.61
SReplay-Subtask 0.43 0.73 1.13 1.22 236.67
SDynamic 0.38 0.75 0.82 0.91 216.24
SMobileAgent-E 0.49 0.71 1.00 1.01 224.87

Table: The results of different shortcut generation methods. Column definitions: SR (success rate), MSRS (Mean Step Ratio on Successful tasks), MSC (Mean Shortcut Call Count), SSR (Shortcut Success Rate), MET (Mean Execution Time).

Failure Case Analysis

Figure: Failure cases for shortcut generation. This figure illustrates common failure modes, including (a) incorrect tool selection, (b) improper parameter grounding, and (c) catastrophic forgetting, which hinder the agent's ability to create robust and reliable shortcuts.

BibTeX

@misc{zhao2025masbench,
        title={MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents}, 
        author={Pengxiang Zhao and Guangyi Liu and Yaozhen Liang and Weiqing He and Zhengxi Lu and Yuehao Huang and Yaxuan Guo and Kexin Zhang and Hao Wang and Liang Liu and Yong Liu},
        year={2025},
        eprint={2509.06477},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2509.06477}, 
  }