MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

Abstract

To enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a framework for systematically benchmarking these hybrid agents is still underexplored. To take the first step in bridging this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent's capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 7 evaluation metrics. The tasks are designed to be solvable via GUI-only operations, but can be significantly accelerated by intelligently embedding shortcuts. Experiments show that hybrid agents achieve significantly higher success rates and efficiency than their GUI-only counterparts. This result also demonstrates the effectiveness of our method for evaluating an agent's shortcut generation capabilities. MAS-Bench fills a critical evaluation gap, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.

Highlights

Comprehensive Benchmark: We introduce MAS-Bench, the first benchmark for systematically evaluating GUI-shortcut hybrid mobile agents, featuring 139 tasks, 11 apps, 88 predefined shortcuts, and 7 evaluation metrics.
Hybrid Agent Baselines: We show that hybrid agents on MAS-Bench significantly outperform GUI-only counterparts in both success rate and efficiency.
Shortcut Generation Evaluation: We propose the first framework to evaluate an agent's ability to generate new shortcuts, revealing a key research gap between the performance of predefined and agent-generated shortcuts.

MAS-Bench

Workflow of GUI Only vs. GUI-Shortcut Hybrid Agent. Shortcuts improve agent execution efficiency by bypassing GUI operations.

Functional Comparison of APIs, Deep Links, and RPA Scripts. Example from Amazon app: (a) the open_cart() API opens the cart; (b) the search_product() deep link performs product search; (c) an RPA script combines APIs, deep links, and GUI operations to automate a workflow.

The pipeline of MAS-Bench. The GUI-Shortcut agent first filters products using the search_product() shortcut, selects an item via GUI operations, and then adds it to the cart using the add_to_cart() shortcut. The entire process is monitored by an automated evaluation module, which outputs metrics such as success rate and efficiency.

Evaluation Workflow for Agents' Shortcut Generation Capability. The process consists of two stages: (a) Shortcut Generation Stage, where the agent creates its shortcut knowledge base; and (b) Quality Evaluation Stage where the generated shortcuts are imported into a baseline agent for performance testing.

Experiments Results

Our experiments show that hybrid agents evaluated on MAS-Bench achieve significantly higher success rates and reduced interaction costs compared to GUI-only agents. Detailed quantitative comparisons and ablations are provided below.

Agent	Input		SR ↑	Efficiency		Cost		GSAR ↑
Agent	SS	VH	SR ↑	MSRS ↓	MET ↓	MToC ↓	MSC ↑	GSAR ↑
Single-app Tasks (92 Tasks)
Human	✓		-	1.000	-	-	-	-
T3A		✓	0.511	1.056	137.641	346.382	0	0
M3A	✓	✓	0.565	1.064	192.775	155.281	0	0
MobileAgentV2	✓		0.446	1.058	1013.386	120.212	0	0
MobileAgent-E	✓		0.359	0.818	459.574	88.772	0.378	0.081
MAS-T3A (Ours)		✓	0.576	0.915	129.279	291.391	1.043	0.117
MAS-MobileAgent (Ours)	✓		0.641	0.613	682.547	99.780	1.348	0.345
Cross-app Tasks (47 Tasks)
Human	✓		-	1.000	-	-	-	-
T3A		✓	0.340	1.087	257.122	625.970	0	0
M3A	✓	✓	0.383	1.262	411.145	288.833	0	0
MobileAgentV2	✓		0.170	1.247	2053.133	227.128	0	0
MobileAgent-E	✓		0.064	0.934	469.109	85.859	2.250	0.177
MAS-T3A (Ours)		✓	0.511	0.643	185.911	440.222	2.213	0.185
MAS-MobileAgent (Ours)	✓		0.617	0.829	1441.586	189.836	3.128	0.320

Table: Performance comparison of our MAS agents and baseline methods on MAS-Bench with a predefined shortcuts knowledge base. All agents utilize the Gemini-1.5-Pro. SS and VH refer to the Screenshot and View Hierarchy (UI Tree) input modalities. MSRS is the Mean Step Ratio on Successful tasks, MET is the Mean Execution Time in seconds, and MToC is the Mean Token Cost in thousands (kTokens). The SSR for the predefined shortcuts is 1.0.

Figure: Performance comparison of MAS-MobileAgent with and without shortcuts. The base models are Gemini-2.5-Pro and Gemini-2.0-Flash. Data points show the relationship between SR and MET for single-app and cross-app tasks, with circle size representing mean cost. Results demonstrate that shortcuts benefit both models, with more significant improvements for the weaker Gemini-2.0-Flash.

Figure: Examples of the resulting shortcut types. Action Replay shortcuts (Task-Level and Subtask-Level) use a sequence of actions with fixed indices, while Dynamic Shortcuts use variable arguments that correspond to UI elements.

Shortcut	SR ↑	SSR ↑	MSRS ↓	MSC ↑	MET ↓
Human	-	-	1.00	-	-
Baseline	0.43	-	0.96	-	188.93
S_Predefined	0.52	1.00	0.71	1.45	152.15
S_Replay-Task	0.34	0.10	0.91	3.04	244.61
S_{Replay-Subtask}	0.43	0.73	1.13	1.22	236.67
S_Dynamic	0.38	0.75	0.82	0.91	216.24
S_{MobileAgent-E}	0.49	0.71	1.00	1.01	224.87

Table: The results of different shortcut generation methods. Column definitions: SR (success rate), MSRS (Mean Step Ratio on Successful tasks), MSC (Mean Shortcut Call Count), SSR (Shortcut Success Rate), MET (Mean Execution Time).

Figure: Failure cases for shortcut generation. This figure illustrates common failure modes, including (a) incorrect tool selection, (b) improper parameter grounding, and (c) catastrophic forgetting, which hinder the agent's ability to create robust and reliable shortcuts.

BibTeX

@misc{zhao2025masbench, title={MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents}, author={Pengxiang Zhao and Guangyi Liu and Yaozhen Liang and Weiqing He and Zhengxi Lu and Yuehao Huang and Yaxuan Guo and Kexin Zhang and Hao Wang and Liang Liu and Yong Liu}, year={2025}, eprint={2509.06477}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2509.06477}, }