MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

1 Zhejiang University    2 vivo AI Lab    3 Peking University
Equal Contribution     Project Lead     Corresponding Author
Accepted to ACL 2026 main conference
MAS-Bench Pipeline

The pipeline of MAS-Bench. The GUI-Shortcut agent first filters products using the search_product shortcut, selects an item via GUI operations, and then adds it to the cart using the add_to_cart shortcut. The entire process is monitored by an automated evaluation module, which outputs metrics such as success rate and efficiency.

Abstract

Shortcuts such as APIs and deep-links have emerged as efficient complements to flexible GUI operations, fostering a promising hybrid paradigm for MLLM-based mobile automation. However, systematic evaluation of GUI-shortcut hybrid agents remains largely underexplored. To bridge this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent's capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 9 evaluation metrics. Experiments demonstrate that hybrid agents achieve up to 68.3% success rate and 39% greater execution efficiency than GUI-only counterparts. Furthermore, our evaluation framework effectively reveals the quality gap between predefined and agent-generated shortcuts, validating its capability to assess shortcut generation methods. MAS-Bench addresses the lack of systematic benchmarks for GUI-shortcut hybrid mobile agents, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.

Highlights

  • Comprehensive Benchmark: We introduce MAS-Bench, the first benchmark for systematically evaluating GUI-shortcut hybrid mobile agents, featuring 139 complex tasks, 11 real-world apps, 88 predefined shortcuts, and 9 evaluation metrics.
  • Hybrid Agent Baselines: We establish extensive baselines across agentic workflows, general-purpose models, and specialized GUI models, demonstrating that GUI-shortcut hybrid operation substantially improves success rate and efficiency while exposing shortcut misuse.
  • Shortcut Generation Evaluation: We propose the first framework to evaluate an agent's ability to generate shortcuts from interaction trajectories. Using predefined structural shortcuts as a reference upper-bound baseline, our experiments show that current generated shortcuts still lag behind predefined ones in efficiency and robustness.

MAS-Bench

Experiment Results

Experiments demonstrate that using the predefined knowledge base of 88 shortcuts significantly improves task success rate and efficiency across multiple agent families. MAS-GLM-4.5V achieves the best overall success rate of 68.3%. MAS-MobileAgent improves over its GUI-only counterpart from 35.2% to 63.3% SR, yielding a 79.8% relative gain and reducing the successful-task step ratio by 38.9%.

Agent Input SR ↑ Efficiency Cost S2GR ↑
SS VH MS ↓ MSRS ↓ MET ↓ MToC ↓ MSC ↑
Agentic Workflow (Gemini-2.5-Pro)
Human - 12.11 1.000 - - - -
M3A 0.503 16.655 1.131 266.505 200.509 0 0
MobileAgent-E 0.259 4.808 0.857 462.636 87.819 1.011 0.114
T3A 0.453 15.755 1.067 178.005 440.919 0 0
+ MAS-T3A (Ours) 0.554 12.768 0.823 148.505 341.402 1.438 0.140
MobileAgentV2 0.352 19.252 1.122 1364.979 156.441 0 0
+ MAS-MobileAgent (Ours) 0.633 12.928 0.686 951.918 130.980 1.948 0.341
General-Purpose Models
Qwen2.5-VL-3B 0.036 21.626 1.341 120.975 - 0 0
Qwen2.5-VL-7B 0.022 21.835 1.498 168.832 - 0 0
Qwen3-VL-4B 0.228 17.547 1.273 179.208 - 0 0
+ MAS-Qwen3-VL-4B (Ours) 0.237 20.403 0.971 188.078 - 2.461 0.170
Qwen3-VL-8B 0.259 16.576 1.183 179.377 - 0 0
+ MAS-Qwen3-VL-8B (Ours) 0.425 14.921 0.868 155.977 - 1.309 0.110
Qwen3-VL-32B 0.338 17.834 1.207 394.102 - 0 0
+ MAS-Qwen3-VL-32B (Ours) 0.446 15.913 0.848 358.282 - 3.216 0.276
Qwen3-VL-235B 0.417 16.604 1.111 185.201 - 0 0
+ MAS-Qwen3-VL-235B (Ours) 0.525 14.424 0.785 152.652 - 2.784 0.233
GLM-4.5V 0.526 17.237 1.194 281.928 - 0 0
+ MAS-GLM-4.5V (Ours) 0.683 14.050 0.900 238.579 - 1.051 0.097
Specialized GUI Models
UI-TARS-1.5-7B 0.287 19.209 1.191 188.143 - 0 0
GUI-Owl-7B 0.295 15.568 1.239 168.148 - 0 0
ScaleCUA-7B 0.108 18.194 1.202 123.018 - 0 0
+ MAS-ScaleCUA-7B (Ours) 0.115 19.446 1.358 141.535 - 0.007 0.000
ScaleCUA-32B 0.231 19.209 1.286 141.654 - 0 0
+ MAS-ScaleCUA-32B (Ours) 0.216 18.935 1.260 140.126 - 0.000 0.000
MAI-UI-8B 0.489 19.237 1.247 180.678 - 0 0
+ MAS-MAI-UI-8B (Ours) 0.583 18.065 1.179 169.463 - 0.273 0.018

Table: Overall performance comparison on MAS-Bench with predefined shortcuts knowledge base (139 tasks). Results are weighted averages based on task distribution (92 single-app, 47 cross-app tasks). Methods marked with "+" are shortcut-augmented variants of their corresponding baselines. SS: Screenshot; VH: View Hierarchy; MS: Mean Steps; MSRS: Mean Step Ratio on Successful tasks; MET: Mean Execution Time (seconds); MToC: Mean Token Cost (in thousands); MSC: Mean Shortcut Call Count; S2GR: Shortcut-to-GUI action Ratio.

Experiment Result Figure

Figure: Performance comparison of MAS-MobileAgent with and without shortcuts. The base models are Gemini-2.5-Pro and Gemini-2.0-Flash. Data points show the relationship between SR and MET for single-app and cross-app tasks, with circle size representing mean cost. Results demonstrate that shortcuts benefit both models, with more significant improvements for the weaker Gemini-2.0-Flash.

Shortcut Examples

Figure: Examples of the resulting shortcut types. Action Replay shortcuts (Task-Level and Subtask-Level) use a sequence of actions with fixed indices, while Dynamic Shortcuts use variable arguments that correspond to UI elements.

Shortcut SR ↑ SSR ↑ MSRS ↓ MSC ↑ MET ↓
Human - - 1.00 - -
Baseline 0.43 - 0.96 - 188.93
SPredefined 0.52 1.00 0.71 1.45 152.15
SReplay-Task 0.34 0.10 0.91 3.04 244.61
SReplay-Subtask 0.43 0.73 1.13 1.22 236.67
SDynamic 0.38 0.75 0.82 0.91 216.24
SMobileAgent-E 0.49 0.71 1.00 1.01 224.87

Table: The results of different shortcut generation methods. Column definitions: SR (success rate), MSRS (Mean Step Ratio on Successful tasks), MSC (Mean Shortcut Call Count), SSR (Shortcut Success Rate), MET (Mean Execution Time).

Failure Case Analysis

Figure: Examples of GUI-shortcut hybrid agent failure cases. (a) Selection and Planning Error: the agent incorrectly invokes the search_hotel() shortcut instead of searching for attractions. (b) Behavioral and Adaptation Error: the agent repeatedly calls the open_shorts() shortcut instead of switching to a GUI-only action to select the video.

BibTeX

@misc{zhao2025masbench,
        title={MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents}, 
        author={Pengxiang Zhao and Guangyi Liu and YaoZhen Liang and Weiqing He and Zhengxi Lu and WenHao Wang and Yuehao Huang and Yuxiang Chai and Zhaolu Kang and Yaxuan Guo and Hao Wang and Kexin Zhang and Liang Liu and Yong Liu},
        year={2025},
        eprint={2509.06477},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2509.06477}, 
  }