VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos


TL;DR

This paper proposes a scalable method called VideoAgentTrek, which automatically mines structured training data from unlabeled public screen-recording videos through an inverse dynamics module (VADM), thereby addressing the reliance on large-scale manually annotated data when training computer-use intelligent agent (Agent).

Key Definitions

At present, there are three main ways to obtain training data for computer-use intelligent agent:

  1. Manual annotation: By manually recording operation trajectories, high-quality and highly accurate annotated data can be generated, but the cost is extremely high, making it difficult to scale, and the covered application scenarios are limited.
  2. Programmatic synthesis: Large amounts of interaction data are automatically generated in simulators or scripted environments. Although the scale is large and the parameters are precise, it often lacks the diversity and complexity of real-world UIs and deviates from real scenarios.
  3. Web mining: Data is obtained from online tutorials, RPA logs, and other resources. This offers broad coverage and good diversity, but it usually lacks precise action time boundaries and structured action parameters, and the data quality is uneven.

The key bottleneck in this research area is the lack of a data acquisition method that can balance scale, diversity, and quality. This paper aims to solve this core problem: how to automatically transform the large amount of unstructured screen-recording videos on the internet into structured interaction trajectories with precise parameters that can be directly used for intelligent agent training, thereby eliminating dependence on expensive manual annotation.

Method

The proposed VideoAgentTrek is a three-stage automated pipeline that converts unlabeled web videos into structured training data for intelligent agent.

VideoAgentTrek Overview Overview of VideoAgentTrek. (1) Video collection and preprocessing: Crawl screen-recording tutorials and use GUI-Filter to select GUI operation clips. (2) VADM: An inverse dynamics module that first performs dense action event detection to localize clips and assign action types, then performs *action parameterization (e.g., click coordinates, typed text) to produce structured $(\text{screenshot}, \text{action}, \text{parameter})$ trajectories. (3) Model pretraining and fine-tuning: Use the mined trajectories for continued pretraining and supervised fine-tuning of computer-use intelligent agent.*

Video Collection and Preprocessing

Video Collection

This paper adopts a scalable video collection strategy. It first uses seed keywords such as “Excel tutorial” to search for videos, and then ingests all videos from high-quality channels (sample pass rate $\geq$ 80%) as a whole, using the channels’ tags and metadata for iterative discovery. This “channel consistency”-based strategy efficiently collected about 55,000 candidate videos (about 10,000 hours) with a small amount of human supervision.

Video Collection Process Starting from seed keywords and tags, videos are searched and evaluated, expanded to related videos and high-quality channels (pass rate $\geq$80%), and GUI-containing videos are iteratively collected for VAT.

Video Preprocessing

To precisely extract GUI-interaction clips from candidate videos, this paper developed the \(GUI-Filter\) model. This is a lightweight cursor detection model based on YOLOv8x that can filter out non-interactive content such as slides. The specific filtering criterion is: clips lasting more than 6 consecutive seconds and with at least 80% of frames containing a cursor are retained. In the end, this tool successfully extracted 7,377 hours of valid GUI interaction recordings from 10,000 hours of raw video.

Data Analysis

The collected video data shows high quality in both resolution (97% are 720p or higher) and topic coverage. Analysis of titles and descriptions shows that the videos are mainly tutorials (69.6%), covering multiple domains such as operating systems, professional software, office work, and everyday applications, with operating system (OS)-related content accounting for the largest share (about 36%), ensuring both breadth and practicality of the data.

Domain Distribution Domain distribution chart

VADM: Inverse Dynamics Module

VADM is the technical core of this paper. It mimics the inverse dynamics idea in robotics, inferring the executed actions from observations (video pixels). Without manual annotation, this module can convert videos into structured \((screenshot, action, thought)\) sequences.

VADM Flowchart VADM pipeline: given a screen-recording video (optional subtitles), the module (1) detects GUI action events and segments clips, (2) parameterizes each action (type and parameters), and (3) generates step-level thoughts, ultimately producing a trainable {action clip, action, thought} sequence.

Action Event Detection

The goal of this stage is to perform dense event detection in the unlabeled video \(v\) and output a set of actions with precise start and end times \(S={(a_k, t_k^s, t_k^e)}_k=1^K\).

\[f_{\theta}(v) \rightarrow \mathcal{S}=\{(a_{k},t_{k}^{\mathrm{s}},t_{k}^{\mathrm{e}})\}_{k=1}^{K},\quad a_{k}\in\mathcal{A},\ 0\leq t_{k}^{\mathrm{s}}<t_{k}^{\mathrm{e}}\leq T.\]

The paper uses the OpenCUA dataset to automatically generate timestamped GUI events as supervision data, and fully fine-tunes the Qwen2.5-VL-7B-Instruct model so that it can directly predict action types and their millisecond-level time boundaries from video.

Action Parameterization

After detecting the action clip $v_k=v[t_k^s:t_k^e]$, the goal of this stage is to identify the specific action parameters \(π_k\).

\[h_{\phi}(v_{k}) \rightarrow (\hat{a}_{k},\pi_{k}).\]

For example, for a click action, the output is \((click, (x,y))\); for a typing action, the output is \((type, <content>)\). Similarly, the paper uses the original logs from OpenCUA to generate supervision data and fine-tunes the Qwen2.5-VL model so that it can directly decode the action type and specific parameters from video clips.

Inner Monologue Generation

To help the model learn the intent behind each action, this paper also generates a short “inner monologue” \(r_k\) for every action. By providing GPT-4.5 Medium with contextual information such as the action type, parameters, screenshots before and after the action, and related automatic speech recognition (ASR) text, the model generates a text that describes the intent and plan. This turns the final data format into a ReAct-like \((screenshot, thought, action, parameters)\) sequence, which helps improve the model’s planning and reasoning abilities.

Pretraining Computer-Use Models

This paper adopts a two-stage training strategy to validate the effectiveness of the VideoAgentTrek data.

Data Preparation

  1. VideoAgentTrek data: 39,000 videos were processed through the above pipeline, generating about 15.2 million interaction steps (about 26 billion tokens).
  2. Human-annotated data: Human-labeled trajectories from public datasets such as OpenCUA and AGUVIS were integrated, totaling about 8 billion tokens.
  3. GUI localization data: Localization pairs from the OSWorld-G dataset were introduced to enhance the model’s perception of interface elements, totaling about 1 billion tokens.

Training Strategy

  1. Stage 1: Continued pretraining: The large-scale, diverse, but potentially noisy data generated by VideoAgentTrek is used to pretrain the Qwen2.5-VL-7B model. The goal of this stage is to let the model learn broad GUI interaction patterns and visual foundations.
  2. Stage 2: Supervised fine-tuning (SFT): Fine-tuning is performed on a small amount of high-quality, human-annotated data. The goal of this stage is to sharpen the model’s policy execution and instruction-following abilities on specific tasks.

This “learn broadly first, then refine carefully” strategy aims to fully leverage the breadth of large-scale video data to build robust foundational capabilities, and then use the precision of high-quality labeled data to optimize higher-level policies.

Experimental Results

This paper validates the effectiveness of the method on two mainstream computer-use intelligent agent benchmarks: OSWorld-Verified (online real-world environment) and AgentNetBench (offline evaluation).

Experimental results Experimental results on OSWorld-Verified and AgentNetBench. VideoAgentTrek shows significant improvements over the baseline model, and test-time scaling in the number of steps brings additional performance gains.

Main Results:

Performance scaling with data size Performance scaling

Action Event Detector Evaluation (Held-out Test Set)    
Action Type F1 Score Precision
Click 0.817 0.949
Drag 0.449 0.583
Press 0.449 0.596
Scroll 0.840 0.985
Type 0.771 0.902
Micro Avg. 0.784 0.879


Action Parameterization Evaluation (Manual Blind Review)    
Action Type Number Evaluated Accuracy
Click 324 0.713
Drag 22 0.366
Press 47 0.362
Scroll 34 0.735
Type 73 0.671


Final Conclusion: The experimental results strongly demonstrate that the large amount of passive screen-recording video available on the internet can be successfully transformed into high-quality supervision signals, providing a scalable and effective path for training more powerful and robust computer-use intelligent agents, and serving as a viable alternative to expensive human annotation.