AndroidControl

AndroidControl Evaluation Pipeline

This folder contains the evaluation pipeline for the AndroidControl benchmark.

You can download the AndroidControl test split images folder from Hugging Face. The subfolder names represent episode_id, and each image is named screenshot_{step}.png.

For more information about the benchmark, please refer to the Android Control's official repo.

For more details on our experiments, please refer to Sections 3.2 and E.3 of our paper.

Quick Start

To use our sample of 500 steps, skip to step 2: gpt_plan and use data/500_steps as the input file.

To use our GPT-4o generated plan results, skip to step 4: Grounding Model Inference and use data/query_gpt-4o_{level}.jsonl as the question file.

Pipeline Steps

1. sample.py

Sample a subset of data from the whole dataset.

python sample.py --input_file <full_test_steps_json> --output_file <sample_json> -n <num_samples>

The input_file was generated by processing the original AndroidControl data according to the rules specified in the paper. You can download it from Hugging Face. As in the paper, we use n=500, and the samples are in data/500_steps.

2. gpt_plan.py

Generate plan files using GPT models.

export OPENAI_API_KEY="Your OpenAI API Key"
python gpt_plan.py --model <gpt_model> --input_file <sample_jsonl> --output_file <plan_jsonl> --screenshot_dir <screenshot_dir> --level <task_level>

For gpt_model, we use gpt-4o-2024-05-13 and gpt-4-turbo-2024-04-09.
level can be "high" or "low".

The GPT-4o generated plan files we use are in data/plan_gpt-4o_{level}.jsonl.

3. extract_grounding_query.py

Extract grounding queries from the plan files.

python extract_grounding_query.py --sample_file <sample_jsonl> --input_file <plan_jsonl> --output_file <query_jsonl> --screenshot_dir <screenshot_dir>

The queries extracted from GPT-4o plan files are in data/query_gpt-4o_{level}.jsonl.

4. Grounding Model Inference

Perform grounding model inference using the query file generated in the previous step.

To use UGround-V1, please refer to the UGround-V1 Inference Guidelines and the scripts provided in the ../../grounding folder.

To compare with our results, use data/query_gpt-4o_{level}.jsonl as the question file.

5. eval.py

Evaluate the Step Accuracy and Grounding Accuracy based on plan and grounding results.

python eval.py --sample_file <sample_jsonl> --plan_file <plan_jsonl> --ans_file <grounding_answer_jsonl>

To compare with our results, use the following:

data/500_steps as the sample_file
data/plan_gpt-4o_{level}.jsonl as the plan_file
The ans_file should be inferred from data/query_gpt-4o_{level}.jsonl.

Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
data		data
README.md		README.md
eval.py		eval.py
extract_grounding_query.py		extract_grounding_query.py
extract_raw.py		extract_raw.py
gpt_plan.py		gpt_plan.py
sample.py		sample.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

AndroidControl Evaluation Pipeline

Quick Start

Pipeline Steps

FilesExpand file tree

AndroidControl

Directory actions

More options

Directory actions

More options

Latest commit

History

AndroidControl

Folders and files

parent directory

README.md

AndroidControl Evaluation Pipeline

Quick Start

Pipeline Steps