Qwen3.5 env-0 Mobile SFT: 300 Trajectories, 300 Steps, and a 3-Point Eval Lift

Research

A technical report on the BenchFlow env-0 mobile SFT pipeline: 300 GPT-5.4-mini teacher trajectories, Prime-SFT conversion, 300-step Qwen3.5 BF16 LoRA training, and a Fireworks-hosted standard60 eval that moved from 19/180 to 24/180.

This post documents a small but complete post-training loop for an agent model: collect environment trajectories, convert them into SFT data, train a LoRA adapter, host the adapter, and evaluate it through the same agent harness.

The result is intentionally modest. A 300-trajectory SFT run improved Qwen/Qwen3.5-9B on the original 300-task denominator from 4/300 to 16/300. On the final Fireworks-hosted standard60 eval, the same line moved from 19/180 to 24/180, a +2.78 percentage point change.

That is directionally positive, but not a claim of broad generalization. The useful result is that the full loop works end to end and leaves behind reviewable artifacts: task sets, teacher trajectories, SFT rows, training logs, model commits, and eval trajectories.

Result Summary

The final same-host Fireworks comparison used 60 env-0 tasks with 3 trials per task:

EvalBaselineSFTDelta
Fireworks standard60, 3 trials19/180 (10.56%)24/180 (13.33%)+5 passes, +2.78 pp

The earlier same-denominator mobile eval, using the 300-task set that produced the SFT data, showed a larger movement:

EvalBaselineSFTDelta
env-0-mobile tasks-eval, 300 rows4/300 (1.33%)16/300 (5.33%)+12 passes, +4.00 pp

Held-out mobile generalization was weaker. On an unseen100 split from env-0-mobile/tasks-train, baseline was 3/100 and SFT was 2/100.

The headline should therefore be precise:

We validated the SFT pipeline and saw a +2.78 point Fireworks standard60 lift, but the result is not statistically strong and does not yet prove broad held-out generalization.

Artifact Index

GitHub:

Hugging Face:

Pipeline

env-0 task-set curation
  -> GPT-5.4-mini teacher rollouts with BenchFlow + OpenHands + Daytona
  -> BenchFlow PR828 results.jsonl emission
  -> bench train convert to Prime SFT JSONL
  -> Qwen/Qwen3.5-9B BF16 LoRA SFT, 300 steps
  -> adapter publication to benchflow/benchflow-qwen35-9b
  -> base/SFT eval on 300-task denominator
  -> held-out unseen100 check
  -> Fireworks-hosted standard60 3-trial comparison

1. Task Set And Data Boundary

The training-data source was the 300-task env-0-mobile/tasks-eval set. That set came from generated mobile tasks with reward-1 strong-model trajectories, then was balanced across six task families: auth, calendar, docs, drive, mail, and multi-app tasks.

The task-set metadata records env-0 commit 1de52f23aceeb65bc08a01cfad75593910e14c09. The PR828 reproduction report records the runnable submodule snapshot used during the run as 21e358f7360a9704355556c3d0c8f6466bf5e9c2.

The relevant data boundary:

  • tasks-eval: 300 copied task directories used for teacher trajectory collection and the same-denominator SFT comparison.
  • tasks-train: 1703 tasks after removing those 300 eval tasks.
  • Final Fireworks standard60 eval: env-0/tasks, not the 300 env-0-mobile/tasks-eval training set.

We separately checked exact task ids. The SFT training task set and the final Fireworks standard60 task set had zero exact task-id overlap. This is not the same as saying the eval is held out by generator family or domain.

2. Teacher Trajectory Generation

The 300 SFT rows came from real BenchFlow rollouts:

  • Teacher: Azure GPT-5.4-mini.
  • Agent: OpenHands.
  • Sandbox: Daytona.
  • Task set: env-0/env-0-mobile/tasks-eval.
  • Output contract: Verifiers/Prime-RL-shaped results.jsonl, plus original trajectory artifacts.

The worker-sharded run initially produced 81/300 passes and 5 infra errors. The five infra-error tasks were refilled, then the run was canonicalized to one healthy LLM trajectory per task.

Canonical teacher set:

MetricValue
Canonical pass count83/300
results.jsonl rows300
prime-sft.jsonl rows300
Rows with tool calls175/300
Source LLM exchanges2163
Skipped conversion rows0

The SFT data used all 300 teacher trajectories, not just the 83 teacher-pass trajectories. That choice gave broader coverage but also mixed successful and unsuccessful behavior.

3. BenchFlow PR828 Data Bridge

BenchFlow PR #828 added the bridge from evaluation artifacts to training data:

  • job-level results.jsonl aggregation;
  • bench train validate;
  • bench train convert;
  • Prime-SFT export and normalization.

The bridge was validated with a synthetic no-spend row, a one-task Docker oracle sanity check, a 10-task spendful canary, and the full 300-task run. The 10-task canary produced 10 Verifiers-shaped rows, 10 Prime SFT rows, 7 rows with tool calls, and zero skipped rows.

4. SFT Training

The student was Qwen/Qwen3.5-9B, using the full non-prequantized checkpoint as the source model. The training update was adapter-based BF16 LoRA rather than full-parameter AdamW.

Training run:

env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z

Configuration:

ParameterValue
Base modelQwen/Qwen3.5-9B
Base precisionBF16
Base loadingFull, non-quantized
Sequence length8192
Train rows300
Training steps300
LoRA rank32
LoRA alpha64
LoRA dropout0.05
Batch size1
Gradient accumulation8
Learning rate1e-4
Target modulesq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

An A100 40GB feasibility check OOMed before training. The accepted run used a Prime Intellect H100 80GB instance under the BenchFlow team context.

Training completed all 300 steps:

MetricValue
Completed steps300/300
Best step300
Best eval loss0.4590291380882263
Saved checkpoints100, 200, 300

The final adapter was published to benchflow/benchflow-qwen35-9b.

5. Same-Denominator 300-Task Eval

The trained adapter was first evaluated on the same 300-task denominator used for teacher trajectory collection.

Baseline:

  • Model: official Qwen/Qwen3.5-9B.
  • Serving: SGLang on Lambda A100.
  • Result: 4/300 pass, 217/300 rows with tool calls.

SFT:

  • Model: Qwen3.5 SFT adapter.
  • Serving: SGLang runtime LoRA on H100.
  • Result: 16/300 pass, 215/300 rows with tool calls.

Comparison:

StagePassPass rateRows with tool calls
Base4/3001.33%217/300
SFT16/3005.33%215/300

On this seen denominator, the adapter added 12 passes and 4.00 percentage points. On the 83 tasks solved by GPT-5.4-mini, base was 3/83 and SFT was 13/83.

6. Held-Out Unseen100 Check

A held-out 100-task split was selected from env-0-mobile/tasks-train, after excluding all 300 task ids from the canonical SFT denominator.

StagePassPass rate
Base3/1003.00%
SFT2/1002.00%

This is why the 300-task result should not be presented as broad generalization. The adapter improved the seen denominator, but the held-out mobile split did not improve.

7. Fireworks Standard60 Eval

The final comparison moved to Fireworks managed deployments:

  • Baseline deployment: openai/accounts/bingran-you/deployments/env0-qwen3p5-9b-standard60
  • SFT deployment: openai/accounts/bingran-you/deployments/benchflow-qwen35-9b-env0-mobile-sft-live
  • Benchmark: env-0 standard60 (env-0/tasks).
  • Agent: OpenHands.
  • Sandbox: Daytona.
  • Rollout count: 60 tasks x 3 trials = 180 rows per model.
  • Skill mode: with-skill.

Final aggregate:

ModelStrict passPass rateRows with toolsUnscored
Fireworks Qwen3.5 baseline19/18010.56%175/1800
Fireworks BenchFlow Qwen3.5 SFT24/18013.33%180/1800

Trial breakdown:

ModelTrialPass
Baselinetrial-01-20260626T165651Z5/60
Baselinetrial-02-20260627T055735Z7/60
Baselinetrial-03-20260627T072748Z-proxy7/60
SFTtrial-01-20260627T022446Z8/60
SFTtrial-02-20260627T082221Z6/60
SFTtrial-03-20260627T090839Z10/60

The strict pass delta:

24/180 - 19/180 = +5/180
= +2.78 percentage points

A rough two-proportion check gives p ~= 0.42, with a 95% confidence interval for the delta of about [-3.92 pp, +9.47 pp]. The result is directionally positive, operationally validated, and not yet statistically strong.

What We Learned

This experiment validated several pieces of the training loop:

  1. BenchFlow can emit trainable, verifier-shaped trajectory records through results.jsonl.
  2. The PR828 bridge can convert real OpenHands rollouts into Prime-SFT rows without dropping rows.
  3. A small 300-row SFT set can change Qwen/Qwen3.5-9B behavior on the same denominator.
  4. The trained adapter can be hosted behind Fireworks and still produce structured tool calls for OpenHands.
  5. The final 180-row standard60 comparison is directionally positive with zero unscored rows.

The limitation is equally important: 300 mixed-quality teacher trajectories were not enough to show robust held-out transfer.

Next Step

The next credible experiment should keep the proven loop but improve the data:

  1. Build a larger disjoint train set from env-0-mobile/tasks-train.
  2. Prefer verifier-selected or action-quality-filtered trajectories over all raw teacher trajectories.
  3. Preserve exact task-list provenance before spendful runs.
  4. Train from a fixed Qwen/Qwen3.5-9B source checkpoint.
  5. Evaluate on both held-out mobile tasks and env-0 standard60.
  6. Keep Fireworks hosting for the final same-host baseline/SFT comparison.

The current run is a proof that the post-training loop is real. The next run should test whether stronger data selection and a larger denominator turn that loop into a generalizable OpenHands tool-use improvement.

Cite this work

arXivPDF
@article{li2026skillsbench,
  title={SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks},
  author={Li, Xiangyi and Liu, Yimin and Chen, Wenbo and You, Bingran and Di, Zonglin and He, Yifeng and Zheng, Shenghan and Choe, Kyoung Whan and Sun, Jiankai and Wang, Shuyi and Tao, Chujun and Li, Binxu and Zhao, Xuandong and Geng, Hejia and Wu, Xiaojun and Zhou, Junwei and Chen, Xiaokun and Xing, Hanwen and Li, Yubo and Zeng, Qunhong and Wang, Di and Wang, Yuanli and Chaim, Roey Ben and Jiang, Penghao and Shen, Haotian and Kong, Luyang and Liu, Xinyi and Wang, Runhui and Liu, Xuanqing and Li, Jiachen and Lan, Xin and Lin, Yueqian and Ye, Wengao and He, Junwei and Li, Songlin and Zhang, Yue and Gao, Yipeng and Li, Yijiang and Ma, Ze and Jing, Liqiang and Wang, Tianyu and Li, Kaixin and Xue, Yiqi and Lyu, Haoran and He, Yizhuo and Tian, Yuchen and Wu, Shutong and Wang, Bowei and Gao, Yixuan and Chen, Bo and Liu, Litong and Cheng, Sikai and Bao, Jiajun and Tong, Shuaicheng and Xu, Shuwen and Zhuo, Terry Yue and Ye, Tinghan and Qi, Qi and Li, Miao and Liao, Longtai and Tan, Zelin and Shi, Chang and Tang, Xilin and Tankasala, Srinath and Yuan, Boqin and Qian, Yaoyao and Tu, Jianhong and Wang, Chenguang and Sun, Yizhou and Wang, Wei and Taylor, Aaron and Yang, Ziyue and Guan, Changkun and Dong, Zhikang and Zhang, Xinyu and Dillmann, Steven and Lee, Han-chung and Song, Dawn},
  journal={arXiv preprint arXiv:2602.12670},
  year={2026}
}