A technical report on the BenchFlow env-0 mobile SFT pipeline: 300 GPT-5.4-mini teacher trajectories, Prime-SFT conversion, 300-step Qwen3.5 BF16 LoRA training, and a Fireworks-hosted standard60 eval that moved from 19/180 to 24/180.

This post documents a small but complete post-training loop for an agent model: collect environment trajectories, convert them into SFT data, train a LoRA adapter, host the adapter, and evaluate it through the same agent harness.

The result is intentionally modest. A 300-trajectory SFT run improved Qwen/Qwen3.5-9B on the original 300-task denominator from 4/300 to 16/300. On the final Fireworks-hosted standard60 eval, the same line moved from 19/180 to 24/180, a +2.78 percentage point change.

That is directionally positive, but not a claim of broad generalization. The useful result is that the full loop works end to end and leaves behind reviewable artifacts: task sets, teacher trajectories, SFT rows, training logs, model commits, and eval trajectories.

Result Summary

The final same-host Fireworks comparison used 60 env-0 tasks with 3 trials per task:

Eval	Baseline	SFT	Delta
Fireworks standard60, 3 trials	`19/180` (`10.56%`)	`24/180` (`13.33%`)	`+5` passes, `+2.78` pp

The earlier same-denominator mobile eval, using the 300-task set that produced the SFT data, showed a larger movement:

Eval	Baseline	SFT	Delta
env-0-mobile `tasks-eval`, 300 rows	`4/300` (`1.33%`)	`16/300` (`5.33%`)	`+12` passes, `+4.00` pp

Held-out mobile generalization was weaker. On an unseen100 split from env-0-mobile/tasks-train, baseline was 3/100 and SFT was 2/100.

The headline should therefore be precise:

We validated the SFT pipeline and saw a +2.78 point Fireworks standard60 lift, but the result is not statistically strong and does not yet prove broad held-out generalization.

Artifact Index

GitHub:

env-0-experiment repository: benchflow-ai/env-0-experiment
Full technical report: 2026-06-27-qwen35-env0-mobile-sft-technical-report.md
Final Fireworks report: experiments/fireworks-qwen35-180/EXPERIMENT_REPORT.md
300-task SFT reproduction report: experiments/env0-mobile-prime-sft-pr828/REPRODUCTION_REPORT.md
Task-set registry: task_set_series.json and TASK_SET_SERIES.md
BenchFlow PR that added the train-data bridge: benchflow-ai/benchflow#828
env-0 task-set registry commit: 1de52f23aceeb65bc08a01cfad75593910e14c09

Hugging Face:

Trajectory dataset root: benchflow/env0-experiment-trajectories
Final Fireworks index: experiments/fireworks-qwen35
Fireworks baseline 180 trajectories: qwen3p5-9b-standard60-3trials
Fireworks SFT 180 trajectories: benchflow-qwen35-9b-env0-mobile-sft-live-standard60-3trials
Fireworks baseline aggregate: summary.json
Fireworks SFT aggregate: summary.json
300-task teacher trajectory set: pr828-env0-mobile-full300-azure-openhands-daytona-20260625T041236Z
300-step SFT training artifacts: env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z
300-task post-SFT eval: pr828-env0-mobile-sft-full300-qwen35-9b-h100-20260625T100226Z
SFT model repository: benchflow/benchflow-qwen35-9b
SFT model commit used for final Fireworks verification: 65613510b42644d44fdf06d4b3d31bc3e9f4ef8e

Pipeline

env-0 task-set curation
  -> GPT-5.4-mini teacher rollouts with BenchFlow + OpenHands + Daytona
  -> BenchFlow PR828 results.jsonl emission
  -> bench train convert to Prime SFT JSONL
  -> Qwen/Qwen3.5-9B BF16 LoRA SFT, 300 steps
  -> adapter publication to benchflow/benchflow-qwen35-9b
  -> base/SFT eval on 300-task denominator
  -> held-out unseen100 check
  -> Fireworks-hosted standard60 3-trial comparison

1. Task Set And Data Boundary

The training-data source was the 300-task env-0-mobile/tasks-eval set. That set came from generated mobile tasks with reward-1 strong-model trajectories, then was balanced across six task families: auth, calendar, docs, drive, mail, and multi-app tasks.

The task-set metadata records env-0 commit 1de52f23aceeb65bc08a01cfad75593910e14c09. The PR828 reproduction report records the runnable submodule snapshot used during the run as 21e358f7360a9704355556c3d0c8f6466bf5e9c2.

The relevant data boundary:

tasks-eval: 300 copied task directories used for teacher trajectory collection and the same-denominator SFT comparison.
tasks-train: 1703 tasks after removing those 300 eval tasks.
Final Fireworks standard60 eval: env-0/tasks, not the 300 env-0-mobile/tasks-eval training set.

We separately checked exact task ids. The SFT training task set and the final Fireworks standard60 task set had zero exact task-id overlap. This is not the same as saying the eval is held out by generator family or domain.

2. Teacher Trajectory Generation

The 300 SFT rows came from real BenchFlow rollouts:

Teacher: Azure GPT-5.4-mini.
Agent: OpenHands.
Sandbox: Daytona.
Task set: env-0/env-0-mobile/tasks-eval.
Output contract: Verifiers/Prime-RL-shaped results.jsonl, plus original trajectory artifacts.

The worker-sharded run initially produced 81/300 passes and 5 infra errors. The five infra-error tasks were refilled, then the run was canonicalized to one healthy LLM trajectory per task.

Canonical teacher set:

Metric	Value
Canonical pass count	`83/300`
`results.jsonl` rows	`300`
`prime-sft.jsonl` rows	`300`
Rows with tool calls	`175/300`
Source LLM exchanges	`2163`
Skipped conversion rows	`0`

The SFT data used all 300 teacher trajectories, not just the 83 teacher-pass trajectories. That choice gave broader coverage but also mixed successful and unsuccessful behavior.

3. BenchFlow PR828 Data Bridge

BenchFlow PR #828 added the bridge from evaluation artifacts to training data:

job-level results.jsonl aggregation;
bench train validate;
bench train convert;
Prime-SFT export and normalization.

The bridge was validated with a synthetic no-spend row, a one-task Docker oracle sanity check, a 10-task spendful canary, and the full 300-task run. The 10-task canary produced 10 Verifiers-shaped rows, 10 Prime SFT rows, 7 rows with tool calls, and zero skipped rows.

4. SFT Training

The student was Qwen/Qwen3.5-9B, using the full non-prequantized checkpoint as the source model. The training update was adapter-based BF16 LoRA rather than full-parameter AdamW.

Training run:

env0-mobile-pr828-qwen35-bf16-lora-8k-h100-20260625T084605Z

Configuration:

Parameter	Value
Base model	`Qwen/Qwen3.5-9B`
Base precision	BF16
Base loading	Full, non-quantized
Sequence length	`8192`
Train rows	`300`
Training steps	`300`
LoRA rank	`32`
LoRA alpha	`64`
LoRA dropout	`0.05`
Batch size	`1`
Gradient accumulation	`8`
Learning rate	`1e-4`
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`

An A100 40GB feasibility check OOMed before training. The accepted run used a Prime Intellect H100 80GB instance under the BenchFlow team context.

Training completed all 300 steps:

Metric	Value
Completed steps	`300/300`
Best step	`300`
Best eval loss	`0.4590291380882263`
Saved checkpoints	`100`, `200`, `300`

The final adapter was published to benchflow/benchflow-qwen35-9b.

5. Same-Denominator 300-Task Eval

The trained adapter was first evaluated on the same 300-task denominator used for teacher trajectory collection.

Baseline:

Model: official Qwen/Qwen3.5-9B.
Serving: SGLang on Lambda A100.
Result: 4/300 pass, 217/300 rows with tool calls.

SFT:

Model: Qwen3.5 SFT adapter.
Serving: SGLang runtime LoRA on H100.
Result: 16/300 pass, 215/300 rows with tool calls.

Comparison:

Stage	Pass	Pass rate	Rows with tool calls
Base	`4/300`	`1.33%`	`217/300`
SFT	`16/300`	`5.33%`	`215/300`

On this seen denominator, the adapter added 12 passes and 4.00 percentage points. On the 83 tasks solved by GPT-5.4-mini, base was 3/83 and SFT was 13/83.

6. Held-Out Unseen100 Check

A held-out 100-task split was selected from env-0-mobile/tasks-train, after excluding all 300 task ids from the canonical SFT denominator.

Stage	Pass	Pass rate
Base	`3/100`	`3.00%`
SFT	`2/100`	`2.00%`

This is why the 300-task result should not be presented as broad generalization. The adapter improved the seen denominator, but the held-out mobile split did not improve.

7. Fireworks Standard60 Eval

The final comparison moved to Fireworks managed deployments:

Baseline deployment: openai/accounts/bingran-you/deployments/env0-qwen3p5-9b-standard60
SFT deployment: openai/accounts/bingran-you/deployments/benchflow-qwen35-9b-env0-mobile-sft-live
Benchmark: env-0 standard60 (env-0/tasks).
Agent: OpenHands.
Sandbox: Daytona.
Rollout count: 60 tasks x 3 trials = 180 rows per model.
Skill mode: with-skill.

Final aggregate:

Model	Strict pass	Pass rate	Rows with tools	Unscored
Fireworks Qwen3.5 baseline	`19/180`	`10.56%`	`175/180`	`0`
Fireworks BenchFlow Qwen3.5 SFT	`24/180`	`13.33%`	`180/180`	`0`

Trial breakdown:

Model	Trial	Pass
Baseline	`trial-01-20260626T165651Z`	`5/60`
Baseline	`trial-02-20260627T055735Z`	`7/60`
Baseline	`trial-03-20260627T072748Z-proxy`	`7/60`
SFT	`trial-01-20260627T022446Z`	`8/60`
SFT	`trial-02-20260627T082221Z`	`6/60`
SFT	`trial-03-20260627T090839Z`	`10/60`

The strict pass delta:

24/180 - 19/180 = +5/180
= +2.78 percentage points

A rough two-proportion check gives p ~= 0.42, with a 95% confidence interval for the delta of about [-3.92 pp, +9.47 pp]. The result is directionally positive, operationally validated, and not yet statistically strong.

What We Learned

This experiment validated several pieces of the training loop:

BenchFlow can emit trainable, verifier-shaped trajectory records through results.jsonl.
The PR828 bridge can convert real OpenHands rollouts into Prime-SFT rows without dropping rows.
A small 300-row SFT set can change Qwen/Qwen3.5-9B behavior on the same denominator.
The trained adapter can be hosted behind Fireworks and still produce structured tool calls for OpenHands.
The final 180-row standard60 comparison is directionally positive with zero unscored rows.

The limitation is equally important: 300 mixed-quality teacher trajectories were not enough to show robust held-out transfer.

Next Step

The next credible experiment should keep the proven loop but improve the data:

Build a larger disjoint train set from env-0-mobile/tasks-train.
Prefer verifier-selected or action-quality-filtered trajectories over all raw teacher trajectories.
Preserve exact task-list provenance before spendful runs.
Train from a fixed Qwen/Qwen3.5-9B source checkpoint.
Evaluate on both held-out mobile tasks and env-0 standard60.
Keep Fireworks hosting for the final same-host baseline/SFT comparison.

The current run is a proof that the post-training loop is real. The next run should test whether stronger data selection and a larger denominator turn that loop into a generalizable OpenHands tool-use improvement.

Qwen3.5 env-0 Mobile SFT: 300 Trajectories, 300 Steps, and a 3-Point Eval Lift