SMART-PAIR

Predictive Agent Interaction Representation for SMART-Style Traffic Simulation

Achyut Morang

2026-04-01

SMART-PAIR

Research deck | traffic simulation | WOSAC-style evaluation

SMART-PAIR

A checkpoint-first plan for improving SMART-style multi-agent traffic simulation with a training-time predictive interaction regularizer and counterfactual stress diagnosis.

SMART baseline CAT-K reference PAIR fine-tuning Counterfactual stress tests

Core idea

\mathcal{L}_{\text{SMART-PAIR}} = \mathcal{L}_{\text{task}} + \lambda_{\text{PAIR}}\mathcal{L}_{\text{PAIR}}

Keep the deployed SMART generator unchanged. Add a training-only objective that forces hidden states to predict future multi-agent interaction structure.

Inference footprint: unchanged SMART rollout path. PAIR is a fine-tuning regularizer, not a bigger deployed simulator.

Why This Project Exists

1. SMART is strong

SMART treats driving as next-token prediction over learned motion tokens and map context.

This gives an elegant language-like formulation for multi-agent rollout.

2. CAT-K is useful

CAT-K improves closed-loop behavior by selecting model-likely tokens closer to logged futures.

It is a strong fine-tuning reference, not a strawman.

3. A gap remains

Logged-future realism may not fully test whether agents react correctly when another agent changes behavior.

This is the opening for counterfactual interaction robustness.

Research claim boundary: we are not claiming CAT-K is weak. We are asking whether logged-future metrics under-test changed-interaction response quality.

First-Principles Distinction

Standard realism

\hat{\tau}_{1:N} \sim p_{\theta}(\tau_{1:N}\mid H,M)

Can the simulator sample joint futures that look like the dataset?

Counterfactual interaction realism

\hat{\tau}_{-j}^{cf} \sim p_{\theta}(\tau_{-j}\mid H,M,\operatorname{do}(\tau_j=\tilde{\tau}_j))

If one agent behaves differently, do the nearby agents react plausibly?

\text{closeness to logged future}\;\neq\;\text{correct response under changed interaction}

The Baseline Architecture We Keep

SMART generator path

  • Vectorized map tokens enter RoadNet.
  • Agent motion tokens enter MotionNet.
  • Temporal, interaction, and map-agent attention produce next-token distributions.
  • Rollout samples or selects the next motion token autoregressively.

SMART-PAIR: Small Training-Time Addition

What changes?

  • Use existing hidden states from the SMART agent decoder.
  • Predict future interaction latents through a compact PAIR branch.
  • Train with matched future targets and a shuffled-target control.
  • Remove the PAIR branch at inference.

\mathcal{L}_{\text{PAIR}} = 2 - 2\cos\left(\hat{z}_{i,r},\operatorname{sg}(z^+_{i,r})\right)

Experiment Logic: Checkpoint First

The project should not scale new training until checkpoint trust and stress diagnosis are reproducible.

Current Substrate Status

Author checkpoints
2

BC and closed-loop fine-tuned reference checkpoints are available locally.

Smoke cache

Small fixed cache is enough for load, rollout, and validation-prefix checks.

Stress pairs
20

Diverse lead-braking pairs across 10 scenarios for first diagnosis.

PAIR scaffold
off

Implemented behind disabled-by-default config flags.

Baseline Smoke Results

Scope: 4 validation batches, batch size 2, one closed-loop rollout, same fixed prefix. This is a mechanics/trust table, not a paper-grade benchmark.

Baseline Table

Model Open acc Open loss Closed ADE ↓ RMM ↑ Interaction ↑ Map ↑ Scenarios
BC 0.8198 2.3412 0.8633 0.6966 0.7701 0.7724 8
CLSFT 0.8163 2.9705 0.7836 0.7182 0.7889 0.8076 8

Interpretation: the closed-loop fine-tuned checkpoint improves closed-loop ADE, RMM, interaction, and map metrics on this small fixed prefix. This makes it a credible reference before testing PAIR.

Counterfactual Stress Diagnosis

Lead-braking intervention

A lead vehicle is forced to follow a plausible braking token sequence. The model must generate the follower response.

\tilde{\tau}_{lead} = \operatorname{Brake}(\tau_{lead}^{*}), \qquad \hat{\tau}_{follower}^{cf} \sim p_{\theta}(\cdot \mid \operatorname{do}(\tilde{\tau}_{lead}))

Metrics

  • forced collision rate
  • normal vs forced minimum gap
  • minimum time-to-collision
  • follower speed delta at 3 seconds
  • paired videos for qualitative inspection

Focal Rollout Comparison: Lead-Braking Case

Ground truth

Ground truth rollout animation
Logged scenario reference for case 17a010edfe6d47b3, follower 15, lead 22.

SMART-BC rollout

SMART-BC rollout animation
visual collision evidentselected-pair gap 5.21 mselected-pair TTC 2.62 s
Behavior-cloned SMART-tiny checkpoint. The pair metric does not capture every visible rollout collision, so this panel is treated as qualitative evidence.

SMART-CLSFT rollout

SMART-CLSFT rollout animation
visual collision evidentselected-pair gap 0.19 mselected-pair TTC 0.017 s
Closed-loop fine-tuned reference checkpoint. The selected-pair metric and visual rollout both indicate a severe interaction failure.
GIFs are shown side-by-side for rapid visual comparison. Pair metrics are not global collision metrics; use this slide as qualitative diagnosis, not a benchmark claim.

First Stress Signal

Scope: 20 selected lead/follower pairs from 10 scenarios. This is diagnostic evidence only; visual inspection and larger coverage are still required.

What The Stress Result Means

Useful signal

On the 20-pair diagnostic subset, the fine-tuned checkpoint shows more forced lead-braking collisions than the BC checkpoint despite similar normal-condition gaps.

This suggests the stress harness may reveal behavior not captured by the small aggregate realism prefix.

Not yet a claim

The subset is small. The intervention family is narrow. No videos were generated in this pass. The result must be replicated with stronger controls before being treated as evidence.

Decision gate: do not claim PAIR helps until matched PAIR, shuffled PAIR, BC, and CLSFT are compared on the same frozen stress suite.

PAIR Fine-Tuning Design

Matched PAIR

\theta_{PAIR} = \operatorname{FT}\left( \theta_{BC}, \mathcal{L}_{task}+\lambda\mathcal{L}_{PAIR} \right)

Target latents come from the true future token structure.

Shuffled control

\theta_{shuffle} = \operatorname{FT}\left( \theta_{BC}, \mathcal{L}_{task}+\lambda\mathcal{L}_{PAIR}^{shuffle} \right)

Same capacity and training path, but future targets are mismatched.

If matched PAIR improves stress behavior while shuffled PAIR does not, the evidence points toward future-interaction structure rather than extra parameters or extra optimization steps.

Minimal Experiment Matrix

Variant Initialization Extra objective Purpose
BC author BC checkpoint none behavior-cloning baseline
CLSFT / CAT-K reference author fine-tuned checkpoint closed-loop fine-tuning strong reference
SMART-PAIR author BC checkpoint matched PAIR proposed method
Shuffled PAIR author BC checkpoint shuffled PAIR negative control
Every variant must be evaluated on the same validation subset, same stress manifest, same rollout settings, and same random seeds.

Evidence Package Required

Standard realism

  • RMM
  • interaction metrics
  • map metrics
  • kinematic metrics
  • closed-loop ADE

Counterfactual diagnosis

  • forced collision rate
  • min gap distribution
  • TTC distribution
  • speed-response curves
  • side-by-side videos

Representation evidence

  • hidden-state future probes
  • matched vs shuffled ablation
  • PAIR cosine curves
  • no-inference-size increase proof

Engineering Invariants

Invariant 1: with pair.enabled=false, author checkpoints must strict-load with zero missing or unexpected keys.

Invariant 2: PAIR must be training-only unless an explicit inference-time reranking experiment is introduced.

Invariant 3: no restricted checkpoints, processed WOMD files, or private raw logs enter public slides or GitHub history.

Current Limitations

Scientific limitations

  • Stress suite currently covers one intervention family.
  • 20-pair diagnosis is too small for a claim.
  • Need visual inspection to rule out metric artifacts.
  • Need non-interference controls and shuffled PAIR comparison.

Engineering limitations

  • PAIR scaffold exists but has not been scaled.
  • Stress evaluator needs faster scenario caching and video export.
  • Validation must expand beyond the smoke prefix.
  • Documentation still needs cleanup of old J-SMART names.

Near-Term Plan

1. Freeze deck inputspublic-safe notes, plots, diagrams
2. Review code changesPAIR and stress harness regression checks
3. Generate videosBC vs CLSFT stress cases
4. Run PAIR smokematched and shuffled fine-tune
5. Expand evidencefixed stress suite plus WOSAC-style metrics

Takeaway

Technical thesis

SMART-PAIR is a conservative modification: keep SMART’s deployed autoregressive generator, but regularize the training representation toward future interaction structure.

Scientific thesis

A simulator should be judged not only by logged-future realism, but also by whether it reacts plausibly when another agent’s future changes.

\text{standard realism} + \text{counterfactual reaction quality} + \text{representation evidence}

Backup: Source Grounding

  • SMART: next-token multi-agent motion generation.
  • CAT-K: closed-loop fine-tuning reference through top-K token targets.
  • WOSAC: realism metrics covering kinematic, interaction, and map behavior.
  • JEPA-style learning: latent predictive objectives that organize hidden states around future structure.
This deck is public-safe: it includes only derived metrics, diagrams, equations, and research planning. It excludes checkpoints and restricted dataset artifacts.