AI Agent Engineering Delivery Framework
Roll out features with AI agents — Move fast, no sprints.
BIPO · 2026
AI coding tools have evolved from "auto-completing a few lines" to "delivering entire features."
The developer's role is shifting from writing code to directing AI.
Developers write every line themselves. Quality depends on individual skill and experience.
Companies constrain people through architecture standards, code reviews, and QA processes.
AI handles the actual coding. But with the same AI tool, the same model —
different people get dramatically different results.
The problem isn't "which AI tool to use" — just pick one and standardize.
The problem is: why does the same tool produce great results for some and garbage for others?
We used to have an entire system to constrain people — architecture standards, processes, code reviews.
Now AI is the primary executor, but the constraint system for AI is almost nonexistent.
AI doesn't know your architecture conventions, module boundaries, or forbidden zones. Different prompts = different outputs.
Constraints must be built into the project so AI follows them automatically — not documents people read once.
ROLL solves a people problem 本质上解决的是「人」的问题
ROLL bakes constraints and methodology into the project, so output quality is consistent regardless of who's driving.
An autonomous delivery system for software teams — AI agents pick stories from your BACKLOG,
execute them with encoded engineering discipline, and ship continuously while you stay focused on what to build next.
roll loop on runs BACKLOG items hourly. Dream scans code health nightly and surfaces maintenance tasks. Humans retain sole release authority — the system never ships to production without your approval.
22 skills encode TDD, TCR, and INVEST practices as reliable, repeatable workflows any agent can follow. Works with Claude, Cursor, Codex, or your own agent — swap the tool, keep the discipline.
ROLL turns a BACKLOG into shipped code continuously. Engineering practices are encoded as executable skills — reliable enough for an agent to run unattended, disciplined enough to ship production code.
From raw idea to production — five stages, three loops, one continuous flow.
Raw thought
Anyone can submit
AC ready
Feature Doc done
TCR micro-steps
Spar · Review · CI
Deploy to Test/UAT
Live Evidence
Deploy to Prod
Sentinel takes over
↩ Loop C finds issues → auto-creates new Idea → back to Pipeline
The two ends need human judgment. The middle runs on autopilot.
"Help me research this, break it into Stories"
Idea → Backlog
ROLL Loop auto-delivers. No babysitting needed.
Backlog → Build → Verify
"UAT passed. Ship it?"
Verify → Release
The more automated the middle is, the more humans can focus on the two ends — deciding what to build and when to ship. These are judgment calls AI shouldn't make alone.
From vague idea to executable Backlog — powered by DDD and structured design.
Establish Bounded Contexts, Ubiquitous Language, and Context Maps. Ensure engineering speaks the same language as business.
HV Analysis: vertical traces the full lifecycle, horizontal compares competitors. Cross-axis produces insights. Output: PDF report.
Solution exploration with DDD modeling, architecture decisions, interface definitions, data models. When uncertain, explores multiple options before committing.
Break into INVEST-compliant User Stories with acceptance criteria. Write to BACKLOG.md + docs/features/. Each Story is independently deliverable.
Everything flows through BACKLOG.md — four work item types, each with different Loop A depth.
Business requirements, product features. Full Loop A: DDD → Research → Design → AC. The heaviest investment upfront.
ID: US-XXX
Tech debt, architecture improvements. Often generated by $roll-.dream's nightly scans. Medium Loop A: impact analysis → plan → AC.
ID: REFACTOR-XXX
Bug fixes, Sentinel alerts, user reports. Light Loop A: locate root cause → AC. Fast path from diagnosis to delivery.
ID: FIX-XXX
Exploratory research. Loop A only — the output is knowledge, not code. May produce new User Stories or Refactors.
ID: SPIKE-XXX
FIX (bugs first) > US (user stories) > REFACTOR (tech debt). Automated by $roll-loop — the autonomous executor scans BACKLOG hourly and routes each item to the right skill.
Test-Driven Development writes the standard first.
TCR (Test && Commit || Revert) enforces it mechanically.
Per micro-step
Worst case: lose 5 minutes
Every save is verified
Code is always runnable
AI executes autonomously
No human babysitting
The full delivery pipeline — from Backlog item to verified deployment.
Decompose into minimal deliverable Actions. Independent Actions run in parallel.
Write test (RED) → Write code (GREEN) → Self-review ($roll-.review) → Auto-commit. Fail = auto-revert.
Lint + type check + full test suite + build. All must pass before push.
CI re-verifies in a clean environment — the final ruling on "shippable."
Live Evidence required — screenshots, curl responses, test outputs. AI saying "I checked, it works" doesn't count.
Human approves. BACKLOG status: ✅ Done. Sentinel takes over monitoring.
Every layer is automated. None depend on human patience or memory.
Verify every 2-5 minutes. Define the standard before writing code — auto-revert if it fails. Bugs are eliminated the instant they appear.
$roll-build
For critical modules (payments, auth), red-blue testing. One AI attacks, another defends. Max 5 rounds of escalating intensity.
$roll-spar
Self Review — per-commit 6-dimension check.
Peer Review — cross-agent negotiation for risky decisions.
Dream — nightly code health scan (6 dimensions).
$roll-.review · $roll-peer · $roll-.dream
24/7 random-sample monitoring. Alerts only after 3 consecutive failures (false-positive prevention). Auto-creates fix tasks.
$roll-sentinel
Two complementary monitors: Sentinel watches runtime, Dream watches code structure.
Random-sample monitoring of production systems. Cost-controlled AI validation with intelligent spot-checking.
Patrol Modes:
Light: 5 checks/day · Intensive: 20 checks/hour (post-release) · Full sweep: weekly
Output: FIX-XXX entries in BACKLOG.md
Runs at 3am. Six dimensions of code health analysis:
1. Dead Code · 2. Architectural Drift · 3. Pruning Candidates · 4. Emerging Patterns · 5. Doc Coverage · 6. Doc Freshness
Output: REFACTOR-XXX entries in BACKLOG.md
Sentinel monitors behavior. Dream monitors structure.
Together they ensure both runtime health and code quality degrade detection — before users notice.
ROLL operates at three levels of autonomy, each with clear boundaries.
Set goals, review proposals, approve releases. The judgment calls.
Scans BACKLOG hourly (10am–6pm). Auto-routes each item to the right skill. FIX > US > REFACTOR.
3am nightly code health scan. 6 dimensions. Generates REFACTOR entries autonomously.
Cross-agent negotiation for high-risk decisions. Up to 3 rounds. No consensus → escalate to human.
Humans set direction and approve releases. Everything else — building, reviewing, monitoring, refactoring — can run autonomously. The system never ships to production without human approval.
13 Core + 4 Autonomous + 5 Support — each maps to a specific phase.
| Skill | Tier | What It Does |
|---|---|---|
| $roll-research | Research | HV analysis — timeline + competitive landscape → PDF report |
| $roll-design | Design | DDD modeling, solution design, INVEST story breakdown |
| $roll-idea | Capture | Fast backlog capture — one-liner in, classified BACKLOG entry out |
| $roll-propose | Propose | Generate 1-3 structured US drafts → PROPOSALS.md for human review |
| $roll-build | Build | Universal entry: US-XXX / FIX-XXX / plain text → TCR delivery |
| $roll-spar | Adversarial | Red-blue drill: Attacker writes exploits, Defender patches |
| $roll-fix | Fix | Single-bug fix + mandatory regression test |
| $roll-debug | Diagnose | Black Box probe: Console/Network/DOM/Perf → root cause |
| $roll-sentinel | Patrol | Production random-sample monitoring, 3-strike alerting |
| $roll-doc | Document | Auto-scan, index, gap analysis, fill for project docs |
| $roll-notes | Journal | Project diary — records dev moments chronologically |
| $roll-doctor | Maintain | ROLL self-health check (symlinks/config/skill status) |
| $roll-release | Release | Version bump, tag, push — triggers npm auto-publish |
| $roll-loop | Auto | Hourly BACKLOG executor — routes items to skills |
| $roll-peer | Auto | Cross-agent peer review, up to 3 negotiation rounds |
| $roll-brief | Auto | Owner-facing briefing: done, in-progress, queue, escalations |
| $roll-.dream | Auto | Nightly 6-dimension code health scan → REFACTOR entries |
| $roll-.review | Hidden | Per-commit self-review: correctness, security, maintainability |
| $roll-.changelog | Hidden | Auto-generates CHANGELOG.md from completed stories |
| $roll-.qa | Hidden | Test pyramid standards: unit/E2E/visual/smoke + CI gates |
| $roll-.echo | Hidden | Passive intent clarification for vague inputs |
| $roll-.clarify | Hidden | Scope clarification for under-specified Fly mode inputs |
Quality assurance isn't removed — the implementation is upgraded.
roll init — one command, 5 seconds.
roll setup syncs conventions to all AI tools simultaneously.
Never overwrites existing configs. Writes to its own file, appends via @include.
Update ROLL, re-run setup — every AI tool upgrades in seconds.
roll setup · roll init · roll update
Example: building a "User Login" feature across all three loops.
PM submits: "We need user login." $roll-design runs DDD modeling, decomposes into 3 Stories (password auth, OAuth, remember me), writes AC. Idea → Backlog.
$roll-build starts TCR delivery. Micro-step rhythm: verify + commit every 3 minutes. 12 tests passing in 30 min. 8 micro-commits.
Auth module detected as high-risk. Attacker attempts SQL injection, brute force, session hijacking. Defender patches. 5 rounds, coverage: 71% → 93%.
Cross-agent negotiation flags a session management concern. 2 rounds of discussion. Consensus reached, implementation adjusted.
Screenshots + curl responses captured. Verify stage complete. AI nudges: "UAT passed. Ready to release?"
Deployed. BACKLOG status: ✅ Done. $roll-sentinel begins monitoring.
Detects emerging pattern: 3 similar auth helpers could be extracted. Creates REFACTOR-015 in BACKLOG.
OAuth endpoint response time degrading (3 consecutive failures). Auto-creates FIX-012. $roll-fix patches + regression test. Resolved before users notice.
Pipeline · Loops · Human×AI · Four Defenses · Three-Layer Autonomy — everything on one page.
Take 20 years of proven engineering practices (TDD / TCR / CI / DDD / SRE)
and encode them as standardized AI Agent work instructions.
AI won't cut corners, won't get tired, won't "skip the tests this time" —
because that branch doesn't exist in its instructions.
Requirement to production: hours
Zero-rework micro-step delivery
New dev onboards in 5 minutes
Four automated defense lines
Live evidence verification
Sentinel + Dream 24/7 watch
22 skills, one unified system
Three-layer autonomy
Human decides, AI delivers
@seanyao/roll · MIT · 22 Skills · npm install -g @seanyao/roll