This is my current agentic development workflow in May 2026.
The short version: I run almost everything through Codex now. Not because every other tool is bad. The whole space is moving extremely fast, and a lot of tools are interesting. But for the work I actually do every day, the current GPT model in Codex is the best fit.
The longer version is more about the loop than the model.
The model matters. A lot. But the workflow around it matters just as much: repo instructions, small helper scripts, local skills, review passes, docs discovery, commit discipline, and a way to make the agent prove what it changed.
That is the part that changed the most for me.
Where I Started
For me this started with the first public version of ChatGPT. As soon as it was available, I was using it wherever it helped.
At first I used it mostly as an explainer and learning tool. I still do. It was useful for understanding an unfamiliar API, asking why something worked, or getting a second explanation when the docs did not quite click.
Pretty quickly, I also saw the power for coding. Not as something to blindly copy from, but as an aid in the process. Copy a piece of code in. Ask for a fix. Paste the error back. Ask again. It was useful, but it was not really a developer sitting in the repo.
I was also using GitHub Copilot: useful completions, useful model swaps now and then, and still a subscription I keep around. It can be handy with OpenCode when I want to try another model path. But I rarely use it these days.
The first version that felt more agentic came at the end of 2023 and beginning of 2024, through the VS Code extension era: Roo Code, Cline-style workflows, and tools that could read files, edit files, run commands, and ask for approval along the way.
That was a big shift.
It is history for me now. I have not used Roo Code for about a year, but that period mattered because it showed what the workflow could become.
It also created a lot of bloat.
The agent could do more, so it did more. More files. More plans. More scaffolding. More “helpful” abstractions. The result was often impressive in the moment and annoying the next day. You got motion, but not always good taste.
Then Claude Code changed the shape of the workflow again.
Terminal-first mattered. Better tool use mattered. Letting the agent live closer to the repo instead of inside a chat box mattered. It still needed supervision, but it felt like a real step toward agentic development instead of prompt-and-paste development.
Boris Cherny’s way of talking about Claude Code also influenced me in that early phase. The useful idea was not “let an agent do anything”. It was more specific: keep the agent close to the terminal, make it search and inspect before it edits, and treat the workflow as a real engineering loop instead of a smarter autocomplete box.
I also used OpenCode with oh-my-opencode. That was much better than the early extension loops in a lot of ways. More structured. More configurable. More serious about agents as a harness.
But my workflow was still too often:
That can work. But it is slow, and it encourages the agent to spend too much energy narrating the work instead of closing the loop.
The models improved. The harnesses improved. The results improved. But the real unlock for me was reducing the distance between task, edit, test, review, and commit.
Why I Moved Fully To Codex
This is intentionally a May 2026 snapshot, not a timeless model ranking.
At the beginning of the year, I was still using Claude every day. Opus 4.5 and Opus 4.6 were good for this kind of work. They had strong taste, strong language, and they were often excellent at turning a messy request into a plausible implementation plan.
The move was not because Claude suddenly became bad. It was because my workflow changed. I wanted less plan negotiation and more closed-loop repo work: read the rules, inspect the code, make the smallest useful patch, run the checks, review the patch, fix the real findings, and hand back proof.
I moved fully to Codex when GPT-5.3-Codex was there.
That was the first point where Codex clearly fit my daily coding work better than Claude Opus 4.6. Opus was still very good, but it could be trigger-happy. It liked to do a lot. Sometimes too much. Codex needed better guardrails, but once those guardrails were there, it just got the job done more reliably.
The workflow also felt more natural to me. I could talk to the agent, let it search, let it build up the intent and context, and then let it implement. Less giant planning ceremony. More discovery into action. Fewer moments where I felt like I was managing the assistant instead of delegating work to it.
What I did not like was losing some of the personality of Opus. It was more fun to work with. It had a nicer tone. But that is not the main reason I am here. I need to build.
With GPT-5.5, the quality jump is big again. The same workflow works better than before: fewer misses, better judgment, better use of instructions, and less need to constantly steer the model back onto the rails.
At the same time, Claude moved in a direction I like less for agentic coding. Opus 4.7 did not reverse that for me. In some ways it felt worse before it felt better. The giant context window sounds attractive, but in practice I do not want the workflow to depend on one specific Claude setup, or on manually picking the special context mode in Claude’s own CLI. I want the model to work well inside the harness I choose.
I also do not love the ecosystem gravity around CLAUDE.md and Claude-specific conventions while the rest of the industry is slowly moving toward more shared agent instructions. The leaked-code discussion around Claude Code also reinforced my feeling that the harness itself was not the magic. I did not read it as “Claude Code is bad”. I read it as another reminder that the wrapper layer matters, and that I want more of that layer open, inspectable, and portable.
That is the broader issue for me: Anthropic often makes the experience pleasant, especially if you are not technical, but it also hides too much from the user. For technical work, I want more control, more portability, and more visibility into the harness. The Claude CLI never felt like the best version of that.
So for coding, right now, Codex with GPT-5.5 is my default. I am fully in Codex and not really looking back, at least for now.
That does not mean I never use anything else. For some web design work I still sometimes reach for Claude Opus, especially when I want a different taste pass. Smaller and cheaper models are also worth watching: GLM, Kimi, and the rest of that ecosystem are getting better quickly.
The reason is not just “it writes better code”.
It follows the repo. It uses tools well. It handles boring constraints. It can keep a large task in its head without trying to turn everything into a framework. And when the instructions are good, it can work in a way that feels less like prompting and more like delegating.
That is the bar.
I do not want an agent that only produces code. I want an agent that can:
- read the repo rules
- find the relevant docs
- make a narrow change
- run the right checks
- review its own patch
- fix what the review finds
- commit only the intended files
- tell me what is still uncertain
The model is important because it makes that whole loop less fragile.
The Base Layer: Agent Scripts
My shared base layer is agent-scripts, my fork of steipete/agent-scripts.
It is not glamorous. That is why it works.
Peter Steinberger influenced this part of my workflow a lot. His agent scripts, skills, review habits, and general bias toward running many small agent workflows in parallel helped me stop thinking of this as one heroic chat session and start treating it as an operating layer around the repo.
There are three parts I care about most:
- shared instructions
- repo docs discovery
- small helpers that remove bad defaults
AGENTS.MD
agent-scripts gives my repos a common set of instructions and small tools. The important file is AGENTS.MD: the shared rules that tell agents how I want them to work.
Things like:
- read repo docs before coding
- do not reset my worktree
- do not swap package managers
- use repo-local commands
- add tests when the bug shape fits
- run the full gate before handoff
- keep diffs narrow
- commit with the helper
- do not print secrets
- ask before doing risky infrastructure work
That sounds like process. It is really context.
Agents are much better when the rules are not hidden in my head. The goal is to remove “Bram would have wanted…” guessing from every session.
Each repo can still add local rules. This blog repo, for example, says to read the shared AGENTS.MD, then run bin/docs-list, then read the matching docs by read_when. That is exactly the kind of small, boring instruction that saves time. The agent does not need to search the whole repo to understand how posts work. It asks the docs inventory first.
docs-list
docs-list is one of those tiny helpers that sounds too simple to matter until you use it every day.
The pattern is:
docs/*.md
summary: ...
read_when:
- ...
Then the agent runs:
bin/docs-list
and gets a small map of the repo.
Not a giant context dump. Not a wiki search. Just enough to know which doc matters for the current task.
For this post, the relevant doc is the content workflow. It says new posts start in Obsidian, under:
/Users/bram/obsidian/bvrA/posts
Then the repo syncs them into Hugo with:
bin/sync-obsidian-posts
That is the kind of thing I want the model to discover and follow without me repeating it every time.
committer
I use a small committer helper from agent-scripts.
The important behavior is simple: start from a clean staging area, stage only the files I list, then commit with the message I give it.
That matters because agent sessions often leave noise nearby: generated files, local notes, experiments, unrelated user changes, maybe a synced artifact. I do not want “git add .” as a habit.
The helper enforces that habit. It rejects . as a path, checks that the listed files exist or are real deletions, unstages everything first, stages only the explicit paths, and refuses to commit if nothing changed.
The commit step should be boring and explicit:
committer "fix: handle empty export path" src/exporter.py tests/test_exporter.py
That is also a good forcing function for the agent. If it cannot name the files it wants to commit, it probably does not understand the patch well enough.
Skills
The other big piece is local skills.
Skills are small operational playbooks. They are not long essays. They tell the agent what to do when a certain kind of task appears.
In my setup, a skill has a SKILL.md with front matter:
---
name: codex-review
description: "Codex code review closeout: local dirty changes, PR branch vs main, parallel tests."
---
The description is for routing. The body is for execution.
That has been surprisingly powerful.
What Belongs In Skills
The useful skills are the ones that capture a repeatable workflow I do not want the model to rediscover.
In my setup that includes:
- code review closeout
- Oracle second-model review
- Obsidian note work
- GitHub deep review
- 1Password rules
- npm release work
- frontend design
- screenshots and UI inspection
- remote host workflows
The point is not to turn everything into a skill. The point is to capture the workflows where I already know the rules.
Do not make the agent rediscover the 1Password policy. Do not make it guess how npm release verification works. Do not make it remember the exact review loop. Put that into a skill and let the model pick it up when the task matches.
Review Loop
The review loop has been the biggest recent gamechanger.
The old version was:
The better version is:
Codex-review makes that explicit.
This is one of the most powerful pieces of the whole setup, and I use it daily now.
It treats review output as advisory, not sacred. The agent still has to verify each finding against the real code path, adjacent files, and dependency behavior when needed. Bad findings get rejected. Speculative rewrites get rejected. Real findings get fixed at the smallest useful ownership boundary. If a review-triggered fix changes code, the focused tests run again and review runs again.
The important bit is that this closes the loop before the work gets back to me.
I still review the diff. I still make the call. But I am no longer the first person to notice the obvious bug the agent introduced ten minutes ago.
For non-trivial changes, this is now part of my default expectation:
codex review --uncommitted
or, on a branch:
codex review --base origin/main
Sometimes the review finds nothing. Sometimes it catches exactly the kind of edge case that would have been annoying in production. Either way, the discipline matters.
I also wrapped that in a helper:
codex-review --parallel-tests "go test ./..."
The helper chooses the right review target. If the checkout is dirty, it runs codex review --uncommitted. If the work is already committed or on a PR branch, it reviews against the branch base instead, usually origin/main or the actual PR base. That matters because a clean --uncommitted review only proves there is no local patch. It says nothing about a branch that is already committed.
It can run tests and review in parallel, captures the review output, fails if Codex reports P0-P3 findings, treats empty review output as non-clean, and prints a clean closeout only when there are no accepted/actionable findings:
codex-review clean: no accepted/actionable findings reported
The point is not to obey a second model. The point is to add another pressure test to the patch before I see it. The agent has to read the finding, decide whether it is real, fix it if it is real, reject it if it is not, and keep looping until the review is clean.
Oracle
Oracle used to be more central in my workflow.
The idea is good: bundle a prompt plus selected files and hand it to another model as a second opinion. I found it very useful, especially when I wanted a separate pass on architecture, debugging, or a hard refactor.
In my setup it lives as a skill and a CLI:
npx -y @steipete/oracle --help
These days I need it less.
That is mostly because GPT-5.5 in Codex is already strong enough for a lot of the work where I previously wanted a second model. The better the primary loop gets, the less often I need to stop and ask a different model to reason from a bundle.
But Oracle still has a place.
For really hard tasks, I still like having another model look at the problem with a tight file set and a standalone prompt. It is especially useful when I want to separate “implement this” from “tell me what I am missing”.
The trick is not to overuse it. A second opinion is useful when the task is hard. It is drag when the primary loop can solve the problem faster.
Google Models
Google models have their spot. I just do not currently reach for them first for code development.
At the end of last year, Google models were often the best fit for UI design work in my setup. They had a good feel for layout, visual hierarchy, and turning a rough interface idea into something that looked more intentional.
For coding, they were less compelling for me. Not useless. Just less optimized for the loop I care about. They did not expose or maintain the same kind of thinking-and-correction rhythm, and they did not correct themselves after failed checks as often as I wanted.
I still think Google’s DESIGN.md spec is a good idea: a plain Markdown design-system file with structured tokens and human-readable design rationale that coding agents can follow.
That is a nice idea.
It fits a broader pattern I like: put the important context in files, make it readable by humans, and make it precise enough for agents.
But I think Google is playing a different game. World models, robotics, multimodal systems, big context, and other things that are not just “make my repo patch better today”.
That is not a criticism. It is just a different center of gravity.
For coding, in my current day-to-day work, GPT-5.5 in Codex is the tool I trust most.
Smaller Models And Other CLIs
The cheap and fast models are worth paying attention to.
GLM 5 was nice to use for a while. Kimi too. Cheap, fast, and good enough for certain slices of work: summarizing, searching, mechanical edits, draft review, simple scripts, and background exploration.
Then Opus 4.6 and GPT-5.2-Codex were good enough that I stopped caring as much about cheaper routing in my own workflow. The cost difference mattered less than the cost of reviewing weaker output.
Now even the small GPT-5.5 models are cheap relative to what they can do. I am not using that layer heavily right now, but I am watching it. The useful split may come back: frontier model for final patch judgment, smaller models for background exploration and cheap parallel passes.
The same is true for the CLI ecosystem.
Codex, Claude Code, OpenCode, Gemini CLI, and all the wrappers around them are not just interchangeable chat frontends. They encode different opinions:
- planning versus direct execution
- terminal versus editor
- one agent versus subagents
- MCP-heavy versus CLI-heavy
- local files versus cloud sessions
- cheap model routing versus frontier model first
This world is changing fast.
That is why I like keeping my own workflow in small files and scripts. If the best harness changes, I can move the layer above it. AGENTS.MD, skills, docs-list, and commit discipline are portable ideas.
The Current Loop
My current loop looks roughly like this:
It still often starts with just talking to the model.
I usually know what I have in mind. I guide the model toward that shape, but I also want it to push back, surface missing information, or suggest a path I had not considered. That conversational part matters. Just talking to the model can get you surprisingly far, especially when you use it to clarify the task before touching files.
Once I have enough information, and the model has enough context, then I want it to switch into implementation mode.
This is the main proof shape of the workflow: conversational discovery first, then a set of gates that force repo context, checks, review, and explicit handoff.
For app work, there may be screenshots or Playwright checks.
For Maya work, there may be a live Maya session involved.
For release work, there are changelog and registry checks.
The shape changes per repo. The principle stays the same:
The agent should close the loop as much as it reasonably can before handing work back.
A Real Example: gobankcli
gobankcli is a good example because the problem was boring in exactly the right way: build a local-first, read-only bank transaction archive without letting an agent make unsafe assumptions.
It took me a few hours to create end to end. That is the part that still feels different: not a weekend of scaffolding, not a week of wiring, but a focused session where the agent could keep moving because the constraints were clear.
The first commit was not code. It was AGENTS.md.
That file set the boundaries before implementation started:
- no scraping
- no payment initiation
- no bank password storage
- no real bank data in tests, docs, examples, logs, or commits
- no
float64for money - stdout is data only; stderr is hints and progress
--json,--plain, and--no-inputmust be scriptable
Then the repo grew as a stack of small commits:
2026-05-17 2817f85 chore: add agent instructions
2026-05-17 65e9d05 feat: add provider abstraction
2026-05-17 d802a51 feat: add sqlite archive store
2026-05-17 a9fc856 feat: add normalized csv export
2026-05-17 15f8f0c feat: add gocardless provider normalization
2026-05-17 47c4b81 feat: wire provider commands
2026-05-17 61e7376 feat: add read-only archive query command
2026-05-17 7a0d51c fix: handle provider sync edge cases
2026-05-17 eb80840 fix: keep remittance out of transaction identity
2026-05-18 d433841 feat: add enable banking local HTTPS setup
That sequence matters more than the size of any one patch. The agent did not start by inventing a giant banking framework. It started with repo rules, then added the provider boundary, archive store, export path, provider normalization, command wiring, query surface, and docs.
The fix commits are the useful part. fix: handle provider sync edge cases added command and normalization coverage around real provider-shape problems. fix: keep remittance out of transaction identity is exactly the kind of bug I want this workflow to catch: one provider field may be useful descriptive data, but it should not accidentally become part of a stable transaction identity.
The proof is not that the agent wrote a lot of Go. The proof is that the work stayed inside the constraints: read-only, local-first, synthetic fixtures only, explicit command behavior, tests beside behavior changes, and commits narrow enough that I could inspect the shape of the system afterwards.
What I Want From Agents Now
I do not want maximal autonomy. I want useful autonomy.
That means:
- know the repo rules
- keep changes small
- use the right local tools
- prove the important behavior
- review the patch
- say what was not verified
- do not hide uncertainty
The best agentic workflow is not the one with the most moving parts. It is the one where the model has enough context and enough tools to finish the boring parts without turning the repo into a mess.
Right now, for me, that is Codex plus agent-scripts, local skills, docs discovery, review loops, and explicit commits.
No mysticism. Just a tighter loop.