12-Month Roadmap

A month-by-month plan to go from "engineer who uses an agent runtime" to hireable AI Application Engineer. You already drive Copilot Agent Mode / Claude Code at work — this closes the engineering gaps around that skill.

Destination (month 12): an offer at a meaningfully higher salary, backed by 2–3 deployed AI projects and demonstrable evals, RAG, agent-design, and production skill.

Companion docs: Start Here (reading order) · Glossary (vocabulary) · Building an Agentic System (the build recipe) · AI-First Methodology (the six components).

The 8 gaps this year closes

You can already run an agent. These are what separate that from "AI Application Engineer." Evals is the highest-leverage — do not skip it because it's unglamorous.

#	Gap	Why it matters
1	Evals	Measure if a feature works; iterate from data, not vibes. The #1 skill.
2	RAG design & tuning	Ground models in your data; measure retrieval quality deliberately.
3	Architecture judgment	Choose workflow vs single-agent vs multi-agent — and defend it.
4	Production engineering	Cost, latency, retries, caching, observability.
5	Structured outputs	Schemas downstream code can rely on, not prose.
6	Methodical prompt tuning	Measured improvements, not guesswork.
7	Context engineering	Curate the right context at scale — most "AI doesn't work" is wrong-context.
8	Public writing	Artifacts that move hiring managers; start mid-year.

Year at a glance

Month	Focus	Ship
1	Structured foundations	Flagship agent v1 scaffold + system prompt + 5 examples
2	Structured outputs & prompt library	v1 with parseable output; 1 other user
3	Evals (highest leverage)	Eval harness + baseline on a personal project
4	Eval-driven iteration + production basics	Flagship v2: measured prompt gains, Langfuse instrumented
5	RAG at production quality	Personal RAG project with measured retrieval quality
6	Context engineering + go public	Refined context; first eval blog post
7	Architecture: workflow vs agent vs multi-agent	Flagship v3 pipeline (analyzer → drafter → reviewer)
8	Reliability + cost/latency	Hardened flagship; full-pipeline evals; failure-mode writeup
9	2nd public project + system design	Ambitious public project, 5+ real users
10	Portfolio + narrative; start applying	READMEs, pinned repos, resume, pivot story; first applications
11	Interview execution	3+ active processes; synthesis blog post
12	Land	1+ offer at target; clean handover; retrospective

Phases compress and expand: months 1–2 are short (you know the runtime); months 3–6 (evals + production + RAG) get the most weight.

Month-by-month

Month 1 — Structured foundations

Why this month: turn months of ad-hoc agent use into a deliberately structured project.

Learn	Why it helps	Build
Anthropic prompt engineering docs (deep read)	Closes prompt-tuning gap with named techniques	Repo structure + first system prompt
Repo structure for an agentic system	Forces architecture decisions before content	5 curated reference examples
~5h math max (cosine similarity, softmax/temperature)	Enough to read the field; math is not the bottleneck	—

Resource: Anthropic prompt engineering overview · Building an Agentic System

Month 2 — Structured outputs & prompt library

Why this month: make output machine-trustworthy and capture what works.

Learn	Why it helps	Build
Structured outputs / JSON schema (Zod, tool calling)	Downstream code can rely on the agent	v1 returns parseable JSON, not prose
Reusable prompt library (`prompts/`)	Stops reinventing prompts; composable	`draft` / `review` / `extract-intent` prompts
10 advanced runtime techniques (table below)	Pushes runtime use to expert	1 other user on a real task

Resource: OpenAI Structured Outputs · Anthropic tool use

Month 3 — Evals (the make-or-break month)

Why this month: without evals every change is a guess. This is the single biggest hire signal.

Learn	Why it helps	Build
Golden datasets, error analysis	Source of truth for "is it better?"	Eval harness on a personal project
LLM-as-judge with rubrics	Scales evaluation past manual review	Baseline scores you can quote
Metrics that correlate with usefulness	Avoids measuring the wrong thing	Regression-catching loop

Resource: Hamel Husain — Your AI Product Needs Evals · LLM-as-a-Judge

Month 4 — Eval-driven iteration + production basics

Why this month: connect evals to real engineering — cost, latency, observability.

Learn	Why it helps	Build
Streaming, retries, exponential backoff	Production-grade reliability (write the loop yourself)	Flagship v2 iterated against evals
Prompt caching	Cuts cost and latency materially	Cost monitoring in place
Langfuse tracing	See every call: prompt, response, tokens, cost	Every LLM call instrumented

Resource: Langfuse docs · Anthropic prompt caching

Month 5 — RAG at production quality

Why this month: grounding is a core production pattern — and you must measure retrieval, not just generation.

Learn	Why it helps	Build
Chunking + embedding model choice	Retrieval quality starts here	Personal RAG project on your own corpus
Hybrid search (vector + keyword), re-ranking	Beats naive vector search	Measured retrieval quality (not just vibes)
pgvector in Postgres	Committed, boring, scalable store	Working semantic search

Resource: Eugene Yan — Patterns for LLM Systems · pgvector

Month 6 — Context engineering + go public

Why this month: context discipline is "the #1 job"; and public writing must start now, not in month 10.

Learn	Why it helps	Build
Context curation & window management	Most agent failures are context failures	Refined flagship context / instructions
What to include vs exclude per task	Fewer hallucinations, lower cost	First substantive eval blog post
Networking: join a practitioner community	Compounds; referrals beat cold applies	Follow 30 builders; reply weekly

Resource: Anthropic — Effective context engineering for AI agents

Month 7 — Architecture judgment

Why this month: interviewers probe "why this design?" The disagreement between the two essays IS the lesson.

Learn	Why it helps	Build
Workflow vs single-agent vs multi-agent	Choose the simplest thing that works	Flagship v3: analyzer → drafter → reviewer
When multi-agent backfires (context fragmentation)	Avoid speculative complexity	Per-stage evals across the pipeline
Composing patterns (routing, orchestrator-workers)	Vocabulary + judgment for design rounds	Defensible architecture writeup

Resource: Anthropic — Building effective agents · Cognition — Don't Build Multi-Agents

Month 8 — Reliability + cost/latency

Why this month: "I can talk about failure modes and tradeoffs" is what makes you sound senior.

Learn	Why it helps	Build
Failure recovery, idempotency, checkpoints	Production survives partial failures	Hardened flagship + better logging
Human-in-the-loop placement	Trust + adoption; catches the bad 5%	Real failure-mode writeup
Model routing, parallel tool calls, token budgets	Cheaper + faster without quality loss	Measured time-saving claim

Resource: Anthropic — Effective harnesses for long-running agents

Month 9 — Second public project + system design

Why this month: two non-trivial deployed projects beat one; system-design fluency is interview gold.

Learn	Why it helps	Build
AI system design (application chapters)	Frame problems like an AI engineer	Ambitious public project on a personal corpus
Vector store tradeoffs (pgvector vs hosted)	Defend infra choices	5+ strangers actually using it
Basic Python to read AI codebases	Most AI code/examples are Python	Architecture feedback from a senior peer

Resource: Chip Huyen — AI Engineering · aie-book repo

Month 10 — Portfolio, narrative, start applying

Why this month: convert nine months of building into evidence; begin the funnel.

Learn	Why it helps	Build
Take-home + AI system-design formats	Know what you're being tested on	READMEs: problem, approach, tradeoffs, lessons
Resume framing for the pivot	"Engineer deepening into AI," not "beginner"	Pinned repos + resume + pivot story
Light LeetCode only if screened	Small signal, capped at ~30 problems	First batch of applications (~30 target list)

Resource: Hamel Husain — Your AI Product Needs Evals (re-read for talking points)

Month 11 — Interview execution

Why this month: performing, not learning. Rehearse the stories until they're boring.

Learn	Why it helps	Build
System-design + prompt-debug rehearsal	Fluency under pressure	3+ active interview processes
Behavioral: pivot story, hard project	Narrative is half the decision	Synthesis blog post on what you shipped
RAG / eval architecture Q&A	Your strongest topics — own them	—

Resource: Eugene Yan — Patterns for LLM Systems (design-round checklist)

Month 12 — Land

Why this month: close, don't panic-accept. Extend 1–3 months over a bad fit.

Learn	Why it helps	Build
Offer evaluation vs your constraints	Anchor to market, not current pay	1+ offer at/above target
—	—	Clean handover + retrospective doc

Resource: Cognition — Multi-Agents: What's Actually Working (stay current)

Advanced agent-runtime techniques

Apply 3 per week until reflexive. These push runtime use from competent to expert.

Technique	One-liner
Plan, not destination	Ask for a plan + approval before changes; agents over-act on destinations.
Explain rejected options	"List 2–3 approaches you ruled out" surfaces wrong-architecture early.
Diff, not change	Demand diffs; split >50-line diffs — big changes hide bugs.
Name failure modes	Specify empty/null/large/concurrent inputs — beats default happy-path.
Self-review	"Review this as a senior reviewer who didn't write it."
Anchor to real patterns	"Read 3 similar files first; match their conventions exactly."
Constrain the surface	"Only modify `src/auth/`; stop and ask before touching anything else."
Instructions as files	Save recurring context to `.github/prompts/[task].md`; reference by name.
Skepticism on hallucination	"List APIs you used + confidence they exist + how to verify."
Tight test loop	"Make this failing test pass without weakening it; then run the suite."

Committed tech stack (stop tool-shopping)

Layer	Choice
Runtime (work / personal)	Copilot Agent Mode / Claude Code
Context & prompts	Markdown — `copilot-instructions.md` / `CLAUDE.md`, version-controlled `prompts/`
Structured outputs	JSON schema + Zod validation
Vector store	Postgres + pgvector
Integrations	Native runtime tools first; MCP servers when available
Evals	Hand-rolled: golden set + LLM-as-judge
Observability	Langfuse (free tier)
Docs	Docusaurus repo (this one)

Avoid LangChain / CrewAI / AutoGen for the flagship — hand-write the loop so you understand it.

Common traps

Skipping evals because they're boring — the single most important new skill.
Tutorial loops and tool roulette — build and commit, don't keep shopping.
Networking late — start month 6, not month 10.
Perfectionist publishing — a rough post at month 6 compounds.
Underselling the pivot — you're an engineer deepening into production AI, not a beginner.
Letting current pay anchor your salary expectations.
Crowding out the day job — it's paid AI practice; protect it.

Sources

Verified June 2026.

12-Month Roadmap

The 8 gaps this year closes​

Year at a glance​

Month-by-month​

Month 1 — Structured foundations​

Month 2 — Structured outputs & prompt library​

Month 3 — Evals (the make-or-break month)​

Month 4 — Eval-driven iteration + production basics​

Month 5 — RAG at production quality​

Month 6 — Context engineering + go public​

Month 7 — Architecture judgment​

Month 8 — Reliability + cost/latency​

Month 9 — Second public project + system design​

Month 10 — Portfolio, narrative, start applying​

Month 11 — Interview execution​

Month 12 — Land​

Advanced agent-runtime techniques​

Committed tech stack (stop tool-shopping)​

Common traps​

Sources​