12-Month Roadmap
A month-by-month plan to go from "engineer who uses an agent runtime" to hireable AI Application Engineer. You already drive Copilot Agent Mode / Claude Code at work โ this closes the engineering gaps around that skill.
Destination (month 12): an offer at a meaningfully higher salary, backed by 2โ3 deployed AI projects and demonstrable evals, RAG, agent-design, and production skill.
Companion docs: Start Here (reading order) ยท Glossary (vocabulary) ยท Building an Agentic System (the build recipe) ยท AI-First Methodology (the six components).
The 8 gaps this year closesโ
You can already run an agent. These are what separate that from "AI Application Engineer." Evals is the highest-leverage โ do not skip it because it's unglamorous.
| # | Gap | Why it matters |
|---|---|---|
| 1 | Evals | Measure if a feature works; iterate from data, not vibes. The #1 skill. |
| 2 | RAG design & tuning | Ground models in your data; measure retrieval quality deliberately. |
| 3 | Architecture judgment | Choose workflow vs single-agent vs multi-agent โ and defend it. |
| 4 | Production engineering | Cost, latency, retries, caching, observability. |
| 5 | Structured outputs | Schemas downstream code can rely on, not prose. |
| 6 | Methodical prompt tuning | Measured improvements, not guesswork. |
| 7 | Context engineering | Curate the right context at scale โ most "AI doesn't work" is wrong-context. |
| 8 | Public writing | Artifacts that move hiring managers; start mid-year. |
Year at a glanceโ
| Month | Focus | Ship |
|---|---|---|
| 1 | Structured foundations | Flagship agent v1 scaffold + system prompt + 5 examples |
| 2 | Structured outputs & prompt library | v1 with parseable output; 1 other user |
| 3 | Evals (highest leverage) | Eval harness + baseline on a personal project |
| 4 | Eval-driven iteration + production basics | Flagship v2: measured prompt gains, Langfuse instrumented |
| 5 | RAG at production quality | Personal RAG project with measured retrieval quality |
| 6 | Context engineering + go public | Refined context; first eval blog post |
| 7 | Architecture: workflow vs agent vs multi-agent | Flagship v3 pipeline (analyzer โ drafter โ reviewer) |
| 8 | Reliability + cost/latency | Hardened flagship; full-pipeline evals; failure-mode writeup |
| 9 | 2nd public project + system design | Ambitious public project, 5+ real users |
| 10 | Portfolio + narrative; start applying | READMEs, pinned repos, resume, pivot story; first applications |
| 11 | Interview execution | 3+ active processes; synthesis blog post |
| 12 | Land | 1+ offer at target; clean handover; retrospective |
Phases compress and expand: months 1โ2 are short (you know the runtime); months 3โ6 (evals + production + RAG) get the most weight.
Month-by-monthโ
Month 1 โ Structured foundationsโ
Why this month: turn months of ad-hoc agent use into a deliberately structured project.
| Learn | Why it helps | Build |
|---|---|---|
| Anthropic prompt engineering docs (deep read) | Closes prompt-tuning gap with named techniques | Repo structure + first system prompt |
| Repo structure for an agentic system | Forces architecture decisions before content | 5 curated reference examples |
| ~5h math max (cosine similarity, softmax/temperature) | Enough to read the field; math is not the bottleneck | โ |
Resource: Anthropic prompt engineering overview ยท Building an Agentic System
Month 2 โ Structured outputs & prompt libraryโ
Why this month: make output machine-trustworthy and capture what works.
| Learn | Why it helps | Build |
|---|---|---|
| Structured outputs / JSON schema (Zod, tool calling) | Downstream code can rely on the agent | v1 returns parseable JSON, not prose |
Reusable prompt library (prompts/) | Stops reinventing prompts; composable | draft / review / extract-intent prompts |
| 10 advanced runtime techniques (table below) | Pushes runtime use to expert | 1 other user on a real task |
Resource: OpenAI Structured Outputs ยท Anthropic tool use
Month 3 โ Evals (the make-or-break month)โ
Why this month: without evals every change is a guess. This is the single biggest hire signal.
| Learn | Why it helps | Build |
|---|---|---|
| Golden datasets, error analysis | Source of truth for "is it better?" | Eval harness on a personal project |
| LLM-as-judge with rubrics | Scales evaluation past manual review | Baseline scores you can quote |
| Metrics that correlate with usefulness | Avoids measuring the wrong thing | Regression-catching loop |
Resource: Hamel Husain โ Your AI Product Needs Evals ยท LLM-as-a-Judge
Month 4 โ Eval-driven iteration + production basicsโ
Why this month: connect evals to real engineering โ cost, latency, observability.
| Learn | Why it helps | Build |
|---|---|---|
| Streaming, retries, exponential backoff | Production-grade reliability (write the loop yourself) | Flagship v2 iterated against evals |
| Prompt caching | Cuts cost and latency materially | Cost monitoring in place |
| Langfuse tracing | See every call: prompt, response, tokens, cost | Every LLM call instrumented |
Resource: Langfuse docs ยท Anthropic prompt caching
Month 5 โ RAG at production qualityโ
Why this month: grounding is a core production pattern โ and you must measure retrieval, not just generation.
| Learn | Why it helps | Build |
|---|---|---|
| Chunking + embedding model choice | Retrieval quality starts here | Personal RAG project on your own corpus |
| Hybrid search (vector + keyword), re-ranking | Beats naive vector search | Measured retrieval quality (not just vibes) |
| pgvector in Postgres | Committed, boring, scalable store | Working semantic search |
Resource: Eugene Yan โ Patterns for LLM Systems ยท pgvector
Month 6 โ Context engineering + go publicโ
Why this month: context discipline is "the #1 job"; and public writing must start now, not in month 10.
| Learn | Why it helps | Build |
|---|---|---|
| Context curation & window management | Most agent failures are context failures | Refined flagship context / instructions |
| What to include vs exclude per task | Fewer hallucinations, lower cost | First substantive eval blog post |
| Networking: join a practitioner community | Compounds; referrals beat cold applies | Follow 30 builders; reply weekly |
Resource: Anthropic โ Effective context engineering for AI agents
Month 7 โ Architecture judgmentโ
Why this month: interviewers probe "why this design?" The disagreement between the two essays IS the lesson.
| Learn | Why it helps | Build |
|---|---|---|
| Workflow vs single-agent vs multi-agent | Choose the simplest thing that works | Flagship v3: analyzer โ drafter โ reviewer |
| When multi-agent backfires (context fragmentation) | Avoid speculative complexity | Per-stage evals across the pipeline |
| Composing patterns (routing, orchestrator-workers) | Vocabulary + judgment for design rounds | Defensible architecture writeup |
Resource: Anthropic โ Building effective agents ยท Cognition โ Don't Build Multi-Agents
Month 8 โ Reliability + cost/latencyโ
Why this month: "I can talk about failure modes and tradeoffs" is what makes you sound senior.
| Learn | Why it helps | Build |
|---|---|---|
| Failure recovery, idempotency, checkpoints | Production survives partial failures | Hardened flagship + better logging |
| Human-in-the-loop placement | Trust + adoption; catches the bad 5% | Real failure-mode writeup |
| Model routing, parallel tool calls, token budgets | Cheaper + faster without quality loss | Measured time-saving claim |
Resource: Anthropic โ Effective harnesses for long-running agents
Month 9 โ Second public project + system designโ
Why this month: two non-trivial deployed projects beat one; system-design fluency is interview gold.
| Learn | Why it helps | Build |
|---|---|---|
| AI system design (application chapters) | Frame problems like an AI engineer | Ambitious public project on a personal corpus |
| Vector store tradeoffs (pgvector vs hosted) | Defend infra choices | 5+ strangers actually using it |
| Basic Python to read AI codebases | Most AI code/examples are Python | Architecture feedback from a senior peer |
Resource: Chip Huyen โ AI Engineering ยท aie-book repo
Month 10 โ Portfolio, narrative, start applyingโ
Why this month: convert nine months of building into evidence; begin the funnel.
| Learn | Why it helps | Build |
|---|---|---|
| Take-home + AI system-design formats | Know what you're being tested on | READMEs: problem, approach, tradeoffs, lessons |
| Resume framing for the pivot | "Engineer deepening into AI," not "beginner" | Pinned repos + resume + pivot story |
| Light LeetCode only if screened | Small signal, capped at ~30 problems | First batch of applications (~30 target list) |
Resource: Hamel Husain โ Your AI Product Needs Evals (re-read for talking points)
Month 11 โ Interview executionโ
Why this month: performing, not learning. Rehearse the stories until they're boring.
| Learn | Why it helps | Build |
|---|---|---|
| System-design + prompt-debug rehearsal | Fluency under pressure | 3+ active interview processes |
| Behavioral: pivot story, hard project | Narrative is half the decision | Synthesis blog post on what you shipped |
| RAG / eval architecture Q&A | Your strongest topics โ own them | โ |
Resource: Eugene Yan โ Patterns for LLM Systems (design-round checklist)
Month 12 โ Landโ
Why this month: close, don't panic-accept. Extend 1โ3 months over a bad fit.
| Learn | Why it helps | Build |
|---|---|---|
| Offer evaluation vs your constraints | Anchor to market, not current pay | 1+ offer at/above target |
| โ | โ | Clean handover + retrospective doc |
Resource: Cognition โ Multi-Agents: What's Actually Working (stay current)
Advanced agent-runtime techniquesโ
Apply 3 per week until reflexive. These push runtime use from competent to expert.
| Technique | One-liner |
|---|---|
| Plan, not destination | Ask for a plan + approval before changes; agents over-act on destinations. |
| Explain rejected options | "List 2โ3 approaches you ruled out" surfaces wrong-architecture early. |
| Diff, not change | Demand diffs; split >50-line diffs โ big changes hide bugs. |
| Name failure modes | Specify empty/null/large/concurrent inputs โ beats default happy-path. |
| Self-review | "Review this as a senior reviewer who didn't write it." |
| Anchor to real patterns | "Read 3 similar files first; match their conventions exactly." |
| Constrain the surface | "Only modify src/auth/; stop and ask before touching anything else." |
| Instructions as files | Save recurring context to .github/prompts/[task].md; reference by name. |
| Skepticism on hallucination | "List APIs you used + confidence they exist + how to verify." |
| Tight test loop | "Make this failing test pass without weakening it; then run the suite." |
Committed tech stack (stop tool-shopping)โ
| Layer | Choice |
|---|---|
| Runtime (work / personal) | Copilot Agent Mode / Claude Code |
| Context & prompts | Markdown โ copilot-instructions.md / CLAUDE.md, version-controlled prompts/ |
| Structured outputs | JSON schema + Zod validation |
| Vector store | Postgres + pgvector |
| Integrations | Native runtime tools first; MCP servers when available |
| Evals | Hand-rolled: golden set + LLM-as-judge |
| Observability | Langfuse (free tier) |
| Docs | Docusaurus repo (this one) |
Avoid LangChain / CrewAI / AutoGen for the flagship โ hand-write the loop so you understand it.
Common trapsโ
- Skipping evals because they're boring โ the single most important new skill.
- Tutorial loops and tool roulette โ build and commit, don't keep shopping.
- Networking late โ start month 6, not month 10.
- Perfectionist publishing โ a rough post at month 6 compounds.
- Underselling the pivot โ you're an engineer deepening into production AI, not a beginner.
- Letting current pay anchor your salary expectations.
- Crowding out the day job โ it's paid AI practice; protect it.
Sourcesโ
Verified June 2026.
- Anthropic โ Building effective agents
- Anthropic โ Effective context engineering for AI agents
- Anthropic โ Effective harnesses for long-running agents
- Anthropic โ Prompt engineering overview ยท Prompt caching ยท Tool use
- Hamel Husain โ Your AI Product Needs Evals ยท LLM-as-a-Judge
- Eugene Yan โ Patterns for Building LLM-based Systems & Products
- Cognition โ Don't Build Multi-Agents ยท Multi-Agents: What's Actually Working
- Chip Huyen โ AI Engineering (O'Reilly) ยท aie-book resources
- OpenAI โ Structured Outputs
- Langfuse โ LLM observability docs ยท pgvector
- Simon Willison's blog