Skip to main content

12-Month Roadmap

A month-by-month plan to go from "engineer who uses an agent runtime" to hireable AI Application Engineer. You already drive Copilot Agent Mode / Claude Code at work โ€” this closes the engineering gaps around that skill.

Destination (month 12): an offer at a meaningfully higher salary, backed by 2โ€“3 deployed AI projects and demonstrable evals, RAG, agent-design, and production skill.

Companion docs: Start Here (reading order) ยท Glossary (vocabulary) ยท Building an Agentic System (the build recipe) ยท AI-First Methodology (the six components).


The 8 gaps this year closesโ€‹

You can already run an agent. These are what separate that from "AI Application Engineer." Evals is the highest-leverage โ€” do not skip it because it's unglamorous.

#GapWhy it matters
1EvalsMeasure if a feature works; iterate from data, not vibes. The #1 skill.
2RAG design & tuningGround models in your data; measure retrieval quality deliberately.
3Architecture judgmentChoose workflow vs single-agent vs multi-agent โ€” and defend it.
4Production engineeringCost, latency, retries, caching, observability.
5Structured outputsSchemas downstream code can rely on, not prose.
6Methodical prompt tuningMeasured improvements, not guesswork.
7Context engineeringCurate the right context at scale โ€” most "AI doesn't work" is wrong-context.
8Public writingArtifacts that move hiring managers; start mid-year.

Year at a glanceโ€‹

MonthFocusShip
1Structured foundationsFlagship agent v1 scaffold + system prompt + 5 examples
2Structured outputs & prompt libraryv1 with parseable output; 1 other user
3Evals (highest leverage)Eval harness + baseline on a personal project
4Eval-driven iteration + production basicsFlagship v2: measured prompt gains, Langfuse instrumented
5RAG at production qualityPersonal RAG project with measured retrieval quality
6Context engineering + go publicRefined context; first eval blog post
7Architecture: workflow vs agent vs multi-agentFlagship v3 pipeline (analyzer โ†’ drafter โ†’ reviewer)
8Reliability + cost/latencyHardened flagship; full-pipeline evals; failure-mode writeup
92nd public project + system designAmbitious public project, 5+ real users
10Portfolio + narrative; start applyingREADMEs, pinned repos, resume, pivot story; first applications
11Interview execution3+ active processes; synthesis blog post
12Land1+ offer at target; clean handover; retrospective

Phases compress and expand: months 1โ€“2 are short (you know the runtime); months 3โ€“6 (evals + production + RAG) get the most weight.


Month-by-monthโ€‹

Month 1 โ€” Structured foundationsโ€‹

Why this month: turn months of ad-hoc agent use into a deliberately structured project.

LearnWhy it helpsBuild
Anthropic prompt engineering docs (deep read)Closes prompt-tuning gap with named techniquesRepo structure + first system prompt
Repo structure for an agentic systemForces architecture decisions before content5 curated reference examples
~5h math max (cosine similarity, softmax/temperature)Enough to read the field; math is not the bottleneckโ€”

Resource: Anthropic prompt engineering overview ยท Building an Agentic System

Month 2 โ€” Structured outputs & prompt libraryโ€‹

Why this month: make output machine-trustworthy and capture what works.

LearnWhy it helpsBuild
Structured outputs / JSON schema (Zod, tool calling)Downstream code can rely on the agentv1 returns parseable JSON, not prose
Reusable prompt library (prompts/)Stops reinventing prompts; composabledraft / review / extract-intent prompts
10 advanced runtime techniques (table below)Pushes runtime use to expert1 other user on a real task

Resource: OpenAI Structured Outputs ยท Anthropic tool use

Month 3 โ€” Evals (the make-or-break month)โ€‹

Why this month: without evals every change is a guess. This is the single biggest hire signal.

LearnWhy it helpsBuild
Golden datasets, error analysisSource of truth for "is it better?"Eval harness on a personal project
LLM-as-judge with rubricsScales evaluation past manual reviewBaseline scores you can quote
Metrics that correlate with usefulnessAvoids measuring the wrong thingRegression-catching loop

Resource: Hamel Husain โ€” Your AI Product Needs Evals ยท LLM-as-a-Judge

Month 4 โ€” Eval-driven iteration + production basicsโ€‹

Why this month: connect evals to real engineering โ€” cost, latency, observability.

LearnWhy it helpsBuild
Streaming, retries, exponential backoffProduction-grade reliability (write the loop yourself)Flagship v2 iterated against evals
Prompt cachingCuts cost and latency materiallyCost monitoring in place
Langfuse tracingSee every call: prompt, response, tokens, costEvery LLM call instrumented

Resource: Langfuse docs ยท Anthropic prompt caching

Month 5 โ€” RAG at production qualityโ€‹

Why this month: grounding is a core production pattern โ€” and you must measure retrieval, not just generation.

LearnWhy it helpsBuild
Chunking + embedding model choiceRetrieval quality starts herePersonal RAG project on your own corpus
Hybrid search (vector + keyword), re-rankingBeats naive vector searchMeasured retrieval quality (not just vibes)
pgvector in PostgresCommitted, boring, scalable storeWorking semantic search

Resource: Eugene Yan โ€” Patterns for LLM Systems ยท pgvector

Month 6 โ€” Context engineering + go publicโ€‹

Why this month: context discipline is "the #1 job"; and public writing must start now, not in month 10.

LearnWhy it helpsBuild
Context curation & window managementMost agent failures are context failuresRefined flagship context / instructions
What to include vs exclude per taskFewer hallucinations, lower costFirst substantive eval blog post
Networking: join a practitioner communityCompounds; referrals beat cold appliesFollow 30 builders; reply weekly

Resource: Anthropic โ€” Effective context engineering for AI agents

Month 7 โ€” Architecture judgmentโ€‹

Why this month: interviewers probe "why this design?" The disagreement between the two essays IS the lesson.

LearnWhy it helpsBuild
Workflow vs single-agent vs multi-agentChoose the simplest thing that worksFlagship v3: analyzer โ†’ drafter โ†’ reviewer
When multi-agent backfires (context fragmentation)Avoid speculative complexityPer-stage evals across the pipeline
Composing patterns (routing, orchestrator-workers)Vocabulary + judgment for design roundsDefensible architecture writeup

Resource: Anthropic โ€” Building effective agents ยท Cognition โ€” Don't Build Multi-Agents

Month 8 โ€” Reliability + cost/latencyโ€‹

Why this month: "I can talk about failure modes and tradeoffs" is what makes you sound senior.

LearnWhy it helpsBuild
Failure recovery, idempotency, checkpointsProduction survives partial failuresHardened flagship + better logging
Human-in-the-loop placementTrust + adoption; catches the bad 5%Real failure-mode writeup
Model routing, parallel tool calls, token budgetsCheaper + faster without quality lossMeasured time-saving claim

Resource: Anthropic โ€” Effective harnesses for long-running agents

Month 9 โ€” Second public project + system designโ€‹

Why this month: two non-trivial deployed projects beat one; system-design fluency is interview gold.

LearnWhy it helpsBuild
AI system design (application chapters)Frame problems like an AI engineerAmbitious public project on a personal corpus
Vector store tradeoffs (pgvector vs hosted)Defend infra choices5+ strangers actually using it
Basic Python to read AI codebasesMost AI code/examples are PythonArchitecture feedback from a senior peer

Resource: Chip Huyen โ€” AI Engineering ยท aie-book repo

Month 10 โ€” Portfolio, narrative, start applyingโ€‹

Why this month: convert nine months of building into evidence; begin the funnel.

LearnWhy it helpsBuild
Take-home + AI system-design formatsKnow what you're being tested onREADMEs: problem, approach, tradeoffs, lessons
Resume framing for the pivot"Engineer deepening into AI," not "beginner"Pinned repos + resume + pivot story
Light LeetCode only if screenedSmall signal, capped at ~30 problemsFirst batch of applications (~30 target list)

Resource: Hamel Husain โ€” Your AI Product Needs Evals (re-read for talking points)

Month 11 โ€” Interview executionโ€‹

Why this month: performing, not learning. Rehearse the stories until they're boring.

LearnWhy it helpsBuild
System-design + prompt-debug rehearsalFluency under pressure3+ active interview processes
Behavioral: pivot story, hard projectNarrative is half the decisionSynthesis blog post on what you shipped
RAG / eval architecture Q&AYour strongest topics โ€” own themโ€”

Resource: Eugene Yan โ€” Patterns for LLM Systems (design-round checklist)

Month 12 โ€” Landโ€‹

Why this month: close, don't panic-accept. Extend 1โ€“3 months over a bad fit.

LearnWhy it helpsBuild
Offer evaluation vs your constraintsAnchor to market, not current pay1+ offer at/above target
โ€”โ€”Clean handover + retrospective doc

Resource: Cognition โ€” Multi-Agents: What's Actually Working (stay current)


Advanced agent-runtime techniquesโ€‹

Apply 3 per week until reflexive. These push runtime use from competent to expert.

TechniqueOne-liner
Plan, not destinationAsk for a plan + approval before changes; agents over-act on destinations.
Explain rejected options"List 2โ€“3 approaches you ruled out" surfaces wrong-architecture early.
Diff, not changeDemand diffs; split >50-line diffs โ€” big changes hide bugs.
Name failure modesSpecify empty/null/large/concurrent inputs โ€” beats default happy-path.
Self-review"Review this as a senior reviewer who didn't write it."
Anchor to real patterns"Read 3 similar files first; match their conventions exactly."
Constrain the surface"Only modify src/auth/; stop and ask before touching anything else."
Instructions as filesSave recurring context to .github/prompts/[task].md; reference by name.
Skepticism on hallucination"List APIs you used + confidence they exist + how to verify."
Tight test loop"Make this failing test pass without weakening it; then run the suite."

Committed tech stack (stop tool-shopping)โ€‹

LayerChoice
Runtime (work / personal)Copilot Agent Mode / Claude Code
Context & promptsMarkdown โ€” copilot-instructions.md / CLAUDE.md, version-controlled prompts/
Structured outputsJSON schema + Zod validation
Vector storePostgres + pgvector
IntegrationsNative runtime tools first; MCP servers when available
EvalsHand-rolled: golden set + LLM-as-judge
ObservabilityLangfuse (free tier)
DocsDocusaurus repo (this one)

Avoid LangChain / CrewAI / AutoGen for the flagship โ€” hand-write the loop so you understand it.


Common trapsโ€‹

  • Skipping evals because they're boring โ€” the single most important new skill.
  • Tutorial loops and tool roulette โ€” build and commit, don't keep shopping.
  • Networking late โ€” start month 6, not month 10.
  • Perfectionist publishing โ€” a rough post at month 6 compounds.
  • Underselling the pivot โ€” you're an engineer deepening into production AI, not a beginner.
  • Letting current pay anchor your salary expectations.
  • Crowding out the day job โ€” it's paid AI practice; protect it.

Sourcesโ€‹

Verified June 2026.