Building an Agentic System
A project-agnostic guide for building an agentic system on top of an agent runtime (GitHub Copilot Agent Mode, Claude Code, Cursor, etc.).
This is the recipe. Apply it to a work tool, a personal side project, or any future agentic system you build.
What this guide assumes​
- You have an agent runtime available (Copilot Agent Mode, Claude Code, etc.)
- You're not building the agent loop yourself — you're configuring an existing runtime for a specific task
- You can read code and use Git
- You have a clear problem in mind that an agent should solve
If those aren't all true, fix them first.
Mental model​
You're not "writing software" in the traditional sense. You're curating context for an agent and defining its behavior through structured files. The agent runtime handles the loop, tools, and execution; you handle the inputs.
The four pieces of every agentic system:
- The system prompt — who the agent is, what it does, what constraints it follows
- The reference context — examples, documentation, domain knowledge the agent uses
- The reusable prompts — task-specific instructions composed across runs
- The workflow — how a user invokes the system and what the agent does in response
That's the whole structure. Everything below is the build sequence.
Step 0 — Decide before you code (~15 minutes)​
Three decisions to make before a single file is created:
1. What problem is this solving? Write it in one sentence. If you can't, you're not ready to build.
Example: "Drafting routine scripts is repetitive because most new tickets resemble past ones; an agent that retrieves similar past examples and drafts a starting point would save engineers ~30 minutes per ticket."
2. Where does this live?
- Personal repo or team-owned repo?
- What's the data classification of anything you'll put in it?
- Does the project lifecycle match the repo (long-lived team tool vs short-lived experiment)?
If you're doing this at work, get explicit clearance before storing anything sensitive. Don't assume.
3. What's the success criterion? What does "v1 works" mean? Be specific.
Example: "v1 produces draft scripts I'd rather start from than from scratch on at least 4 of 5 test tickets."
Without this, you'll iterate forever.
Step 1 — Create the repo structure (~30 minutes)​
Create the folder structure first, before any content. This is mechanical but valuable: it forces you to commit to the architecture before you're knee-deep in prompt-writing.
Standard structure for an agentic system:
[project-name]/
├── .github/
│ └── copilot-instructions.md # System prompt for Copilot
├── CLAUDE.md # System prompt for Claude Code (if used)
├── examples/ # Reference examples the agent reads
│ └── (empty for now)
├── prompts/ # Reusable task-specific prompts
│ └── (empty for now)
├── inputs/ # Where new tasks/inputs are placed
│ └── (empty for now)
├── outputs/ # Where agent outputs are saved (audit trail)
│ └── (empty for now)
├── docs/
│ └── README.md # How to use this tool
└── .gitignore
Notes:
inputs/is often callednew-tickets/,tasks/, or similar depending on what the agent processesoutputs/is your audit trail — every run gets saved here so you can compare across iterations- Use
.gitkeepfiles in empty folders if Git strips them - Initial commit is just the empty structure plus a one-line README
Why this order: decoupling structure from content means you don't make folder-naming decisions while also making prompt-content decisions. Two different cognitive tasks.
Step 2 — Curate reference examples (~60–90 minutes)​
The single most important step. The agent's behavior depends almost entirely on the examples it reads.
Pick 5 examples (not 3, not 20). Five is the right size for v1:
- Few enough that you can curate each carefully
- Enough variety that the agent learns patterns, not specific cases
- Small enough to fit in one context window
For each example, capture:
# [ID] — [Short description]
## Context
[The problem this example solved, in 1–2 sentences]
## Input
[The original input — request, ticket, question, document, whatever the agent receives]
## Solution / Output
[The actual output that was produced, in its final form]
## Notes
[1–2 sentences on what was tricky, what could have gone wrong, or why
this approach was chosen over alternatives]
The Notes section is what makes the example teach, not just demonstrate.
Variety guidelines:
- 1–2 simple cases (agent should nail these)
- 1–2 medium cases (agent should mostly nail with minor cleanup)
- 1 edge case or "almost went wrong" example (teaches the agent what to be paranoid about)
Avoid:
- 5 examples that are too similar — agent learns one pattern, fails on the rest
- 5 examples that are wildly different — agent learns no pattern at all
- Examples without Notes — agent sees what was done but not why
Step 3 — Write the system prompt (~60 minutes)​
The system prompt is the single highest-leverage file in the project. Write it deliberately.
Save as .github/copilot-instructions.md (Copilot) or CLAUDE.md (Claude Code) or both. The agent runtime loads it automatically.
Use this template:
# [Project Name]
## Role
[What persona should the agent adopt? Be specific.]
Example: "Senior engineer at [company type], drafting defensive, reviewable
[type of artifact]. You write code other engineers can read and confidently merge."
## Domain context
- [What does the codebase or domain do?]
- [Key entities, terms, structures]
- [Database, framework, language specifics]
- [Anything unusual]
## Style
- [Specific naming conventions]
- [Code formatting preferences]
- [Common patterns the codebase uses]
- [Patterns the codebase avoids]
- Pull this from looking at your real examples — don't invent generic best practices.
## Workflow
When asked to perform the main task:
1. [Step 1]
2. [Step 2]
3. [Step 3]
4. [Step 4]
5. [Step 5 — typically output]
## Constraints
- [What should the agent never do?]
- [What should it always check?]
- [When should it stop and ask vs proceed?]
## Output format
[How should responses be structured? Be specific.]
1. [Section 1]
2. [Section 2]
3. [Section 3 — usually concerns/risks/uncertainties]
The two biggest mistakes when writing this:
Mistake 1: Generic instructions.
Bad: "Write good code that follows best practices."
Good: "Always alias tables when joining (from customers c). One condition per line in WHERE clauses. Comment any non-obvious business logic above the SQL."
The difference is specificity. Generic instructions teach the agent nothing it didn't already know.
Mistake 2: Style rules from imagination, not observation. Look at your actual examples. Pull the style rules from what's already there. The agent should match your team's existing patterns, not generic conventions from training data.
Step 4 — Run the first draft (~30 minutes)​
Pick a real input — a new task that needs solving. Save it to inputs/[id].md in the same format your examples use.
Open the agent runtime, switch to Agent Mode, and run a prompt like:
Read
inputs/[id].mdand theexamples/folder. Following the workflow in[copilot-instructions.md / CLAUDE.md], [perform the task] and tell me which past examples you drew from.
Watch what happens. Specifically:
- Does the agent read the right files in the right order?
- Does it identify reasonable similar examples?
- Does the output follow the format you specified?
- Does the substance look right?
- What's clearly wrong?
Save the run. Copy the output to outputs/[id]-v1.md. This becomes your audit trail. Without saved outputs, you can't compare iterations later.
Step 5 — Iterate (the real work — ongoing)​
This is where most engineers stop too early. The first run is rarely good. Iterations 2–10 are where the system gets sharp.
After each run, ask:
| Symptom | Likely fix |
|---|---|
| Agent's behavior is wrong (style, format, missing constraints) | Update system prompt |
| Agent learned the wrong pattern | Add a clearer example, or replace a confusing one |
| Agent's outputs are inconsistent | Tighten the output format section in system prompt |
| Agent gets confused on certain types of inputs | Add an example covering that case |
| Agent ignores a constraint | Move the constraint earlier in the system prompt; restate it in the workflow |
| Agent hallucinates non-existent things | Add an explicit "if uncertain, say so rather than invent" constraint |
Run, observe, edit, run again. This is the loop. You'll do it 10–20 times to get to v1 quality.
A trap to avoid: changing two things between runs and not knowing which one helped. Change one thing at a time.
Step 6 — Add the prompt library (week 2 onwards)​
Once the main workflow works, start extracting reusable sub-task prompts into prompts/.
Common sub-task prompts:
prompts/[main-task].md— the standard task prompt (mostly restating the workflow from system prompt)prompts/review-[output].md— given a draft output, review it for issuesprompts/extract-[intent].md— given an ambiguous input, extract clear intentprompts/explain-[output].md— explain an existing output in plain languageprompts/compare-[options].md— compare multiple approaches
Each prompt file:
- Single, clear purpose
- Specifies the role, workflow, and output format for that one task
- Can be referenced by name: "Use the workflow in
prompts/review-output.md."
Why this matters: prompts in a folder are version-controlled, reviewable, and shareable. Prompts in your head are forgotten and reinvented inconsistently.
Step 7 — Build a workflow chain (week 2–3)​
Once you have a prompt library, you can compose them into multi-step workflows.
Example: handling an ambiguous input
- Use
extract-intent.mdto clarify what's being asked - Use
[main-task].mdto produce a draft - Use
review-output.mdto check the draft
Each step is a separate agent invocation with its own focused prompt. The user (or you) chains them.
This is the simplest form of multi-agent orchestration: same agent, different prompts, sequenced manually. It's enough for v1–v2. Multi-agent with separate orchestration is a Phase 3 concern.
Step 8 — Get one user other than yourself (week 2–3)​
Pick someone who'd plausibly use the tool. Walk them through the workflow on a real task:
- They place their input into
inputs/[id].md - They open the agent runtime
- They run the workflow
- They tell you what helped and what didn't
Their friction is your most valuable input for further iteration. The first user always finds problems you missed.
Update the README based on what they actually did, not what you assumed they'd do.
Step 9 — Add evals (later phase, but plan for it)​
Once you have ~10 documented runs in outputs/, you have the start of an eval set.
For each saved output, add a metadata.json next to it:
{
"input_id": "[id]",
"run_version": "v3",
"verdict": "PASS / WARN / FAIL",
"issues": ["missing rollback", "wrong table alias style"],
"notes": "..."
}
When you have ~10 of these labeled, you can build a runner that:
- Re-runs the agent on each input under the current system
- Scores the new output against the labeled expected behavior
- Flags regressions when prompt changes degrade quality
Without evals, every prompt change is a guess. With them, every change is verified.
This is full Phase 2 work — don't try to build it in week 1.
Anti-patterns to avoid​
1. Skipping Step 0. Building before deciding what success looks like is how you build forever.
2. Fake examples when real ones exist. If you're building something for real work, real examples beat fake ones every time. If you can't use real ones for policy reasons, sanitize aggressively but preserve structure.
3. Generic system prompts. "Write good code" teaches the agent nothing. Specificity is everything.
4. Iterating without saving outputs. If you can't compare run 1 to run 5, you can't tell if you're improving.
5. Adding more examples instead of fixing prompts. When the agent's behavior is off, the system prompt is usually the lever, not the example count.
6. Building the prompt library before the v0 works. Prompt libraries make sense once the basic flow is solid. Building them upfront is over-engineering.
7. Skipping the first user. Your own use isn't enough. Every system has friction the builder doesn't see.
8. Aiming for autonomy too early. A reviewable, supervised v1 beats a fully-autonomous v1 that breaks unpredictably. Autonomy comes after trust, which comes from evals.
Time budget for v1​
For a focused agentic system on a clear problem:
| Step | Time |
|---|---|
| 0 — Decide before code | 15 min |
| 1 — Create structure | 30 min |
| 2 — Curate examples | 60–90 min |
| 3 — System prompt | 60 min |
| 4 — First run | 30 min |
| 5 — Iterate to v1 | 5–10 hours, spread across days |
| 6 — Prompt library | 2–4 hours, week 2 |
| 7 — Workflow chain | 1–2 hours, week 2 |
| 8 — First user | 1–2 hours observing + iterating |
Total to a useful v1: roughly 15–20 hours of focused work spread across 2–3 weeks. Less if the problem is simple; more if domain context is complex.
Mental model recap​
Building an agentic system is not writing software. It's:
- Curating context for an existing agent runtime
- Defining behavior through structured prompts
- Iterating against real inputs until quality is acceptable
- Adding rigor (evals, observability) once value is proven
The agent runtime is the engine. You're tuning it for a specific job. Done well, this is fast and produces real value. Done poorly, it produces a brittle demo.
The difference is mostly in Steps 2 and 3 (examples and system prompt) and in patient iteration in Step 5. Most of the value compresses into "good examples + good prompt + many iterations."
Now go build something.