Building an Agentic System

A project-agnostic guide for building an agentic system on top of an agent runtime (GitHub Copilot Agent Mode, Claude Code, Cursor, etc.).

This is the recipe. Apply it to a work tool, a personal side project, or any future agentic system you build.

What this guide assumes

You have an agent runtime available (Copilot Agent Mode, Claude Code, etc.)
You're not building the agent loop yourself — you're configuring an existing runtime for a specific task
You can read code and use Git
You have a clear problem in mind that an agent should solve

If those aren't all true, fix them first.

Mental model

You're not "writing software" in the traditional sense. You're curating context for an agent and defining its behavior through structured files. The agent runtime handles the loop, tools, and execution; you handle the inputs.

The four pieces of every agentic system:

The system prompt — who the agent is, what it does, what constraints it follows
The reference context — examples, documentation, domain knowledge the agent uses
The reusable prompts — task-specific instructions composed across runs
The workflow — how a user invokes the system and what the agent does in response

That's the whole structure. Everything below is the build sequence.

Step 0 — Decide before you code (~15 minutes)

Three decisions to make before a single file is created:

1. What problem is this solving? Write it in one sentence. If you can't, you're not ready to build.

Example: "Drafting routine scripts is repetitive because most new tickets resemble past ones; an agent that retrieves similar past examples and drafts a starting point would save engineers ~30 minutes per ticket."

2. Where does this live?

Personal repo or team-owned repo?
What's the data classification of anything you'll put in it?
Does the project lifecycle match the repo (long-lived team tool vs short-lived experiment)?

If you're doing this at work, get explicit clearance before storing anything sensitive. Don't assume.

3. What's the success criterion? What does "v1 works" mean? Be specific.

Example: "v1 produces draft scripts I'd rather start from than from scratch on at least 4 of 5 test tickets."

Without this, you'll iterate forever.

Step 1 — Create the repo structure (~30 minutes)

Create the folder structure first, before any content. This is mechanical but valuable: it forces you to commit to the architecture before you're knee-deep in prompt-writing.

Standard structure for an agentic system:

[project-name]/
├── .github/
│   └── copilot-instructions.md      # System prompt for Copilot
├── CLAUDE.md                         # System prompt for Claude Code (if used)
├── examples/                         # Reference examples the agent reads
│   └── (empty for now)
├── prompts/                          # Reusable task-specific prompts
│   └── (empty for now)
├── inputs/                           # Where new tasks/inputs are placed
│   └── (empty for now)
├── outputs/                          # Where agent outputs are saved (audit trail)
│   └── (empty for now)
├── docs/
│   └── README.md                     # How to use this tool
└── .gitignore

Notes:

inputs/ is often called new-tickets/, tasks/, or similar depending on what the agent processes
outputs/ is your audit trail — every run gets saved here so you can compare across iterations
Use .gitkeep files in empty folders if Git strips them
Initial commit is just the empty structure plus a one-line README

Why this order: decoupling structure from content means you don't make folder-naming decisions while also making prompt-content decisions. Two different cognitive tasks.

Step 2 — Curate reference examples (~60–90 minutes)

The single most important step. The agent's behavior depends almost entirely on the examples it reads.

Pick 5 examples (not 3, not 20). Five is the right size for v1:

Few enough that you can curate each carefully
Enough variety that the agent learns patterns, not specific cases
Small enough to fit in one context window

For each example, capture:

# [ID] — [Short description]

## Context

[The problem this example solved, in 1–2 sentences]

## Input

[The original input — request, ticket, question, document, whatever the agent receives]

## Solution / Output

[The actual output that was produced, in its final form]

## Notes

[1–2 sentences on what was tricky, what could have gone wrong, or why
this approach was chosen over alternatives]

The Notes section is what makes the example teach, not just demonstrate.

Variety guidelines:

1–2 simple cases (agent should nail these)
1–2 medium cases (agent should mostly nail with minor cleanup)
1 edge case or "almost went wrong" example (teaches the agent what to be paranoid about)

Avoid:

5 examples that are too similar — agent learns one pattern, fails on the rest
5 examples that are wildly different — agent learns no pattern at all
Examples without Notes — agent sees what was done but not why

Step 3 — Write the system prompt (~60 minutes)

The system prompt is the single highest-leverage file in the project. Write it deliberately.

Save as .github/copilot-instructions.md (Copilot) or CLAUDE.md (Claude Code) or both. The agent runtime loads it automatically.

Use this template:

# [Project Name]

## Role

[What persona should the agent adopt? Be specific.]
Example: "Senior engineer at [company type], drafting defensive, reviewable
[type of artifact]. You write code other engineers can read and confidently merge."

## Domain context

- [What does the codebase or domain do?]
- [Key entities, terms, structures]
- [Database, framework, language specifics]
- [Anything unusual]

## Style

- [Specific naming conventions]
- [Code formatting preferences]
- [Common patterns the codebase uses]
- [Patterns the codebase avoids]
- Pull this from looking at your real examples — don't invent generic best practices.

## Workflow

When asked to perform the main task:

1. [Step 1]
2. [Step 2]
3. [Step 3]
4. [Step 4]
5. [Step 5 — typically output]

## Constraints

- [What should the agent never do?]
- [What should it always check?]
- [When should it stop and ask vs proceed?]

## Output format

[How should responses be structured? Be specific.]

1. [Section 1]
2. [Section 2]
3. [Section 3 — usually concerns/risks/uncertainties]

The two biggest mistakes when writing this:

Mistake 1: Generic instructions. Bad: "Write good code that follows best practices." Good: "Always alias tables when joining (from customers c). One condition per line in WHERE clauses. Comment any non-obvious business logic above the SQL."

The difference is specificity. Generic instructions teach the agent nothing it didn't already know.

Mistake 2: Style rules from imagination, not observation. Look at your actual examples. Pull the style rules from what's already there. The agent should match your team's existing patterns, not generic conventions from training data.

Step 4 — Run the first draft (~30 minutes)

Pick a real input — a new task that needs solving. Save it to inputs/[id].md in the same format your examples use.

Open the agent runtime, switch to Agent Mode, and run a prompt like:

Read inputs/[id].md and the examples/ folder. Following the workflow in [copilot-instructions.md / CLAUDE.md], [perform the task] and tell me which past examples you drew from.

Watch what happens. Specifically:

Does the agent read the right files in the right order?
Does it identify reasonable similar examples?
Does the output follow the format you specified?
Does the substance look right?
What's clearly wrong?

Save the run. Copy the output to outputs/[id]-v1.md. This becomes your audit trail. Without saved outputs, you can't compare iterations later.

Step 5 — Iterate (the real work — ongoing)

This is where most engineers stop too early. The first run is rarely good. Iterations 2–10 are where the system gets sharp.

After each run, ask:

Symptom	Likely fix
Agent's behavior is wrong (style, format, missing constraints)	Update system prompt
Agent learned the wrong pattern	Add a clearer example, or replace a confusing one
Agent's outputs are inconsistent	Tighten the output format section in system prompt
Agent gets confused on certain types of inputs	Add an example covering that case
Agent ignores a constraint	Move the constraint earlier in the system prompt; restate it in the workflow
Agent hallucinates non-existent things	Add an explicit "if uncertain, say so rather than invent" constraint

Run, observe, edit, run again. This is the loop. You'll do it 10–20 times to get to v1 quality.

A trap to avoid: changing two things between runs and not knowing which one helped. Change one thing at a time.

Step 6 — Add the prompt library (week 2 onwards)

Once the main workflow works, start extracting reusable sub-task prompts into prompts/.

Common sub-task prompts:

prompts/[main-task].md — the standard task prompt (mostly restating the workflow from system prompt)
prompts/review-[output].md — given a draft output, review it for issues
prompts/extract-[intent].md — given an ambiguous input, extract clear intent
prompts/explain-[output].md — explain an existing output in plain language
prompts/compare-[options].md — compare multiple approaches

Each prompt file:

Single, clear purpose
Specifies the role, workflow, and output format for that one task
Can be referenced by name: "Use the workflow in prompts/review-output.md."

Why this matters: prompts in a folder are version-controlled, reviewable, and shareable. Prompts in your head are forgotten and reinvented inconsistently.

Step 7 — Build a workflow chain (week 2–3)

Once you have a prompt library, you can compose them into multi-step workflows.

Example: handling an ambiguous input

Use extract-intent.md to clarify what's being asked
Use [main-task].md to produce a draft
Use review-output.md to check the draft

Each step is a separate agent invocation with its own focused prompt. The user (or you) chains them.

This is the simplest form of multi-agent orchestration: same agent, different prompts, sequenced manually. It's enough for v1–v2. Multi-agent with separate orchestration is a Phase 3 concern.

Step 8 — Get one user other than yourself (week 2–3)

Pick someone who'd plausibly use the tool. Walk them through the workflow on a real task:

They place their input into inputs/[id].md
They open the agent runtime
They run the workflow
They tell you what helped and what didn't

Their friction is your most valuable input for further iteration. The first user always finds problems you missed.

Update the README based on what they actually did, not what you assumed they'd do.

Step 9 — Add evals (later phase, but plan for it)

Once you have ~10 documented runs in outputs/, you have the start of an eval set.

For each saved output, add a metadata.json next to it:

{
  "input_id": "[id]",
  "run_version": "v3",
  "verdict": "PASS / WARN / FAIL",
  "issues": ["missing rollback", "wrong table alias style"],
  "notes": "..."
}

When you have ~10 of these labeled, you can build a runner that:

Re-runs the agent on each input under the current system
Scores the new output against the labeled expected behavior
Flags regressions when prompt changes degrade quality

Without evals, every prompt change is a guess. With them, every change is verified.

This is full Phase 2 work — don't try to build it in week 1.

Anti-patterns to avoid

1. Skipping Step 0. Building before deciding what success looks like is how you build forever.

2. Fake examples when real ones exist. If you're building something for real work, real examples beat fake ones every time. If you can't use real ones for policy reasons, sanitize aggressively but preserve structure.

3. Generic system prompts. "Write good code" teaches the agent nothing. Specificity is everything.

4. Iterating without saving outputs. If you can't compare run 1 to run 5, you can't tell if you're improving.

5. Adding more examples instead of fixing prompts. When the agent's behavior is off, the system prompt is usually the lever, not the example count.

6. Building the prompt library before the v0 works. Prompt libraries make sense once the basic flow is solid. Building them upfront is over-engineering.

7. Skipping the first user. Your own use isn't enough. Every system has friction the builder doesn't see.

8. Aiming for autonomy too early. A reviewable, supervised v1 beats a fully-autonomous v1 that breaks unpredictably. Autonomy comes after trust, which comes from evals.

Time budget for v1

For a focused agentic system on a clear problem:

Step	Time
0 — Decide before code	15 min
1 — Create structure	30 min
2 — Curate examples	60–90 min
3 — System prompt	60 min
4 — First run	30 min
5 — Iterate to v1	5–10 hours, spread across days
6 — Prompt library	2–4 hours, week 2
7 — Workflow chain	1–2 hours, week 2
8 — First user	1–2 hours observing + iterating

Total to a useful v1: roughly 15–20 hours of focused work spread across 2–3 weeks. Less if the problem is simple; more if domain context is complex.

Mental model recap

Building an agentic system is not writing software. It's:

Curating context for an existing agent runtime
Defining behavior through structured prompts
Iterating against real inputs until quality is acceptable
Adding rigor (evals, observability) once value is proven

The agent runtime is the engine. You're tuning it for a specific job. Done well, this is fast and produces real value. Done poorly, it produces a brittle demo.

The difference is mostly in Steps 2 and 3 (examples and system prompt) and in patient iteration in Step 5. Most of the value compresses into "good examples + good prompt + many iterations."

Now go build something.

Building an Agentic System

What this guide assumes​

Mental model​

Step 0 — Decide before you code (~15 minutes)​

Step 1 — Create the repo structure (~30 minutes)​

Step 2 — Curate reference examples (~60–90 minutes)​

Step 3 — Write the system prompt (~60 minutes)​

Step 4 — Run the first draft (~30 minutes)​

Step 5 — Iterate (the real work — ongoing)​

Step 6 — Add the prompt library (week 2 onwards)​

Step 7 — Build a workflow chain (week 2–3)​

Step 8 — Get one user other than yourself (week 2–3)​

Step 9 — Add evals (later phase, but plan for it)​

Anti-patterns to avoid​

Time budget for v1​

Mental model recap​