Build an AI Agent on Llama: The 2026 Local Stack
Build an AI agent on Llama: hardware sizing, model picks, the agent loop in working Python, tool design, costs, and when self-hosted Llama beats the frontier APIs.
Updated May 2026. Rewritten for the Llama 4 generation, working agent-loop code, real hardware costs, and the narrow cases where a Llama agent earns its place over Claude or GPT.
Building an AI agent on Llama means running a tool-calling loop where Llama is the model picking the next step. In 2026 this is a much smaller club than building an agent on Claude or GPT, and that is the right answer. For most use cases, a frontier-API agent ships faster, costs less for low volume, and has fewer sharp edges. The cases where a Llama agent wins are real but specific: workloads where data cannot leave your network, high-volume inference where API costs dominate, and research or product work where you need to ship the model itself with the agent.
We are Osher Digital, an AI and automation consultancy based in Brisbane. We have shipped Llama-based agents into a healthcare client (patient records, data sovereignty drove the local-only stance), a recruitment client (high-volume CV screening where API costs were unsustainable), and several internal tools. This guide is the production recipe with the parts we wish we had known at the start. If you have already read our companion guide to running Llama locally, this is the agent-shaped follow-up; for the broader picture see our build an AI agent in Python guide.
This guide is for ML engineers, platform engineers, and tech leads picking a stack for a Llama agent. It assumes you know what an AI agent is in principle (loop, tools, observation) and want the production specifics.
When an AI Agent with Llama Actually Makes Sense
We are starting with the elimination criteria because most projects that consider a Llama agent should not build one. The decision-aid we run with clients.
Use a frontier API (Claude Sonnet 4.5 or GPT-4.1) when: token volume is under about 30 million per month, the data is not subject to residency or VPC constraints, the team does not have an MLOps function, latency budgets allow 200 to 800 ms per token. This covers 80 per cent of agentic AI projects.
Build an AI agent on Llama when: data residency requires it (regulated health, finance, government), token volume exceeds 30 to 50 million per month and API costs become the largest line item, you have an MLOps capability or willingness to build one, or the model itself needs to be embedded in the product (on-device agents, edge deployment).
If you are building a Llama agent for the first reason (data residency), check our Llama local hosting guide for the host-and-deploy bit before reading on. If you are building it for the second reason (cost at volume), the break-even on Claude Sonnet 4.5 against a self-hosted Llama 4 70B setup is roughly 35 million input tokens or 8 million output tokens per month on our measurements.
Picking the Right Llama Model for an Agent in 2026
Llama 4 ships in three sizes that matter for agentic use: Maverick (the flagship reasoner, ~400B parameters total with MoE activation), Scout (the mid-size workhorse, 109B parameters MoE), and the smaller Llama 4 8B and 70B descendants from the Llama 3 generation that are still widely deployed.
For an agent (which means tool-calling), the smallest model we recommend is Llama 3.3 70B Instruct or Llama 4 Scout. Below that, tool-selection accuracy drops sharply: in our internal evals, an 8B model picks the wrong tool roughly 22 per cent of the time, a 70B picks wrong 7 per cent, Scout drops that to about 4 per cent, and Maverick lands close to Claude Sonnet 4.5 at around 3 per cent. The 8B is fine for classification and extraction, not for agents.
Recommended defaults: Llama 4 Scout for new builds because it fits comfortably on a single H100 with AWQ 4-bit quantisation and the tool-calling quality is production-grade. Llama 3.3 70B when you need a smaller VRAM footprint or established deployment stacks. Llama 4 Maverick only when reasoning quality on long-horizon tasks is the priority and you can afford the hardware.
Hardware Sizing for an AI Agent on Llama
The rule of thumb after a year of deployments: roughly 0.5 to 0.6 GB of VRAM per billion active parameters at 4-bit AWQ quantisation, plus a 20 to 30 per cent overhead for KV cache during agent loops (which carry long contexts).
Llama 3.3 70B at 4-bit fits on a single 80 GB H100 with room for a 32k context. Llama 4 Scout (109B MoE, but ~17B active per token) fits on a single H100 80 GB with breathing room. Llama 4 Maverick needs a dual-H100 or single H200 setup.
Throughput numbers from our production environments on vLLM with AWQ 4-bit. Dual RTX 4090 workstation running 70B at 4-bit: 800 to 1,200 tokens/sec sustained. Single H100 PCIe running Scout at 4-bit: 2,500 to 3,500 tokens/sec sustained. Dual H100 SXM running Maverick at 4-bit: 1,800 to 2,400 tokens/sec sustained (lower than Scout because of the bigger model). H200 single GPU is competitive with dual H100 SXM for Maverick at lower total cost.
For Australian deployments, AWS Sydney has H100 capacity in ap-southeast-2; Vultr Sydney has H100 single instances; RunPod and Vast.ai are cheaper if data residency does not require AU-based hosting. For local dev, a workstation with two RTX 4090s and 128 GB system RAM around 12,000 to 16,000 AUD will run 70B agents at usable speed.
The Agent Loop for an AI Agent with Llama
Here is the minimal working loop. This runs against a vLLM server exposing Llama 4 Scout via the OpenAI-compatible API. The structure is identical for any tool-calling-capable Llama model.
from openai import OpenAI
import json
client = OpenAI(base_url="http://llama.internal:8000/v1", api_key="not-used")
# Tool catalogue passed to the model
tools = [
{
"type": "function",
"function": {
"name": "search_records",
"description": "Search internal records by customer ID or surname.",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"limit": {"type": "integer", "default": 10},
},
"required": ["query"],
},
},
},
{
"type": "function",
"function": {
"name": "send_summary_email",
"description": "Send a summary email to a named recipient.",
"parameters": {
"type": "object",
"properties": {
"recipient": {"type": "string"},
"subject": {"type": "string"},
"body": {"type": "string"},
},
"required": ["recipient", "subject", "body"],
},
},
},
]
# Local tool implementations
def search_records(query, limit=10):
# Replace with a real database query
return [{"id": 1, "name": "Sample", "match": query}]
def send_summary_email(recipient, subject, body):
# Replace with a real mailer
return {"status": "sent", "to": recipient}
TOOL_FUNCS = {"search_records": search_records, "send_summary_email": send_summary_email}
def run_agent(goal, max_turns=8, token_budget=15000):
messages = [
{"role": "system", "content": "You are an internal records assistant."},
{"role": "user", "content": goal},
]
total_tokens = 0
for turn in range(max_turns):
resp = client.chat.completions.create(
model="llama-4-scout-awq",
messages=messages,
tools=tools,
tool_choice="auto",
temperature=0.2,
)
total_tokens += resp.usage.total_tokens
if total_tokens > token_budget:
return {"status": "budget_exceeded", "messages": messages}
msg = resp.choices[0].message
messages.append(msg.model_dump(exclude_none=True))
if not msg.tool_calls:
return {"status": "complete", "result": msg.content, "messages": messages}
for call in msg.tool_calls:
fn = TOOL_FUNCS[call.function.name]
args = json.loads(call.function.arguments)
result = fn(**args)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(result),
})
return {"status": "max_turns_reached", "messages": messages}
That is a working agent loop in roughly 60 lines. The non-obvious parts: the max_turns cap stops runaway loops, the token_budget cap stops runaway costs (still relevant at self-hosted scale because long contexts get slow), the explicit tool-result append in the OpenAI message shape is required for Llama to keep its tool-calling state coherent.
Tool Design for an AI Agent with Llama
The single biggest determinant of agent quality is tool description quality. Llama, especially 70B and below, leans more heavily on the tool description than Claude or GPT do. In our evals, the same tool catalogue rewritten with cleaner descriptions lifted Llama 4 Scout’s task completion rate from 79 to 94 per cent. The same change moved Claude Sonnet 4.5 from 92 to 95 per cent. Llama gains more from the work.
Three rules that matter.
First, tool names are part of the contract. Use snake_case verbs with clear semantics: search_records, send_summary_email, create_calendar_event. Avoid cute names that the model has to interpret.
Second, descriptions answer “when to use this tool” not “what this tool is”. The model already sees the schema. The description should say: “Use this when the user is asking about a specific customer’s order history. Do not use this for product catalogue queries (use search_products instead).”
Third, return errors as data, not as exceptions. If search_records finds nothing, return {"status": "not_found", "query": query} not an empty list. The model can reason about explicit errors; it ignores empty silences.
Serving an AI Agent with Llama in Production
For production, we deploy Llama agents behind vLLM with AWQ 4-bit quantisation, an OpenAI-compatible router, and continuous batching. The OpenAI-compatible shape means the agent code is the same whether it points at vLLM or a frontier API, which is operationally important during fallback events.
Three patterns that work.
Two-tier model serving. Run a small, fast model (Llama 3.3 70B AWQ on RTX 4090s) for the agent’s bounded sub-tasks, and a larger model (Llama 4 Scout or Maverick on H100s) for the orchestration loop. Same OpenAI-compatible interface, different endpoints. Cuts inference cost roughly in half while keeping orchestration quality.
Frontier-API fallback. Wire a fallback to Claude Sonnet 4.5 when the self-hosted endpoint is degraded. The fallback should be visible in logs and rate-limited so a bug does not blow your API budget overnight.
Observability layer. Treat every agent run like a distributed trace. Log every model call (prompt hash, completion length, tokens, latency), every tool invocation, and the final outcome. We default to OpenTelemetry traces into Grafana Tempo or Sentry for application-side logging; for ML-specific observability, LangSmith and Phoenix both work even when your stack is not LangChain.
Real AUD Costs for an AI Agent with Llama
Numbers from three live deployments running through Q1 and Q2 2026, all in ap-southeast-2 or Vultr Sydney.
Healthcare client running Llama 3.3 70B for document triage at ~12 million tokens/day. Single H100 reserved instance, around 5,800 AUD/month. Throughput supports the workload with 40 per cent headroom. The Claude Sonnet 4.5 equivalent would be roughly 14,000 AUD/month at this volume. Break-even reached after about month two when API parity costs caught up with the reserved hardware.
Recruitment client running Llama 4 Scout for CV screening, ~6 million tokens/day. Spot H100 on Vast.ai with auto-restart, around 1,800 USD/month (~2,700 AUD). The Claude alternative for this workload is around 5,200 AUD/month. Worth it once we accepted the spot-restart latency on overnight rebalances.
Internal research agent running Llama 4 Scout for code review, low volume (~1 million tokens/day). Vultr Sydney H100 at about 4,200 AUD/month. The Claude equivalent is around 600 AUD/month. We run this on Llama because the codebase cannot leave our network. The cost premium is the residency tax.
The pattern: Llama agents pay off above roughly 5 to 8 million tokens per day, or any volume when data residency is a hard constraint.
Things That Break for an AI Agent with Llama
The production gotchas we have hit.
JSON-mode regressions across model versions. Llama 3.3 70B and Llama 4 Scout differ subtly in how strictly they obey response_format. Test schemas against the exact model you deploy before believing the docs.
Tool-call hallucination. Llama models will sometimes invent a tool name that is not in the catalogue. Validate every tool name on the loop runner side and return a structured error (“Tool ‘foo’ not found. Available tools: […]”). The model usually self-corrects on the next turn.
KV cache pressure under long agent contexts. A 30-turn agent on Maverick can exhaust the KV cache and cause vLLM to start preempting requests. Tune --max-model-len against your real context distribution and provision 20 to 30 per cent of headroom.
Context degradation past about 60,000 tokens. Despite the published context length, agent quality drops noticeably above 60k. Compact the conversation history (drop irrelevant tool results, summarise old turns) once you cross that line.
Spot capacity reclaims at the worst time. If you run on Vast.ai or RunPod spot, build a queue and a hot-standby pattern so a reclaim does not bring down the agent.
When Not to Build an AI Agent with Llama
The cases where we steer clients away.
Frontier reasoning matters more than cost or residency. For very long-horizon planning or hard multi-step reasoning, Claude Opus 4.5 still beats Maverick by enough that the operational complexity of self-hosting is not worth it. Llama 5 may close this; we will revisit.
The team has no MLOps capacity and no appetite to build one. A failed self-hosted Llama deployment is the most expensive AI mistake you can make: hardware sunk cost, opportunity cost on the project, and a slow recovery to a frontier API anyway. Pick the API.
The use case is single-shot generation, not agentic. If you do not have tools in the loop, you are paying agent overhead for nothing. Drop to a single-call extraction with Pydantic and move on.
Frequently Asked Questions
Can Llama do tool-calling well enough for agents?
Yes, from Llama 3.3 70B upwards. Tool-selection accuracy is around 93 per cent on Llama 3.3 70B, 96 per cent on Llama 4 Scout, and roughly 97 per cent on Maverick in our internal evals, against 97 per cent for Claude Sonnet 4.5 and 95 per cent for GPT-4.1. The smaller Llama 3.x 8B is not reliable enough for agentic work.
How much does it cost to run an AI agent on Llama?
Around 4,000 to 6,500 AUD/month for a single reserved H100 in Sydney supporting roughly 10 to 12 million tokens/day on Scout AWQ. Cheaper on Vast.ai spot at 1,500 to 2,500 USD/month (~2,300 to 3,800 AUD) if you can tolerate occasional restarts. Workstation builds with dual RTX 4090s land at 12,000 to 16,000 AUD upfront and run smaller agents (70B AWQ) at usable speed.
Which Llama model should I use for an agent?
For new builds in 2026, Llama 4 Scout AWQ on a single H100 is the default. Llama 3.3 70B AWQ if you have established deployment infrastructure or constrained VRAM. Llama 4 Maverick when long-horizon reasoning is the priority and dual-H100 or H200 hardware is available. Avoid sub-70B models for agentic work; tool selection becomes unreliable.
What is the best stack for serving a Llama agent?
vLLM with AWQ 4-bit quantisation and continuous batching is the production default in 2026. Expose the OpenAI-compatible API so agent code is portable to frontier APIs. Add observability via OpenTelemetry to Grafana Tempo or Sentry. For experimental work, Ollama and llama.cpp are easier to spin up but lose throughput by 5 to 10x compared to vLLM at production load.
Can I fine-tune Llama for better agent performance?
Sometimes worth it, often not. Most agent failures are about tool description quality, not model weights. Try better tool descriptions, better system prompts, and few-shot examples first. Only fine-tune when you have 5,000+ labelled examples and you have already eliminated the easier wins. LoRA adapters with Unsloth or Axolotl are the cost-effective fine-tuning path; expect 500 to 4,000 AUD in compute per training run.
How does an AI agent with Llama compare to Claude or GPT agents?
Claude Sonnet 4.5 and GPT-4.1 ship faster (no infrastructure work), have slightly better tool-calling at the top end, and cost less for low to moderate volume (under 30 million tokens/month). Llama agents win on data residency, cost at high volume, and the ability to embed the model in your product. Most teams should default to a frontier API and switch to Llama only when one of the three Llama-winning conditions clearly applies.
Do I need LangChain or the Claude Agent SDK to build a Llama agent?
No. The 60-line loop in this guide is the production pattern. Frameworks help with retries, persistence, and observability but the core loop is small enough to own. LangGraph is useful for very complex multi-agent setups. Pydantic AI is a clean middle ground if you want typed tools without a full framework. Our default for Llama agents is custom loop plus FastAPI plus Pydantic, no framework.
How do I keep a Llama agent from running up the bill?
Self-hosted Llama bills are predictable (the reserved hardware does not care how many tokens you push through it). The cost discipline is on the spot side: budget caps, max-turn caps, context compaction past 30k tokens, and a daily cost alarm tied to spot-instance hours. We have a hard rule: token budget cap visible in every agent log, refused at the loop runner if the model tries to keep going.
Building an AI agent on Llama is the right answer for a smaller set of projects than the hype suggests, but for those projects it is the only answer. If you are weighing self-hosted Llama against a frontier API for a specific workload, our team runs a fixed-fee assessment that lands on a defensible recommendation. Book a call or get in touch through the contact page with your token volume, residency constraints, and target latency, and we will tell you which way the numbers point.
Jump to a section
Ready to streamline your operations?
Get in touch for a free consultation to see how we can streamline your operations and increase your productivity.