Updated May 2026. Rewritten from the ground up to reflect how AI agents are actually built in Python in 2026. The previous version covered classical machine learning libraries that are not what most teams reach for when building agents today.

AI agents are running in production right now, processing documents, handling customer queries, and automating workflows inside real businesses. We know because we have built them. This guide is how we approach building them in Python.

At Osher Digital, we are a Brisbane-based AI consultancy that has shipped agents across healthcare, recruitment, finance, and professional services. Some of those run on n8n. Some run on raw Python. This post covers the Python path: when it is the right choice, what stack to use in 2026, and the patterns that hold up under production traffic.

This guide is aimed at Python developers building their first production AI agent and at technical leads weighing whether Python is the right runtime for an AI agent development project. If you are already evaluating Claude specifically, our Claude AI agent guide covers the same ground from that angle.

What an AI Agent Actually Is in 2026

The textbook definition of an AI agent talks about perceiving an environment, reasoning, and taking action. That description is technically correct but not very useful. Here is the practical version.

A chatbot tells a customer about your return policy. An agent reads the customer’s order history, checks whether the item is eligible for return, processes the refund through your payment system, updates the CRM, and sends the confirmation email. It reasons through the problem and takes action across multiple systems.

In 2026, when people say “AI agent” they almost always mean a system built around a large language model that can call tools (also called functions). The LLM does the reasoning. Your code defines the tools and runs the loop. Frameworks help with the plumbing. That is the entire mental model.

Anything else you read about agents in Python that does not start with this mental model is probably out of date.

Why Python for AI Agents

You can build agents in JavaScript, Go, or Rust. Python is the most common choice in 2026 for three reasons.

Library maturity. The OpenAI, Anthropic, and Google SDKs all have first-class Python support, and the agent orchestration frameworks (Pydantic AI, LangGraph, the OpenAI Agents SDK, Anthropic’s agent SDK) all started in Python. JavaScript and TypeScript libraries exist and are improving, but Python still has more options and more documentation.

Integration surface. Python is what most data, machine learning, and DevOps tooling already speaks. If your agent needs to talk to a Salesforce export, a Postgres database, a vector store, or an internal data pipeline, the Python ecosystem already has a tested library for it.

Iteration speed. Agents are non-deterministic. You will rewrite prompts, swap models, and tune tool definitions dozens of times before you ship. Python’s REPL, Jupyter, and dynamic typing make that loop faster than the equivalent in compiled languages.

That said, Python is not always the right choice. If your agent is one part of a larger TypeScript application, building in TypeScript avoids a service boundary. If you are deploying to a serverless environment with strict cold-start budgets, Go or Rust may serve better. We use Python for most agent work and reach for other languages when there is a specific reason to.

The 2026 Python AI Agent Stack

Here is the stack we reach for when we start a new Python agent project. Treat this as a reasonable starting point, not a doctrine.

LLM SDK: anthropic for Claude, openai for GPT and the OpenAI Responses API, google-genai for Gemini. We default to Claude Sonnet for most work.
Structured outputs: pydantic for response schemas. The major SDKs all support Pydantic models directly.
Agent orchestration: pydantic-ai when we want a clean, opinionated framework. langgraph when we need explicit graph-based control over multi-agent workflows. The official OpenAI Agents SDK or Anthropic’s agent SDK if we are committed to a single provider.
Tool integration: The mcp Python SDK for Model Context Protocol. MCP has become the standard way for agents to discover and use external tools, so if you are integrating with several systems it pays to learn it.
HTTP and serving: fastapi if you are exposing the agent as an API.
Background work: celery or arq for queueing long-running agent tasks, apscheduler for simple scheduled jobs.
Observability: logfire (from the Pydantic team) or Langfuse for agent-specific tracing. Standard OpenTelemetry where you need it.
Testing: pytest with snapshot testing for agent outputs, plus an evaluation framework like inspect-ai or promptfoo for harder evals.

What you will notice is missing: TensorFlow, PyTorch, scikit-learn, NLTK. Those are the tools of classical machine learning and they are not central to building LLM-based agents. You may use them downstream (a custom embeddings model, a classifier alongside the agent) but they are not where you start.

Setting Up Your Environment

We use uv for Python project management. It is fast, handles virtual environments well, and reads the same pyproject.toml as everything else.

uv init my-agent
cd my-agent
uv add anthropic openai pydantic pydantic-ai mcp fastapi
uv add --dev pytest ruff

Set your API keys as environment variables. Never commit them. We use direnv with a .envrc file that sources from a password manager, but a plain .env file with python-dotenv works for development.

export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."

That is the entire setup. Modern Python tooling has made the boring parts boring, which is exactly what you want.

Tool Calling: The Foundation of Agent Behaviour

Tool calling (also called function calling) is what turns a language model into an agent. You define functions the model can call. When you send a message along with these tool definitions, the model decides whether to call a tool, picks the right one, and provides the arguments. Your code executes the tool and returns the result. The model continues reasoning until it is done.

Here is the simplest possible example using the Anthropic SDK and Pydantic for tool definitions.

import anthropic
from pydantic import BaseModel

client = anthropic.Anthropic()

class LookupCustomer(BaseModel):
    """Look up a customer record by email address."""
    email: str

class CreateSupportTicket(BaseModel):
    """Create a new support ticket in the helpdesk system."""
    customer_email: str
    subject: str
    priority: str  # low, medium, high, urgent
    description: str

tools = [
    {
        "name": "lookup_customer",
        "description": LookupCustomer.__doc__,
        "input_schema": LookupCustomer.model_json_schema(),
    },
    {
        "name": "create_support_ticket",
        "description": CreateSupportTicket.__doc__,
        "input_schema": CreateSupportTicket.model_json_schema(),
    },
]

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    tools=tools,
    messages=[{
        "role": "user",
        "content": "Customer [email protected] says her last order arrived damaged. She is upset. Look up her account and open a high-priority ticket.",
    }],
)

Claude responds with a tool_use block, choosing lookup_customer first and providing the email. You execute the function, return the result, and Claude continues, likely calling create_support_ticket with appropriate detail and high priority given the customer’s sentiment.

The model decides the sequence. You define the tools. Pydantic gives you type-safe argument validation for free.

The ReAct Loop in Python

ReAct (Reason + Act) is the workhorse pattern for single-purpose agents. The agent observes state, reasons about what to do next, takes an action, observes the result, and repeats until the task is complete.

from typing import Callable

def react_agent(
    task: str,
    tools: list[dict],
    tool_handlers: dict[str, Callable],
    system_prompt: str = "",
    max_steps: int = 10,
) -> str:
    messages = [{"role": "user", "content": task}]

    for step in range(max_steps):
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=4096,
            system=system_prompt,
            tools=tools,
            messages=messages,
        )

        if response.stop_reason == "tool_use":
            tool_block = next(
                b for b in response.content if b.type == "tool_use"
            )
            handler = tool_handlers[tool_block.name]
            result = handler(**tool_block.input)

            messages.append({"role": "assistant", "content": response.content})
            messages.append({
                "role": "user",
                "content": [{
                    "type": "tool_result",
                    "tool_use_id": tool_block.id,
                    "content": str(result),
                }],
            })
        else:
            return next(
                b.text for b in response.content if b.type == "text"
            )

    raise RuntimeError(f"Agent did not finish within {max_steps} steps")

That is the core loop. About fifty lines of code with no framework dependency. We use this pattern for most single-purpose agents because it is simple, predictable, and easy to debug. When something goes wrong in production, you can trace exactly which tool was called with what arguments and what came back.

If you would rather use a framework, Pydantic AI gives you the same loop with a cleaner ergonomics around type-safe tool definitions and dependency injection. The OpenAI Agents SDK is similar in shape and tied to the OpenAI Responses API. LangGraph adds explicit graph-based control flow when your agent has branching logic that does not map cleanly to a linear loop.

Practical Example: A Document Processing Agent

Document processing is one of the most common use cases we build for Australian businesses. Here is a simplified version of what we deploy.

from pathlib import Path
import anthropic
from pydantic import BaseModel, Field

client = anthropic.Anthropic()

class MedicalRecord(BaseModel):
    patient_name: str
    date_of_birth: str = Field(description="DD/MM/YYYY format")
    medicare_number: str
    diagnosis: str
    treatment_plan: str
    confidence_score: float = Field(ge=0.0, le=1.0)

def extract_text_from_pdf(file_path: str) -> str:
    """Extract raw text content from a PDF document."""
    # Use pypdf, pdfplumber, or unstructured here
    ...

def classify_document(text_content: str) -> str:
    """Classify document type. Returns: invoice, contract, medical_record, resume, correspondence, other."""
    ...

def save_to_database(record: MedicalRecord) -> str:
    """Save a validated medical record to the patient database."""
    ...

DOCUMENT_TOOLS = [
    {
        "name": "extract_text_from_pdf",
        "description": extract_text_from_pdf.__doc__,
        "input_schema": {
            "type": "object",
            "properties": {"file_path": {"type": "string"}},
            "required": ["file_path"],
        },
    },
    # ...other tools omitted for brevity
]

SYSTEM_PROMPT = """You are a document processing agent for an Australian healthcare provider.

Your job is to:
1. Extract text from the PDF
2. Classify the document type
3. For medical records, extract structured fields and validate them
4. Save validated records to the database

Required medical record fields: patient_name, date_of_birth (DD/MM/YYYY),
medicare_number, diagnosis, treatment_plan.

If your confidence in any field is below 0.85, do not save. Flag the document
for human review by responding with the issues you found.

Always use Australian date formats and Medicare number conventions."""

# Run the agent
react_agent(
    task=f"Process this document: {Path('inbox/2026-05-07-record.pdf').absolute()}",
    tools=DOCUMENT_TOOLS,
    tool_handlers={
        "extract_text_from_pdf": extract_text_from_pdf,
        "classify_document": classify_document,
        "save_to_database": save_to_database,
    },
    system_prompt=SYSTEM_PROMPT,
)

The agent handles the full pipeline: read, classify, extract, validate, store. The system prompt encodes the business rules including the confidence threshold that triggers human review and the Australian date format requirement. In production we add audit logging, retries, idempotency keys, and metrics, but the core shape stays the same.

We deployed a similar architecture for a healthcare client where staff used to spend hours manually sorting and entering patient documents. The agent now handles the routine cases and flags the ambiguous ones for review. Same staff, much higher throughput.

Integrating With Business Systems Using MCP

Real agents need to talk to real systems: CRMs, databases, communication platforms, internal APIs. In 2026 the standard way to do this is the Model Context Protocol (MCP). Instead of writing custom integration code for every system, you connect to an MCP server that exposes that system’s capabilities through a consistent interface.

The Python mcp SDK lets you both consume MCP servers (so your agent can use them as tools) and expose your own tools as MCP servers (so other agents can call them). Both major LLM providers have native MCP support, so the same agent code works whether you are using Claude or GPT.

from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

# Connect to an MCP server (this could be a Salesforce MCP, a Postgres MCP, etc.)
server_params = StdioServerParameters(
    command="uv",
    args=["run", "--directory", "/path/to/salesforce-mcp", "salesforce-mcp"],
    env={"SALESFORCE_TOKEN": "..."},
)

async with stdio_client(server_params) as (read, write):
    async with ClientSession(read, write) as session:
        await session.initialize()
        tools = await session.list_tools()
        # The MCP tools can now be passed to Claude or OpenAI as agent tools

For workflow-heavy use cases we often pair Python agents with n8n. n8n handles the workflow layer (event triggers, retries, branching) and calls the Python agent for the parts that need reasoning. This separation works well because each tool is doing what it is best at.

Error Handling and Fallback Strategies

Agents fail. APIs time out. Models hallucinate tool calls or invent fields. External systems return data the agent did not expect. Building a reliable agent means planning for failure at every step.

Retry with exponential backoff for transient API failures. Anthropic and OpenAI both occasionally return 529 (overloaded) errors. The official SDKs include retry logic but tune the defaults for your workload.
Fallback models. If your primary model is rate-limited, fall back to a cheaper or alternative model so the workflow does not stall.
Confidence thresholds. Have the agent assess and report confidence. Below the threshold, escalate to a human. Critical for any high-stakes decision such as financial processing or medical document classification.
Structured output validation. Always validate the agent’s structured output against a Pydantic schema before writing it anywhere. Catch malformed outputs at the boundary.
Circuit breakers. If an external system is consistently failing, stop calling it and alert your team rather than burning through API credits on retries that will not succeed.

import time
from anthropic import APIError, RateLimitError

def call_with_retry(messages, tools, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-sonnet-4-5",
                max_tokens=4096,
                tools=tools,
                messages=messages,
            )
        except RateLimitError:
            time.sleep(2 ** attempt)
        except APIError:
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)
    raise RuntimeError("Max retries exceeded")

Cost Management

Token costs add up faster than you expect on agent workloads because every tool call round-trip resends the conversation history. Here is how we keep costs under control.

Pick the right model. Not every task needs the largest model. We use Haiku for routing and classification, Sonnet for most agent reasoning, Opus or GPT-4o only when the task genuinely needs it. The cost difference is large.

Use prompt caching. If your agent runs many requests with the same system prompt and tool definitions, prompt caching cuts your input token bill significantly. Both Anthropic and OpenAI support it natively.

Be deliberate with context. Trim tool results to only the relevant fields before sending them back to the model. Summarise long conversation histories rather than including every message verbatim.

Use batch APIs for non-urgent work. Anthropic’s message batches API and OpenAI’s batch API both offer roughly 50% off list price for workloads that do not need real-time responses. Overnight document processing is a good candidate.

For a Sonnet-based document processing agent handling a few hundred documents per day with caching, expect to spend in the range of $50 to $200 AUD per month in API fees. High-volume customer support agents typically run $200 to $1,000 AUD per month depending on conversation length.

Testing and Evaluation

Testing agents is harder than testing traditional software because outputs are non-deterministic. Our approach has three layers.

Unit test the tools. The Python functions your agent calls are normal code. Test them like any other code. Mock external systems and verify edge cases.

Snapshot the agent. Build a set of representative inputs with expected outputs (or expected output shape). Run the agent against these regularly. Track tool selection accuracy, completion rate, and cost per run. Catch regressions when you tweak prompts or swap models.

Adversarial evals. Deliberately try to break the agent. Feed it ambiguous inputs, conflicting instructions, edge cases. Verify it fails gracefully and escalates rather than producing confident-but-wrong output.

Tools that help: inspect-ai from the UK AI Safety Institute, promptfoo for prompt-level evaluation, and standard pytest with snapshot fixtures for the simpler cases.

Production Deployment Considerations

Moving an agent from a Jupyter notebook to production involves a few things that are easy to overlook.

Logging and observability. Log every agent interaction: full message history, tool calls, tool results, final output, model choice, token counts, and latency. When something goes wrong in production (and it will), you need to be able to reconstruct exactly what happened. logfire and Langfuse both make this much easier than rolling your own.

Rate limiting and concurrency. The LLM provider rate limits will bite you faster than you expect under concurrent load. A queue-based architecture using Celery, arq, or Redis Streams gives you a clean way to throttle requests and retry transient failures.

Latency budgets. An agentic loop with four tool calls will be noticeably slower than a single API call. Set expectations with stakeholders. For customer-facing flows, stream the response so the user sees output as it arrives.

Version control your prompts. Treat system prompts and tool definitions as code. Commit them, review changes, deploy through CI. A small prompt change can completely alter agent behaviour and you want that change to go through the same review process as any other code change.

Australian Data Considerations

If you are building agents for Australian businesses, data handling matters. Both Anthropic’s and OpenAI’s APIs process data on servers in the United States. For many use cases this is fine, but if you are handling sensitive personal information, health data, or government data, you need to consider the implications under the Privacy Act 1988, the Australian Privacy Principles, and any industry-specific regulations such as the My Health Records Act.

Practical approaches that work:

Data minimisation. Send only the data the agent needs. Extract and send relevant sections rather than entire documents where possible.
Anonymisation before API calls. Strip personally identifiable information before sending data to the LLM, then re-associate it with the original record on your side.
Hybrid architectures for sensitive data. For clients with strict data sovereignty requirements, we run open-source models (Llama, Qwen, Mistral) locally for the parts that touch sensitive data and send only de-identified reasoning tasks to the commercial APIs.
Data Processing Addendums. Both Anthropic and OpenAI offer DPAs for enterprise customers. If you are processing personal information, make sure one is in place.

These are not theoretical. We work with healthcare providers and financial services firms in Australia where getting data handling wrong has real regulatory consequences.

When to Build Custom vs Use a Framework

The Python agent ecosystem has matured quickly. Should you use Pydantic AI, LangGraph, the OpenAI Agents SDK, or write the loop yourself?

Use a framework when you want to move fast, the framework’s architecture matches your use case, you need built-in features like memory or multi-agent orchestration, or your team is new to agent development and wants useful guardrails.

Build custom when you need precise control over the agent loop, retry logic, or tool selection, your use case has performance or cost requirements that a general framework adds overhead to, or you are building something that will run for years and you want to minimise dependencies on fast-moving open-source projects.

In practice, most production agents end up somewhere in between. We often start with Pydantic AI for prototyping and progressively replace framework components with custom code as we optimise. The core ReAct loop is simple enough that building it from scratch is often less work than learning and working around a framework’s opinions.

Getting Started

If you are ready to build your first Python AI agent, here is where to start.

Define a narrow, high-value use case. Pick one repetitive, time-consuming task and automate it well. Avoid trying to build something that does everything.
Map the tools. What systems does the agent need to read from and write to? Define your tool set before writing any agent code.
Pick your stack. The default choice in 2026: Python, the Anthropic or OpenAI SDK, Pydantic for schemas, optionally Pydantic AI or LangGraph for orchestration. Start simple. Add complexity only when you need it.
Build the eval set first. Decide how you will measure success before you start building. This saves enormous time later.
Plan for failure. Every tool call can fail. Every API can time out. Build error handling from day one.

If you want help building production-grade Python AI agents for your business, book a call with our team. We have shipped agents across healthcare, recruitment, property, and professional services in Australia, and we can help you identify the right use case and architecture.

Frequently Asked Questions

What is the difference between LangChain, LangGraph, and Pydantic AI?

LangChain is the original Python framework for LLM applications. It is broad in scope and includes abstractions for many use cases beyond agents. LangGraph is a newer LangChain project focused specifically on agent workflows expressed as graphs, which gives you explicit control over branching and looping. Pydantic AI is a more recent framework from the Pydantic team focused on type-safe agent development with Pydantic models for tools and outputs. We use LangGraph for complex multi-agent workflows with explicit branching, Pydantic AI for simpler agents where type safety matters most, and the raw provider SDKs when we want minimum framework overhead.

Should I use the OpenAI Agents SDK or build my agent from scratch?

If your agent will only ever use OpenAI models and the SDK’s patterns match your use case, it is the fastest way to get to production. If you might switch providers, want to use Claude alongside GPT, or need behaviour the SDK does not support, building the loop yourself is straightforward. The ReAct loop is about 50 lines of code. Frameworks save time on the second and third agent more than the first.

How do I deploy a Python AI agent to production?

The most common pattern is to wrap the agent in a FastAPI service, deploy it to a container platform (Render, Fly.io, Railway, AWS App Runner, or Kubernetes if you have it), and put a queue in front for any long-running work. Add observability (logfire or Langfuse), structured logging, and proper secret management. For low-traffic agents you can run them as scheduled jobs without a service at all.

What does it cost to run a Python AI agent in production?

API costs depend heavily on model choice, prompt caching, and conversation length. As a rough guide, a Sonnet-based document processing agent handling a few hundred documents per day with caching typically costs $50 to $200 AUD per month. Customer support agents range from $200 to $1,000 AUD per month depending on conversation length and tool usage. Infrastructure cost is usually small relative to API cost (a $20/month VPS is enough for most agents). Talk to our team if you want help estimating costs for a specific use case.

Can I run a Python AI agent locally without OpenAI or Anthropic?

Yes. Open-source models such as Llama 3, Qwen, and Mistral run well via Ollama or vLLM, and most of them now support tool calling. The trade-off is that the largest open models are still behind GPT-4o and Claude Sonnet on agent benchmarks, especially for multi-step reasoning. For sensitive workloads where data must stay on your infrastructure, the gap is worth it. For most other use cases, the commercial APIs are cheaper than self-hosting GPU infrastructure.

How do AI agents differ from traditional Python automation scripts?

A traditional script follows a fixed sequence of steps that you write. An AI agent decides the sequence at runtime based on the input. If the steps your task needs are always the same, write a script. If the steps vary based on the content of the input, an agent is the right tool. The other big difference is that agents handle ambiguity. A script breaks when it encounters something unexpected. An agent reasons about what to do.

Can Python AI agents handle Australian data formats like Medicare numbers and ABNs?

Yes. The major LLMs handle Australian data formats well, including Medicare numbers, ABNs, ACNs, Australian phone numbers, addresses, and DD/MM/YYYY dates. We build validation layers using Pydantic that verify these formats and check digits where applicable. For healthcare and financial services clients, we also implement data handling practices that align with the Privacy Act and relevant industry regulations.

Building Python AI agents that actually hold up in production takes more than good prompts. You need the right architecture, solid error handling, observability you can trust, and a thorough understanding of the business process you are automating. If you want to explore what a Python AI agent could do for your organisation, get in touch with our team. We are based in Brisbane and work with businesses across Australia.

How to Build an AI Agent in Python: SDKs, Tools, and Patterns