Train AI on Your Own Data: Fine-Tuning, RAG, and Prompts
Train AI on your own data without overengineering: a 2026 decision tree for fine-tuning, RAG, and prompts, with real cost numbers and working code.
Updated May 2026. Rewritten for the 2026 stack: prompts vs RAG vs fine-tuning decision tree, working code with claude-sonnet-4-5 and gpt-4.1, and honest cost numbers in AUD and USD.
“How do I train AI on my own data” is the most common question we get from teams new to LLMs. It is also one of the most misunderstood. In 2026, the answer for almost everyone is not fine-tuning. It is a prompt with your data attached, or a retrieval system that fetches the relevant slice of your data at query time. Fine-tuning a model is the exception, not the default.
This guide walks through the three real ways to train AI on your own data, when each one is the right pick, what they cost, and the code we actually ship for clients. We are a small AI consultancy based in Brisbane, and these are the patterns we use on every client engagement that involves a domain-specific model output.
Related reading: our guide to building an AI agent on Claude, our LangChain primer, and our piece on data validation techniques for AI pipelines.
Three Ways to Train AI on Your Own Data
When people say “train AI on my own data” they usually mean one of three different things, and conflating them is where most projects go wrong.
- Inline context (prompts with your data attached). The model sees your data as part of the prompt at query time. No training, no infrastructure. Works for anything that fits in the context window.
- Retrieval-augmented generation (RAG). Your data lives in a vector store or document store. At query time the system fetches the relevant slice and passes it to the model alongside the question. Right answer for almost any “answer questions about my docs” use case.
- Fine-tuning. You modify the model’s weights using your data. Real, expensive, and rarely the right answer. Reserved for narrow cases: very specific output formats, very low latency requirements, or sensitive data that cannot leave a VPC.
The decision tree is almost always: try prompt with context first. If your data exceeds the context window, try RAG. Only if those two fail do you fine-tune.
Prompts with Context: The Default Way to Train AI on Your Own Data
This is the cheapest, fastest, and most accurate way to train AI on your own data for any task where the relevant data fits in the model’s context window. Claude Sonnet 4.5 has a 200K-token context. GPT-4.1 has a 1M-token context. That is roughly 150,000 words for Claude and 750,000 words for GPT-4.1. Many real business documents fit comfortably.
Here is a working extraction example we ship for an Australian finance client, simplified for brevity:
from anthropic import Anthropic
from pydantic import BaseModel
import json
client = Anthropic()
class Invoice(BaseModel):
supplier_name: str
invoice_number: str
invoice_date: str
total_aud: float
gst_amount: float
line_items: list[dict]
def extract_invoice(pdf_text: str) -> Invoice:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=4000,
system="You extract structured invoice data. Return only valid JSON matching the schema. AU GST is 10 percent of the ex-GST amount.",
messages=[{
"role": "user",
"content": f"Schema:\n{Invoice.model_json_schema()}\n\nInvoice text:\n{pdf_text}"
}]
)
return Invoice(**json.loads(response.content[0].text))
That is the whole “training” step. No fine-tuning, no vector database, no MLOps stack. The “training” is the system prompt and the schema. We have run this pattern in production for over a year with an extraction accuracy north of 96 percent on Australian tax invoices.
Use prompts with context when: the relevant data per query fits in 100K tokens or less, your accuracy target is 90 to 98 percent, and you can iterate on prompts faster than you can iterate on training runs (which is almost always).
RAG: Train AI on Your Own Data When It Does Not Fit
If your corpus is bigger than the context window (handbooks, contracts library, knowledge base, codebase), RAG is the right way to train AI on your own data. The pattern: embed each chunk of your data into a vector, store the vectors, and at query time fetch the most relevant chunks to pass to the model.
The minimal stack:
- Chunker. Break documents into 500-to-1500 token chunks at semantic boundaries (sections, not sentences).
- Embedding model. OpenAI
text-embedding-3-largeor Voyagevoyage-3-large. Both are fine. Voyage tends to win on technical content. - Vector store. Postgres + pgvector for under 5M chunks. Pinecone or Qdrant for serious scale. We default to pgvector because it sits next to the data already.
- Retriever. Top-k similarity search plus a reranker (Cohere Rerank or Voyage Rerank) for any corpus over 10,000 chunks.
- Generator. Claude Sonnet 4.5 or GPT-4.1 with the retrieved chunks as context.
The thing nobody tells you about RAG: retrieval quality is the bottleneck, not the model. A great model with bad retrieval gives confident wrong answers. A solid model with great retrieval is reliable. Spend your engineering time on chunking, metadata filters, and reranking, not on picking between Claude and GPT.
A production RAG build for a mid-sized AU client typically lands in the $30,000 to $80,000 AUD range. Ongoing costs at 50,000 queries per month sit at $200 to $600 AUD for inference plus $50 to $200 AUD for embedding and storage.
Fine-Tuning: When It Actually Earns Its Place
Fine-tuning earns its place in a narrow set of cases. We have shipped fine-tuned models for clients perhaps four times in three years. Each time it was because prompts and RAG had been tested first and one of the following was true:
- The output format is unusual and stable. A specific markup language, a specific JSON dialect, or a custom DSL where prompt-following accuracy on a frontier model was still around 85 percent and we needed 99 percent.
- Latency requirement under 200ms. Frontier models have non-trivial first-token latency. A smaller fine-tuned model (Llama 3.3 8B AWQ, Mistral 7B) running on a local GPU can hit sub-100ms.
- Data cannot leave a VPC. Some regulated AU clients require inference inside their own AWS account with no external API call. Fine-tuning a self-hosted Llama gets you there.
- Token cost dominates at huge scale. A fine-tuned smaller model is cheaper per token than a frontier model. Crossover usually happens around 50M tokens per month.
If none of those apply, fine-tuning is the wrong answer. The cost is real: $15,000 to $80,000 AUD for the build, plus $500 to $5,000 AUD per month for hosting if you self-host the result.
How Much Data Do You Need to Train AI on Your Own Data?
Three data quantities. They differ by an order of magnitude across the three approaches.
Prompts with context. Zero training examples. You just need the relevant data for the current query. Some teams include 3 to 10 few-shot examples in the prompt to lock the output format.
RAG. The whole corpus you want to query over. A starting point is anywhere from a few hundred documents (a handbook) to a few million chunks (a large knowledge base). Quality of metadata matters more than quantity.
Fine-tuning. Several hundred to several thousand labelled examples in the exact format you want the model to learn. For a classification task, around 500 well-labelled examples is the floor. For a more complex generation task, 2,000 to 10,000. The label quality matters more than quantity, and labelling is where most fine-tuning projects spend the bulk of their budget.
Evaluating Models Trained on Your Own Data
This is where most “we trained an AI on our data” projects fail quietly. The model produces plausible output. Nobody measures whether it is correct. Six months later somebody notices it has been wrong about a category of inputs since launch.
Our minimum eval setup for any production AI on client data:
- 30 to 100 labelled examples. Real inputs from production, with the expected output written by a human. The set evolves as edge cases appear.
- Automated eval on every prompt change. Pass-fail per example, aggregate accuracy reported in CI.
- Drift check. Re-run the eval suite weekly. Models change behaviour. APIs change behaviour. Catch it before the user does.
- Human spot check. A sample of production outputs reviewed by a human every week, even when the eval is passing.
The single highest-leverage day we spend on a new build is the day we set up the eval suite. The single most expensive failure mode we see is shipping without one.
Data Preparation and the Things That Actually Break
Some things you only learn from running this in production. A short list of what we have lost time on.
PDF extraction failures. Scanned PDFs, two-column layouts, table-heavy invoices. PyMuPDF gets most of it; the failures are silent. We run a length-check (extracted text under 200 chars on a multi-page PDF = bug) and route to OCR fallback (Tesseract or AWS Textract).
PII and consent. Whatever data you train AI on, it is now in the model’s context window. For regulated clients we strip PII before it ever reaches the API, or route to a Sydney-hosted model on AWS Bedrock (Claude Sonnet 4.5 is available there) to stay onshore.
Duplicate and near-duplicate data. Same document attached to ten support tickets shows up ten times in your training set or your RAG store. The model gets unduly confident about that one document. Deduplicate aggressively before ingestion.
Label drift. The way the business categorises tickets in 2024 is not the way they do it in 2026. Old training data labels conflict with the current taxonomy. We re-label periodically rather than just adding new examples.
When Not to Train AI on Your Own Data
The honest “do not bother” cases.
- Off-the-shelf tools fit. If Notion AI, Glean, or ChatGPT Team can answer the questions on your data, use them. Building a custom RAG to replicate Glean is a year and $200K of work you do not need.
- The data does not exist yet. Cleaning data is a project in itself. If your records live in spreadsheets, emails, and a tribal-knowledge oral tradition, the AI project will fail. Build the data foundation first.
- You want a chatbot for the website. Out-of-the-box vendors (Intercom Fin, Zendesk AI Agents) ship faster than custom builds for the standard customer-support-bot pattern. Build custom only if you have an unusual integration requirement.
Want help deciding which of the three patterns fits your data? Send us a note via the contact page or book a call. We have run this decision dozens of times and the answer is almost never the one teams expect at the start.
Frequently Asked Questions
How do I train AI on my own data without fine-tuning?
Use prompts with your data in the context window, or build a RAG system. Both let the model “know” your data without modifying its weights. For most business use cases this is the right call. Fine-tuning is reserved for unusual output formats, very low latency requirements, or VPC-only data.
How much data do I need to train AI on my own data?
None if you are using prompts with context. Your entire corpus if you are using RAG. Several hundred to several thousand labelled examples if you are fine-tuning. The 500-example minimum for fine-tuning a classifier is a useful threshold to start with.
How much does it cost to train AI on your own data?
Prompt-with-context builds: $5,000 to $20,000 AUD. RAG builds: $30,000 to $80,000 AUD for production-grade. Fine-tuning: $15,000 to $80,000 AUD for the training plus $500 to $5,000 AUD per month for hosting. Inference costs scale with usage and are usually a small fraction of build cost for the first year.
Which model should I use to train AI on my own data?
For extraction and structured output: Claude Sonnet 4.5. For agentic workflows with tool use: GPT-4.1 or Claude Sonnet 4.5. For embeddings: OpenAI text-embedding-3-large or Voyage voyage-3-large. For self-hosted: Llama 3.3 8B or 70B with vLLM. Mix and match by component, not by single vendor commitment.
Can I train AI on PDFs and Word documents?
Yes, but extraction quality varies. Born-digital PDFs work well with PyMuPDF. Scanned PDFs need OCR (Tesseract or AWS Textract). Word docs convert cleanly via python-docx. The data-prep step is usually the bulk of the work, not the AI step.
Is it safe to train AI on customer data?
Depends on the deployment. Sending customer data to a frontier model via the public API exposes you to data residency questions. Sydney-hosted endpoints (AWS Bedrock ap-southeast-2, Vertex AI Sydney, Azure OpenAI with AU data residency) keep inference onshore. For any regulated workload (health, finance, legal) we default to Sydney inference and on-the-fly PII redaction.
How long does it take to train AI on your own data?
Prompt with context: a few days to a working prototype, two to four weeks to production. RAG: four to eight weeks to production. Fine-tuning: eight to sixteen weeks to production, mostly because of labelling and evaluation time, not training time itself.
Will the AI stay accurate after launch?
Not without an eval suite running on a schedule. Models change behaviour silently. Data drifts. New edge cases appear. We re-run a 50-to-100 example eval suite weekly on every production AI we have shipped, and we have caught two regressions that way that would otherwise have shipped silently to users.
Jump to a section
Ready to streamline your operations?
Get in touch for a free consultation to see how we can streamline your operations and increase your productivity.