Updated May 2026. Rewritten to cover Llama 3.x and 4 model sizes, current local-inference tooling, and the hardware reality of running open models in Australian environments.

Running a Llama model on your own hardware used to be a research exercise. In 2026, it is a real production option. The tooling has matured, the model quality has caught up to the closed APIs for many tasks, and the hardware that can run a useful model is sitting on a lot of desks already. We help Australian businesses make this call regularly, and the answers depend on your data, your throughput, and the work you actually need the model to do.

At Osher Digital, we are a Brisbane-based AI consultancy that has shipped on-premise and self-hosted LLM infrastructure for healthcare, finance, and professional services clients. This guide covers what actually works in 2026: which Llama variant to pick, what hardware you genuinely need, the inference stack we deploy, and the cost tradeoff against using a hosted API like Anthropic or OpenAI.

It is aimed at developers, ops engineers, and technical leaders evaluating local inference for sensitive workloads or cost-sensitive scale. If you are weighing local versus hosted before you start, our guide to Claude-based agents is the better starting point for that decision.

Why Run Llama Locally

Local inference solves three problems that hosted APIs cannot. The first is data sovereignty. If your prompts contain personal health information, banking data, or unreleased commercial records, sending them to a US-based API may not be acceptable under your contractual or regulatory obligations. A Llama model running on hardware you own keeps the data on your infrastructure end-to-end.

The second is unit economics at scale. For workloads that run thousands or millions of inferences per day with predictable token sizes, a dedicated GPU running a quantised Llama model often costs less per million tokens than the equivalent hosted API. The crossover point depends on your workload, but for high-volume document processing, classification, and embedding generation, local inference is increasingly the rational choice.

The third is latency. A local inference server in your data centre or on a workstation in your office responds in tens of milliseconds for the first token. A hosted API call adds 100-300ms of network latency before any tokens come back. For interactive applications where every hundred milliseconds matters, local inference is faster.

Local inference does not solve every problem. Llama is not better than Claude or GPT-4o at the hardest reasoning tasks. We pair them in production: hosted API for the genuinely hard work, local Llama for the high-volume, well-bounded tasks where its accuracy is good enough.

Hardware Requirements: Real Numbers

Hardware sizing is where most local LLM projects go wrong. The honest version, from running these workloads in production:

Llama 3.x 8B (or Llama 3.2 11B). The smallest useful model class. At 4-bit quantisation, it fits in 6-8 GB of VRAM. A consumer card like an RTX 4060 Ti 16GB runs it comfortably. A Mac with 16 GB of unified memory runs it through Ollama or LM Studio. This size handles classification, summarisation, and structured extraction at strong quality. It is what we deploy for high-volume document processing.

Llama 3.x 70B. The middle class. At 4-bit quantisation, it needs around 40 GB of VRAM. A single RTX 6000 Ada (48 GB) or two RTX 4090s (24 GB each, model split across them) runs it. Cloud-equivalent: a single L40S or two L40 cards. This is the size at which Llama gets close to GPT-4-class quality on many tasks.

Llama 4 405B (or future flagship). The frontier class. At 4-bit quantisation, around 220 GB of VRAM is the minimum. Realistically, you are looking at four H100s (80 GB each) or eight A100s. This is data centre territory. Most Australian businesses do not run this themselves; they use it via a managed inference provider or via the hosted Llama API offerings.

The simple rule of thumb is that the model needs 0.5 to 0.6 GB of VRAM per billion parameters at 4-bit quantisation, plus another 2-4 GB for the KV cache and operating overhead. Add 50-100% headroom if you want to serve multiple concurrent requests rather than one at a time.

Choosing a Llama Model in 2026

Meta has released several Llama generations since the original Llama 3. The choice for most production work is between three model classes:

Llama 3.3 70B remains a sweet spot for general-purpose work in mid-2026. The quality is high, the context window is workable (128K tokens), and the hardware footprint is achievable on a single high-end workstation GPU.

Llama 3.2 vision models (11B and 90B) are the option if you need image understanding alongside text. We use the 11B vision model for invoice and receipt processing where the OCR-to-text path loses too much information.

Llama 4 introduced a mixture-of-experts architecture that delivers higher quality at the same active parameter count. The Maverick (17B active, 400B total) and Scout (17B active, 109B total) models are the production-relevant variants. Scout fits comfortably on a single H100 or two L40S cards. Maverick is data centre territory.

For most teams starting fresh in 2026, we recommend Llama 3.3 70B as the default. It is well-supported by every major inference engine, runs on attainable hardware, and the quality is genuinely production-ready for classification, extraction, summarisation, and most agent-style tasks. Reach for vision variants when the workload needs them, and for Llama 4 when you have hardware sized for it.

Quick Start with Ollama

The fastest path from “I have a machine” to “Llama is responding” is Ollama. It runs on Linux, macOS, and Windows, manages model downloads, and exposes a clean OpenAI-compatible API.

# Install (macOS)
brew install ollama

# Or Linux one-liner
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.3:70b

# Run it interactively
ollama run llama3.3:70b "Summarise this paragraph: ..."

# Start the server (auto-runs in most installs)
ollama serve

The server exposes http://localhost:11434 with an OpenAI-compatible /v1/chat/completions endpoint. Most client libraries (the OpenAI SDK, LangChain, LlamaIndex) work against it by setting a custom base URL. This makes the swap from a hosted model to local Ollama a single configuration change.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Ollama ignores the value but the SDK requires one
)

response = client.chat.completions.create(
    model="llama3.3:70b",
    messages=[
        {"role": "system", "content": "You classify support tickets."},
        {"role": "user", "content": "My order arrived damaged."},
    ],
)
print(response.choices[0].message.content)

Ollama is excellent for development, evaluation, and small-scale internal tools. It is not what we run in production for high-throughput workloads; for that, we move to vLLM.

Production Inference with vLLM

For production workloads beyond a few requests per minute, vLLM is the inference engine we deploy. It implements PagedAttention, continuous batching, and optimised CUDA kernels that deliver several times the throughput of naive implementations. Llama is a first-class supported model.

pip install vllm

# Start an OpenAI-compatible server
vllm serve meta-llama/Llama-3.3-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --quantization awq

This command starts an inference server on two GPUs, splits the model across them with tensor parallelism, allows up to 32K-token contexts, and uses AWQ 4-bit quantisation. The endpoint is OpenAI-compatible at http://localhost:8000/v1.

For on-CPU or low-VRAM deployments, llama.cpp remains the reference. It uses the GGUF format and runs Llama models on consumer hardware (including Apple Silicon) faster than most alternatives. We use llama.cpp for edge deployments and for workloads that need to run on a developer’s MacBook.

GPU Selection and Costs

For Australian businesses, the GPU choice usually comes down to three options:

Buy and self-host. An RTX 6000 Ada (48 GB) sits at around $12,000 AUD landed. Two RTX 4090s (24 GB each) come in around $4,000 AUD total. For workloads that run continuously, the capital cost amortises in 12-18 months against equivalent cloud rental. We have built in-office inference servers for clients who want absolute data sovereignty.

Cloud GPU rental in Australia. Vast.ai and RunPod offer hourly GPU rentals starting around $0.50 USD per hour for an RTX 4090, with availability in Asia-Pacific regions. AWS p5 instances in ap-southeast-2 are an option but materially more expensive. For unpredictable workloads or proof of concept, hourly rental is the rational starting point.

Managed Llama API. Together AI, Fireworks, and Anyscale Endpoints all offer hosted Llama 3.3 70B at around $0.50-$0.90 USD per million output tokens. This is dramatically cheaper than provisioning your own GPU for low-volume work. The catch is that the API request still leaves Australia, which removes the data sovereignty benefit. For internal-only or non-sensitive workloads it is a strong middle option.

The decision framework we use with clients: if data must stay in Australia and volume is high, buy hardware. If data is non-sensitive and volume is moderate, use a managed Llama API. Everything in between needs a model and an honest workload calculation. Book a call if you want help with that calculation for your specific scenario.

Performance and Quantisation Tradeoffs

Quantisation reduces the precision of model weights from 16-bit floats to 8-bit, 4-bit, or even 2-bit integers. The smaller the precision, the smaller the memory footprint and the faster the inference, at the cost of some output quality.

For Llama 3.3 70B, the quality difference between 16-bit (full precision) and 4-bit AWQ or GPTQ is small for most tasks: typically 1-3% on standard benchmarks, often imperceptible in production output. We deploy 4-bit by default. We move to 8-bit only when we have a specific quality regression to address, and we have not yet found a production workload that genuinely needed 16-bit.

2-bit quantisation is too aggressive for production. Models lose enough quality that we no longer trust them for client work. The market consensus has converged on 4-bit AWQ for vLLM and 4-bit GGUF (Q4_K_M variant) for llama.cpp.

Throughput numbers we see in practice: a Llama 3.3 70B 4-bit model on two RTX 4090s with vLLM serves around 800-1,200 output tokens per second across 16 concurrent requests. The same model on a single H100 lands around 2,500-3,500 tokens per second. These are real numbers from our deployments, not lab benchmarks.

Connecting Llama to Your Applications

The OpenAI-compatible API on Ollama and vLLM is the integration glue. Most production stacks treat the local Llama endpoint as a drop-in replacement for OpenAI in code, with the model name and base URL configured via environment variables.

For workflow automation, n8n’s HTTP Request node calls the local endpoint directly. We pair this with our self-hosted n8n stack so the entire pipeline runs in-region.

For agent-style applications, Llama 3.3 70B supports function calling well enough for production use. The format is a near-clone of OpenAI’s tool-calling schema. We use it for internal classification and routing agents where the data must stay local.

For embedding workloads, Llama is not the right tool; use a dedicated embedding model like BGE-large or NV-Embed-v2. Both run cheaply on a fraction of the hardware needed for Llama itself, and they are what we deploy alongside the chat models for retrieval-augmented generation pipelines.

When Not to Run Llama Locally

Local Llama is the wrong call when:

The volume is low. If you are making a few thousand requests per day, hosted APIs are cheaper and require no operational work. Spinning up a GPU server costs more in time than it saves in API fees.

You need frontier reasoning quality. Claude Opus 4.5 and GPT-4.1 still outperform every open model on the hardest reasoning tasks. If your workload depends on the very top of the quality curve, a closed API is the right tool. We pair them: Llama for the bulk volume, Claude for the genuinely hard cases.

You do not have ops capacity. Running a GPU inference server is not difficult, but it is real ops work. Driver updates, CUDA versions, model updates, monitoring, on-call. If your team is two engineers and they are already at capacity, the hosted API is a better tradeoff.

You need long context with high quality. Llama’s 128K context works, but the quality at the back of the context window degrades faster than the closed-API competition. For tasks that genuinely need 100K+ tokens of input, Claude or Gemini still wins.

Australian Hosting Options for GPU Inference

Where you run the GPUs matters for the same data sovereignty reasons that drive the local-inference choice in the first place. Options for Australian deployments:

AWS p5 in ap-southeast-2 (Sydney). H100-class instances on demand, expensive at around $50+ AUD per hour but officially in Australia. Use spot capacity to bring this down where the workload tolerates interruption.

NextDC and other Australian colocation. For owned hardware, NextDC’s Sydney and Melbourne facilities house several Australian AI infrastructure providers. We have placed client GPU servers in colocation when in-office cooling and power were limiting factors.

On-premise with grade-2 cooling. A single RTX 6000 Ada or pair of RTX 4090s runs comfortably in a small server room with adequate airflow. We have set this up for clients who wanted absolute on-site control.

Compliance with the Australian Privacy Principles, particularly APP 8 (cross-border disclosure), is far simpler when the GPU and the storage layer are both in Australia. For healthcare clients under the My Health Records Act, this is effectively required.

Frequently Asked Questions

How do I install Llama 3 on my laptop?

The simplest path is Ollama. Install it with Homebrew on macOS or the install script on Linux, then run ollama pull llama3.3:8b followed by ollama run llama3.3:8b. The 8B model runs comfortably on a Mac with 16 GB of unified memory or any laptop with an Nvidia GPU that has 8 GB of VRAM. Larger models need more hardware.

What hardware do I need to run Llama 3 locally?

For Llama 3.x 8B at 4-bit quantisation, 8 GB of VRAM is enough. A consumer card like the RTX 4060 Ti 16GB or an Apple Silicon Mac with 16 GB unified memory works. For Llama 3.3 70B at 4-bit, you need around 40 GB of VRAM, which means an RTX 6000 Ada (48 GB), two RTX 4090s, or a cloud GPU instance. CPU-only inference works for the smaller model classes through llama.cpp but throughput is around 5-10 tokens per second.

Can I run Llama offline?

Yes. Once the model weights are downloaded (a one-time download of 5-50 GB depending on size), inference happens entirely on your hardware. No internet connection is required for normal operation. This is one of the key reasons we deploy local Llama for clients in air-gapped environments and for use cases where outbound network access is blocked.

Is Llama as good as Claude or GPT-4?

For many production workloads, yes. For the hardest reasoning tasks, the closed APIs still hold an edge. Llama 3.3 70B is competitive with GPT-4-class models for classification, summarisation, structured extraction, and most tool-using agent work. It tends to underperform on multi-step reasoning, complex code generation, and very long context tasks. We commonly run a hybrid in production: Llama for the high-volume bulk work and Claude or GPT-4o for the genuinely hard cases.

How much does it cost to run Llama locally in Australia?

For a workstation with a 4090 running Llama 8B, the hardware is around $4,000 AUD landed and electricity adds maybe $20 AUD per month. For Llama 70B on dual 4090s, hardware around $7,500 AUD landed. For cloud rental, an RTX 4090 hour costs around $0.50 USD on Vast.ai or RunPod. AWS Sydney H100 capacity runs around $50 AUD per hour. The break-even with hosted APIs depends entirely on volume; high-throughput workloads cross over to favouring local hardware quickly.

What is the best way to run Llama 3 in production?

vLLM is the inference engine we deploy in production. It supports continuous batching, paged attention, and OpenAI-compatible APIs out of the box. Pair it with a reverse proxy for TLS, a queue (Redis or RabbitMQ) if you need request shaping, and a monitoring stack (Prometheus plus Grafana). For smaller workloads where ops simplicity matters more than throughput, Ollama in production works fine and we run it in front of internal tools.

How do I fine-tune Llama on my own data?

For most workloads, you do not need to. Retrieval-augmented generation (RAG) plus a strong system prompt covers about 90% of “make Llama know about my data” cases. When fine-tuning genuinely helps, parameter-efficient methods like LoRA on top of the base weights need a single GPU with 24-48 GB and a few hours. Full fine-tuning of a 70B model is data centre work. We help clients pick the right approach as part of our AI agent development engagements.

Is running Llama locally suitable for Australian healthcare data?

It is one of the few options that comfortably satisfies the My Health Records Act and APP 8 cross-border disclosure restrictions. With on-premise hardware or in-region cloud GPUs, the patient data never leaves your infrastructure or your in-region cloud account. We have shipped this configuration for healthcare providers who could not use a US-based hosted API for regulatory reasons.

If you want help deciding between local Llama and a hosted API, or if you are designing a GPU inference stack for sensitive Australian workloads, get in touch with our team. We are based in Brisbane and ship hybrid local-and-hosted AI infrastructure for businesses across Australia.

Run Llama 3 Locally: Hardware, Tooling, and Setup