System Integration Best Practices That Prevent Outages
System integration best practices that hold up in production: API contracts, idempotency, retries, schema versioning, and the monitoring that catches drift.
Updated May 2026. Rewritten from a generic listicle into the engineering practices we actually hold ourselves to, with working patterns for idempotency, retries, and schema versioning.
Most system integration best practices exist for one blunt reason: integrations that work fine in the demo fall over three weeks into production. The code was never the problem. The problem is the stuff that only shows up under real load: a webhook delivered twice, a partner API that rate-limits you at the worst moment, a field that changes type without warning. The habits below are what keep those failures from becoming outages.
We are Osher Digital, a Brisbane integration and automation consultancy. We have built and inherited a lot of integrations, and we have been paged at 2am by enough of them to know which practices earn their keep and which are cargo-cult ceremony. This is the opinionated version. For the failure modes themselves see our piece on system integration challenges, and for the rollout lifecycle see our guide on taking integrations from pilot to production. This article is about the engineering practices underneath both.
System Integration Best Practices Start With the Contract
The single highest-leverage practice is to agree the API contract before anyone writes the implementation. Define the endpoints, the payload shapes, the error responses, and the auth model up front, and write them down in an OpenAPI spec. That spec becomes the thing both sides build against in parallel, instead of one team waiting on the other and discovering the mismatch at integration time.
Contract-first buys you two things that matter later. You can mock the other side and test against the mock from day one. And you have a written record of what “correct” looks like, so when a payload arrives malformed you can point at the contract rather than argue about intent. Stripe built an entire business on a stable, well-documented API contract. The lesson scales down to a two-system integration just as well.
Make Every Write Idempotent
If there is one practice we would tattoo on every integration developer, it is this. Networks retry. Queues redeliver. Users double-click. At some point the same message will arrive twice, and if your write is not idempotent you will create duplicate orders, double-charge a customer, or send two confirmation emails.
The fix is to make a repeated operation produce the same result as a single one. Two patterns cover almost everything. For your own writes, upsert against a stable business key rather than blind insert. For calls to someone else’s API, send an idempotency key and let them deduplicate, the way Stripe’s idempotency keys work.
# Idempotent upsert on a stable external key, not a blind insert
def save_order(db, order):
db.execute(
"""
INSERT INTO orders (external_id, customer, total_aud, status)
VALUES (%(external_id)s, %(customer)s, %(total_aud)s, %(status)s)
ON CONFLICT (external_id)
DO UPDATE SET status = EXCLUDED.status, total_aud = EXCLUDED.total_aud
""",
order,
)
# Replaying the same webhook now updates in place instead of duplicating.
We have cleaned up the aftermath of integrations that skipped this. The worst was a billing sync that created a fresh invoice on every retry. By the time anyone noticed, one customer had 14 copies of the same invoice. An afternoon of idempotency work up front would have saved a week of reconciliation.
Retry With Backoff and a Dead-Letter Queue
Transient failures are normal, not exceptional. A partner API returns a 503, a connection times out, a rate limit kicks in. The wrong response is to fail the whole job. The right one is to retry with exponential backoff and jitter, and after a few attempts, park the message in a dead-letter queue for a human to look at rather than dropping it on the floor.
from tenacity import retry, stop_after_attempt, wait_exponential_jitter
@retry(stop=stop_after_attempt(5),
wait=wait_exponential_jitter(initial=1, max=30))
def push_to_partner(payload):
resp = httpx.post(PARTNER_URL, json=payload, timeout=10)
resp.raise_for_status() # 4xx/5xx triggers a retry
return resp.json()
Two caveats that bite people. Retries are only safe if the operation is idempotent, which is why the previous section comes first. And you need a ceiling: retrying forever against a partner that is genuinely down just burns your own resources. After the retries are exhausted, the dead-letter queue is what stops you losing the data silently. Silent data loss is the worst integration failure there is, because you do not know to fix it.
Validate at Every Boundary
Never trust data crossing a system boundary, even from a system you own. Validate the shape and the values the moment data arrives, with a schema library like Pydantic or Zod, and reject what does not fit before it pollutes your database. The alternative is discovering a string where you expected a number three tables deep, after it has already corrupted a report.
from pydantic import BaseModel, field_validator
class IncomingOrder(BaseModel):
external_id: str
total_aud: float
currency: str
@field_validator("total_aud")
@classmethod
def positive(cls, v):
if v <= 0:
raise ValueError("total must be positive")
return v
# Bad payloads raise at the door, not deep in the pipeline.
Validation at the boundary is also your early-warning system for the next practice on the list. When a partner quietly changes their payload, your validation starts rejecting events immediately, which is exactly when you want to find out.
Version Schemas and Plan for Drift
Schemas change. A partner adds a field, renames one, or switches a date format. If your integration assumes the schema is frozen, the first change breaks it. Pin to an explicit API version where the partner offers one, and treat your own published payloads as contracts you cannot break without a version bump.
The practical rule we follow: additive changes (a new optional field) are safe and need no version bump. Anything that removes a field, renames one, or changes a type is a breaking change and gets a new version, with the old version supported until consumers migrate. This is dull discipline. It is also the difference between a partner’s release being a non-event for you and being an outage.
Instrument Everything With Correlation IDs
When an integration misbehaves in production, you need to trace one transaction across every system it touched. That is only possible if you stamp a correlation ID on the event when it enters and carry it through every hop, log line, and downstream call. Without it, debugging a distributed failure is archaeology.
Pair the correlation ID with structured logging and a few alerts that watch the things that actually predict trouble: the dead-letter queue depth, the validation rejection rate, and the age of the oldest unprocessed message. Alert on those three and you catch most integration problems before a user reports them. We learned the value of the rejection-rate alert the hard way, after a partner’s silent schema change went unnoticed for four days and quietly dropped a small slice of records the whole time.
Secure the Integration From Day One
Security retrofitted onto a working integration is always more painful than security designed in. Use OAuth 2.0 or signed requests for authentication, TLS for everything in transit, and the principle of least privilege for every credential. An integration token that can read one endpoint cannot do much damage if it leaks. One with full admin rights is a breach waiting to happen.
The operational half of security is credential lifecycle. Store secrets in a vault, not in environment files committed to a repo, and put expiry dates on a calendar. Expired credentials are one of the most common causes of an integration that “just stopped working” overnight, and they are entirely preventable. We keep a shared expiry calendar for every client integration for exactly this reason.
Choose the Right Integration Pattern
Not every integration should be a real-time API call. Match the pattern to the need:
- Request-response API when the caller needs an answer immediately, such as a stock check at checkout.
- Event-driven messaging when systems should react to changes without being tightly coupled, using a broker like Kafka or a managed queue.
- Scheduled batch when volume is high, latency does not matter, and a nightly reconciliation is simpler and cheaper than streaming.
- A workflow tool like n8n when the integration is mostly glue between SaaS apps and a full custom build is overkill.
Teams default to real-time APIs because they feel modern, then pay for it in complexity. A nightly batch job that reconciles two systems is unglamorous and often exactly right. If you are deciding how to connect your systems and want a second opinion, book a call and we will talk through the tradeoffs for your case. For data-heavy syncs specifically, our notes on data integration solutions go deeper.
When System Integration Best Practices Mean Not Building It
The best integration is sometimes none. Walk away, or wait, when:
- The volume is tiny. If two people copy 10 records a day between systems, a custom integration costs more to build and maintain than it saves. Automate it later, when the volume justifies it.
- One of the systems is being retired. Building a careful integration to a platform that is going away in six months is effort you will throw out.
- A native connector already exists. If your two tools have a supported native integration, use it. A maintained connector beats a bespoke one you have to babysit, even if it is slightly less flexible.
This overlaps with modernisation decisions more often than people expect. If the integration pain is really a symptom of an ageing core system, our legacy system modernisation guide is the better starting point than another point-to-point connector.
Frequently Asked Questions
What are the most important system integration best practices?
If you only do three things: design the API contract before building, make every write idempotent, and validate data at every boundary. Those three prevent the most common production failures, namely mismatched expectations, duplicate records, and corrupt data. Retries with a dead-letter queue and correlation-ID logging come next.
What is the most common cause of integration failure in production?
Expired or misconfigured credentials, closely followed by silent schema drift from a partner. Both are preventable. Credential expiry calendars catch the first, and validating payloads at the boundary with alerting on the rejection rate catches the second before it does much damage.
Why does idempotency matter so much in integration?
Because messages get delivered more than once as a normal part of distributed systems. Without idempotency, a retried webhook creates a duplicate order or a double charge. With it, processing the same message twice produces the same result as processing it once. It is the foundation that makes safe retries possible.
Should I use real-time APIs or batch integration?
Use real-time when the caller needs an immediate answer, event-driven messaging when systems should react without tight coupling, and scheduled batch when volume is high and latency does not matter. Teams over-reach for real-time. A nightly batch reconciliation is often simpler, cheaper, and entirely sufficient.
How do I handle a partner changing their API?
Pin to an explicit API version where the partner offers one, validate incoming payloads so a change surfaces immediately, and alert on the validation rejection rate. Treat additive changes as safe and any removal, rename, or type change as breaking. That way a partner’s release is something you handle deliberately rather than discover through an outage.
How much does a system integration cost in Australia?
A straightforward point-to-point integration between two systems with good APIs typically runs 8,000 to 25,000 AUD to build and test properly. Complex multi-system work with messaging, reconciliation, and monitoring runs higher. The biggest cost driver is rarely the happy path; it is handling the errors, retries, and edge cases that the practices in this guide address.
What is the difference between data integration and system integration?
System integration connects applications so they can trigger actions and exchange data in workflows, often in real time. Data integration focuses on moving and combining data from multiple sources into a single store for analysis, usually in batches. They share practices like validation and schema versioning, but the goals differ: one is about behaviour, the other about a unified dataset.
None of these system integration best practices are clever. They are dull, and dull is the point: the integrations that never page you are the ones built on boring, disciplined habits. If you want help designing an integration that holds up in production, or rescuing one that does not, get in touch with our team. We are based in Brisbane and work with businesses across Australia.
Jump to a section
Ready to streamline your operations?
Get in touch for a free consultation to see how we can streamline your operations and increase your productivity.