System Integration Best Practices: From Pilot to Production
System integration best practices that survive contact with production: idempotency, retries, observability, and the patterns we use across client builds.
Updated May 2026. Replaced the textbook checklist with the integration patterns we actually run for clients, the failure modes we have hit in production, and the design choices we will not bend on.
Every system integration that has burned us in production failed for one of about six reasons. They are the same reasons. Different vendors, different APIs, different domains, same pathologies. Once you know the list, you stop building integrations that hit them.
We are Osher Digital, an automation and AI consultancy in Brisbane. We build and maintain integrations for clients in healthcare, recruitment, finance, and professional services. The advice below is not a textbook. It is the set of practices we will not skip, the patterns that have survived production, and the trade-offs that look obvious in hindsight.
This guide is for anyone designing or owning an integration that has to keep running for years. If you are sizing up a one-off data migration, this is overkill. For automation patterns that wrap these integrations, see our AI workflow patterns in n8n and n8n consulting work.
The six failures that produce most integration outages
Before the patterns, the failure modes. Every production incident we have shipped a fix for in the last three years lands in one of these buckets.
Silent data corruption. The integration runs, no errors, but the data on one side does not match the data on the other. The hardest failure to detect and the most damaging when it surfaces.
Replays after a partial failure. The job died halfway through. Someone restarts it. Half the records process twice. Two duplicate invoices, two duplicate emails, one customer charged twice.
Rate limit hits at the worst possible time. End of month, big batch, suddenly the upstream API is throttling and your integration is failing in a way that looks like a bug.
Schema drift. The vendor adds a field. Or removes one. Or changes a date format. Your code does not error; it just starts producing nonsense. Caught two weeks later by a finance team noticing weird numbers.
Auth credential expiry. The OAuth refresh token got revoked. The service account password rotated. The certificate hit its expiry date. Integration goes dark, monitoring alerts on nothing because no records are flowing, and someone notices on day three.
The dependency you forgot was a dependency. The integration depends on a script on someone’s laptop, an outdated DNS entry, an undocumented allowlist, an export that someone runs every Friday. They go on leave. The integration breaks. Nobody can find the dependency.
The patterns below are designed to make these failures either impossible or loud. Anything you build that does not address them is going to bite you within 12 months.
Design for idempotency from day one
Idempotency means running the same operation twice produces the same result as running it once. This is the single most valuable property an integration can have. It turns “we cannot retry safely because we do not know what got committed” into “retry as much as you want.”
Two patterns we use almost every time:
Upsert with a stable external key. Every record we sync has a unique identifier from the source system. The destination upserts on that key. Run it twice, you get one record, not two. Most modern APIs (Salesforce, HubSpot, Xero, Stripe) support an external key field. Use it. The integrations that do not are the ones that produce duplicates when something goes wrong.
Idempotency keys for non-upsertable operations. Sending an email, charging a card, posting a webhook. These actions cannot be made naturally idempotent because the side effect is the point. Generate a unique idempotency key per logical operation, send it as a header, store it in the source system. The receiving system uses it to deduplicate. Stripe’s idempotency-key header is the best-known example; most modern APIs have an equivalent.
If neither pattern is available, you need a “did I already do this” check before every write. Slower and more code, but it is the only safe path for non-idempotent destinations.
Retries with backoff, not loops
Networks fail. APIs throttle. Servers restart. Roughly 1 to 3 percent of API calls in any moderately busy integration will fail for transient reasons. Your code has to handle this without help.
The pattern. Exponential backoff with jitter. First retry after 1 second, then 2, then 4, up to a cap. Add a random jitter of 0 to 50 percent so retries do not all happen at the same instant. Stop after 5 to 8 attempts. Anything still failing after that is not transient and should escalate.
The trap. Retrying every error. A 401 is not transient; do not retry. A 422 validation error is not transient; do not retry. A 429 rate limit usually has a Retry-After header; honour it instead of guessing. A 5xx is usually transient. A connection timeout is usually transient. Categorise the failure before deciding to retry.
n8n’s built-in retry settings do this competently. For Python integrations we use tenacity. For Node we use p-retry. Whatever the language, do not write your own retry logic from scratch in 2026; the libraries handle the edge cases you will forget.
Rate limit handling that actually works
Every API has a rate limit. Some publish it clearly, some leave you to discover it. The integrations that work in production all assume the limit is closer than the documentation says.
Three habits.
Token-bucket throttling on the client side. Pace your outbound calls so you stay under the limit even at peak. Most modern HTTP client libraries support this; if not, write a small wrapper. The cost of throttling is patience. The cost of not throttling is being throttled at the wrong time.
Honour the Retry-After header religiously. When you do hit a 429, the response usually tells you how long to wait. Use that value. Do not guess.
Bulk operations where the API supports them. The classic example: 1,000 records updated as 1,000 single-record API calls is 1,000 chances to hit a rate limit. The same 1,000 records sent as 10 batches of 100 is 10 chances. Salesforce, HubSpot, NetSuite, Xero, Shopify all have bulk APIs. Use them.
Data contracts and schema drift
Schema drift is the silent killer of integrations. The vendor adds a “Region” field. Your code happily ignores it. Six months later, a business rule changes and the team needs to filter by region. Your integration has been dropping that field on the floor the entire time.
Three habits we hold to.
Validate inbound payloads against an explicit schema. Pydantic for Python, Zod for TypeScript, JSON Schema for almost anything else. The validation step is short and the alert when a payload no longer matches the schema is invaluable.
Pin to specific API versions where the vendor offers them. Salesforce, Stripe, Twilio, Xero all let you pin to a specific API version in headers. Pin it. Upgrade deliberately. The default of “follow latest” means the vendor controls when your integration breaks.
Capture the raw payload, not just the parsed version. We log the raw JSON of inbound webhooks for 30 days, separate from the parsed records. When something looks wrong, we can compare what we received against what we processed. This has caught silent schema changes more than once.
Auth and credential management that does not bite
Auth failures are the most embarrassing kind of integration outage because they are 100 percent preventable and yet keep happening. The pattern that prevents them is operational, not architectural.
Centralise credentials. We use 1Password for client integrations, with shared vaults scoped per project. The encryption key for our hosted n8n instances lives in 1Password and nowhere else. Credentials never appear in code, in config files committed to git, or in chat messages.
Monitor expiry dates. OAuth refresh tokens that expire after 90 days of inactivity. Service account passwords with 12-month rotation policies. SSL certificates. Build a calendar with every credential’s expiry. We have a single Notion database with every client credential’s renewal date and an alert two weeks ahead.
Use service accounts, never personal accounts. The integration that runs under Mary’s login dies when Mary leaves. The integration that runs under [email protected] survives the org chart.
Test the renewal path. Renew the credential in dev before the expiry. If you cannot, you do not actually know the renewal works.
Observability is not optional
If you cannot answer “is my integration healthy right now?” in under 60 seconds, you do not have an integration. You have a liability.
The minimum we ship for any client integration:
- Structured logs with a correlation ID per logical transaction (the same ID flows from inbound webhook through every downstream call)
- A health-check dashboard showing volume, success rate, average latency, and error rate over the last 24 hours and 7 days
- An alert when error rate exceeds a threshold over a rolling window (not on individual errors; the noise will get ignored)
- An alert when volume drops to zero for longer than expected (the silent failure case)
- An alert when latency p95 spikes (often the first sign that an upstream system is degrading)
Tools we use most: Grafana with Loki for logs, BetterStack for uptime, Sentry for error grouping, n8n’s native execution log for workflow-level visibility. None of this is novel. The discipline is in actually setting it up before something goes wrong.
Security non-negotiables
Every integration moves data, which means every integration is a potential leak. The list below is short because the basics are the basics for a reason.
TLS for everything in transit. No exceptions. If a vendor only supports HTTP, do not use them.
Encryption at rest for any persistent store with sensitive data. Postgres column-level encryption for PII. AWS KMS or equivalent for storage encryption.
Least privilege on service accounts. The Salesforce integration user that only needs Read on Account, Account-CRUD on Contact, gets exactly that. Not “API Enabled, System Administrator.” We have audited client setups and found 80 percent of integration accounts had wider permissions than the integration actually used.
For Australian clients with Australian Privacy Principles obligations, the data residency question matters. The integration may be invisible, but the data flowing through it is yours. Hosting integration components in ap-southeast-2 (Sydney) keeps the data on-shore. For health data with state-level legislation, this stops being a preference and starts being a requirement.
Audit logs that are tamper-resistant. Append-only stores, with shipping to a separate account or vendor. The audit log lives outside the system being audited.
Testing in a world where you do not control the dependencies
Integration testing is harder than unit testing because the systems you depend on are real, slow, and not yours.
The pyramid we use:
Unit tests for transformation logic. Pure functions that take an input shape and produce an output shape. These should be fast and run on every commit.
Contract tests against recorded payloads. Capture real responses from each API once, replay them in tests. This catches breakage when your code changes. It does not catch vendor-side breakage.
Periodic smoke tests against the live API in a sandbox account. Run hourly or daily. These are slow and flaky and you should still have them. They are the only way to catch vendor changes before your users do.
End-to-end test in staging before any production deploy. A real round-trip through the integration with synthetic test data. Boring, slow, occasionally annoying, and the single most reliable predictor of production safety.
Picking the right integration pattern
The big buckets we choose between:
Webhook-driven, near-real-time. Source system pushes events, your integration reacts. Best for low-latency requirements (a customer signs up, kick off onboarding within seconds). Worst for systems that do not support reliable webhooks or where you need to backfill historical data.
Polled, scheduled. Your integration asks the source system “what changed since I last asked?” every N minutes. Best for systems with reliable change-tracking endpoints (Salesforce, HubSpot, modern APIs with updated_at fields). Trades latency for simplicity.
Bulk batch. Daily or hourly extracts of large data sets. Best for analytics destinations and when the source system has poor incremental APIs. Hardest to make idempotent.
Event bus. Source publishes to a queue (SQS, Kafka, EventBridge), one or more consumers pick up. Best when multiple systems care about the same event or when you need durable delivery. Adds operational overhead; do not reach for it for simple two-party integrations.
iPaaS platforms (Workato, Mulesoft, Boomi, Zapier, Make, n8n cloud). Best for non-developer ownership and rapid prototyping. We use n8n self-hosted heavily for client work; the others have their place but the lock-in and per-task pricing of the SaaS variants compound quickly. Past about 50 active workflows or moderate transaction volume, self-hosted is almost always cheaper.
Custom code (Python, TypeScript). Best when the integration is core to the business and the team can support it. Worst when there is no engineering capacity to maintain it long-term.
When not to build the integration at all
Some integrations are not worth building. Three honest cases.
The volume does not justify it. If a process runs five times a month and a person can do it in three minutes, building an integration is more effort and more risk than the manual run. Wait until the volume changes.
The system on the other end is going to be replaced. We have built integrations into systems that the client then migrated away from six months later. Ask whether the system you are integrating with will still be there in two years. If the answer is “probably not,” delay or skip.
The vendor offers a native integration that is good enough. The Salesforce-HubSpot connector. The Xero-Stripe one. The Slack-Jira one. The native option is usually 80 percent of what you would build, free or near-free, and somebody else maintains it. Use it. Build custom only when the native fails to do something you actually need.
The temptation to build something because the platform supports it is strong. The discipline to leave it alone is the difference between an integration estate that is maintainable and one that is a graveyard.
Ready to build something that lasts?
If you are about to commit to an integration build and want a second opinion on the design, that is a half-day conversation we run with most new clients. We will ask the awkward questions about retries, secrets, observability, and rollback before any code gets written. Get in touch or book a call if that would be useful. For broader context on automation work, see our AI consulting page.
Frequently Asked Questions
What are the four types of system integration?
The classical taxonomy is point-to-point, hub-and-spoke (or ESB), API-led, and event-driven. In practice modern integrations are usually a mix: API-led between systems that have good APIs, event-driven for high-volume or multi-consumer flows, with a workflow tool like n8n acting as the hub for cross-system orchestration. Pick the pattern that matches the data flow, not the textbook label.
What is the most important system integration best practice?
Idempotency. If you can only adopt one practice, make every operation safely retryable. It eliminates an entire category of production failure (duplicates from partial replays) and makes every other operational decision easier. Almost every other practice on this list assumes idempotency is in place.
How much does a system integration cost in Australia?
Workflow-style integrations between modern SaaS systems on n8n or similar typically run $5,000 to $25,000 AUD to build with $200 to $1,000/month to operate. Custom integrations between bespoke or legacy systems run $30,000 to $150,000 AUD with significantly higher ongoing maintenance. iPaaS-platform builds (Workato, Mulesoft) carry the platform licence cost on top, often $20,000 to $200,000 AUD per year.
How should I handle integration failures in production?
Retry transient errors with exponential backoff and jitter, capped at 5 to 8 attempts. Surface non-transient errors (auth, validation) immediately to monitoring. Hold failed records in a dead-letter queue for human review rather than silently dropping them. Always log the input that caused the failure so the issue can be debugged after the fact.
How do I keep data consistent across integrated systems?
Designate one system as the source of truth for each entity. Every other system reads from that source via integration. Avoid bidirectional sync where both sides can edit unless you have a strict conflict-resolution rule (last-writer-wins is usually wrong). For high-stakes data, run a daily reconciliation job that compares record counts and key fields between systems and alerts on drift.
How do I integrate with a legacy system that has no API?
Three options in order of preference. First, look for a database export or replication option (Postgres logical replication, SQL Server CDC, scheduled CSV exports). Second, screen-scrape with Playwright or similar if the system has a web UI; brittle but works. Third, build a custom adapter that wraps the legacy system’s native interface (file drops, message queues, COBOL stored procedures). Avoid the third unless absolutely necessary; the maintenance burden is high.
What should I monitor on a production integration?
At minimum: throughput (records per minute), success rate, p95 latency, and error rate by category. Alert on volume dropping to zero unexpectedly (silent failure) and on error rate crossing a threshold (degradation). Avoid alerting on every individual error; the noise will get filtered out and the real outage will be missed. Add a daily reconciliation alert if the integration is high-stakes.
Should I use an iPaaS platform or build custom?
iPaaS for non-developer ownership, rapid prototyping, and integrations with well-known SaaS systems. Custom for performance-critical paths, integrations with legacy or bespoke systems, or when the team can support the code long-term. We use n8n self-hosted as the default workflow layer and drop into Python or TypeScript for the parts that warrant it. Pure Workato or Mulesoft engagements make sense at enterprise scale where the platform pays back the licence cost.
Jump to a section
Ready to streamline your operations?
Get in touch for a free consultation to see how we can streamline your operations and increase your productivity.