Zum Inhalt springen
SAP, DATEV and Dynamics experts
All Articles Middleware

Idempotency and Retry Strategies for Robust Interfaces

13 min read
IdempotenzRetryMiddleware

In distributed systems there is an uncomfortable truth: a response that never arrives is not the same as an action that was never performed. When an order sync between store and ERP breaks off after a timeout, the middleware initially does not know whether the order was created in the ERP or not. If it blindly retries the call, it risks a duplicate order. If it does not retry, it risks a lost order. According to the AWS Builders Library, most of these errors can be retried safely if the interface has two properties: idempotency on the receiving side and an intelligent retry strategy on the sending side (AWS Builders Library, 2024). This article shows how idempotency keys, exponential backoff with jitter, deduplication and dead letter queues work together to make order and inventory syncs safely repeatable -- and how a store integration implements these mechanisms in practice.

Idempotency and Retry StrategiesStoreOrder + KeyIdempotency StoreKey known?Dedup checkRetry EngineBackoff + Jittermax 5 attemptsERP / SAPTarget systemSafe retry with exponential backoffAttempt 1after 1sAttempt 2after 2s + jitterAttempt 3after 4s + jitterAttempt 4after 8s + jitterAttempt 5after 16s + jitterSuccess: key marked as processedlater duplicates are ignoredDead Letter Queueafter max attemptsmanual clarificationAt-least-once deliveryEvery message arrives at least once,possibly several times. Duplicates areallowed and removed by the receiver.Idempotency makes the receiverimmune to repetition.Deduplication in the storeThe idempotency key is stored beforethe order goes to the ERP. If the samekey arrives again, the result of thefirst call is returned.No duplicate orders in the ERP.

Why every interface must expect retries

Between store and ERP lie networks, load balancers, queues and several applications. Each of these stations can delay a request, abort it or swallow a response. The Google SRE Book states it as a basic assumption: in a sufficiently large system, every possible failure eventually does occur (Google SRE Book, 2017). For an order interface this means concretely: it is not a question of whether a call will hang once, but how often, and how the integration responds to it.

The core problem is the ambiguity of a timeout. If the API layer sends an order to the ERP and receives no response, there are three possible realities: the request never reached the ERP, it was processed but the response was lost, or it is still being processed right now. From the sender's point of view, all three cases look identical. A naive retry treats them the same -- with potentially fatal consequences if the order was in fact already created.

According to the AWS Builders Library, 76 percent (AWS Builders Library, 2024) of errors in service-oriented architectures are transient in nature -- throttling, brief overload, network jitter. It is precisely these errors that disappear on a retry. The task is therefore not to avoid retries, but to design them so that they cause no harm. This is exactly where idempotency and well-considered retry strategies come in, which we establish as standard in every integration project we build.

At-least-once vs. exactly-once: the honest answer

Many specifications demand 'exactly-once' delivery: every order should be processed exactly once, never lost, never duplicated. In practice, however, true exactly-once delivery across system boundaries is extraordinarily hard and expensive to achieve, because sender and receiver would have to agree in a distributed manner on every single processing step. The more honest and more robust answer is a combination: at-least-once delivery plus idempotent processing produce, in effect, an 'effectively-once' behaviour.

PropertyAt-least-onceExactly-once (idealized)
Delivery guaranteeAt least once, duplicates possibleExactly once, no duplicates
Implementation effortModerate, well manageableHigh, often only theoretically clean
Behaviour on timeoutSafely repeatableHard to guarantee
Requirement at the receiverIdempotency / deduplicationDistributed transaction / coordination
Practical recommendationStandard for store-ERP syncsRarely needed, high cost

The reason for this recommendation is pragmatic: at-least-once is easy to implement with queues and retries, and duplicates are a solvable problem -- you remove them on the receiving side using an idempotency key. With event-streaming platforms, exactly-once semantics are achievable only within a controlled system, while across external system boundaries one typically relies on idempotent receivers (project experience). For the connection to SAP or another ERP this means: we do not guarantee that a message will never arrive twice -- we make sure that a message arriving twice has no effect.

The central rule of thumb

Make the transport unreliable-but-persistent (at-least-once) and the receiver duplicate-resistant (idempotent). This combination is simpler, cheaper and in practice more reliable than trying to build a perfect exactly-once transport.

Idempotency keys: the key to safe retries

An operation is idempotent if executing it multiple times has the same effect as executing it once. Creating an order is inherently not idempotent -- without protection, every call produces new records. The idempotency key makes the operation idempotent: the sender generates a unique key for each logical operation and sends it unchanged with every attempt. The receiver stores the key together with the result and, on a second call with the same key, simply returns the stored result without executing the operation again.

Stripe has prominently documented this pattern for payment APIs: an idempotency key passed via a header ensures that a call repeated due to a network error does not lead to a second charge (Stripe Engineering, 2024). The same principle protects an order interface from duplicate orders. The crucial point is that the key is generated once on the sending side and remains stable across all retries -- a key regenerated on every attempt would defeat the protection.

In the middleware we derive the key from business-relevant, stable data. For orders the Shopware order number in combination with the operation type works well. For inventory notifications, the combination of item number and source timestamp. This way the same business operation maps to the same key across any number of technical retries, while a genuinely new order necessarily receives a new key.

idempotency-check.js
// Simplified idempotency check in the middleware
async function processOrder(order) {
  const key = idempotencyKey(order); // e.g. 'order:SW-100245'

  const known = await store.getResult(key);
  if (known) {
    // Key already processed -> return stored result
    return known.result; // no second ERP creation
  }

  // Reserve the key before the ERP call happens
  await store.reserve(key);
  const result = await erp.createOrder(order);

  await store.saveResult(key, result, { ttlDays: 30 });
  return result;
}

Deduplication: detecting duplicates before they cause harm

Deduplication is the receiving side of the idempotency coin. It answers the question: have I seen this business operation before? For this, the integration layer maintains a persistent idempotency store -- a table that records every processed key with its status, timestamp and processing result. If an already known key arrives again, the operation is not executed but booked as a duplicate, and the original result is returned.

A subtle but important point is the deduplication time window. Keys must be stored long enough to cover realistic retries -- a sync that is retried hours later from a dead letter queue must still be recognized as a duplicate. At the same time, the store should not grow indefinitely. In practice, retaining keys for several weeks has proven effective, aligned with the maximum lifetime of a message in the system.

Unique key

Derived from stable business data (order number, item plus timestamp) -- identical across all retries.

Persistent store

Every key is stored with status and result. Survives restarts, so even delayed retries are recognized.

Harmless second hit

A key arriving again triggers no second processing -- the ERP sees the order only once.

Retention window

Keys are kept long enough to recognize late retries from the dead letter queue as duplicates.

Status reservation

The key is reserved before the ERP call starts -- parallel attempts do not collide with one another.

Audit link

Every key is linked to the correlation ID, so duplicates remain traceable in the audit log.

The order matters: the key is reserved before the actual ERP call takes place. If it were only stored after successful processing, a time window would arise in which two parallel attempts both create the order. The API interface therefore uses a two-stage booking: first reserve the key with status 'in progress', then update it to 'completed' along with the result. If a second attempt hits a key with status 'in progress', it waits briefly and then adopts the result of the first attempt.

Exponential backoff with jitter: spreading out retries

When a call fails and is retried immediately, it often hits the same problem -- an overloaded system stays overloaded. Worse still: if many senders retry simultaneously in the same rhythm, a synchronized surge arises that delays recovery. Exponential backoff solves the first problem by increasing the wait time after each failure (roughly 1, 2, 4, 8, 16 seconds). Jitter -- a random offset on the wait time -- solves the second problem by distributing the retries over time.

The AWS Builders Library has examined this interplay in detail and explicitly recommends backoff with jitter, because pure exponential backoff without a random component can still lead to load spikes (AWS Builders Library, 2024). The Google SRE Book adds that retries must generally be capped and budgeted across the entire call chain, so that retries do not themselves create an overload (Google SRE Book, 2017).

backoff-jitter.js
// Exponential backoff with full jitter
function waitMs(attempt, base = 1000, max = 30000) {
  const exp = Math.min(max, base * 2 ** attempt);
  // Full jitter: uniformly distributed between 0 and exp
  return Math.floor(Math.random() * exp);
}

async function withRetry(action, maxAttempts = 5) {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      return await action();
    } catch (error) {
      if (!isTransient(error) || attempt === maxAttempts - 1) {
        throw error; // permanent or budget exhausted -> DLQ
      }
      await sleep(waitMs(attempt));
    }
  }
}

Crucial is the distinction between transient and permanent errors before retrying at all. An HTTP 429 (throttling) or 503 (service unavailable) is a clear candidate for a retry -- ideally honouring an existing 'Retry-After' header. An HTTP 400 or 422 (business validation error), on the other hand, does not improve with retries and belongs straight into clarification. We describe this classification in detail in the sister article on interfaces in store integration, which shows when push-based and when pull-based transfer makes sense.

Dead letter queues: the safety net after exhausted attempts

Even the best retry strategy ends at some point. If a call still fails after the maximum number of attempts, or is recognized as permanently faulty from the outset, the message must not simply be discarded. The dead letter queue (DLQ) catches precisely these messages. It stores the complete operation with context: original content, error reason, number of attempts, timestamp and correlation ID. This way no order and no inventory notification is lost, even if the target system is disrupted for an extended period.

The DLQ is far more than a trash bin. It is a deliberately curated queue for cases that require human attention or a corrected retry. In every integration solution it is made accessible via a dashboard: a specialist sees every stranded operation, can correct the data and re-feed the operation specifically into the processing pipeline. Since the retry again carries the same idempotency key, even this late retry is protected against duplicates.

  • Complete context: Every DLQ entry contains the original message, error cause, attempt counter and correlation ID for seamless traceability.
  • Controlled resumption: Corrected operations are re-fed with an unchanged idempotency key -- deduplication continues to prevent duplicates.
  • Pattern recognition: Clusters of similar entries in the DLQ point to systematic problems, such as a changed ERP data model.
  • Alerting: Every new DLQ entry or a rapid growth of the queue triggers a notification to the operations team.
  • Retention: DLQ entries are kept in an audit-proof manner until they are resolved -- nothing is silently deleted.

The interplay of capped retry and downstream DLQ achieves, in our projects, a processing rate of around 99.9 percent (project experience) for the temporarily failed operations, while the few remaining cases land safely in the DLQ instead of being lost. This architecture is standard in every store-ERP connection we implement.

Making the order sync safely repeatable

The order sync is the most critical data flow between store and ERP, because real money and real deliveries are at stake here. A duplicated order leads to double picking, double invoicing and frustrated customers. A lost order leads to a missing delivery. Both errors are expensive. The combination of a stable idempotency key, at-least-once transport and idempotent ERP creation closes both risks: an order is retried as often as needed until it has safely arrived, but never created twice.

In practice this means: on receiving the order, the integration layer generates the key from the order number, persists the message in a queue and attempts the ERP creation. If the call fails temporarily, backoff with jitter kicks in. If an identical order message arrives again during these attempts -- for example because the store resends the webhook -- deduplication recognizes the key and discards the duplicate. Only when the ERP returns an order number is the operation considered complete and the key marked as final.

Sales angle

We set up this order sync architecture as a fixed part of every integration -- not as an optional extra at a surcharge, but as self-evident baseline quality. An existing connection that still operates without idempotency today, we retrofit during live operation, without taking the store offline.

Inventory sync: retries without phantom stock

The inventory sync hides a different trap. Stock changes are often transmitted as deltas -- 'minus 3 units' instead of 'new level: 17 units'. A repeated delta is dangerous: if 'minus 3' is accidentally processed twice, the stock drops by 6 instead of 3, and phantom stock levels arise that do not match reality. Here the value of idempotency shows particularly clearly, because delta operations are inherently not idempotent.

Two proven approaches defuse the problem. First: transmit absolute instead of relative values where possible -- an absolute target level ('level: 17') is inherently idempotent, because a retry sets the same level. Second: where deltas are unavoidable, an idempotency key per movement in combination with deduplication protects against double application. When using a middleware or iPaaS, the same holds for synchronization patterns: idempotent upsert operations are preferable to pure delta processing when the transfer may be unreliable (project experience).

Absolute values beat deltas

Whenever the ERP can supply the absolute target stock level, we transmit that instead of a relative change. An absolute stock level can be set any number of times without distorting the value -- the simplest form of idempotency, entirely without an additional key.

Observability: making retries measurable

A retry strategy that nobody observes can derail unnoticed. When the number of retries quietly rises, that is an early warning sign of a disrupted target system -- long before the first operations land in the DLQ. We therefore link every operation to a correlation ID and log each attempt in a structured way: timestamp, error class, wait time, attempt number. This makes the entire path of an order traceable across all retries.

Gartner points out that a lack of observability of integration flows is one of the most common causes of long error resolution times, and recommends end-to-end monitoring of interface health as a fixed component of modern integration platforms (Gartner, 2024). Concretely, we monitor the retry rate per target system, the average number of attempts until success, the depth of the dead letter queue and the hit rate of deduplication. A sudden rise in duplicate hits can, for example, indicate that the store is sending webhooks multiple times -- a clue that would remain hidden without measurement.

Hope is not a strategy. Do not rely on a call arriving -- build systems that expect the opposite and remain correct anyway.

Paraphrased from the Google SRE Book (2017)

Retrofitting into existing integrations

Many existing interfaces were built without idempotency and appear to work without problems in normal operation -- until the first network error produces a duplicate order. The good news: idempotency and retry strategies can be retrofitted into running store integrations without interrupting operations. We proceed in clearly delineated steps, each of which delivers a measurable resilience gain.

  1. Analysis (about 1 week): Map existing data flows, identify non-idempotent operations, define key derivation per operation type.
  2. Introduce idempotency store (1--2 weeks): Create a persistent key table, implement two-stage reservation, hook deduplication into processing.
  3. Retry with backoff and jitter (1 week): Add error classification, configure wait-time strategy and attempt budget, honour 'Retry-After'.
  4. Dead letter queue (1 week): Set up the DLQ with complete context, provide dashboard and resumption, connect alerting.
  5. Observability and testing (1 week): Thread correlation IDs through, instrument retry metrics, perform fault injection to verify repeatability.

Across 50+ (project experience) implemented integration projects, it has become clear that retrofitting idempotency typically takes a few weeks and quickly pays for itself through eliminated manual corrections. If you want to know how your specific store-ERP connection can be secured, we are happy to discuss your individual effort.

This article is based on data from: AWS Builders Library (2024), Google SRE Book (2017), Stripe Engineering (2024), Gartner Integration Best Practices (2024) and our own project experience.