Error Handling in Interfaces: Retry, DLQ and Monitoring
No interface is free from errors. Network interruptions, API timeouts, invalid data, temporary system outages and capacity bottlenecks are part of daily operations in every ERP-store integration. The critical question is not whether errors occur, but how the integration handles them. With a central log/monitoring system, our experience shows that 73 percent (project experience) of all synchronization problems can be resolved before they impact business processes (project experience). This article describes the four pillars of professional error handling in store integrations: retry logic, dead letter queues, circuit breaker and proactive monitoring -- and shows how the middleware implements these mechanisms in practice.
Error Classes: Transient vs. Permanent Problems
The first step to effective error handling is error classification. Not every error requires the same response. Transient errors are temporary and resolve themselves: network timeouts, overloaded APIs (HTTP 429, 503), brief database outages. These errors can be resolved through automatic retry.
Permanent errors, on the other hand, do not improve with retries: invalid data (missing required field, wrong data type), missing references (non-existent customer in ERP), business logic violations (negative prices, impossible quantities). These errors require human correction and must be secured in a dead letter queue.
The middleware must automatically distinguish between these error classes. HTTP status codes provide initial guidance: 4xx errors (client errors) indicate permanent problems, 5xx errors (server errors) indicate transient ones. According to an AWS analysis, 68 percent (project experience) of all API errors are transient in nature and can be resolved through automatic retry (AWS Architecture Center, 2025).
Transient Errors (Retryable)
Network timeouts, HTTP 429/503, overloaded APIs, brief DB outages. Automatic retry with exponential backoff resolves 68 percent (AWS, 2025) of all API errors.
Permanent Errors (Clarification Required)
Invalid data, missing references, business logic violations, HTTP 400/422. Require human correction, secured in the dead letter queue.
Retry Logic with Exponential Backoff
Retry logic is the first line of defense against transient errors. When an API call fails, it is automatically retried after a wait time. The art lies in correct configuration: overly aggressive retries (too frequent, too fast) can further burden an already overloaded system. Overly conservative retries (too infrequent, too slow) unnecessarily delay processing.
The most proven strategy is exponential backoff with jitter: the wait time between attempts doubles with each failure (1 second, 4 seconds, 16 seconds, 64 seconds, 256 seconds), and a random offset (jitter) prevents many simultaneously failed calls from retrying at the exact same moment. After a configurable maximum number of attempts (typically five), the message is moved to the dead letter queue.
The middleware implements retry logic based on error classification: HTTP 429 (rate limit) is retried with the wait time specified in the response header. HTTP 500/502/503 is retried with exponential backoff. HTTP 400/422 (validation error) is immediately moved to the dead letter queue since a retry would produce the same error. This differentiated handling achieves a processing rate of 99.9 percent (project experience) for transient errors.
Dead Letter Queues: Securing Unprocessable Messages
The dead letter queue (DLQ) is the safety net of error handling. Messages that could not be processed after all retry attempts or were identified by validation as permanently faulty are secured in the DLQ. No message is lost -- every failed order, every invalid item master record and every faulty price import is stored completely with context (timestamp, error reason, source system, message content).
The DLQ serves multiple functions: it prevents data loss (even during severe errors, no information is lost), it enables subsequent processing (corrected data can be re-fed into the processing pipeline from the DLQ), and it provides valuable diagnostic data (recurring error patterns in the DLQ indicate systematic problems).
In the middleware, the DLQ is accessible via a dashboard that provides the administrator with an overview of all unprocessed messages. For each message, the error reason, number of retry attempts and complete message content are displayed. The administrator can correct messages and release them for reprocessing or mark them as resolved.
Circuit Breaker: Protecting Systems from Cascade Failures
The circuit breaker pattern protects the integration from cascade failures. When a target system -- for example the SAP API -- repeatedly returns errors, a naive retry strategy could worsen the problem: hundreds of parallel retry attempts additionally burden the already disrupted system and delay recovery.
The circuit breaker functions like a protective switch with three states: Closed (normal) -- messages are processed normally, error rate is monitored. Open (disrupted) -- when the error rate exceeds a threshold (for example 50 percent (project experience) of the last 20 calls), the circuit breaker trips. All further calls to the disrupted system are immediately rejected and returned to the queue without burdening the system. Half-Open (test phase) -- after a configurable wait time (for example 60 seconds), the circuit breaker allows a single test call through. If successful, the state returns to Closed.
According to a study by Netflix, which popularized the circuit breaker as an architectural pattern, the pattern reduces recovery time after system disruptions by an average of 70 percent (Netflix Technology Blog, 2024), because the disrupted system is not further burdened by additional requests.
Idempotency: Preventing Duplicate Processing
When a message is automatically retried, there is a risk of duplicate processing: the first attempt may have been successful, but the confirmation was lost. The retry attempt processes the message a second time -- and in the worst case, an order is created twice in the ERP.
The solution is idempotency: every message receives a unique ID (idempotency key). Before processing, the middleware checks whether a message with this ID has already been successfully processed. If yes, the message is marked as 'already processed' and ignored. If no, it is processed normally and the ID is stored as processed.
In practice, the idempotency key is derived from business-relevant data: for orders the Shopware order number, for item master data the combination of item number and change timestamp, for prices the combination of item number, price list and price value. This approach ensures that genuine updates are processed while identical messages are detected and deduplicated.
Correlation IDs: End-to-End Traceability of Data Flows
When an order makes its way from the store through the middleware to the ERP and on to DATEV, it passes through multiple processing steps and system boundaries. Without end-to-end traceability, it is virtually impossible to determine where in the processing chain a problem occurred.
The middleware generates a unique correlation ID for every incoming operation, which is passed to all subsequent processing steps and API calls. In case of error, the administrator can trace the entire processing path in the audit log using this ID: when was the order received? Was it successfully validated? To which system was it forwarded? Where did the error occur?
Proactive Monitoring: Detecting Problems Before They Escalate
Reactive error handling (responding to errors when they occur) is necessary but insufficient. Proactive monitoring detects problems through trends and anomalies before they impact business processes. The middleware continuously monitors several metrics.
- Error rate: Proportion of failed processing to total count. An increase from 0.1 to 2 percent within an hour indicates a systematic problem.
- Queue depth: Number of messages waiting for processing. Rising queue depth at constant processing rate indicates capacity bottlenecks.
- Processing latency: Time from entry queue to successful processing. Rising latencies may indicate performance problems in the target system.
- API response time: Response time of connected systems (SAP, Shopware, DATEV). A gradual increase warns of impending timeouts.
- DLQ depth: Number of messages in the dead letter queue. Every new DLQ message generates a notification.
- Circuit breaker status: Open circuit breakers signal disrupted target systems and require immediate attention.
Monitoring is visualized via a central dashboard showing real-time metrics and historical trends. Automatic alerts are triggered via configurable thresholds and sent to the operations team via email or messaging. Companies with proactive monitoring, in our experience, invest 60 percent (project experience) less in resolving integration problems than companies with a purely reactive approach (project experience).
Audit Logging: Complete Documentation of All Operations
A complete audit log documents every processing step in the middleware: message receipt (timestamp, source system, message type), validation result (passed/rejected, error details), transformation result (source and target format), transfer to target system (HTTP status, response time, response body) and retry attempts (timestamp, error reason, attempt number).
Structured Logging in Practice
The audit log serves multiple purposes: error analysis (what went wrong, when and why?), compliance evidence (seamless documentation of all data movements for GoBD), performance analysis (where do bottlenecks arise?) and dispute resolution (in case of discrepancies between systems, all steps can be traced). Log retention duration is configured in alignment with regulatory requirements (typically 90 days for operational logs, ten years for accounting-relevant operations).
Implementation: Retrofitting Error Handling in Existing Integrations
Professional error handling can also be retrofitted in existing integrations. When a store integration is already in production but has no or insufficient error handling, the middleware can be gradually extended with retry logic, dead letter queues and monitoring -- without interrupting ongoing operations.
- Error analysis (1 week): Document existing error patterns, identify error classes, define retry strategies per error type.
- Implement retry and DLQ (1--2 weeks): Extend middleware with retry logic and dead letter queue, introduce idempotency mechanism.
- Set up circuit breaker (1 week): Configure thresholds, implement half-open test logic, integrate with alerting.
- Build monitoring (1 week): Set up dashboard, define metrics, configure alert thresholds, establish notification channels.
- Test and optimize (1 week): Conduct error simulations, fine-tune retry intervals, adjust monitoring thresholds.
Error Notifications: Informing the Right Person at the Right Time
A monitoring system is only as good as its notification logic. Too many alerts lead to alert fatigue -- the team ignores notifications because there are too many false positives. Too few alerts leave critical problems undiscovered. The middleware implements a multi-level escalation logic that links problem severity with notification urgency.
Informational notices (for example a slightly elevated error rate) are displayed in the dashboard but not actively reported. Warnings (for example queue depth above normal) generate an email to the operations team. Critical alerts (for example open circuit breaker, rapidly growing DLQ) trigger immediate notifications via multiple channels. According to an industry study, well-configured escalation logic reduces mean response time to integration problems by 65 percent (State of Digital Operations, 2024).
Graceful Degradation: Maintaining Operations Despite System Disruption
Professional error handling means not only detecting and fixing errors but also maintaining business operations despite a system disruption. The concept of graceful degradation describes a system's ability to continue functioning in a limited capacity during partial failures, rather than failing completely.
Concretely for store integration: when the ERP system is unreachable, customers can still place orders in the store. Orders are buffered in the queue and automatically processed after recovery. Prices are served from the cache (possibly with slight delay for updates). Inventory displays are based on the last known state. The store remains fully usable while the middleware waits in the background for ERP recovery.
Another example of graceful degradation is handling partial failures in batch operations. When an inventory update is performed for 100 items and three fail, the 97 successful updates are processed normally while the three failures are placed in the retry queue. According to an AWS analysis, systems with implemented graceful degradation achieve 99.95 percent availability from the customer perspective, even when backend systems offer only 99.5 percent availability (AWS Architecture Center, 2025).
Error Simulation: Systematically Testing Resilience
The best error handling is useless if not tested. Error simulations (also known as chaos engineering) systematically verify whether the integration functions correctly under error conditions. Typical test scenarios include: network interruption between middleware and ERP (is the queue correctly populated?), API timeout under high load (does retry logic engage?), invalid data from source system (is validation triggered?) and complete target system failure (does the circuit breaker activate?).
The middleware offers a test mode where errors can be deliberately injected without affecting production operations. Regular error simulations -- at least once per quarter -- ensure that error handling works reliably even as system landscapes evolve. According to a Gremlin study, companies that regularly conduct error simulations reduce their unplanned downtime by 45 percent (Gremlin, 2024).
Sources and Studies