Delay in webhook message (Webhook API)

Incident Report for The Coupon Bureau (Prod)

Postmortem

Incident Date: September 4, 2025
Reported By: P&G at 6th September 9:43 AM IST (Resolved at 4:40 PM IST) ~ 7 hours
System: The Coupon Bureau (TCB) Platform
Component: AWS Sharing API Lambda for DLQ (Dead Letter Queue) Processing

Summary

On 4th September, the TCB platform experienced increased Lambda concurrency and message backlog. The issue was triggered while processing the DLQ: exceptions within the DLQ-processing Lambda caused abnormal exits. Because messages were not successfully acknowledged, they reappeared in the queue and were retried multiple times. This repeated retry loop consumed concurrency near the reserved limit (100), leading to processing delays across the system.

Impact

  • Lambda Concurrency: Reached close to the reserved limit of 10 (baseline was set as ≈10).
  • Message Processing: DLQ messages are retried repeatedly, which inflates the invocation count and delays real-time workloads.
  • Customer Impact: Webhook processing delays and slower downstream processing.

Root Cause

  • DLQ-processing Lambda threw unhandled exceptions during execution.
  • Messages were not acknowledged, so SQS re-queued them after visibility timeout.
  • SQS retried messages up to the configured retry limit, amplifying concurrency usage.

Corrective Actions Taken

  • Incident reviewed; root cause traced to exception handling in DLQ Lambda.
  • Manual monitoring is enabled for the DLQ queue.
  • Fixed the code that was causing the exception.
  • Move lambda to reserved concurrency mode.
  • Monitor event age.
Posted Sep 14, 2025 - 16:36 UTC

Resolved

It was observed that coupons are appearing in the C24 wallet with a delay of approximately three hours after being clipped. The issue is linked to the coupon-sharing process via the sharing API from the PG website to the C24 app.
Posted Sep 06, 2025 - 08:30 UTC