Error redeeming paper coupon

Incident Report for The Coupon Bureau (Prod)

Postmortem

Bug Analysis Report: 8112 Paper Coupon Redemption Issue 

Issue Summary 

Over the last 24 hours, we investigated an issue affecting the redemption of 8112 paper coupons. The problem was isolated to the paper coupon redemption path and did not impact digital coupons. 

Root Cause 

The issue originated during a recent bulk deposit operation where 8112 paper coupons (around 400k per MOF) were uploaded sequentially using our bulk uploader. Each serialized data string was written to four separate data stores to ensure high availability and low latency API performance: 

  • Primary Database: MongoDB 

  • Real-time Processing: DynamoDB 

  • Reporting Database: AWS OpenSearch 

  • Cache Database: AWS ElastiCache 

However, due to the sequential nature of deposits and the structure of the serialized data strings (same base GS1 prefix), all writes were directed to the same partition in DynamoDB. This resulted in hot partitioning, which caused DynamoDB to reject approximately 30–40% of the insert operations. Importantly, all other databases were successfully populated, including the reporting database, which caused the issue to go undetected initially. 

Why It Was Undetected 

Post-deposit, we validated success through the reporting database, which had complete records. Since this data source showed no discrepancies, we did not initially identify the missing entries in DynamoDB. This resulted in partial propagation across our system, specifically impacting coupon redemption which relies heavily on DynamoDB for real-time reads. 

Impact 

  • Affected redemption of paper coupons between June 7–10 

  • Digital coupons remained unaffected 

  • Redemption failures returned HTTP 400 errors 

  • Our 400-error alarms had been temporarily disabled due to frequent false alarms from CVS duplicate redemption calls, allowing the issue to go unnoticed 

Immediate Fix 

Once the issue was reported by Sigma, we deployed a hotfix within 40 minutes. The fix allowed the system to fall back to MongoDB (primary DB) when an 81121-series coupon was not found in DynamoDB. This ensured continuity with slightly increased response time (up to 300ms), which was acceptable given the low redemption volume. 

Permanent Resolution 

  • Optimized Bulk Upload Process to reduce write contention in DynamoDB 

  • Implemented Retry Mechanisms for DynamoDB insert failures 

  • Introduced Post-Deposit Validation across all four databases to confirm full propagation 

  • New Alarm Strategy: Introduced refined alarms to exclude known false positives (e.g., CVS NO_REDEMPTION errors) while accurately flagging other HTTP 400 issues 

Follow-up Actions 

  • Verified all 4M+ deposited coupons 

  • Fixed all missing entries in DynamoDB 

  • Confirmed redemption flow is now functioning normally 

  • Identified a minor UI bug where the redemption count is not visually updating on the portal (data is syncing correctly in the backend) 

Conclusion 

This incident has provided critical learnings as it marked our first large-scale execution of 8112 paper coupons. We’ve taken comprehensive measures to ensure this does not recur and enhanced observability across our redemption infrastructure. 

We will continue close monitoring of the paper coupon flow and further harden the system with enhanced validations and alerting mechanisms.

Posted Jun 13, 2025 - 18:35 UTC

Resolved

The incident has been resolved. The issue was caused by a bug in the print coupon deposit flow that skipped a limited number of coupons from being stored in the cache. We’ve identified the missing entries and successfully pushed them to the cache.
Posted Jun 11, 2025 - 19:41 UTC

Monitoring

A fix has been implemented and we are monitoring the result.
Posted Jun 11, 2025 - 19:38 UTC

Identified

The issue has been identified, and the fix is in progress.
Posted Jun 11, 2025 - 19:35 UTC

Investigating

Some of our paper coupons are currently encountering errors during redemption. Our team is actively investigating the issue.
Posted Jun 11, 2025 - 19:17 UTC
This incident affected: Retailer API.