Bug Analysis Report: 8112 Paper Coupon Redemption Issue
Issue Summary
Over the last 24 hours, we investigated an issue affecting the redemption of 8112 paper coupons. The problem was isolated to the paper coupon redemption path and did not impact digital coupons.
Root Cause
The issue originated during a recent bulk deposit operation where 8112 paper coupons (around 400k per MOF) were uploaded sequentially using our bulk uploader. Each serialized data string was written to four separate data stores to ensure high availability and low latency API performance:
Primary Database: MongoDB
Real-time Processing: DynamoDB
Reporting Database: AWS OpenSearch
Cache Database: AWS ElastiCache
However, due to the sequential nature of deposits and the structure of the serialized data strings (same base GS1 prefix), all writes were directed to the same partition in DynamoDB. This resulted in hot partitioning, which caused DynamoDB to reject approximately 30–40% of the insert operations. Importantly, all other databases were successfully populated, including the reporting database, which caused the issue to go undetected initially.
Why It Was Undetected
Post-deposit, we validated success through the reporting database, which had complete records. Since this data source showed no discrepancies, we did not initially identify the missing entries in DynamoDB. This resulted in partial propagation across our system, specifically impacting coupon redemption which relies heavily on DynamoDB for real-time reads.
Impact
Affected redemption of paper coupons between June 7–10
Digital coupons remained unaffected
Redemption failures returned HTTP 400 errors
Our 400-error alarms had been temporarily disabled due to frequent false alarms from CVS duplicate redemption calls, allowing the issue to go unnoticed
Immediate Fix
Once the issue was reported by Sigma, we deployed a hotfix within 40 minutes. The fix allowed the system to fall back to MongoDB (primary DB) when an 81121-series coupon was not found in DynamoDB. This ensured continuity with slightly increased response time (up to 300ms), which was acceptable given the low redemption volume.
Permanent Resolution
Optimized Bulk Upload Process to reduce write contention in DynamoDB
Implemented Retry Mechanisms for DynamoDB insert failures
Introduced Post-Deposit Validation across all four databases to confirm full propagation
New Alarm Strategy: Introduced refined alarms to exclude known false positives (e.g., CVS NO_REDEMPTION errors) while accurately flagging other HTTP 400 issues
Follow-up Actions
Verified all 4M+ deposited coupons
Fixed all missing entries in DynamoDB
Confirmed redemption flow is now functioning normally
Identified a minor UI bug where the redemption count is not visually updating on the portal (data is syncing correctly in the backend)
Conclusion
This incident has provided critical learnings as it marked our first large-scale execution of 8112 paper coupons. We’ve taken comprehensive measures to ensure this does not recur and enhanced observability across our redemption infrastructure.
We will continue close monitoring of the paper coupon flow and further harden the system with enhanced validations and alerting mechanisms.