Backup and Recovery Strategy for an Online Game Database System
This is a comprehensive backup and recovery strategy for an online game database system, which needs to handle player data, game states, transactions, and maintain high availability.
Example Scenario:
- Online multiplayer game with 100,000+ active players
- Player data includes accounts, characters, inventories, and achievements
- In-game transactions (purchases, trades, item acquisitions)
- 24/7 operation with players worldwide
- Database size: 2TB
- Monthly data growth: ~50GB
- Maximum acceptable data loss: 5 minutes
- Maximum acceptable downtime: 30 minutes
Backup Strategy:
- Primary Backups (On-Premise):
- Full Database Backup
- Every Sunday at 2 AM UTC (lowest player activity)
- Stored on high-speed SAN storage
- Retained for 4 weeks
- Compressed and encrypted
- Differential Backups
- Daily at 1 AM UTC
- Captures changes since last full backup
- Stored on local SSD storage for quick recovery
- Retained until next full backup
- Transaction Log Backups
- Every 5 minutes
- Critical for minimal data loss
- Stored on separate physical storage
- Retained for 1 week
- Secondary Backups (Off-site):
- Full Database Backup Replication
- Copied to cloud storage within 6 hours
- Encrypted during transit and at rest
- Geo-redundant storage across multiple regions
- Retained for 3 months
- Full Database Backup Replication
- Real-time Protection:
- Always-On Availability Groups
- Synchronous replica in the same datacenter
- Asynchronous replica in different geographical location
- Automatic failover capability
- Used for read-scaling during peak times
- Always-On Availability Groups
Recovery Procedures:
- Minor Issues (Corrupt Data/Accidental Changes):
-- Example point-in-time recovery
RESTORE DATABASE GameDB
FROM DISK = 'E:\Backup\GameDB_Full.bak'
WITH NORECOVERY;
RESTORE DATABASE GameDB
FROM DISK = 'E:\Backup\GameDB_Diff.bak'
WITH NORECOVERY;
RESTORE LOG GameDB
FROM DISK = 'E:\Backup\GameDB_Log.trn'
WITH RECOVERY,
STOPAT = '2024-12-11 14:30:00';
- Major Failure (Primary Server Down):
- Automatic failover to synchronous replica
- RTO: Less than 1 minute
- RPO: Zero data loss
- Disaster Recovery (Data Center Loss):
- Failover to asynchronous replica
- RTO: Less than 15 minutes
- RPO: Potential loss of last few transactions
Monitoring and Maintenance:
- Automated Health Checks
- Backup success/failure monitoring
- Replica synchronization status
- Storage space monitoring
- Backup file integrity checks
- Regular Testing
- Monthly restore tests of full backup
- Quarterly disaster recovery drills
- Validation of restored data integrity
- Performance Optimization
- Regular index maintenance
- Statistics updates
- Backup compression optimization
- Storage cleanup procedures
Example Recovery Time Estimates:
Scenario 1: Single table corruption
- Recovery time: 10-15 minutes
- Uses point-in-time recovery
- Minimal impact on players
Scenario 2: Complete server failure
- Recovery time: <1 minute
- Automatic failover to replica
- Players experience brief disconnection
Scenario 3: Data center disaster
- Recovery time: 10-15 minutes
- Manual failover to DR site
- Some data loss possible
Documentation:
- Maintain detailed recovery procedures
- Keep configuration scripts in version control
- Document all backup schedules and retention policies
- Regular updates to emergency contact list
Special Considerations for Game Data:
- Player Economy Protection
- Transaction log backup frequency increased during special events
- Additional verification steps for high-value item transactions
- Separate backup chain for economy-related tables
- Character Data Safety
- More frequent backups of character progress tables
- Hash verification of character data integrity
- Special restore procedures for character rollbacks