Monitoring & Alert
Off-chain monitoring plan and alert system
Overview
Our project relies on a robust backend infrastructure that includes a Kubernetes cluster and managed databases (PostgreSQL and Redis) hosted on Google Cloud. The backend serves as a crucial piece in the puzzle, triggering liquidations and working as a price oracle, among other functions. To ensure the system's reliability and performance, a comprehensive monitoring and alert system has to be implemented.
Monitoring Plan
Exception and Error Monitoring -> Sentry
Sentry is utilized to monitor exceptions and errors within the codebase. It provides real-time error tracking, allowing for quick identification and resolution of issues. Sentry is configured to capture and report errors, exceptions, and performance bottlenecks in our application code.
Integrate Sentry SDK into the codebase to capture exceptions and errors.
Configure Sentry to group and prioritize issues for efficient troubleshooting.
Utilize Sentry's alerting system to notify the team of critical issues.
Infrastructure Monitoring -> New Relic
New Relic is employed to monitor the infrastructure, including the Kubernetes cluster and managed databases. It offers insights into system performance, resource utilization, and application dependencies. New Relic helps in identifying and resolving performance bottlenecks and optimizing resource utilization for optimal system functioning.
Install New Relic agents on relevant infrastructure components (Kubernetes, PostgreSQL, Redis, etc).
Configure New Relic to monitor key performance indicators, resource utilization, and dependencies.
Set up alert policies to notify the team of performance anomalies and potential issues.
Alert Notification Channels -> Slack & PagerDuty
Slack Notifications
Slack is integrated into our monitoring system to streamline communication and notification processes. Alerts from both Sentry and New Relic are configured to be posted in dedicated channels, allowing team members to stay informed about the system's health and respond promptly to any issues.
Sentry alerts are posted in a designated Sentry channel, providing visibility into code-level issues.
New Relic alerts are shared in a separate Infrastructure Monitoring channel, offering insights into system-level performance.
Ensure relevant team members are subscribed to these channels for immediate awareness.
PagerDuty Notifications
PagerDuty is employed to ensure timely response to critical incidents, especially during non-working hours. PagerDuty is configured to escalate and notify on-call engineers through various channels, such as phone calls, SMS, and mobile applications. This ensures that urgent issues are addressed promptly, maintaining the system's availability and reliability.
Integrate PagerDuty with both Sentry and New Relic.
Configure PagerDuty escalation policies to alert on-call engineers in case of critical incidents.
Internal Status Page
Utilizing PagerDuty, an internal status page is established to provide real-time updates on system health and status. The status page serves as a centralized point for the team to monitor the overall health of the system and receive updates on ongoing incidents.
Conclusion
The monitoring and alert system ensures proactive identification and resolution of issues within our lending smart contract backend infrastructure. By leveraging Sentry for code-level monitoring, New Relic for infrastructure insights, Slack for streamlined communication, and PagerDuty for 24/7 incident response, we maintain the reliability and availability required for the success of our project. Regular reviews and updates to the monitoring setup will be conducted to adapt to evolving system requirements and ensure continuous improvement.
Last updated