Monitoring & Alert

Off-chain monitoring plan and alert system

Overview

Our project relies on a robust backend infrastructure that includes a Kubernetes cluster and managed databases (PostgreSQL and Redis) hosted on Google Cloud. The backend serves as a crucial piece in the puzzle, triggering liquidations and working as a price oracle, among other functions. To ensure the system's reliability and performance, a comprehensive monitoring and alert system has to be implemented.

Monitoring Plan

Exception and Error Monitoring -> Sentry

Sentry is utilized to monitor exceptions and errors within the codebase. It provides real-time error tracking, allowing for quick identification and resolution of issues. Sentry is configured to capture and report errors, exceptions, and performance bottlenecks in our application code.

  • Integrate Sentry SDK into the codebase to capture exceptions and errors.

  • Configure Sentry to group and prioritize issues for efficient troubleshooting.

  • Utilize Sentry's alerting system to notify the team of critical issues.

Infrastructure Monitoring -> New Relic

New Relic is employed to monitor the infrastructure, including the Kubernetes cluster and managed databases. It offers insights into system performance, resource utilization, and application dependencies. New Relic helps in identifying and resolving performance bottlenecks and optimizing resource utilization for optimal system functioning.

  • Install New Relic agents on relevant infrastructure components (Kubernetes, PostgreSQL, Redis, etc).

  • Configure New Relic to monitor key performance indicators, resource utilization, and dependencies.

  • Set up alert policies to notify the team of performance anomalies and potential issues.

Alert Notification Channels -> Slack & PagerDuty

Slack Notifications

Slack is integrated into our monitoring system to streamline communication and notification processes. Alerts from both Sentry and New Relic are configured to be posted in dedicated channels, allowing team members to stay informed about the system's health and respond promptly to any issues.

  • Sentry alerts are posted in a designated Sentry channel, providing visibility into code-level issues.

  • New Relic alerts are shared in a separate Infrastructure Monitoring channel, offering insights into system-level performance.

  • Ensure relevant team members are subscribed to these channels for immediate awareness.

PagerDuty Notifications

PagerDuty is employed to ensure timely response to critical incidents, especially during non-working hours. PagerDuty is configured to escalate and notify on-call engineers through various channels, such as phone calls, SMS, and mobile applications. This ensures that urgent issues are addressed promptly, maintaining the system's availability and reliability.

  • Integrate PagerDuty with both Sentry and New Relic.

  • Configure PagerDuty escalation policies to alert on-call engineers in case of critical incidents.

Internal Status Page

Utilizing PagerDuty, an internal status page is established to provide real-time updates on system health and status. The status page serves as a centralized point for the team to monitor the overall health of the system and receive updates on ongoing incidents.

Drawing
Monitoring and Alert system architecture

Conclusion

The monitoring and alert system ensures proactive identification and resolution of issues within our lending smart contract backend infrastructure. By leveraging Sentry for code-level monitoring, New Relic for infrastructure insights, Slack for streamlined communication, and PagerDuty for 24/7 incident response, we maintain the reliability and availability required for the success of our project. Regular reviews and updates to the monitoring setup will be conducted to adapt to evolving system requirements and ensure continuous improvement.

Last updated