System Monitoring

This section describes the monitoring plan for Vizium

Vizium has robust downtime alerting and recovery systems that monitor both the Realtime Chain Monitoring Service (RTCM), various blockchain nodes, and production machines themselves.

Each server runs a proprietary python script that monitors vitals such as CPU, RAM, and disk space usage and ensures that the host is reachable. On node machines, we also check to make sure we have had a recent block and that the chain is progressing as expected. We run backup nodes in alternative geographic locations such that if there is an issue with a node, we can easily detect and switch. This process is semi-automated but will be fully automated in the near future.

The RTCM service itself is monitored using Monit to ensure the service is up and running by monitoring local log files and output. Additionally, we run a separate client service that connects to the RTCM and requests event subscription for events expected to occur in most blocks. Thus, we can detect if the service is running but not broadcasting data as expected and be restarted automatically. Further, the RTCM API itself defines gracefull reconnect logic such that existing clients can easily handle a disconnect/reconnect event with minimal to no disruption.

All of this monitoring process feeds into a centralized alerting system that currently sends emails and texts to our technical support but we are currently migrating to PagerDuty to have more in-depth and robust alert capabilities.

Last updated