Loan To Value (LTV) Monitor System:

Documentation for the LTV Monitor Service

The LTV Monitor is a microservice designed to continuously monitor the loans that exist within the Concrete protocol and calculate their LTV ratios. Based on the calculated LTV ratios and the types of protections applied to the loan, it must perform specific actions like claim and foreclosure triggers. Due to the time-sensitivity of these actions, it must be able to perform fast for a high number of loans.

High-level Requirements

Scalability - The system must be designed to scale and handle a high amount of loans
Efficiency - The system must be able to prioritize the processing of risky loans over safer ones
Reliability - The system must be reasonably fault-tolerant

The Loan Leasing System

The loan leasing system is the solution to being able to process many loans in parallel. It consists of a Postgres table with all the loans and some metadata, and a pool of workers taking "leases" on loans to process them and then releasing them. The leasing system is based on Postgres' locking features. When selecting loans to lease, workers put a select for update lock on the rows they select, and then update their 'worker node id' to signify that it is currently leased. Other workers will not lease already leased loans because their select query filters out loans with a non-null worker node id. With this solution, an arbitrary amount of workers can work in parallel on an arbitrary amount of loans, ensuring that they will never execute actions for the same loan at the same time.

When processing the loan, the workers also calculate a 'next_processing_after" timestamp, which tells other workers after what time this loan is available again for reprocessing. This is important because not all loans need to be processed continuously, which you can read about in the below dynamic reprocessing section.

The Processing Flow

The processing of a loan consists of a few steps

Acquire loan lease
Calculate the LTV
Execute applicable actions
Calculate next_processing_after
Release loan lease

We already spoke about the loan leasing, so lets dive deeper into the rest of it.

The flowchart above contains three main paths, each determined by a loan's 'state'. The state is a column stored on the loan table that is used for the 'state' machine that is manifested by the continuous processing of a loan.

The open state

This is the simplest state - it means that the loan is currently open and we are not waiting for anything to happen. We basically just recalculate the LTV for this loan to make sure no actions have to be triggered, and release it to be reprocessed later on.

To get data about the loan, like collateral amounts, borrow amounts, etc., we rely on the borrower and price oracle services. The borrower service provides the loan's data, and we use the pricing of the oracle service to convert the collateral and borrow amounts into USD so that we can easily calculate the LTV.

Once the LTV is calculated, we can determine if it is above any protection thresholds. If it is not, we calculate when next we should process this loan and release our lease.

If the loan has breached any of its protection thresholds, we then trigger the claim/foreclosure process. First, we see if this loan is eligible for a claim at all. To do this wee use the data from the borrower service to check if this loan has a protection enabled, and if it does if there are any claims left to be disbursed. If the loan is not eligible for a claim, we check if it is eligible for a foreclosure. This can be if the user has purchased the 'foreclose-able' product, or if the loan is protected but no claims remain to be disbursed.

If the loan is eligible for a claim or foreclosure, we make a request to the borrower service's appropriate HTTP method to trigger them. Then, we update our loan's state to either claim_triggered and foreclosure_triggered - This way the state machine will know to check up on this claim or foreclosure the next time this loan is processed.

The claim_triggered state

This one is pretty straightforward - at this point in the loan's lifecycle we know that we have asked the borrower service to trigger a claim, so we check up on the borrower service to see if that claim has succeeded. If it has, we put the loan back into the open state for it to be reprocessed again - with the hope that the deposit made during the claim process has decreased the LTV back below the protection threshold.

The foreclosure_triggered state

This one is a mirror of the above state, but in this case we check if the foreclosure has succeeded. If it has, we then mark the loans state as closed in the LTV monitor's database so that we don't process it again.

Dynamic Loan Reprocessing Intervals

During the processing of the loan, the workers have the ability to determine when the loan should be processed next. In an ideal scenario, loans would be continiously processed. But, due to lack of infinite resources the system has to be selective with which loans it spends its resources processing.

By having the ability to determine the time of next processing, the worker is tasked with answering the question of "I am confident that the loan will not pass its liquidation LTV within this timeframe." Naturally, loans with LTVs far from their protection thresholds will be processed further into the future. Therefore, the system can prioritize its resources on loans that are close to their protection threshold. This is also consistent with the data science team's observation that loans with higher LTVs have a higher rate of change in their LTV.

In terms of how the system does this, it sets a next_processing_after value on the loan. Then, when workers are looking for leases they look for leases that have next_processing_after > now(). The next_processing_after can be calculated based on any property of the loan available to the worker at the time of processing.

Above is a graph of the current implementation of this dynamic reprocessing interval function. It takes the LTV delta(difference between liquidation LTV and current LTV) and squares it to get the seconds the system should wait before reprocessing. Notice that there are also configured min and max values for this function that are implemented. This is to make sure that the reprocessing doesn't happen in too short or too long of an interval.

Loan Creation

Loans are created in the borrower service. This means that the LTV monitor has to somehow be told to add a loan into its own database so that it starts processing it. This is done using the cb-ltv-monitor-pubsub-listener service. This service listens to a pubsub topic for a loan creation event which is sent by the borrower service. Once this event is received, it inserts the loan into the database.

LTV History

Saving the history of a loan's LTV is useful for the frontend and other parts of the system. Because of this, the LTV monitor workers add entries into a table called historicloandata. The contents of this table is then exposed to the frontend via a REST route that is served by a microservice called cb-ltv-monitor-rest

Horizontal Autoscaling

The idea of having an arbitrary amount of workers means that the system can be scaled up and down based on demand. The LTV monitor relies on kubernetes' horizontal pod autoscaler to achieve this.

A service called the cb-ltv-monitor-metrics-exporter continually queries the database for various different metrics like "count of loans waiting for processing", "count of open loans", "average processing delay", etc. and exports them to GCP Stackdriver. Then, a kubernetes horizontal pod autoscaler is configured to listen to these metrics and determine how many workers it should spin up.

Scale Testing

The system described above has been tested in both local and cloud environments to observe how it behaves at scale.

Methodology

For local environment testing, the repository contains a local kustomize overlay that lets you deploy all the components needed to process loans within a minikube cluster. This environment includes mock versions of the borrower and price oracle services, making it possible to simulate an arbitrary amount of loans and predefined market movements.

For cloud testing, the local kustomize overlay was applied to an out of the box GKE Autopilot cluster. Due to the nature of the local overlay, no extra configuration had to be done on GKE to make this work. The reason the cloud environment was needed was because the amount of loans being tested required an amount of resources not possible for a single computer running minikube.

For observation, a new database-moitor was created under the ./tools directory to allow for monitoring key metrics for a given environment. This tool generates useful charts for total loan count, loan processing throughput, loans waiting processing, etc.

Findings

The borrower service interface could be a key bottleneck - Each of the LTV monitor's workers relies on the borrower service's REST APIs to get loan data. As workers scale up, they hit this API more and more. The API call involved here represents around 40% of a single loan's processing time, meaning there is a huge impact if these APIs get overloaded and become slow.
- As a solution for this, the team split up the borrower service into internal and external services. This will ensure that load on the internal service does not affect the external service, and vice-versa. This has already been implemented
- At the end of the day, the solution above is a stop-gap. Both the internal and external services query the same database, and if high usage of the internal service impacts database resources then the problem still persists. The team has discussed creating a read-replica of the borrower service database and configuring some of the read-only internal borrower service routes to read from that. But, it has been determined that this is not needed at this point due to this issue only becoming a factor when active loan counts go into the hundreds of thousands
- Another solution is to remove the borrower service from the equation completely. If a read-only database exists, then the LTV monitor's workers can get loan data directly from this database. This will reduce the time to resolving a loan's data significantly, and reduce network congestion.
The system slows down significantly when processing intervals are capped at a low cap. This is intuitive, if the system requires loans to be reprocessed at a minimum interval of 60 seconds, then for a million loans the system's throughputs have to be 1M loans/minute. In the cloud-based testing, a per-worker loan processing throughput of 100 loans per minute was observed, meaning it would take 10,000 workers to handle 1M loans/minute. By increasing this cap to something more sensible, we would allow the system to stagger the 1M loans across a longer timeframe, requiring less nodes to handle the throughput requirements. During testing, it was observed that a maximum of a 1 hour processing interval makes it possible for a reasonable amount of workers(<100) to handle 1,000,000 loans. The team is currently working on a function to determine the loan's next processing interval, so this point will be revisited once that is complete.
Autoscaling lags behind real time demand. When a burst of loans gets created, it creates a need for immediate high throughput. This represents a problem if the scaling of new services to meet this demand takes time. The saving grace for this scenario is that concrete does not allow for loans to be protected if they have extremely unhealthy opening LTVs. But, the autoscaler must be configured so it looks at the most immediate metric of "how many loans have to be processed at this moment" - Lagging metrics like "what is the average loan processing delay in the past x minutes" will not work.

Useful Links

Initial RequirementsLTV Service Design Proposal

PreviousArchitecture NextLTV Service Design Proposal

Last updated 1 year ago