Provider rating
The core thing of DRPC’s load balancing algorithm is a provider rating. After filtering providers capable of serving the request, the final provider is chosen randomly, and the probability to be chosen is proportional to the rating.
Now we will tell more about how the rating is calculated as of the moment this doc is being created.
Rating dimension#
The initial concept to discuss is rating dimensions. Each provider can host many different networks in different locations, and perform differently on different methods. Therefore, each provider has a rating for each available combination. These combinaitons of parameters are called "dimensions".
A rating dimension can be described as a Go structure.
type RatingDimension struct {
Method string
Chain dshackle.ChainRef
SourceRegion reducerstates.Region
Kind DimensionKind
}Let’s describe what each field means.
Method. Directly represents the method of the request. Since different methods require varying amounts of time and effort to handle, it is essential to differentiate each method as one of the dimensions.Chainis a network ID. Obviously, we need to have separate ratings for different chains.SourceRegionhelps us to compare provider`s performance for requests originating from different regions. For example, providers in Europe typically have lower latency for requests from Europe than those from the US.Kindis a subset of providers we want to calculate ratings for.- All — contains all available providers
- Public — contains only public free providers
- PublicBestLatency — formed dynamycally from the Public pool, contains well-performing public providers
- PublicLowPriority — special pool that aggregates rate limited public RPCs (i.e. providers not working with dRPC)
- BestLatency — contains only paid, well performing providers
- BestLatency Top - contains only paid, top performing providers
- MEV - special pool for providers with MEV services
Rating calculation#
The rating uses a provider performance prediction and provider price as features to determine the probability of serving a request in a given dimension.
The performance prediction algorithm predicts a benchmark that embraces latencies, number of errors, and derivative features, such as EMA (Exponential Moving Average). The last set of features lets the system "remember" latencies older than a second ago.
EMA (opens in a new tab) is a term from signal processing. It can be represented as follows
EMA lets create features that "remember" their previous values since they literally use their previous values to calculate new ones. The alpha parameter controls how long-term the memory is.
For example, calculating EMA with an alpha=0.06 will eventually produce a metric close to an average value in the last 30 seconds.
The rating algorithm take the prediction, the prices and produce new ratings every 5 seconds. The ratings are calculated and stored on each dproxy independently, meaning the ratings could vary but not substantially.
Performance prediction#
The prediction algorithm uses the performance data of a provider in the last second. We operate in a certain rating dimension and use data only from this dimension (chain, region, etc). Let's see what is inside the performance data.
Average latency. The average latency in the last second for each provider in each dimension.
Error rate. We have very little tolerance for errors, so if a provider returns 10-20 errors, it will be heavily weighted within the ML algorithm and the next prediction of the performance will be very low. It's worth noticing that we filter out errors caused by clients or dRPC. The error effect on a rating can last from seconds to hours, depending on the number and the severity of errors.
EMAs. As described above, it allow the rating algorithm to remember the history of the provider's performance.
While performance data is the features, the latency with error penalties is the target. The error penalty is an artifically created large number that is put in the dataset. It increases the average benchmark and serves as a penalty.
Essentially we try to predict how well the provider will perform in the next seconds, and then we create ratings based on this prediciton and the prices.
We have been testing Linear Regressions and various algorithms of time-series forecasting. Our priority was the simplicity and rapid execution with a decent prediciton capabilities.
For Linear Regression, we use VowpalWabbit (opens in a new tab) machine learning framework that is written in C++ with Go bindings. We developed training and inference processes of our models with this framework. It works very well on a large scale with a high load. It can be used for online learning. Now we don't use the online learning. An inference model on millions of inputs takes less than half a second.
Rating Calculation Algorithm
Our input is a set of providers and their performance predictions (one for each provider), and prices offered by providers in a given dimension. Since predictions are highly correlated to the latencies and error rates, low values are better than high values.
Firstly, we apply a transformation to each prediction.
We do this with the goal of placing low values closer together and placing high values further away.
The transformation is made with a sigmoid function.
It has low values at the beginning, then a slope, and high values at the end.
The x here is the difference between the lowest prediction in the dimension and the behchmark we apply the transformation to.
So values close to the best one are multiplied by a small number and become closer to each other, and high values are multiplied by a large number and grouped together far away from the low ones.
After transformation, the softmax function is applied to a transformed prediction vector multiplied by -1 to obtain a vector that sums up to 1 — “pre”-rating with no discount affect so far. Here, high values imply more traffic, low values imply less traffic. These numbers can serve as ratings, but they don't consider prices yet.
Softmax (opens in a new tab) is a function that maps a vector of numbers to a probability distribution. It's not real probability but behaves like it.
We use an additional parameter in softmax called temperature to control the level of "uniformnesss" of the resulting distribution. We usually make the ratings more equal to each other than a plain softmax suggests.
Next, prices are transformed in a way that when multiplied by pre-ratings, they increase ratings of cheaper providers if the performance of the provider is good and if their discount is higher than usual discount in a dimension.
Currently, the algorithm is adjusted to consider differences between latencies less 10ms to be insignificant if all other factors are equal. Also, no discount can put a provider in the top if the latency is performance is too low.
As a result, we have a set of numbers between 0 and 1 representing the share of traffic (or the probability to serve a request) of each provider in each dimension. When stored to the database and shown in dashboards, the ratings are typically multiplied by 1*10^9 for storage optimization.
The ratings are practically never equal to 0 for everyone but a leader. We don't want the winner takes all system so that we can test and update the performance data of each provider as well as provide the most robust and reliable service to our clients.
Notes on how the algorithm behaves:
- If discounts are all the same in a dimension (e.x. 0%), they do not affect the rating; the rating is based on latencies only
- If latencies are all the same in a dimension, they do not affect the rating; the rating is based on discounts only
- Discounts of worse-performing providers have lower effect then discounts of well-performing providers
- Ratings of providers with latencies close to the best latency are high and close, ranged by the discounts
- Ratings of providers with latencies far from the best latency are low
- Difference in latencies of <= 10ms is now considered insignificant by the algorithm
The rating can be improved or downgraded only by 3 factors: latency, errors and price. However, as you might have noticed, the rating is dependant on ratings of all other providers in the dimension. Thus, since any provider may change their performance and price, even if your parameters stay untouched, your rating may change because of other providers activity.
Sampling providers
On every request, we take the ratings of available providers in the request's dimension and sample a provider randomly where the probability of being chosen is the rating. If the request is served, the process ends. Otherwise, we can try to select another provider or move to the next pool and repeat this action.
Generally, we choose a provider for a request from pools in the following order:
- BestLatencyTop — Only for paid requests
- BestLatency — Only for paid requests
- Default — Only for paid requests
- PublicBestLatency
- Public
- PublicLowPriority
- All
Up to this point, we assumed that all considered providers can handle a given request. But determining whether it is true or not for each provider is a separate and crucial step happening in advance. It is described in routing strategies.