Provider rating
The core thing of DRPC’s load balancing algorithm is a provider rating. After filtering providers capable of serving the request, the final provider is chosen randomly, and the probability to be chosen is proportional to the rating.
Now we will tell more about how the rating is calculated as of the moment this doc is being created.
Rating dimension#
The initial concept to discuss is rating dimensions. Each provider can host many different networks in different locations, and perform differently on different methods. Therefore, each provider has a rating for each available combination. These combinaitons of parameters are called "dimensions".
A rating dimension can be described as a Go structure.
type RatingDimension struct {
Method string
Chain dshackle.ChainRef
SourceRegion reducerstates.Region
Kind DimensionKind
}Let’s describe what each field means.
Method. Directly represents the method of the request. Since different methods require varying amounts of time and effort to handle, it is essential to differentiate each method as one of the dimensions.Chainis a network ID. Obviously, we need to have separate ratings for different chains.SourceRegionhelps us to compare provider`s performance for requests originating from different regions. For example, providers in Europe typically have lower latency for requests from Europe than those from the US.Kindis a subset of providers we want to calculate ratings for.- All — contains all available providers
- Public — contains only public free providers
- PublicBestLatency — formed dynamycally from the Public pool, contains well-performing public providers
- PublicLowPriority — special pool that aggregates rate limited public RPCs (i.e. providers not working with dRPC)
- BestLatency — contains only paid, well performing providers
- BestLatency Top - contains only paid, top performing providers
- MEV - special pool for providers with MEV services
Rating calculation#
The rating uses a provider performance prediction, provider price, and alignment with incentives as features to determine the probability of serving a request in a given dimension.
The performance prediction algorithm predicts a benchmark that embraces latencies, number of errors, and derivative features, such as EMA (Exponential Moving Average). The last set of features lets the system "remember" latencies older than a second ago.
Note that the algorithm applies penalties to a part of errors, and it excludes errors caused by users.
The decision if a penalty must be applied is done in a few steps:
- Errors are divided into retriable and non-retriable. An error is called retriable if at least 5% of requests with a particular error are successfully processed by another provider after a retry.
- We check if an error is one of the following internal types:
- RetryableReplyErrorQualityScore (dshackle is unavailable or unauthenticated)
- NonRetryableReplyErrorQualityScore (timeouts and gRPC cancellations)
- RetryableErrorQualityScore (node errors)
- Any other retriable error
EMA (opens in a new tab) is a term from signal processing. It can be represented as follows
EMA lets create features that "remember" their previous values since they literally use their previous values to calculate new ones. The alpha parameter controls how long-term the memory is.
For example, calculating EMA with an alpha=0.06 will eventually produce a metric close to an average value in the last 30 seconds.
Incentives are special conditions by meeting which providers can increase their ratings. Examples of incentives include:
- Support of particular networks
- Support of special methods in particular networks
- Keeping local nodes in certain regions
- Keeping low error rate
The list of incentives is updated. We may introduce new incentives or end old ones. We always inform providers about news related to incentives.
The rating algorithm takes the prediction, the prices, incentives, and produce new ratings every 5 seconds. The ratings are calculated and stored on each dproxy independently, meaning the ratings could vary but not substantially.
Performance prediction#
The prediction algorithm uses the performance data of a provider in the last second. We operate in a certain rating dimension and use data only from this dimension (chain, region, etc). Let's see what is inside the performance data.
Average latency. The average latency in the last second for each provider in each dimension.
Error rate. We have very little tolerance for errors, so if a provider returns 10-20 errors, it will be heavily weighted within the ML algorithm and the next prediction of the performance will be very low. The error effect on rating can last from seconds to hours, depending on the number of errors.
Stability. In addition to average latency and error rate, we consider metric's stabilities. Specifically, providers with similar latency and error rate but lower standard deviation would be slightly prioritized. Standard deviation is a function describing how far a metric deviates from its average. The lower the value, the more stable the metric.
EMAs. As described above, it allow the rating algorithm to remember the history of the provider's performance.
While performance data is the features, the latency with error penalties is the target. The error penalty is an artifically created large number that is put in the dataset. It increases the average benchmark and serves as a penalty.
Essentially we try to predict how well the provider will perform in the next seconds, and then we create ratings based on this prediciton and the prices.
We have been testing Linear Regressions and various algorithms of time-series forecasting. Our priority was the simplicity and rapid execution with a decent prediciton capabilities.
For Linear Regression, we use VowpalWabbit (opens in a new tab) machine learning framework that is written in C++ with Go bindings. We developed training and inference processes of our models with this framework. It works very well on a large scale with a high load. It can be used for online learning. Now we don't use the online learning. An inference model on millions of inputs takes less than half a second.
Rating Calculation Algorithm
Our input is a set of providers and their performance predictions (one for each provider), prices offered by providers in a given dimension, and applicable incentives. If a provider has several dshackles, we use a dschackle with the best predicted performance for rating calculation. After calculation, the rating is deaggregated and assigned to each dshackle by their performances. This means that adding more dshackle can't increase the rating unless their latency is lower compared to other dshackles.
Since predictions are highly correlated with the latencies and error rates, low predictions usually cause higher ratings.
We start by calculating pre-ratings from predictions. First, we set a mapping between thresholds and multipliers. For example:
| Threshold | Multiplier |
|---|---|
| 0 ms | 1 |
| 10 ms | 1 |
| 20 ms | 2 |
| 50 ms | 4 |
| 75 ms | 8 |
| … | … |
| 30 s | 2^30 |
The exact mapping can change.
Threshold signifies a difference between the lowest latency in a dimension and a given provider's latency.
Multiplier is a ratio between the rating of the fastest provider and a given provider.
This means that, for example, a provider with a latency of 75 ms higher than the lowest one must have a rating 8 times lower than the fastest provider.
If a provider's latency is in between thresholds, linear interpolation between closest multipliers and thresholds is used to determine a multiplier.
Once set, a simple equation is constructed:
Where — pre-rating of the fastest provider, each term is a rating of some other provider, and — multipliers chosen according to the mapping. The expressions sums up to 1 because each rating is basically a share of traffic dedicated to a provider.
Since multipliers are known, we easily find and all other ratings.
Next, provider prices are transformed in a way that when multiplied by pre-ratings, they increase ratings of providers with higher discounts if their discount is higher than usual discount in a dimension.
Each price transofrmed in the following way:
for price in prices:
price_transformed = (dimension_max_price - price) / dimension_max_priceThe output of this step is a set of numbers between 0 and 1 where each number corresponds to a certain provider.
Incentives features are transofrmed in a number between 0 and 1 for each provider in a similar way. 1 means that a provider is the only one matching the incentive rule, and 0 means that a provider does not match the incentive conditions.
To transform standard deviations into a vector of real numbers between 0 and 1, we use a softmax (opens in a new tab) function with a very high temperature parameter.
Then, secondary features are summed up together with a vector of ones. The produces a vector of features where each feature is more or equal to 1.
Finally, pre-ratings and output features are multiplied pointwise and normalized, and we get provider ratings.
Final ratings are numbers between 0 and 1 representing a share of traffic (or the probability to serve a request) of each provider at a given dimension. When stored to the database and shown in dashboards, the ratings are typically multiplied by 1e9 for storage optimization.
This approach ensures that slower and more expensive providers get consistently lower ratings than the fastest providers. Latency and error rate are still the most important features. In case of equal discounts, incentives, and stability (we call it "secondary features"), ratings decrease exponentially, starting from highest and almost equal ratings for providers with latencies in 0-10ms proximity to the lowest latency, down to 1e-9 fraction of the highest pre-rating if the latency is 30s higher than the latency of the fastest provider.
The ratings are practically never equal to 0. We don't want the winner takes all system so that we can test and update the performance data of each provider as well as provide the most robust and reliable service to our clients.
Notes on how the algorithm behaves:
- If discounts or other secondary parameters are all equal in a dimension (e.x. 0%), they do not affect the rating; the rating is based on latencies and error rates only
- If latencies and error rates are all equal in a dimension, they do not affect the rating; the rating is based on secondary features only
- Secondary features of worse-performing providers have lower effect then secondary features of well-performing providers
- Ratings of providers with latencies close to the best latency are high and close, ranged by the secondary features
- Ratings of providers with latencies far from the best latency are low
- Difference in latencies of <= 10ms is now considered insignificant by the algorithm
The rating can be increased or decreased only by 4 factors: latency, errors (averages and stabilities), price and matching incentive conditions. However, as you might have noticed, the rating is dependant on ratings of all other providers in the dimension. Thus, since any provider may change their parameters, even if your parameters stay untouched, your rating may change because of actions of other providers.
Sampling providers
On every request, we take the ratings of available providers in the request's dimension and sample a provider randomly where the probability of being chosen is the rating. If the request is served, the process ends. Otherwise, we can try to select another provider or move to the next pool and repeat this action.
Generally, we choose a provider for a request from pools in the following order:
- BestLatencyTop — Only for paid requests
- BestLatency — Only for paid requests
- Default — Only for paid requests
- PublicBestLatency
- Public
- PublicLowPriority
- All
Up to this point, we assumed that all considered providers can handle a given request. But determining whether it is true or not for each provider is a separate and crucial step happening in advance. It is described in routing strategies.