Overview#
PERMANOVA for covariate shift monitoring#
One known problem of machine learning models in production that affects their predictive ability is covariate shift. It is defined as a change in the distribution of one or more independent variables used to train the model.
ANOVA is often adopted to assess if two samples are from the same population by comparing the variance of their means (H0: all \(\mu\)’s are equal; H1: at least one pair of \(\mu\)’s are not equal). This test relies, however, on the normality assumption of the samples, which makes it a non-viable solution to effectively monitor batches of data.
PERMANOVA is a multivariate version of ANOVA based on the pseudo-F statistic, which makes use of permutations, allowing for a non-parametric estimation.
In the case of covariates shift monitoring, the test compares the original sample \(s_0\) used at time \(t_0\) to train the model with a new, unseen sample \(s_1\) on which the model made predictions at time \(t_1\).
The test starts by computing the pseudo-F statistics on the two samples as follows:
where \(p\) is the number of groups (in this case \(p = 2\)), \(n\) is the number of observations in a group, and \(N\) is the total number of observations.
\(SS_W\) is the sum of squares within the groups (where \(\delta _{ij}\) is 1 if the observations i and j belong to the same group and 0 otherwise):
and \(SS_B\) is the sum of squares between the groups:
\(SS_T\) is the total sum of squares:
Then, for each permutation, \(\pi\) items are shuffled between each group, and the respective \(F^\pi\) is computed. The p-value is then computed as follows:
If the p-value is lower than the chosen significance level \(\alpha\) then the null hypothesis is rejected, so the variances of the two groups differ. This can be used as a first alert to further analyse the new sample and evaluate if a model retrain could be necessary.
A “lightweight” implementation of PERMANOVA#
The usual implementation of PERMANOVA relies on the distance matrix between every observation in the samples; this results in a computationally expensive test.
However, if the observations are considered as points in the Euclidean space, it is possible to revise the previous formulas leveraging distances from centroids.
\(SS_T\) = the sum of squared distances from the observations to the overall centroid
\(SS_W\) = the sum of squared distances from the observations to their own group centroid
This allows to reduce the time complexity of the algorithm from \(O(N^2)\) to \(O(N)\), at the expense of relying on Euclidean distance. If another metric could be more suitable to the variables that describe the samples, the complete distance matrix should be adopted.