Does Predictor Order Affect the Results of Linear Regression?
In the last few weeks I encountered the following question in two separate discussions with friends and colleagues: When doing linear regression in R, will your coefficients vary if you change the order of the predictors in your data?
My intuitive answer to this was a clear “No”. Mostly because the practical implications of the alternative would be too severe. It would make linear regression unusable in many research settings in which you rely on the fitted coefficients to understand the relationship between the predictors and the independent variable. After all, what would be the correct, “canonical” order of inputs that you should use to estimate these effects?
The problem might be less relevant in other settings where you are more focused on building models with high predictive performance, like it is the case in many business applications. In these scenarios you are likely to be using more powerful, black box models such as XGBoost or random forest models to begin with. However, even then it is common practice to use a regression model as a baseline to better understand your data and benchmark your production models against.
So the issue seemed very relevant to me but I realised I did not have many arguments handy to support my intuitive answer. Therefore I decided to spend some time to explore the issue further and in the process refresh some of the numerical theory behind how the linear regression problem is solved on a computer.
Turns out that during the course of my investigation which I want to summarise in this blog post my simple “No” has become more of an “It depends…”
Note that I will only be looking at R’s standard linear regression function as
implemented in the lm
function from the stats
package.
A First Look at the Problem
First of all, let’s try to get some more clarity about the exact question that
we will be trying to answer here. Let’s say we are fitting two linear regression
models on the good old mtcars
dataset:
> m1 <-
+ lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + gear, data = mtcars)
> m2 <-
+ lm(formula = mpg ~ qsec + hp + gear + cyl + drat + wt + disp, data = mtcars)
These two models contain the same predictors, just in a different order. Now if we compare the fitted coefficients of both models, can we expect them to be the same?
So let’s put the coefficients of both models side by side and calculate their relative error:
## key coeff_m1 coeff_m2 rel_error
## 1 (Intercept) 18.58647751 18.58647751 1.338016e-15
## 2 cyl -0.50123400 -0.50123400 5.094453e-15
## 3 disp 0.01662624 0.01662624 1.043365e-15
## 4 hp -0.02424687 -0.02424687 1.430884e-16
## 5 drat 1.00091587 1.00091587 1.331049e-15
## 6 wt -4.33688975 -4.33688975 4.095923e-16
## 7 qsec 0.60667859 0.60667859 1.830002e-16
## 8 gear 1.04427289 1.04427289 6.378925e-16
So while the way the values are displayed suggests that they might be identical,
we can see from the column rel_error
that they are actually not exactly the same.
To make sure that this fluctuation is actually a result of the predictor permutation
and not a result of some potential non-determinism within the lm
function,
let’s see what happens when we fit the exact same model twice:
> all(coefficients(lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + gear, data = mtcars)) ==
+ coefficients(lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + gear, data = mtcars)))
## [1] TRUE
It is reassuring to see that in this case the coefficients are exactly the same.
So here we have a first interesting finding: We can expect at least slight fluctuations in the coefficients when we play around with the order of the predictor variables.
Later on we will get a better idea where these small fluctuations come from when we look at how R solves the linear regression problem. However, at least in our example above, fluctuations on an order of \(\approx 10^{-16}\) will probably have no practical relevance. They also happen to be on the same order of magnitude as the machine precision for variables of type double.
Simulation Study
So after our little experiment with the mtcars
dataset, can we consider our
question answered? Not quite, as running two models on one single dataset can hardly
be sufficient to answer such a general question.
For getting a more solid answer we will need to take a look at the theory of
linear regression which we will do in a minute. But first let’s take an
intermediate step and scale our little mtcars
experiment to a larger amount of
datasets. This will allow us to see if there are other, more severe cases of
fluctuating coefficients and will also serve as a nice transition into the more
theoretical parts of this post.
Getting Datasets from OpenML
To do this simulation study we obtained a number of datasets from the great OpenML dataset repository using their provided R package. Overall 207 datasets with between 25 and 1000 observations and between 3 and 59 predictors were obtained this way. This is obviously not a big sample of datasets to base general statements on. However, it should provide us with enough educational examples to get a better understanding and intuition of the kinds of issues you may encounter when fitting linear models “in the wild”.
The details and results from the study can be reviewed and replicated from the R Notebook here.
For each of the processed datasets we generated a number of predictor permutations and fitted a model for each of them using the same target variable each time. Overall we ended up with 3.764 fitted models with an average of 19.5 fitted coefficients each.
Quantifying the Coefficient Fluctuations
As we have already learned from our mtcars
example above we can not expect the
coefficients across different fits to be exactly the same. So we need to come up
with a meaningful way to detect fluctuations that we would consider problematic.
For this we will use the coefficient of variation (CV) which is equal to the standard deviation normalised by the mean. It is therefore a measure of spread/variation of a set of observations but unlike the standard deviation it is expressed on a unit-less scale relative to the mean. This will allow us to compare the CVs of the different coefficients.
For every predictor in each dataset we calculated the CV across all the fit coefficients from the different models (i.e. predictor permutations). To identify extreme cases of fluctuations we flagged those predictors for which a CV of more than \(10^{-10}\) was measured. This limit has been chosen somewhat arbitrarily but should represent a quite conservative approach as it limits the maximum deviation for any given coefficient to \(\sqrt{N-1} * 10^{-10}\). Here \(N\) is the number of fits for that coefficient which in our case is between 2 and 20 depending on the dataset.
Simulation Results
In our simulation around 14.3 % of the coefficients showed a CV value above the chosen threshold of \(10^{-10}\), involving around 7.7 % of the datasets. The plot below also shows the overall distribution of the CV in a histogram on a log 10 scale.
From the plot we can see that some of the coefficients fluctuate on a magnitude
of up to \(10^3\), way above our defined threshold of \(10^{-10}\) (dashed red line).
Unlike the fluctuations on the mtcars
dataset these kinds of fluctuations can
certainly not be ignored in practice.
Analysis
So let’s try to do some troubleshooting to find out how these fluctuations come about. First, let’s take a closer look at the datasets in which the problematic coefficients occur:
## # A tibble: 16 x 7
## dataset_id max_cv high_cv perc_coefficients_missi… dep_var n_obs n_cols
## <int> <dbl> <lgl> <dbl> <chr> <int> <int>
## 1 195 2.64 TRUE 0.0476 price 159 21
## 2 421 5321. TRUE 0.426 oz54 31 54
## 3 482 11.5 TRUE 0.0217 events 559 46
## 4 487 410. TRUE 0.268 response_1 30 41
## 5 513 3.91 TRUE 0.0217 events 559 46
## 6 518 1.02 TRUE 0.222 Violence_ti… 74 9
## 7 521 1284. TRUE 0.118 Team_1_wins 120 34
## 8 527 11.3 TRUE 0 Gore00 67 15
## 9 530 1.17 TRUE 0.0833 GDP 66 12
## 10 533 35.0 TRUE 0.0217 events 559 46
## 11 536 24.1 TRUE 0.0217 events 559 46
## 12 543 2462. TRUE 0.111 LSTAT 506 117
## 13 551 30.5 TRUE 0.188 Accidents 108 16
## 14 1051 70.4 TRUE 0.238 ACT_EFFORT 60 42
## 15 1076 194. TRUE 0.383 act_effort 93 107
## 16 1091 49.8 TRUE 0 NOx 59 16
One thing that stands out is that for almost all of the problematic datasets there
seem to be coefficients that could not be fit (see column perc_coefficients_missing
).
This suggests that the lm
function ran into some major problems while trying
to fit some of the models on these datasets.
Underdetermined Problems
To understand how that happens let’s start by looking at a subset of the problematic
cases: some of the datasets (dataset_id
s 421, 487, 1076) contain more predictors
than there are observations in the dataset (n_pred
> n_obs
). For linear
regression this poses a problem as it means that we are trying to estimate more
parameters than we have data for (i.e. the problem is underdetermined).
It is easy to understand the issue when thinking about a scenario in which we have only one predictor \(x\) and a target variable \(y\). Let’s say we try to fit regression model including an intercept (the offset from the y axis) and a slope for \(x\):
\[y = a + b*x\]
Further, let’s assume we only have one observation \((x,y) = (1,1)\) in our data
to estimate the two coefficients \(a\) and \(b\) (n_pred
> n_obs
). It is easy to
see that there are infinitely many pairs \(a\) and \(b\) that will fit our data, some
of which are shown below:
So when you are asking R to fit a linear regression in this scenario you are giving it somewhat of an impossible task. However, it will not throw an error (at least not when using the default settings). It will produce a solution, in our particular example the flat red line in the plot above:
> lm(y ~ ., data = tibble(x = 1, y = 1))
##
## Call:
## lm(formula = y ~ ., data = tibble(x = 1, y = 1))
##
## Coefficients:
## (Intercept) x
## 1 NA
So in the case of n_pred > n_obs
we have an explanation for how the missing
coefficients occur: R is “running out of data” for the surplus coefficients and
will assign NA
to them.
The running out of data part in the sentence above is a bit of a simplification. However, the estimation of coefficients seems to actually happen sequentially. At least that is the impression we get from a slightly more complicated example:
> example_2predictors <-
+ tibble(x1 = c(1, 2),
+ x2 = c(-1, 1)) %>%
+ mutate(y = x1 + x2)
>
> lm(y ~ x1 + x2, data = example_2predictors)
##
## Call:
## lm(formula = y ~ x1 + x2, data = example_2predictors)
##
## Coefficients:
## (Intercept) x1 x2
## -3 3 NA
> lm(y ~ x2 + x1, data = example_2predictors)
##
## Call:
## lm(formula = y ~ x2 + x1, data = example_2predictors)
##
## Coefficients:
## (Intercept) x2 x1
## 1.5 1.5 NA
We can see that at least in this example lm
seems to prioritise the estimation
of the coefficients in the order that we supply it to the model formula and
assigns NA
to the second predictor.
What is even better, the example also provides a case of fluctuating coefficients:
The intercept in the first model is -3 and becomes 1.5 in the second one.
And this intuitively makes sense as well. When we try to fit the model with too
little data to estimate all the predictors what lm
ends up doing is fitting
a “reduced” model with fewer predictors. This limited set of predictors will
explain the data in a different way and therefore result in different coefficients.
The Least Squares Problem
So now we have a pretty good understanding where the fluctuating coefficients originate from in the case of underdetermined problems as the ones above. The remaining datasets on the list above are not underdetermined but upon closer inspection we can see that they are suffering from the same underlying issue.
For this it is necessary to understand the theoretical foundation of how linear regression problems are solved numerically, the least squares method. A good and concise derivation of it can be found in these lecture notes.
The least squares formulation for linear regression turns out to be as follows:
\[X^tX\beta = X^tY\] In the equation above \(Y\) is the target variable we want to fit our data to and \(\beta\) is the vector holding the coefficients we want to estimate. \(X\) is the design matrix representing our data. Each of its \(n\) rows corresponds to one observation and each of its \(p\) columns to one predictor.
Now in order for us to solve the least squares problem we need the matrix \(X^tX\) to be invertible. If that is the case we can simply multiply both sides of the equation above by \((X^tX)^-1\) and have a solution for our coefficients \(\beta\).
However, for \(X^tX\) to be invertible it needs to have full rank. The rank is a fundamental property of a matrix which has many different interpretations and definitions. For our purposes it is only important to know that \(X^tX\) would have full rank if its rank would be equal to the number of its columns \(p\) (which in our case is the same as the number of predictors). By one of the properties of the rank, the rank of \(X^tX\) can at most be the minimum of \(n\) and \(p\).
Now we can see why we ran into problems when we tried to fit a linear regression model for the datasets that had less observations (\(n\)) than predictors (\(p\)): The matrix \(X^tX\) had rank \(< p\) (i.e. not full rank) and was therefore not invertible.
As a consequence the underlying least squares problem for these cases was not solvable (at least not in a proper way as we will see in a bit).
Multicollinearity
With our understanding of the underdetermined problems and the issues they cause in the least squares formulation we now have all the necessary tools to explain all the other cases of fluctuating coefficients. As we have pointed out the remaining cases are not underdetermined (\(p\) > \(n\)), instead they suffer from a slightly more subtle issue called multicollinearity.
Multicollinearity arises when some of the columns in your data can be represented as a linear combination of a subset of the other columns.
For example, the columns of the following dataset have multicollinearity:
## # A tibble: 5 x 4
## y x1 x2 x3
## <dbl> <dbl> <dbl> <dbl>
## 1 1 1 -1 0
## 2 1 2 1 3
## 3 1 3 -1 2
## 4 0 4 1 5
## 5 0 5 -1 4
If we know two of the three predictor columns we can always calculate the third (e.g.
x1
= x2 + x3
). One of the columns is simply representing the same
data as the other two in a redundant fashion. We could remove either one of them
and still retain the same information in the data.
It might not be clear at first sight why having additional columns that represent the same data in a redundant way should pose problems but it becomes obvious when we analyse it with the same tools that we used above.
One of the definitions for the rank of a matrix is the number of linearly independent columns. But as we have just seen in a dataset with multicollinearity some of the columns will be linear combinations of other columns. Therefore if our matrix \(X\) has multicollinearity in its columns its rank will be less than the number of its columns \(p\). And since \(rank(X^tX) = rank(X) < p\) this means that \(X^tX\) is not invertible.
Therefore we can trace the remaining cases of fluctuating coefficients back to the \(n < p\) case and explain them in the same way!
Multicollinearity can be a bit difficult to detect sometimes but it can be found in all of the remaining datasets in which we have detected the coefficient fluctuations (see the accompanying R notebook document). In the case of datasets 527 and 1091 we have a special case where we have collinearity between the predictors and the target variable which can also explain why there are no missing coefficients in these cases.
Why lm still returns results: QR Decomposition
So we have been able to explain all the cases of coefficient fluctuations by violations of basic assumptions of linear regression that every undergraduate textbook should warn you about:
- make sure you have at least as many observations as predictors you want to fit
- make sure your data has no multicollinearity in it
However when we tried to fit a linear regression R was able to provide us with results (albeit some of the coefficients were missing) without any error or even warning.
How can it come to any solution if the inverse of \(X^tX\) in the matrix formulation above does not exist?
To understand this it is necessary to dive a bit deeper into the internal workings
of the lm
function in R. I highly recommend doing that as it will teach you a
lot about how R works internally and how it interfaces with other languages (the
actual code for doing the matrix inversion is using Fortran code from the 1970s).
This blog post
offers a great guide to follow along if you are interested in those implementation
details.
The bottom line for us here is that to solve the linear regression problem R will try to decompose our design matrix \(X\) into the product of two matrices \(Q\) and \(R\). With this factorization it becomes quite straightforward to solve the overall matrix equation for our coefficient vector \(\beta\).
Now without going into too much detail here, we can say that the QR decomposition can be executed even when \(X\) is not invertible as the computations are essentially done for each coefficient sequentially.
So the QR and will produce meaningful values for the first few coefficients up
until the algorithm will abort the calculation and return NA
for the remaining
coefficients. This is exactly the behaviour we have seen in our little example
datasets above and in our simulation study.
I was a bit surprised to see that R will just silently process the data even if
it encounters these kinds of problems with the QR decomposition (I certainly
remembered the behaviour differently). The lm
function actually provides a way
to change this behaviour by setting the singular.ok
to FALSE
. In this case
lm
will error out when it runs into a matrix with non-full rank and not return any results. The default for this parameter however is set to TRUE
resulting in the behaviour we have witnessed in our analysis. Interestingly the help text of the lm
function mentions that this parameter’s default was FALSE
in R’s predecessor S.
The output of lm
provides some information about the QR fit, most importantly
the rank of the matrix it encountered, which can be extracted from the output. So
let’s append this information to all our datasets in which we found fluctuations
in the coefficients:
## # A tibble: 16 x 8
## dataset_id high_cv perc_coefficient… dep_var n_obs n_cols rank_qr singular_qr
## <int> <lgl> <dbl> <chr> <int> <int> <int> <lgl>
## 1 195 TRUE 0.0476 price 159 21 20 TRUE
## 2 421 TRUE 0.426 oz54 31 54 31 TRUE
## 3 482 TRUE 0.0217 events 559 46 45 TRUE
## 4 487 TRUE 0.268 respon… 30 41 30 TRUE
## 5 513 TRUE 0.0217 events 559 46 45 TRUE
## 6 518 TRUE 0.222 Violen… 74 9 7 TRUE
## 7 521 TRUE 0.118 Team_1… 120 34 30 TRUE
## 8 527 TRUE 0 Gore00 67 15 15 FALSE
## 9 530 TRUE 0.0833 GDP 66 12 11 TRUE
## 10 533 TRUE 0.0217 events 559 46 45 TRUE
## 11 536 TRUE 0.0217 events 559 46 45 TRUE
## 12 543 TRUE 0.111 LSTAT 506 117 104 TRUE
## 13 551 TRUE 0.188 Accide… 108 16 13 TRUE
## 14 1051 TRUE 0.238 ACT_EF… 60 42 32 TRUE
## 15 1076 TRUE 0.383 act_ef… 93 107 66 TRUE
## 16 1091 TRUE 0 NOx 59 16 16 FALSE
We can see that almost all of the datasets with coefficient fluctuations have encountered a singular QR decomposition. The only two exceptions we have found (datasets 527 and 1091) have another defect are the ones we mentioned above where there is collinearity between the target variable and the predictors. The issue with those datasets has therefore a slightly different nature (see details in the R notebook) but they definitely represent ill-posed linear regression problems.
As a side note: The QR decomposition is calculated only approximated using numerical algorithms (in the case of R Householder reflections are used). The approximative nature of the calculations is likely to be the
reason for the minimal coefficient fluctuations we have seen in the mtcars
example and our simulation study.
Pivoting in Fortran
One final thing I came across while trying to make sense of the Fortran code which does the actual QR decomposition is that it is applying a pivoting strategy to improve the numerical stability of the problem. You can find the code and some documentation about the pivoting here.
The application of the pivoting strategy implies that in certain cases the Fortran function will try to improve the numerical stability of the QR decomposition by moving some of the columns to the end of the matrix.
This behaviour is relevant for our analysis here as it means there is some reordering of the columns going on which we do not have any control over in R.
This might influence our results as it might be the case that for those datasets in which we found no fluctuations it was simply because our prescribed column ordering was overwritten in the Fortran code and never had the chance to cause coefficient fluctuations.
Again the information whether pivoting was applied is returned in the output of
the lm
function so let’s append this information to our dataset summary and
filter for all datasets on which pivoting was applied:
## # A tibble: 18 x 7
## dataset_id max_cv high_cv perc_coefficients_m… rank_qr singular_qr pivoting
## <int> <dbl> <lgl> <dbl> <int> <lgl> <lgl>
## 1 195 2.64e+ 0 TRUE 0.0476 20 TRUE TRUE
## 2 199 6.72e-15 FALSE 0.143 6 TRUE TRUE
## 3 217 2.00e-14 FALSE 0.0357 27 TRUE TRUE
## 4 421 5.32e+ 3 TRUE 0.426 31 TRUE TRUE
## 5 482 1.15e+ 1 TRUE 0.0217 45 TRUE TRUE
## 6 494 0. FALSE 0.333 2 TRUE TRUE
## 7 513 3.91e+ 0 TRUE 0.0217 45 TRUE TRUE
## 8 518 1.02e+ 0 TRUE 0.222 7 TRUE TRUE
## 9 521 1.28e+ 3 TRUE 0.118 30 TRUE TRUE
## 10 530 1.17e+ 0 TRUE 0.0833 11 TRUE TRUE
## 11 536 2.41e+ 1 TRUE 0.0217 45 TRUE TRUE
## 12 543 2.46e+ 3 TRUE 0.111 104 TRUE TRUE
## 13 546 1.94e-13 FALSE 0.0769 24 TRUE TRUE
## 14 551 3.05e+ 1 TRUE 0.188 13 TRUE TRUE
## 15 703 1.17e-11 FALSE 0.104 302 TRUE TRUE
## 16 1051 7.04e+ 1 TRUE 0.238 32 TRUE TRUE
## 17 1076 1.94e+ 2 TRUE 0.383 66 TRUE TRUE
## 18 1245 8.26e-15 FALSE 0.115 23 TRUE TRUE
As we can see pivoting was only applied in datasets where we also encountered a singular QR decomposition which makes sense as putting a non-invertible matrix into the algorithm probably has poor numerical stability. Out of those datasets there are a total of 6 datasets where pivoting was applied but no coefficient fluctuations were detected. Whether the pivoting had an impact in preventing the fluctuations from happening or if it was another property of the dataset is hard to say.
Either way we can conclude that the pivoting has not interfered with the results of the vast majority of the datasets in which we did not see any fluctuations.
Conclusion
We started this post with a seemingly trivial question which in the end took us on quite a journey which involved a simulation study, revisiting the linear algebra behind the linear regression problem and taking a closer look at how the resulting matrix equations are solved numerically in R (or rather in Fortran).
All the cases of coefficient fluctuations we could detect in our investigation were caused by violations of basic assumptions of linear regression and not the permutations of the input variables per se.
Therefore our conclusion is that as long as you do your due diligence when fitting regression models you should not have to worry about the order of your predictors having any kind of impact on your estimated coefficients.