Sorting Data independently before Regression


This thread on StackExchange is circling around my Twitter timeline today and I couldn’t resist sharing it here:

Suppose we have data set (X_i, Y_i) with n points. We want to perform a linear regression, but first we sort the X_i values and the Y_i values independently of each other, forming data set(X_i, Y_j). Is there any meaningful interpretation of the regression on the new data set? Does this have a name?

I don’t want to blame the author of the question. It just offers plain ignorance of basic statistical concepts. On first sight this might be a beginner’s misunderstanding, but this totally kills it:

But my manager says he gets “better regressions most of the time” when he does this […]. I have a feeling he is deceiving himself.

This isn’t incompetence anymore – this is deliberate torture of statistics.


Leave a Reply

Your email address will not be published. Required fields are marked *