TY - JOUR

T1 - Classification of microarray data with penalized logistic regression

AU - Eilers, P. H.C.

AU - Boer, J. M.

AU - Van Ommen, G. J.

AU - Van Houwelingen, H. C.

PY - 2001

Y1 - 2001

N2 - Classification of microarray data needs a firm statistical basis. In principle, logistic regression can provide it, modeling the probability of membership of a class with (transforms of) linear combinations of explanatory variables. However, classical logistic regression does not work for microarrays, because generally there will be far more variables than observations. One problem is multicollinearity: estimating equations become singular and have no unique and stable solution. A second problem is over-fitting: a model may fit well to a data set, but perform badly when used to classify new data. We propose penalized likelihood as a solution to both problems. The values of the regression coefficients are constrained in a similar way as in ridge regression. All variables play an equal role, there is no ad-hoc selection of most "relevant" or "most expressed" genes. The dimension of the resulting systems of equations is equal to the number of variables, and generally will be too large for most computers, but it can dramatically be reduced with the singular value decomposition of some matrices. The penalty is optimized with AIC (Akaike's Information Criterion), which essentially is a measure of prediction performance. We find that penalized logistic regression performs well on a public data set (the MIT ALL/AML data).

AB - Classification of microarray data needs a firm statistical basis. In principle, logistic regression can provide it, modeling the probability of membership of a class with (transforms of) linear combinations of explanatory variables. However, classical logistic regression does not work for microarrays, because generally there will be far more variables than observations. One problem is multicollinearity: estimating equations become singular and have no unique and stable solution. A second problem is over-fitting: a model may fit well to a data set, but perform badly when used to classify new data. We propose penalized likelihood as a solution to both problems. The values of the regression coefficients are constrained in a similar way as in ridge regression. All variables play an equal role, there is no ad-hoc selection of most "relevant" or "most expressed" genes. The dimension of the resulting systems of equations is equal to the number of variables, and generally will be too large for most computers, but it can dramatically be reduced with the singular value decomposition of some matrices. The penalty is optimized with AIC (Akaike's Information Criterion), which essentially is a measure of prediction performance. We find that penalized logistic regression performs well on a public data set (the MIT ALL/AML data).

KW - AIC

KW - Cross-validation

KW - Generalized linear models

KW - Genetic expression

KW - Multicollinearity

KW - Multivariate calibration

KW - Ridge regression

KW - Singular value decomposition

UR - http://www.scopus.com/inward/record.url?scp=0034863834&partnerID=8YFLogxK

U2 - 10.1117/12.427987

DO - 10.1117/12.427987

M3 - Conference article

AN - SCOPUS:0034863834

SN - 0277-786X

VL - 4266

SP - 187

EP - 198

JO - Proceedings of SPIE - The International Society for Optical Engineering

JF - Proceedings of SPIE - The International Society for Optical Engineering

T2 - Microarrays: Optical Technologies and Informatics

Y2 - 21 January 2000 through 22 January 2000

ER -