TY - JOUR
T1 - Predicting gene function through systematic analysis and quality assessment of high-throughput data
AU - Kemmeren, Patrick
AU - Kockelkorn, Thessa T.J.P.
AU - Bijma, Theo
AU - Bonders, Rogier
AU - Holstege, Frank C.P.
N1 - Funding Information:
We thank Philip Lijnzaad, Thomas Schlitt, Harm van Bakel, and Arnaud Leijen for discussions and technical support. Supported by grants from the Netherlands Organization for Scientific Research (NWO); 05050205, 016026009, 05071002 and by the European Union fifth framework project TEMBLOR.
PY - 2005/4/15
Y1 - 2005/4/15
N2 - Motivation: Determining gene function is an important challenge arising from the availability of whole genome sequences. Until recently, approaches based on sequence homology were the only high-throughput method for predicting gene function. Use of high-throughput generated experimental data sets for determining gene function has been limited for several reasons. Results: Here a new approach is presented for integration of high-throughput data sets, leading to prediction of function based on relationships supported by multiple types and sources of data. This is achieved with a database containing 125 different high-throughput data sets describing phenotypes, cellular localizations, protein interactions and mRNA expression levels from Saccharomyces cerevisiae, using a bit-vector representation and information content-based ranking. The approach takes characteristic and qualitative differences between the data sets into account, is highly flexible, efficient and scalable. Database queries result in predictions for 543 uncharacterized genes, based on multiple functional relationships each supported by at least three types of experimental data. Some of these are experimentally verified, further demonstrating their reliability. The results also generate insights into the relative merits of different data types and provide a coherent framework for functional genomic datamining.
AB - Motivation: Determining gene function is an important challenge arising from the availability of whole genome sequences. Until recently, approaches based on sequence homology were the only high-throughput method for predicting gene function. Use of high-throughput generated experimental data sets for determining gene function has been limited for several reasons. Results: Here a new approach is presented for integration of high-throughput data sets, leading to prediction of function based on relationships supported by multiple types and sources of data. This is achieved with a database containing 125 different high-throughput data sets describing phenotypes, cellular localizations, protein interactions and mRNA expression levels from Saccharomyces cerevisiae, using a bit-vector representation and information content-based ranking. The approach takes characteristic and qualitative differences between the data sets into account, is highly flexible, efficient and scalable. Database queries result in predictions for 543 uncharacterized genes, based on multiple functional relationships each supported by at least three types of experimental data. Some of these are experimentally verified, further demonstrating their reliability. The results also generate insights into the relative merits of different data types and provide a coherent framework for functional genomic datamining.
UR - http://www.scopus.com/inward/record.url?scp=17444416487&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/bti103
DO - 10.1093/bioinformatics/bti103
M3 - Article
C2 - 15531615
AN - SCOPUS:17444416487
SN - 1367-4803
VL - 21
SP - 1644
EP - 1652
JO - Bioinformatics
JF - Bioinformatics
IS - 8
ER -