TY - JOUR
T1 - Scaling bio-analyses from computational clusters to grids
AU - Byelas, Heorhiy
AU - Dijkstra, Martijn
AU - Neerincx, Pieter
AU - Van Dijk, Freerk
AU - Kanterakis, Alexandros
AU - Deelen, Patrick
AU - Swertz, Morris
PY - 2013
Y1 - 2013
N2 - Life sciences have moved rapidly into big data thanks to new parallel methods for gene expression, genomewide association, proteomics and whole genome DNA sequencing. The scale of these methods is growing faster than predicted by Moores law. This has introduced new challenges and needs for methods for specifying computation protocols for e.g. Next-Generation Sequencing (NGS) and genome-wide association study (GWAS) imputation analyses and running these on a large scale is a complicated task, due to the many steps involved, long runtimes, heterogeneous computational resources and large files. The process becomes error-prone when dealing with hundreds of samples, such as in genomic analysis facilities, if it is performed without an integrated workflow framework and data management system. From recent projects we learnt that bioinformaticians do not want to invest much time in learning advanced grid or cluster scheduling tools, preferring to concentrate on their analyses, be closer to old-fashion shell scripts that they can fully control and have some automatic mechanisms taking care of all submission and monitoring details. We present a lightweight workflow declaration and execution system to address these needs, built on top of the MOLGENIS framework for data tracking. We describe lessons learnt when scaling running NGS and imputation analyses from computational clusters to grids and show application of our solution, in particular, in the nation-wide "Genome of the Netherlands" project (GoNL, 700TB of data and about 200.000 computing hours).
AB - Life sciences have moved rapidly into big data thanks to new parallel methods for gene expression, genomewide association, proteomics and whole genome DNA sequencing. The scale of these methods is growing faster than predicted by Moores law. This has introduced new challenges and needs for methods for specifying computation protocols for e.g. Next-Generation Sequencing (NGS) and genome-wide association study (GWAS) imputation analyses and running these on a large scale is a complicated task, due to the many steps involved, long runtimes, heterogeneous computational resources and large files. The process becomes error-prone when dealing with hundreds of samples, such as in genomic analysis facilities, if it is performed without an integrated workflow framework and data management system. From recent projects we learnt that bioinformaticians do not want to invest much time in learning advanced grid or cluster scheduling tools, preferring to concentrate on their analyses, be closer to old-fashion shell scripts that they can fully control and have some automatic mechanisms taking care of all submission and monitoring details. We present a lightweight workflow declaration and execution system to address these needs, built on top of the MOLGENIS framework for data tracking. We describe lessons learnt when scaling running NGS and imputation analyses from computational clusters to grids and show application of our solution, in particular, in the nation-wide "Genome of the Netherlands" project (GoNL, 700TB of data and about 200.000 computing hours).
UR - http://www.scopus.com/inward/record.url?scp=84922572593&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:84922572593
SN - 1613-0073
VL - 993
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
T2 - 5th International Workshop on Science Gateways, IWSG 2013
Y2 - 3 June 2013 through 5 June 2013
ER -