Parallelisation of specific R functions for the SPRINT framework to enable the statistical analysis of post-genomic high-throughput biological data
The analysis of genetic data requires large amounts of computational processing power and memory to complete. The last few years have seen the widespread introduction of high-throughput and highly parallel experiments in biological research. Microarray-based techniques are a prominent example, allowing for simultaneous measurement of thousands to millions of genes or sequences across tens to thousands of different samples. These studies generate an unprecedented amount of data and test the limits of existing bioinformatics computing infrastructure. SPRINT (www.r-sprint.org) is a collaborative project between EPCC (www.epcc.ed.ac.uk) and the Division of Pathway Medicine (DPM) (http://www.pathwaymedicine.ed.ac.uk/) which aims to provide the microarray community with an easy access to High Performance Computing (HPC) in order to allow for the efficient analysis of post-genomic microarray data.
A popular tool for biostatisticians to analyse microarray data is the free statistical software package R/Bioconductor. However, R is inherently sequential and cannot be easily or efficiently used on HPC platforms without substantial modifications to the R code. SPRINT provides a Simple Parallel R INTerface to HPC allowing the biological researchers to reap the benefits of HPC while hiding the complexity of programming for HPC.
- Parallelisation of specific R functions using the SPRINT framework
- Test parallelised R function within genomics scenario
SPRINT has two main components, an intelligent HPC harness and a library of parallelized R functions. The purpose of this project is to add significant functionality to SPRINT by parallelizing specific R functions to be used in the analysis of gene expression and genotyping data. The development work is to be carried out in C and MPI on NESS and tested and benchmarked on Eddie and/or HECToR. The student(s) will work closely with SPRINT team at EPCC and DPM.
Essential: C, parallel programming, MPI
Advantage but non essential: R, statistical programming
Benefit to the Student
The student will:
- contribute to an ongoing research programme developing open source software;
- develop software for HPC clusters;
- gain an understanding of statistical methods used in the analysis of post- genomic high-throughput biological data;
- use general statistical tools such as R and Bioconductor.