I just spotted a link to a paper on a new read summarisation program featureCounts in my twitter feed. the second author Gordon Smyth is the guy who wrote limma – the linear model framework for microarrays. Presently there is just a link to a paper on the arxiv. This program takes read alignments and summarises them according to the genome feature they fall within or near.
Their program is a lot quicker and uses less memory than some commonly used tools htseq (Python) and GenomicRanges (R- Bioconductor) – e.g. Table 1.
Firstly people are always complaining about the number of assembly or read alignment programs. It seem to me that there are a lot of simple tools that are pretty inefficient. And their inefficiency has a substantial economic cost in the hardware you need and the time you run it. Perhaps people think it is a fraction of alignment or assembly so forget it … move on.
The thing is the GenomicRanges tool is written in C. Their tool featureCounts is written in C. Whilst GenomicRanges is not necessarily a tool for summarisation but a more general framework for handling read alignments. It still seems hamstrung by R memory. It is also very difficult to follow the C code underlying GenomicRanges (to me anyway) and other Bioconductor sequencing packages.
Presently a lot of the high-level numerical or statistical work on genomics is done with R packages after summarisation. So this is perhaps just a step below the designed competency of R. However I think that the work demonstrated at companies like Revolution, with packages like bigmemory and Rcpp, and efforts like bigvis or data.table — all point to the fact that more of the genomic workflow could be done within R – or at least within an R wrapper.