I did a bit more work on the fasqtc package after investigating multithreading.
- I found that I could simply multithread an Rcpp function by using the parallel package and the mclapply() function. I suppose this makes sense but given the sometime painful complexity of doing this in C (or C++) code I just didn’t expect it to work.
- I was extra surprised that you can use mclapply() to split a list of fastq files to be handled by different processes and this gives a considerable speed up (~1/3rd faster with just 2 cores). Even though the processes are essentially reading from the same disk. Again I expected that reading from the disk would be a bottleneck that negated any multithreading gain. Apparently not.
- I read somewhere else (I cannot recall now) that if you set mclapply() to use more processes than you actually have cores this would be optimal. Again this sounded weird but it’s true! mc.cores=detectCores()+1 ##FTW.
Oh yes and I finally ran build check and it all worked!
# This is the C++ workhorse using Rcpp, Armadillo and especially seqan
# The resulting object contains all the necessary data for other functions.
# If mc=TRUE the the function will attempt parallel processing
# If sampled = FALSE all the fastq file wil be read, if sampled= TRUE the first 200K
fastqc= function(files, mc=TRUE, numreads= 200000)
# for a single file
res= fastqCpp(files, numreads)
# multi files multi core
if (length(files) > 1 & mc==TRUE)
res= mclapply(X=files, FUN=fastqCpp, numreads=numreads, mc.preschedule=TRUE, mc.cores=detectCores()+1)
# multi files single core
if (length(files) > 1 & mc==FALSE)
res= lapply(X=files, FUN=fastqCpp, numreads=numreads)
#.Call("fastqCpp", file, package="fastqc")
Created by Pretty R at inside-R.org
The last couple of days have been a bit of a struggle. I think I know a bit of R and have got Rcpp and RcppArmadillo working OK. I’m learning C++ and making progress with seqan. I’ve got the hang of C++ package building and compilation. And I’ve got Git working with RStudio and managed to push folders to Git repositories to Github.
Putting it all together though has been mind bending. Clearly I did a lot of stuff in the wrong order and then had trouble following tutorials. I would have thought making a Github repository was the last thing… But no.
Github -> Rstudio ->Git -> Package -> C++ Flags, Namespace, Makevars etc.. etc..
Now I know I won’t do the reverse.
Anyway my first github package is at:
I just spotted a link to a paper on a new read summarisation program featureCounts in my twitter feed. the second author Gordon Smyth is the guy who wrote limma – the linear model framework for microarrays. Presently there is just a link to a paper on the arxiv. This program takes read alignments and summarises them according to the genome feature they fall within or near.
Their program is a lot quicker and uses less memory than some commonly used tools htseq (Python) and GenomicRanges (R- Bioconductor) – e.g. Table 1.
Firstly people are always complaining about the number of assembly or read alignment programs. It seem to me that there are a lot of simple tools that are pretty inefficient. And their inefficiency has a substantial economic cost in the hardware you need and the time you run it. Perhaps people think it is a fraction of alignment or assembly so forget it … move on.
The thing is the GenomicRanges tool is written in C. Their tool featureCounts is written in C. Whilst GenomicRanges is not necessarily a tool for summarisation but a more general framework for handling read alignments. It still seems hamstrung by R memory. It is also very difficult to follow the C code underlying GenomicRanges (to me anyway) and other Bioconductor sequencing packages.
Presently a lot of the high-level numerical or statistical work on genomics is done with R packages after summarisation. So this is perhaps just a step below the designed competency of R. However I think that the work demonstrated at companies like Revolution, with packages like bigmemory and Rcpp, and efforts like bigvis or data.table — all point to the fact that more of the genomic workflow could be done within R – or at least within an R wrapper.
One of the more surprising aspects of our continuing depression has been that whilst our economy has stagnated (I mean in Britain) there hasn’t been as much unemployment as expected. For a while I’ve reckoned I knew why as a lot of people I know seem to have gone self-employed as a last resort. And whilst not unemployed they just aren’t working very much. Of course this is anecdotal – but recently studies have started to appear that suggest that people are indeed working fewer hours than they would like. Unfortunately it appears to be the low paid “flexible” workers who are struggling most to find the hours to support themselves.
Then today I read an intriguing story in the FT about big US companies who are monitoring peoples spending habits:
“We see a pronounced difference between how people are shopping today and before the recession,” the executive explained. “Consumers are living pay check by pay check, and they tend to spend accordingly. Then you have 50 million people on food stamps and that has cycles too. So for our business it has become critical to understand the cycle – when pay [and benefit] checks are arriving.”
It strikes me that companies like Tesco, Amazon and especially Wonga here in the UK probably have better Big Data explaining the state of our nation than the ONS.
Over at Nature they’ve just announced new guidelines for submitted papers. They want results to be more reproducible. I’m guessing where Nature leads other journals will follow.
From next month, Nature and the Nature research journals will introduce editorial measures to address the problem by improving the consistency and quality of reporting in life-sciences articles. To ease the interpretation and improve the reliability of published results we will more systematically ensure that key methodological details are reported, and we will give more space to methods sections. We will examine statistics more closely and encourage authors to be transparent, for example by including their raw data.
I guess this makes sense. For a lot of the genomics papers that get into Nature the article itself is just really a lot of results with a few highlight pictures. The Supplementary then contains extended Methods. Even then it is sometimes difficult from that to really see what has been done. Ensuring that people provide raw data is a no-brainer. They have been supposed to do this for years but some groups are backsliding. A re-emphasis is welcome.
They also provide a checklist which amongst other things has a rudimentary statistical list. This I agree with. The analysis sections of many big papers often seem to elide over far too much. As an R person I’d like to actually see the code that did the calculations – the software settings and parameters even. I actually think the problems with most papers are not in the p-values and error bars – but generally in the normalisation of big datasets, the estimation of null in resampling stats, or just not really having an appreciation of whether you might get different results with slightly different analysis choices.
Then again… it’s getting harder and harder to write/publish papers in big journals. Each year as competition is fierce – more and more “stuff” is requested. Sometimes this is a request for more and better work – sometimes it just feels like more and more box-ticking. So if we are going to have more thoroughness in Methods, Stats, and Data sharing – It would be nice to see a relaxation elsewhere so we can genuinely concentrate on the important stuff. I doubt that will happen though.
This next algorithm was a bit tougher as std::string are kryptonite to me at the best of times and Seqan::StringSet<Dna5>> confused me too. I got there eventually by monkey-typewriting.
### Sequence Duplication
This function puts together two FastQC screens as the data arise naturally from the same procedure. First there are bins of the numbers of duplicate and then if there are more than a certain threshold of duplicates we record the sequence and return them. Such highly duplicated sequences maybe PCR errors.
In many ways seqan is great and mixes well with STL. For instance I simply std::sort all the sequences in a single line. Then I can run through them and count the consecutive duplicates. On the other hand I can’t just copy a Dna5 to a std::string (at least I don’t know how) – I have to step through char by char. Maybe there is a way?
dups = seqDuplCpp(argv)
dupFraction = sum(dups$Counts[2:10])/sum(dups$Counts)
seqs.df= data.frame(Duplicates=dups$Duplicates, Count=dups$SeqCount, Fraction= dups$SeqCount/dups$Reads)
qplot(x=Bins, y=Counts, data=dups.df, geom='line', main = paste('Duplicate Fraction =', format(dupFraction, digits=3)))+theme_bw()
Duplicates Count Fraction
1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 62336 0.0207786667
2 GGCGGACAGAGCCAAGGCGAACAGGGCGAACACGAGGATGCAAAT 754 0.0002513333
3 GTCCTTTCGTACTAAAATATCACAATTTTTTAAAGATAGAAACCA 777 0.0002590000
The C++ code is then:
// CUT THE TOP HALF OF THE CODE
// SIMILAR TO PREVIOUS POSTS
//A Dna5q Stringset Iterator
typedef Iterator<StringSet<Dna5String> >::Type TStringSetIter;
std::sort(begin(seqs), end(seqs));// YOU CAN USE STL!!!
TStringSetIter seqIt1 = begin(seqs);
TStringSetIter seqIt2 = seqIt1 + 1 ; // this is 1 in front of the other iterator
const unsigned numcycles = length(*seqIt1); // get the number of cycles i.e. readlength
const unsigned numreads = length(seqs); // length of the StringSet, number of reads
unsigned int dupCount = 0;
unsigned int dupThreshold = numreads/5000;
std::string dupSeqs(numcycles, 'N');
std::vector<int > dupSeqsCount;
seqan::CharString seq1, seq2;
// seqs is lexically sorted so duplicates are consecutive
for( ; seqIt2 != end(seqs); ++seqIt1, ++seqIt2)
// if first and next are the same we start counting how many
if (seq1 == seq2)
// till we come to a different read sequence
if (seq1 != seq2)
// bin 1-10 increment
if(dupCount <= 9)
bins[dupCount] += 1;
// if >10 increment 10th bin
if(dupCount > 9)
bins += 1;
// if a threshold is met we record the sequence and how many there were
if(dupCount > dupThreshold)
for(unsigned i=0; i< numcycles; ++i)
dupSeqs[i] = seq1[i];
// then reset the counter after bin increments and recording is done
dupCount = 0;
return List::create(Named("File") = argv,
Named("Reads") = numreads,
Named("Counts") = bins,
Named("Duplicates") = dupSeqsVec,
Named("SeqCount") = dupSeqsCount);
I really love this animated illustration of path-finding algorithms. A bit of pseudo-code or links to an explanation of each would make it even better: