Rcpp multithreading surprise

Standard

I did a bit more work on the fasqtc package after investigating multithreading.

  1. I found that I could simply multithread an Rcpp function by using the parallel package and the mclapply() function. I suppose this makes sense but given the sometime painful complexity of doing this in C (or C++) code I just didn’t expect it to work.
  2. I was extra surprised that you can use mclapply() to split a list of fastq files to be handled by different processes and this gives a considerable speed up (~1/3rd faster with just 2 cores). Even though the processes are essentially reading from the same disk. Again I expected that reading from the disk would be a bottleneck that negated any multithreading gain. Apparently not.
  3. I read somewhere else (I cannot recall now) that if you set mclapply() to use more processes than you actually have cores this would be optimal. Again this sounded weird but it’s true!  mc.cores=detectCores()+1  ##FTW.

Oh yes and I finally ran build check and it all worked!

# This is the C++ workhorse using Rcpp, Armadillo and especially seqan
# The resulting object contains all the necessary data for other functions.
# If mc=TRUE the the function will attempt parallel processing
# If sampled = FALSE all the fastq file wil be read, if sampled= TRUE the first 200K
fastqc= function(files, mc=TRUE, numreads= 200000)
{
  # for a single file
  if (length(files)==1)
    res= fastqCpp(files, numreads)

  # multi files multi core
  if (length(files) > 1 & mc==TRUE)
  {
    require(parallel)
    res= mclapply(X=files, FUN=fastqCpp, numreads=numreads, mc.preschedule=TRUE, mc.cores=detectCores()+1)  
  }

  # multi files single core
  if (length(files) > 1 & mc==FALSE)
  {
    res= lapply(X=files, FUN=fastqCpp, numreads=numreads)
  }

  return(res)
  #.Call("fastqCpp", file, package="fastqc")  
}

Created by Pretty R at inside-R.org

Rcpp, RcppArmadillo, Seqan, Sequence Analysis part 4

Standard

This next algorithm was a bit tougher as std::string are kryptonite to me at the best of times and Seqan::StringSet<Dna5>> confused me too. I got there eventually by monkey-typewriting.

### Sequence Duplication

This function puts together two FastQC screens as the data arise naturally from the same procedure. First there are bins of the numbers of duplicate and then if there are more than a certain threshold of duplicates we record the sequence and return them. Such highly duplicated sequences maybe PCR errors.

In many ways seqan is great and mixes well with STL. For instance I simply std::sort all the sequences in a single line. Then I can run through them and count the consecutive duplicates. On the other hand I can’t just copy a Dna5 to a std::string (at least I don’t know how) – I have to step through char by char. Maybe there is a way?

seqDupl= function(argv)
  {
  require(ggplot2)
  dups = seqDuplCpp(argv)
  dups.df=data.frame(Bins=1:10, Counts=dups$Counts/dups$Counts[1])
  dupFraction = sum(dups$Counts[2:10])/sum(dups$Counts)

  seqs.df= data.frame(Duplicates=dups$Duplicates, Count=dups$SeqCount, Fraction= dups$SeqCount/dups$Reads)
  print(seqs.df)
  qplot(x=Bins, y=Counts, data=dups.df, geom='line', main = paste('Duplicate Fraction =', format(dupFraction, digits=3)))+theme_bw()

  }
seqDupl('test8.fastq')

seqDupl

                                     Duplicates Count     Fraction
1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 62336 0.0207786667
2 GGCGGACAGAGCCAAGGCGAACAGGGCGAACACGAGGATGCAAAT   754 0.0002513333
3 GTCCTTTCGTACTAAAATATCACAATTTTTTAAAGATAGAAACCA   777 0.0002590000

The C++ code is then:



    // CUT THE TOP HALF OF THE CODE
    // SIMILAR TO PREVIOUS POSTS

    //A Dna5q Stringset Iterator
    typedef Iterator<StringSet<Dna5String> >::Type TStringSetIter;    
    std::sort(begin(seqs), end(seqs));// YOU CAN USE STL!!!

    TStringSetIter seqIt1 = begin(seqs);
    TStringSetIter seqIt2 = seqIt1 + 1 ; // this is 1 in front of the other iterator

    const unsigned numcycles = length(*seqIt1);  // get the number of cycles i.e. readlength
    const unsigned numreads = length(seqs);    // length of the StringSet, number of reads

    NumericVector bins(10);
    unsigned int dupCount = 0;
    unsigned int dupThreshold = numreads/5000;    
    std::vector<std::string> dupSeqsVec;
    std::string dupSeqs(numcycles, 'N');
    std::vector<int > dupSeqsCount;
    seqan::CharString seq1, seq2;

    // seqs is lexically sorted so duplicates are consecutive
    for( ; seqIt2 != end(seqs); ++seqIt1, ++seqIt2)
    {
      seq1= value(seqIt1);
      seq2=value(seqIt2);
      // if first and next are the same we start counting how many
      if (seq1 == seq2)
      {        
        ++dupCount;
      }
      // till we come to a different read sequence
      if (seq1 != seq2)
      {
        // bin 1-10 increment
        if(dupCount <= 9)  
          bins[dupCount] += 1;
        // if >10 increment 10th bin
        if(dupCount > 9)
          bins[9] += 1;

        // if a threshold is met we record the sequence and how many there were  
        if(dupCount > dupThreshold)
        {
          dupSeqsCount.push_back(dupCount);
          for(unsigned i=0; i< numcycles; ++i)
              dupSeqs[i] = seq1[i];

          dupSeqsVec.push_back(dupSeqs);
        }
      // then reset the counter after bin increments and recording is done  
      dupCount = 0;    
      }

    }
    return List::create(Named("File")     = argv,
                      Named("Reads")      = numreads,
                      Named("Counts")       = bins,
                      Named("Duplicates") = dupSeqsVec,
                      Named("SeqCount")      = dupSeqsCount);
}

Rcpp, RcppArmadillo, Seqan, Sequence Analysis part 3

Standard

I mentioned in the last interregnum post the concept of Algebraic, Distributive, and Holistic data algorithms mentioned by Hadley Wickham in a recent talk . Here now with simple DNA sequence counting is a more or less Algebraic example (mean is not quite totally Algebraic).

### Bases by Cycle
Ideally sequence reads should be a random draw from the underlying genomic material. That isn’t exactly uniform A/T/G/C as there are runs of G/C – for various reasons I won’t go into. However you can also get molecular biological artefacts or biases giving rise to repetitive or frequent sequences. You at least want to know about this and probably need to remove them before continuing. The next few posts deal with detcting such sequences.

This first function is probably best for detecting a CG bias at the start of reads (rather than outright artefacts). In many way it is the least useful diagnostic – as the bias does not imply any error in the reads themselves just an uneven sampling of the underlying sequence population (…that’s an ungainly sentence…).

baseByCycle = function(fastqFile)
{
  require(RcppArmadillo)
  require(ggplot2)
  require(reshape2)
  # This is the C++ function
  bbc= baseByCycleCpp(fastqFile)
  baseByCycleFrame = as.data.frame(t(bbc$baseByCycle))
  colnames(baseByCycleFrame)= c('A', 'C', 'G', 'T', 'N')
  baseByCycleFrame$Cycle = 1:bbc$Cycles
  baseByCycleFrame= melt(baseByCycleFrame, id.vars= 'Cycle', variable.name='Base', value.name='Fraction')

  g <- ggplot(baseByCycleFrame, aes(x = Cycle, y = Fraction, color = Base))
  g + geom_line()+theme_bw()+ labs(title='Base Fraction per Cycle')

}

This produces a plot like this in which you can clearly see a tendency for reads to start with G or C that then evens out:

baseByCycle

The C++ code is pretty similar to before with the exception that we don’t need to read the qualities in.

#include <seqan/sequence.h>
#include <iostream>
#include <seqan/file.h>
#include <seqan/seq_io.h>
#include <RcppArmadillo.h>

using namespace Rcpp;
using namespace seqan;

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]

List baseByCycleCpp(std::string argv)
{

 // Open memory mapped string.
    seqan::String<char, seqan::MMap<> > mmapString;
    if (!open(mmapString, argv.c_str(), seqan::OPEN_RDONLY))
        return 1;  // Could not open file.

    // Create RecordReader.
    seqan::RecordReader<seqan::String<char, seqan::MMap<> >,
                        seqan::DoublePass<seqan::Mapped> > reader(mmapString);

    // Read file in one pass into optimised memory space
    seqan::StringSet<seqan::CharString> seqs; // this persists
    {
    // whereas I declare ids inside this block because I don't need them
    seqan::StringSet<seqan::CharString> ids;
    if (read2(ids, seqs, reader, seqan::Fastq()) != 0)
        return 1;  // Could not read file.
    }

    //A Dna5q Stringset Iterator
    typedef Iterator<StringSet<CharString> >::Type TStringSetIter;
    TStringSetIter seqIt = begin(seqs);

    const unsigned numcycles = length(*seqIt);  // get the number of cycles i.e. readlength
    const unsigned numreads = length(seqs);    // length of the StringSet, number of reads

    //there are 5 bases ACGTN hence 5 rows.
    arma::mat baseCounterMatrix(5, numcycles);
    baseCounterMatrix.zeros();

    seqan::CharString seqn;

    for( ; seqIt != end(seqs); ++seqIt) // this puts the seq data into the arma format
    {
      seqn = *seqIt;

      for(unsigned cycle = 0; cycle< numcycles; ++cycle)
      {

          if(seqn[cycle] =='A') 
            baseCounterMatrix(0,cycle) += 1;

          if(seqn[cycle] =='C') 
            baseCounterMatrix(1,cycle) +=1;

          if(seqn[cycle] =='G') 
            baseCounterMatrix(2,cycle) +=1;

          if(seqn[cycle] =='T') 
            baseCounterMatrix(3,cycle) +=1;

          if(seqn[cycle] =='N') 
            baseCounterMatrix(4,cycle) +=1;

      }

    }
  // convert the base counts to a fraction  
  baseCounterMatrix = baseCounterMatrix/numreads;
  return List::create(Named("File")   = argv,
                      Named("Reads")   = numreads,
                      Named("Cycles") =  numcycles,
                      Named("baseByCycle")  = baseCounterMatrix);
}

Rcpp, RcppArmadillo, Seqan, Sequence Analysis (2)

Standard

I should talk a bit more about RcppArmadillo and the Seqan C++ libraries. RcppArmadillo is the Rcpp integration of the Armadillo linear algebra libraries. They and Rcpp bring some of that vectorised R style magic to C++ matrices (see below). The seqan library meanwhile has all the reading and parsing functionality for sequence files and data. In particular seqan uses memory efficiently through memory mapping and it can extract the phred quality scores to integer values.

Quality By Sequence

The next important quality control (QC) diagnostic you get from FastQC is the average sequence quality – or what I have called quality by sequence (qualBySeq). The code is similar to before but note you can calculate it with a single read through the data and no need to create a large matrix. The godsend here though is the RcppArmadillo hist() function which makes this trivial.

qualBySeq = function(fastqFile) {
    require(RcppArmadillo)
    require(ggplot2)

    # This is the C++ function
    qualSeqHist = qualBySeqCpp(fastqFile)

    # tidy into a data.frame of bins and read counts
    qualSeqFrame = data.frame(MeanSeqQual = (qualSeqHist$minQual + 1):qualSeqHist$maxQual, 
        Reads = qualSeqHist$qualBySeq)

    g <- ggplot(qualSeqFrame, aes(x = MeanSeqQual, y = Reads))
    g + geom_line() + theme_bw() + labs(list(x = "Mean Sequence Quality (Phred Scores)", 
        y = "Reads"))

}

The results look like this:

qualBySeq

Naturally due to the lower overheads this function is faster than the previous taking ~6 seconds to calculate averages on 2 million reads. It will probably process a lot more data without running out of space too. Note the lazy (I mean lazy as in “lazy” not any sophisticated programming sense of the word) use of scope to read lots of fastq sequence and ID data then just forget it. A major improvement would be to skip these fastq lines completely. There is also an unsavoury mixture of iterator and unsigned int loops …

#include <seqan/sequence.h>
#include <iostream>
#include <seqan/file.h>
#include <seqan/seq_io.h>
#include <RcppArmadillo.h>

using namespace Rcpp;
using namespace seqan;

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]

List qualBySeqCpp(std::string argv)
{

 // Open memory mapped string.
    seqan::String<char, seqan::MMap<> > mmapString;
    if (!open(mmapString, argv.c_str(), seqan::OPEN_RDONLY))
        return 1;  // Could not open file.

    // Create RecordReader.
    seqan::RecordReader<seqan::String<char, seqan::MMap<> >,
                        seqan::DoublePass<seqan::Mapped> > reader(mmapString);

    // Read file in one pass into optimised memory space
    seqan::StringSet<seqan::CharString> quals; // this persists
    {
    // whereas I declare these inside this block because I dont need them
    seqan::StringSet<seqan::CharString> seqs;
    seqan::StringSet<seqan::CharString> ids;
    if (read2(ids, seqs, quals, reader, seqan::Fastq()) != 0)
        return 1;  // Could not read file.
    }

    //A CharString Stringset Iterator
    typedef Iterator<StringSet<CharString> >::Type TStringSetIter;
    TStringSetIter qualIt;

    seqan::CharString qual;
    qualIt = begin(quals);                       // point to first Phred String of Stringset  
    const unsigned numcycles = length(*qualIt);  // get the number of cycles i.e. readlength
    const unsigned numreads = length(quals);    // length of the StringSet, number of reads

    //an armadillo vector of length of numreads set to zero
    arma::vec perReadQuality(numreads);
    perReadQuality.zeros();

    unsigned armaCounter = 0; 
    for( ; qualIt != end(quals); ++qualIt, ++armaCounter) // this puts the seq data into the arma format
    {
      qual = *qualIt;

      for(unsigned cycle = 0; cycle< numcycles; ++cycle)
      {
        // sum the quality scores across each sequence read
        perReadQuality[armaCounter] += (int)ordValue(qual[cycle])-33;

      }

    }

    // FastQC Per Sequence Quality Scores  
    perReadQuality = perReadQuality/numcycles; // make the sum phred score to a mean phred score    
    int minQual = min(perReadQuality);
    int maxQual = max(perReadQuality);
    int qualRange = maxQual-minQual;
    // calculates bins for the integer range of qualities
    arma::uvec qualityBins = hist(perReadQuality, qualRange);

    return List::create(Named("File")   = argv,
                      Named("Reads")   = numreads,
                      Named("minQual") = minQual,
                      Named("maxQual") = maxQual,
                      Named("qualBySeq")  = qualityBins);
}

** NEXT : FastQC style "Per Base Sequence Content" **

Rcpp, RcppArmadillo, Seqan, Sequence Analysis (1)

Image

Bioinformatics software – particularly dealing with sequences – is dominated by command line tools. Sadly these are written in a plethora of different languages meaning that analysis pipelines are a mish-mash of Java Jars, interpreted Perl, Ruby, or Python scripts, plus compiled C/C++ standalones usually piped together by Bash scripts. Whilst these are mostly efficient they sometimes fail in the unix design philosophy of doing one simple thing very well. Rather they often do lots of overlapping things in undocumented ways. Installing, maintaining and using them together can be a bit of a pain.

In R meanwhile there is an excellent set of Bioconductor packages that lend a unified framework for dealing with sequences. There are many packages for applications like ChIP-seq or RNA-seq when the data is already aligned and summarised. There are less great tools for processing unaligned sequence data. Mostly this is because there is so much of it – R just doesn’t handle large datasets that well – though the Bioconductor developers have made heroic strides in managing memory outside base R.

The Bioconductor ShortRead package does handle some quality control (QC). It doesn’t do everything I want though and the underlying code is in C – which I for one find difficult to hack. So I decided to write the functionality of the FastQC and trimmomatic in R and C++ (using Rcpp and the Seqan C++ library)…. for fun. The next few posts will show the functions

Quality By Cycle

This first post show code for the most important FastQC graph – what they call “per base sequence quality” – and I call “quality by cycle”. High throughput sequences are “read” from left to right in repeated photochemical cycles of declining quality. So when reads get too long the error rate becomes unacceptably high. Usually this is the main diagnostic used to select the ‘trimming’ length for reads before subsequent analyses.

This “quality” is coded in the reads fastq files by a character code knowna as a Phred Score phred quality scores (Q) per base. Where P is the base calling error probability:

\( Q = -10\log_{10}P \)

Here are 4 lines of fastq: an ID, the DNA sequence, a separator line, and the Phred Score:

@SRR031709.1 HWI-EAS299_4_308T2AAXX:6:1:901:1978/1
GTTTCATAATTTAGATGGTTCGGGAAGTTAAACAATATGCGTAAA
+
:::::::::::::::::::::7::::::::/2222-,,,,,-,,(

Here then is the R code for qualByCycle which as the name implies creates a boxplot of the quality scores by cycle using ggplot2.

qualByCycle = function(fastqFile)
  {
  require(RcppArmadillo)
  require(ggplot2)

  # This is the C++ function
  boxStats = qualByCycleCpp(fastqFile)

  # just some tidying and labels
  boxStatsFrame = boxStats$fivenum
  boxtitles = paste('no of reads = ', boxStats$Reads)
  boxStatsFrame= t(boxStatsFrame)
  colnames(boxStatsFrame)=c('min', 'Q1', 'median', 'Q3', 'max')
  boxStatsFrame=as.data.frame(boxStatsFrame)
  boxStatsFrame$Cycle=factor(1:45)

  # make a plot
  g <- ggplot(boxStatsFrame, aes(x = Cycle, ymin = min, lower = Q1, middle = median, upper = Q3, ymax = max))
  g + geom_boxplot(stat = "identity")+theme_bw()+labs(list(title=boxtitles, y= 'Phred Score'))
}

And it gives something like this:

QualByCycle

I’m using Rcpp + RcppArmadillo + C++ seqan to calculate quality summaries. C++ (can sometimes) handle big chunks of memory better than R. You could do some of this reading in fastq data into R using ShortRead, make a matrix of qualities then a boxplot. But for really big matrices e.g. >106 by 75 making a boxplot chokes an average desktop computer. Alternatively you could try the R fivenum() function (min, Q2, median, Q3, max) to calculate summary statistics for each cycle and plot these directly. This too chokes on a big matrix.

So you could pass the data out to a C++ function using Rcpp and calculate the fivenum summaries…better. Indeed a similar programming strategy is employed in Hadley Wickhams forthcoming (?) bigVis package. Regardless calculating fivenum is always going to be a cpu chore as whilst stats such as mean, min, max can be computed by a one pass read through an array (cycle columns in this case), Q1, median and Q3 imply sorting all values.

However reading the fastq file into R memory using Shortread and converting to a matrix before passing to Rcpp is also a memory/cpu hog. Better to do it all in C++ outside R then import just the summary data for plotting. For this I employ the outstanding sequence analysis C++ seqan library for the heavy lifting. In particular I use this due to it’s memory mapping which I’ll talk about more in subsequent posts….

Whilst I only really need R for the plotting – the Rcpp “Sugar” and RcppArmadillo packages are really useful too for simplifying C++ maths code. I’m not too proud of the following code…  It analyses  2 million reads in ~10 secs on my oldish mac desktop with 4GB RAM and 2 cpu cores. It will happily work on >6 million reads without disk churn. There are some glaring inefficiencies… but it is convenient… for now.

#include <seqan/sequence.h>
#include <iostream>
#include <seqan/file.h>
#include <seqan/seq_io.h>
#include <RcppArmadillo.h>

using namespace Rcpp;
using namespace seqan;

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]

List qualByCycleCpp(std::string argv)
{

 // Open memory mapped string.
    seqan::String<char, seqan::MMap<> > mmapString;
    if (!open(mmapString, argv.c_str(), seqan::OPEN_RDONLY))
        return 1;  // Could not open file.

    // Create RecordReader.
    seqan::RecordReader<seqan::String<char, seqan::MMap<> >,
                        seqan::DoublePass<seqan::Mapped> > reader(mmapString);

    // Read file in two pass into optimised memory space
    seqan::StringSet<seqan::CharString> quals; // this persists
    {
    // ids and seqs are not used so I let go out of scope to free up memory -- HAXXXOR!
    seqan::StringSet<seqan::CharString> ids;
    seqan::StringSet<seqan::CharString > seqs;
    if (read2(ids, seqs, quals, reader, seqan::Fastq()) != 0)
        return 1;  // Could not read file.
    }

    //A Dna5q Stringset Iterator
    typedef Iterator<StringSet<CharString> >::Type TStringSetIter;
    TStringSetIter qualIt;

    seqan::CharString qual;
    qualIt = begin(quals);                       // point to first Phred String of Stringset  
    const unsigned numcycles = length(*qualIt);  // get the number of cycles i.e. readlength
    const unsigned numreads = length(quals);    // length of the StringSet, number of reads

    // for boxplots by cycle a matrix sort is necessary you cant compute median etc on the stream.
    // Rcpp Armadillo matrices
    arma::mat qualMatrix = arma::mat(numreads, numcycles);
    qualMatrix.zeros();
    unsigned armaCounter = 0; 

    for( ; qualIt != end(quals); ++qualIt, ++armaCounter) // this puts the seq data into the arma format
    {
      qual = *qualIt;

      for(unsigned cycle = 0; cycle< numcycles; ++cycle)
      {
        // a matrix of quality scores
        qualMatrix(armaCounter, cycle) = (int)ordValue(qual[cycle])-33;

      }

    }// is there a quick way to get seqs into arma::matrix?

    // I sort the columns (cycles) then calculate fivenum for each cycle (column)
    qualMatrix= sort(qualMatrix,0,0);
    int q1 = numreads/4;
    int mid = numreads/ 2; // Euclidian division
    int q3 = q1*3;
    NumericMatrix fivenum(5,numcycles);    
    for (unsigned cycle = 0; cycle < numcycles; ++cycle) 
    { 

        fivenum(0,cycle) = qualMatrix(0,cycle);
        fivenum(1,cycle) = qualMatrix(q1,cycle);
        fivenum(2,cycle) = qualMatrix(mid,cycle);
        fivenum(3,cycle) = qualMatrix(q3,cycle);
        fivenum(4,cycle) = qualMatrix(numreads-1,cycle);        
    }

return List::create(Named("File")   = argv,
                      Named("Reads")   = numreads,
                      Named("fivenum")  = fivenum);
}

NEXT TIME: FastQC style “per Sequence Quality Score” – and a bit more about Seqan.