Rcpp, RcppArmadillo, Seqan, Sequence Analysis part 3

Standard

I mentioned in the last interregnum post the concept of Algebraic, Distributive, and Holistic data algorithms mentioned by Hadley Wickham in a recent talk . Here now with simple DNA sequence counting is a more or less Algebraic example (mean is not quite totally Algebraic).

### Bases by Cycle
Ideally sequence reads should be a random draw from the underlying genomic material. That isn’t exactly uniform A/T/G/C as there are runs of G/C – for various reasons I won’t go into. However you can also get molecular biological artefacts or biases giving rise to repetitive or frequent sequences. You at least want to know about this and probably need to remove them before continuing. The next few posts deal with detcting such sequences.

This first function is probably best for detecting a CG bias at the start of reads (rather than outright artefacts). In many way it is the least useful diagnostic – as the bias does not imply any error in the reads themselves just an uneven sampling of the underlying sequence population (…that’s an ungainly sentence…).

baseByCycle = function(fastqFile)
{
  require(RcppArmadillo)
  require(ggplot2)
  require(reshape2)
  # This is the C++ function
  bbc= baseByCycleCpp(fastqFile)
  baseByCycleFrame = as.data.frame(t(bbc$baseByCycle))
  colnames(baseByCycleFrame)= c('A', 'C', 'G', 'T', 'N')
  baseByCycleFrame$Cycle = 1:bbc$Cycles
  baseByCycleFrame= melt(baseByCycleFrame, id.vars= 'Cycle', variable.name='Base', value.name='Fraction')

  g <- ggplot(baseByCycleFrame, aes(x = Cycle, y = Fraction, color = Base))
  g + geom_line()+theme_bw()+ labs(title='Base Fraction per Cycle')

}

This produces a plot like this in which you can clearly see a tendency for reads to start with G or C that then evens out:

baseByCycle

The C++ code is pretty similar to before with the exception that we don’t need to read the qualities in.

#include <seqan/sequence.h>
#include <iostream>
#include <seqan/file.h>
#include <seqan/seq_io.h>
#include <RcppArmadillo.h>

using namespace Rcpp;
using namespace seqan;

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]

List baseByCycleCpp(std::string argv)
{

 // Open memory mapped string.
    seqan::String<char, seqan::MMap<> > mmapString;
    if (!open(mmapString, argv.c_str(), seqan::OPEN_RDONLY))
        return 1;  // Could not open file.

    // Create RecordReader.
    seqan::RecordReader<seqan::String<char, seqan::MMap<> >,
                        seqan::DoublePass<seqan::Mapped> > reader(mmapString);

    // Read file in one pass into optimised memory space
    seqan::StringSet<seqan::CharString> seqs; // this persists
    {
    // whereas I declare ids inside this block because I don't need them
    seqan::StringSet<seqan::CharString> ids;
    if (read2(ids, seqs, reader, seqan::Fastq()) != 0)
        return 1;  // Could not read file.
    }

    //A Dna5q Stringset Iterator
    typedef Iterator<StringSet<CharString> >::Type TStringSetIter;
    TStringSetIter seqIt = begin(seqs);

    const unsigned numcycles = length(*seqIt);  // get the number of cycles i.e. readlength
    const unsigned numreads = length(seqs);    // length of the StringSet, number of reads

    //there are 5 bases ACGTN hence 5 rows.
    arma::mat baseCounterMatrix(5, numcycles);
    baseCounterMatrix.zeros();

    seqan::CharString seqn;

    for( ; seqIt != end(seqs); ++seqIt) // this puts the seq data into the arma format
    {
      seqn = *seqIt;

      for(unsigned cycle = 0; cycle< numcycles; ++cycle)
      {

          if(seqn[cycle] =='A') 
            baseCounterMatrix(0,cycle) += 1;

          if(seqn[cycle] =='C') 
            baseCounterMatrix(1,cycle) +=1;

          if(seqn[cycle] =='G') 
            baseCounterMatrix(2,cycle) +=1;

          if(seqn[cycle] =='T') 
            baseCounterMatrix(3,cycle) +=1;

          if(seqn[cycle] =='N') 
            baseCounterMatrix(4,cycle) +=1;

      }

    }
  // convert the base counts to a fraction  
  baseCounterMatrix = baseCounterMatrix/numreads;
  return List::create(Named("File")   = argv,
                      Named("Reads")   = numreads,
                      Named("Cycles") =  numcycles,
                      Named("baseByCycle")  = baseCounterMatrix);
}
Advertisements

Some thoughts on a new Hadley Wickham talk

Standard

I just watched a talk by Hadley Wickham the author of the very popular ggplot2, plyr and reshape2 packages.

NY Open Stats Programming Meetup – Hadley Wickham

Screen Shot 2013-04-30 at 10.57.09

A lot of his talk was pertinent to some of the stuff I’m playing around with here. And there were a few new things I learned.

  1. readRDS( ), 45 mins 30 : That’s a new function to me. I’d never heard of it. Apparently it is used by base R and is really compact and efficient for single data objects. I’ve been using R for just less than 10 years. Why haven’t I heard of it? Actually the idea sounds similar to the memory mapping that helps seqan read data so quickly.. though I need to look into it more.
  2. Algebraic, Distributive, Holistic, 58 mins 30: Here he breaks down algorithms on data into 3 types based on whether they need a single pass through the data and a single stored value (Algebraic) e.g. counting or summing. A pass through the data and more than a single stored value (Distributive) e.g. mean or standard error or kurtosis. And finally a function on all the data that may require lots of intermediates e.g. medians, quantiles. Again these categories are new to me but they reflect several FastQC tasks I’m working through: the boxplots of base quality by cycle (Holistic), the bins of mean read quality (Distributive), and to follow shortly the GC content (largely Algebraic). Really it’s only the Holistic that present any problems.
  3. Rcpp, 1hr 14mins 28: I really had no idea of all the stuff that goes on under the hood of Rcpp to make it integrate with C. I’m glad it just seems to work. I spent a bit of time trying to understand the relation between RcppArmadillo, BLAS, LAPACK, ATLAS – multithreading (is RcppArmadillo compiling to multithread automatically?) and Mac a few days back. NEVER AGAIN – JUST LEAVE IT.
  4. data.frame data.table, 1hr 34 & 1hr 42: There were a couple of interesting comments on why data.frames don’t play nice with big datasets. He also at one point complained that though the data.table package was very powerful it achieved this at the expense of breaking the established syntax of the R language!!! “Hello Pot, meet my friend Kettle, you can call him Black”. Although he also said that plyr would be moving to use data.table as a back-end….

On these last couple of points it seems to me that there is a lot of struggling with R to do clever stuff with big data (OK I said big data- I was trying to avoid it) behind the scenes without breaking the user experience. In these posts I’m not even passing data in from R just a file string. In Bioconductor packages like GRanges they seem to do a lot of pointing (in the C sense) to external memory on huge genomics DNA strings. With Rcpp they are always trying to pass by reference to the R object. Then there are file backed packages like bigmemory, SQLdf.

Of course in a high level language like R you expect lots of low level stuff like XML, cURL, or database connections just to integrate with lower level code. But it seems that lots of the low level memory handling and access stuff in R is being hacked in a multitude of different ways. It’s a bit of a wild west. If I were king R would all be written in C++ (C++ 11 : auto everything).

And how does R.3.0 and the new big vectors effect memory and performance? I’d really like to see a practical talk on this – that I understood. Actually I’d like to see someone with grand vision of where all of this is going give a talk….

Anyway… there is a lot more detail on bigvis in his preprint: Bin-summarise-smooth: A framework for visualising large data.

Rcpp, RcppArmadillo, Seqan, Sequence Analysis (2)

Standard

I should talk a bit more about RcppArmadillo and the Seqan C++ libraries. RcppArmadillo is the Rcpp integration of the Armadillo linear algebra libraries. They and Rcpp bring some of that vectorised R style magic to C++ matrices (see below). The seqan library meanwhile has all the reading and parsing functionality for sequence files and data. In particular seqan uses memory efficiently through memory mapping and it can extract the phred quality scores to integer values.

Quality By Sequence

The next important quality control (QC) diagnostic you get from FastQC is the average sequence quality – or what I have called quality by sequence (qualBySeq). The code is similar to before but note you can calculate it with a single read through the data and no need to create a large matrix. The godsend here though is the RcppArmadillo hist() function which makes this trivial.

qualBySeq = function(fastqFile) {
    require(RcppArmadillo)
    require(ggplot2)

    # This is the C++ function
    qualSeqHist = qualBySeqCpp(fastqFile)

    # tidy into a data.frame of bins and read counts
    qualSeqFrame = data.frame(MeanSeqQual = (qualSeqHist$minQual + 1):qualSeqHist$maxQual, 
        Reads = qualSeqHist$qualBySeq)

    g <- ggplot(qualSeqFrame, aes(x = MeanSeqQual, y = Reads))
    g + geom_line() + theme_bw() + labs(list(x = "Mean Sequence Quality (Phred Scores)", 
        y = "Reads"))

}

The results look like this:

qualBySeq

Naturally due to the lower overheads this function is faster than the previous taking ~6 seconds to calculate averages on 2 million reads. It will probably process a lot more data without running out of space too. Note the lazy (I mean lazy as in “lazy” not any sophisticated programming sense of the word) use of scope to read lots of fastq sequence and ID data then just forget it. A major improvement would be to skip these fastq lines completely. There is also an unsavoury mixture of iterator and unsigned int loops …

#include <seqan/sequence.h>
#include <iostream>
#include <seqan/file.h>
#include <seqan/seq_io.h>
#include <RcppArmadillo.h>

using namespace Rcpp;
using namespace seqan;

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]

List qualBySeqCpp(std::string argv)
{

 // Open memory mapped string.
    seqan::String<char, seqan::MMap<> > mmapString;
    if (!open(mmapString, argv.c_str(), seqan::OPEN_RDONLY))
        return 1;  // Could not open file.

    // Create RecordReader.
    seqan::RecordReader<seqan::String<char, seqan::MMap<> >,
                        seqan::DoublePass<seqan::Mapped> > reader(mmapString);

    // Read file in one pass into optimised memory space
    seqan::StringSet<seqan::CharString> quals; // this persists
    {
    // whereas I declare these inside this block because I dont need them
    seqan::StringSet<seqan::CharString> seqs;
    seqan::StringSet<seqan::CharString> ids;
    if (read2(ids, seqs, quals, reader, seqan::Fastq()) != 0)
        return 1;  // Could not read file.
    }

    //A CharString Stringset Iterator
    typedef Iterator<StringSet<CharString> >::Type TStringSetIter;
    TStringSetIter qualIt;

    seqan::CharString qual;
    qualIt = begin(quals);                       // point to first Phred String of Stringset  
    const unsigned numcycles = length(*qualIt);  // get the number of cycles i.e. readlength
    const unsigned numreads = length(quals);    // length of the StringSet, number of reads

    //an armadillo vector of length of numreads set to zero
    arma::vec perReadQuality(numreads);
    perReadQuality.zeros();

    unsigned armaCounter = 0; 
    for( ; qualIt != end(quals); ++qualIt, ++armaCounter) // this puts the seq data into the arma format
    {
      qual = *qualIt;

      for(unsigned cycle = 0; cycle< numcycles; ++cycle)
      {
        // sum the quality scores across each sequence read
        perReadQuality[armaCounter] += (int)ordValue(qual[cycle])-33;

      }

    }

    // FastQC Per Sequence Quality Scores  
    perReadQuality = perReadQuality/numcycles; // make the sum phred score to a mean phred score    
    int minQual = min(perReadQuality);
    int maxQual = max(perReadQuality);
    int qualRange = maxQual-minQual;
    // calculates bins for the integer range of qualities
    arma::uvec qualityBins = hist(perReadQuality, qualRange);

    return List::create(Named("File")   = argv,
                      Named("Reads")   = numreads,
                      Named("minQual") = minQual,
                      Named("maxQual") = maxQual,
                      Named("qualBySeq")  = qualityBins);
}

** NEXT : FastQC style "Per Base Sequence Content" **

Rcpp, RcppArmadillo, Seqan, Sequence Analysis (1)

Image

Bioinformatics software – particularly dealing with sequences – is dominated by command line tools. Sadly these are written in a plethora of different languages meaning that analysis pipelines are a mish-mash of Java Jars, interpreted Perl, Ruby, or Python scripts, plus compiled C/C++ standalones usually piped together by Bash scripts. Whilst these are mostly efficient they sometimes fail in the unix design philosophy of doing one simple thing very well. Rather they often do lots of overlapping things in undocumented ways. Installing, maintaining and using them together can be a bit of a pain.

In R meanwhile there is an excellent set of Bioconductor packages that lend a unified framework for dealing with sequences. There are many packages for applications like ChIP-seq or RNA-seq when the data is already aligned and summarised. There are less great tools for processing unaligned sequence data. Mostly this is because there is so much of it – R just doesn’t handle large datasets that well – though the Bioconductor developers have made heroic strides in managing memory outside base R.

The Bioconductor ShortRead package does handle some quality control (QC). It doesn’t do everything I want though and the underlying code is in C – which I for one find difficult to hack. So I decided to write the functionality of the FastQC and trimmomatic in R and C++ (using Rcpp and the Seqan C++ library)…. for fun. The next few posts will show the functions

Quality By Cycle

This first post show code for the most important FastQC graph – what they call “per base sequence quality” – and I call “quality by cycle”. High throughput sequences are “read” from left to right in repeated photochemical cycles of declining quality. So when reads get too long the error rate becomes unacceptably high. Usually this is the main diagnostic used to select the ‘trimming’ length for reads before subsequent analyses.

This “quality” is coded in the reads fastq files by a character code knowna as a Phred Score phred quality scores (Q) per base. Where P is the base calling error probability:

\( Q = -10\log_{10}P \)

Here are 4 lines of fastq: an ID, the DNA sequence, a separator line, and the Phred Score:

@SRR031709.1 HWI-EAS299_4_308T2AAXX:6:1:901:1978/1
GTTTCATAATTTAGATGGTTCGGGAAGTTAAACAATATGCGTAAA
+
:::::::::::::::::::::7::::::::/2222-,,,,,-,,(

Here then is the R code for qualByCycle which as the name implies creates a boxplot of the quality scores by cycle using ggplot2.

qualByCycle = function(fastqFile)
  {
  require(RcppArmadillo)
  require(ggplot2)

  # This is the C++ function
  boxStats = qualByCycleCpp(fastqFile)

  # just some tidying and labels
  boxStatsFrame = boxStats$fivenum
  boxtitles = paste('no of reads = ', boxStats$Reads)
  boxStatsFrame= t(boxStatsFrame)
  colnames(boxStatsFrame)=c('min', 'Q1', 'median', 'Q3', 'max')
  boxStatsFrame=as.data.frame(boxStatsFrame)
  boxStatsFrame$Cycle=factor(1:45)

  # make a plot
  g <- ggplot(boxStatsFrame, aes(x = Cycle, ymin = min, lower = Q1, middle = median, upper = Q3, ymax = max))
  g + geom_boxplot(stat = "identity")+theme_bw()+labs(list(title=boxtitles, y= 'Phred Score'))
}

And it gives something like this:

QualByCycle

I’m using Rcpp + RcppArmadillo + C++ seqan to calculate quality summaries. C++ (can sometimes) handle big chunks of memory better than R. You could do some of this reading in fastq data into R using ShortRead, make a matrix of qualities then a boxplot. But for really big matrices e.g. >106 by 75 making a boxplot chokes an average desktop computer. Alternatively you could try the R fivenum() function (min, Q2, median, Q3, max) to calculate summary statistics for each cycle and plot these directly. This too chokes on a big matrix.

So you could pass the data out to a C++ function using Rcpp and calculate the fivenum summaries…better. Indeed a similar programming strategy is employed in Hadley Wickhams forthcoming (?) bigVis package. Regardless calculating fivenum is always going to be a cpu chore as whilst stats such as mean, min, max can be computed by a one pass read through an array (cycle columns in this case), Q1, median and Q3 imply sorting all values.

However reading the fastq file into R memory using Shortread and converting to a matrix before passing to Rcpp is also a memory/cpu hog. Better to do it all in C++ outside R then import just the summary data for plotting. For this I employ the outstanding sequence analysis C++ seqan library for the heavy lifting. In particular I use this due to it’s memory mapping which I’ll talk about more in subsequent posts….

Whilst I only really need R for the plotting – the Rcpp “Sugar” and RcppArmadillo packages are really useful too for simplifying C++ maths code. I’m not too proud of the following code…  It analyses  2 million reads in ~10 secs on my oldish mac desktop with 4GB RAM and 2 cpu cores. It will happily work on >6 million reads without disk churn. There are some glaring inefficiencies… but it is convenient… for now.

#include <seqan/sequence.h>
#include <iostream>
#include <seqan/file.h>
#include <seqan/seq_io.h>
#include <RcppArmadillo.h>

using namespace Rcpp;
using namespace seqan;

// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]

List qualByCycleCpp(std::string argv)
{

 // Open memory mapped string.
    seqan::String<char, seqan::MMap<> > mmapString;
    if (!open(mmapString, argv.c_str(), seqan::OPEN_RDONLY))
        return 1;  // Could not open file.

    // Create RecordReader.
    seqan::RecordReader<seqan::String<char, seqan::MMap<> >,
                        seqan::DoublePass<seqan::Mapped> > reader(mmapString);

    // Read file in two pass into optimised memory space
    seqan::StringSet<seqan::CharString> quals; // this persists
    {
    // ids and seqs are not used so I let go out of scope to free up memory -- HAXXXOR!
    seqan::StringSet<seqan::CharString> ids;
    seqan::StringSet<seqan::CharString > seqs;
    if (read2(ids, seqs, quals, reader, seqan::Fastq()) != 0)
        return 1;  // Could not read file.
    }

    //A Dna5q Stringset Iterator
    typedef Iterator<StringSet<CharString> >::Type TStringSetIter;
    TStringSetIter qualIt;

    seqan::CharString qual;
    qualIt = begin(quals);                       // point to first Phred String of Stringset  
    const unsigned numcycles = length(*qualIt);  // get the number of cycles i.e. readlength
    const unsigned numreads = length(quals);    // length of the StringSet, number of reads

    // for boxplots by cycle a matrix sort is necessary you cant compute median etc on the stream.
    // Rcpp Armadillo matrices
    arma::mat qualMatrix = arma::mat(numreads, numcycles);
    qualMatrix.zeros();
    unsigned armaCounter = 0; 

    for( ; qualIt != end(quals); ++qualIt, ++armaCounter) // this puts the seq data into the arma format
    {
      qual = *qualIt;

      for(unsigned cycle = 0; cycle< numcycles; ++cycle)
      {
        // a matrix of quality scores
        qualMatrix(armaCounter, cycle) = (int)ordValue(qual[cycle])-33;

      }

    }// is there a quick way to get seqs into arma::matrix?

    // I sort the columns (cycles) then calculate fivenum for each cycle (column)
    qualMatrix= sort(qualMatrix,0,0);
    int q1 = numreads/4;
    int mid = numreads/ 2; // Euclidian division
    int q3 = q1*3;
    NumericMatrix fivenum(5,numcycles);    
    for (unsigned cycle = 0; cycle < numcycles; ++cycle) 
    { 

        fivenum(0,cycle) = qualMatrix(0,cycle);
        fivenum(1,cycle) = qualMatrix(q1,cycle);
        fivenum(2,cycle) = qualMatrix(mid,cycle);
        fivenum(3,cycle) = qualMatrix(q3,cycle);
        fivenum(4,cycle) = qualMatrix(numreads-1,cycle);        
    }

return List::create(Named("File")   = argv,
                      Named("Reads")   = numreads,
                      Named("fivenum")  = fivenum);
}

NEXT TIME: FastQC style “per Sequence Quality Score” – and a bit more about Seqan.

How to work through the R tutorials

Video

I discuss here how you might want to work through the tutorials. All the tutorials and necessary data will be given to you in a single folder. You will probably want to open the tutorial in your web browser and find a keyboard shortcut that quickly switches between browser and RStudio.

For the exercises throughout the tutorial you will probably want to open up the RStudio code editor so you can edit small bits of code without retyping everything.

Some exercises are hard (some are easy!) so there are model answers for everything. I doubt anyone will be able to solve them all in good time. — but have a go before you look at the model answers.

R vs RStudio

Video

 

Essentially RStudio is a wrapper application for R. When you launch RStudio it launches the default version of R on your computer — but with a lot of handy features.

  • If you update your R version Rstudio will automatically use the new version.
  • If you install packages whilst running RStudio they will be installed when you launch R alone.
  • If you update RStudio it won’t alter your R version at all. The R version and packages will remain the same.