Rcpp, RcppArmadillo, Seqan, Sequence Analysis part 4

Standard

This next algorithm was a bit tougher as std::string are kryptonite to me at the best of times and Seqan::StringSet<Dna5>> confused me too. I got there eventually by monkey-typewriting.

### Sequence Duplication

This function puts together two FastQC screens as the data arise naturally from the same procedure. First there are bins of the numbers of duplicate and then if there are more than a certain threshold of duplicates we record the sequence and return them. Such highly duplicated sequences maybe PCR errors.

In many ways seqan is great and mixes well with STL. For instance I simply std::sort all the sequences in a single line. Then I can run through them and count the consecutive duplicates. On the other hand I can’t just copy a Dna5 to a std::string (at least I don’t know how) – I have to step through char by char. Maybe there is a way?

seqDupl= function(argv)
  {
  require(ggplot2)
  dups = seqDuplCpp(argv)
  dups.df=data.frame(Bins=1:10, Counts=dups$Counts/dups$Counts[1])
  dupFraction = sum(dups$Counts[2:10])/sum(dups$Counts)

  seqs.df= data.frame(Duplicates=dups$Duplicates, Count=dups$SeqCount, Fraction= dups$SeqCount/dups$Reads)
  print(seqs.df)
  qplot(x=Bins, y=Counts, data=dups.df, geom='line', main = paste('Duplicate Fraction =', format(dupFraction, digits=3)))+theme_bw()

  }
seqDupl('test8.fastq')

seqDupl

                                     Duplicates Count     Fraction
1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 62336 0.0207786667
2 GGCGGACAGAGCCAAGGCGAACAGGGCGAACACGAGGATGCAAAT   754 0.0002513333
3 GTCCTTTCGTACTAAAATATCACAATTTTTTAAAGATAGAAACCA   777 0.0002590000

The C++ code is then:



    // CUT THE TOP HALF OF THE CODE
    // SIMILAR TO PREVIOUS POSTS

    //A Dna5q Stringset Iterator
    typedef Iterator<StringSet<Dna5String> >::Type TStringSetIter;    
    std::sort(begin(seqs), end(seqs));// YOU CAN USE STL!!!

    TStringSetIter seqIt1 = begin(seqs);
    TStringSetIter seqIt2 = seqIt1 + 1 ; // this is 1 in front of the other iterator

    const unsigned numcycles = length(*seqIt1);  // get the number of cycles i.e. readlength
    const unsigned numreads = length(seqs);    // length of the StringSet, number of reads

    NumericVector bins(10);
    unsigned int dupCount = 0;
    unsigned int dupThreshold = numreads/5000;    
    std::vector<std::string> dupSeqsVec;
    std::string dupSeqs(numcycles, 'N');
    std::vector<int > dupSeqsCount;
    seqan::CharString seq1, seq2;

    // seqs is lexically sorted so duplicates are consecutive
    for( ; seqIt2 != end(seqs); ++seqIt1, ++seqIt2)
    {
      seq1= value(seqIt1);
      seq2=value(seqIt2);
      // if first and next are the same we start counting how many
      if (seq1 == seq2)
      {        
        ++dupCount;
      }
      // till we come to a different read sequence
      if (seq1 != seq2)
      {
        // bin 1-10 increment
        if(dupCount <= 9)  
          bins[dupCount] += 1;
        // if >10 increment 10th bin
        if(dupCount > 9)
          bins[9] += 1;

        // if a threshold is met we record the sequence and how many there were  
        if(dupCount > dupThreshold)
        {
          dupSeqsCount.push_back(dupCount);
          for(unsigned i=0; i< numcycles; ++i)
              dupSeqs[i] = seq1[i];

          dupSeqsVec.push_back(dupSeqs);
        }
      // then reset the counter after bin increments and recording is done  
      dupCount = 0;    
      }

    }
    return List::create(Named("File")     = argv,
                      Named("Reads")      = numreads,
                      Named("Counts")       = bins,
                      Named("Duplicates") = dupSeqsVec,
                      Named("SeqCount")      = dupSeqsCount);
}
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s