Wednesday, August 01, 2012

Three ways to trim adaptor/primer sequences for paired-end reads

1. Understanding the adaptors (skip this part if you're familiar with the Illumina adaptor)

Before trimming anything from the reads, let's get clear what the reads content is.

Taking Trufseq reads (from Illumina HiSeq 2000) as example, here is the read file (fastq) looks like:

$ cat r1.fq 
@3VFXHS1:278:D13Y4ACXX:1:1101:1472:2209 1:N:0:CGATGT
CTGGTATTGTCTCTTCCCACACTGAACTCTGGGGAATTCGATGTGTGGCACAGCCCGGCTCAGCCTGCCCGCTGGTGGGAGCCCCTGGGAAGCTGCGGCGC
+
@@CFDDFFGH>CAEH:CGHIJJJJEIHJJHIJJJ?DHIDIJHGEGHJG;FHC9@B(5@6A=EH:B@B@2=>>B?BDCBD<B52<<ABD?<?B1@A9>B###
@3VFXHS1:278:D13Y4ACXX:1:1101:1434:2224 1:N:0:CGATGT
GGCAGAGCCAATCTTCGGACGTGGTGATTGTCTCCTCTAAGTACAAACAGCGCTATGAGTGTCGCCTGCCAGCTGGAGCTATTCACTTCCAGCGTGAAAGG
+
BC@FFFFFHHHGFHIIJJIJGFHICFCGIHGFHFGGCHD@F?B?BGGHJJIG6D@EHEHHEHCD259?AACD@AC59?,(5>A,;>:@C(::(029?8>@A
@3VFXHS1:278:D13Y4ACXX:1:1101:1712:2247 1:N:0:CGATGT
GTACACTTGAACACATTTTTCTAACCTTAGAAAATACCTACAAGGCCTGTTGTCTTGACCCATTACTCAATTGTCCCTGGCATATTATCTGATCTTCACGT
+
CCCFFFFFHHHHGJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJGIIJGGGDHHHIHIHHEIJIJJIIGGIIIIFEHHHHFEFFFDEEEDCEEEDEDC?A
@3VFXHS1:278:D13Y4ACXX:1:1101:3318:2215 1:N:0:CGATGT
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAGAAAAAAAAACAAGCGACAAGGACAGA
+
CCCFFFFFHHHHHJJJJJIJJJJJJJJJIJGHFFHIAHIFGGIIJJIJJFIJIHJIHHHGEGFE>CFFEB###############################
@3VFXHS1:278:D13Y4ACXX:2:1101:5344:2243 1:N:0:ACAGTG
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAACAACAGAAAAAACAAAGCGCGAACAGTGC
+
CCCFFFFFHHHHHJIJJIJIJHJJJJJJJHIGIGIHJFHFFDIIJGIIIJJJHHFIJJJJJJHCEHHFD################################

$ cat r2.fq 
@3VFXHS1:278:D13Y4ACXX:1:1101:1472:2209 2:N:0:CGATGT
CTGGATTTGAAATCTTTAGCGGAGCGGGAACGCCGGCGCGGAAGGGTCTCTACACAGGGCCCGGTCCGCCCTTGCGCTCTCCTTAATGNNNNNNNNNNCGC
+
@CCF?EFFHHHGHHHIFGI@HGGIEHIGIJGI6@@E>B8>??:DBD++399>ACCDDDD@DBDD58@BBDDDDD@@<@BDDDDC>CDC#############
@3VFXHS1:278:D13Y4ACXX:1:1101:1434:2224 2:N:0:CGATGT
CCCGGGGCCTCCCATTAAGGTCGCACTTGGACCCATTGCCATAGGTCTGGCTGTGGTAGCGTTTAAGACGATGCTGCTTGGAGGCCTTGGCTGTTTCATCA
+
BCCFFFFDHFFHGIHIHGJJIJJGGIIJJJIDHGJIJEIIIIIJJIIJIJHHHEEF;>DDA>BBB@CAABBBDDCDDDDDDAD@@?CDDDCCB?ACCDDC#
@3VFXHS1:278:D13Y4ACXX:1:1101:1712:2247 2:N:0:CGATGT
GTTACTCAGCATTTATTCATGCCTGCTGTGTACGGAAAGGGCAGTTACAAAGGAAAGCCTTGATGATTCTGCTTCCAAGAAACGTGAAGATCAGATAATAT
+
CBCFFFFFHHHHHJIJJIIJJJJJIIJIHIHIJJJIJIJIJJIIIIIIJJGGIIJJJIJJJJHIJJJJHIJHGHHHHFFFFFEDECBDDDDDCCDDDCDEE
@3VFXHS1:278:D13Y4ACXX:1:1101:3318:2215 2:N:0:CGATGT
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAAAACAGAAAAAGAAAGAAGGAAAGGNGAGT
+
CCCFFFFFHHHHHJJJJFHHHJJJIJJGGIDGHIEIJJJIIJGIJ5@FHIB?DFFEEEDEEEDD?####################################
@3VFXHS1:278:D13Y4ACXX:2:1101:5344:2243 2:N:0:ACAGTG
GGATCGGGAAAGGGGGGGGGGGGGGAAAAGGGGGGATTTCCGGGGGGGCCGGTTCTTTTAAAAAAAAAAAAAAAGAAAACAGAAACAGAAGATGGACAACA
+
CCCFFFFFHHHHHJJJJFHHHJJJIJJGGIDGHIEIJJJIIJGIJ5@FHIB?DFFEEEDEEEDD?####################################

First of first, it's critical to understand what your reads file contain; do they contain adaptor sequences? do they contain primer sequence? I strongly recommend to read the description file here: http://genomics.med.tufts.edu/documents/protocols/TUCF_Understanding_Illumina_TruSeq_Adapters.pdf, from which we could know that the constructed dsDNA (before binding to the flow cell for sequencing) looks like: 
Where 

Trufseq Universal Adaptor:
5´AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT3´
--> reverse complementary
5´AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT3´

TruSeq Indexed Adapter:
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC‐NNNNNN‐ATCTCGTATGCCGTCTTCTGCTTG3´

The 6-nt "NNNNNN" is barcode for multiplexing. We noticed that the 3´ of Universal adaptor is reverse-complementary to the 5´ of Indexed adaptor (Why? This is to form the Y-shape adaptor. See ZZ's dUTP figure in another post). 

Combining together the "Overrepresented sequences" of FASTQC output file (for example), we could infer that the reads are contaminated by adaptor/primers (for example the red-marked part in the fastq sequences).

So, next step is how to remove the contamination. 

2. Collect adaptor sequences

To remove contamination, we first should collect all possible "contamination" sources. For our case, we collect all used barcode sequences and generated the adaptor file:

http://zlab.umassmed.edu/~dongx/tracks/RNAseq/BU/Demultiplex_Stats.htm
copy content and save to txt file Demultiplex_Stats.htm
grep -v Undetermined barcode_stat.html | awk '{print ">"$2; print "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC"$4"ATCTCGTATGCCGTCTTCTGCTTG";}' > adaptor.fa

The Demultiplex_Stats.htm file, which contains barcode information of each sample, is usually included in the output folder of sequencing. Otherwise, you can consult from the data producer. 

You may also want to append the universal adaptor:
$cat >> adaptor.fa
>Trufseq_Universal_Adaptor
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
(Ctrl+D)

A full list of commonly-used adaptors can be retrieved from the following URL (returned by Google search) by command:

curl -s http://www.omicsoft.com/downloads/ngs/contamination_list/v1.txt 2>&1 | sed "/^\s\+$/d;s/\t\+/\t/g;s/ /_/g;s/,//g;s/'//g" | awk '{print ">"$1; print $2;}' > adaptors_list.fa 

This does not include Trufseq barcodes used for smallRNA (see below), which you have to include yourself:
http://epigenome.usc.edu/docs/resources/core_protocols/TruSeq%20index%20sequences-3.pdf

3. Remove adaptor 

There might be many ways to remove adaptors, but specifically for PE reads (e.g. both reads are removed if one of the pairs is disqualified), I'd like to introduce three ways to do this.


usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]

Options:
    -h      This help
    -o FIL  Output file (stats to stdout)
    -s N.N  Log scale for clip pct to threshold (2.2)
    -t N    % occurance threshold before clipping (0.25)
    -m N    Minimum clip length, overrides scaled auto (1)
    -p N    Maximum adapter difference percentage (10)
    -l N    Minimum remaining sequence length (19)
    -L N    Maximum sequence length (none)
    -k N    sKew percentage-less-than causing trim (2)
    -q N    quality threshold causing trimming (10)
    -w N    window-size for quality trimming (1)
    -f      force output, even if not much will be done
    -F FIL  remove sequences that align to FIL
    -0      Set all trimming parameters to zero
    -U|u    Force disable/enable illumina PF filtering
    -P N    phred-scale (auto)
    -x N    'N' (Bad read) percentage causing trimming (20)
    -R      Don't remove N's from the fronts/ends of reads
    -n      Don't clip, just output what would be done
    -C N    Number of reads to use for subsampling (200k)
    -S FIL  Save clipped reads to file
    -d      Output lots of random debugging stuff

For example, 
fastq-mcf -o c1.fq -o c2.fq -l 16 -q 15 -w 4 -x 10 -u -P 33 adaptor.fa r1.fq r2.fq &>r.log


Paired End Mode:
java -classpath <path to trimmomatic jar> org.usadellab.trimmomatic.TrimmomaticPE [-threads <threads>] [-phred33 | -phred64] [-trimlog <logFile>] <input 1> <input 2> <paired output 1> <unpaired output 1> <paired output 2> <unpaired output 2> <step 1> ...

Step options:
  • ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>
    • fastaWithAdaptersEtc: specifies the path to a fasta file containing all the adapters, PCR sequences etc. The naming of the various sequences within this file determines how they are used. See below.
    • seedMismatches: specifies the maximum mismatch count which will still allow a full match to be performed
    • palindromeClipThreshold: specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.
    • simpleClipThreshold: specifies how accurate the match between any adapter etc. sequence must be against a read.
  • SLIDINGWINDOW:<windowSize>:<requiredQuality>
    • windowSize: specifies the number of bases to average across
    • requiredQuality: specifies the average quality required.
  • LEADING:<quality>
    • quality: Specifies the minimum quality required to keep a base.
  • TRAILING:<quality>
    • quality: Specifies the minimum quality required to keep a base.
  • CROP:<length>
    • length: The number of bases to keep, from the start of the read.
  • HEADCROP:<length>
    • length: The number of bases to remove from the start of the read.
  • MINLENGTH:<length>
    • length: Specifies the minimum length of reads to be kept.
Following the above example:

java -classpath $CLASSPATH/trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticPE -phred33 -trimlog r.log r1.fq r2.fq t1.fq t1.unpaired.fq t2.fq t2.unpaired.fq LEADING:3 TRAILING:3 ILLUMINACLIP:adaptor.fa:2:40:15 SLIDINGWINDOW:4:15 MINLEN:16 

These two programs give same results:

@3VFXHS1:278:D13Y4ACXX:1:1101:1472:2209 1:N:0:CGATGT
CTGGTATTGTCTCTTCCCACACTGAACTCTGGGGAATTCGATGTGTGGCACAGCCCGGCTCAGCCTGCCCGCTGGTGGGAGCCCCTGGGAAGCTGCGG
+
@@CFDDFFGH>CAEH:CGHIJJJJEIHJJHIJJJ?DHIDIJHGEGHJG;FHC9@B(5@6A=EH:B@B@2=>>B?BDCBD<B52<<ABD?<?B1@A9>B
@3VFXHS1:278:D13Y4ACXX:1:1101:1434:2224 1:N:0:CGATGT
GGCAGAGCCAATCTTCGGACGTGGTGATTGTCTCCTCTAAGTACAAACAGCGCTATGAGTGTCGCCTGCCAGCTGGAGCTATTCACTTCCAGCGTGAAAGG
+
BC@FFFFFHHHGFHIIJJIJGFHICFCGIHGFHFGGCHD@F?B?BGGHJJIG6D@EHEHHEHCD259?AACD@AC59?,(5>A,;>:@C(::(029?8>@A
@3VFXHS1:278:D13Y4ACXX:1:1101:1712:2247 1:N:0:CGATGT
GTACACTTGAACACATTTTTCTAACCTTAGAAAATACCTACAAGGCCTGTTGTCTTGACCCATTACTCAATTGTCCCTGGCATATTATCTGATCTTCACGT
+
CCCFFFFFHHHHGJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJGIIJGGGDHHHIHIHHEIJIJJIIGGIIIIFEHHHHFEFFFDEEEDCEEEDEDC?A


Cite what Simon posted here, HTSeq has facilities useful for this.

You can write a little Python script like this:
import itertools
import HTSeq
in1 = iter( HTSeq.FastqReader( "mydata_1.fastq" ) )
in2 = iter( HTSeq.FastqReader( "mydata_2.fastq" ) )
out1 = open( "trimmed_1.fastq", "w" )
out2 = open( "trimmed_2.fastq", "w" )
for read1, read2 in itertools.izip( in1, in2 ):
   read1.trim_right_end( "ACGGTC" )
   read2.trim_left_end( "TTCGAC" )
   read1.write_to_fastq_file( out1 )
   read2.write_to_fastq_file( out2 )
out1.close()
out2.close()
I've not figured out how to apply a fasta file of adaptor. 

4 comments:

  1. How to sort these barcodes instead of trimming them?

    ReplyDelete
  2. Thanks for the info!
    My data is stardard Truseq+Hiseq, adaptors same as TruSeq3-SE.fa provided by Trimmomatic.
    My command are as follows:

    java -jar $trimmomatic PE -phred33 -trimlog logs/$pre.log ${pre}_1.fastq.gz ${pre}_2.fastq.gz ${pre}_1.trim.fastq.gz ${pre}_1.unpaired.fastq.gz ${pre}_2.trim.fastq.gz ${pre}_2.unpaired.fastq.gz LEADING:3 TRAILING:3 ILLUMINACLIP:$primer:2:30:10:4:true SLIDINGWINDOW:4:10 MINLEN:20

    And I'm quite confused about the output in the unmapped file. For example, the following read from read 1, I cannot see a reason why it is classified as unmapped. No adaptor sequenced in it and good quality.

    @HWI-C00135:113:C4RWGACXX:8:1101:3466:2612 1:N:0:ACAGTG
    GGGCACTGAGAGACACCAGCTTTCCTGTTTTCAGCTTCAGCTGTGCTTCCAGGGCCCCCACAGCACTGAAAGCCCAGCAAGCACCACAAGAACCTTGATAT
    +
    CCCFFFFFHHHHHJJIIHJFIIJEHIJHJJJIGHIJJIIJJIJGIJGIFHIJJBHJJJJIIJGHHHHFFFFFBEC=AACCCDBDBBDDDDDCDDDDCDDEC


    Another question, for those reads with only a few bad-quality bases at the end, how can still keep the trimmed part in the */trim.fastq file? It seems that the default behavior is to put it in the unmapped file too.

    Appreciate it very if you can help!

    ReplyDelete
    Replies
    1. You may want to read my another post about the essential background behind adaptor: http://onetipperday.blogspot.com/2013/12/illumina-hiseq2000-adaptor-and.html. Personally I prefer to use FastqMcf.

      Delete
  3. Hi xianjun,

    Do you still have a working link for the adaptor fasta file?

    Thanks,
    Ming

    ReplyDelete