Trinity+Pipeline

=Trinity RNA-Seq Pipeline=

Based on the khmer protocols [|khmer protocols online documents]

Starting with the following files code filtered_106A_Female_Mix_GATCAG_L004_R1.fastq filtered_106A_Female_Mix_GATCAG_L004_R2.fastq filtered_106A_Male_Mix_TAGCTT_L004_R1.fq.fastq filtered_106A_Male_Mix_TAGCTT_L004_R2.fq.fastq filtered_108A_Female_Mix_GGCTAC_L004_R1.fastq filtered_108A_Female_Mix_GGCTAC_L004_R2.fastq filtered_108A_Male_Mix_AGTCAA_L004_R1.fastq filtered_108A_Male_Mix_AGTCAA_L004_R2.fastq code

Cleaning & Trimming code format="bash" FILES=`ls -1 *_R1.fastq | awk -F "." '{print $1}' | sed -e 's/...$//'`
 * 1) !/bin/sh

for file in ${FILES} do echo ${file} java -jar /usr/local/bioinformatics/Trimmomatic/trimmomatic-0.30.jar \ PE ${file}_R1.fastq ${file}_R2.fastq \ s1_pe s1_se s2_pe s2_se ILLUMINACLIP:illuminaClipping.fa:2:30:10

interleave-reads.py s1_pe s2_pe > ${file}.pe.fq cat s1_se s2_se > ${file}.se.fq rm s1_pe s2_pe s1_se s2_se

fastq_quality_filter -Q 33 -q 30 -p 50 -v -i ${file}.pe.fq \ -o ${file}.pe.qc.fq

fastq_quality_filter -Q 33 -q 30 -p 50 -v -i ${file}.se.fq \ -o ${file}.se.qc.fq

extract-paired-reads.py ${file}.pe.qc.fq

done code The first step, Trimmomatic, trims the paired read files and removes the adapters listed in the **illuminaClipping.fa** file. Trimmomatic saves the results in four files, s1_pe and s2_pe, the forward and reverse paired reads that are still paired, s1_se and s2_se which contain the broken forward and reverse paired reads left over.

Next the paired reads are woven into a single file and the left over broken reads are put into a single file.

Both the paired reads and the broken reads are then quality filtered using **fastq_quality_filter.** The last step removes any newly broken paired reads generated during the QC step, it generates a single paired end file (ending in pe) and a new broken paired read file (ending in se).


 * NOTE:** Trimmomatic is capable of doing both of these options but I'm not sure why the people who came up with pipeline decided to use fastq_quality_filter instead.

Files after cleaning code etc... code
 * .pe.fq
 * .pe.qc.fq
 * .pe.qc.fq.pe # Cleaned/Trimmed paired reads
 * .pe.qc.fq.se # Cleaned/Trimmed broken paired reads
 * .se.fq
 * .se.qc.fq # Broken Paired reads that were Cleaned/Trimmed

Cleaning up after trimming and filtering code format="bash" FILES=`ls -1 *_R1.fastq | awk -F "." '{print $1}' | sed -e 's/...$//'`

for file in ${FILES} do echo ${file} cat ${file}.pe.qc.fq.se ${file}.se.qc.fq > ${file}.combined.se.qc.fq

rm ${file}.pe.fq   rm ${file}.pe.qc.fq    rm ${file}.se.fq

echo "Gzip-ing ${file}_R1" gzip -c ${file}_R1.fastq > ${file}_R1.fq.gz   echo "Gzip-ing ${file}_R2" gzip -c ${file}_R2.fastq > ${file}_R2.fq.gz

mv ${file}.pe.qc.fq.pe ${file}.cleaned.pe.qc.fq   rm ${file}.pe.qc.fq.se    rm ${file}.se.qc.fq done code The left over files are removed, the QC'd paired end file is renamed, the raw files are compressed and the QC'd broken paired reads are combined into a single file.

Your files should now looks like this code code
 * .cleaned.pe.qc.fq # Cleaned PE sequences
 * .combined.se.qc.fq # Cleaned SE sequences
 * _R1.fastq # Original R1 file
 * _R1.fq.gz # compressed R1 file
 * _R2.fastq # Original R2 file
 * _R2.fq.gz # compressed R2 file

You should be able to remove the .fastq files because we now have the compressed .fastq files.

Digital Normalization code format="bash"
 * 1) !/bin/sh

FILES=`ls -1 *cleaned* | awk -F "." '{print $1}'`

for file in ${FILES} do echo ${file}

# Digi-Norm PE reads normalize-by-median.py \ -p \ -k 20 \ -C 20 \ -N 4 \ -x 3e9 \ --savehash ${file}.normC20k20.kh \ ${file}.cleaned.pe.qc.fq

# Digi-Norm SE reads normalize-by-median.py \ -C 20 \ -N 4 \ -x 3e9 \ --loadhash ${file}.normC20k20.kh \ --savehash ${file}.normC20k20.kh \ ${file}.combined.se.qc.fq

# Trim off errors filter-abund.py \ -V \ ${file}.normC20k20.kh \ ${file}.*.keep

done code The first step looks through our cleaned PE file and only saves sequences that have kmers that appear more then once. It also saves a database of these kmers in the .kh file. The second step does the same thing for the SE sequences but uses our .kh file from our previous step. Finally we trim off any low abundance areas from the sequences. code code
 * .cleaned.pe.qc.fq # cleaned PE file from previous steps
 * .cleaned.pe.qc.fq.keep # sequences that made it past digital normalization
 * .cleaned.pe.qc.fq.keep.abundfilt # removing erros from last file
 * .combined.se.qc.fq # cleaned SE file from previous steps
 * .combined.se.qc.fq.keep # Same as above except for SE sequences
 * .combined.se.qc.fq.keep.abundfilt # Same as above except for SE sequences
 * .normC20k20.kh # temporary database of kmer sequences
 * _R1.fq.gz
 * _R2.fq.gz

Cleaning up after digital normalization code format="bash"
 * 1) !/bin/sh

FILES=`ls -1 *_R1.fq.gz | awk -F "." '{print $1}' | sed -e 's/...$//'`

for file in ${FILES} do echo ${file} echo "Break out orphaned reads" extract-paired-reads.py ${file}.cleaned.pe.qc.fq.keep.abundfilt

echo "Combining SE reads" cat ${file}.cleaned.pe.qc.fq.keep.abundfilt.se \ ${file}.combined.se.qc.fq.keep.abundfilt \ > ${file}.se.qc.keep.abundfilt.fq

echo "Rename Compress" mv ${file}.cleaned.pe.qc.fq.keep.abundfilt.pe \ ${file}.pe.qc.keep.abundfilt.fq

gzip -c ${file}.pe.qc.keep.abundfilt.fq \ > ${file}.pe.qc.keep.abundfilt.fq.gz   gzip -c ${file}.se.qc.keep.abundfilt.fq \ > ${file}.se.qc.keep.abundfilt.fq.gz

echo "Removing Last files" rm ${file}.cleaned.pe.qc.fq.keep.abundfilt.se   rm ${file}.pe.qc.keep.abundfilt.fq    rm ${file}.se.qc.keep.abundfilt.fq    rm ${file}.*.kh    rm ${file}.*.keep rm ${file}.*.abundfilt

done code

First we remove any broken pairs from the PE file, then combine the SE sequences into a single file. Then we compress the digitally normalized files. Finally we remove any left over files that we don't need. Files should now look like this.

code code
 * .cleaned.pe.qc.fq # Cleaned, Trimmed PE file from before
 * .combined.se.qc.fq # Cleaned, Trimmed SE file from before
 * .pe.qc.keep.abundfilt.fq.gz # Cleaned, Trimmed, Normalized & Compressed PE file
 * .se.qc.keep.abundfilt.fq.gz # Cleaned, Trimmed, Normalized & Compressed SE file
 * _R1.fq.gz
 * _R2.fq.gz

Setup for Trinity code format="bash"
 * 1) !/bin/sh

for file in *.pe.qc.keep.abundfilt.fq.gz do   echo ${file} split-paired-reads.py ${file} done

cat *.1 > left.fq cat *.2 > right.fq

gunzip -c *.se.qc.keep.abundfilt.fq.gz >> left.fq code

Break the PE file into left and right reads then add the SE sequences to the left file. Now we're ready to run Trinity.

Running Trinity code format="bash"
 * 1) !/bin/sh

Trinity.pl \ --left left.fq \ --right right.fq \ --seqType fq \ --JM 20G \ --CPU 6 \ > trinity.log 2>&1 code

Telling Trinity that we are using fastq files (--seqType fq) and that we have paired reads in left and right files. We're telling Trinity that it can't use more then 20 GB of memory for the Jellyfish step and to use at most 6 cores. Finally it should redirect all output and error messages to the trinity.log file.

Once Trinity is finished you should see a folder called **trinity_out_dir** (unless you told Trinity something else). Inside of that are a whole slew of files but the important one is **Trinity.fasta**. It contains all the assembled contigs. There is also a **Trinity.timing** text file that shows how long each step took.