Sequence alignment
In this practice session, we go through the process of aligning sequencing data The goal of this analysis is to align our sequences on our reference genome to generate alignment tracks and visualize the transcription signal on the reference genome.
For simplicity purposes, we will be using specialized tools command line in this practice session, but all data manipulation steps can be performed or modified using a scripting language. The tools that will be needed here are:
For the analysis we will use data from Bacillus subtilis culture infected by its phage SPP1 RNA-seq libraries. The reads are subsampled to accelerate the quality check. The reference genome from this library is the YB886 strain from Bacillus subtilis and its phage SPP1.
All the data are available in the genomics_supbiotech_2023/ folder.
Trimming the adaptors
As we saw in the first session there were some adaptors sequences at the beginning of the reads. The solution is to trim the reads to remove these bases to avoid mapping issues as they were artificially added.
Looking at the fastqc report from the previous session how much base pairs will you remove ?
Answer:
Looking at the per base sequence report, we can remove 8bp to be sure to remove all the adapters sequences.
To trim the sequence we will use cutadapt software.
Looking at the help menu of cutadapt what command do you propose to trim your reads ?
Clue:
cutadapt --help
Show code:
mkdir -p results/cutadapt/
cutadapt -u 8 -U 8 -q 33,33 --trim-n -m 20 -o results/cutadapt/Bacillus_infection_T0_R1_trimmed.fq.gz -p results/cutadapt/Bacillus_infection_T0_R2_trimmed.fq.gz fastq/Bacillus_infection_T0_R1.fq.gz fastq/Bacillus_infection_T0_R2.fq.gz
cutadapt -u 8 -U 8 -q 33,33 --trim-n -m 20 -o results/cutadapt/Bacillus_infection_T10_R1_trimmed.fq.gz -p results/cutadapt/Bacillus_infection_T10_R2_trimmed.fq.gz fastq/Bacillus_infection_T10_R1.fq.gz fastq/Bacillus_infection_T10_R2.fq.gz
Once is done you can launch a fastqc and fastq-screen again to see what have
changed.
Show code:
mkdir -p results/fastqc results/fastq_screen
fastqc -o results/fastqc results/cutadapt/*fq.gz
fastq_screen --outdir results/fastq_screen --conf reference/fastq_screen.conf --force results/cutadapt/*fq.gz
You saw that we have removed the biases at the beginning of the reads and select the reads with the highest quality. The length is not uniform now as some reads have been more trimmed than others. Finally, we have resolved the mapping issue from the reverse reads.
Aligning the sequences
Now that we have cleaned and well-prepared our sequences we can start our analysis. The first step is to
align the reads on the reference genome. The most known algorithms for mapping are bowtie2,
bwa and minimap. There are also some others aligners more specific to RNA-seq such
as STAR or salmon. For bacteria where the genomes are small with no repeated
regions, there are not necessary.
In this practical we will use bowtie2
as bwa and minimap are not well suited for very short reads (<100bp).
The aligner algorithm needs to index the fasta. An index will be like a dictionnary where the sequence
present in the genome are sorted. It will allow to find the matching sequences very fast. To build the index
using bowtie2 you can use bowtie2-build command. To simplify the analysis we will
just look at the genome of the bacteria at T0.
Clue:
bowtie2-build --help
Show code:
mkdir -p reference/index
bowtie2-build reference/YB886.fa reference/index/YB886
Once the index is done we can use the bowtie2 command to map the reads. We will use the
--very-sensitive method. You can look at the documentation:
bowtie2 --help
What’s the seed ?
Answer:
As trying to map a nicely to all sequence will be slow for the aligner, they used a subset of the sequence
called seed to find first potential position, then they will extend that seed to find the best match. The
seed is ste up by the -L parameter in bowtie2. The bigger the seed is the faster the algorithm
is. Indeed it will have to extedn on less positions but it may missed the perfect position if it didn’t take
a perfect sequence as the seed.
What’s the
localalignment mode ? Is it really useful in our case ?
Answer:
Opposed to end-to-end mode the local mode is a mode allowing the edges of the
reads to not be mapped by the aligner. Is it usually advised to use it for RNA-seq as there are some
splicing which will introduce issues with local alignment. However, as it’s not occuring in bacteria, it is
not necessary for us to use it.
Launch the alignment using bowtie2 paired end mode:
Show code:
mkdir -p results/alignment
bowtie2 --very-sensitive --maxins 1000 -x reference/index/YB886 -1 results/cutadapt/Bacillus_infection_T0_R1_trimmed.fq.gz -2 results/cutadapt/Bacillus_infection_T0_R2_trimmed.fq.gz -S results/alignment/T0.sam
Bowtie2 will return some alignment metrics. You see that you have more than 95% of sequence
mapped on the genome.
Why it’s not 100% ?
Answer:
There are multiple answers to that question: - You got contaminated by others DNA (another organism, vectors, adapters…). We checked this one so it will be less likely. - There are some sequencing errors which introduce wrong sequence in your files, that you won’t be able to map. - There are mistakes in your reference genome or it’s uncomplete.
Try to map the reads of the second time point (T10). Do you have the same rate of mapping ? Why ?
Answer:
bowtie2 --very-sensitive --maxins 1000 -x reference/index/YB886 -1 results/cutadapt/Bacillus_infection_T10_R1_trimmed.fq.gz -2 results/cutadapt/Bacillus_infection_T10_R2_trimmed.fq.gz -S results/alignment/T10.sam
The sam format output
Let’s look at the output that we have. The grep -v '^@' command allows showing lines
that don’t have a @ at the beginning, in our case it allows to skip the header.
grep -v '^@' results/alignment/T0.sam | head -n 4
NS500446:543:HNM5TBGX7:1:11101:11655:1062 77 * 0 0 **0 0 GCAACAACATCATCTGCTANTTCTANTATTGATGANACCCTTA EEEEEEEEEEEEAAEAEEE#EEAEE#EEAEEEEEA#EEE6E/E YT:Z:UP
NS500446:543:HNM5TBGX7:1:11101:11655:1062 141 * 0 0 **0 0 CAGCTTCCTTTGCAANNTGCTGCTT EEEEEEEAEAEAEEE##EEEEE/EE YT:Z:UP
NS500446:543:HNM5TBGX7:1:11101:13459:1063 77 * 0 0 **0 0 GGTGAACTTCACCTTGATANCATTGNTGACCGTATNAAACGCG EEEEEEEEEEEEEEEEEEE#EEEEE#EEEEEEEEE#EEEEEEE YT:Z:UP
NS500446:543:HNM5TBGX7:1:11101:13459:1063 141 * 0 0 **0 0 AGCTGGGATGTATTCACNNGGAACGAC EEEEEEEEEEEEEEEEE##EEEEEEEE YT:Z:UP
NS500446:543:HNM5TBGX7:1:11101:21545:1063 83 manual_scaffold_1 1527086 42 43M = 1526863 -266 TATAATANAATTTTACCNTATTTNAGGAGGGAGTTGAATTTTT EEEEEAE#EEEEEEEEE#EEEAE#EEEEE/EEEEEEEEEEEEE AS:i:-3 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:7A9A5A19 YS:i:-1 YT:Z:CP
NS500446:543:HNM5TBGX7:1:11101:21545:1063 163 manual_scaffold_1 1526863 42 26M = 1527086 266 ACGTGAATAATATATACNTCCTAGAT EEEEEEEE/EEEEEEEE#EEAEEEEE AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:17T8 YS:i:-3 YT:Z:CP
NS500446:543:HNM5TBGX7:1:11101:5515:1064 99 manual_scaffold_1 3635839 42 23M = 3635918 106 GTTTTATTTTAAATTCTCCTCAA EEEEEEEEEEEEA/EEEEEEEEE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:23 YS:i:-1 YT:Z:CP
NS500446:543:HNM5TBGX7:1:11101:5515:1064 147 manual_scaffold_1 3635918 42 27M = 3635839 -106 CTTGGTGTGNATTCCTTGTCTTCTTTT EEEEEEEEE#EEEEEEEEE66EEEEEE AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:9A17 YS:i:0 YT:Z:CP
NS500446:543:HNM5TBGX7:1:11101:19648:1065 83 manual_scaffold_1 272545 42 43M = 272509 -79 AGCGCTTNGGCCACAGAAACCGGAAGAACCTCTCCGCAAAACA EEEEEEE#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:7T35 YS:i:-1 YT:Z:CP
NS500446:543:HNM5TBGX7:1:11101:19648:1065 163 manual_scaffold_1 272509 42 27M = 272545 79 GTATTAAAAATCTTCGCNTTCGGAAAA EEEEEEEEEEEEEEEEE#EEEEEEEEE AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:17T9 YS:i:-1 YT:Z:C
The file is a sam file, which is a tab separated file of 11 columns:
Sam file format
The column that interest us especially is the position of the reads on the reference, the flag, information about the mapping and the mapping quality, how much a sequence is unique in the reference.
To see what a flag means, we can use that: https://broadinstitute.github.io/picard/explain-flags.html
Using it try to explain the different following flags 77, 141, 83, 163, 99, 147.
The sam file is quite heavy, as for the fastq we are used to work with compressed files. We will at the
same time filter the mapping reads, sort them and index them using samtools. The
bam file is the compressed version of sam.
Bonus:
Use the following command:
samtools --help
Program: samtools (Tools for alignments in the SAM format)
Version: 1.3.1 (using htslib 1.3.1)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
index index alignment
-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
rmdup remove PCR duplicates
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
-- File operations
collate shuffle and group alignments by name
cat concatenate BAMs
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA
-- Statistics
bedcov read depth per BED region
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)
-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
You see that samtools is package containing several modules to work with sam,
fasta and others alignment files.
For example, there is a module for the flag explanation that we checked earlier online. But you can also do a lot of operation and statistics on your alignment files.
If you want to use one of this module you have to call it usingsamtools flags --help. In that
case you will have the documentation of the flags module.
# Sort and compressed the sam file.
samtools sort -n -O BAM results/alignment/T0.sam -o T0_tmp1.bam
# Keep only pairs with both reads mapped.
samtools fixmate --output-fmt bam T0_tmp1.bam T0_tmp2.bam
# Filter reads with low mapping quality.
samtools view --output-fmt bam -f 2 -q 30 -1 -b T0_tmp2.bam -o T0_tmp1.bam
# Sort the resulting reads.
samtools sort --output-fmt bam -l 9 T0_tmp1.bam -o results/alignment/T0_sorted.bam
# Index the reads
samtools index results/alignment/T0_sorted.bam
# Remove the sam files and the temporary files.
rm results/alignment/T0.sam T0_tmp1.bam T0_tmp2.bam
The rm function allow removing files through the command lines. Be careful with this one, it’s
permanent deletion.
Generating the transcription tracks
Once you align the reads you want to compute the coverage on your genome at each sequence of interest, genes
in our case. To do that we will use the bamCoverage function from deeptools.
mkdir -p results/tracks
bamCoverage --bam results/alignment/T0_sorted.bam --outFileName results/tracks/T0_unstranded.bw --binSize 1 --normalizeUsing CPM --extendReads --ignoreDuplicates
bamCoverage --bam results/alignment/T0_sorted.bam --outFileName results/tracks/T0_forward.bw --binSize 1 --normalizeUsing CPM --extendReads --filterRNAstrand forward --ignoreDuplicates
bamCoverage --bam results/alignment/T0_sorted.bam --outFileName results/tracks/T0_reverse.bw --binSize 1 --normalizeUsing CPM --extendReads --filterRNAstrand reverse --ignoreDuplicates
The output files are bigwig which are compress files with coverage of a genomic track.
What’s the
extendReadsoption ?
Answer:
As we have a pair file, the extendreads option will just use also the sequence between the
forward and reverse to compute the coverage.
What does CPM mean ? Why are we using that normalization ?
Answer:
CPM means Count Per Millions. The idea of that normalization is to get rid of the difference between the numbers of mapping reads in a library (Sequencing twice more will yield twice more coverage otherwise).
What is the point to filter the reads based on their strand in a RNA-seq and to generate two tracks ?
Answer:
We did a stranded library, so we can generate a forward and reverse tracks which are both biologically relevant. Filtering the strand allows to generate the two tracks separately.
Visualizing the alignment using IGV
Now let’s visualize our tracks using IGV. In IGV you can load the fasta genome and then the sorted
bam, bigwig or gff. The gff is the genome annotation file
with the genes position.
For this step, you will have to download IGV on your own personal computer. Here is the page where the installation file is available for each OS (Windows, Mac or Linux).
https://software.broadinstitute.org/software/igv/download
Once it’s downloaded you can open it. Load the bacterial YB886 genome from the reference folder
and some bw files that you have generated. You can also load the gene annotation gff
file.
Now we can play around and explore your data.
Is it uniform ?
Answer:
No. It’s a RNA-seq, not a genomic DNA coverage.
If you load the stranded files, Do you see antisens RNA (it means coverage in the opposite strand of the gene) ?
Answer:
There are some coverage, but it’s very low. If you do the statistics, you will see that only 2-3% of the reads are antisens in these samples. For some species, samples, there are a lot of antisens RNA but it’s generally not more than 10-20%.
Looking at the strands of the genes on the whole genome (look at the tracks on the forward and reverse strand), is it a uniform distribution or not ? Based on your knowledge on bacteria, why do you think it’s not uniform ?
Answer:
Mainly the first half of the genome the genes are in the forward strand and there are in reverse in the second one. The reason is the origin of replication of the bacteria is the beginning/end of the reference genome (the genome is circular so the beginning and the end of the reference is the same region). Evolution tends to select genes which are in the same direction as the replication of the DNA. So in the first half the replication goes in the forward way, and the genes too, and the contrary happens in the second half.