11 Sequence QC
| Learning Objectives: |
| Perform standard quality control analyses on sequence data |
11.1 Quality control of sequence data
In this section we’re going to work on doing some standard quality checks for sequencing data. There are three basic topics to cover here:
- Collecting basic summaries
- Trimming
- Contamination detection
The only one that is absolutely required across all data types and experiments is the collection of basic summaries. Your sequencing center may be lovely and your lab scientists ultra-competent, but you can’t just trust that you’ve got the data you expect, you have to actually look at it. There is no worse feeling than troubleshooting a pipeline step-by-step backward from the final products, only to discover the problem is the data and if you had just taken a few minutes to scrutinize it before you got started, the last many hours (days? weeks?) of your work would not have been wasted. Not every problem will be apparent with standard QC, but many common ones will. Once you look at summary data, you can decide whether you can proceed, or if further action is required.
Unlike previous chapters, these sections contain dummy code that does not refer to real files on Xanadu. The exercise for this chapter will ask you to run this software and examine the output.
11.1.1 Collecting basic summaries
11.1.1.1 Illumina data
We typically use fastqc to characterize Illumina data. It’s been around a long time, and it does a fine job of covering summaries we want to look at for most applications. It flags statistics when they are worrisome, but this flagging is not always useful, because it is tuned to apply to whole genome sequencing data (WGS). If you have a sequencing library that has different characteristics (e.g. RNA-seq), fastqc will often flag it as problematic though it may not be.
Paired-end sequence is typically delivered in two separate files, which we analyze separately with fastqc.
The usage for fastqc is very simple:
module load fastqc/0.11.7
fastqc -o output_dir input_dir/mysequences.fastq.gzYou can glob the command like this and it will churn through all the files matching the glob pattern:
fastqc -o output_dir input_dir/*.fastq.gzThe main output from fastqc is an HTML file which is viewed in a web browser. You can also aggregate many of these HTML files into a single report using multiqc (see below).
For an explanation of fastqc output, please watch this video created by the developers, look at their example reports here and read the brief descriptions of the analysis modules here.
11.1.1.2 Oxford Nanopore Technologies data
ONT data straight from the sequencer is a bit more complex than Illumina data. The raw signal data is often delivered alongside fastq files (or sometimes BAM alignment files - more on that later). Furthermore, the data are often chunked up into many smaller files that you may need to concatenate together for further analysis.
The signal data is delivered in fast5 (deprecated) or pod5 format. You will need the signal data if you want to re-basecall, or do modified basecalling (i.e identify methylated bases). We won’t cover basecalling this semester.
When basecalling is complete, a file, usually named sequencing_summary.txt is produced. This file can be used to quickly generate summaries of the quality and length distribution of the reads. This information can be generated directly from the reads themselves, but it’s slower. While there can’t be much length variation in Illumina data (even if you quality trim it, every read is going to be <= 300bp), for Nanopore data the length distribution can be very wide and highly variable among runs and so it’s a key variable for understanding your data.
Worth noting is that after basecalling a report.html file is also generated. This is worth looking at.
If you download Nanopore sequence data from a public archive, the raw signal data, sequencing_summary.txt, and report.html are unlikely to have been archived alongside the basecalls.
For Nanopore data, we typically use either pycoQC or NanoPlot to summarize because they emphasize length variation and show its relationship to sequence quality.
To run NanoPlot on fastq files:
module load NanoPlot/1.41.6
NanoPlot --fastq input_dir/mysample.fastq.gz -o output_dir -p mysample -t 4 Or on sequencing_summary.txt:
NanoPlot --summary input_dir/sequencing_summary.txt -o output_dir -p mysample -t 4 Here we use -t 4 to specify 4 cpus. Remember:
- Most software that can use multiple CPUs needs to be told how many CPUs it can use.
- When you tell a piece of software it can use
ncpus, you need to requestnCPUS (or possibly more, if you are piping commands together) from SLURM with--cpus-per-task nin your header or on the command line.
pycoQC can only accept the summary file like this:
module load pycoQC/2.5.2
pycoQC -f sequencing_summary.txt -o pycoQC_output.htmlBoth NanoPlot and pycoQC produce HTML files as their primary outputs (but also some useful text summaries). Both of these programs can produce additional summaries if you provide an alignment file (we have not covered alignment yet!).
For NanoPlot, please read through the author’s gallery of example plots. There is not a ton of documentation for pycoQC’s output, so you’ll have to puzzle it out yourself. It’s fairly self-explanatory.
ONT’s technologies have been changing more rapidly than other major platforms over the last several years. On both the instrument side (nanopores, chemistry) and the processing side (basecalling from raw signal) there have been substantial improvements in recent years. Because of this it is very important to keep track of the following:
- The instrument
- Nanopore version
- Sequencing kit
- Basecalling software version
- Basecalling model
Some downstream applications will want this information, as the implied error profiles can be different. Unfortunately this information is not always provided by researchers who upload data to public repositories.
11.1.1.3 PacBio
PacBio HiFi data is nice and simple. It’s not paired-end, and since we typically only look at the final base calls, and don’t usually scrutinize the raw “subreads” that went into them, we are usually only concerned with a single fastq file.
PacBio sequencers produce an XML formatted file containing similar data to the ONT sequencing_summary.txt file. PacBio has proprietary software (SMRT LINK) that can read this file and produce a summary. Like the helper files from ONT, though, these are not typically available in public databases so we recommend just using NanoPlot on the fastq file, as above.
11.1.2 Trimming
Trimming is something we typically do only with Illumina data. There are three aspects to it:
- Removing adapter contamination (“read-through” contamination is the most common type by far).
- Trimming low quality ends from sequences.
- Removing sequences that have been trimmed too short to be useful in steps 1 and 2.
As you will note in fastqc outputs, it is not uncommon to get adapter read-through in Illumina data. It’s good to remove this. In many applications, however, a small amount can be tolerated. Reference mapping programs mostly do “local” rather than “global” alignment, so they are capable of leaving some bases at the end of a sequence unmapped if they don’t seem to fit.
To trim adapter contamination, software needs to know what adapters to look for. For some software you will need to provide a fasta file with the specific adapters used in your study. Other software comes pre-loaded with common adapter sequences and you only need to provide them if you’re using something unusual.
We usually trim low quality ends from sequences because quality often declines towards the 3’ end on the Illumina platform. The longer the read, the worse the decline. For most common applications of Illumina data, however, we don’t need to be extremely stringent with the quality trimming. The Phred-scaled base qualities capture the uncertainty in the base calls reasonably well and are used by many downstream applications. For this reason, unless we have a specific need to tightly control potential base calling errors, we just try to trim out the worst bits.
Finally, if after trimming adapters and low quality sequence, we wind up with very short sequences (perhaps < 50bp), we typically remove them because they are of limited utility. This often leaves “orphaned” mate sequences that are long enough and high enough quality to be retained, which are directed to a separate file. Since there usually not that many of these, we often ignore them downstream as they introduce more complexity in handling data for not much benefit.
Here we’ll use Trimmomatic, a venerable and widely used piece of software. It requires that you know what adapter sequences were used to generate the library and provide them. When that information is not available (sometimes published data do not state it) FastQC usually infers it for you (when there is detectable adapter contamination from commonly used adapter sequence that is). Trimmomatic comes packaged with fasta files for the most common adapters, so this is not really a large barrier.
There are many other similar useful tools for trimming, such as fastp, sickle, cutadapt. TrimGalore, and BBTools.
For bioinformatics workflows, we do not typically modify our raw data files in any way (rather, we set permissions to read-only). This means that some steps, like trimming, more or less double the size of our data during the course of analysis. Once you’re fairly certain you’re done with intermediate analysis products like trimmed reads, it’s a good idea to go back and remove them to save storage space.
To run Trimmomatic:
module load Trimmomatic/0.39
ADAPTERS=/isg/shared/apps/Trimmomatic/0.39/adapters/NexteraPE-PE.fa
java -jar ${Trimmomatic} PE -threads 4 \
indir/mysample_R1.fastq.gz \
indir/mysample_R2.fastq.gz \
trimdir/mysample_trimmed_R1.fastq.gz trimdir/mysample_trimmed_orphans_R1.fastq.gz \
trimdir/mysample_trimmed_R2.fastq.gz trimdir/mysample_trimmed_orphans_R2.fastq.gz \
ILLUMINACLIP:"${ADAPTERS}":2:30:10 \
SLIDINGWINDOW:4:25 MINLEN:45A couple of notes:
- A bit of BASH usage:
\at the end of each line tells the bash interpreter to ignore the line break. All 7 lines above are treated as a single continuous line. The\simply allows more clarity with long command lines. - Java programs are often distributed as JAR files.
java -jar myprogram.jar <options>is the command for executing a JAR file. In the module system we usually have the module create an environment variable that contains the path to the JAR file. Try loading Trimmomatic and doingecho $Trimmomaticnow. You can see how it is set by doingmodule display Trimmomatic/0.39and looking at thesetenvline. - Because it’s brief, you might overlook the
PEin there, but it’s telling Trimmomatic you’re giving it paired end data. If you had single end data it would beSE. ILLUMINACLIPprovides the parameters needed to search for adapters. Read through the documentation to understand it better.SLIDINGWINDOWgives parameters for sliding window quality trimming (if any 4 bases drop below a mean Q of 25, the rest of the read is trimmed)MINLENshould be self-explanatory.
We may do other quality filtering for ONT and PacBio data, depending on application, but that is typically length and/or mean quality filtering, which can be easily accomplished using seqkit, which we demonstrated previously, or taxonomy-based filtering which we will cover in the next section. For more in depth filtering of long reads by quality and length you can also try filtlong, but we won’t cover it here.
11.1.3 Detecting mixtures and contamination
Sometimes our sequence data is a mixture of DNA or RNA from different organisms. This may be a result of biology. That is, for some organisms, tissues, or data types, we may be unable to completely separate our focal organism’s DNA or RNA from one or more other organisms. In these cases, mixtures are usually derived from a taxonomically distant organisms.
Mixtures may also be a result of inadvertent contamination in the field or laboratory. Depending on the context, mixtures may be of distant or closely related organisms.
There are also applications where you deliberately sequence mixtures, such as eDNA sequencing.
When mixtures or contamination are possible and unwanted, there are QC steps we can take to detect it. If it is detected, what we decide to do will depend on our downstream application. We may want to remove unwanted reads at a very early stage, or we may want to retain them and manage their influence on our analysis later.
In this section we’ll talk about how to use metagenomic classification software to try to identify the extent of mixing of more distantly related species. For the case of cross-contaminationm, there is a wide variety of software for different use cases. CalculateContamination and verifyBamId are a couple examples.
Metagenomic classification essentially takes “query” sequences of interest and maps them to a database of “target” sequences. When matches are found, they are reported. In the software we use, both detailed match information for each sequence (which can be quite a large file) and summary information about the taxonomic distribution of matches is produced. The key element in all classification analyses is the database. A match can only be identified if it is present in the database. There are publicly available databases for the software we recommend, but they won’t apply to all use cases, so users can also build their own composed of potential contaminants.
For Illumina short-read data we recommend using kraken2. Databases can be obtained here, but the standard database is already available on the Xanadu cluster here: /isg/shared/databases/kraken/v06.2025/standard. This database contains bacterial, viral, archaeal and human sequences. There are several other databases in the parent directory.
module load kraken/2.1.2
kraken2 -db /isg/shared/databases/kraken/v06.2025/standard \
--paired indir/mysample_R1.fq.gz indir/mysample_R2.fq.gz \
--use-names \
--threads 16 \
--output outdir/kraken_general.out \
--unclassified-out outdir/unclassified#.fastq \
--classified-out outdir/classified#.fastq \
--report outdir/kraken_report.txt \
--use-mpa-styleIn this case the output file format unclassified#.fastq will cause kraken to produce R1 and R2 fastq files without having to specify them both by name. kraken_general.out will give the classification status of ALL reads (or read pairs in this case). kraken_report.txt will give a taxonomic summary.
We recommend centrifuge for long reads from PacBio and ONT. You can run it like this:
Outputs are
module load centrifuge/1.0.4
centrifuge -q \
-x /isg/shared/databases/centrifuge/b+a+v+h/p_compressed+h+v \
--report-file outdir/report.tsv \
--quiet \
--min-hitlen 50 \
-U indir/mysequences.fastq.gz >outdir/centrifuge_results.txtNote that it uses a different database format, but is still composed of the same sequence set. It is available at the link above, but also on Xanadu at the path given in the example command. This will run with unpaired reads (-U) and require database hits to be at least 50bp.
The stdout captured by outdir/centrifuge_results.txt will give the classification of every read. outdir/report.tsv will give the taxonomic summary.
11.2 MultiQC
MultiQC is a fantastic tool for aggregating quality reports across samples and pieces of software. It can read in reports from most things we’ll use and digest them into a readable format. This is particularly useful when you have many samples. You can imagine that checking fastqc html files individually for 20 or 30 (or 800) samples can range from tedious to virtually impossible. MultiQC gives you a quick overview of everything that can key you into problems when they exist.
MultiQC has very simple usage:
module load MultiQC/1.15
multiqc -f -o multiqc_out input_dir-o specifies an output directory. -f means “force-overwrite” if the directory already exists and contains results (handy when re-running or testing scripts). MultiQC will search input_dir and subdirectories for summary files it recognizes.