Quality Control (DNA/RNA)
Quality Control for DNA and RNA Sequencing Data
Essential preprocessing workflows for quality control, trimming, and contamination removal of DNA and RNA sequencing data.
Description
MetaGEAR provides two quality control workflows for different types of sequencing data:
qc_dna- Quality control for DNA metagenomic sequencing dataqc_rna- Quality control for RNA sequencing (metatranscriptomic) data
Both workflows follow the same general process and accept identical parameters. The key difference is that qc_rna includes an additional rRNA removal step to filter out ribosomal RNA sequences, which is essential for metatranscriptomic analysis.
Common Processing Steps
Both workflows include:
- Quality assessment - Initial quality metrics and statistics
- Adapter trimming - Removal of sequencing adapters
- Quality trimming - Removal of low-quality bases
- Contamination removal - Removal of host and other contaminant sequences using Kneaddata
- Final quality assessment - Post-processing quality metrics
RNA-Specific Processing
The qc_rna workflow includes one additional step:
- rRNA removal - Removal of ribosomal RNA sequences (essential for metatranscriptomic data)
Parameters
Required Parameters
| Parameter | Type | Description |
|---|---|---|
--input |
File path | Input .csv file for quality control (same format for both workflows) |
Global Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
--outdir |
No | results |
Output directory |
--help |
No | - | Display help information |
--debug |
No | false |
Enable debug mode |
--config |
No | - | Custom configuration file |
Syntax
# For DNA sequencing data
metagear qc_dna --input INPUT_FILE [GLOBAL_OPTIONS]
# For RNA sequencing data
metagear qc_rna --input INPUT_FILE [GLOBAL_OPTIONS]
Examples
Basic Usage
# Run DNA quality control with default settings
metagear qc_dna --input samples.csv
# Run RNA quality control with default settings
metagear qc_rna --input rna_samples.csv
# Run with custom output directory
metagear qc_dna --input samples.csv --outdir dna_qc_results
metagear qc_rna --input rna_samples.csv --outdir rna_qc_results
# Enable debug mode for troubleshooting
metagear qc_dna --input samples.csv --debug
metagear qc_rna --input rna_samples.csv --debug
Preview Mode
# Generate DNA QC script without executing
metagear qc_dna --input samples.csv --preview
# Generate RNA QC script without executing
metagear qc_rna --input rna_samples.csv --preview
This will create a metagear_qc_dna.sh or metagear_qc_rna.sh script that can be reviewed and executed manually.
Input Format
Both workflows use identical input CSV file formats. The input CSV file should contain sample information with the following columns:
sample,fastq_1,fastq_2
SAMPLE-01,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
SAMPLE-02,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz
SAMPLE-03,/path/to/sample3_R1.fastq.gz,/path/to/sample3_R2.fastq.gz
Column Descriptions
sample: Unique sample identifierfastq_1: Path to forward reads (R1) FASTQ filefastq_2: Path to reverse reads (R2) FASTQ file
Output
Both workflows generate identical output directory structures. The workflow generates the following outputs in the specified directory structure:
outdir/
├── fastqc/ # Raw and clean read quality reports
│ ├── {sample}_fastqc.html # Individual FastQC HTML reports (raw)
│ ├── {sample}_fastqc.zip # FastQC data files (raw)
│ ├── {sample}_clean_fastqc.html # Individual FastQC HTML reports (clean)
│ └── {sample}_clean_fastqc.zip # FastQC data files (clean)
├── trimgalore/ # Adapter and quality trimmed reads
│ ├── {sample}_1_val_1.fq.gz # Trimmed forward reads
│ ├── {sample}_2_val_2.fq.gz # Trimmed reverse reads
│ ├── {sample}_1.fastq.gz_trimming_report.txt # Trimming reports
│ └── {sample}_2.fastq.gz_trimming_report.txt
├── kneaddata/ # Host/contamination removal results
│ ├── {sample}_paired_1.fastq.gz # Final cleaned forward reads
│ ├── {sample}_paired_2.fastq.gz # Final cleaned reverse reads
│ ├── {sample}_kneaddata.log # Detailed processing log
│ └── {sample}_kneaddata_stats.csv # Read count statistics
├── multiqc/ # Consolidated quality control reports
│ ├── multiqc_report.html # Main QC summary report
│ ├── multiqc_data/ # Parsed statistics and data
│ └── multiqc_plots/ # Static plot images
└── pipeline_info/ # Pipeline execution metadata
├── execution_report.html # Nextflow execution report
├── execution_timeline.html # Processing timeline
└── execution_trace.txt # Resource usage tracking
Key Output Files
Final cleaned reads for downstream analysis:
kneaddata/{sample}_paired_1.fastq.gz- Host-decontaminated forward readskneaddata/{sample}_paired_2.fastq.gz- Host-decontaminated reverse reads
Quality assessment reports:
multiqc/multiqc_report.html- Comprehensive QC summary across all samplesfastqc/{sample}_clean_fastqc.html- Post-processing quality metrics per samplekneaddata/{sample}_kneaddata_stats.csv- Read count statistics (raw, trimmed, decontaminated, final)
Processing logs and intermediate files:
trimgalore/{sample}_*_trimming_report.txt- Adapter trimming statisticskneaddata/{sample}_kneaddata.log- Detailed contamination removal logtrimgalore/{sample}_*_val_*.fq.gz- Quality and adapter trimmed reads (intermediate)
RNA-Specific Considerations
For qc_rna workflows, the output files contain RNA-seq data with rRNA sequences removed. This additional processing step means:
- Processing time is typically longer due to the rRNA removal step
- Final read counts may be significantly lower due to rRNA filtering
- The cleaned reads are optimized for metatranscriptomic functional analysis
Prerequisites
Before running either workflow:
- Install databases: Run
metagear download_databasesfirst - Prepare input file: Ensure all FASTQ file paths are correct and accessible
- Check disk space: Quality control can generate substantial intermediate files
- For RNA workflows: Ensure rRNA reference databases are available and properly configured
Notes
General Notes
- Both workflows automatically detect paired-end vs single-end reads
- Host contamination databases must be properly configured for your sample type
- Intermediate files are preserved for quality assessment and troubleshooting
- Processing time depends on input file sizes and available computational resources
DNA-Specific Notes (qc_dna)
- Optimized for metagenomic DNA sequencing data
- Focuses on host DNA contamination removal
- Suitable for taxonomic profiling and functional analysis
RNA-Specific Notes (qc_rna)
- Optimized for metatranscriptomic RNA sequencing data
- Includes rRNA removal which is crucial for downstream functional analysis
- Processing time is typically longer due to the additional rRNA removal step
- rRNA removal may significantly reduce final read counts (this is expected and beneficial)
- Host contamination removal may be less stringent for certain RNA sample types
Choosing Between qc_dna and qc_rna
- Use
qc_dnafor:- Metagenomic DNA sequencing data
- Whole genome shotgun sequencing
- Amplicon sequencing data
- Use
qc_rnafor:- Metatranscriptomic RNA sequencing data
- Any RNA-seq data where rRNA removal is needed
- Functional gene expression analysis