Microbial Profiles

Get microbial profiles with MetaPhlAn and HUMAnN

Comprehensive taxonomic and functional profiling of microbial communities using MetaPhlAn and HUMAnN.

Description

The microbial_profiles workflow performs comprehensive microbial community analysis using state-of-the-art tools. This workflow provides both taxonomic and functional profiling:

Taxonomic profiling - Species-level identification and abundance using MetaPhlAn
Functional profiling - Gene family and pathway analysis using HUMAnN

Parameters

Required Parameters

Parameter	Type	Description
`--input`	File path	Input .csv file for microbial profiles

Global Parameters

Parameter	Required	Default	Description
`--outdir`	No	`results`	Output directory
`--help`	No	-	Display help information
`--debug`	No	`false`	Enable debug mode
`--config`	No	-	Custom configuration file

Syntax

metagear microbial_profiles --input INPUT_FILE [GLOBAL_OPTIONS]

Examples

Basic Usage

# Run microbial profiling with default settings
metagear microbial_profiles --input samples.csv

# Run with custom output directory
metagear microbial_profiles --input samples.csv --outdir profiles_output

# Enable debug mode for troubleshooting
metagear microbial_profiles --input samples.csv --debug

Preview Mode

# Generate script without executing
metagear microbial_profiles --input samples.csv --preview

This will create a metagear_microbial_profiles.sh script that can be reviewed and executed manually.

Input Format

The input CSV file should contain sample information with the following columns:

sample,fastq_1,fastq_2
SAMPLE-01,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
SAMPLE-02,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz
SAMPLE-03,/path/to/sample3_R1.fastq.gz,/path/to/sample3_R2.fastq.gz

Input Data Requirements

Quality-controlled reads: Input should be quality-controlled and contamination-removed reads
Sufficient depth: Adequate sequencing depth for reliable taxonomic and functional profiling
Paired-end preferred: While single-end reads are supported, paired-end provides better resolution

Output

The workflow generates comprehensive profiling results in the following directory structure:

outdir/
├── metaphlan/                   # MetaPhlAn taxonomic profiling results
│   ├── {sample}_microbial_profile.txt    # Individual taxonomic profiles
│   ├── {sample}.biom                     # BIOM format profiles
│   ├── {sample}.bowtie2out.txt          # Alignment files (optional)
│   └── merged_microbial_profiles.txt    # Combined abundance table across samples
├── humann/                      # HUMAnN functional profiling results
│   ├── {sample}/                        # Per-sample HUMAnN output directories
│   │   ├── {sample}_qc_genefamilies_cpm.tsv    # Gene family abundances (CPM normalized)
│   │   ├── {sample}_qc_pathabundance_cpm.tsv   # Pathway abundances (CPM normalized)
│   │   └── {sample}_qc_pathcoverage.tsv         # Pathway coverage
│   ├── gene_families_merged_profiles.tsv        # Merged gene family profiles
│   └── path_abundances_merged_profiles.tsv     # Merged pathway abundance profiles
├── multiqc/                     # Quality control and summary reports
│   ├── multiqc_report.html              # Consolidated analysis report
│   ├── multiqc_data/                    # Parsed statistics and data
│   └── multiqc_plots/                   # Static plot images
└── pipeline_info/               # Pipeline execution metadata
    ├── execution_report.html            # Nextflow execution report
    ├── execution_timeline.html          # Processing timeline
    └── execution_trace.txt              # Resource usage tracking

Key Output Files

Taxonomic profiling results:

metaphlan/merged_microbial_profiles.txt - Combined species abundance table across all samples
metaphlan/{sample}_microbial_profile.txt - Individual taxonomic profiles per sample
metaphlan/{sample}.biom - BIOM format profiles for downstream analysis tools

Functional profiling results:

humann/gene_families_merged_profiles.tsv - Combined gene family abundance matrix
humann/path_abundances_merged_profiles.tsv - Combined pathway abundance matrix
humann/{sample}/{sample}_qc_genefamilies_cpm.tsv - Per-sample gene family abundances (CPM normalized)
humann/{sample}/{sample}_qc_pathabundance_cpm.tsv - Per-sample pathway abundances (CPM normalized)
humann/{sample}/{sample}_qc_pathcoverage.tsv - Per-sample pathway coverage metrics

Analysis summary:

multiqc/multiqc_report.html - Comprehensive quality control and profiling summary report

Prerequisites

Before running this workflow:

Install databases: Run metagear download_databases first
Quality control: Run qc_dna (for DNA data) or qc_rna (for RNA data) on raw data first
Sufficient computational resources: Profiling can be computationally intensive
Adequate disk space: Profile generation requires substantial storage

Recommended Workflow Order

# 1. Download databases (run once)
metagear download_databases

# 2. Quality control
metagear qc_dna --input raw_samples.csv

# 3. Microbial profiling (using QC'd data)
metagear microbial_profiles --input qc_samples.csv

Analysis Features

Taxonomic Profiling (MetaPhlAn)

Species-level taxonomic assignment
Relative abundance estimation
Novel clade detection

Functional Profiling (HUMAnN)

Gene family abundance quantification
Metabolic pathway reconstruction
Pathway coverage and abundance
Species-stratified functional profiles

Performance Considerations

Memory requirements: HUMAnN requires substantial RAM (8-16GB recommended)
Processing time: Can take several hours for large datasets
Database size: MetaPhlAn and HUMAnN databases are large (>10GB)
I/O intensive: Frequent disk access during processing

Troubleshooting

Common issues and solutions:

Database not found: Ensure download_databases completed successfully
Memory errors: Increase available RAM or adjust configuration
Low profiling rate: May indicate poor quality input or low microbial content
Missing species: Some novel species may not be in reference databases

← Back to Workflows Overview