Gene Analysis
Gene centric analysis workflow
This advanced workflow performs comprehensive gene-centric analysis including gene prediction, functional annotation, pangenome reconstruction, and taxonomic classification.
Description
The gene_analysis workflow performs comprehensive gene-centric analysis of metagenomic data. This advanced workflow focuses on detailed functional annotation and analysis:
- Gene prediction - Identification of protein-coding genes
- Functional annotation - Assignment of functional categories and pathways
- Pangenome reconstruction - Metagenomic Species Pan-genome (MSP) reconstruction
- Taxonomic annotation - Assignment of taxonomy at MPS
Parameters
Required Parameters
| Parameter | Type | Description |
|---|---|---|
--input |
File path | Input .csv file for gene analysis |
Optional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--contig_catalog |
File path | - | Contigs catalog (if available) |
--metaphlan_profiles |
File path | - | MetaPhlan profiles (merged, if available) |
Global Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
--outdir |
No | results |
Output directory |
--help |
No | - | Display help information |
--debug |
No | false |
Enable debug mode |
--config |
No | - | Custom configuration file |
Syntax
metagear gene_analysis --input INPUT_FILE [--contig_catalog CATALOG_FILE] [GLOBAL_OPTIONS]
Examples
Basic Usage
# Run gene analysis with default settings
metagear gene_analysis --input samples.csv
# Run with custom output directory
metagear gene_analysis --input samples.csv --outdir gene_results
# Use pre-computed contigs catalog
metagear gene_analysis --input samples.csv --contig_catalog contigs_catalog.fasta
# Enable debug mode for troubleshooting
metagear gene_analysis --input samples.csv --debug
Preview Mode
# Generate script without executing
metagear gene_analysis --input samples.csv --preview
This will create a metagear_gene_analysis.sh script that can be reviewed and executed manually.
Input Format
The input CSV file should contain sample information with the following columns:
sample,fastq_1,fastq_2
SAMPLE-01,/path/to/sample1_R1.fastq.gz,/path/to/sample1_R2.fastq.gz
SAMPLE-02,/path/to/sample2_R1.fastq.gz,/path/to/sample2_R2.fastq.gz
SAMPLE-03,/path/to/sample3_R1.fastq.gz,/path/to/sample3_R2.fastq.gz
Optional Catalog Input
If you have a pre-computed contigs catalog (from a previous assembly), you can provide it with the --contig_catalog parameter:
metagear gene_analysis --input samples.csv --contig_catalog my_contigs_catalog.fasta
The catalog should be in FASTA format containing assembled contigs.
Output
The workflow generates comprehensive gene analysis results in the following directory structure:
outdir/
├── megahit/ # Assembly results (per sample)
│ ├── {sample}.contigs.fa # Assembled contigs
│ ├── {sample}.log # Assembly log files
│ └── {sample}.k{kmer}.fastg.gz # Assembly graph files
├── prodigal/ # Gene prediction results
│ ├── {sample}.fna.gz # Predicted nucleotide sequences
│ ├── {sample}.faa.gz # Predicted protein sequences
│ ├── {sample}.gff.gz # Gene annotation in GFF format
│ └── {sample}_all.txt.gz # Complete gene annotations
├── cdhit/ # Gene catalog construction
│ ├── merged_genes.nr_95_90.fa # Non-redundant gene catalog (95% similarity)
│ ├── merged_genes.nr_95_90.fa.clstr # Clustering information
│ └── protein_catalog.prot.faa # Protein catalog
├── coverm/ # Gene abundance quantification
│ ├── gene_abundance_count_merged.tsv # Raw read counts per gene
│ ├── gene_abundance_rpkm_merged.tsv # RPKM normalized abundances
│ ├── gene_abundance_tpm_merged.tsv # TPM normalized abundances
│ └── bwa_index/ # BWA alignment indices
├── interproscan/ # Functional annotation results
│ ├── protein_catalog.annotations.tsv # InterProScan annotations
│ └── protein_catalog.FG_IPS_Pfam.tsv # Functional group annotations
├── msp/ # Metagenomic Species Pan-genome (MSP) analysis
│ ├── msp_abundance.median.RPKM.txt # MSP abundance profiles
│ ├── all_msps.tsv # MSP definitions and gene content
│ ├── pangenome_sequences/ # MSP pangenome FASTA files
│ └── msp_metaphlan_LM.bestR2.txt # MSP-MetaPhlAn taxonomic mapping
├── gtdbtk/ # Taxonomic classification
│ ├── classify/ # GTDB-Tk classification results
│ ├── identify/ # Marker gene identification
│ └── align/ # Multiple sequence alignments
├── multiqc/ # Quality control and summary reports
│ ├── multiqc_report.html # Consolidated analysis report
│ ├── multiqc_data/ # Parsed statistics and data
│ └── multiqc_plots/ # Static plot images
└── pipeline_info/ # Pipeline execution metadata
├── execution_report.html # Nextflow execution report
├── execution_timeline.html # Processing timeline
└── execution_trace.txt # Resource usage tracking
Key Output Files
Gene catalog and abundance:
cdhit/merged_genes.nr_95_90.fa- Non-redundant gene catalog for all samplesabundance/gene_abundance_rpkm_merged.tsv- Gene abundance matrix (RPKM normalized)abundance/gene_abundance_count_merged.tsv- Raw read count matrix per gene
Functional annotation:
interproscan/protein_catalog.annotations.tsv- Comprehensive functional annotations (InterProScan)interproscan/protein_catalog.FG_IPS_Pfam.tsv- Functional group classifications
MSP (Metagenomic Species Pan-genome) analysis:
mspminer/msp_abundance.median.RPKM.txt- MSP abundance profiles across samplesmspminer/all_msps.tsv- MSP definitions with gene membershipmspminer/pangenome_sequences/{msp_id}.pangenome.fasta- Individual MSP pangenome sequencesmspminer/msp_metaphlan_LM.bestR2.txt- Best taxonomic assignments for MSPs
Taxonomic classification:
gtdbtk/classify/{sample}.bac120.summary.tsv- Bacterial taxonomic assignmentsgtdbtk/classify/{sample}.ar53.summary.tsv- Archaeal taxonomic assignments
Assembly and gene prediction (per sample):
megahit/{sample}.contigs.fa- Assembled contigs per sampleprodigal/{sample}.faa.gz- Predicted protein sequences per sample
Prerequisites
Before running this workflow:
- Install databases: Run
metagear download_databasesfirst - Quality control: Run
qc_dna(for DNA data) orqc_rna(for RNA data) on raw data first - Sufficient resources: Gene analysis is computationally intensive
- Large disk space: Assembly and annotation require substantial storage
Recommended Workflow Order
# 1. Download databases (run once)
metagear download_databases
# 2. Quality control
metagear qc_dna --input raw_samples.csv
# 3. Optional: Run microbial profiling first
metagear microbial_profiles --input qc_samples.csv
# 4. Gene-centric analysis
metagear gene_analysis --input qc_samples.csv
Troubleshooting
Common issues and solutions:
- Assembly failure: Check input data quality and quantity
- Memory errors: Increase available RAM or use catalog mode
- Annotation database errors: Ensure all databases are properly downloaded
- Low gene density: May indicate poor assembly quality or unusual sample type
- Long runtime: Consider using a pre-computed catalog or increasing CPU cores