← ClaudeAtlas

roary-pangenomelisted

Compute the bacterial pan-genome from Prokka/Bakta GFF3 annotations with Roary's CD-HIT + BLAST + MCL clustering pipeline. Builds gene presence/absence matrices, core/soft-core/shell/cloud partitions, multi-FASTA core gene alignments (with `-e`), and a pan-genome reference. Use Panaroo for higher-accuracy pan-genomes from highly fragmented assemblies, PIRATE for paralog-aware clustering, or PPanGGOLiN for graph-based partitioning.
jaechang-hits/SciAgent-Skills · ★ 193 · AI & Automation · score 79
Install: claude install-skill jaechang-hits/SciAgent-Skills
# Roary Pan-Genome Pipeline ## Overview Roary is a high-throughput pan-genome pipeline for prokaryotes that takes per-sample GFF3 annotations (typically from Prokka or Bakta) and produces a clustered gene presence/absence matrix across the entire input set. It first reduces redundancy with CD-HIT iterative clustering, then performs an all-vs-all BLASTP within each pre-cluster, and finally applies MCL graph clustering to define orthologous gene families. The output partitions the gene space into core (≥ 99 %), soft-core (95–99 %), shell (15–95 %), and cloud (< 15 %) genes and optionally builds a concatenated core-gene alignment suitable for phylogenetic inference. ## When to Use - Computing a pan-genome from a set of bacterial isolate annotations (10–10,000 genomes) - Producing a `gene_presence_absence.csv` matrix for downstream GWAS, accessory-gene mining, or core-gene phylogenetics - Building a concatenated core-gene multi-FASTA alignment for ML/Bayesian phylogenetic trees - Generating a pan-genome reference FASTA to use as a non-redundant gene catalog - Comparative genomics across closely related strains where >95 % nucleotide identity is expected - Use **Panaroo** instead when assemblies are highly fragmented or annotations are noisy (Panaroo aggressively cleans annotation errors) - Use **PIRATE** instead when paralog-aware clustering with multiple identity thresholds is needed - Use **PPanGGOLiN** instead when graph-based, statistically grounded gene-family partitioni