Supplementary data repository for Vizueta et al. Cell 2025 This repository contains the extended dataset accompanying the manuscript "Adaptive radiation and social evolution of the ants" in Cell. # Detailed contents # The GAGA-IDs used in the following directories and the corresponding species names are specified in the table "GAGA_species_list.txt". 01) Folder containing the genome sequences and annotations for each genome assembly. The following files can be found in the compressed file for each of the species, e.g. GAGA-0177 Lasius niger: - Genome assembly fasta file, with repeats in lower case GAGA-0177_chromosome_gapCloser_nextpolish_final_dupsrm_filt.softMasked.fasta.gz - Repeat annotation GFF file used before gene annotation GAGA-0177_chromosome_gapCloser_nextpolish_final_dupsrm_filt.repeats.gff.gz - Gene annotation GFF3 file, and the corresponding predicted cds and protein sequences in fasta GAGA-0177_final_annotation_repfilt_addreannot.cds.fasta.gz GAGA-0177_final_annotation_repfilt_addreannot_v3fixed.gff3.gz GAGA-0177_final_annotation_repfilt_addreannot.pep.fasta.gz - Gene annotation GFF3 containing a single isoform as representative* per gene, and the cds and protein fasta files GAGA-0177_final_annotation_repfilt_addreannot_noparpse_representative.cds.fasta.gz GAGA-0177_final_annotation_repfilt_addreannot_noparpse_representative_v3fixed.gff3.gz GAGA-0177_final_annotation_repfilt_addreannot_noparpse_representative.pep.fasta.gz * One representative isoform per gene was selected based on the following priority criteria: First, re-annotated gene models were preferred over other existing isoforms for a gene; second, homology-based predictions took precedence over RNA-seq based or de novo annotations; third, RNA-seq models were preferred over de novo based genes, which were only selected as representative when a gene was identified solely via de novo annotation. If a gene had several isoforms from one source we kept the longest isoform as the representative one. Partial or pseudogene gene family copies (e.g. OR and GRs) were excluded from this file. 02) Functional annotations containing a summary table combining all evidence (Blast, Eggnog and Interproscan), as well as the GO and KEGG associated terms. Each folder within the compressed file contains the annotation files for the representative isoform in each genome assembly. 03) Whole genome alignments (WGA) - Cactus and MultiZ. - Ants_Cactus_aln_v2.hal: Cactus WGA - Cactus_commands.sh: Bash commands used to generate the Cactus WGA - Cactus_guide_tree.nwk: Guide tree used in Cactus alignment - Ants_Multiz_new.maf.gz: MultiZ WGA 04) Orthology files. Note that the GAGA-IDs used and the corresponding species names are specified in the table "GAGA_species_list.txt" located in the main directory. 04a) Orthology assessment across the 163 ants covered in our analyses. - Orthogroup_table_N0.tsv.gz: Table containing the orthogroup ID, and the sequences (protein IDs) included in each species. - Orthogroup_table_N0_genecount.tsv.gz: Table with the orthogroup and the number of sequences in each species - Orthogroups_functional_annotation_summary.tsv.gz: Table with the functional annotations for each orthogroup based on consensus BLAST, EggNog and Interpro hits across all protein sequences included in the orthologous group - Supp_table_all_ortholog_funct_annots_expr.xlsx: Excel file with the above functional annotation, as well as the best reciprocal hit in the fruit fly, and caste differential expression patterns across ants, and developmental transcriptomes in M. pharaonis and A. echinatior (see 07_Caste_gene_expression below - Orthogroups_functional_annotation_GO.annot.gz and Orthogroups_functional_annotation_KEGG.annot.gz : Annotation tables with GO and KEGG terms for each orthogroup - All_orthogroups_sequences.tar.gz: Compressed folder containing all protein and nucleotide (cds) sequences for each orthogroup 04b) Orthology assessment across all ants and the eight Apoidea outgroups. The tables include the orthogroup IDs and the sequences (protein IDs) in each species, the gene count per orthogroup, and the functional annotations as described above. 05) Phylogenies - Directory with the main phylogeny presented in the manuscript (WGA-iqtree), and the alternative phylogenies presented in Figure S1. The text files in this main folder contain the species names and GAGA-IDs, as well as the files to rename and color the tree using iTol. 05a) WGA trees : Species trees reconstructed from intergenic regions extracted from the Cactus whole-genome alignment using iqtree and astral, and the respective sequence alignments. 05b) Ortholog trees : Species trees reconstructed from single copy ortholog alignments. The two presented trees in the manuscript are retrieved from codon alignments across 1000 genes selected with GeneSort using iqtree, or from all gene trees using Astral. The phylogenies resulting from alternative datasets and from protein alignments can be found in the Additional_trees folder. 05c) BUSCO trees : Protein alignment matrix based on genes from Hymenoptera_odb10 BUSCO dataset, and the phylogeny reconstructed using 1000 genes selected with GeneSort. Additional species trees can be found in the Additional_trees folder using the dna alignments in iqtree, and also Astral. 05d) UCE trees : Species tree based in UCE alignments. The Additional_trees folder contains phylogenies using different matrix occupancies. 05e) Dating : Folder containing the input alignment, the tree with fossil prior distributions and MCMCtree ctl, and the resulting calibrated ant phylogeny. 06) Ancestral state reconstructions (ASR) - plots presenting the results from the ASR analyses of genomic data and phenotypic traits. 07) Caste gene expression data. 07a) Table with differential expression between gyne and worker across M. pharaonis developmental transcriptomes (from Qiu et al. 2022), a table with the transcript abundance (in TPM) across samples, and the individual gene plots with the expression levels across caste and developmental stage. 07b) Table with differential expression between large and small worker castes across A. echinatior developmental transcriptomes, tables with raw counts and TPMs (from Qiu et al. 2022), and individual gene plots with the expression levels across caste and developmental stage. 07c) Transcriptome data for adult castes across ant species. - GAGA_transcriptomes_all_sample_information.txt.gz: Table with the ID and sample information for each RNA-seq - GAGA_transcriptomes_all_wb_abundance_tpm_addTree.txt.gz: Table with the estimated abundance (in TPM) for each single copy orthogroup and RNA-seq sample - get_expression_plots.pl: Script to plot the expression across castes for a list of orthogroups, using the two input tables described above - GAGA_DE_queen_worker_perspecies.zip: Tables containing the differential expression (DE) test results comparing queen and worker castes in each species separately - GAGA_DE_queen_worker.zip: Tables with queen and worker DE estimates for each orthogroup across ants, formicoids, poneroids or each subfamily combining the species-specific DE. The information from here is specified in the functional annotation table for the orthogroups (in Orthology above) * Queen_worker_DE_summaries contains the summary for all orthogroups with the number of genes queen- or worker-biased across species * Queen_worker_DE_summaries_chisq_species contains only orthogroups that are significantly enriched in queen- or worker-bias after chi-squared tests. * Queen_worker_DE_summaries_50pc_species contains the orthogroups where more than 50% of the species had differentially expressed genes * Queen_worker_DE_summaries_50pc_species_strict contains the orthogroups where more than 50% of the species had differentially expressed genes and the difference between caste-specific percentages (gyne-worker or minor worker-large worker) was higher than 50% * Queen_worker_DE_summaries_75pc_species contains the orthogroups with more than 75% of the species having differentially expressed genes towards the same caste - GAGA_DE_large_minor_workers_perspecies.zip: Tables containing the differential expression results comparing large- and minor-worker castes in each species separately - GAGA_DE_large_minor_workers.zip: Tables with large- and minor-worker DE estimates for each orthogroup across formicoids or each subfamily combining the species-specific DE. The information from here is specified in the functional annotation table for the orthogroups (in Orthology above), and the description of each folder and the differential expression orthologs is the same as detailed above for queen-worker DE across ant clades 07d) Extended data for Figure 4C containing gene expression of MAPK pathway and associated genes in developmental stages of Monomorium pharaonis. Each dot shows expression (in log2(tpm)) for gyne- or worker-destined individuals sampled as 2nd instar, 3rd instar, prepupa, young pupa, old pupa or imago. Significant differences in expression between gyne- and worker-destined individuals according to Wilcoxon Rank Sum tests (ns: p > 0.05). 07e) Extended data for Figure 4C containing gene expression of MAPK pathway and associated genes in developmental stages of Acromyrmex echinatior. Each dot shows expression (in log2(tpm)) for gyne- or worker-destined individuals sampled as 2nd instar, 3rd instar, prepupa, young pupa, old pupa or imago. Significant differences in expression between gyne- and worker-destined individuals according to Wilcoxon Rank Sum tests (ns: p > 0.05). 08) Selective constraint analyses. 08a) Compressed directory containing the codon alignments for all orthogroups using PRANK. 08b) Compressed directory containing the quality scores for the codon alignments using ZORRO, and the retained alignments with good quality used in the next steps for the selective constraint analyses. 08c) Output from HyPhy GARD, and the resulting codon alignments for each partitioned gene from GARD. 08d) HmmCleaner curated codon alignments: Output from HmmCleaner and the resulting curated alignments and their gene trees. The second compressed directory (codon_alignments_genetrees_partitionsgard_hmmclean_blocks50perc.tar.gz) contains the curated alignments after removing blocks with gappy regions in more than 50% of the species, and sequences with fewer than 15 unmasked codons. These alignments and trees were used in the following selective constraint analyses. 08e) ABSREL results: Output from HyPhy aBSREL and the resulting tables including dN/dS estimates and inferred branches in the gene trees under positive selection (indicated by significant p-values). 08f) Positive selection inference in the species tree - data presented in Figure 3: Species tree with named internal nodes, and the table used to color the branches in iTol as in Figure 3A and Figure S3A. Note that the Table S5D in the manuscript contains all the information associated with the inferred positive selection events in single-copy genes across all nodes in the species tree. 08g) Selection signatures associated with phenotypic traits: Directories containing the list of species assigned as "test" and "reference" for each of the analyzed traits (see Table S6A), and the following results: - Signatures of positive selection associated with phenotypic traits in folder "absrel_results_strict_split_v2rooted": Summary table with positive selection events (from HyPhy aBSREL results described above), and the list of orthogroups with convergent or enriched positive selection associated to the "test" or "reference" states (Table S6A describes these numbers; see Methods for a detailed description of the methodology). - *Signatures of positive selection associated with phenotypic traits in folder "absrel_results_split_v2rooted": Same files as described above, but these results are retrieved using a more relaxed parameter allowing a difference of 40% of higher between the proportion of "test" and "reference" positive selection events, instead of the 50% used in the above results "absrel_results_strict_split_v2rooted" as described in the Methods. *These results are not presented in the manuscript (given the limited word count and space constrictions we focused on the most significant results), but provided here as these may be useful for future comparative genomics analyses of social and ecological traits in ants. - Signatures of relaxed or intensified selection associated with phenotypic traits in the folder "relax_results": Summary tables with the results from HyPhy RELAX for all partitioned genes (with and without dN/dS estimates), and the same tables after merging the results of multiple partitions for the same gene and correcting the p-values with FDR (files with name "mergeparts_fdr.txt"). In addition, the two files "Relax_candidates_" contain the lists of significant genes under relaxed or intensified selection associated with the "test" trait (i.e. intensified in "test" is relaxed in "reference", and vice versa). Note that the RELAX analysis was run in a number of traits because of the high computational costs (see Table S6A). Also note that the table S6A in the manuscript contains the summary of these analyses, and Table S6B describes all genes with selection signatures associated to the analyzed trait comparisons. 09) Gene family evolution - input and output files from CAFE analyses. - N0_GeneCounts_nostLFR.tsv.gz : Table containing the gene count per species for each orthogroup, retrieved from OrthoFinder using the 163 ant gene annotations and the 8 Apoidea outgroups (see 04 - Orthology files). - filter_genefams_forcafe.pl : Script used to filter gene families with high variability as well as TE-related orthogroups. The input is "N0_GeneCounts_nostLFR.tsv", and the resulting files are "N0_GeneCounts_nostLFR_sfilt20_nolow_addTE.tsv" and "N0_GeneCounts_nostLFR_sfilt20_nolow_addTE.tsv_summary.txt". This reduced table is used in CAFE to estimate lambda, as CAFE does not converge when including the filtered high variable families (see Methods). - N0_GeneCounts_nostLFR_sfilt20_nolow_addTE.tsv.gz : Input table used for the first CAFE run to estimate lambda, which does not contain gene families for which the difference between minimum and maximum gene count is higher than 20, and TE-related genes. - N0_GeneCounts_nostLFR_nolow_addTE.tsv.gz : Input table used for the second CAFE run using the inferred lambda with the above dataset. This table includes the gene families with high variability (filtered in above dataset). - GAGA_dated_phylogeny_woutgroup_DatesPeters2017_CurrBio_nostLFR.tre : Dated species tree for the ants and apoidea outgroups used in CAFE analyses, after excluding the short-read based genomes with scaffold N50 lower than 500Kb. - CAFE_final_output_with_outgroups.tar.gz : Output folder from CAFE. - Significant_HOG_gene_family_plots.zip : Plots for each orthogroup significantly expanded or contracted in any branch of the species tree. 10) Genome synteny - results from the gene synteny analyses, and the co-expression modules from developmental transcriptomes of Monomorium pharaonis. - Conserved_synteny_of_specific_genes_plots.zip : Plots showing the conserved gene synteny of specific sets of genes highlighted in Figure 2. - Synteny.tar.gz : Compressed file containing: - syntenic_nets/ : The syntenic nets from the pair-wise whole genome alignments that were used to plot the chromosome-level synteny in Figure 2A and Figure S2A. - SYNPHONI.synt.output.csv: Output from SYNPHONI. - GAGA_species_breakpointRate_phylogeny.nhx: Estimated breakpoint rates across all branches in the phylogeny. - GAGA_species_BreakpointRate.csv: Estimated breakpoint rates across each branch of the ant phylogeny for all high-quality genomes. - GAGA_species_breakpointRate_phylogeny.node.nhx: Phylogenetic tree containin the node IDs used in the "GAGA_species_BreakpointRate.csv". - Monomorium_pharaonis_Gyne_abundance_200samples.csv: Gene expression matrix containing abundance in TPM for the gyne samples from developmental transcriptomes of Monomorium pharaonis (samples from Qiu et al. 2022 NEE). - Monomorium_pharaonis_Worker_abundance_170samples.csv: Gene expression matrix containing abundance in TPM for the gyne samples from developmental transcriptomes of Monomorium pharaonis (samples from Qiu et al. 2022 NEE). - Monomorium_pharaonis_Gyne_worker_sample_info.csv: Description of all gyne and worker samples and their developmental stages (from Qiu et al. 2022 NEE) - Monomorium_pharaonis_developmental_stages_log2foldchange.csv: The log2foldchange[Gyne/Worker] of Monomorium genes in developmental stages (from Qiu et al. 2022 NEE) - Monomorium_pharaonis_gene_co-expression_modules.csv: List of genes assigned to co-expression modules in the caste developmental transcriptomes of Monomorium pharaonis, using WGCNA. Note that the scripts are provided in the GAGA github repository (https://github.com/schraderL/GAGA/) 11) Contamination screening - Files used in the contamination screening pipeline, described in https://github.com/dinhe878/GAGA-Metagenome-LGT - Insect_Assembly_Acc.txt: List of insect genomes accession IDs. - insect_43_genomes.fa.gz: Fasta file containing the genome sequences from the file above after filtering using blobtools2 - PATRIC_genome_list_21112020.txt: 1908 complete bacterial genome sequences from PATRIC (doi:10.1093/nar/gkt1099) ##