# This repository contains the data accompanying manuscript "Complexity of avian evolution revealed by family-level genomes" in Nature by Josefin Stiller, Shaohong Feng, Al-Aabid Chowdhury, Iker Rivas-González, David A. Duchêne, Qi Fang, Yuan Deng, Alexey Kozlov, Alexandros Stamatakis, Santiago Claramunt, Jacqueline M. T. Nguyen, Simon Y. W. Ho, Brant C. Faircloth, Julia Haag, Peter Houde, Joel Cracraft, Metin Balaban, Uyen Mai, Guangji Chen, Rongsheng Gao, Chengran Zhou, Yulong Xie, Zijian Huang, Zhen Cao, Zhi Yan, Huw A. Ogilvie, Luay Nakhleh, Bent Lindow, Benoit Morel, Jon Fjeldså, Peter A. Hosner, Rute R. da Fonseca, Bent Petersen, Joseph A. Tobias, Tamás Székely, Jonathan David Kennedy, Andrew Hart Reeve, Andras Liker, Martin Stervander, Agostinho Antunes, Dieter Thomas Tietze, Mads Bertelsen, Fumin Lei, Carsten Rahbek, Gary R. Graves, Mikkel H. Schierup, Tandy Warnow, Edward L. Braun, M. Thomas P. Gilbert, Erich D. Jarvis, Siavash Mirarab, Guojie Zhang. # The deposited data can be either accessed through a graphical user interface or using the command line. Command line links are given for individual files or for batch download of large numbers of files in the following. The repository contains 8 main directories (some with subdirectories marked *): 01) Alignments and gene trees * Intergenic regions * Exons * Introns * UCEs * Total evidence datasets * Whole genome alignment in 10kb windows 02) Characteristics of alignments and gene trees * Locus Mastertable * GC content * Phylogenetic model adequacy 03) Species trees 04) Time trees 05) Subsetting experiments 06) Polytomy test 07) Scripts to reproduce figures 08) Supplementary results ########################################################################### ############################ DETAILED CONTENTS ############################ ########################################################################### ##################################### ### 01) Alignments and gene trees ### ##################################### ########################## ### Intergenic regions ### ########################## Path: 01_alignments_and_gene_trees/intergenic_regions/ # Alignments and gene trees of intergenic regions have the format {chromosome}_{start_coordinate_in_10kb_window}_{end_coordinate_in_10kb_window}_{start_coordinate_of_locus_within_the_window}.fasta/tre. E.g. chr10_10070000_10080000_1k_start38 is from chromosoem 10, with the 10kb window starting at 10070000 and ending at 10080000 and the selected 1kb locus starting at position 38. ### 63K dataset (dataset for main analysis) # 63430 alignments in FASTA format for which gene trees were built wget https://sid.erda.dk/share_redirect/ENhZODU9YE/intergenic_regions/63430.alns.tar.gz # 63430 gene trees in newick format with support as aLRT values. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/63430.gene_trees.tar.gz ## 63430 gene trees, with 1st column is name of the underlying locus, 2nd column is tree in newick format wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/63430.named.gene.trees.gz # 63430 gene trees after collapsing branches with aLRT values below 0.95. This file was used as input for ASTRAL analysis. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/63430.aLRT-0.95-collapsed.gene.trees.gz # 63430 gene trees after collapsing branches with aLRT values below 0.95 used for gene-only multi-locus bootstrapping (globalBS) wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/63430.aLRT-0.95-collapsed.GOBS-replicatetrees.tre.gz ### 94K dataset # 94402 alignments in FASTA format for which gene trees were built. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/94402.alns.tar.gz # 94402 gene trees, each as individual files in newick format with support as aLRT values. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/94402.gene_trees.tar.gz # 94402 gene trees, with 1st column is name of the underlying locus, 2nd column is tree in newick format wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/94402.named.gene.trees.gz # 94402 gene trees after collapsing branches with aLRT values below 0.95. This file was used as input for ASTRAL analysis. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/94402.aLRT-0.95-collapsed.gene.trees.gz ### 80K dataset # 80047 alignments in FASTA format for which gene trees were built. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/80047.alns.tar.gz # 80047 gene trees in newick format with support as aLRT values. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/80047.gene_trees.tar.gz # 80047 gene trees, with 1st column is name of the underlying locus, 2nd column is tree in newick format wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/80047.named.gene.trees.gz # 80047 gene trees after collapsing branches with aLRT values below 0.95. This file was used as input for ASTRAL analysis. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/80047.aLRT-0.95-collapsed.gene.trees.gz ############# ### Exons ### ############# Path: 01_alignments_and_gene_trees/exons/ # All alignments are nucleotides, with third codon position excluded (C12). # Alignment and gene trees have short names, that can be translated to full species names with the mapping file Supplementary_File_1.tsv: # All alignments in FASTA format, including alignments with <4 taxa. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/exons/all_alns.tar.gz # 14972 alignments in FASTA format for which gene trees were built. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/exons/alns.tar.gz # 14972 gene trees in newick format with support as aLRT values. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/exons/exon_c12_atleast4taxa.gene.trees.tar.gz # 14972 gene trees as a single file, with 1st column is name of the underlying locus, 2nd column is tree in newick format wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/exons/named.trees.gz ############### ### Introns ### ############### Path: 01_alignments_and_gene_trees/introns/ # 44846 alignments in FASTA format for which gene trees were built. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/introns/alns.tar.gz # 44846 gene trees in newick format with support as aLRT values. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/introns/trees.tar.gz # 44846 gene trees as a single file, with, 1st column is name of the underlying locus, 2nd column is tree in newick format wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/introns/named.trees.gz ############ ### UCEs ### ############ Path: 01_alignments_and_gene_trees/uces/ # Alignment and gene trees have extended names, that can be translated to full species names with the mapping file: wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/uces/uce_name_map.txt # 4985 alignments in FASTA format for which gene trees were built. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/uces/alns.tar.gz # 4985 gene trees in newick format with support as aLRT values. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/uces/trees.tar.gz # 4985 gene trees as a single file, with, 1st column is name of the underlying locus, 2nd column is tree in newick format wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/uces/named.trees.gz ############################### ### Total evidence datasets ### ############################### Path: 01_alignments_and_gene_trees/total_evidence_datasets/ # 159205 gene trees, with 1st column is name of the underlying locus, 2nd column is tree in newick format. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/159205.named.trees.gz # 159205 gene trees after collapsing branches with aLRT values below 0.95. This file was used as input for ASTRAL analysis. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/159205.aLRT-0.95-collapsed.gene.trees.gz # 128233 gene trees after collapsing branches with aLRT values below 0.95. This file was used as input for ASTRAL analysis. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/intergenic_regions/128233.aLRT-0.95-collapsed.gene.trees.gz ############################################## ### Whole genome alignment in 10kb windows ### ############################################## Path: 01_alignments_and_gene_trees/whole_genome_fasta/ # Alignments are 10kb windows each in FASTA format. Each tar.gz contains 100 alignments. # Alignment names have the following format: {chromosome}_{startCoordinate}_{endCoordinate}.bed.maf.sort.all.fasta.merge, e.g. chr10_4730000_4740000.bed.maf.sort.all.fasta.merge # All tar.gz files can be downloaded with the following command: for x in `seq -w 1 12`; do mkdir $x; cd $x; for i in `seq -w 1 100`; do echo Downloading $x $i ............. ; wget https://sid.erda.dk/share_redirect/ENhZODU9YE/01_alignments_and_gene_trees/whole_genome_fasta/Gallus_gallus_extract_10k_package/$x/$i.tar.gz; done cd ..; done ######################################################## ### 02) Characteristics of alignments and gene trees ### ######################################################## ######################### ### Locus Mastertable ### ######################### Path: 02_characteristics_of_alignments_and_gene_trees/ # The tab-delimited table contains a different metrics pertaining to the alignment and to the gene trees for each gene tree (a total of 159205 loci). wget https://sid.erda.dk/share_redirect/ENhZODU9YE/02_characteristics_of_alignments_and_gene_trees/master_table_gene_trees.txt ################## ### GC content ### ################## Path: 02_characteristics_of_alignments_and_gene_trees/gc_content/ # Counts of each nucleotide and gap and N characters, proportion of GC and AT for each sequence of each alignment. Available for each datatype: wget https://sid.erda.dk/share_redirect/ENhZODU9YE/02_characteristics_of_alignments_and_gene_trees/gc_content/exons.c12.gc_content.tab.gz wget https://sid.erda.dk/share_redirect/ENhZODU9YE/02_characteristics_of_alignments_and_gene_trees/gc_content/intron.gc_content.tab.gz wget https://sid.erda.dk/share_redirect/ENhZODU9YE/02_characteristics_of_alignments_and_gene_trees/gc_content/random_region.gc_content.tab.gz wget https://sid.erda.dk/share_redirect/ENhZODU9YE/02_characteristics_of_alignments_and_gene_trees/gc_content/uces.gc_content.tab.gz ################################### ### Phylogenetic model adequacy ### ################################### Path: 02_characteristics_of_alignments_and_gene_trees/phylogenetic_model_adequacy/ # Datasets giving the base heterogeneity and risk of sequence substitution for each locus of each datatype (exon, intron, intergenic regions, UCEs) wget https://sid.erda.dk/share_redirect/ENhZODU9YE/02_characteristics_of_alignments_and_gene_trees/phylogenetic_model_adequacy/comp_sat_exons.csv wget https://sid.erda.dk/share_redirect/ENhZODU9YE/02_characteristics_of_alignments_and_gene_trees/phylogenetic_model_adequacy/comp_sat_introns.csv wget https://sid.erda.dk/share_redirect/ENhZODU9YE/02_characteristics_of_alignments_and_gene_trees/phylogenetic_model_adequacy/comp_sat_intergenic.csv wget https://sid.erda.dk/share_redirect/ENhZODU9YE/02_characteristics_of_alignments_and_gene_trees/phylogenetic_model_adequacy/comp_sat_uce.csv ######################### ### 03) Species trees ### ######################### Path: 03_species_trees/ # The main species tree resulting from 63K intergenic regions analyzed with ASTRAL wget https://sid.erda.dk/share_redirect/ENhZODU9YE/03_species_trees/63K.tre # The main species tree resulting from 63K intergenic regions analyzed with ASTRAL and using gene-only multi-locus bootstrapping (globalBS): wget https://sid.erda.dk/share_redirect/ENhZODU9YE/03_species_trees/63K-GOBS.tre # The species tree resulting from 63K intergenic regions analyzed with RAxML-NG concatenation. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/03_species_trees/63K.RAxML.support_bs50.tre # The species tree resulting from 14K loci of exons (1st and 2nd codon position) analyzed with ASTRAL wget https://sid.erda.dk/share_redirect/ENhZODU9YE/03_species_trees/exon_c12.tre # The species tree resulting from 45K loci of introns analyzed with ASTRAL wget https://sid.erda.dk/share_redirect/ENhZODU9YE/03_species_trees/intron.tre # The species tree resulting from 5K loci of UCEs analyzed with ASTRAL wget https://sid.erda.dk/share_redirect/ENhZODU9YE/03_species_trees/UCE.tre # 1425 species trees resulting from subsetting experiments. File has the name of the experiment in the first column and the species tree in newick format in the second column. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/03_species_trees/experiment.named.trees # Comparison of the 1425 species trees resulting subsetting experiments to the main 63K intergenic tree and other reference trees. Table gives Robinson-Foulds distances, quartet differences, and the proportion of highly supported nodes. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/03_species_trees/master_table_experiments.txt ###################### ### 04) Time trees ### ###################### Path: 04_timetree/ # The main time-calibrated phylogeny in NEXUS format from MCMCTree. wget https://sid.erda.dk/share_redirect/ENhZODU9YE/04_timetree/main_alternative_mod.tre ################################## ### 05) Subsetting experiments ### ################################## # Datasets used in subsetting experiments according to different metrics and characteristics. For each experiment, three files are given: # 1. Collapsed gene trees. Given as a gzipped text file with all gene trees are given with low-aLRT value branches collapsed, which were analyzed with ASTRAL. # 2. The resulting ASTRAL log file. Contains the command used to run ASTRAL, statistics and run time of the run. # 3. The resulting ASTRAL species tree. ##################### ### Data quantity ### ##################### Path: 05_subsetting_experiments/number_loci/ # Analyses were done for all data types (intergenic regions (called "random" in these folders), introns, exons, UCEs) and all gene trees from non-coding regions (129,878 loci, called "all_loci" in these folders). Files for each data type are compressed into a tar.gz (see download command below). For each experiment, three files are given: # 1. Collapsed gene trees in the format "{datatype}.{number_of_loci_subsampled}.{repetition}.gz". E.g. exon.2000.rep11.gz contains exonic gene trees, of which 2000 were randomly sampled, in the 11th repeated subset of 2000 gene trees. # 2. The resulting ASTRAL log file in the format "astral-{datatype}.{number_of_loci_subsampled}.{repetition}.log". Contains the command used to run ASTRAL, statistics and run time of the run. # 3. The resulting ASTRAL species tree in the format "astral-{datatype}.{number_of_loci_subsampled}.{repetition}.tre". # Analyses for each datatype $d can be downloaded following: for d in "all_loci exons introns intergenic_regions UCEs"; do echo Downloading $d ............. ; wget https://sid.erda.dk/share_redirect/ENhZODU9YE/05_subsetting_experiments/number_loci/$d.tar.gz ############################### ### Genomic characteristics ### ############################### Path: 05_subsetting_experiments/genomic_characteristics/ # Analyses were done for loci of all data types (intergenic regions (called "random_region" in these folders), intron, exon, UCE), split into 4 quantiles based on different genomic characteristics. The genomic characteristics for all loci are given in the "mastertable_gene_trees.txt" in directory 02_characteristics_of_alignments_and_gene_trees/ Files are compressed into a tar.gz (see download command below). For each experiment, three files are given: # 1. Collapsed gene trees in the format "{datatype}_{genomic_characteristic}_{quantile}.txt.aLRT0.05.gene.trees.gz". E.g. exon_clock_quantile2.txt.aLRT0.05.gene.trees.gz contains gene trees of exons constituting the 2nd quantile of the distribution of clocklikeness. # 2. The resulting ASTRAL log file in the format "astral-{datatype}_{genomic_characteristic}_{quantile}.txt.aLRT0.05.gene.trees.log". Contains the command used to run ASTRAL, statistics and run time of the run. # 3. The resulting ASTRAL species tree in the format "astral-{datatype}_{genomic_characteristic}_{quantile}.txt.aLRT0.05.gene.trees.tre". # Download all species trees subset by genomic characteristics wget https://sid.erda.dk/share_redirect/ENhZODU9YE/05_subsetting_experiments/genomic_characteristics/genomic_characteristics.tar.gz ######################################### ### Species trees for each chromosome ### ######################################### Path: 05_subsetting_experiments/chromosome/by_chromosome/ # Analyses were done based on intergenic regions (from the 80k locus set) for each chromosome with >1000 gene trees. Three files are given: # 1. Collapsed gene trees in the format "{chromosome_number}.gene.trees.gz". E.g. chr7.gene.trees.gz contains all gene trees of chromosome 7. # 2. The resulting ASTRAL log file in the format "astral-{chromosome_number}.gene.trees.log". Contains the command used to run ASTRAL, statistics and run time of the run. # 3. The resulting ASTRAL species tree in the format "astral-{chromosome_number}.gene.trees.tre". # Download all chromosome trees wget https://sid.erda.dk/share_redirect/ENhZODU9YE/05_subsetting_experiments/chromosome/by_chromosome.tar.gz ############################################## ### Species trees for each chromosome type ### ############################################## Path: 05_subsetting_experiments/chromosome/by_chromosome_type/ # Analyses were done based on intergenic regions (from the 80k locus set) for each chromosome with >1000 gene trees for the 3 major chromosomal categories. Three files are given: # 1. Collapsed gene trees in the format "{chromosome_type}.{number_of_loci_analyzed}_gene.trees.gz". E.g. intermediate.12k_gene.trees.gz contains all 12k gene trees of chromosomes of the intermediate size category. # 2. The resulting ASTRAL log file in the format "astral-{chromosome_type}.{number_of_loci_analyzed}_gene.trees.log". Contains the command used to run ASTRAL, statistics and run time of the run. # 3. The resulting ASTRAL species tree in the format "astral-{chromosome_type}.{number_of_loci_analyzed}_gene.trees.gz.tre". # Download all chromosome type trees wget https://sid.erda.dk/share_redirect/ENhZODU9YE/05_subsetting_experiments/chromosome/by_chromosome_type.tar.gz ######################### ### 06) Polytomy test ### ######################### Path: 06_polytomy_test/ # Analyses done on the 63K locus set testing for polytomies with ASTRAL. ######################################## ### 07) Scripts to reproduce figures ### ######################################## Path: 07_scripts_figures/ # Scripts needed to reproduce main text figures and extended data figues. # Each folder is named according to the figure name. The directory contains needed data files, R scripts for reading, analysis and plotting. # This includes datasets of basic statistics of alignments, gene trees, comparisons of various species trees resulting from different data types and subsetting experiments, substitution rates, Pagel's lambda and rate of change in body mass and relative brain size through time analyses. ################################# ### 08) Supplementary Results ### ################################# Path: 08_supplementary_results/ # Supplementary results from simulations on the effect of taxon sampling on Pagel's lambda