How to build phylogenetic tree

Okay, we are about to build a phylogenetic tree. What do we need for it? How is it done? What is it?

First of all, phylogenetic tree is a branching diagram or “tree” showing the inferred evolutionary relationships among various biological species or other entities—their phylogeny—based upon similarities and differences in their physical or genetic characteristics. It is the tool used by the biologists to see, how different population can be, what split is about to happen and how much is subjects of one population differ from other. In this article we provide short manual for building a phylogenetic tree.

This is going to be a «building a phylogenetic tree for dummies» manual. Of course, there are more approaches available and you can use whatever you like / need to use, this is just an example, of how we can build a phylogenetic tree.

We are going to use the following workflow:

SETTING UP THE DATASET: ALIGNMENT ALGORITHMS AND DATA MANAGING

1. Identify a zoological group of interest (aiming to reconstruct a deep phylogeny, this should be a higher-level taxon, like a Phylum, a Class, an Order, or similar);

2. use CLC Sequence Viewer to download 15-20 complete mitochondrial genomes of the selected group;

3. extract the DNA sequences of a Protein Coding Gene (PCG) and a Ribosomal Gene (RG) and save the two datasets in FASTA format;

4. translate the PCG dataset into amminoacids, using the correct mitochondrial code;

5. align the three datasets using M-Coffee, trying different combinations of algorithms;

6. mask the data with GBlocks;

7. convert the translated PCG dataset into Phylip format using MEGA or CLC Sequence Viewer;

8. code nucleotide gaps as presence/absence data using GapCoder (which takes as input a modified FASTA file).

MAXIMUM LIKELIHOOD

1 Execute ProtTest on the Phylip-formatted translated PCG dataset to identify the most suitable amminoacid substitution model;

2 use RAxML to estimate 100 maximum likelihood trees from the translated PCG dataset;

3 create a 60%-threshold consensus tree using PhyUtility;

4 open the tree using Dendroscope.

BAYESIAN INFERENCE

1 Execute jModelTest on FASTA nucleotide dataset to identify the most suitable nucleotide substitution model;

2 concatenate the aminoacid and the nucleotide datasets (both data and presence/absence gap scores) in a single NEXUS file;

3 load this NEXUS files with MrBayes and instruct the software for partitioning, modelling, and analyzing data;

4 open the tree using Dendroscope.

NETWORKS AND CHRONOGRAMS

1 Use SplitsTree to perform different reconstructions of phylogenetic networks;

2 provide r8s the output tree of MrBayes to obtain an ultrametric tree (parameters will be optimized group by group).

Part 1: SETTING UP THE DATASET. ALIGNMENT ALGORITHMS AND DATA MANAGING

At first we need to decide, for which group of animals we are going to build our phylogenetic tree. This is one of the hardest parts, so this is should be done wisely, unless you are reading this, already having to build a particular tree, then it’s easy for you, cause you don’t have a choice.

Once you’ve chosen interesting animal/plant/whatever, you could go directly to wikipedia to find out its taxonomy, because if your target animal family doesn’t have much information, you’ll have to go one level above.

During this tutorial we are going to explore Passeriformes, because the family Corvidae (crows) has too little information. So that’s what we put in the search box: “Passeriformes complete mitochondrion”

so we have something like this:

We want to get order by accession, cause the best sequences are from NCBI and have NC_ prefix in front of them. If you don’t have enough species (as in our case), you might want to search for “complete mitochondrion genome” or “mitochondrion”. This might work out.

Now you want to get on protein-coding gene and one ribosomal, so you should switch to circular view for the selected species:

Then, you should save these genes for every animal you need (10at least ten), select saved genes and create new sequence list:

and then save in fasta format to obtain something like this:

>NC_014341_ND1

ATGACCAACCATCCCATCTTAATCAGCCTTATCATAGCCCTCTCCTACATCCTCCCCATT

but for further steps you might want to delete new lines characters, which can be easily done by using regular expression (notepad++ for Windows, gedit/vim for Linux): “$\n” (“$\r\n” for Windows) and then putting new lines before “>” sign

Congratulations! Now you do have needed sequences in fasta format. Next step is to align them with any alignment tool you have. We will use tcoffee for this, which outputs result in fasta file, so after going for the online tool, we will have email with the confirmation of the alignment and aligned sequences with gaps, which we will look something like this:

>NC_020605

TGCCAAATTCTAGCCCAATATACC-CAACCC

right now we have saved project with search results, 2 sequences lists in CLC, 2 fasta files, 2 aligned fasta files

So time to go for Mega.

We open existing alignment:

and find out number of sites:

now we should export it in mega format and then open with mega:

select display => use identical symbols

and then export in phylip format, because we will need for further steps.

The Fasta file will be used as input for GBlocks. This is the masking step: sites carrying low phylogenetic signal are removed from the alignment. The output of GBlocks is again a Fasta, along with a HTML page showing regions interested by conserved blocks.

The Fasta output of GBlocks may be used as input for jModelTest (for the ribosomal alignement) and translated into Phylip for ProtTest (for the aminoacid alignment).

When information about molecular evolutionary models are available, tree inference is incoming. The Phylip format is used as input for RAxML, performing a classical Maximum Likelihood analysis. The NEXUS format (which may also be obtained from MEGA) is used as input for MrBayes.

Nexus syntax looks like this:
#NEXUS [ Title result2] begin taxa; dimensions ntax= 11; taxlabels NC_020605 NC_021105_l-rRNA NC_021408 NC_022839 NC_022840 NC_020603 NC_020601 NC_014341 NC_021641_l-rRNA NC_021408_1 NC_020604 ; end; begin characters; dimensions nchar= 1625; format missing=? gap=- matchchar=. datatype=nucleotide interleave=yes; matrix [!Domain=Data property=Coding CodonStart=1;] NC_020605 TGCCAAATTCTAGCCCAATATACC-CAACCCAAAACAACAAAACTGCT-ACCCAAACCACAACTAAAGCATTTACTAG NC_021105_l-rRNA TGCCAGACTCTAGCCCAATT-GCATTGACCTGGAATAACAAAGCTACTCCCCATACACCAAACTAAAGCATTTACTAG NC_021408 TGCCAAACTCTAGCCCAACCGACA-CGACCTAGAATAACAAAGCCACTTACCCCACACCCAACTAAAGCATTCACCAG
you need to put coding matrix of one sequences list after another and to set some parameters for the Mr bayes settings.

For the purpose of experiment, my nexus file is provided, along with the settings:
#NEXUS BEGIN DATA; [!Comments are included in brackets, NTAX = number of taxonomical entries you have, NCHAR = number of sites (for both sequences);] DIMENSIONS NTAX=11 NCHAR=2603; [!DATATYPE=DNA cause we didn't switch to sequence yet, GAP sets gap-coding symbol;] FORMAT MISSING=? GAP=- interleave=yes DATATYPE=DNA; MATRIX [!Domain=Data property=Coding CodonStart=1;] NC_020605 ATGACTAACTACCAAACATTAATTAACCTAATCATAGCCCTCTCCTACGCCGTACCGATCTTAGTCGCAGTAGCCTTC ********* begin mrbayes; outgroup NC_020604; charset atp6 = 1-978; charset rrnS = 979 - 2603; partition atp6_rrnS = 2: atp6, rrnS; set partition = atp6_rrnS; lset applyto=(2) nst=6 rates=invgamma; prset applyto=(1) aamodelpr=fixed(gtr); lset applyto=(1) rates=invgamma; unlink statefreq=(all) revmat=(all) shape=(all) pinvar=(all) tratio=(all); prset applyto=(all) ratepr=variable; mcmcp ngen=1000000 burninfrac=0.10; end;

List of software for building phylogenetic tree

CLC SEQUENCE VIEWER

Environment: Windows/Linux

Download link: http://www.clcbio.com/products/clc-sequence-viewer/

DENDROSCOPE

Environment: Java

Download link: http://ab.inf.uni-tuebingen.de/data/software/dendroscope3/download/welcome.html

GAPCODER

Environment: Windows

Download link: http://www.biomedcentral.com/1471-2105/4/6/

GBLOCKS

Environment: Windows/Linux

Download link: http://molevol.cmima.csic.es/castresana/Gblocks.html

JMODELTEST

Environment: Java

MRBAYES

Environment: Windows/Linux

Download link: http://mrbayes.sourceforge.net/download.php

PHYUTILITY

Environment: Java

Download link: https://code.google.com/p/phyutility/downloads/list

PROTTEST

Environment: Java

Download link: http://darwin.uvigo.es/software/prottest.html

SPLITSTREE

Environment: Java

Download link: http://ab.inf.uni-tuebingen.de/data/software/splitstree4/download/welcome.html

T-COFFEE

Environment: Online server/Linux

Download link: http://www.tcoffee.org/Packages/Stable/Latest/