How to build phylogenetic tree

Okay, we are about to build a phylogenetic tree. What do we need for it? How is it done? What is it?

First of all, phylogenetic tree is a branching diagram or “tree” showing the inferred evolutionary relationships among various biological species or other entities—their phylogeny—based upon similarities and differences in their physical or genetic characteristics. It is the tool used by the biologists to see, how different population can be, what split is about to happen and how much is subjects of one population differ from other. In this article we provide short manual for building a phylogenetic tree.

This is going to be a «building a phylogenetic tree for dummies» manual. Of course, there are more approaches available and you can use whatever you like / need to use, this is just an example, of how we can build a phylogenetic tree.

We are going to use the following workflow:

SETTING UP THE DATASET: ALIGNMENT ALGORITHMS AND DATA MANAGING

1. Identify a zoological group of interest (aiming to reconstruct a deep phylogeny, this should be a higher-level taxon, like a Phylum, a Class, an Order, or similar);

2. use CLC Sequence Viewer to download 15-20 complete mitochondrial genomes of the selected group;

3. extract the DNA sequences of a Protein Coding Gene (PCG) and a Ribosomal Gene (RG) and save the two datasets in FASTA format;

4. translate the PCG dataset into amminoacids, using the correct mitochondrial code;

5. align the three datasets using M-Coffee, trying different combinations of algorithms;

6. mask the data with GBlocks;

7. convert the translated PCG dataset into Phylip format using MEGA or CLC Sequence Viewer;

8. code nucleotide gaps as presence/absence data using GapCoder (which takes as input a modified FASTA file).

 MAXIMUM LIKELIHOOD

1 Execute ProtTest on the Phylip-formatted translated PCG dataset to identify the most suitable amminoacid substitution model;

2 use RAxML to estimate 100 maximum likelihood trees from the translated PCG dataset;

3 create a 60%-threshold consensus tree using PhyUtility;

4 open the tree using Dendroscope.

 BAYESIAN INFERENCE

1 Execute jModelTest on FASTA nucleotide dataset to identify the most suitable nucleotide substitution model;

2 concatenate the aminoacid and the nucleotide datasets (both data and presence/absence gap scores) in a single NEXUS file;

3 load this NEXUS files with MrBayes and instruct the software for partitioning, modelling, and analyzing data;

4 open the tree using Dendroscope.

NETWORKS AND CHRONOGRAMS

1 Use SplitsTree to perform different reconstructions of phylogenetic networks;

2 provide r8s the output tree of MrBayes to obtain an ultrametric tree (parameters will be optimized group by group).

 

Part 1: SETTING UP THE DATASET. ALIGNMENT ALGORITHMS AND DATA MANAGING

At first we need to decide, for which group of animals we are going to build our phylogenetic tree. This is one of the hardest parts, so this is should be done wisely, unless you are reading this, already having to build a particular tree, then it’s easy for you, cause you don’t have a choice.

Once you’ve chosen interesting animal/plant/whatever, you could go directly to wikipedia to find out its taxonomy, because if your target animal family doesn’t have much information, you’ll have to go one level above.

During this tutorial we are going to explore Passeriformes, because the family Corvidae (crows) has too little information. So that’s what we put in the search box: “Passeriformes complete mitochondrion”

Building a phylogenetic tree: CLC viewer

so we have something like this:

Building a phylogenetic tree: CLC search results

We want to get order by accession, cause the best sequences are from NCBI and have NC_ prefix in front of them. If you don’t have enough species (as in our case), you might want to search for “complete mitochondrion genome” or “mitochondrion”. This might work out.

Now you want to get on protein-coding gene and one ribosomal, so you should switch to circular view for the selected species:

Building a phylogenetic tree: CLC viewer

Then, you should save these genes for every animal you need (10at least ten), select saved genes and create new sequence list:

Building a phylogenetic tree: CLC viewer sequence list

and then save in fasta format to obtain something like this:

>NC_014341_ND1

ATGACCAACCATCCCATCTTAATCAGCCTTATCATAGCCCTCTCCTACATCCTCCCCATT

but for further steps you might want to delete new lines characters, which can be easily done by using regular expression (notepad++ for Windows, gedit/vim for Linux): “$\n” (“$\r\n” for Windows) and then putting new lines before “>” sign

Congratulations! Now you do have needed sequences in fasta format. Next step is to align them with any alignment tool you have. We will use tcoffee for this, which outputs result in fasta file, so after going for the online tool, we will have email with the confirmation of the alignment and aligned sequences with gaps, which we will look something like this:

>NC_020605

TGCCAAATTCTAGCCCAATATACC-CAACCC

right now we have saved project with search results, 2 sequences lists in CLC, 2 fasta files, 2 aligned fasta files

So time to go for Mega.

We open existing alignment:

Building a phylogenetic tree: Mega

and find out number of sites:

Building a phylogenetic tree: Mega

now we should export it in mega format and then open with mega:

select display => use identical symbols

and then export in phylip format, because we will need for further steps.

The Fasta file will be used as input for GBlocks. This is the masking step: sites carrying low phylogenetic signal are removed from the alignment. The output of GBlocks is again a Fasta, along with a HTML page showing regions interested by conserved blocks.

The Fasta output of GBlocks may be used as input for jModelTest (for the ribosomal alignement) and translated into Phylip for ProtTest (for the aminoacid alignment).

When information about molecular evolutionary models are available, tree inference is incoming. The Phylip format is used as input for RAxML, performing a classical Maximum Likelihood analysis. The NEXUS format (which may also be obtained from MEGA) is used as input for MrBayes.

Nexus syntax looks like this:

#NEXUS
[ Title result2]
begin taxa;
dimensions ntax= 11;
taxlabels
NC_020605
NC_021105_l-rRNA
NC_021408
NC_022839
NC_022840
NC_020603
NC_020601
NC_014341
NC_021641_l-rRNA
NC_021408_1
NC_020604
;
end;
begin characters;
dimensions nchar= 1625;
format missing=? gap=- matchchar=. datatype=nucleotide interleave=yes;
matrix
[!Domain=Data property=Coding CodonStart=1;]
NC_020605        TGCCAAATTCTAGCCCAATATACC-CAACCCAAAACAACAAAACTGCT-ACCCAAACCACAACTAAAGCATTTACTAG
NC_021105_l-rRNA TGCCAGACTCTAGCCCAATT-GCATTGACCTGGAATAACAAAGCTACTCCCCATACACCAAACTAAAGCATTTACTAG
NC_021408        TGCCAAACTCTAGCCCAACCGACA-CGACCTAGAATAACAAAGCCACTTACCCCACACCCAACTAAAGCATTCACCAG

you need to put coding matrix of one sequences list after another and to set some parameters for the Mr bayes settings.

For the purpose of experiment, my nexus file is provided, along with the settings:

#NEXUS
BEGIN DATA;
[!Comments are included in brackets, NTAX = number of taxonomical entries you have, NCHAR = number of sites (for both sequences);]
DIMENSIONS NTAX=11 NCHAR=2603;
[!DATATYPE=DNA cause we didn't switch to sequence yet, GAP sets gap-coding symbol;]
FORMAT MISSING=? GAP=- interleave=yes DATATYPE=DNA;
MATRIX
[!Domain=Data property=Coding CodonStart=1;]
NC_020605   ATGACTAACTACCAAACATTAATTAACCTAATCATAGCCCTCTCCTACGCCGTACCGATCTTAGTCGCAGTAGCCTTC
*********
begin mrbayes;
outgroup NC_020604;
charset atp6 = 1-978;
charset rrnS = 979 - 2603;
partition atp6_rrnS = 2: atp6, rrnS;
set partition = atp6_rrnS;
lset applyto=(2) nst=6 rates=invgamma;
prset applyto=(1) aamodelpr=fixed(gtr);
lset applyto=(1) rates=invgamma;
unlink statefreq=(all) revmat=(all) shape=(all) pinvar=(all) tratio=(all);
prset applyto=(all) ratepr=variable;
mcmcp ngen=1000000 burninfrac=0.10;
end;

List of software for building phylogenetic tree

 CLC SEQUENCE VIEWER

Environment: Windows/Linux

Download link: http://www.clcbio.com/products/clc-sequence-viewer/

 DENDROSCOPE

Environment: Java

Download link: http://ab.inf.uni-tuebingen.de/data/software/dendroscope3/download/welcome.html

GAPCODER

Environment: Windows

Download link: http://www.biomedcentral.com/1471-2105/4/6/

GBLOCKS

Environment: Windows/Linux

Download link: http://molevol.cmima.csic.es/castresana/Gblocks.html

JMODELTEST

Environment: Java

MRBAYES

Environment: Windows/Linux

Download link: http://mrbayes.sourceforge.net/download.php

PHYUTILITY

Environment: Java

Download link: https://code.google.com/p/phyutility/downloads/list

PROTTEST

Environment: Java

Download link: http://darwin.uvigo.es/software/prottest.html

SPLITSTREE

Environment: Java

Download link: http://ab.inf.uni-tuebingen.de/data/software/splitstree4/download/welcome.html

T-COFFEE

Environment: Online server/Linux

Download link: http://www.tcoffee.org/Packages/Stable/Latest/