First exam is passed, and now I can show, what’ve been doing during it. The data provided with the knitr file, so the research is reproducible.
PCA for population analysis
Okay, today we use 5 textfiles, containing information about different alleles sequences to see how they differ from one population to another. We will load all the data sets, make one data frame for further interactions. Than we will convert this data to genind object to save some time and just get directly to the Principal Component Analysis.
We begin with loading datasets and required library. Than we add information about from which population, genotypes part were aquired. EastAsians already have a column “population”, so we just skip it.
## Warning: package 'adegenet' was built under R version 2.15.3
## Loading required package: ade4
## Warning: package 'ade4' was built under R version 2.15.3
## ========================== ## adegenet 1.3-9.2 is loaded ## ========================== ## ## - to start, type '?adegenet' ## - to browse adegenet website, type 'adegenetWeb()' ## - to post questions/comments: email@example.com
setwd("C:/R_PCA") # start reading the data ASW <- read.delim("ASW.txt") CEU <- read.delim("CEU.txt") CLM <- read.delim("CLM.txt") EastAsians <- read.delim("EastAsians.txt") YRI <- read.delim("YRI.txt") # Name populations ASW$population = "ASW" CEU$population = "CEU" CLM$population = "CLM" YRI$population = "YRI"
Than, we merge these datasets into one dataframe by using rbind function (which adds data by row)
# Start merging the population test <- rbind(ASW, CEU) test <- rbind(test, CLM) test <- rbind(test, EastAsians) test <- rbind(test, YRI) # Showing first 6 columns of the reslting dataset head(test[, 1:6])
## ind rs35744813 rs4920552 rs3935570 rs12137730 rs7534618 ## 1 NA19625 T/T T/T G/G A/A G/G ## 2 NA19700 T/T T/T G/G A/A G/G ## 3 NA19701 T/T C/C G/G C/C T/T ## 4 NA19703 T/T T/T G/G A/A G/G ## 5 NA19704 T/T T/T T/T A/A T/T ## 6 NA19707 T/T C/C T/T A/A G/G
Now we have resulting dataset. Rows are individuals, for which we have the information about the allels. Allels presented in the dataframe itself.
In the next chunk we convert data to genind format for easy data handling. We are replacing the missing the values with calculated mean, because otherwise we won’t be able to run the PCA.
# Converting the data to genind format data4PCA = df2genind(test, sep = "/", pop = test$population, missing = "mean", ploidy = 2, type = c("codom", "PA")) data.x = scaleGen(data4PCA) data_prepared = scaleGen(data4PCA) # Scaling the data, before perforing the PCA data_prepared = scaleGen(data4PCA)
We have also scaled the data to prepare it for the PCA.
# PCA itself data_pca = dudi.pca(data_prepared, cent = F, scale = F, scannf = FALSE, nf = 4)
Now we want to know, how many percents of variance are explained by each variable. They are already sorted, so all we have to do is to calculate it and then check that we have got 100% of sum.
eig = data.frame(data_pca$eig) eig$percentage = (eig[, 1]/sum(eig$data_pca.eig)) View(eig) # Check that we've got percentage correctly sum(eig$percentage)
##  1
# How much % of variance do first to Principal Components explain? sum(eig$percentage[1:2])
##  0.1425
Okay, check how much entrys we have per each population
# Now get the number of samples for each population with table function table(test$population)
## ## ASW CEU CHB CHS CLM YRI ## 61 85 97 100 60 88
# Check that sum is 491 sum(table(test$population))
##  491
And now we plot the results for first two principal components:
# Plotting the data plot(data_pca$li[, 1:2], type = "pch", pch = c(rep(21, 61), rep(22, 85), rep(23, 97), rep(24, 100), rep(25, 60)), bg = c(rep("orange", 61), rep("violet", 85), rep("red", 97), rep("blue", 100), rep("green", 60), rep("yellow", 88)))
## Warning: ??? ??????? 'pch' ????? ??????? ?? ??????? ?????
# Plotting the legend legend(7, 6, legend = c("ASW", "CEU", "CHB", "CHS", "CLM", "YRI"), col = c("orange", "violet", "red", "blue", "green", "yellow"), pch = 16)
This plot shows us, that people from European population (CEU) have a lot of common with people chinese populations (CHB,CHS). Though they are not really far from the african genotype (YRI). CLM is the population from the latin america, so, the plot is only approving the migration theory, showing how differences in allels were spread all other the world.
This is the image from wikipedia, showing the spread of human population: