Contributors: Ismat Ghazal, Damilola Adegbite, Omotoyosi Saba, Oreoluwa Akande, Tolani Omoleye and Mayowa
The HLA system, also known as the human version of the major histocompatibility complex (MHC) that is found in many animals, is a gene complex located on the short arm of Chromosome 6. These polymorphic genes code for HLA molecules which are primarily responsible for presenting processed peptide antigens.
HLA’s have multiple other responsibilities within the human body. HLA Class I group, present peptides from inside the cell. HLA Class II presents antigens from outside of the cell to T-lymphocytes, whilst HLA corresponding to the MHC Class III encodes components of the complement system.
Clinically, the HLA system is important in hematopoietic stem cell transplantation and is also associated with certain diseases such as cancer, type 1 diabetes, systemic lupus erythematosus. In cancer, although the HLA often plays a protective role, it has been seen to exhibit both pro and anti cancer properties.
The HLA loci are some of the most genetically variable loci in mammals. Hence, this project aims to compare the HLA variants in 4 different Asian population groups – Dai (CDX), Han (CHB), and Southern Han Chinese (CHS) and Vietnamese (KHV).
The results from this analysis infers possible biological implications associated with the identified Asian HLA variants particularly in drug response.
A tab delimited file containing the ID of each sample and the population code was downloaded directly from the complete 1000 genomes database as well as binary plink files asia.bim, asia.bed.gz & asia.fam.
The dataset was downloaded directly from this github repository using the ‘wget’ command on the linux terminal and the compressed dataset “asia.bed.gz” was unzipped using the ‘gunzip’ command.
The complete 1000 genome sample dataset is a large database of different human genetic variation obtained from 26 populations representing Europe, East & South Asia, West Africa, and America e.t.c.
Principal Component Analysis (PCA)
This is done to decompose the structure of the data and identify the different populations in the data. PCA was used to visualize the data into readable and pictorial 2D plots to identify the different populations and view clearly to what extent the 4 Asian populations within our genome dataset vary or intercept.
The first step was to generate eigenvalues by running the plink command below . Eigenvalue shows the importance of the direction of spread within the data.
During the analysis the chr-set and no-xy parameters were not used as our samples are human chromosomes, which plink is preset on.
To create a PCA plot, the eigenvalues were downloaded into a PC then imported to RStudio. After specifying the directory containing the datasets, we set eigenvec to pca1. Since eigenvec is separated into multiple columns and does not have a header, this command was used:
Using library(“ggplot2”) to load ggplot, we created a preliminary plot with pca1 using the default parameters.
To explain the properties of the 1000 genomes list, a metadata table was created using this command
The next step was to merge pca1 and metadata using a common column in both dataset.
This was done to highlight the Asian populations in the complete 1000 genome list. To generate a final PCA plot and color by population, we ran the command below:
Multidimensional Scaling (MDS) Analysis
We performed this analysis in a linux terminal using plink.
We created a pruned set of markers that are not highly correlated using whole genome SNP binary fileset (asia.bed, asia.bim, asia.fam) as the input .
The set filtering values removes any SNP that has r-squared > 0.01 with another SNP within a 1000-SNP window; this window is shifted across the chromosome 10 SNPs at a time.
We then calculated genome-wide identity by descent score (allelic similarity) on the pruned marker list using:
Finally, using the previous .ibs result, we performed population stratification by clustering individuals into homogeneous groups and performing multidimensional scaling analysis.
To place constraints on the clusters, we used Pairwise Population Concordance (PPC ) test in the command
To visualize the MDS analysis, MDS component 1 (C1) was plotted against MDS Component 2 (C2) from the strat1.mds file using ggplot in RStudio. After setting the right working directory and launching ggplot2, we set strat1.mds as mdsdata using
Next, we created a metadata table as earlier stated and merged mdsdata with metadata using
Finally, we created a scatterplot color-coded by population codes using the command
Results and Discussion
Metadata provides simplified details on the structure, nature and context of a dataset. Here, the metadata table was gotten from the complete 1000 genomes list and clearly shows attributes of the data sorted into ; sample name, sex, biosample ID, population code, population name, superpopulation name, superpopulation code, population elastic ID and
Table 1: Metadata
Principal Component Analysis
Principal component analysis (PCA) is one of the most useful tools for population stratification. In this project, we carried out PCA on data from the 4 Asian populations using Plink and RStudio.
The plink analyses yielded the eigenvalues and eigenvectors of 20 principal components. In this analysis, all eigenvalues were greater than 1 and thus they all fulfilled the Kaiser Criterion.
The eigenvectors with 2 highest eigenvalues (V3 and V4) were used to make a PCA plot (Figure 1) of the different populations with both accounting for approximately 17.9% of the total variation within the populations.
Figure 1: PCA plot
From the PCA plot, we can see 4 different clusters. There is an overlap between the CHB (Han Chinese) and CHS (Southern Han Chinese) population. By observing the distance between the clusters on the (PC1) axis, the CDX (Dai Chinese) population is more varied from the CHS and CHB population than from the KHV (Vietnamese) population. The Han Chinese (CHB & CHS) are separated from the southern population (CDX and KHV) by PC1. On PC2, CHB and CHS do not vary.
Multidimensional Scaling (MDS) Analysis
Multidimensional scaling is used to graphically depict the relationships between samples in a multidimensional space. It shows the degree of similarity or differences between the samples based on their proximity and gives no information about variables.
For our analysis, we first created a set of pruned markers (approx. 8700 SNPs) that were not highly correlated. Next, the identity by descent (IBD) scores were calculated for all pairs of individuals to determine the degree of similarity.
The IBD scores were then used to cluster individuals into homogeneous groups and also generate the first 2 MDS components for each individual (C1 and C2). These MDS components represent the position of each individual in first and second dimensions.
Figure 2: MDS plot
Plotting C1 against C2 and color-coding by population produced the scatter plot shown in Figure 2. There are 4 clusters in which individuals are closely packed together representing each Asian Population.
The CHB and CHS also overlap significantly in both dimensions which suggests that both populations are more similar to each other than to CDX or KHV. Likewise, CDX and KHV appear more closely related based on the degree of overlap of both clusters.
Collectively, the results from PCA and MDS suggest that the CHS and CHB populations will show similar physiological responses to HLA associated drugs as both populations appear to be closely related while the CDX and KHV populations will have a distinct response to drugs as they slightly overlap on the MDS plot.