Description:
Genome-wide association studies (GWAS) have become a popular method for the
discovery of genetic variants associated with complex diseases or traits. As the size and
scope of these studies increase in order to obtain higher power for determining significant
associations, careful consideration of population structure becomes paramount. If individ-
uals in a study come from different ethnic or ancestral backgrounds, variation in allele
frequencies and disproportionate ancestry representation in cases and controls can lead
to inflated Type I error rates. Over the years, several methods for controlling population
stratification have been introduced, many of which rely on the use of multivariate dimension
reduction methods. An important aspect of population stratification is to determine which
loci exhibit evidence of population allele frequency differences. We introduce a method
based on Hardy-Weinberg Disequilibrium to find substructure-informative markers coupled
with the use of nonmetric Multidimensional Scaling (NMDS) in order to visualize popula-
tion structure in a sample. We extend the use of NMDS in conjunction with nonparametric
clustering to develop a test for association that corrects for population stratification. We
show that NMDS is a preferable visualization technique for detecting multiple levels of
relatedness within a set of individuals and that the subsequent test correction model is a
more powerful test under realistic scenarios. Recent research has shown that technical bias
due to differential genotyping errors between cases and controls can also inflate the Type I
error rate, possibly an even more severe source of bias in GWAS. Current genotype calling
algorithms rely on processing samples in batches due to computational constraints as well
as concerns of differences in DNA collection, lab preparation and heterogeneous samples
that can skew results of genotype calls. This thesis also addresses possible bias caused
by differential genotyping due to batch size and composition effects for the widely used
BRLMM algorithm recommended for the Affymetrix GeneChip Human Mapping 500 K ar-
ray set. Samples obtained from the Wellcome Trust Case Control Consortium are utilized
to determine differential results due to genotype calling batch differences.