Genome-wide association studies (GWAS) have successfully identified thousands of genetic loci associated with a wide variety of human phenotypic traits. In this thesis, we evaluate the bias, precision and power of three statistical techniques employed in GWAS.
In Chapter 2, we assess bias and power for adjusted-trait regression (ATR). ATR is a modification to the traditional ordinary least-squares estimation and F-test hypothesis testing techniques for quantitative trait multiple linear regression models. ATR involves performing bivariate correlation analysis between a genetic variant (or set of genetic variants) and a covariate-adjusted trait, obtained by regressing the trait on covariates. We show that ATR effect size estimates for single variant analysis are biased towards the null by a factor equal to coefficient of determination obtained from the regression of genetic variant onto covariates. We derive the exact distributions of ATR test statistics and show that ATR is less powerful than traditional methods when the genetic variant are correlated with covariates. The loss of power increases as stringency of Type 1 error control increases. The maximum possible power loss for the ATR multi-variant test is completely characterized by the canonical correlation between genetic variants and covariates. We show that, for typical covariates like genetic principal components, the loss of power will likely be low in practice.
In Chapter 3, we assess three genetic imputation quality scores (allelic-RSQ, MACH-RSQ and INFO) as predictors for realized imputation quality (squared correlation between true genotypes and imputed dosages) for low-frequency and rare variants. We assess the impact of using different imputation algorithms (Beagle 4.2, minimac3 and IMPUTE 2) and reference panels (1000 Genomes [1KG] and Haplotype Reference Consortium [HRC]) on the relationship between imputation quality scores and realized quality.
We imputed genotypes into 8,378 participants using each imputation algorithm with the 1KG panel and minimac3 with the HRC panel. We show that MACH-RSQ and INFO are identical when calculated on the same data. We observe that allelic-RSQ predicts realized quality less well than MACH-RSQ/INFO for low-frequency and rare variants. Realized quality decreases as minor allele frequency (MAF) decreases. The mean absolute difference (MAD) between quality scores and realized quality increases as MAF decreases. Imputation with HRC resulted in better realized quality for low-frequency and rare variants compared to imputation with 1KG. However, the MAD between quality scores and realized quality for low-frequency and rare variants was similar for both panels.
In chapter 4, we assess the efficiency gained or lost by adding an external sample with missing case-control status to an (internal) case-control study sample. We propose a method for estimation and testing that accounts for the known (or presumed) proportion of cases in the external sample. Misspecification of the external sample case proportion leads to biased estimation; in particular, treating the external sample as a control sample leads to underestimation of the effect size. However, the proposed test controls Type 1 error regardless of the particular value chosen for the presumptive external sample case proportion. When treating the external participants as controls, addition of external participants improves power if the proportion of cases in the internal sample is at least twice that in the external sample.
PHD
Biostatistics
University of Michigan, Horace H. Rackham School of Graduate Studies
http://deepblue.lib.umich.edu/bitstream/2027.42/163049/1/pyajnik_1.pdf