See also introduction for links to progress reports and the final report.
The search for genes responsible for susceptibility loci in complex disease has been fruitful ground for scientific advancement for nearly half a century. Over the past decade, advances in genotyping technology have allowed genetic epidemiologists to shift their focus from sparse restriction fragment length polymorphisms and linkage studies to genome-wide association studies involving from 10,000 to over a million single nucleotide polymorphisms. Unfortunately, the availability of new data has not been met with a match in methodological advances in statistics. Epistatic interactions, in particular those with little or no marginal statistical effect, remain difficult to detect. In spite of heroic efforts, multiple testing and the curse of dimensionality continue to plague the field. We propose a novel method to identify multi-locus association in genes and demonstrate its use on a cohort of systemic lupus erythematosus samples.
Problem Statement
The multi-locus genotype-phenotype association problem is to detect association between sets of genetic loci and a phenotype. Given as input a set of genotypes for affected cases and controls, parent-offspring trios, or other family structures, produce a set of associated genetic markers and the associated probability of their membership in this set.
Motivation
Everything in life is determined by a combination of two complimentary influences: genetics and environment. These co-conspirators work together to determine such diverse traits as eye color, intelligence, flowering times, and degree of pathogen virulence. While systematic observation and characterization of environment remains a challenge in most circumstances, broad-scale evaluation of genetics is becoming increasingly feasible. Of particular interest to researchers are genotyping microarray chips, which can determine the state of hundreds of thousands of single nucleotide polymorphisms (SNPs) in a single experiment. SNPs may be selected from genome-wide panels or specifically chosen from candidate genes. The blessing of this mountain of easy to obtain data is met with a set of unique challenges and opportunities unavailable just a few years ago.
The most obvious question to ask of a new type or scale of data concerns its utility: how can one establish novel associations between genetic variants and phenotypes? For large scale genotyping experiments, the answer thus far has taken two forms. The simple answer is to evaluate markers or groups thereof representing specific genomic regions by testing single SNPs or computationally inferred haplotypes. The literature is full of countless approaches which I will not attempt to recount. Broadly speaking, they have two attributes of present interest. First, they are simple to calculate. The most basic test statistic consists of a two-by-two contingency table defined by a phenotype quality and allele counts for a single locus. Second, the overall algorithmic complexity is roughly linear with respect to the global number of areas evaluated. Though constructing a haplotype or evaluating a single locus may require polynomial or exponential or exponential time, since only one genomic region is considered at a time, adding new areas causes an only approximately linear increase in running time.
Unfortunately, the simple one-gene model does not reflect the underlying biological problem for a large number of interesting traits. Phenotypes with complex genetic underpinnings may be the result of a large number of variations working separately or in epistasis. In other words, many genetic effects may have incomplete penetrance or rely on the action of variants in other parts of the genome. The latter case is especially interesting, since purely epistatic events are undetectable by the marginal effects at a single locus.
Detecting association of genes acting in epistasis with a phenotype can be difficult. Of particular interest are interactions between genetic loci which show no marginal statistical effect when considered in isolation. The straightforward solution of evaluating all possible pairs, triplets, or n-tuples of markers is fraught with problems, two of which make it unusable. First, it is NP-complete and hence very difficult. Second, the type-I error rate of such an exhaustive search may be explosive and standard correction such as Bonferroni may be conservative.
Hypothesis
In order to make efficient use of materials, computational time, and statistical significance, it is clear that the size of the search space and/or the dimensionality of the problem must be reduced. I hypothesize that a reduction based on interactions in known or suspected pathways may provide an effective means to reduce the number of interactions which may be considered. In this reduced search space, machine learning techniques may be used to model interactions between genes and phenotype or define gene-based classification criteria.
Methods
There are two parts to the method I would like to implement. The first and part of the second overlap with the scope of my COMP 572 project, while the second would form a marked improvement and would be the basis of judgment in this course.
-
Construct a network of genes likely to be associated with a phenotype of interest.
- Identify kernel set of genes known to be associated by marginal effects.
- Assemble network of genes from known interactions with kernel set using data from GeneNetwork (Franke 2006).
-
Search for epistatic interactions between kernel genes and remaining network members for which data is available.
- Contingency table (the extent of the scope of my COMP 572 project)
- Model the phenotype using genotypes and either affection status or quantitative and qualitative clinical features and construct a Bayesian network. There are a number of ways genes could be represented as nodes. These must be explored, though it is certainly possible to fall back on a known method such as that used by Wang (2007)
- Attempt to learn to distinguish the class of subjects with the phenotype from those without using SVM.
Data
It is very likely that I will have access to the data from the recent Nature Genetics article announcing the discovery of 12 newly associated genes with systemic lupus erythematosus (SLE) (SLEGEN 2008). The consortium reported on 2,566 female cases of European descent and 4,162 matched controls. While they are unlikely to have significantly more male samples, it is very possible that data will be available on many more participants of other races. In the best case, I would expect about 4,000 cases. If I am only able to obtain data from the SLEGEN members with whom I am personally acquainted, this number may drop to 2000-3000.
For each provided sample, I will have access to SLE affection status, race, ethnicity and genotypes from about 317,000 SNPs. Data provided by the OMRF may be much richer, containing genotypes from as many as 500,000 SNPs, extended pedigrees, information about clinical features, and possibly expression microarray results. This reduced data set has been the source of dozens of publications in top tier journals (see my publications for a sample). It is also possible that other OMRF scientists who have conducted their own candidate gene and fine-mapping studies may contribute genotypes from the same collection.
Expected Results
I expect to have two major findings.
- Genes with known marginal effects will be found to be in epistasis with other genes known to be important to SLE etiology
- Candidate genes selected for the SLEGEN study with no marginal effects will be found to be in epistasis with genes known to be associated with SLE
I base these assertions on prevailing views that genetic interaction is important, as well as three anecdotes.
- Gray-McGuire identified linkage and epistasis at 4p16-15.2 (2000), though localization of the effect driving the linkage has alluded researchers looking for single-marker effects (Sestak 2005).
- Modeling linkage at FcGRIIa and other linked loci by separating subjects into liability classes based on genotype state at other unlinked loci caused a LOD (maximum likelihood log of odds) score increase from ~4 to >8 (Kilpatrick 2004).
- Informal exploration of GeneNetwork by OMRF scientists yielded several suggested interactions by presence of association effects in marginally significant loci (Harley 2008).
Related Work
The search for epistatic interactions has become quite popular of late. Theoretical work has been conducted by Wang (2007), Millstein (2006), Marchini (2006), and Sun (2005), among many others. Three machine learning approaches, including the immensely popular MDR, were compared very recently by Heidema (2007). Interest in multi-locus effects is not particularly new with the advent of genotyping microarrays. Zaykin (1995) and MacLean (1993) reported on multi-locus association and linkage, respectively.
Timetable
I have constructed a detailed timetable for the project. I have left significant time for implementing each of the machine learning methods and interpreting and communicating results. This pragmatism also allows for the contingency of slow collaborators. To allow for such a contingency, I will simply temporarily replace the SLE data with a public or familiar private data set for development purposes.
| Milestone | Projected Date | Completed | Activity |
|---|---|---|---|
| 1 | 18. Jan, 2008 | 18. Jan, 2008 |
Complete with an important note. The full GeneNetwork is not yet available for download, though an extensive interaction network (the true-positive interaction graph) can be found. The author assures me the full data set will be released soon. Either way, the method should work well. |
| 2 | 29. Jan, 2008 | 26. Jan, 2008 |
Emailed JBH on 23. Jan. Received responses from data keepers whose work is most likely to be threatened by my possibly duplicated effort. OMRF researchers are eager to collaborate. Others did not voice a preference and wish to speak to me before passing judgment. Received email announcing availability of GAW data. This seems like a viable alternative if we can get the Rice IRB (or similar appropriate body) to cooperate quickly. There are three data sets: case/control data for 550,000 SNPs from the NARAC rheumatoid arthritis study; cohort and family data from the Framingham Heart Study including cardiovascular risk factors and genotypes for 550,000 SNPs; and a simulated data set that mimics the Framingham study. According to their dbGAP entry, Framingham has about 14,000 participants. NARAC has only 1,000 sibling pairs. Found and started integrating Ruby graph library RGL, whose design is influenced by the Boost Graph Library. This should work well for the present purposes. |
| 3 | 15. Feb, 2008 |
As of February 13th, I'm still waiting for data. The Rice IRB got involved, which is a bad thing for my purposes if it's anything like the IRBs I've had experience with. In the meantime, I'm trying to take care of mundane details and evaluating an alternative. Specifically, I'm evaluating Ruby to R bridge solutions and importing other data to refresh my memory on how it works. In data do not come through, I'm prepared to make use of a haplotype generation tool. If nothing happens soon, I'll generate simulated data so I can at least start implementing SVMs and Bayesian networks. |
|
| 4 | 22. Feb, 2008 |
|
|
| 5 | 7. March, 2008 |
|
|
| 6 | 21. March, 2008 |
|
|
| 7 | 17. April, 2008 |
|
|
| 8 | 1. May, 2008 |
|
References
- Franke L, van Bakel H, Fokkens K, de Jong ED, Egmont-Petersen M, Wijmenga C. Reconstruction of a Functional Human Gene Network, with an Application for Prioritizing Positional Candidate Genes. Am J Hum Genet 2006 Jun 78: 1011-25.
- Gray-McGuire C, Moser KL, Gaffney PM, Kelly K, Yu H, Olson JM, Jedrey CM, Jacobs KB, Kimberly RP, Neas BR, Rich SS, Behrens TW, Harley JB. Genome scan of human systemic lupus erythematosus by regression modeling: evidence of linkage and epistasis at 4p16-15.2. Am J Hum Genet 2000 Dec 67(6): 1460-9.
- Harley JB and Kaufman KM. Personal communication. 2008.
- Heidema AG, Feskens EJ, Doevendans PA, Ruven HJ, van Houwelingen HC, Mariman EC, Boer JM. Analysis of multiple SNPs in genetic association studies: comparison of three multi-locus methods to prioritize and select SNPs. Genet Epidemiol 2007 Dec 31(8): 910-21.
- Kilpatrick JR and Harley JB. Analysis for NIH R01 grant application. 2004.
- MacLean CJ, Sham PC, Kendler KS. Joint linkage of multiple loci for a complex disorder. Am J Hum Genet 1993 Aug 53(2): 353-66.
- Marchini J, Donely P, Cardon LP. Genome-wide strategies for detecting multiple loci that influence complex disease. Nat Genet 2005 Apr 37(4): 413-7.
- Millstein, J, Conti DV, Gilliland FD, Gauderman WJ. A testing framework for identifying susceptibility genes in the presence of epistasis. Am J Hum Genet 2006 78: 15-27.
- The International Consortium for Systemic Lupus Erythematosus Genetics (SLEGEN), Harley JB, Alarcon-Riquelme ME, Criswell LA, Jacob CO, Kimberly RP, Moser KL, Tsao BP, Vyse TJ, Langefeld CD. Genome-wide association scan in women with systemic lupus erythematosus identifies susceptibility variants in ITGAM, PXK, KIAA1542 and other loci. Nat Genet. 2008 Jan. Epub Ahead of Print. .
- Sestak A. Personal communication. 2005.
- Sun X, Zhang Z, Zhang Y, Zhang X, Li Y. Multi-locus penetrance variance analysis method for association study in complex diseases. Hum Hered 2005 60(3): 143-9.
- Wang K, Li M, Bucan M. Pathway-Based Approaches for Analysis of Genomewide Association Studies. Am J Hum Genet 2007 Oct 81(6): 1278-83.
- Zaykin D, Zhivotovsky L, Weir BS. Exact tests for association between alleles at arbitrary numbers of loci. Genetica 1995 96(1-2): 169-78.
