Statistics and Data Mining

Various statistical approaches to addressing the issue of multiple genomic loci association with quantitative traits (e.g., diseases) have shown interesting, if sometimes inconclusive, results. Examples include U-statistics, stepwise algorithms and multifactor dimensionality reduction. Some researchers suggested that a one single best strategy in a multi-locus disease association model might not be possible. Different strategies can perform better in different conditions based on varying parameters such as allele frequencies or locus interaction models (e.g., additive vs. multiplicative effects). Moreover, these same varying parameters can affect the ability to replicate the association of interacting loci across different studies.

A number of data mining approaches, such as dimensionality reduction, neural networks, random forests and support vector machines (SVM), have also been proposed for multi-locus association with traits. Data mining approaches are well-suited for identifying trends in large dimensional data sets. Some, however, can be computationally intensive (e.g., SVM, neural networks) and some may be hard to interpret (e.g., neural networks).

There are many statistical and data mining approaches that can be applied when analyzing the association between genomic loci and disease, and new approaches are being developed. Previous studies suggest that there might be no single approach that is in some sense best. Different approaches may work better with different data sets and with different diseases. We developed Pathway/SNP, a software tool that allows an exploratory approach to integrative association analysis, combining the use of biological pathway information and different statistical and data mining algorithms.