Diverse Sources of Biological Domain Information

Database Integration and Modeling

Some of our recent research has focused on applications and tools that enable a pathway-focused disease association analysis of high density SNP genotype data sets. The pathways utilized in this analysis generally consist of sets of protein-coding genes that participate, through a series of interactions, in various biological processes. For the purpose of analyzing high density genotype information, the concept of pathway, however, could be expanded into a more generic concept of “set of biologically related genomic loci” or “network of genomic loci,” related through various biological domain-knowledge links such as RNA interference (RNAi), regulatory networks, or protein-protein interaction.

The use of gene networks has been recently proposed for the prioritization of candidate genes in linkage studies, and their use could be adapted to the analysis of high density SNP data sets as well. Gene networks can ascertain the likelihood of association between any two genes by combining multiple sources of publicly available information, such as:

  • Protein-Protein Interaction (e.g., OPHID)
  • Gene Co-Expression (e.g., NCBI’s Gene Expression Omnibus)
  • Similarity in Gene Ontology annotation for biological process and molecular function
  • sequence similarity (e.g., BLAST)
  • Protein Domain Similarity (e.g., Pfam)
  • Transcription Motifs (TRANSFAC)
  • Literature (Pubmed abstracts)
  • Membership to the same pathway (KEGG, BioCarta, GenMAPP, Reactome)