Integration and Querying of Genomic and Proteomic Semantic Annotations for Biomedical Knowledge Extraction

Abstract

Understanding complex biological phenomena involves answering complex biomedical questions on multiple biomolecular information simultaneously, which are expressed through multiple genomic and proteomic semantic annotations scattered in many distributed and heterogeneous data sources; such heterogeneity and dispersion hamper the biologists’ ability of asking global queries and performing global evaluations. To overcome this problem, we developed a software architecture to create and maintain a Genomic and Proteomic Knowledge Base (GPKB), which integrates several of the most relevant sources of such dispersed information (including Entrez Gene, UniProt, IntAct, Expasy Enzyme, GO, GOA, BioCyc, KEGG, Reactome and OMIM).Our solution is general, as it uses a flexible, modular and multilevel global data schema based on abstraction and generalization of integrated data features, and a set of automatic procedures for easing data integration and maintenance, also when the integrated data sources evolve in data content, structure and number.These procedures also assure consistency, quality and provenance tracking of all integrated data, and perform the semantic closure of the hierarchical relationships of the integrated biomedical ontologies.At http://www.bioinformatics.deib.polimi.it/GPKB/, a Web interface allows graphical easy composition of queries, although complex, on the knowledge base, supporting also semantic query expansion and comprehensive explorative search of the integrated data to better sustain biomedical knowledge extraction.

Existing System

? Genome research projects generate enormous quantities of data. Gen Bank is the National Institutes of Health (NIH) molecular database, which is composed of an annotated collection of all publicly available DNA sequences [Benson et al. 2000;Benson et al. 2003; 2007]. ? There exist many standalone databases, which harbor important scientific data and are goldmines for a biologist. ? These databases have expanded exponentially and typically double in size every 12-18 months due to development of advanced DNA sequencing technologies. ? Most of these databases are stand-alone text-only repositories containing highly specialized medical, mutational, sequence or coordinate data.

Disadvantages

? The scattering of genomic and proteomic annotation data in many complementary butal soover lapping sources is an important and not yet completely solved challenge. ? Spe-cifically, data source heterogeneity in data representation and format, their fast evolution in number, data content and structure, the high variety of available data types, and also the great amount of data produced overtime, are the facets of a very hard data integration problem. ? Our data schema and software architecture are generic;thus,they have the potential for over coming the maintenance and extension issues posed by the ware-housing technique.

Proposed System

? In this paper we have provided an extensive survey of the databases and other resources related to the current research in bioinformatics and the issues that confront the database researcher in helping the biologists. ? The study of genes and proteins is becoming extremely important and is being known as genomics and proteomics, respectively. ? Whereas there are numerous databases related to various subfields of biology, we have maintained a focus on genomic and proteomic databases which are the crucial stepping stones for other fields and are expected to play an important role in the future applications of biology and medicine.

Advantages

? We also developed a user- friendly Web interface that supports the easy composition of complex multi- topic queries and their semantic expansion upon the integrated data; such interface fully enables users to comprehensively select , extract and display all data of their interest that match, syntactically or semantically, the performed query and to take advantage of them for biomedical hypotheses formulation and knowledge discovery. ? In this Section we illustrate and discuss the global data schema that we defined to integrate numerous, heterogeneous, controlled annotation data,i.e.data regarding different features or topics represented through multiple controlled vocabularies and ontologies, as well as their associations.

Download DOC Download PPT