| Publication Type | | Journal Article |
| Authors | | Katrin Fundel, Ralf Zimmer |
| Year of Publication | | 2006 |
| Journal | | BMC Bioinformatics |
| Volume | | 7 |
| Pages | | 372 |
| Keywords | | textmining |
| DOI | | 10.1186/1471-2105-7-372 |
| Citation Key | | bioinflmu-275 |
| Document visibility | | Global publication list |
| Export | | BibTex |
Abstract
BACKGROUND: Frequently, several alternative names are in use for
biological objects such as genes and proteins. Applications like manual
literature search, automated text-mining, named entity identification,
gene/protein annotation, and linking of knowledge from different information
sources require the knowledge of all used names referring to a given gene or
protein. Various organism-specific or general public databases aim at organizing
knowledge about genes and proteins. These databases can be used for deriving gene
and protein name dictionaries. So far, little is known about the differences
between databases in terms of size, ambiguities and overlap. RESULTS: We compiled
five gene and protein name dictionaries for each of the five model organisms
(yeast, fly, mouse, rat, and human) from different organism-specific and general
public databases. We analyzed the degree of ambiguity of gene and protein names
within and between dictionaries, to a lexicon of common English words and
domain-related non-gene terms, and we compared different data sources in terms of
size of extracted dictionaries and overlap of synonyms between those. The study
shows that the number of genes/proteins and synonyms covered in individual
databases varies significantly for a given organism, and that the degree of
ambiguity of synonyms varies significantly between different organisms.
Furthermore, it shows that, despite considerable efforts of co-curation, the
overlap of synonyms in different data sources is rather moderate and that the
degree of ambiguity of gene names with common English words and domain-related
non-gene terms varies depending on the considered organism. CONCLUSION: In
conclusion, these results indicate that the combination of data contained in
different databases allows the generation of gene and protein name dictionaries
that contain significantly more used names than dictionaries obtained from
individual data sources. Furthermore, curation of combined dictionaries
considerably increases size and decreases ambiguity. The entries of the curated
synonym dictionary are available for manual querying, editing, and PubMed- or
Google-search via the ProThesaurus-wiki. For automated querying via custom
software, we offer a web service and an exemplary client application.},