Annokey: an annotation tool based on key term search of the NCBI Entrez Gene database
© Park et al.; licensee BioMed Central Ltd. 2014
Received: 27 March 2014
Accepted: 5 June 2014
Published: 26 June 2014
The NCBI Entrez Gene and PubMed databases contain a wealth of high-quality information about genes for many different organisms. The NCBI Entrez online web-search interface is convenient for simple manual search for a small number of genes but impractical for the kinds of outputs seen in typical genomics projects.
We have developed an efficient open source tool implemented in Python called Annokey, which annotates gene lists with the results of a keyword search of the NCBI Entrez Gene database and linked Pubmed article information. The user steers the search by specifying a ranked list of keywords (including multi-word phrases and regular expressions) that are correlated with their topic of interest. Rank information of matched terms allows the user to guide further investigation.
We applied Annokey to the entire human Entrez Gene database using the key-term “DNA repair” and assessed its performance in identifying the 176 members of a published “gold standard” list of genes established to be involved in this pathway. For this test case we observed a sensitivity and specificity of 97% and 96%, respectively.
Annokey facilitates the identification of genes related to an area of interest, a task which can be onerous if performed manually on a large number of genes. Annokey provides a way to capitalize on the high quality information provided by the Entrez Gene database allowing both scalability and compatibility with automated analysis pipelines, thus offering the potential to significantly enhance research productivity.
KeywordsGene annotation Keyword search NCBI gene database PubMed article summaries
Recent advances in high-throughput DNA sequencing have allowed large regions of a sample genome (or genomes) to be sequenced rapidly at high coverage . The detection of rare variant associations with disease is one of many applications enabled by this new technology . However, current analysis pipelines tend to produce large numbers of genetic variants, only a small subset of which are likely to be relevant to the phenotype under study [3, 4]. This necessitates filtering and annotation to obtain manageable lists of candidates for further investigation. A typical workflow includes the identification of a list of variants in a sample, annotation of those variants with metadata drawn from a variety of sources, and filtering using research-specific criteria. Existing annotation tools, such as Annovar , are good at associating variants with basic biological information, such as which genes are affected. However, they are less helpful in deciding whether those genes are relevant to a given domain of interest. For a single DNA sample it is not uncommon for an experiment to produce thousands of variants, which may be collectively associated with hundreds of genes. It is therefore infeasible to manually inspect every candidate gene to determine if it is known to be associated with a particular domain of investigation. In order to address this problem we have implemented the Annokey tool described in this paper. We discuss its main features, how it is used and implemented, and analyse its performance on a simple test case relative to a published “gold standard” of DNA repair pathway genes .
Annokey’s inputs are a list of genes and a list of key terms (e.g. “DNA repair pathways”) that are considered by the user to be relevant to a research question. For each gene, Annokey searches for instances of those terms within the NCBI Entrez Gene database  and PubMed article abstracts and records their frequency and occurrence contexts. This provides an indication of each gene’s relevance as a candidate for follow-up study. The key terms are listed by the user in descending order of significance, and the rankings of matched terms are included in the output. This offers an additional level of information for prioritisation, which becomes more important as the number of key terms increases. The tool automates what would otherwise be a labour intensive task and can greatly increase research productivity. Whilst Annokey was originally developed in the context of rare variant detection, it is applicable to a much wider range of genomic studies; any situation where a researcher wants to know which of a set of genes are related to particular topics, e.g. RNA-Seq, ChIP-Seq, siRNA screens.
The Entrez Gene database is provided by the National Center for Biotechnology Information (NCBI) and contains detailed information about genomes from a variety of organisms. Gene records for a particular organism are curated by NCBI’s Reference Sequence project, drawing data from other databases within NCBI and other sources such as the Gene Ontology Database . These represent individual genes and each is labelled by a unique identifier called a GeneID. Each entry contains information about (amongst other things) nomenclature, gene sequences, gene product sequences, pathways, interactions, markers and phenotypes. Entries also indicate other resources within and outside the NCBI, including literature citations in the form of PubMed article references. The database can be accessed in two main ways. The first access method is via a web interface  which is suitable for interactive browsing and querying. The second access method is via a programming interface (which is used by Annokey). Whilst the web interface provides an expressive search engine, it is tedious to use for large numbers of genes or large numbers of search terms, and it is difficult to incorporate into an automated analysis pipeline. The programming interface is also not straightforward to work with for non-technical researchers, as also noted by Mrozek et al. who aim to provide a more user-friendly interface for complex exploration of the NCBI resources . Annokey is a targeted gene annotation tool that allows researchers to capitalize on the high quality information provided by the Entrez Gene database in a way that scales to large numbers of genes and queries, is tailored to a specified user context, and is compatible with automated analysis pipelines.
Online and offline search capability.
Flexible search terms, allowing use of regular expressions. Search terms are ranked in importance by the user according to the order in which they are listed in the input.
Summary search results are provided as annotations on the input gene list for quick inspection, prioritisation and integration with workflows based on spreadsheets.
Detailed search results are provided in an HTML report with hyperlinks back to the Entrez Gene web interface. The HTML report also shows each matching instance of a search term and the context in which it was found.
Annokey’s online search retrieves data from the NCBI databases directly over the Internet. This provides access to the most recent version of the data but, due to network latencies, is typically only applicable to queries involving a small number of genes (<100), although the number of search terms is not a limiting factor. Offline search utilises a locally cached copy of the Entrez Gene database, which can be populated using a snapshot of the whole database for a given organism. Offline search is appropriate for queries involving hundreds or thousands of genes and search terms, or in cases where a large number of different searches are required. The Annokey distribution provides an additional tool that automates the process of downloading and preparing the latest database snapshot.
Search terms are provided by the user in a text file, one per line, and are written as regular expressions . This provides an easy interface for exact-match literal terms but also allows sophisticated search patterns to be constructed. For example, the search term “tumor cell” matches exactly (and only) that text, whereas “([Cc]ancer|[Tt]umou?r) [Cc]ell” generalises the search to allow for both “cancer cell” or “tumor cell”, an optional spelling of “tumour” (with a “u”), and variation in capitalisation of the first letter of each word. The use of regular expressions provides a good compromise between the needs and abilities of novice and advanced users. The terms are ranked by importance according to the relative order of the lines in the input file. In its summary output, Annokey annotates each gene with information about the search results. One of the annotations is the highest rank of any matching search term, which provides a simple way for the user to prioritise the results.
For each gene in the input, Annokey retrieves its corresponding entry from the Entrez Gene database (if it exists), and then searches for each keyword in the fields of that entry. The location and frequency of each search term is recorded. The search optionally extends to the PubMed article summaries referenced in the gene record. The results of the search are presented in two ways: 1) as summary annotations to the input gene file; 2) as a more detailed HTML report. The summary output provides information that is easy to understand at a glance and practical to use in a typical bioinformatics workflow using spreadsheets (or CSV files). The detailed output provides a more fine-grained breakdown of the search results and includes hyperlinks back to the Entrez Gene web interface. We anticipate that most users will first look at the summary results to select a set of candidate genes, and then use the detailed report to investigate the candidates in more depth.
The algorithm is realised by the ANNOKEY_SEARCH procedure, which takes four parameters: 1) a list of gene names; 2) a list of search terms; 3) an organism name (which defaults to human); and 4) a Boolean flag indicating whether the search should include PubMed articles referenced from the gene entries. To avoid ambiguities such as those caused by synonyms, Annokey matches input gene names against the “Official Symbol” as specified by NCBI.
Each gene is processed separately by the loop spanning lines 4 to 26. The database record for a gene is retrieved (line 6) as an XML document, and parsed into its constituent fields (line 7). The precise set of fields that are searched are presented in the “NCBI Entrez Gene database fields” subsection below. Depending on how Annokey is used, the gene record might be found in the local file cache, or it might be fetched by an online query to the NCBI Entrez database. Each field of the gene record is processed separately by the loop spanning lines 9 to 13, and for a given field, each search term is processed separately by the loop spanning lines 11 to 13. All the matches of the search term are collected (line 12) and added to the set of hits (line 13). Each hit records the gene name, the search term, the field in which the matches were found, and a list of match locations. If the parameter “include_pubmed” is true then Annokey will also search through the titles and summaries of PubMed articles that are referred to in each gene record (lines 15 to 26). The set of PubMed articles referenced by a gene are collected (line 17) and the corresponding PubMed database entries are retrieved (line 18). Annokey first looks for PubMed database entries in a local file cache, and if that is unsuccessful, it then tries to fetch them from the online PubMed database (saving the downloaded entry into the local cache to optimise future requests). The XML PubMed entries are parsed into fields and searched in much the same way as the fields of the gene database. The total set of search hits is returned on line 28. In practice, to save memory, Annokey incrementally generates outputs after each gene is processed. It further reduces memory requirements by employing a streaming XML parser , which avoids the need to store the entire entry in memory for any given gene. Multiple different genes in a search could refer to the same PubMed article. Annokey avoids repeating the same search by saving the results of previous searches in a table indexed by PubMed ID.
Results and discussion
The following example illustrates how Annokey can be used to perform a simple search. More complex scenarios are described in the user documentation . Annokey requires two input files: a list of genes, and a list of search terms. The input gene list is a comma or tab separated file with at least one column headed “Gene”. Other columns are allowed; they will be ignored by the search but preserved in the output. For illustrative purposes, suppose we have supplied the following gene list of three genes (although typically the list would be much longer, and contain many more columns):
The input search term list is a text file with one term per line. Search terms can be literal text or regular expressions. The example below illustrates the contents of a file with two terms (again, in a real experiment, the list would be much longer):
If, for example, the gene list and the search term list are stored in files called “genes.csv” and “terms.txt” respectively, then an offline (cached) search can be performed by executing the following on the command line:
annokey --terms terms.txt --genes genes.csv
The summary output of the search is a new CSV file (printed to the standard output device) with three additional columns added, as follows:
Gene,Highest Rank,Highest Rank Term,Total Matched Entries
Highest Rank: the numerical rank of the highest ranked matching search term.
Highest Rank Term: the highest ranked matching search term.
Total Matched Entries: for each search term Annokey counts the number of database fields where that term is found at least once. This column contains the sum of all those counts over all the search terms.
The first two annotation columns show the highest ranked matching term (its numerical rank, and the term itself). The numerical rank allows the user to sort the search results in terms of the relative importance of matched terms. In the example above, we have ranked “breast cancer” as having higher importance than “DNA repair”. The third annotation column shows how many matching fields were found over all the search terms. This allows the user to sort the results in terms of the weight of evidence. In the example above, there were no matches for the CANT1 gene, so its annotations are empty. It is important to note that the annotated output provides a heavily summarised view of the search results. It is intended to provide an overview that is relatively easy to understand and prioritise. Brief, sortable summaries are particularly useful when dealing with large numbers (>100) genes, as is typical in many genomics projects. Annokey also provides more detailed search results in the form of a HTML report, which we describe below.
Analysis of performance
In order to gauge the effectiveness of Annokey, we measured its sensitivity and specificity against a published “gold standard” list of human DNA repair genes . This is by no means an exhaustive experiment but it gives us an idea about its behaviour in a reasonably well-controlled scenario. We ran Annokey with an input list of all the human genes contained in the Entrez Gene database, searched for the key term “DNA [Rr]epair”, and included linked PubMed articles in the search. We counted every gene with at least one search-term match as a positive result (and any other gene as a negative result), and compared the list of positives against the gold standard. Any positive result that was also in the gold standard list was considered a “true positive”, whereas the remaining ones were considered “false positives”. Any gene in the gold standard list that was not also within the positive results was considered a “false negative”, whereas the remaining ones were considered “true negatives”. The results were as follows:
Number of human genes in Entrez Gene (at the time of the experiment): 43869
“True positives”: 170
“False positives”: 1543
“True negatives”: 42150
“False negatives”: 6
Sensitivity: 170/(170 + 6) = 0.97
Specificity: 42150/(42150 + 1543) = 0.96
Rules employed by a human expert to score the relevance of genes against the topic “DNA repair” using manual inspection of evidence from the Entrez Gene database
Is most likely relevant to DNA repair or DNA damage response, supported by biochemical evidence of involvement in DNA damage response affecting DNA repair.
Is possibly involved in DNA repair or DNA damage response e.g., DNA repair protein binding partners without necessarily evidence of involvement in DNA repair, altered regulation in response to DNA damage or DNA damaging treatments but without direct evidence of a role in DNA repair.
Has no discernable involvement in DNA repair or DNA damage response and the key term match appears off-target e.g., refers to another gene mentioned in the same text as the test gene.
Myriad tools exist to support functional analysis of gene lists. A 2009 survey identified nearly 70 tools that highlight “interesting” genes derived from high-throughput studies . Many such tools take advantage of structured information in gene databases, including Gene Ontology annotations . PubMed is used by some tools as a source of functional information about genes. However, these tools generally rely on statistical tests to establish the relevance of specific genes and their annotations with respect to a set of reference or control genes. In contrast, Annokey applies a user-directed ranking algorithm that relies on curated associations between genes and an area of interest.
There are also a number of tools that provide search functionality for the biomedical literature. Web-based, general search tools comparable to the PubMed system are reviewed in . A few such tools offer the ability to limit results to particular biological entity types, such as genes, which can help to identify the literature most relevant to a given gene. A small number of literature search tools are also specifically designed to support the analysis of lists of genes. The most directly comparable in aims to Annokey are GoGene  and GeneValorization .
GoGene is a tool that draws on associations between genes and functionally-related terms extracted from the literature, in combination with structured information in Entrez Gene and UniProt, to organise genes according to particular functional categories (e.g. Gene Ontology terms). Genes related to specific functional categories or diseases can be identified by browsing the relevant ontology structures. Like Annokey (when the PubMed search is activated), it performs a search against both PubMed and Entrez Gene. Unlike Annokey, the direct literature search is based only on matches to the gene names in the abstracts. Associations between genes and specific diseases or functional concepts are based on identification of those concepts in the abstracts mentioning the genes. The tools are therefore somewhat complementary; while Annokey relies on direct links to the literature to identify relevant abstracts rather than the occurrence of a gene name, it can support more flexible search for relevant terms (including in the Gene database itself, not restricted to terms found in the literature alone) and users have direct control over the relative importance of those terms.
GeneValorization is a web-based tool that shows the relationship between genes and contexts of study (keywords) using frequency of co-location of the gene name and the keyword in the literature. It employs a graphical format to display its results. Whilst having similar goals to Annokey, it differs functionally in a number of ways. Annokey is a command-line tool that is designed to work with existing analysis pipelines, whereas GeneValorization is designed for interactive use. Annokey allows more flexible search using regular expressions, whereas GeneValorization uses literal terms only. Annokey searches within various fields of the Entrez Gene databases plus linked PubMed articles, whereas GeneValorization focuses on literature search only. Annokey caches database entries, which allows it to support very large numbers of genes and terms, whereas GeneValorization performs all queries online.
Annokey is a freely available open-source tool that allows users to annotate a list of genes using a keyword search of the Entrez Gene database and linked PubMed article summaries. It produces two types of search results: 1) a summarised output of annotated genes, which is useful for filtering and prioritisation; and 2) a detailed search report which shows how each search term was matched within the various fields of a gene record. We believe that Annokey will be a useful addition to many bioinformatics workflows in which lists of candidate genes need to be prioritised with respect to a domain of interest.
Availability and requirements
Project name: Annokey
Project home page: http://bjpop.github.io/annokey/
Operating systems: POSIX-like operating systems (OS X, Linux)
Programming language: Python
Other requirements: Python libraries: BioPython, lxml, html
Any restrictions to use by non-academic: None
National Center for Biotechnology Information
Comma separated value
Hypertext markup language
Extensible markup language.
This work was supported by a Victorian Life Sciences Computation Initiative (VLSCI) grant number VR0182 on its Peak Computing Facility at the University of Melbourne, an initiative of the Victorian Government. This work was also supported by National Health and Medical Research Council (NHMRC) of Australia Project Grants 1028280 and 1025145. SK’s work at VLSCI was made possible through the 2012 AMSI Bioinformatics Internships supported by EMBL Australia and BioPlatforms Australia. KV participated in this work initially while working at NICTA. NICTA is supported by Australian Federal and Victorian State Governments and the Australian Research Council through the ICT Centre of Excellence program. TN-D is the recipient of a post-doctoral fellowship from Susan G. Komen®.
- Moorthie S, Mattocks CJ, Wright CF: Review of massively parallel DNA sequencing technologies. HUGO J. 2011, 5: 1-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Southey MC: The role of new sequencing technologies in identifying rare mutations in new susceptibility genes for cancer. Curr Genet Med Rep. 2013, 1: 7.View ArticleGoogle Scholar
- Do R, Kathiresan S, Abecasis GR: Exome sequencing and complex disease: practical aspects of rare variant association studies. Hum Mol Genet. 2012, 21: R1-R9.PubMed CentralView ArticlePubMedGoogle Scholar
- Snape K, Ruark E, Tarpey P, Renwick A, Turnbull C, Seal S, Murray A, Hanks S, Douglas J, Stratton MR, Rahman N: Predisposition gene identification in common cancers by exome sequencing: insights from familial breast cancer. Breast Cancer Res Treat. 2012, 134: 429-433.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang K, Li M, Hakonarson H: ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010, 38: e164.PubMed CentralView ArticlePubMedGoogle Scholar
- Human DNA repair genes public database. http://sciencepark.mdanderson.org/labs/wood/dna_repair_genes.html - Human
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2011, 39: D52-D57.PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29.PubMed CentralView ArticlePubMedGoogle Scholar
- The NCBI Entrez Gene database.http://www.ncbi.nlm.nih.gov/gene.
- Mrozek D, Malysiak-Mrozek B, Siaznik A: search GenBank: interactive orchestration and ad-hoc choreography of Web services in the exploration of the biomedical resources of the National Center For Biotechnology Information. BMC Bioinformatics. 2013, 14: 73.PubMed CentralView ArticlePubMedGoogle Scholar
- Python: Regular expressions documentation.http://docs.python.org/2/library/re.html.
- The lxml toolkit.http://lxml.de/.
- Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009, 25: 1422-1423.PubMed CentralView ArticlePubMedGoogle Scholar
- NCBI ftp server. [ftp.ncbi.nlm.nih.gov]
- HTML Python library.https://pypi.python.org/pypi/html/.
- W3C Validator.http://validator.w3.org/.
- Annokey User documentation.http://bjpop.github.io/annokey/.
- Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37: 13.Google Scholar
- Lu Z: PubMed and beyond: a survey of web tools for searching biomedical literature. Database (Oxford). 2011, 2011: baq036.View ArticleGoogle Scholar
- Plake C, Royer L, Winnenburg R, Hakenberg J, Schroeder M: GoGene: gene annotation in the fast lane. Nucleic Acids Res. 2009, 37: W300-W304.PubMed CentralView ArticlePubMedGoogle Scholar
- Brancotte B, Biton A, Bernard-Pierrot I, Radvanyi F, Reyal F, Cohen-Boulakia S: Gene List significance at-a-glance with GeneValorization. Bioinformatics. 2011, 27: 1187-1189.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.