PageRank as a method to rank biomedical literature by importance
© Yates and Dixon. 2015
Received: 5 February 2015
Accepted: 9 November 2015
Published: 9 December 2015
Optimal ranking of literature importance is vital in overcoming article overload. Existing ranking methods are typically based on raw citation counts, giving a sum of ‘inbound’ links with no consideration of citation importance. PageRank, an algorithm originally developed for ranking webpages at the search engine, Google, could potentially be adapted to bibliometrics to quantify the relative importance weightings of a citation network. This article seeks to validate such an approach on the freely available, PubMed Central open access subset (PMC-OAS) of biomedical literature.
On-demand cloud computing infrastructure was used to extract a citation network from over 600,000 full-text PMC-OAS articles. PageRanks and citation counts were calculated for each node in this network. PageRank is highly correlated with citation count (R = 0.905, P < 0.01) and we thus validate the former as a surrogate of literature importance. Furthermore, the algorithm can be run in trivial time on cheap, commodity cluster hardware, lowering the barrier of entry for resource-limited open access organisations.
PageRank can be trivially computed on commodity cluster hardware and is linearly correlated with citation count. Given its putative benefits in quantifying relative importance, we suggest it may enrich the citation network, thereby overcoming the existing inadequacy of citation counts alone. We thus suggest PageRank as a feasible supplement to, or replacement of, existing bibliometric ranking methods.
KeywordsPageRank Bibliometrics Citation count Impact factor Journal ranking
MEDLINE is the premier bibliographic database of the U.S National Library of Medicine (NLM), containing over 22 million biomedicine-related entries. With approximately 750,000 new citations added in 2014, it is essential to identify literature of the highest quality for priority reading . High citation rates (in addition to journal impact factor and circulation rates) are proposed to be predictive of article quality , thus in turn, scientific importance. Factors such as bias towards review articles and variable bibliographic lengths however suggest that such methods are not always optimal .
Citation counts give no weighting towards articles of greater importance. Naturally, definition of such importance is a subjective task. In a static system of inter-article referencing, we observe that a citation by an article from a low distribution journal has equivalence to a citation from a large-scale systematic review. Perhaps a weighting approach would favour articles of greater perceived ‘scientific gravity’, however this may neglect the emerging relevance of an article’s spread through the scientific community. Therefore a method of objectively weighting literature importance would be highly beneficial.
The PageRank algorithm, originally used for link analysis by the search engine, Google , provides one such method of ranking by importance. The concept, originally applied to web pages, proposes that a web page itself carries a greater importance if linked to by other high importance pages. Thus for a closed system of total web pages online, a system of merit can be constructed based on assigning a relative weighting (as a proportion of the entire database) to each web page.
Much as web pages are interconnected through hyperlinks, scientific articles are themselves linked via their citations. As such, this study seeks to investigate PageRank-based bibliometrics as an alternative to citation counts alone.
The PubMed Central open access subset (PMC-OAS) represents a more liberally-licenced part of the PubMed Central collection , freely available online. Contributing journals provide selected full text articles in eXtensible Markup Language (XML) format, specifically for data mining purposes.
With data ingestion going beyond the capability of traditional desktop computing, on-demand cloud-computing infrastructure was leveraged to parallelise metadata extraction. This commodity cluster environment represents a readily-available, low-cost method of scaling up ‘embarrassingly parallel’ computational tasks .
XML parsing was performed in parallel on four compute nodes (2Gb RAM, 2 virtual CPU cores) using a hand-written Python  parser in under two hours (Appendix 1). PubMed identification (PMID) numbers of ‘outbound’ citations were extracted from each article’s reference list and used as reference keys for every citation vertex in the graph of article nodes.
A dampening factor was originally introduced in PageRank to model an imaginary surfer randomly clicking on links, that will eventually stop clicking. 0.85 suggests an 85 % probability that at any step, this imaginary surfer will continue to click. Due to the recursive nature of the algorithm, a convergence value (epsilon) of 0.00001 was used to guarantee precision. The algorithm was used as per the reference implementation except where otherwise described.
Inverted citation index creation
MapReduce, a programming model for large corpus processing, also developed at Google, was used to create an ‘inverted citation index’. This distributed computational approach allows near linear scalability with increasing cluster size , thus facilitating a route for future corpus expansion. The inverted citation index generates a list of ‘inbound’ citations for each article node in the graph, with a corresponding total citation count.
The high-level programming language, Pig  was used as a layer on top of MapReduce for near-natural language manipulation of the dataset. A Pig script was written to facilitate numeric comparison between derived citation count and calculated PageRank (Appendix 2).
Statistical analysis was performed using IBM SPSS version 184.108.40.206 .
The PageRank algorithm processed and ranked a total of 6293819 unique PMIDs as graph nodes, with 24626354 vertices, representing corresponding outbound citations. A random, 5 % sample of the data was taken (using SPSS randomisation) for statistical analysis. This figure comfortably exceeds the sample size calculation (n = 385 required, Raosoft ), detailed in Appendix 3.
PageRank is shown to be a surrogate of literature importance
As such, given the current role of citation count as a marker of literature importance, we demonstrate PageRank to be a similar such surrogate due to high degree of correlation. In light of this finding, we suggest that novel rankings would likely remain broadly similar and thus suggest that implementation of PageRank into the ranking of biomedical literature is feasible.
Top of the corpus comparison
If the putative benefits of PageRank in quantifying importance are to be observed, it must be through outliers from those otherwise highly correlated with citation count. Such outliers may have been preferentially weighted by the algorithm, based on perceived importance. Due to the training subset size, it would be infeasible to account for such examples, however a top of corpus comparison allows some speculative inspection.
Top of the corpus comparison
PubMed ID (PMID)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Basic local alignment search tool.
Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.
Analysis of relative gene expression data using real-time quantitative PCR and the 2(−Delta Delta C(T)) Method.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
A rapid and sensitive method for the quantitation of microgram quantities of protein utilizing the principle of protein-dye binding.
MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods.
MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0.
Cleavage of structural proteins during the assembly of the head of bacteriophage T4.
The neighbor-joining method: a new method for reconstructing phylogenetic trees.
Whilst this represents a single example, we hypothesize that a method article is likely to be widely cited by those utilising its techniques, however this gives little information about the importance of such implementers. As such, we suggest that this correlation outlier has been proportionally ‘down-ranked’ by the PageRank algorithm in relation to the rest of the comparative head.
Whilst further work is required to validate such claims, we suggest this finding may build upon the notion of PageRank’s potential benefits in outweighing citation count alone. If the method is truly able to better weight those articles with higher importance rather than mass citation, we propose that its implementation into the ranking of biomedical literature may be warranted.
PageRank can be trivially calculated on commodity cluster hardware
The use of on-demand cloud computing infrastructure for data extraction and computation allows for scalability with increasing corpus size. In the event of increasing article burden, additional XML parsing nodes could be employed with linear cost and throughput. Despite the uncompressed corpus totalling approximately 40Gb, the fully citation-extracted form was <500 Mb. Therefore, we suggest that growth by an order of magnitude (in the range of entire MEDLINE database size) could still be stored on a single commodity hard drive.
Whilst the PageRank calculation was performed on a single node, expansion beyond 2Gb of RAM on a single computer is becoming cheaper and widely available . The use of MapReduce for inverted citation network creation allows near-linear scalability, similar to XML parsing, and can thus be trivially re-evaluated as the corpus grows. PMC-OAS is updated daily, thus all metrics can be recalculated in a matter of minutes (minus the cost of data parsing), as required by the maintainer.
Expanding automated XML processing to MEDLINE as a whole is problematic
The PMC-OAS full-text articles are freely available in XML format, facilitating automated citation extraction. Unfortunately, the vast majority of MEDLINE articles are not open access, meaning that full-text access in not trivially available without bulk licencing programmes. Furthermore, the lack of XML-based metadata in non-open access articles limits the capability for rapid citation network generation.
Efforts have been made to parse bibliographic data from papers [15, 16], however attempts are limited by paid access to such articles in addition to the efficiency of extraction from a variety of article distribution file formats. We thus identify expansion beyond this 600,000-article training corpus as a major barrier to non-proprietary bibliometrics.
Articles appearing in PMC-OAS, referenced articles, which were not included in the corpus. This means that the latter’s PMID appeared in the citation network and thus received a PageRank. However, due to the limited inclusion set of this work, the PageRank (and thus relative ordering) is by no means final and would inevitably change should expansion to the whole of MEDLINE be feasible.
Other methods of importance quantification
Thus far, importance analysis has been derived from article citation networks alone. However, importance is a non-static entity, with the impact of papers going beyond that of, who cites who. Indeed, importance of a particular work may be represented by its spread through the scientific community, rather than an ‘acknowledgement-based’ system of the traditional publishing model. Social media may provide a real-time window into this community dissemination.
Altmetrics, the use of the social web for insight into article impact , has previously shown promise in correlation with citation count and may therefore add to bibliometrics through real-time importance weighting . Consideration of social impact is beyond the scope of this research, though provides an exciting avenue for further exploration, perhaps in conjunction with PageRank.
PageRank is a novel method for determining the importance of biomedical literature. The possibility of commodity cluster hardware use and value re-calculation following corpus expansion suggests that curation of an open access citation network is not beyond the limits of a single maintainer. Whilst further work will inevitably be required to expand the network beyond the XML data-mining corpus of the PubMed Central open access subset, the 600,000-article training corpus provides a starting platform for PageRank’s addition to existing importance ranking methods.
PubMed Central open access subset
National Library of Medicine
eXtensible Markup Language
File Transfer Protocol
Prof Jamie Coleman, School of Clinical and Experimental Medicine, College of Medical and Dental Sciences, University of Birmingham for his kind support throughout.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- NLM: Fact Sheet MEDLINE ®. [http://www.nlm.nih.gov/pubs/factsheets/medline.html]. Accessed 2 Apr 2015.
- Lee KP, Schotland M, Bacchetti P, Bero LA. Association of journal quality indicators with methodological quality of clinical research articles. JAMA. 2002;287(21):2805–8.View ArticlePubMedGoogle Scholar
- Adam D. The counting house. Nature. 2002;415(6873):726–9.View ArticlePubMedGoogle Scholar
- Page L, Brin S, Motwani R, Winograd T. The PageRank Citation Ranking: Bringing Order to the Web. In: Stanford Digital Library Working Paper SIDL-WP-1999-0120. Stanford University. 1999.Google Scholar
- PMC: Open Access Subset [http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/]. Acccessed 2 Apr 2015.
- PMC: FTP Service [http://www.ncbi.nlm.nih.gov/pmc/tools/ftp/]. Acccessed 2 Apr 2015.
- Leykin A, Verschelde J, Zhuang Y. “Parallel Homotopy Algorithms to Solve Polynomial Systems”. Proceedings of ICMS 2006. 2006.Google Scholar
- Python Software Foundation: Python programming language [https://www.python.org]. Acccessed 2 Apr 2015.
- Github: Panos Louridas (louridas) PageRank C++ implementation [https://github.com/louridas/pagerank]. Acccessed 2 Apr 2015.
- Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters. OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA. 2004.Google Scholar
- Apache Software Foundation: Pig [http://pig.apache.org]. Acccessed 2 Apr 2015.
- IBM: SPSS [http://www-01.ibm.com/software/uk/analytics/spss]. Acccessed 2 Apr 2015.
- Raosoft: Sample size calculator [http://www.raosoft.com/samplesize.html]. Acccessed 2 Apr 2015.
- Brock DC, editor. Understanding Moore’s law: four decades of innovation. Philadelphia: Chemical Heritage Press; 2006. ISBN 0941901416.Google Scholar
- Zou J, Le D, Thoma GR. Locating and parsing bibliographic references in HTML medical articles. Int J Doc Anal Recognit. 2010;13(2):107–19.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang X, Zou J, Le DX, Thoma GR. A structural SVM approach for reference parsing. BMC Bioinformatics. 2011;12 Suppl 3:S7. doi:10.1186/1471-2105-12-S3-S7.View ArticlePubMedGoogle Scholar
- Melero R. Altmetrics - a complement to conventional metrics. Biochem Med (Zagreb). 2015;25(2):152–60.View ArticleGoogle Scholar
- Thelwall M, Haustein S, Larivière V, Sugimoto CR. Do altmetrics work? Twitter and ten other social web services. PLoS ONE. 2013;8(5), e64841.PubMed CentralView ArticlePubMedGoogle Scholar