Data from: Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis

  • Brian O'Meara (Contributor)
  • Dan F. Rosauer (Contributor)
  • Arlin Stoltzfus (Contributor)
  • Ross Mounce (Contributor)
  • Jamie Whitacre (Contributor)
  • Emily L. Gillespie (Contributor)
  • Sudhir Kumar (Arizona State University) (Contributor)
  • Rutger A. Vos (Contributor)



BACKGROUND: Recently, various evolution-related journals adopted policies to encourage or require archiving of phylogenetic trees and associated data. Such attention to practices that promote data sharing reflects rapidly improving information technology, and rapidly expanding potential to use this technology to aggregate and link data from previously published research. Nevertheless, little is known about current practices, or best practices, for publishing phylogenetic trees and associated data in a way that promotes re-use. RESULTS: Here we summarize results of an ongoing analysis of current practices for archiving phylogenetic trees and associated data, current practices of re-use, and current barriers to re-use. We find that the technical infrastructure is available to support rudimentary archiving, but the frequency of archiving is low. Currently, most phylogenetic knowledge is not easily re-used due to a lack of archiving, lack of awareness of best practices, and lack of community-wide standards for formatting data, naming entities, and annotating data. Most attempts at data re-use seem to end in disappointment. Nevertheless, we find many positive examples of data re-use, particularly those that involve customized species trees generated by grafting to, and pruning from, a mega-tree. CONCLUSIONS: The technologies and practices that facilitate data re-use can catalyze synthetic and integrative research. However, success will require engagement from various stakeholders including individual scientists who produce or consume shareable data, publishers, policy-makers, technology developers and resource-providers. The critical challenges for facilitating re-use of phylogenetic trees and associated data, we suggest, include: a broader commitment to public archiving; more extensive use of globally meaningful identifiers; development of user-friendly technology for annotating, submitting, searching, and retrieving data and their metadata; and development of a minimum reporting standard (MIAPA) indicating which kinds of data and metadata are most important for a re-useable phylogenetic record.,Literature Sample1: conducted in April 2011, journals: American Journal of Botany & EvolutionLitSample1. To get a sense of current practices, AS and BO picked 2 journals, Evolution and Am J Bot, and looked at every one of the 32 regular articles in the April 2011 issues. Evolution is the premier trade journal for organismal evolutionary biologists. American Journal of Botany is a frequent venue for phylogenetic systematics.LitSample1_Apr2011_AmJBot_Evol.csvLiterature Sample2: 40 recently published phylogeny-related articlesWe searched Thomson Reuters Web of Science (WoS) in May of 2011 for articles matching 'phylogen*' in title or 'topic'. WoS sorted the results by 'relevance', and we picked 40 articles from the top of the list. We deliberately chose this approach to focus on articles likely to focus on phylogeny, rather than to mention it peripherally. However, because we do not know exactly what 'topic' and 'relevance' mean in this case (and WoS does not make its methodology clear to users), we cannot be certain what kind of a sample this represents. Of the 40 articles, 38 report new trees, considerably more than the 27/40 expected by chance for an article that matches 'phylogen*' anywhere (see below). The file "LitSample2_40RecentPhylogenInDepth.csv" contains extensive notes on the 40 articles. This spreadsheet was populated by an online fillable form that is available from the authors on request (in case any reader would like to analyze their own literature sample).LitSample2_40RecentPhylogenInDepth.csvLiterature Sample3: 100 randomly-selected 'phylogen*' articles published in 2010The sole purpose of this survey was to estimate the frequency of reports of new trees among 2010 publications. We first searched Web of Science for 2010 papers that matched 'phylogen*' in any field. Many of the 11,664 matching publications might be false positives, i.e., papers that refer to 'phylogen*' in some way, but do not report a new tree. To estimate this fraction, we picked 100 papers at random. Each paper was assigned to BO, AS or RM for individual evaluation, with the result that 66 of the 100 papers reported a new tree. The file "LitSample3_100RandomPhylogen2010.csv" contains results of the analysis of the sample of 100 publications. There is not much in this spreadsheet other than a determination of whether it has a new tree or not. This spreadsheet was populated by an online fillable form that is available from the authors on request (in case any reader would like to analyze their own literature sample). We also considered false negatives due to papers that report a new phylogeny, but avoid the term 'phylogen*', using instead some term such as 'dendrogram', 'cladogram' or 'tree'. Because 'tree' has many non-phylogenetic uses, we used a restricted search methodology based on other terms associated with phylogenies, such as 'SSU' or 'cytb' and so on. By comparing matches to 'SSU + tree -phylogeny' to those for 'SSU + phylogeny', we can estimate how often authors use 'tree' as a synonym while avoiding 'phylogeny'. We got only about 1/100 as many hits, and many of these referred to "trees" that were not phylogenetic trees. Thus, the results suggest that phylogeny synonyms would increase the yield by less than 1 %. We did not estimate false negatives due to poor indexing, or non-indexing, in Web of Science. Web of Science may contain information on articles that are indexed very incompletely, e.g., articles for which only the citation information is available, without keywords or abstract. A poorly indexed article that reports a phylogeny will only be found if 'phylogen*' appears in the title. We also did not estimate the number of false negatives due to phylogeny reports that are not indexed at all in Web of Science. It is difficult to see how this could be done. However, one way to do it would be to take a very carefully researched review article, e.g., on phylogeny of major reptile groups, and then assess what fraction of cited phylogeny articles can be found in WoS. Apropos, TimeTree has nearly a thousand articles in its database, and a substantial fraction are not indexed in PubMed.LitSample3_100RandomPhylogen2010.csvArchive Sample Analysis of All Dryad 2010 studies matching keyword: 'phylogen*'All TreeBASE entries have trees, but not all Dryad packages for phylogeny papers have decodable (i.e., not graphic) trees. Using the Dryad search interface in August, 2011, AS found 32 entries for 2010 studies in Dryad that match "phylogen". In this group, AS found one server error: . Among the remainder, there were 24 packages without any phylogeny in decodable form, and 7 packages with one or more phylogenies in decodable form. Note that most of the NEXUS files do not have trees, and that there are trees in non-NEXUS formats, e.g., some are just Newick strings in text files (e.g., The file "ArchiveSample_AllDryad_2010_Phylogen.csv" is a spreadsheet with the results of this very brief analysis.ArchiveSample_AllDryad_2010_Phylogen.csvUser stories of barriers to data re-use encounteredAs part of a MIAPA exercise we gathered and analyzed stories of phylogeny use & re-use, based on our own experiences, and those of colleagues who are sharing this information as a personal communication. This material provides a basis for many aspects of the barriers to re-use taxonomy in the text, and for individual comments about problems that users experience, such as inconsistent names, re-doing analyses, etc.UserStories_BarriersToReUse.pdfREADMEThis file describes the contents of the supplementary data package for Stotlzfus, et al, Sharing Phylogenetic Trees. The package includes this README file, a PDF file with user stories, and 4 spreadsheets (for 3 literature samples plus 1 quick analysis of Dryad content): * LitSample1_Apr2011_AmJBot_Evol.csv - all pubs from 2 April issues * LitSample2_40RecentPhylogenInDepth.csv - sample of 40 recent phylogen* pubs * LitSample3_100RandomPhylogen2010.csv - random sample of 2010 phylogen* pubs * ArchiveSample_AllDryad_2010_Phylogen.csv * UserStories_BarriersToReUse.pdf - user stories and taxonomy of barriers * README = this file | This README file is Unicode (UTF-8) with Unix/Linux file endings. The .csv files are also in Unicode (UTF-8), with field delimiter symbol , (comma) and text delimiter symbol " (double quote mark),
Date made availableJan 1 2012

Cite this