rOpenSci | When Field or Lab Work is not an Option - Leveraging Open Data Resources for Remote Research

When Field or Lab Work is not an Option - Leveraging Open Data Resources for Remote Research

The COVID-19 pandemic has dramatically impacted all of our lives in a very short period of time. Spring and summer are usually very busy as students prepare to go the field to engage in various data collection efforts. The pandemic has also disrupted these carefully planned activities as travel is suspended and local and remote field stations have closed indefinitely. A lost field season can be a major setback for a dissertation timeline and students will have to improvise. One promising opportunity to continue research efforts during these unprecedented times is taking advantage of the massive amounts of open scientific data that are freely available. Open data can form the basis of a review, synthesis, or new research.

Inspired by tweets from Ethan White about “PhD research from a distance”, the rOpenSci team did an in-depth exploration of how we provide access to open data. Our goal is to inspire students to find research opportunities with open data and highlight some of the rOpenSci packages that already make programmatic access possible. We also highlight some examples of how specific collections of packages are being used right now in fields as varied as archaeology and climate science.

🔗 Exploring open data

Data are fundamental to scientific discovery and leveraging new discoveries would not be possible without access to data 1. Although people rarely develop new research entirely on open data, these datasets provide an opportunity to reproduce and validate existing results, improve models, and be combined with other data to generate new syntheses. The open science movement has been growing for over a decade and all of that interest has surfaced numerous databases and repositories. The growing interest in reproducibility has also led to the creation of a plethora of open source software to access such data. rOpenSci’s core mission is to develop such tools and to date we have built over 120 robust data-access packages. These packages provide access to an impressive variety and quantity of data:
eBird offers up 700 million observations, Crossref has 108 million records of scholarly works which include articles and books, Dryad makes available 13 terabytes of data associated with published papers, and GBIF has over 1.3 billion records of species worldwide.

We hope that this post and these tools provide inspiration for you to explore new data sources and research topics.

🔗 Data sources for your research

Many of rOpenSci’s tools are developed by practicing scientists and have strong communities behind them. We invited university faculty from our community of developer-researchers to highlight sources of open data for research in their fields.

🔗 Climate and weather

Brooke Anderson, Colorado State University

Research on weather and climate—and their impacts on humans and the environment—can draw on numerous excellent open data sources, including many made available through programmatic access to data collected and shared by institutions and monitoring networks. The US Geological Survey offers a particular exciting example, offering not only APIs for accessing their data, but also a full suite of R packages developed and shared through the USGS-R community. rOpenSci’s own rnoaa package provides access to data through a number of the US National Oceanic and Atmospheric Administration’s open data APIs, allowing for fast and convenient access from R to national or worldwide data on, among others, meteorological observations, sea ice, and tides and currents, while its bomrang package offers similar access to data from the Australian Government Bureau of Meteorology. Other rOpenSci packages provide access to weather- and climate-related data from the Iowa Environment Mesonet (riem), New Zealand’s National Climate Database (clifro), the US National Aeronautics and Space Administration’s Prediction of Worldwide Energy Resource (POWER) dataset (nasapower), the US National Centers for Environmental Information’s Global Surface Summary of the Day (GSOD) dataset (GSODR), the US National Hurricane Center (rrricanes), the Flanders Environment Agency and Flanders Hydraulics Research’s waterinfo.be dataset (wateRinfo), and Environment and Climate Change Canada (ECCC) (weathercan). bowerbird is general-purpose package for maintaining local copies of a range of satellite- and model-derived environmental and climate data.

🔗 Water

Louise Slater, University of Oxford, Sam Zipper, University of Kansas, Ilaria Prosdocimi, Ca ‘Foscari University, Sam Albers, Government of British Columbia, and Claudia Vitolo, European Centre for Medium Range Weather Forecasts

In hydrology, there has been a rapid growth in the number of streamflow data archives made publicly available online by countries such as the UK (rnrfa package), USA (dataRetrieval package), Greece (rOpenSci’s hydroscoper package), and Canada (rOpenSci’s tidyhydat package) although most countries sadly do not yet apply an open policy to their hydrological data. The Task View on Hydrological Data and Modelling and accompanying blog post Getting your toes wet in R: Hydrology, meteorology, and more provide an exciting overview of the most up-to-date R packages that are available for downloading, analysing, and modelling these data. For an overview of the many advantages of using R for hydrological research, see the paper “Using R in Hydrology” 2 which describes approaches to retrieve, analyse, map, model, and visualise hydrological data.

🔗 Antarctic and Southern Ocean

Ben Raymond, Australian Antarctic Division and Anton Van de Putte, Royal Belgian Institute for Natural Science

Antarctic science has a strong culture of open data - the Antarctic treaty itself states that scientific observations and results from Antarctica should be openly shared, and the Scientific Committee on Antarctic Research has had an active data management group since the late 1980s. To find Antarctic and Southern Ocean data, search the Antarctic master directory (metadata catalogue) or portals such as the Antarctic Biodiversity portal or the Southern Ocean Observing System.

The Antarctic rOpenSci community is developing R resources to support Antarctic and Southern Ocean science, with a particular emphasis on simplifying data access and performing common analytical tasks. See this blog post and task view for an overview of some of the packages in development, and the types of analyses that we are aiming to support.

🔗 Archaeology

Ben Marwick, University of Washington

Research shuddered to a stop in the Geoarchaeology Lab in early March, with UW being one of the first US campuses to switch to remote work. No longer able to go to campus, we turned our attention to computational text analysis of a large corpus of archaeological conference abstracts to look at questions about gender imbalance and theory change in our field. Our quick pivot to this new area was only possible thanks to high quality and well-documented software such as rOpenSci’s tesseract, pdftools and magick packages. These enabled us to generate data rapidly, giving us more time for exploring and testing hypotheses, and ensuring our students could get to the end of the term ready to share some really interesting results.

We’ve been keeping up with the literature through in-depth study of new journal articles, especially those that include open data. Archaeologists use specialised repositories such as the Digital Archaeological Record (tDAR), Open Context as well as several generic repositories to share data (e.g. Zenodo, Figshare, Dataverse - each of these have R packages to access data). There are R packages for accessing data hosted by those archaeology repositories (tdar, opencontext), but many of our favourite recent articles (we keep a list here) had their data openly archived on the Open Science Framework data repository. While studying these articles we have enjoyed using rOpenSci’s osfr package to quickly and reproducibly access these materials for in-depth exploration. A favourite type of data for many archaeologists is radiocarbon ages, and our group has also been working with these with ease thanks to the c14bazAAR package. We’ve been using this package to get data to study radiocarbon dates from hundreds of archaeological sites in Australia. While we’re missing the lab, rOpenSci’s packages for acquiring archaeological data have been invaluable tools for efficiently enabling us to be active and engaged in our research.

Our task view for archaeological science shows the full range of tools we use, from data acquisition through environmental and geological analysis to writing reproducible manuscripts.

🔗 Transport

Robin Lovelace, University of Leeds

There has never been a better time for data driven and reproducible transport research. The COVID-19 pandemic has disrupted transport patterns worldwide. This has led to changes, such as the construction of ‘pop-up’ active transport infrastructure, the prioritisation of which can be supported by reproducible and open data analysis, as outlined in preprint (the analysis of which was undertaken in R) on the topic 3. There is a wealth of data out there that can be found with careful search queries and many new datasets (like Uber’s micromobility datasets, released on May 6th of this year).

  • For downloading data representing transport networks, I recommend heading to the overpass website and for R users checking out osmdata and the in-development geofabric (to be renamed) R packages.

  • For open origin-destination data there are many resources but the PCT package provides a way to access national-scale datasets quickly from the R command line, as outlined stplanr’s Origin-destination vignette.

  • For road safety data there is a lack of open data in many countries but you can access national road casualty data, with 60+ variables and 100,000+ records each year with the stats19 package.

  • For links to additional resources I recommend Chapter 12 of Geocomputation with R and Chapter 11 of QGIS for transport researchers.

  • For inspiration, I recommend checking out the Propensity to Cycle Tool, an interactive free and open web app that is being used to inform active transport investment plans in dozens of cities across the UK (it also has many data download options at zone, route and route network levels).

🔗 Taxonomy, biodiversity, ecology

rOpenSci has its roots in software for biodiversity research, with many packages in the areas of taxonomy, biological occurrences, and natural history/traits.

  • taxonomy: A good place to start is the taxonomy task view, covering many options for working with online taxonomy data

  • occurrences: Occurrence data forms the basis of much ecological research. The largest source of occurrence data, GBIF, can be accessed with the rgbif package. Many more are listed in the README for the package spocc.

  • natural history/traits: Conservation researchers may want to fetch data from the IUCN Red List via rredlist, Fishbase life history data from rfishbase, bird data from auk or rebird, or trait data from various marine taxa in WoRMS (called “attributes” by WoRMS; worrms).

A good general resource for rOpenSci packages on biodiversity is the rOpenSci Community Call from March 2019: Research Applications of rOpenSci Taxonomy and Biodiversity Tools.

 

Browse our table of > 100 data-access packages (under the bird) or jump ahead to see where you come in.

Lesser Violetear (Colibri cyanotus) by Carlos Sanchez, Macaulay Library, eBird Lesser Violetear Colibri cyanotus. Carlos Sanchez, Macaulay Library | eBird.

 

🔗 rOpenSci data-access packages

The table below shows a subset of our full suite of R packages. You can find scientific use cases for a package on our main page by clicking on a package name.

R package Data and source Maintainer
antanym Antarctic geographic names. Composite Gazetteer of Antarctica Ben Raymond
AntWeb Ant data. AntWeb database from the California Academy of Sciences Karthik Ram
auk bird sighting records. http://ebird.org Matthew Strimas-Mackey
bikedata Historic ride data from public hire bicycle systems. London, U.K., from the U.S.A., San Francisco CA, New York City NY, Chicago IL, Washington DC, Boston MA, Los Angeles LA, Philadelphia PA, Minnesota, Montreal, Canada, and Guadalajara, Mexico. Mark Padgham
biomartr genomic data retrieval. ‘NCBI RefSeq’, ‘NCBI Genbank’, ‘ENSEMBL’, and ‘UniProt’ databases, plus interface to ‘BioMart’ database Hajk-Georg Drost
bittrex Bittrex crypto-currency exchange. https://bittrex.com Michael Kane
bold Bold Systems for genetic barcode data. http://www.boldsystems.org Scott Chamberlain
brranching phylogenetic data. ‘Phylomatic’ http://phylodiversity.net/phylomatic, and ‘Phylocom’ https://github.com/phylocom/phylocom Scott Chamberlain
camsRad Time series of global, direct, and diffuse irradiations on horizontal surface. Copernicus Atmosphere Monitoring Service (CAMS) Lukas Lundstrom
ccafs Climate Change, Agriculture, and Food Security (CCAFS) General Circulation Models. Scott Chamberlain
chromer Chromosome Counts Database. http://ccdb.tau.ac.il Paula Andrea Martinez
clifro New Zealand National Climate Database. https://cliflo.niwa.co.nz Blake Seers
comtradr United Nations Comtrade data. https://comtrade.un.org/data Chris Muir
cRegulome transcription factor/microRNA-gene correlations (co-expression) in cancer. Cistrome Cancer Liu et al. (2011) doi:10.1186/gb-2011-12-8-r83 and ‘miRCancerdb’ databases (in press). Mahmoud Ahmed
dbhydroR South Florida Water Management Districts DBHYDRO’ database. https://www.sfwmd.gov/science-data/dbhydro Joseph Stachelek
DoOR.data Drosophila odorant response data for DoOR.functions. Daniel Münch
ecoengine Georeferenced specimen records from the University of California, Berkeley’s Natural History Museums. https://ecoengine.berkeley.edu Karthik Ram
epubr reading and parsing of internal e-book content from EPUB files. EPUB e-books. Matthew Leonawicz
essurvey European Social Survey data. http://www.europeansocialsurvey.org Jorge Cimentada
FedData Geospatial data from several federated data sources (mainly sources maintained by the US federal government). National Elevation Dataset National Hydrography Dataset (USGS), The Soil Survey Geographic (SSURGO) database, the Global Historical Climatology Network (GHCN), the Daymet gridded estimates of daily weather parameters, the International Tree Ring Data Bank, and the National Land Cover Database (NLCD). R. Kyle Bocinsky
fingertipsR Data for many indicators of public health in England. http://fingertips.phe.org.uk Sebastian Fox
genderdata Historical datasets of first names and dates of birth. Lincoln Mullen
getCRUCLdata University of East Anglia Climate Research Unit gridded climatology of monthly means. https://crudata.uea.ac.uk/cru/data/hrg/tmc/readme.txt Adam Sparks
getlandsat Landsat 8 Data. https://registry.opendata.aws/landsat-8 Scott Chamberlain
GSODR Global Surface Summary of the Day (GSOD) weather data from USA National Centers for Environmental Information (NCEI). http://www1.ncdc.noaa.gov/pub/data/gsod/readme.txt Adam Sparks
gtfsr public GTFS feeds. Danton Noriega-Goodwin
gutenbergr Project Gutenberg collection. http://www.gutenberg.org David Robinson
hathi HathiTrust bibliographic API. https://www.hathitrust.org Scott Chamberlain
hddtools hydrological data. various data providers Claudia Vitolo
helminthR London Natural History Museum’s host-parasite database. http://www.nhm.ac.uk/research-curation/scientific-resources/taxonomy-systematics/host-parasites Tad Dallas
historydata sample data sets for historians on population, institutional, religious, military, and prosopographical data. Lincoln Mullen
hydroscoper Greek National Data Bank for Hydrological and Meteorological Information. http://www.hydroscope.gr Konstantinos Vantas
internetarchive Internet Archive. https://archive.org/ Lincoln Mullen
isdparser NOAA Integrated Surface Data. https://www.ncdc.noaa.gov/isd Scott Chamberlain
jaod Directory of Open Access Journals. https://doaj.org Scott Chamberlain
MODIStsp time series of rasters from MODIS Satellite Land Products data. Lorenzo Busetto
musemeta museum metadata. Many different museums, including the MET, Getty Museum, and more Scott Chamberlain
nasapower NASA POWER (Prediction Of Worldwide Energy Resource) global meteorology and surface solar energy climatology data. https://power.larc.nasa.gov Adam H. Sparks
natserv NatureServe. https://www.natureserve.org Scott Chamberlain
neotoma paleoecological datasets from the Neotoma Paleoecological Database. http://api.neotomadb.org Simon J. Goring
nomisr UK official statistics from the Nomis database, including data from the from the Census, the Labour Force Survey, DWP benefit statistics and other economic and demographic data from the Office for National Statistics. https://www.nomisweb.co.uk/api/v01/help Evan Odell
onekp Transcriptomes of over 1000 plant species.. The 1000 Plants Initiative (www.onekp.com) Zebulun Arendsee
opencontext Open Context data. https://opencontext.org Ben Marwick
originr Species origin data from multiple sources. Encyclopedia of Life (http://eol.org), Flora ‘Europaea’ (http://rbg-web2.rbge.org.uk/FE/fe.html), Global Invasive Species Database (http://www.iucngisd.org/gisd), the Native Species Resolver (http://bien.nceas.ucsb.edu/bien/tools/nsr/), Integrated Taxonomic Information Service (http://www.itis.gov/), and Global Register of Introduced and Invasive Species (http://www.griis.org/). Scott Chamberlain
osmdata OpenStreetMap data. https://openstreetmap.org Mark Padgham
ots Ocean time series datasets, including BATS, HOT, and more. Scott Chamberlain
paleobioDB PaleobioDB fossil data. http://paleobiodb.org/data1.1 Sara Varela
pangaear Pangaea Database. https://www.pangaea.de Scott Chamberlain
phylotaR Orthologous sequence clusters within taxonomic groups from GenBank. https://www.ncbi.nlm.nih.gov/genbank Dom Bennett
pleiades Pleiades data. https://pleiades.stoa.org Scott Chamberlain
prism Oregon State Prism climate data. http://www.prism.oregonstate.edu/ Alan Butler
qualtRics Survey results from the Qualtrics API. https://www.qualtrics.com/about Julia Silge
rAvis proyectoavis database. http://proyectoavis.com Sara Varela
rbace Bielefeld Academic Search Engine (BASE) of more than 150 million scholarly documents from more than 7000 sources. https://www.base-search.net Scott Chamberlain
rbhl Biodiversity Heritage Library (BHL) of digitized literature on biodiversity studies. https://www.biodiversitylibrary.org Scott Chamberlain
rbison USGS BISON database for species occurrence data from the United States. https://bison.usgs.gov Scott Chamberlain
rbraries Libraries.io data from 36 different package managers for programming languages. https://libraries.io/api Scott Chamberlain
rcoreoa CORE API aggregates open access research outputs from repositories and journals. https://core.ac.uk/docs Scott Chamberlain
rdatacite DataCite metadata. https://www.datacite.org Scott Chamberlain
rdataretriever Data Retriever. http://data-retriever.org Henry Senyondo
rdefra DEFRA’s UK-AIR website. https://uk-air.defra.gov.uk Claudia Vitolo
rdopa DOPA (Digital Observatory for protected Areas) by the European Union Joint Research Centre. Joona Lehtomaki
rdryad Dryad \Solr\ data underlying scientific publications. https://datadryad.org Scott Chamberlain
rebird eBird database of bird observations and locations. https://ebird.org/home Sebastian Pardo
rentrez NCBIs EUtils API for databases like GenBank and PubMed'. https://www.ncbi.nlm.nih.gov/genbank https://www.ncbi.nlm.nih.gov/pubmed David Winter
rerddap ERDDAP servers. https://upwell.pfeg.noaa.gov/erddap/information.html Scott Chamberlain
rfishbase Fishbase data on over 30,000 species of fish, their biology, ecology, morphology and more. http://www.fishbase.org http://www.sealifebase.org Carl Boettiger
rfisheries openfisheries.org. http://www.openfisheries.org/ Karthik Ram
rfna Flora of North America website data. http://www.efloras.org Scott Chamberlain
rgbif Global Biodiversity Information Facility (GBIF) data of species occurrence. https://www.gbif.org/developer/summary Scott Chamberlain
rglobi Global Biotic Interactions (GloBI) data on spatial-temporal species interactions. https://www.globalbioticinteractions.org/ Jorrit Poelen
rgpdd Global Population Dynamics Database. https://ecologicaldata.org/wiki/global-population-dynamics-database Carl Boettiger
riem Weather data from Automated Surface Observing System (ASOS) stations. Iowa Environment Mesonet website. Maëlle Salmon
rif Neuroscience Information Framework (NIF) data. https://neuinfo.org Scott Chamberlain
rinat iNaturalist website of species occurrence data submitted by citizen scientists.. http://inaturalist.org Stéphane Guillou
rnaturalearthdata Vector map data. http://www.naturalearthdata.com Andy South
rnoaa Many NOAA data sources including NCDC climate data, and data on sea ice, severe weather, historical metadata, storm and tornado data. https://www.ncdc.noaa.gov/cdo-web/webservices/v2 Scott Chamberlain
rnpn National Phenology Network data on various life history events that occur at specific times. https://usanpn.org Scott Chamberlain
ropenaq air quality data from the OpenAQ platform. https://docs.openaq.org Maëlle Salmon
rotl Open Tree of Life data on phylogenetic trees. https://tree.opentreeoflife.org/ Francois Michonneau
rperseus Perseus Digital Library collection of classical texts. http://cts.perseids.org David Ranzolin
rppo Global Plant Phenology Data Portal. https://www.plantphenology.org John Deck
rredlist IUCN Red List of threatened and endangered species. http://apiv3.iucnredlist.org/api/v3/docs Scott Chamberlain
rrricanes Data on past and current hurricanes and tropical storms for the Atlantic and eastern Pacific oceans. https://www.nhc.noaa.gov/archive/1998/1998archive.shtml Tim Trice
rrricanesdata Storm discussions, forecast/advisories, public advisories, wind speed probabilities, strike probabilities and more. National Hurricane Center Tim Trice
rsnps SNP datasets for SNPs, genotypes, and phenotypes. https://opensnp.org https://www.ncbi.nlm.nih.gov/projects/SNP Julia Gustavsen
rusda United States Department of Agriculture (USDA) data from the Systematic Mycology and Microbiology Laboratory (SMML). Franz-Sebastian Krah
rvertnet VertNet.org archives including taxonomic names, places, and dates. http://vertnet.org Scott Chamberlain
rWBclimate Model predictions from 15 different global circulation models in 20 years. Edmund Hart
skynet air transport statistics from the Bureau of Transport Statistics (BTS) in the United States. https://www.transtats.bts.gov/databases.asp?Mode_ID=1&Mode_Desc=Aviation&Subject_ID2=0 Filipe Teixeira
smapr NASA Soil Moisture Active Passive (SMAP) data. https://smap.jpl.nasa.gov/ Maxwell Joseph
solrium data from Solr. https://lucene.apache.org/solr Scott Chamberlain
spocc species occurrence data sources, including Global Biodiversity Information. Scott Chamberlain
suppdata Supplementary materials from published manuscripts,. William D. Pearse
tidyhydat Historical and real-time national hydrometric data from Water Survey of Canada data sources. http://dd.weather.gc.ca/hydrometric/csv http://collaboration.cmc.ec.gc.ca/cmc/hydrometrics/www Sam Albers
tradestatistics Access Open Trade Statistics API from R to download international trade data.. Mauricio Vargas
traits Species trait data from many different sources, including sequence data from from NCBI, plant trait data from BETYdb, plant data from the USDA plants database, data from EOL Traitbank, Coral traits data, Birdlife International, and more.. Scott Chamberlain
treebase TreeBASE repository of phylogenetic trees (of species, population, or genes). http://treebase.org Carl Boettiger
USAboundaries Boundaries for geographical units in the United States of America. U.S. Census Bureau, Newberry Library’s ‘Atlas of Historical County Boundaries’ Lincoln Mullen
USAboundariesData Higher resolution boundary data, for use in the USAboundaries package.. U.S. Census Bureau, the Newberry Library’s ‘Historical Atlas of U.S. County Boundaries’, and Erik Steiner’s ‘United States Historical City Populations, 1790-2010’. Lincoln Mullen
weathercan Historical weather data from Environment and Climate Change Canada. http://climate.weather.gc.ca/historical_data/search_historic_data_e.html Steffi LaZerte
webchem Chemical information from around the web.. Tamás Stirling

 

🔗 This is where you come in!

Have you successfully used one or more of these data sources in your research? We want others to imagine what’s possible by seeing examples. Share your story in the comments and cite your paper or preprint if it’s published.

Is there a data source you want to access programmatically but there’s no R package to do that? Tell us about it in the comments.

Need help? Ask in our discussion forum and we’ll do our best to get you answers.


  1. Tierney, N. J., & Ram, K. (2020). A Realistic Guide to Making Data Available Alongside Code to Improve Reproducibility. arXiv preprint arXiv:2002.11626. https://arxiv.org/abs/2002.11626 ↩︎

  2. Slater, L. J., Thirel, G., Harrigan, S., Delaigue, O., Hurley, A., Khouakhi, A., Prosdocimi, I., Vitolo, C., & Smith, K. (2019). Using R in hydrology: a review of recent developments and future directions. Hydrology and Earth System Sciences, 23(7), 2939-2963. https://www.hydrol-earth-syst-sci.net/23/2939/2019/ ↩︎

  3. Lovelace, R., Morgan, M., Talbot, J., & Lucas-Smith, M. (2020, May 11). Methods to prioritise pop-up active transport infrastructure. https://doi.org/10.31219/osf.io/7wjb6 ↩︎