rOpenSci | Literature

Literature

Analyze Scientific Papers (and Text in General)
Showing 10 of 12

Text Extraction, Rendering and Converting of PDF Documents

Jeroen Ooms
Description

Utilities based on libpoppler for extracting text, fonts, attachments and metadata from a PDF file. Also supports high quality rendering of PDF documents into PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.

View Documentation
Scientific use cases
  1. Cole, C. B., Patel, S., French, L., & Knight, J. (2016). Semi-Automated Identification of Ontological Labels in the Biomedical Literature with goldi. https://doi.org/10.1101/073460
  2. Krotov, V., & Tennyson, M. (2018). Scraping Financial Data from the Web Using R Language. Journal of Emerging Technologies in Accounting. https://doi.org/10.2308/jeta-52063
  3. Iqbal, J. (2019). Managerial Self-Attribution Bias and Banks’ Future Performance: Evidence from Emerging Economies. Journal of Risk and Financial Management, 12(2), 73. https://doi.org/10.3390/jrfm12020073
  4. Hanna, A., & Hanna, L.-A. (2019). Topic Analysis of UK Fitness to Practise Cases: What Lessons Can Be Learnt? Pharmacy, 7(3), 130. https://doi.org/10.3390/pharmacy7030130
  5. Hwang, L. J., Pauloo, R. A., & Carlen, J. (2019). Assessing Impact of Outreach through Software Citation for Community Software in Geodynamics. Computing in Science & Engineering, 1–1. https://doi.org/10.1109/mcse.2019.2940221
  6. Ulibarri, N., & Scott, T. A. (2019). Environmental hazards, rigid institutions, and transformative change: How drought affects the consideration of water and climate impacts in infrastructure management. Global Environmental Change, 59, 102005. https://doi.org/10.1016/j.gloenvcha.2019.102005
  7. Lope, D. J., & Dolgun, A. (2020). Measuring the inequality of accessible trams in Melbourne. Journal of Transport Geography, 83, 102657. https://doi.org/10.1016/j.jtrangeo.2020.102657
  8. Verde Arregoitia, L. D., Teta, P., & D’Elía, G. (2020). Patterns in research and data sharing for the study of form and function in caviomorph rodents. Journal of Mammalogy. https://doi.org/10.1093/jmammal/gyaa002
  9. Hagan, A. K., Pollet, R. M., & Libertucci, J. (2020). Suggestions for Improving Invited Speaker Diversity To Reflect Trainee Diversity. Journal of Microbiology & Biology Education, 21(1). https://doi.org/10.1128/jmbe.v21i1.2105
  10. Berkel, C., & Cacan, E. (2020). GAB2 and GAB3 are expressed in a tumor stage-, grade- and histotype-dependent manner and are associated with shorter progression-free survival in ovarian cancer. Journal of Cell Communication and Signaling. https://doi.org/10.1007/s12079-020-00582-3
  11. Scott, T. A., Ulibarri, N., & Perez Figueroa, O. (2020). NEPA and National Trends in Federal Infrastructure Siting in the United States. Review of Policy Research. https://doi.org/10.1111/ropr.12399
  12. Roa-Ureta, R. H., Henríquez, J., & Molinet, C. (2020). Achieving sustainable exploitation through co-management in three Chilean small-scale fisheries. Fisheries Research, 230, 105674. https://doi.org/10.1016/j.fishres.2020.105674
  13. Westgate, M. J., Barton, P. S., Lindenmayer, D. B., & Andrew, N. R. (2020). Quantifying shifts in topic popularity over 44 years of Austral Ecology. Austral Ecology, 45(6), 663–671. https://doi.org/10.1111/aec.12938
  14. Marshall, B. M., Strine, C., & Hughes, A. C. (2020). Thousands of reptile species threatened by under-regulated global trade. Nature Communications, 11(1). https://doi.org/10.1038/s41467-020-18523-4
  15. Li, B., Trueman, B. F., Rahman, M. S., & Gagnon, G. A. (2021). Controlling lead release due to uniform and galvanic corrosion — An evaluation of silicate-based inhibitors. Journal of Hazardous Materials, 407, 124707. https://doi.org/10.1016/j.jhazmat.2020.124707
  16. Hines, R. E., Grandage, A. J., & Willoughby, K. G. (2020). Staying Afloat: Planning and Managing Climate Change and Sea Level Rise Risk in Florida’s Coastal Counties. Urban Affairs Review, 107808742098052. https://doi.org/10.1177/1078087420980526
patentsview
CRAN Peer-reviewed

An R Client to the PatentsView API

Christopher Baker
Description

Provides functions to simplify the PatentsView API (https://patentsview.org/apis/purpose) query language, send GET and POST requests to the API’s seven endpoints, and parse the data that comes back.

View Documentation

Google's Compact Language Detector 3

Jeroen Ooms
Description

Googles Compact Language Detector 3 is a neural network model for language identification and the successor of cld2 (available from CRAN). The algorithm is still experimental and takes a novel approach to language detection with different properties and outcomes. It can be useful to combine this with the Bayesian classifier results from cld2'. See https://github.com/google/cld3#readme for more information.

View Documentation
bibtex
CRAN

Bibtex Parser

James Joseph Balamuta
Description

Utility to parse a bibtex file.

View Documentation
rdatacite
CRAN

Client for the DataCite API

Bianca Kramer
Description

Client for the web service methods provided by DataCite (https://www.datacite.org/), including functions to interface with their RESTful search API. The API is backed by Elasticsearch, allowing expressive queries, including faceting.

View Documentation
Scientific use cases
  1. Jaspers, S., De Troyer, E., & Aerts, M. (2018). Machine learning techniques for the automation of literature reviews and systematic reviews in EFSA. EFSA Supporting Publications, 15(6), 1427E. https://doi.org/10.2903/sp.efsa.2018.EN-1427
  2. White, L., & Santy, S. (2018). DataDepsGenerators.jl: making reusing data easy by automatically generating DataDeps.jl registration code. Journal of Open Source Software, 3(31), 921. https://doi.org/10.21105/joss.00921
lingtypology
CRAN Peer-reviewed

Linguistic Typology and Mapping

George Moroz
Description

Provides R with the Glottolog database https://glottolog.org/ and some more abilities for purposes of linguistic mapping. The Glottolog database contains the catalogue of languages of the world. This package helps researchers to make a linguistic maps, using philosophy of the Cross-Linguistic Linked Data project https://clld.org/, which allows for while at the same time facilitating uniform access to the data across publications. A tutorial for this package is available on GitHub pages https://docs.ropensci.org/lingtypology/ and package vignette. Maps created by this package can be used both for the investigation and linguistic teaching. In addition, package provides an ability to download data from typological databases such as WALS, AUTOTYP and some others and to create your own database website.

View Documentation
Scientific use cases
  1. Maisak, T. (2017). Repetitive prefix in Agul: Morphological copy from a closely related language. International Journal of Bilingualism, 136700691774006. https://doi.org/10.1177/1367006917740060
  2. Roettger, T., & Gordon, M. (2017). Methodological issues in the study of word stress correlates. Linguistics Vanguard, 3(1). http://www.linguistics.ucsb.edu/faculty/gordon/Roettger&Gordon_AcousticMethodologoy.pdf
  3. Hantgan-Sonko, A. (2020). Synchronic and diachronic strategies of mora preservation in Gújjolaay Eegimaa. Journal of African Languages and Literatures, (1), 1-25. http://www.politics.unina.it/index.php/jalalit/article/download/6732/7790
  4. Ye, J. (2020). Independent and dependent possessive person forms. Studies in Language, 44(2), 363–406. https://doi.org/10.1075/sl.19020.ye

Rendering Math to HTML, MathML, or R-Documentation Format

Jeroen Ooms
Description

Convert latex math expressions to HTML and MathML for use in markdown documents or package manual pages. The rendering is done in R using the V8 engine (i.e. server-side), which eliminates the need for embedding the MathJax library into your web pages. In addition a math-to-rd wrapper is provided to automatically render beautiful math in R documentation files.

View Documentation
rcrossref
CRAN

Client for Various CrossRef APIs

Najko Jahn
Description

Client for various CrossRef APIs, including metadata search with their old and newer search APIs, get citations in various formats (including bibtex, citeproc-json, rdf-xml, etc.), convert DOIs to PMIDs, and vice versa, get citations for DOIs, and get links to full text of articles when available.

View Documentation
Scientific use cases
  1. Jahn, N., & Tullney, M. (2016). A study of institutional spending on open access publication fees in Germany. PeerJ, 4, e2323. https://doi.org/10.7717/peerj.2323
  2. Lammey, R. (2016). Using the Crossref Metadata API to explore publisher content. Sci Ed, 3(2), 109–111. https://doi.org/10.6087/kcse.75
  3. Bauer, P. C., Barbera, P., & Munzert, S. (2016). The Quality of Citations: Towards Quantifying Qualitative Impact in Social Science Research. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2874549
  4. Cho, H., & Yu, Y. (2018). Link prediction for interdisciplinary collaboration via co-authorship network. arXiv preprint arXiv:1803.06249. https://arxiv.org/pdf/1803.06249.pdf
  5. Jaspers, S., De Troyer, E., & Aerts, M. (2018). Machine learning techniques for the automation of literature reviews and systematic reviews in EFSA. EFSA Supporting Publications, 15(6), 1427E. https://doi.org/10.2903/sp.efsa.2018.EN-1427
  6. Hicks, D. J., Coil, D. A., Stahmer, C. G., & Eisen, J. A. (2019). Network analysis to evaluate the impact of research funding on research community consolidation. https://doi.org/10.1101/534495
  7. Olsson-Collentine, A., van Assen, M. A. L. M., & Hartgerink, C. H. J. (2019). The Prevalence of Marginally Significant Results in Psychology Over Time. Psychological Science, 095679761983032. https://doi.org/10.1177/0956797619830326
  8. Matthias, L., Jahn, N., & Laakso, M. (2019). The Two-Way Street of Open Access Journal Publishing - Flip It and Reverse It. Publications. 7(2), 23. https://doi.org/10.3390/publications7020023
  9. Mishra, P., & Narayan Tripathi, L. (2019). Characterization of two‐dimensional materials from Raman spectral data. Journal of Raman Spectroscopy. https://doi.org/10.1002/jrs.5744
  10. Fu, D. Y., & Hughey, J. J. (2019). Releasing a preprint is associated with more attention and citations for the peer-reviewed article. eLife, 8. https://doi.org/10.7554/elife.52646
  11. Fraser, N., Momeni, F., Mayr, P., & Peters, I. (2020). The relationship between bioRxiv preprints, citations and altmetrics. Quantitative Science Studies, 1–21. https://doi.org/10.1162/qss_a_00043
  12. Dion, M. L., Mitchell, S. M., & Sumner, J. L. (2020). Gender, seniority, and self-citation practices in political science. Scientometrics, 125(1), 1–28. https://doi.org/10.1007/s11192-020-03615-1
  13. Puschmann, C., & Pentzold, C. (2020). A field comes of age: tracking research on the internet within communication studies, 1994 to 2018. Internet Histories, 1–19. https://doi.org/10.1080/24701475.2020.1749805
  14. Benard, S., & Correll, S. J. (2010). Normative Discrimination and the Motherhood Penalty. Gender & Society, 24(5), 616–646. https://doi.org/10.1177/0891243210383142
  15. Clayson, P. E., Baldwin, S., & Larson, M. J. (2020). The Open Access Advantage for Studies of Human Electrophysiology: Impact on Citations and Altmetrics. https://doi.org/10.31234/osf.io/5xagd

General Purpose Oai-PMH Services Client

Michal Bojanowski
Description

A general purpose client to work with any OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) service. The OAI-PMH protocol is described at http://www.openarchives.org/OAI/openarchivesprotocol.html. Functions are provided to work with the OAI-PMH verbs: GetRecord, Identify, ListIdentifiers, ListMetadataFormats, ListRecords, and ListSets.

View Documentation
Scientific use cases
  1. Peters, I., Kraker, P., Lex, E., Gumpenberger, C., & Gorraiz, J. I. (2017). Zenodo in the Spotlight of Traditional and New Metrics. Frontiers in Research Metrics and Analytics, 2. https://doi.org/10.3389/frma.2017.00013
tabulizer
Peer-reviewed

Bindings for Tabula PDF Table Extractor Library

Mauricio Vargas
Description

Bindings for the Tabula http://tabula.technology/ Java library, which can extract tables from PDF documents. The tabulizerjars package https://github.com/ropensci/tabulizerjars provides versioned Java .jar files, including all dependencies, aligned to releases of Tabula.

View Documentation
Scientific use cases
  1. Baquero, O. S., & Machado, G. (2018). Spatiotemporal dynamics and risk factors for human Leptospirosis in Brazil. Scientific Reports, 8(1). https://doi.org/10.1038/s41598-018-33381-3
  2. Prats, J., & Danis, P.-A. (2019). An epilimnion and hypolimnion temperature model based on air temperature and lake characteristics. Knowledge & Management of Aquatic Ecosystems, (420), 8. https://doi.org/10.1051/kmae/2019001

Google's Compact Language Detector 2

Jeroen Ooms
Description

Bindings to Googles C++ library Compact Language Detector 2 (see https://github.com/cld2owners/cld2#readme for more information). Probabilistically detects over 80 languages in plain text or HTML. For mixed-language input it returns the top three detected languages and their approximate proportion of the total classified text bytes (e.g. 80% English and 20% French out of 1000 bytes). There is also a cld3' package on CRAN which uses a neural network model instead.

View Documentation
Scientific use cases
  1. Martín-Martín, A., Orduna-Malea, E., Thelwall, M., & López-Cózar, E. D. (2018). Google Scholar, Web of Science, and Scopus: a systematic comparison of citations in 252 subject categories. arXiv preprint arXiv:1808.05053 https://arxiv.org/abs/1808.05053
  2. Albrecht, U.-V., Hasenfuß, G., & von Jan, U. (2018). Description of Cardiological Apps From the German App Store: Semiautomated Retrospective App Store Analysis. JMIR mHealth and uHealth, 6(11), e11753. https://doi.org/10.2196/11753
  3. Green, E. P., Whitcomb, A., Kahumbura, C., Rosen, J. G., Goyal, S., Achieng, D., & Bellows, B. (2019). What is the best method of family planning for me?: a text mining analysis of messages between users and agents of a digital health service in Kenya. Gates Open Research, 3, 1475. https://doi.org/10.12688/gatesopenres.12999.1
  4. Jaric, I., & Djeric, M. (2019). Curriculum and labor market: Comparative analysis of the curricular outcomes of the study program in sociology at the Faculty of Philosophy, University of Belgrade and the required competences in the labor market. Sociologija, 61(Suppl. 1), 718–741. https://doi.org/10.2298/soc19s1718j

Extract Text from Rich Text Format (RTF) Documents

Jeroen Ooms
Description

Wraps the unrtf utility to extract text from RTF files. Supports document conversion to HTML, LaTeX or plain text. Output in HTML is recommended because unrtf has limited support for converting between character encodings.

View Documentation

Split, Combine and Compress PDF Files

Jeroen Ooms
Description

Content-preserving transformations transformations of PDF files such as split, combine, and compress. This package interfaces directly to the qpdf C++ API and does not require any command line utilities. Note that qpdf does not read actual content from PDF files: to extract text and data you need the pdftools package.

View Documentation

R Interface to Apache Tika

Sasha Goodman
Description

Extract text or metadata from over a thousand file types, using Apache Tika https://tika.apache.org/. Get either plain text or structured XHTML content.

View Documentation
tif

Text Interchange Format

Taylor Arnold
Description

Provides validation functions for common interchange formats for representing text data in R. Includes formats for corpus objects, document term matrices, and tokens. Other annotations can be stored by overloading the tokens structure.

View Documentation
medrxivr
CRAN Peer-reviewed

Access and Search MedRxiv and BioRxiv Preprint Data

Luke McGuinness
Description

An increasingly important source of health-related bibliographic content are preprints - preliminary versions of research articles that have yet to undergo peer review. The two preprint repositories most relevant to health-related sciences are medRxiv https://www.medrxiv.org/ and bioRxiv https://www.biorxiv.org/, both of which are operated by the Cold Spring Harbor Laboratory. medrxivr provides programmatic access to the Cold Spring Harbour Laboratory (CSHL) API https://api.biorxiv.org/, allowing users to easily download medRxiv and bioRxiv preprint metadata (e.g. title, abstract, publication date, author list, etc) into R. medrxivr also provides functions to search the downloaded preprint records using regular expressions and Boolean logic, as well as helper functions that allow users to export their search results to a .BIB file for easy import to a reference manager and to download the full-text PDFs of preprints matching their search criteria.

View Documentation

High-Performance Stemmer, Tokenizer, and Spell Checker

Jeroen Ooms
Description

Low level spell checker and morphological analyzer based on the famous hunspell library https://hunspell.github.io. The package can analyze or check individual words as well as parse text, latex, html or xml documents. For a more user-friendly interface use the spelling package which builds on this package to automate checking of files, documentation and vignettes in all common formats.

View Documentation
Scientific use cases
  1. Cichosz, P. (2018) A case study in text mining of discussion forum posts: classification with bag of words and global vectors Int. J. Appl. Math. Comput. Sci., Vol. 28, No. 4, 787–801. https://www.amcs.uz.zgora.pl/?action=paper&paper=1469
  2. Yeomans, M., Kantor, A., & Tingley, D. (2018). The politeness Package: Detecting Politeness in Natural Language. The R Journal. https://journal.r-project.org/archive/2018/RJ-2018-067/RJ-2018-067.pdf
  3. Lee, A. J., Jones, B. C., & DeBruine, L. M. (2019, January 21). Investigating the association between mating-relevant self-concepts and mate preferences through a data-driven analysis of online personal descriptions. https://doi.org/10.31234/osf.io/38zef
  4. Liu, Crocker H., Nowak, Adam, and Smith, Patrick S. 2018. Does the Asset Pricing Premium Reflect Asymmetric or IncompleteInformation?. Economics Faculty Working Papers Series. 5. https://researchrepository.wvu.edu/econ_working-papers/5
  5. Nicolas, G., Bai, X., & Fiske, S. T. (2019). Automated Dictionary Creation for Analyzing Text: An Illustration from Stereotype Content. https://psyarxiv.com/afm8k/download?format=pdf
  6. Bayer, D., & Michael, S. (2019). Exploring the Daschle Collection using Text Mining. arXiv preprint arXiv:1904.12623 https://arxiv.org/pdf/1904.12623
  7. Green, E. P., Whitcomb, A., Kahumbura, C., Rosen, J. G., Goyal, S., Achieng, D., & Bellows, B. (2019). What is the best method of family planning for me?: a text mining analysis of messages between users and agents of a digital health service in Kenya. Gates Open Research, 3, 1475. https://doi.org/10.12688/gatesopenres.12999.1
  8. Lin, C., Lou, Y.-S., Tsai, D.-J., Lee, C.-C., Hsu, C.-J., Wu, D.-C., … Fang, W.-H. (2019). Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study. JMIR Medical Informatics, 7(3), e14499. https://doi.org/10.2196/14499
  9. Luc, A., Lê, S., & Philippe, M. (2019). Nudging consumers for relevant data using Free JAR profiling: an application to product development. Food Quality and Preference, 103751. https://doi.org/10.1016/j.foodqual.2019.103751
  10. Ramagopalan, S. V., Malcolm, B., Merinopoulou, E., McDonald, L., & Cox, A. (2019). Automated extraction of treatment patterns from social media posts: an exploratory analysis in renal cell carcinoma. Future Oncology. https://doi.org/10.2217/fon-2019-0406
  11. Cinelli, M., Ficcadenti, V., & Riccioni, J. (2019). The interconnectedness of the economic content in the speeches of the US Presidents. Annals of Operations Research. https://doi.org/10.1007/s10479-019-03372-2
  12. Christensen, A. P., & Kenett, Y. (2019, October 22). Semantic Network Analysis (SemNA): A Tutorial on Preprocessing, Estimating, and Analyzing Semantic Networks. https://doi.org/10.31234/osf.io/eht87
  13. Booth, A., Bell, T., Halhol, S., Pan, S., Welch, V., Merinopoulou, E., … Cox, A. (2019). Using Social Media to Uncover Treatment Experiences and Decisions in Patients With Acute Myeloid Leukemia or Myelodysplastic Syndrome Who Are Ineligible for Intensive Chemotherapy: Patient-Centric Qualitative Data Analysis. Journal of Medical Internet Research, 21(11), e14285. https://doi.org.10.2196/14285
  14. Deng, H., Wang, Q., Turner, D. P., Sexton, K. E., Burns, S. M., Eikermann, M., … Houle, T. T. (2020). Sentiment analysis of real-world migraine tweets for population research. Cephalalgia Reports, 3, 251581631989886. https://doi.org/10.1177/2515816319898867
  15. Cinelli, M. (2019). Generalized rich-club ordering in networks. Journal of Complex Networks, 7(5), 702–719. https://doi.org/10.1093/comnet/cnz002
  16. Funk, B., Sadeh-Sharvit, S., Fitzsimmons-Craft, E. E., Trockel, M. T., Monterubio, G. E., Goel, N. J., … Taylor, C. B. (2020). A Framework for Applying Natural Language Processing in Digital Health Interventions. Journal of Medical Internet Research, 22(2), e13855. https://doi.org/10.2196/13855
  17. Cichosz, P. (2020). Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation. Natural Language Engineering, 1–28. https://doi.org/10.1017/s1351324920000066
  18. Pruchnik, P. (2020). Identification of Trends in the Polish Media on the Example of the Quarterly Studia Medioznawcze The Use of Big Data Tools. Media Studies, 80(1). http://yadda.icm.edu.pl/yadda/element/bwmeta1.element.desklight-e79ed2c7-fd7d-4a91-8895-c322743c8f48/c/04_Pruchnik_EN.pdf
  19. Hamilton, L. M., & Lahne, J. (2020). Fast and automated sensory analysis: Using natural language processing for descriptive lexicon development. Food Quality and Preference, 83, 103926. https://doi.org/10.1016/j.foodqual.2020.103926
  20. DellaPosta, D., & Nee, V. (2020). Emergence of diverse and specialized knowledge in a metropolitan tech cluster. Social Science Research, 86, 102377. https://doi.org/10.1016/j.ssresearch.2019.102377
  21. Geller, J., Davis, S. D., & Peterson, D. (2020, May 23). Sans forgetica is not desirable for learning. https://doi.org/10.31234/osf.io/ku5bz
  22. Morselli, D., Passini, S., & McGarty, C. (2020). Sos Venezuela: an analysis of the anti-Maduro protest movements using Twitter. Social Movement Studies, 1–22. https://doi.org/10.1080/14742837.2020.1770072
  23. Ficcadenti, V., Cerqueti, R., Ausloos, M., & Dhesi, G. (2020). Words ranking and Hirsch index for identifying the core of the hapaxes in political texts. Journal of Informetrics, 14(3), 101054. https://doi.org/10.1016/j.joi.2020.101054
  24. Garvey, M. D., Samuel, J., & Pelaez, A. (2021). Would you please like my tweet?! An artificially intelligent, generative probabilistic, and econometric based system design for popularity-driven tweet content generation. Decision Support Systems, 144, 113497. doi:10.1016/j.dss.2021.113497
handlr
CRAN

Convert Among Citation Formats

Scott Chamberlain
Description

Converts among many citation formats, including BibTeX, Citeproc, Codemeta, RDF XML, RIS, Schema.org, and Citation File Format. A low level R6 class is provided, as well as stand-alone functions for each citation format for both read and write.

View Documentation

Extract Text from Microsoft Word Documents

Jeroen Ooms
Description

Wraps the AntiWord utility to extract text from Microsoft Word documents. The utility only supports the old doc format, not the new xml based docx format. Use the xml2 package to read the latter.

View Documentation

Find Free Versions of Scholarly Publications via Unpaywall

Najko Jahn
Description

This web client interfaces Unpaywall https://unpaywall.org/products/api, formerly oaDOI, a service finding free full-texts of academic papers by linking DOIs with open access journals and repositories. It provides unified access to various data sources for open access full-text links including Crossref and the Directory of Open Access Journals (DOAJ). API usage is free and no registration is required.

View Documentation
Scientific use cases
  1. Ashby, M. P. J. (2020, March 6). Three quarters of new criminological knowledge is hidden from policy makers. https://doi.org/10.31235/osf.io/wnq7h
  2. Ashby, M. P. J. (2020). The Open-Access Availability of Criminological Research to Practitioners and Policy Makers. Journal of Criminal Justice Education, 1–21. https://doi.org/10.1080/10511253.2020.1838588
  3. Robinson-Garcia, N., van Leeuwen, T. N., & Torres-Salinas, D. (2020). Measuring Open Access Uptake: Data Sources, Expectations, and Misconceptions. Scholarly Assessment Reports, 2(1). https://doi.org/10.29024/sar.23
  4. Clayson, P. E., Baldwin, S., & Larson, M. J. (2020). The Open Access Advantage for Studies of Human Electrophysiology: Impact on Citations and Altmetrics. https://doi.org/10.31234/osf.io/5xagd
refsplitr
Peer-reviewed

author name disambiguation, author georeferencing, and mapping of coauthorship networks with Web of Science data

Emilio Bruna
Description

Tools to parse and organize reference records downloaded from the Web of Science citation database into an R-friendly format, disambiguate the names of authors, geocode their locations, and generate/visualize coauthorship networks. This package has been peer-reviewed by rOpenSci (v. 1.0).

View Documentation
Scientific use cases
  1. Hazlett, M. A., Henderson, K. M., Zeitzer, I. F., & Drew, J. A. (2020). The geography of publishing in the Anthropocene. Conservation Science and Practice, 2(10). https://doi.org/10.1111/csp2.270
  2. Smith, T. B., Vacca, R., Krenz, T., & McCarty, C. (2021). Great minds think alike, or do they often differ? Research topic overlap and the formation of scientific teams. Journal of Informetrics, 15(1), 101104. https://doi.org/10.1016/j.joi.2020.101104
europepmc
CRAN Peer-reviewed

R Interface to the Europe PubMed Central RESTful Web Service

Najko Jahn
Description

An R Client for the Europe PubMed Central RESTful Web Service (see https://europepmc.org/RestfulWebService for more information). It gives access to both metadata on life science literature and open access full texts. Europe PMC indexes all PubMed content and other literature sources including Agricola, a bibliographic database of citations to the agricultural literature, or Biological Patents. In addition to bibliographic metadata, the client allows users to fetch citations and reference lists. Links between life-science literature and other EBI databases, including ENA, PDB or ChEMBL are also accessible. No registration or API key is required. See the vignettes for usage examples.

View Documentation

Read Data from JSTOR/DfR

Thomas Klebel
Description

Functions and helpers to import metadata, ngrams and full-texts delivered by Data for Research by JSTOR.

View Documentation
aRxiv
CRAN

Interface to the arXiv API

Karl Broman
Description

An interface to the API for arXiv, a repository of electronic preprints for computer science, mathematics, physics, quantitative biology, quantitative finance, and statistics.

View Documentation
Scientific use cases
  1. Jaspers, S., De Troyer, E., & Aerts, M. (2018). Machine learning techniques for the automation of literature reviews and systematic reviews in EFSA. EFSA Supporting Publications, 15(6), 1427E. https://doi.org/10.2903/sp.efsa.2018.EN-1427
citecorp
CRAN

Client for the Open Citations Corpus

Scott Chamberlain
Description

Client for the Open Citations Corpus (http://opencitations.net/). Includes a set of functions for getting one identifier type from another, as well as getting references and citations for a given identifier.

View Documentation
textreuse
CRAN Peer-reviewed

Detect Text Reuse and Document Similarity

Lincoln Mullen
Description

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

View Documentation
Scientific use cases
  1. Funk, K. R., & Mullen, L. A. (2017). The Spine of American Law: Digital Text Analysis and US Legal Practice. The American Historical Review. https://doi.org/10.1093/ahr/123.1.132
  2. A. Mullen, L., Benoit, K., Keyes, O., Selivanov, D., & Arnold, J. (2018). Fast, Consistent Tokenization of Natural Language Text. Journal of Open Source Software, 3(23), 655. https://doi.org/10.21105/joss.00655
  3. García, F. T., Villalba, L. J. G., Orozco, A. L. S., Ruiz, F. D. A., Juárez, A. A., & Kim, T. H. (2018). Locating similar names through locality sensitive hashing and graph theory. Multimedia Tools and Applications, 1-14. https://link.springer.com/article/10.1007/s11042-018-6375-9
  4. Catalano, J. (2018). Digitally Analyzing the Uneven Ground: Language Borrowing Among Indian Treaties. Current Research in Digital History, 1. https://doi.org/10.31835/crdh.2018.02
  5. Schmidt, B. (2018). Stable random projection: lightweight, general-purpose dimensionality reduction for digitized libraries. Journal of Cultural Analytics. https://doi.org/10.22148/16.025
  6. Sanger, W., & Warin, T. (2019). Dataset of Jaccard similarity indices from 1,597 European political manifestos across 27 countries (1945–2017). Data in Brief, 103907. https://doi.org/10.1016/j.dib.2019.103907
  7. Jaric, I., & Djeric, M. (2019). Curriculum and labor market: Comparative analysis of the curricular outcomes of the study program in sociology at the Faculty of Philosophy, University of Belgrade and the required competences in the labor market. Sociologija, 61(Suppl. 1), 718–741. https://doi.org/10.2298/soc19s1718j
  8. Marple, T. (2020). The social management of complex uncertainty: Central Bank similarity and crisis liquidity swaps at the Federal Reserve. The Review of International Organizations. https://doi.org/10.1007/s11558-020-09378-x
  9. Callaghan, T., Karch, A., & Kroeger, M. (2020). Model State Legislation and Intergovernmental Tensions over the Affordable Care Act, Common Core, and the Second Amendment. Publius: The Journal of Federalism. https://doi.org/10.1093/publius/pjaa012
  10. Vogler, D., Udris, L., & Eisenegger, M. (2020). Measuring Media Content Concentration at a Large Scale Using Automated Text Comparisons. Journalism Studies, 1–20. https://doi.org/10.1080/1461670x.2020.1761865
  11. Vogler, D., & Schäfer, M. S. (2020). Growing Influence of University PR on Science News Coverage? A Longitudinal Automated Content Analysis of University Media Releases and Newspaper Coverage in Switzerland, 2003‒2017. International Journal of Communication, 14, 22. https://ijoc.org/index.php/ijoc/article/download/13498/3113
  12. James, S., Pagliari, S., & Young, K. L. (2020). The internationalization of European financial networks: a quantitative text analysis of EU consultation responses. Review of International Political Economy, 1–28. https://doi.org/10.1080/09692290.2020.1779781
  13. Hansen, E. R., & Jansa, J. M. (2020). Complexity, Resources, and Text Borrowing in State Legislatures. http://ehansen4.sites.luc.edu/documents/Hansen_Jansa_Complexity.pdf
googleLanguageR
CRAN Peer-reviewed

Call Googles Natural Language API, Cloud Translation' API, Cloud Speech API and Cloud Text-to-Speech API

Mark Edmondson
Description

Call Google Cloud machine learning APIs for text and speech tasks. Call the Cloud Translation API https://cloud.google.com/translate/ for detection and translation of text, the Natural Language API https://cloud.google.com/natural-language/ to analyse text for sentiment, entities or syntax, the Cloud Speech API https://cloud.google.com/speech/ to transcribe sound files to text and the Cloud Text-to-Speech API https://cloud.google.com/text-to-speech/ to turn text into sound files.

View Documentation
tidypmc
CRAN

Parse Full Text XML Documents from PubMed Central

Chris Stubben
Description

Parse XML documents from the Open Access subset of Europe PubMed Central https://europepmc.org including section paragraphs, tables, captions and references.

View Documentation