Literature

pdftools

CRAN Staff maintained

Text Extraction, Rendering and Converting of PDF Documents

Maintainer

Jeroen Ooms

Description

Utilities based on libpoppler for extracting text, fonts, attachments and metadata from a PDF file. Also supports high quality rendering of PDF documents into PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.

View Documentation
Scientific use cases

Cole, C. B., Patel, S., French, L., & Knight, J. (2016). Semi-Automated Identification of Ontological Labels in the Biomedical Literature with goldi. https://doi.org/10.1101/073460
Krotov, V., & Tennyson, M. (2018). Scraping Financial Data from the Web Using R Language. Journal of Emerging Technologies in Accounting. https://doi.org/10.2308/jeta-52063
Iqbal, J. (2019). Managerial Self-Attribution Bias and Banks’ Future Performance: Evidence from Emerging Economies. Journal of Risk and Financial Management, 12(2), 73. https://doi.org/10.3390/jrfm12020073
Hanna, A., & Hanna, L.-A. (2019). Topic Analysis of UK Fitness to Practise Cases: What Lessons Can Be Learnt? Pharmacy, 7(3), 130. https://doi.org/10.3390/pharmacy7030130
Hwang, L. J., Pauloo, R. A., & Carlen, J. (2019). Assessing Impact of Outreach through Software Citation for Community Software in Geodynamics. Computing in Science & Engineering, 1–1. https://doi.org/10.1109/mcse.2019.2940221
Ulibarri, N., & Scott, T. A. (2019). Environmental hazards, rigid institutions, and transformative change: How drought affects the consideration of water and climate impacts in infrastructure management. Global Environmental Change, 59, 102005. https://doi.org/10.1016/j.gloenvcha.2019.102005
Lope, D. J., & Dolgun, A. (2020). Measuring the inequality of accessible trams in Melbourne. Journal of Transport Geography, 83, 102657. https://doi.org/10.1016/j.jtrangeo.2020.102657
Verde Arregoitia, L. D., Teta, P., & D’Elía, G. (2020). Patterns in research and data sharing for the study of form and function in caviomorph rodents. Journal of Mammalogy. https://doi.org/10.1093/jmammal/gyaa002
Hagan, A. K., Pollet, R. M., & Libertucci, J. (2020). Suggestions for Improving Invited Speaker Diversity To Reflect Trainee Diversity. Journal of Microbiology & Biology Education, 21(1). https://doi.org/10.1128/jmbe.v21i1.2105
Berkel, C., & Cacan, E. (2020). GAB2 and GAB3 are expressed in a tumor stage-, grade- and histotype-dependent manner and are associated with shorter progression-free survival in ovarian cancer. Journal of Cell Communication and Signaling. https://doi.org/10.1007/s12079-020-00582-3
Scott, T. A., Ulibarri, N., & Perez Figueroa, O. (2020). NEPA and National Trends in Federal Infrastructure Siting in the United States. Review of Policy Research. https://doi.org/10.1111/ropr.12399
Roa-Ureta, R. H., Henríquez, J., & Molinet, C. (2020). Achieving sustainable exploitation through co-management in three Chilean small-scale fisheries. Fisheries Research, 230, 105674. https://doi.org/10.1016/j.fishres.2020.105674
Westgate, M. J., Barton, P. S., Lindenmayer, D. B., & Andrew, N. R. (2020). Quantifying shifts in topic popularity over 44 years of Austral Ecology. Austral Ecology, 45(6), 663–671. https://doi.org/10.1111/aec.12938
Marshall, B. M., Strine, C., & Hughes, A. C. (2020). Thousands of reptile species threatened by under-regulated global trade. Nature Communications, 11(1). https://doi.org/10.1038/s41467-020-18523-4
Li, B., Trueman, B. F., Rahman, M. S., & Gagnon, G. A. (2021). Controlling lead release due to uniform and galvanic corrosion — An evaluation of silicate-based inhibitors. Journal of Hazardous Materials, 407, 124707. https://doi.org/10.1016/j.jhazmat.2020.124707
Hines, R. E., Grandage, A. J., & Willoughby, K. G. (2020). Staying Afloat: Planning and Managing Climate Change and Sea Level Rise Risk in Florida’s Coastal Counties. Urban Affairs Review, 107808742098052. https://doi.org/10.1177/1078087420980526

patentsview

CRAN Peer-reviewed

An R Client to the PatentsView API

Maintainer

Christopher Baker

Description

Provides functions to simplify the PatentsView API (https://patentsview.org/apis/purpose) query language, send GET and POST requests to the API’s seven endpoints, and parse the data that comes back.

View Documentation

cld3

CRAN Staff maintained

Google's Compact Language Detector 3

Maintainer

Jeroen Ooms

Description

Googles Compact Language Detector 3 is a neural network model for language identification and the successor of cld2 (available from CRAN). The algorithm is still experimental and takes a novel approach to language detection with different properties and outcomes. It can be useful to combine this with the Bayesian classifier results from cld2'. See https://github.com/google/cld3#readme for more information.

View Documentation

bibtex

CRAN

Bibtex Parser

Maintainer

James Joseph Balamuta

Description

Utility to parse a bibtex file.

View Documentation

rdatacite

CRAN

Client for the DataCite API

Maintainer

Bianca Kramer

Description

Client for the web service methods provided by DataCite (https://www.datacite.org/), including functions to interface with their RESTful search API. The API is backed by Elasticsearch, allowing expressive queries, including faceting.

View Documentation
Scientific use cases

Jaspers, S., De Troyer, E., & Aerts, M. (2018). Machine learning techniques for the automation of literature reviews and systematic reviews in EFSA. EFSA Supporting Publications, 15(6), 1427E. https://doi.org/10.2903/sp.efsa.2018.EN-1427
White, L., & Santy, S. (2018). DataDepsGenerators.jl: making reusing data easy by automatically generating DataDeps.jl registration code. Journal of Open Source Software, 3(31), 921. https://doi.org/10.21105/joss.00921

lingtypology

CRAN Peer-reviewed

Linguistic Typology and Mapping

Maintainer

George Moroz

Description

Provides R with the Glottolog database https://glottolog.org/ and some more abilities for purposes of linguistic mapping. The Glottolog database contains the catalogue of languages of the world. This package helps researchers to make a linguistic maps, using philosophy of the Cross-Linguistic Linked Data project https://clld.org/, which allows for while at the same time facilitating uniform access to the data across publications. A tutorial for this package is available on GitHub pages https://docs.ropensci.org/lingtypology/ and package vignette. Maps created by this package can be used both for the investigation and linguistic teaching. In addition, package provides an ability to download data from typological databases such as WALS, AUTOTYP and some others and to create your own database website.

View Documentation
Scientific use cases

Maisak, T. (2017). Repetitive prefix in Agul: Morphological copy from a closely related language. International Journal of Bilingualism, 136700691774006. https://doi.org/10.1177/1367006917740060
Roettger, T., & Gordon, M. (2017). Methodological issues in the study of word stress correlates. Linguistics Vanguard, 3(1). http://www.linguistics.ucsb.edu/faculty/gordon/Roettger&Gordon_AcousticMethodologoy.pdf
Hantgan-Sonko, A. (2020). Synchronic and diachronic strategies of mora preservation in Gújjolaay Eegimaa. Journal of African Languages and Literatures, (1), 1-25. http://www.politics.unina.it/index.php/jalalit/article/download/6732/7790
Ye, J. (2020). Independent and dependent possessive person forms. Studies in Language, 44(2), 363–406. https://doi.org/10.1075/sl.19020.ye

katex

CRAN Staff maintained

Rendering Math to HTML, MathML, or R-Documentation Format

Maintainer

Jeroen Ooms

Description

Convert latex math expressions to HTML and MathML for use in markdown documents or package manual pages. The rendering is done in R using the V8 engine (i.e. server-side), which eliminates the need for embedding the MathJax library into your web pages. In addition a math-to-rd wrapper is provided to automatically render beautiful math in R documentation files.

View Documentation

rcrossref

CRAN

Client for Various CrossRef APIs

Maintainer

Najko Jahn

Description

Client for various CrossRef APIs, including metadata search with their old and newer search APIs, get citations in various formats (including bibtex, citeproc-json, rdf-xml, etc.), convert DOIs to PMIDs, and vice versa, get citations for DOIs, and get links to full text of articles when available.

View Documentation
Scientific use cases

Jahn, N., & Tullney, M. (2016). A study of institutional spending on open access publication fees in Germany. PeerJ, 4, e2323. https://doi.org/10.7717/peerj.2323
Lammey, R. (2016). Using the Crossref Metadata API to explore publisher content. Sci Ed, 3(2), 109–111. https://doi.org/10.6087/kcse.75
Bauer, P. C., Barbera, P., & Munzert, S. (2016). The Quality of Citations: Towards Quantifying Qualitative Impact in Social Science Research. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2874549
Cho, H., & Yu, Y. (2018). Link prediction for interdisciplinary collaboration via co-authorship network. arXiv preprint arXiv:1803.06249. https://arxiv.org/pdf/1803.06249.pdf
Jaspers, S., De Troyer, E., & Aerts, M. (2018). Machine learning techniques for the automation of literature reviews and systematic reviews in EFSA. EFSA Supporting Publications, 15(6), 1427E. https://doi.org/10.2903/sp.efsa.2018.EN-1427
Hicks, D. J., Coil, D. A., Stahmer, C. G., & Eisen, J. A. (2019). Network analysis to evaluate the impact of research funding on research community consolidation. https://doi.org/10.1101/534495
Olsson-Collentine, A., van Assen, M. A. L. M., & Hartgerink, C. H. J. (2019). The Prevalence of Marginally Significant Results in Psychology Over Time. Psychological Science, 095679761983032. https://doi.org/10.1177/0956797619830326
Matthias, L., Jahn, N., & Laakso, M. (2019). The Two-Way Street of Open Access Journal Publishing - Flip It and Reverse It. Publications. 7(2), 23. https://doi.org/10.3390/publications7020023
Mishra, P., & Narayan Tripathi, L. (2019). Characterization of two‐dimensional materials from Raman spectral data. Journal of Raman Spectroscopy. https://doi.org/10.1002/jrs.5744
Fu, D. Y., & Hughey, J. J. (2019). Releasing a preprint is associated with more attention and citations for the peer-reviewed article. eLife, 8. https://doi.org/10.7554/elife.52646
Fraser, N., Momeni, F., Mayr, P., & Peters, I. (2020). The relationship between bioRxiv preprints, citations and altmetrics. Quantitative Science Studies, 1–21. https://doi.org/10.1162/qss_a_00043
Dion, M. L., Mitchell, S. M., & Sumner, J. L. (2020). Gender, seniority, and self-citation practices in political science. Scientometrics, 125(1), 1–28. https://doi.org/10.1007/s11192-020-03615-1
Puschmann, C., & Pentzold, C. (2020). A field comes of age: tracking research on the internet within communication studies, 1994 to 2018. Internet Histories, 1–19. https://doi.org/10.1080/24701475.2020.1749805
Benard, S., & Correll, S. J. (2010). Normative Discrimination and the Motherhood Penalty. Gender & Society, 24(5), 616–646. https://doi.org/10.1177/0891243210383142
Clayson, P. E., Baldwin, S., & Larson, M. J. (2020). The Open Access Advantage for Studies of Human Electrophysiology: Impact on Citations and Altmetrics. https://doi.org/10.31234/osf.io/5xagd

oai

CRAN Peer-reviewed

General Purpose Oai-PMH Services Client

Maintainer

Michal Bojanowski

Description

A general purpose client to work with any OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) service. The OAI-PMH protocol is described at http://www.openarchives.org/OAI/openarchivesprotocol.html. Functions are provided to work with the OAI-PMH verbs: GetRecord, Identify, ListIdentifiers, ListMetadataFormats, ListRecords, and ListSets.

View Documentation
Scientific use cases

Peters, I., Kraker, P., Lex, E., Gumpenberger, C., & Gorraiz, J. I. (2017). Zenodo in the Spotlight of Traditional and New Metrics. Frontiers in Research Metrics and Analytics, 2. https://doi.org/10.3389/frma.2017.00013

tabulizer

Peer-reviewed

Bindings for Tabula PDF Table Extractor Library

Maintainer

Mauricio Vargas

Description

Bindings for the Tabula http://tabula.technology/ Java library, which can extract tables from PDF documents. The tabulizerjars package https://github.com/ropensci/tabulizerjars provides versioned Java .jar files, including all dependencies, aligned to releases of Tabula.

View Documentation
Scientific use cases

Baquero, O. S., & Machado, G. (2018). Spatiotemporal dynamics and risk factors for human Leptospirosis in Brazil. Scientific Reports, 8(1). https://doi.org/10.1038/s41598-018-33381-3
Prats, J., & Danis, P.-A. (2019). An epilimnion and hypolimnion temperature model based on air temperature and lake characteristics. Knowledge & Management of Aquatic Ecosystems, (420), 8. https://doi.org/10.1051/kmae/2019001

cld2

CRAN Staff maintained

Google's Compact Language Detector 2

Maintainer

Jeroen Ooms

Description

Bindings to Googles C++ library Compact Language Detector 2 (see https://github.com/cld2owners/cld2#readme for more information). Probabilistically detects over 80 languages in plain text or HTML. For mixed-language input it returns the top three detected languages and their approximate proportion of the total classified text bytes (e.g. 80% English and 20% French out of 1000 bytes). There is also a cld3' package on CRAN which uses a neural network model instead.

View Documentation
Scientific use cases

Martín-Martín, A., Orduna-Malea, E., Thelwall, M., & López-Cózar, E. D. (2018). Google Scholar, Web of Science, and Scopus: a systematic comparison of citations in 252 subject categories. arXiv preprint arXiv:1808.05053 https://arxiv.org/abs/1808.05053
Albrecht, U.-V., Hasenfuß, G., & von Jan, U. (2018). Description of Cardiological Apps From the German App Store: Semiautomated Retrospective App Store Analysis. JMIR mHealth and uHealth, 6(11), e11753. https://doi.org/10.2196/11753
Green, E. P., Whitcomb, A., Kahumbura, C., Rosen, J. G., Goyal, S., Achieng, D., & Bellows, B. (2019). What is the best method of family planning for me?: a text mining analysis of messages between users and agents of a digital health service in Kenya. Gates Open Research, 3, 1475. https://doi.org/10.12688/gatesopenres.12999.1
Jaric, I., & Djeric, M. (2019). Curriculum and labor market: Comparative analysis of the curricular outcomes of the study program in sociology at the Faculty of Philosophy, University of Belgrade and the required competences in the labor market. Sociologija, 61(Suppl. 1), 718–741. https://doi.org/10.2298/soc19s1718j

unrtf

CRAN Staff maintained

Extract Text from Rich Text Format (RTF) Documents

Maintainer

Jeroen Ooms

Description

Wraps the unrtf utility to extract text from RTF files. Supports document conversion to HTML, LaTeX or plain text. Output in HTML is recommended because unrtf has limited support for converting between character encodings.

View Documentation

qpdf

CRAN Staff maintained

Split, Combine and Compress PDF Files

Maintainer

Jeroen Ooms

Description

Content-preserving transformations transformations of PDF files such as split, combine, and compress. This package interfaces directly to the qpdf C++ API and does not require any command line utilities. Note that qpdf does not read actual content from PDF files: to extract text and data you need the pdftools package.

View Documentation

rtika

CRAN Peer-reviewed

R Interface to Apache Tika

Maintainer

Sasha Goodman

Description

Extract text or metadata from over a thousand file types, using Apache Tika https://tika.apache.org/. Get either plain text or structured XHTML content.

View Documentation

tif

Text Interchange Format

Maintainer

Taylor Arnold

Description

Provides validation functions for common interchange formats for representing text data in R. Includes formats for corpus objects, document term matrices, and tokens. Other annotations can be stored by overloading the tokens structure.

View Documentation

medrxivr

CRAN Peer-reviewed

Access and Search MedRxiv and BioRxiv Preprint Data

Maintainer

Luke McGuinness

Description

An increasingly important source of health-related bibliographic content are preprints - preliminary versions of research articles that have yet to undergo peer review. The two preprint repositories most relevant to health-related sciences are medRxiv https://www.medrxiv.org/ and bioRxiv https://www.biorxiv.org/, both of which are operated by the Cold Spring Harbor Laboratory. medrxivr provides programmatic access to the Cold Spring Harbour Laboratory (CSHL) API https://api.biorxiv.org/, allowing users to easily download medRxiv and bioRxiv preprint metadata (e.g. title, abstract, publication date, author list, etc) into R. medrxivr also provides functions to search the downloaded preprint records using regular expressions and Boolean logic, as well as helper functions that allow users to export their search results to a .BIB file for easy import to a reference manager and to download the full-text PDFs of preprints matching their search criteria.

View Documentation

hunspell

CRAN Staff maintained

High-Performance Stemmer, Tokenizer, and Spell Checker

Maintainer

Jeroen Ooms

Description

Low level spell checker and morphological analyzer based on the famous hunspell library https://hunspell.github.io. The package can analyze or check individual words as well as parse text, latex, html or xml documents. For a more user-friendly interface use the spelling package which builds on this package to automate checking of files, documentation and vignettes in all common formats.

View Documentation
Scientific use cases

Cichosz, P. (2018) A case study in text mining of discussion forum posts: classification with bag of words and global vectors Int. J. Appl. Math. Comput. Sci., Vol. 28, No. 4, 787–801. https://www.amcs.uz.zgora.pl/?action=paper&paper=1469
Yeomans, M., Kantor, A., & Tingley, D. (2018). The politeness Package: Detecting Politeness in Natural Language. The R Journal. https://journal.r-project.org/archive/2018/RJ-2018-067/RJ-2018-067.pdf
Lee, A. J., Jones, B. C., & DeBruine, L. M. (2019, January 21). Investigating the association between mating-relevant self-concepts and mate preferences through a data-driven analysis of online personal descriptions. https://doi.org/10.31234/osf.io/38zef
Liu, Crocker H., Nowak, Adam, and Smith, Patrick S. 2018. Does the Asset Pricing Premium Reflect Asymmetric or IncompleteInformation?. Economics Faculty Working Papers Series. 5. https://researchrepository.wvu.edu/econ_working-papers/5
Nicolas, G., Bai, X., & Fiske, S. T. (2019). Automated Dictionary Creation for Analyzing Text: An Illustration from Stereotype Content. https://psyarxiv.com/afm8k/download?format=pdf
Bayer, D., & Michael, S. (2019). Exploring the Daschle Collection using Text Mining. arXiv preprint arXiv:1904.12623 https://arxiv.org/pdf/1904.12623
Green, E. P., Whitcomb, A., Kahumbura, C., Rosen, J. G., Goyal, S., Achieng, D., & Bellows, B. (2019). What is the best method of family planning for me?: a text mining analysis of messages between users and agents of a digital health service in Kenya. Gates Open Research, 3, 1475. https://doi.org/10.12688/gatesopenres.12999.1
Lin, C., Lou, Y.-S., Tsai, D.-J., Lee, C.-C., Hsu, C.-J., Wu, D.-C., … Fang, W.-H. (2019). Projection Word Embedding Model With Hybrid Sampling Training for Classifying ICD-10-CM Codes: Longitudinal Observational Study. JMIR Medical Informatics, 7(3), e14499. https://doi.org/10.2196/14499
Luc, A., Lê, S., & Philippe, M. (2019). Nudging consumers for relevant data using Free JAR profiling: an application to product development. Food Quality and Preference, 103751. https://doi.org/10.1016/j.foodqual.2019.103751
Ramagopalan, S. V., Malcolm, B., Merinopoulou, E., McDonald, L., & Cox, A. (2019). Automated extraction of treatment patterns from social media posts: an exploratory analysis in renal cell carcinoma. Future Oncology. https://doi.org/10.2217/fon-2019-0406
Cinelli, M., Ficcadenti, V., & Riccioni, J. (2019). The interconnectedness of the economic content in the speeches of the US Presidents. Annals of Operations Research. https://doi.org/10.1007/s10479-019-03372-2
Christensen, A. P., & Kenett, Y. (2019, October 22). Semantic Network Analysis (SemNA): A Tutorial on Preprocessing, Estimating, and Analyzing Semantic Networks. https://doi.org/10.31234/osf.io/eht87
Booth, A., Bell, T., Halhol, S., Pan, S., Welch, V., Merinopoulou, E., … Cox, A. (2019). Using Social Media to Uncover Treatment Experiences and Decisions in Patients With Acute Myeloid Leukemia or Myelodysplastic Syndrome Who Are Ineligible for Intensive Chemotherapy: Patient-Centric Qualitative Data Analysis. Journal of Medical Internet Research, 21(11), e14285. https://doi.org.10.2196/14285
Deng, H., Wang, Q., Turner, D. P., Sexton, K. E., Burns, S. M., Eikermann, M., … Houle, T. T. (2020). Sentiment analysis of real-world migraine tweets for population research. Cephalalgia Reports, 3, 251581631989886. https://doi.org/10.1177/2515816319898867
Cinelli, M. (2019). Generalized rich-club ordering in networks. Journal of Complex Networks, 7(5), 702–719. https://doi.org/10.1093/comnet/cnz002
Funk, B., Sadeh-Sharvit, S., Fitzsimmons-Craft, E. E., Trockel, M. T., Monterubio, G. E., Goel, N. J., … Taylor, C. B. (2020). A Framework for Applying Natural Language Processing in Digital Health Interventions. Journal of Medical Internet Research, 22(2), e13855. https://doi.org/10.2196/13855
Cichosz, P. (2020). Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation. Natural Language Engineering, 1–28. https://doi.org/10.1017/s1351324920000066
Pruchnik, P. (2020). Identification of Trends in the Polish Media on the Example of the Quarterly Studia Medioznawcze The Use of Big Data Tools. Media Studies, 80(1). http://yadda.icm.edu.pl/yadda/element/bwmeta1.element.desklight-e79ed2c7-fd7d-4a91-8895-c322743c8f48/c/04_Pruchnik_EN.pdf
Hamilton, L. M., & Lahne, J. (2020). Fast and automated sensory analysis: Using natural language processing for descriptive lexicon development. Food Quality and Preference, 83, 103926. https://doi.org/10.1016/j.foodqual.2020.103926
DellaPosta, D., & Nee, V. (2020). Emergence of diverse and specialized knowledge in a metropolitan tech cluster. Social Science Research, 86, 102377. https://doi.org/10.1016/j.ssresearch.2019.102377
Geller, J., Davis, S. D., & Peterson, D. (2020, May 23). Sans forgetica is not desirable for learning. https://doi.org/10.31234/osf.io/ku5bz
Morselli, D., Passini, S., & McGarty, C. (2020). Sos Venezuela: an analysis of the anti-Maduro protest movements using Twitter. Social Movement Studies, 1–22. https://doi.org/10.1080/14742837.2020.1770072
Ficcadenti, V., Cerqueti, R., Ausloos, M., & Dhesi, G. (2020). Words ranking and Hirsch index for identifying the core of the hapaxes in political texts. Journal of Informetrics, 14(3), 101054. https://doi.org/10.1016/j.joi.2020.101054
Garvey, M. D., Samuel, J., & Pelaez, A. (2021). Would you please like my tweet?! An artificially intelligent, generative probabilistic, and econometric based system design for popularity-driven tweet content generation. Decision Support Systems, 144, 113497. doi:10.1016/j.dss.2021.113497

handlr

CRAN

Convert Among Citation Formats

Maintainer

Scott Chamberlain

Description

Converts among many citation formats, including BibTeX, Citeproc, Codemeta, RDF XML, RIS, Schema.org, and Citation File Format. A low level R6 class is provided, as well as stand-alone functions for each citation format for both read and write.

View Documentation

antiword

CRAN Staff maintained

Extract Text from Microsoft Word Documents

Maintainer

Jeroen Ooms

Description

Wraps the AntiWord utility to extract text from Microsoft Word documents. The utility only supports the old doc format, not the new xml based docx format. Use the xml2 package to read the latter.

View Documentation

roadoi

CRAN Peer-reviewed

Find Free Versions of Scholarly Publications via Unpaywall

Maintainer

Najko Jahn

Description

This web client interfaces Unpaywall https://unpaywall.org/products/api, formerly oaDOI, a service finding free full-texts of academic papers by linking DOIs with open access journals and repositories. It provides unified access to various data sources for open access full-text links including Crossref and the Directory of Open Access Journals (DOAJ). API usage is free and no registration is required.

View Documentation
Scientific use cases

Ashby, M. P. J. (2020, March 6). Three quarters of new criminological knowledge is hidden from policy makers. https://doi.org/10.31235/osf.io/wnq7h
Ashby, M. P. J. (2020). The Open-Access Availability of Criminological Research to Practitioners and Policy Makers. Journal of Criminal Justice Education, 1–21. https://doi.org/10.1080/10511253.2020.1838588
Robinson-Garcia, N., van Leeuwen, T. N., & Torres-Salinas, D. (2020). Measuring Open Access Uptake: Data Sources, Expectations, and Misconceptions. Scholarly Assessment Reports, 2(1). https://doi.org/10.29024/sar.23
Clayson, P. E., Baldwin, S., & Larson, M. J. (2020). The Open Access Advantage for Studies of Human Electrophysiology: Impact on Citations and Altmetrics. https://doi.org/10.31234/osf.io/5xagd

refsplitr

Peer-reviewed

author name disambiguation, author georeferencing, and mapping of coauthorship networks with Web of Science data

Maintainer

Emilio Bruna

Description

Tools to parse and organize reference records downloaded from the Web of Science citation database into an R-friendly format, disambiguate the names of authors, geocode their locations, and generate/visualize coauthorship networks. This package has been peer-reviewed by rOpenSci (v. 1.0).

View Documentation
Scientific use cases

Hazlett, M. A., Henderson, K. M., Zeitzer, I. F., & Drew, J. A. (2020). The geography of publishing in the Anthropocene. Conservation Science and Practice, 2(10). https://doi.org/10.1111/csp2.270
Smith, T. B., Vacca, R., Krenz, T., & McCarty, C. (2021). Great minds think alike, or do they often differ? Research topic overlap and the formation of scientific teams. Journal of Informetrics, 15(1), 101104. https://doi.org/10.1016/j.joi.2020.101104

europepmc

CRAN Peer-reviewed

R Interface to the Europe PubMed Central RESTful Web Service

Maintainer

Najko Jahn

Description

An R Client for the Europe PubMed Central RESTful Web Service (see https://europepmc.org/RestfulWebService for more information). It gives access to both metadata on life science literature and open access full texts. Europe PMC indexes all PubMed content and other literature sources including Agricola, a bibliographic database of citations to the agricultural literature, or Biological Patents. In addition to bibliographic metadata, the client allows users to fetch citations and reference lists. Links between life-science literature and other EBI databases, including ENA, PDB or ChEMBL are also accessible. No registration or API key is required. See the vignettes for usage examples.

View Documentation

jstor

CRAN Peer-reviewed

Read Data from JSTOR/DfR

Maintainer

Thomas Klebel

Description

Functions and helpers to import metadata, ngrams and full-texts delivered by Data for Research by JSTOR.

View Documentation

aRxiv

CRAN

Interface to the arXiv API

Maintainer

Karl Broman

Description

An interface to the API for arXiv, a repository of electronic preprints for computer science, mathematics, physics, quantitative biology, quantitative finance, and statistics.

View Documentation
Scientific use cases

Jaspers, S., De Troyer, E., & Aerts, M. (2018). Machine learning techniques for the automation of literature reviews and systematic reviews in EFSA. EFSA Supporting Publications, 15(6), 1427E. https://doi.org/10.2903/sp.efsa.2018.EN-1427

citecorp

CRAN

Client for the Open Citations Corpus

Maintainer

Scott Chamberlain

Description

Client for the Open Citations Corpus (http://opencitations.net/). Includes a set of functions for getting one identifier type from another, as well as getting references and citations for a given identifier.

View Documentation

textreuse

CRAN Peer-reviewed

Detect Text Reuse and Document Similarity

Maintainer

Lincoln Mullen

Description

Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.

View Documentation
Scientific use cases

Funk, K. R., & Mullen, L. A. (2017). The Spine of American Law: Digital Text Analysis and US Legal Practice. The American Historical Review. https://doi.org/10.1093/ahr/123.1.132
A. Mullen, L., Benoit, K., Keyes, O., Selivanov, D., & Arnold, J. (2018). Fast, Consistent Tokenization of Natural Language Text. Journal of Open Source Software, 3(23), 655. https://doi.org/10.21105/joss.00655
García, F. T., Villalba, L. J. G., Orozco, A. L. S., Ruiz, F. D. A., Juárez, A. A., & Kim, T. H. (2018). Locating similar names through locality sensitive hashing and graph theory. Multimedia Tools and Applications, 1-14. https://link.springer.com/article/10.1007/s11042-018-6375-9
Catalano, J. (2018). Digitally Analyzing the Uneven Ground: Language Borrowing Among Indian Treaties. Current Research in Digital History, 1. https://doi.org/10.31835/crdh.2018.02
Schmidt, B. (2018). Stable random projection: lightweight, general-purpose dimensionality reduction for digitized libraries. Journal of Cultural Analytics. https://doi.org/10.22148/16.025
Sanger, W., & Warin, T. (2019). Dataset of Jaccard similarity indices from 1,597 European political manifestos across 27 countries (1945–2017). Data in Brief, 103907. https://doi.org/10.1016/j.dib.2019.103907
Jaric, I., & Djeric, M. (2019). Curriculum and labor market: Comparative analysis of the curricular outcomes of the study program in sociology at the Faculty of Philosophy, University of Belgrade and the required competences in the labor market. Sociologija, 61(Suppl. 1), 718–741. https://doi.org/10.2298/soc19s1718j
Marple, T. (2020). The social management of complex uncertainty: Central Bank similarity and crisis liquidity swaps at the Federal Reserve. The Review of International Organizations. https://doi.org/10.1007/s11558-020-09378-x
Callaghan, T., Karch, A., & Kroeger, M. (2020). Model State Legislation and Intergovernmental Tensions over the Affordable Care Act, Common Core, and the Second Amendment. Publius: The Journal of Federalism. https://doi.org/10.1093/publius/pjaa012
Vogler, D., Udris, L., & Eisenegger, M. (2020). Measuring Media Content Concentration at a Large Scale Using Automated Text Comparisons. Journalism Studies, 1–20. https://doi.org/10.1080/1461670x.2020.1761865
Vogler, D., & Schäfer, M. S. (2020). Growing Influence of University PR on Science News Coverage? A Longitudinal Automated Content Analysis of University Media Releases and Newspaper Coverage in Switzerland, 2003‒2017. International Journal of Communication, 14, 22. https://ijoc.org/index.php/ijoc/article/download/13498/3113
James, S., Pagliari, S., & Young, K. L. (2020). The internationalization of European financial networks: a quantitative text analysis of EU consultation responses. Review of International Political Economy, 1–28. https://doi.org/10.1080/09692290.2020.1779781
Hansen, E. R., & Jansa, J. M. (2020). Complexity, Resources, and Text Borrowing in State Legislatures. http://ehansen4.sites.luc.edu/documents/Hansen_Jansa_Complexity.pdf

googleLanguageR

CRAN Peer-reviewed

Call Googles Natural Language API, Cloud Translation' API, Cloud Speech API and Cloud Text-to-Speech API

Maintainer

Mark Edmondson

Description

Call Google Cloud machine learning APIs for text and speech tasks. Call the Cloud Translation API https://cloud.google.com/translate/ for detection and translation of text, the Natural Language API https://cloud.google.com/natural-language/ to analyse text for sentiment, entities or syntax, the Cloud Speech API https://cloud.google.com/speech/ to transcribe sound files to text and the Cloud Text-to-Speech API https://cloud.google.com/text-to-speech/ to turn text into sound files.

View Documentation

tidypmc

CRAN

Parse Full Text XML Documents from PubMed Central

Maintainer

Chris Stubben

Description

Parse XML documents from the Open Access subset of Europe PubMed Central https://europepmc.org including section paragraphs, tables, captions and references.

View Documentation