Text Extraction, Rendering and Converting of PDF Documents
Utilities based on libpoppler for extracting text, fonts, attachments and metadata from a PDF file. Also supports high quality rendering of PDF documents into PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.
View DocumentationAn R Client to the PatentsView API
Provides functions to simplify the PatentsView API (https://patentsview.org/apis/purpose) query language, send GET and POST requests to the API’s seven endpoints, and parse the data that comes back.
View DocumentationGoogle's Compact Language Detector 3
Googles Compact Language Detector 3 is a neural network model for language identification and the successor of cld2 (available from CRAN). The algorithm is still experimental and takes a novel approach to language detection with different properties and outcomes. It can be useful to combine this with the Bayesian classifier results from cld2'. See https://github.com/google/cld3#readme for more information.
View DocumentationBibtex Parser
Client for the DataCite API
Client for the web service methods provided by DataCite (https://www.datacite.org/), including functions to interface with their RESTful search API. The API is backed by Elasticsearch, allowing expressive queries, including faceting.
View DocumentationLinguistic Typology and Mapping
Provides R with the Glottolog database https://glottolog.org/ and some more abilities for purposes of linguistic mapping. The Glottolog database contains the catalogue of languages of the world. This package helps researchers to make a linguistic maps, using philosophy of the Cross-Linguistic Linked Data project https://clld.org/, which allows for while at the same time facilitating uniform access to the data across publications. A tutorial for this package is available on GitHub pages https://docs.ropensci.org/lingtypology/ and package vignette. Maps created by this package can be used both for the investigation and linguistic teaching. In addition, package provides an ability to download data from typological databases such as WALS, AUTOTYP and some others and to create your own database website.
View DocumentationRendering Math to HTML, MathML, or R-Documentation Format
Convert latex math expressions to HTML and MathML for use in markdown documents or package manual pages. The rendering is done in R using the V8 engine (i.e. server-side), which eliminates the need for embedding the MathJax library into your web pages. In addition a math-to-rd wrapper is provided to automatically render beautiful math in R documentation files.
View DocumentationClient for Various CrossRef APIs
Client for various CrossRef APIs, including metadata search with their old and newer search APIs, get citations in various formats (including bibtex, citeproc-json, rdf-xml, etc.), convert DOIs to PMIDs, and vice versa, get citations for DOIs, and get links to full text of articles when available.
View DocumentationGeneral Purpose Oai-PMH Services Client
A general purpose client to work with any OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) service. The OAI-PMH protocol is described at http://www.openarchives.org/OAI/openarchivesprotocol.html. Functions are provided to work with the OAI-PMH verbs: GetRecord, Identify, ListIdentifiers, ListMetadataFormats, ListRecords, and ListSets.
View DocumentationBindings for Tabula PDF Table Extractor Library
Bindings for the Tabula http://tabula.technology/ Java library, which can extract tables from PDF documents. The tabulizerjars package https://github.com/ropensci/tabulizerjars provides versioned Java .jar files, including all dependencies, aligned to releases of Tabula.
View DocumentationGoogle's Compact Language Detector 2
Bindings to Googles C++ library Compact Language Detector 2 (see https://github.com/cld2owners/cld2#readme for more information). Probabilistically detects over 80 languages in plain text or HTML. For mixed-language input it returns the top three detected languages and their approximate proportion of the total classified text bytes (e.g. 80% English and 20% French out of 1000 bytes). There is also a cld3' package on CRAN which uses a neural network model instead.
View DocumentationExtract Text from Rich Text Format (RTF) Documents
Wraps the unrtf utility to extract text from RTF files. Supports document conversion to HTML, LaTeX or plain text. Output in HTML is recommended because unrtf has limited support for converting between character encodings.
View DocumentationSplit, Combine and Compress PDF Files
Content-preserving transformations transformations of PDF files such as split, combine, and compress. This package interfaces directly to the qpdf C++ API and does not require any command line utilities. Note that qpdf does not read actual content from PDF files: to extract text and data you need the pdftools package.
View DocumentationR Interface to Apache Tika
Extract text or metadata from over a thousand file types, using Apache Tika https://tika.apache.org/. Get either plain text or structured XHTML content.
View DocumentationText Interchange Format
Provides validation functions for common interchange formats for representing text data in R. Includes formats for corpus objects, document term matrices, and tokens. Other annotations can be stored by overloading the tokens structure.
View DocumentationAccess and Search MedRxiv and BioRxiv Preprint Data
An increasingly important source of health-related bibliographic content are preprints - preliminary versions of research articles that have yet to undergo peer review. The two preprint repositories most relevant to health-related sciences are medRxiv https://www.medrxiv.org/ and bioRxiv https://www.biorxiv.org/, both of which are operated by the Cold Spring Harbor Laboratory. medrxivr provides programmatic access to the Cold Spring Harbour Laboratory (CSHL) API https://api.biorxiv.org/, allowing users to easily download medRxiv and bioRxiv preprint metadata (e.g. title, abstract, publication date, author list, etc) into R. medrxivr also provides functions to search the downloaded preprint records using regular expressions and Boolean logic, as well as helper functions that allow users to export their search results to a .BIB file for easy import to a reference manager and to download the full-text PDFs of preprints matching their search criteria.
View DocumentationHigh-Performance Stemmer, Tokenizer, and Spell Checker
Low level spell checker and morphological analyzer based on the famous hunspell library https://hunspell.github.io. The package can analyze or check individual words as well as parse text, latex, html or xml documents. For a more user-friendly interface use the spelling package which builds on this package to automate checking of files, documentation and vignettes in all common formats.
View DocumentationConvert Among Citation Formats
Converts among many citation formats, including BibTeX, Citeproc, Codemeta, RDF XML, RIS, Schema.org, and Citation File Format. A low level R6 class is provided, as well as stand-alone functions for each citation format for both read and write.
View DocumentationExtract Text from Microsoft Word Documents
Wraps the AntiWord utility to extract text from Microsoft Word documents. The utility only supports the old doc format, not the new xml based docx format. Use the xml2 package to read the latter.
View DocumentationFind Free Versions of Scholarly Publications via Unpaywall
This web client interfaces Unpaywall https://unpaywall.org/products/api, formerly oaDOI, a service finding free full-texts of academic papers by linking DOIs with open access journals and repositories. It provides unified access to various data sources for open access full-text links including Crossref and the Directory of Open Access Journals (DOAJ). API usage is free and no registration is required.
View Documentationauthor name disambiguation, author georeferencing, and mapping of coauthorship networks with Web of Science data
Tools to parse and organize reference records downloaded from the Web of Science citation database into an R-friendly format, disambiguate the names of authors, geocode their locations, and generate/visualize coauthorship networks. This package has been peer-reviewed by rOpenSci (v. 1.0).
View DocumentationR Interface to the Europe PubMed Central RESTful Web Service
An R Client for the Europe PubMed Central RESTful Web Service (see https://europepmc.org/RestfulWebService for more information). It gives access to both metadata on life science literature and open access full texts. Europe PMC indexes all PubMed content and other literature sources including Agricola, a bibliographic database of citations to the agricultural literature, or Biological Patents. In addition to bibliographic metadata, the client allows users to fetch citations and reference lists. Links between life-science literature and other EBI databases, including ENA, PDB or ChEMBL are also accessible. No registration or API key is required. See the vignettes for usage examples.
View DocumentationRead Data from JSTOR/DfR
Functions and helpers to import metadata, ngrams and full-texts delivered by Data for Research by JSTOR.
View DocumentationInterface to the arXiv API
An interface to the API for arXiv, a repository of electronic preprints for computer science, mathematics, physics, quantitative biology, quantitative finance, and statistics.
View DocumentationClient for the Open Citations Corpus
Client for the Open Citations Corpus (http://opencitations.net/). Includes a set of functions for getting one identifier type from another, as well as getting references and citations for a given identifier.
View DocumentationDetect Text Reuse and Document Similarity
Tools for measuring similarity among documents and detecting passages which have been reused. Implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language.
View DocumentationCall Googles Natural Language API, Cloud Translation' API, Cloud Speech API and Cloud Text-to-Speech API
Call Google Cloud machine learning APIs for text and speech tasks. Call the Cloud Translation API https://cloud.google.com/translate/ for detection and translation of text, the Natural Language API https://cloud.google.com/natural-language/ to analyse text for sentiment, entities or syntax, the Cloud Speech API https://cloud.google.com/speech/ to transcribe sound files to text and the Cloud Text-to-Speech API https://cloud.google.com/text-to-speech/ to turn text into sound files.
View DocumentationParse Full Text XML Documents from PubMed Central
Parse XML documents from the Open Access subset of Europe PubMed Central https://europepmc.org including section paragraphs, tables, captions and references.
View Documentation