The HathiTrust Research Center: Mining the 17 Million Volumes of the HathiTrust Digital Library

Line drawing of person pointing to a projection screen

Citation

Kloster, David, et al. "The HathiTrust Research Center (HTRC): Mining the 17 Million Volumes of the HathiTrust Digital Library" 11 Nov, 2020. Digital Library Bown Bag Series. Hazelbaker Hall, Scholar's Commons, Herman B Wells Library, Indiana University Bloomington.

Description

The HathiTrust Digital Library (HTDL) was founded in 2008 with just over 2 million volumes in the collection. Today there are over 17 million volumes ranging from 6th-century psalters to 21st-century academic texts. The diverse contents of the HTDL include government documents, academic journal articles, and monographs from all the disciplines one would find represented in a typical academic research library. While the majority of materials are in English, there are many volumes in German, French, Spanish, Italian, Arabic, Chinese, Russian, and Latin. Researchers may perform text analysis on the contents of HTDL by utilizing the many text analysis tools and data sets provided by the HathiTrust Research Center (HTRC). The HathiTrust Research Center (HTRC), based at IU Bloomington, develops infrastructure, tools, and services to support Text Data Mining of the HTDL corpus. These include off-the-shelf web-based text analysis tools, a secure data capsule computing environment for analysis of rights-restricted content, and the HTRC Extracted Features Data Set, which provides volume-level and page-level word counts and other metadata for the entire corpus. This presentation will discuss the current contents of the HTDL collection and its benefits as a data source and provide examples of existing research facilitated by HTDL collections and HTRC resources. In addition, this presentation will give an overview of the various HTRC text analysis tools and the different options for analyzing public domain and copyrighted material.

Date

Nov 2020

Staff

David Kloster
From HTRC: Jennifer Christie, John Walsh

Client

HTRC

Services

Text Analytics

Type

Presentation