The HathiTrust Research Center is
pleased to announce the release of its Extracted Features Dataset (v.0.2), a
dataset derived from 4.8 million public domain volumes, totaling over 1.8
billion pages currently available in the HathiTrust Digital Library collection.
The dataset includes over 734 billion words, dozens of languages, and spans
multiple centuries. Features are informative, quantified characteristics of a
text, and include:
·
Volume-level metadata
·
Page-level features
o Part-of-speech-tagged
token counts
o Header
and footer identification
o Sentence
and line count
o Algorithmic
language detection
·
Line-level features
o Beginning
and end line character count
o Maximum
length of the sequence of capital characters starting a line
These features allow for analysis
of large worksets of volumes in the HathiTrust public domain collection, at
scales previously intractable for most individual researchers. For example,
page-level token (word) counts, can be used to help build topic models,
classifications and perform other text analytics. Similarly, features can be
used to evaluate readability of a given volume or workset.
How to get the data:
How to cite:
Boris Capitanu, Ted Underwood,
Peter Organisciak, Sayan Bhattacharyya, Loretta Auvil, Colleen Fallaw, J.
Stephen Downie (2015). Extracted Feature Dataset from 4.8 Million
HathiTrust Digital Library Public Domain Volumes (v0.2). [Dataset].
HathiTrust Research Center, doi:10.13012/j8td9v7m.
This feature dataset is provided
under a Creative Commons Attribution 4.0 International License.
About the HathiTrust Research
Center:
The HTRC is a collaborative
research center launched jointly by Indiana University and the University of
Illinois, along with the HathiTrust Digital Library, to help meet the technical
challenges of dealing with massive amounts of digital text that researchers
face by developing cutting-edge software tools and cyberinfrastructure to
enable advanced computational access to the growing digital record of human
knowledge.
Posted: May 8, 2015