Newswire: A Large-Scale Structured Database of a Century of Historical News
CoRR(2024)
Abstract
In the U.S. historically, local newspapers drew their content largely from
newswires like the Associated Press. Historians argue that newswires played a
pivotal role in creating a national identity and shared understanding of the
world, but there is no comprehensive archive of the content sent over
newswires. We reconstruct such an archive by applying a customized deep
learning pipeline to hundreds of terabytes of raw image scans from thousands of
local newspapers. The resulting dataset contains 2.7 million unique public
domain U.S. newswire articles, written between 1878 and 1977. Locations in
these articles are georeferenced, topics are tagged using customized neural
topic classification, named entities are recognized, and individuals are
disambiguated to Wikipedia using a novel entity disambiguation model. To
construct the Newswire dataset, we first recognize newspaper layouts and
transcribe around 138 millions structured article texts from raw image scans.
We then use a customized neural bi-encoder model to de-duplicate reproduced
articles, in the presence of considerable abridgement and noise, quantifying
how widely each article was reproduced. A text classifier is used to ensure
that we only include newswire articles, which historically are in the public
domain. The structured data that accompany the texts provide rich information
about the who (disambiguated individuals), what (topics), and where
(georeferencing) of the news that millions of Americans read over the course of
a century. We also include Library of Congress metadata information about the
newspapers that ran the articles on their front pages. The Newswire dataset is
useful both for large language modeling - expanding training data beyond what
is available from modern web texts - and for studying a diversity of questions
in computational linguistics, social science, and the digital humanities.
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined