A diverse Multilingual News Headlines Dataset from around the World
arxiv(2024)
摘要
Babel Briefings is a novel dataset featuring 4.7 million news headlines from
August 2020 to November 2021, across 30 languages and 54 locations worldwide
with English translations of all articles included. Designed for natural
language processing and media studies, it serves as a high-quality dataset for
training or evaluating language models as well as offering a simple, accessible
collection of articles, for example, to analyze global news coverage and
cultural narratives. As a simple demonstration of the analyses facilitated by
this dataset, we use a basic procedure using a TF-IDF weighted similarity
metric to group articles into clusters about the same event. We then visualize
the \emph{event signatures} of the event showing articles of which languages
appear over time, revealing intuitive features based on the proximity of the
event and unexpectedness of the event. The dataset is available on
\href{https://www.kaggle.com/datasets/felixludos/babel-briefings}{Kaggle} and
\href{https://huggingface.co/datasets/felixludos/babel-briefings}{HuggingFace}
with accompanying \href{https://github.com/felixludos/babel-briefings}{GitHub}
code.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要