Hypertext Entity Extraction in Webpage
arxiv(2024)
摘要
Webpage entity extraction is a fundamental natural language processing task
in both research and applications. Nowadays, the majority of webpage entity
extraction models are trained on structured datasets which strive to retain
textual content and its structure information. However, existing datasets all
overlook the rich hypertext features (e.g., font color, font size) which show
their effectiveness in previous works. To this end, we first collect a
Hypertext Entity Extraction Dataset
(HEED) from the e-commerce domains, scraping both the text and the
corresponding explicit hypertext features with high-quality manual entity
annotations. Furthermore, we present the MoE-based Entity
Extraction Framework (MoEEF), which efficiently
integrates multiple features to enhance model performance by Mixture of Experts
and outperforms strong baselines, including the state-of-the-art small-scale
models and GPT-3.5-turbo. Moreover, the effectiveness of hypertext features in
HEED and several model components in MoEEF are analyzed.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要