Using HTML Formatting to Aid in Natural Language Processing on the World Wide Web

msra(1998)

引用 33|浏览38
暂无评分
摘要
Because of its magnitude and the fact that it is not computer understandable, the WorldWide Web has become a prime candidate for automatic natural language tasks. This thesisargues that there is information in the layout of a web page, and that by looking at theHTML formatting in addition to the text on a page, one can improve performance in taskssuch as learning to classify segments of documents. A rich representation for web pages, theHTML Struct Tree, is described. A parsing algorithm...
更多
查看译文
关键词
world wide web,web pages,natural language processing,natural language
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要