Learning n-ary tree-pattern queries for web information extraction

msra(2006)

引用 23|浏览1
暂无评分
摘要
The problem of extracting information from the Web consists n building patterns allowing to extract specific information from documen ts of a given Web source. Up to now, most existing techniques use string-base d representations of documents as well as string-based patterns. Using tree repr s ntations naturally allows to overcome limitations of string-based approaches . While some tree-based approaches exist, they are either limited to learning unary queries or buildn-ary queries by composing unary queries. In this paper we study us ing tree-patterns as ann-ary extraction language and propose an algorithm capable o f arning such queries. The learning algorithm we propose calculates the most informationconservative tree-pattern which is a generalization of two input trees. Tree-patterns have the double advantage of both allowing to explicitly wor k with the tree structure of the HTML/XML documents and allow to express n-ary queries. As our experiments will show, tree patterns can express many extra ction tasks. They also have the advantage of being closely related to the now standa rd XPath language and therefore easily understandable by human experts.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要