Research on Internet Corpus Collection Method

2022 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS)(2022)

引用 0|浏览3
暂无评分
摘要
With the popularity of websites and the emergence of a large number of text data, the Internet has become an important channel for people to obtain information resources. In today’s society, Internet corpus has become a necessary corpus for linguistic research due to its rich resources, large scale, rich language types and low acquisition cost. It is a common and effective method to obtain corpus on the Internet by using crawler technology. This paper systematically introduces the principle of Internet data transmission, and crawlers are used to crawl the Internet corpus. Finally, some common anti-climbing mechanisms are introduced, which can be circumvented to better crawl corpus.
更多
查看译文
关键词
internet corpus,internet data transmission,web crawler
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要