搜索

x

基于区域分块的微内容类网页正文提取技术

Extraction Technique of Micro-Content-Page Text based on Page Region

  • 摘要: 通过对微内容类网页正文内容块自动填充的研究, 利用网页区域分块技术与HTML的结构特征, 提出了一种基于区域分块和内容块自动填充(RAF)的正文提取方法, 可用于微内容类网页正文的自动提取, 同时运用编程实现提取工具进行实验. 结果表明, 该方法能够有效、准确地提取微内容类网页的正文信息.

     

    Abstract: In the report, the automatic fill of the text of micro-content class web page was analyzed, and the page region technique and the structural features of HTML were used to establish a text extraction method based on page region and auto fill (RAF), which can be used for extracting the text of micro-content class web page automatically. The experiments with the extraction tool were performed, the results indicated that the method can effectively and accurately extract the text of micro-content class web page.

     

/

返回文章
返回