在不了解页面结构的情况下进行网页抓取答案

【问题标题】：Web scraping without knowledge of page structure在不了解页面结构的情况下进行网页抓取
【发布时间】：2014-07-18 06:55:36
【问题描述】：

我正在尝试通过编写脚本来教自己一个概念。基本上，我正在尝试编写一个 Python 脚本，给定几个关键字，它将抓取网页，直到找到我需要的数据。例如，假设我想查找生活在美国的毒蛇列表。我可能会使用关键字 list,venemous,snakes,US 运行我的脚本，并且我希望能够以至少 80% 的把握相信它会返回美国的蛇列表。

我已经知道如何实现网络蜘蛛部分，我只想了解如何在不了解页面结构的情况下确定网页的相关性。我研究过网络抓取技术，但它们似乎都假设了解页面的 html 标记结构。是否有某种算法可以让我从页面中提取数据并确定其相关性？

任何指针将不胜感激。我将Python 与urllib 和BeautifulSoup 一起使用。

【问题讨论】：

你正在尝试做谷歌几十年来一直在尝试（做）的事情。
是的，如果您能以任何合理的准确性做到这一点，那么您就是 Google。编写一个获取关键字然后抓取 Google 搜索结果的脚本怎么样？我相信，谷歌也有他们的搜索 API。两者都是很棒的初学者项目。
好的。我实际上计划让脚本从谷歌搜索开始，然后点击首页上的链接。所以也许我的问题更多是关于隔离页面上的数据而不是确定页面相关性。如果该页面位于谷歌搜索的首页，那么我可以公平地假设它是相关的。

标签： python web-scraping beautifulsoup web-crawler

【解决方案1】：

您基本上是在问“我如何编写搜索引擎”。这……不是小事。

执行此操作的正确方法是仅使用 Google（或 Bing 或 Yahoo! 或...）的搜索 API 并显示前 n 个结果。但是，如果您只是在从事一个个人项目来自学一些概念（虽然不确定这些概念到底是哪些），那么这里有一些建议：

在相应标签（<p>、<div> 等）的文本内容中搜索相关关键字 (duh)
使用相关关键字检查是否存在可能包含您要查找的内容的标签。例如，如果您要查找内容列表，则包含 <ul> 或 <ol> 甚至 <table> 的页面可能是一个不错的选择
建立同义词词典并在每个页面中搜索关键字的同义词。将自己限制在“美国”可能意味着人为降低仅包含“美国”的页面的排名
保留关键字列表中不的单词列表，并为包含最多这些单词的页面提供更高的排名。这些页面（可以说）更有可能包含您正在寻找的答案

祝你好运（你需要它）！

【讨论】：

谢谢！我知道我的目标太高了，但为了学习，我可能还是会尝试一下你的指点。

【解决方案2】：

使用像 scrapy 这样的爬虫（仅用于处理并发下载），您可以编写一个像这样的简单爬虫，并且可能从 Wikipedia 作为一个好的起点开始。此脚本是使用scrapy、nltk 和whoosh 的完整示例。它永远不会停止，并将索引链接以供以后使用whoosh 搜索这是一个小型谷歌：

_Author = Farsheed Ashouri
import os
import sys
import re
## Spider libraries
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from main.items import MainItem
from scrapy.http import Request
from urlparse import urljoin
## indexer libraries
from whoosh.index import create_in, open_dir
from whoosh.fields import *
## html to text conversion module
import nltk

def open_writer():
    if not os.path.isdir("indexdir"):
        os.mkdir("indexdir")
        schema = Schema(title=TEXT(stored=True), content=TEXT(stored=True))
        ix = create_in("indexdir", schema)
    else:
        ix = open_dir("indexdir")
    return ix.writer()

class Main(BaseSpider):
    name        = "main"
    allowed_domains = ["en.wikipedia.org"]
    start_urls  = ["http://en.wikipedia.org/wiki/Snakes"]
    
    def parse(self, response):
        writer = open_writer()  ## for indexing
        sel = Selector(response)
        email_validation = re.compile(r'^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$')
        #general_link_validation = re.compile(r'')
        #We stored already crawled links in this list
        crawledLinks    = set()
        titles = sel.xpath('//div[@id="content"]//h1[@id="firstHeading"]//span/text()').extract()
        contents = sel.xpath('//body/div[@id="content"]').extract()
        if contents:
            content = contents[0]
        if titles: 
            title = titles[0]
        else:
            return
        links   = sel.xpath('//a/@href').extract()

        
        for link in links:
            # If it is a proper link and is not checked yet, yield it to the Spider
            url = urljoin(response.url, link)
            #print url
            ## our url must not have any ":" character in it. link /wiki/talk:company
            if not url in crawledLinks and re.match(r'http://en.wikipedia.org/wiki/[^:]+$', url):
                crawledLinks.add(url)
                  #print url, depth
                yield Request(url, self.parse)
        item = MainItem()
        item["title"] = title
        print '*'*80
        print 'crawled: %s | it has %s links.' % (title, len(links))
        #print content
        print '*'*80
        item["links"] = list(crawledLinks)
        writer.add_document(title=title, content=nltk.clean_html(content))  ## I save only text from content.
        #print crawledLinks
        writer.commit()
        yield item

【讨论】：

是的，我在大型企业搜索引擎中使用这种方法。
链接失效了...
谢谢，@Marnix.hoh。我删除了链接。