抓取网页的策略，最大限度地收集信息答案

【问题标题】：Strategy for scraping web pages, maximizing information gathered抓取网页的策略，最大限度地收集信息
【发布时间】：2013-04-03 23:08:45
【问题描述】：

问题来了：

用户注册一个网站，可以选择 8 个工作类别之一，或选择跳过此步骤。我想根据电子邮件地址中的域名将跳过该步骤的用户分类为工作类别。

当前设置：

使用 Beautiful Soup 和 nltk 的组合，我抓取主页并查找网站上包含“关于”一词的页面的链接。我也刮掉了那个页面。我已经复制了在这篇文章末尾进行抓取的代码。

问题：

我没有获得足够的数据来制定良好的学习程序。我想知道我的抓取算法是否为成功而设置——换句话说，我的逻辑中是否有任何漏洞，或者有什么更好的方法来确保我有大量的文本来描述什么样的工作一家公司呢？

（相关）代码：

import bs4 as bs
import httplib2 as http
import nltk


# Only these characters are valid in a url
ALLOWED_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;="


class WebPage(object):
    def __init__(self, domain):
        """
            Constructor

            :param domain: URL to look at
            :type domain: str
        """
        self.url = 'http://www.' + domain

        try:
            self._get_homepage()
        except: # Catch specific here?
            self.homepage = None

        try:
            self._get_about_us()
        except:
            self.about_us = None

    def _get_homepage(self):
        """
            Open the home page, looking for redirects
        """
        import re

        web = http.Http()
        response, pg = web.request(self.url)

        # Check for redirects:
        if int(response.get('content-length',251)) < 250:
            new_url = re.findall(r'(https?://\S+)', pg)[0]
            if len(new_url): # otherwise there's not much I can do...
                self.url = ''.join(x for x in new_url if x in ALLOWED_CHARS)
                response, pg = web.request(self.url)

        self.homepage = self._parse_html(nltk.clean_html(pg))
        self._raw_homepage = pg

    def _get_about_us(self):
        """
            Soup-ify the home page, find the "About us" page, and store its contents in a
            string
        """
        soup = bs.BeautifulSoup(self._raw_homepage)
        links = [x for x in soup.findAll('a') if x.get('href', None) is not None]
        about = [x.get('href') for x in links if 'about' in x.get('href', '').lower()]

        # need to find about or about-us
        about_us_page = None
        for a in about:
            bits = a.strip('/').split('/')
            if len(bits) == 1:
                about_us_page = bits[0]
            elif 'about' in bits[-1].lower():
                about_us_page = bits[-1]

        # otherwise assume shortest string is top-level about pg.
        if about_us_page is None and len(about):
            about_us_page = min(about, key=len)

        self.about_us = None
        if about_us_page is not None:
            self.about_us_url = self.url + '/' + about_us_page
            web = http.Http()
            response, pg = web.request(self.about_us_url)
            if int(response.get('content-length', 251)) > 250:
                self.about_us = self._parse_html(nltk.clean_html(pg))

    def _parse_html(self, raw_text):
        """
            Clean html coming from a web page. Gets rid of
                - all '\n' and '\r' characters
                - all zero length words
                - all unicode characters that aren't ascii (i.e., &...)
        """
        lines = [x.strip() for x in raw_text.splitlines()]
        all_text = ' '.join([x for x in lines if len(x)]) # zero length strings
        return [x for x in all_text.split(' ') if len(x) and x[0] != '&']

【问题讨论】：

由于您已标记此beautifulsoup，因此如果您提及 url 或提供要解析的网页的 sn-p 将会很有用。而且从您提供的代码（用于连接网页）中很难理解确切的问题。
我这里有一串大约 6000 个 url，所以我不确定一个列表是否能提供信息。我想知道是否有办法改进上面的抓取/解析算法，使其尽可能以最通用的方式工作。当然，任何一般性提示也将不胜感激。
添加一个示例足以提供一些上下文。 1 >>> 0
@AaronD 关键是，对于我给定的任何域，我都想这样做。如果我举一个我试图抓取的域的示例，我会得到一打答案，告诉我如何抓取该域。但这还不够好，因为我必须为我获得的每个新域更改我的算法。那有意义吗？换句话说，我不知道谁将注册我的网站，所以我必须假设完全一般性。
附带说明，我会将抓取和处理分为两个步骤。首先，下载信息并将原始结果存储在文件或数据库中。然后，您可以多次重新分析您的结果，直到获得一个好的结果，而不会影响您正在查看的公司的网站。

标签： python web-scraping beautifulsoup classification

【解决方案1】：

这超出了您的要求，但我会考虑调用已收集此信息的外部数据源。在Programmable Web（例如Mergent Company Fundamentals）上可以找到此类服务的好地方。并非 Programmable Web 上的所有数据都是最新的，但似乎有很多 API 提供商在那里。

【讨论】：

非常好。我以前没听说过可编程网络。