Scrapy：restrict_css 格式错误的 HTML答案

【问题标题】：Scrapy : restrict_css with bad formatted HTMLScrapy：restrict_css 格式错误的 HTML
【发布时间】：2015-09-14 14:54:27
【问题描述】：

我尝试抓取的 HTML 代码格式错误：

<html>
<head>...</head>
<body>
    My items here...
    My items here...
    My items here...

    Pagination here...
</body>
</head>
</html>

问题是第二个</head>。我必须替换蜘蛛中的 HTML 才能使用 xpath 表达式：

class FooSpider(CrawlSpider):
    name = 'foo'
    allowed_domains = ['foo.bar']
    start_urls = ['http://foo.bar/index.php?page=1']
    rules = (Rule(SgmlLinkExtractor(allow=('\?page=\d',),),
              callback="parse_start_url",
              follow=True),)

def parse_start_url(self, response):
    # Remove the second </head> here
    # Perform my item

现在我想在我的规则中使用restrict_xpath 参数，但我不能，因为 HTML 格式错误：此时尚未执行替换。

请问你有什么想法吗？

【问题讨论】：

标签： python web-scraping scrapy

【解决方案1】：

我要做的是编写一个Downloader middleware 并使用例如BeautifulSoup 包来修复和美化response.body 中包含的HTML - 在这种情况下response.replace() 可能会很方便。

请注意，如果您选择BeautifulSoup，请仔细选择parser - 每个解析器都有自己的方式进入损坏的 HTML - 有些或多或少宽松。 lxml.html 在速度方面是最好的。

例子：

from bs4 import BeautifulSoup

class MyMiddleware(object):
    def process_response(self, request, response, spider):
        soup = BeautifulSoup(response.body, "lxml")
        response = response.replace(body=soup.prettify())

        return response

例如，修改下载的 HTML 的自定义中间件，请参阅scrapy-splash middleware。

【讨论】：

哦，太完美了！谢谢你:)