【问题标题】:Scrapy web crawling going badScrapy 网络爬虫变坏了
【发布时间】:2016-01-29 08:42:30
【问题描述】:

我是scrapy的新手,并试图通过抓取yellowpages.com网站来理解它。

我的目标是写一个python代码来输入yellowpages.com主页的搜索字段(业务和位置),然后抓取后续的url。

我的代码如下所示:

import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from spider.items import Website

class YellowPages(Spider):
    name = "yellow"
    allowed_domains = ["yellowpages.com"]
    start_urls = [
        "http://www.yellowpages.com/"
    ]

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formxpath="//form[@id='search-form']",
            formdata={
                "query":"business",
                "location" : "78735" },
            callback=self.after_results
        )

    def after_results(self, response):
        self.logger.info("info msg")

我想在位置“78735”搜索“企业”。但是,这些不是传递给网站的值。我的日志如下所示:

2016-01-28 23:55:36 [scrapy] DEBUG: Crawled (200) <GET http://www.yellowpages.com/> (referer: None)

2016-01-28 23:55:36 [scrapy] DEBUG: Crawled (200) <GET http://www.yellowpages.com/search?search_terms=&geo_location_terms=Los+Angeles%2C+CA&query=business&location=78735> (referer: http://www.yellowpages.com/)

在第二个网址中,以某种方式插入了术语 Los+Angeles。当我尝试手动输入搜索字段并提交时,网址应该是这样的:

http://www.yellowpages.com/search?search_terms=business&geo_location_terms=78735

谁能告诉我出了什么问题以及如何解决?

非常感谢。

仅供参考,这里是yellowpages.com主页的部分HTML源代码

<div class="search-bar"><form id="search-form" action="/search" method="GET"><div><label><span>What do you want to find?</span><input id="query" type="text" value="" placeholder="What do you want to find?" autocomplete="off" data-onempty="recent-searches" name="search_terms" tabindex="1"/></label><ul id="recent-searches" class="search-dropdown recent-searches"><li class="search-hint">Search by<b> business name,</b> or<b> keyword</b></li></ul><ul id="autosuggest-term" data-analytics='{"moi":105}' class="search-dropdown autosuggest-term"></ul></div><em>near</em><div><label><span>Where?</span> <input id="location"type="text" value="78735" placeholder="Where?" autocomplete="off" data-onempty="menu-location" name="geo_location_terms" tabindex="2"/></label>

【问题讨论】:

    标签: python web-scraping scrapy scrapy-spider


    【解决方案1】:

    设置search_termsgeo_location_terms表单参数:

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formxpath="//form[@id='search-form']",
            formdata={
                "search_terms": "business",
                "geo_location_terms" : "78735"},
            callback=self.after_results
        )
    

    用以下蜘蛛测试:

    import scrapy
    from scrapy.spiders import Spider
    
    
    class YellowPages(Spider):
        name = "yellow"
        allowed_domains = ["yellowpages.com"]
        start_urls = [
            "http://www.yellowpages.com/"
        ]
    
        def parse(self, response):
            return scrapy.FormRequest.from_response(
                response,
                formxpath="//form[@id='search-form']",
                formdata={
                    "search_terms":"business",
                    "geo_location_terms" : "78735"},
                callback=self.after_results
            )
    
        def after_results(self, response):
            for result in response.css("div.result a[itemprop=name]::text").extract():
                print(result)
    

    打印“德克萨斯州奥斯汀”的企业列表:

    Prism Solutions
    Time Agent
    Stuart Consulting
    Jones REX L
    Medical Informatics & Tech Inc
    J E Andrews INC
    ...
    Hicks Consulting
    

    【讨论】: