雅虎财经网站 Scrapy 的问题答案

【问题标题】：Problem with Scrapy in yahoo finance site雅虎财经网站 Scrapy 的问题
【发布时间】：2020-10-27 02:07:24
【问题描述】：

我在使用 Scrapy 时遇到了这个问题，但我不知道它是什么。这是我的蜘蛛

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class YFScreener(CrawlSpider):
    name = 'YFScreener'
    allowed_domains = ['finance.yahoo.com']
    start_urls = ['https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6?count=25&offset=0']

    rules = (
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('https://finance.yahoo.com/screener/.*count=\d+&offset=\d+')), callback='parse_item'),
    )

    def parse_item(self, response):
        return response.css('tr.simpTblRow:nth-child(1) > td:nth-child(1) > a:nth-child(2)::text').get()

Here 是我得到的日志，这些是我的一些设置：

# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en-GB,en;q=0.5',
  'User Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'
}

我不知道发生了什么，我以前用过 Scrappy，从来没有遇到过问题。它似乎与 robots.txt 有关，但我不知道那是什么。有什么帮助吗？

谢谢！

【问题讨论】：

标签： python web-scraping scrapy

【解决方案1】：

您的代码在我的机器上运行顺畅。我只是重构下面的函数并打开我的 vpn

def parse_item(self, response):
        return {
            "url": response.css('tr.simpTblRow:nth-child(1) > td:nth-child(1) > a:nth-child(2)::text').get()
        }

这里是releted日志：

2020-07-07 08:41:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6?count=25&offset=0> (referer: https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6?count=25&offset=0)
2020-07-07 08:41:11 [scrapy.core.scraper] DEBUG: Scraped from <200 https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6?count=25&offset=0>
{'url': 'TYT.L'}
2020-07-07 08:41:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6/heatmap?count=25&offset=0> (referer: https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6?count=25&offset=0)
2020-07-07 08:41:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6/heatmap?count=25&offset=0>
{'url': None}

【讨论】：

我不明白为什么它不能在我的身上运行。我不认为这是因为 VPN，因为我可以通过浏览器在线访问此页面。另外，我似乎可以 wget 页面，所以我不知道为什么 Scrappy 在这里不起作用。
将此行粘贴到 settings.py USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36' 并替换为您的用户代理
这是我的 settings.py link