【问题标题】:Problem with Scrapy in yahoo finance site雅虎财经网站 Scrapy 的问题
【发布时间】:2020-10-27 02:07:24
【问题描述】:

我在使用 Scrapy 时遇到了这个问题,但我不知道它是什么。 这是我的蜘蛛

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class YFScreener(CrawlSpider):
    name = 'YFScreener'
    allowed_domains = ['finance.yahoo.com']
    start_urls = ['https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6?count=25&offset=0']

    rules = (
        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('https://finance.yahoo.com/screener/.*count=\d+&offset=\d+')), callback='parse_item'),
    )

    def parse_item(self, response):
        return response.css('tr.simpTblRow:nth-child(1) > td:nth-child(1) > a:nth-child(2)::text').get()

Here 是我得到的日志,这些是我的一些设置:

# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en-GB,en;q=0.5',
  'User Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'
}

我不知道发生了什么,我以前用过 Scrappy,从来没有遇到过问题。它似乎与 robots.txt 有关,但我不知道那是什么。有什么帮助吗?

谢谢!

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    您的代码在我的机器上运行顺畅。我只是重构下面的函数并打开我的 vpn

    def parse_item(self, response):
            return {
                "url": response.css('tr.simpTblRow:nth-child(1) > td:nth-child(1) > a:nth-child(2)::text').get()
            }
    

    这里是releted日志:

    2020-07-07 08:41:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6?count=25&offset=0> (referer: https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6?count=25&offset=0)
    2020-07-07 08:41:11 [scrapy.core.scraper] DEBUG: Scraped from <200 https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6?count=25&offset=0>
    {'url': 'TYT.L'}
    2020-07-07 08:41:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6/heatmap?count=25&offset=0> (referer: https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6?count=25&offset=0)
    2020-07-07 08:41:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6/heatmap?count=25&offset=0>
    {'url': None}
    

    【讨论】:

    • 我不明白为什么它不能在我的身上运行。我不认为这是因为 VPN,因为我可以通过浏览器在线访问此页面。另外,我似乎可以 wget 页面,所以我不知道为什么 Scrappy 在这里不起作用。
    • 将此行粘贴到 settings.py USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36' 并替换为您的用户代理
    • 这是我的 settings.py link
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多