【发布时间】:2020-10-27 02:07:24
【问题描述】:
我在使用 Scrapy 时遇到了这个问题,但我不知道它是什么。 这是我的蜘蛛
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class YFScreener(CrawlSpider):
name = 'YFScreener'
allowed_domains = ['finance.yahoo.com']
start_urls = ['https://finance.yahoo.com/screener/unsaved/c97bc7b4-0e94-43dc-9df1-b46f936742e6?count=25&offset=0']
rules = (
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('https://finance.yahoo.com/screener/.*count=\d+&offset=\d+')), callback='parse_item'),
)
def parse_item(self, response):
return response.css('tr.simpTblRow:nth-child(1) > td:nth-child(1) > a:nth-child(2)::text').get()
Here 是我得到的日志,这些是我的一些设置:
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-GB,en;q=0.5',
'User Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'
}
我不知道发生了什么,我以前用过 Scrappy,从来没有遇到过问题。它似乎与 robots.txt 有关,但我不知道那是什么。有什么帮助吗?
谢谢!
【问题讨论】:
标签: python web-scraping scrapy