Scrapy Crawl 返回无答案

【问题标题】：Scrapy Crawl returning NoneScrapy Crawl 返回无
【发布时间】：2021-08-03 22:18:16
【问题描述】：

我是巴西人，很抱歉英语不好。我开始学习python以及如何使用scrapy，我试图从表中获取信息，但由于某些原因，我编写的函数返回'None'，如您所见：

调试：从 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>{'teste'：无}

我尝试在 response.css 中放入的任何类都返回“无”。我还尝试使用相同的代码从其他站点获取文本并且它有效，所以我猜这是关于这个站点的特定内容，但我真的不知道。有人可以帮我解决这些吗？

这是我写的代码：

import scrapy


class QuotesSpider(scrapy.Spider):

    name = "equipes"
    start_urls = ['https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/']

    def parse(self, response):
        yield {'teste': response.css('tbody tr td.tablesaw-cell-persist').get()}

【问题讨论】：

我没有通过笔记本上网查看，你确定不是JS网站吗？ Scrapy 无法渲染 JS

标签： python scrapy web-crawler

【解决方案1】：

你的想法是正确的。数据在 javaScript 的帮助下动态生成。如果您从浏览器中禁用 javaScript 并转到下拉列表并尝试更改表的名称，那么您会看到它永远不会像“CBLOL Split 1 2020”那样更改为“CBLOL Academy Split 2 2021”不会改变并且这种行为被称为动态填充的 JavaScript 数据，因为您没有通过抓取静态 HTML 来获取数据。这就是为什么您需要无头浏览器来获取数据的原因。实际上，我们无法修复一种技术来抓取网站，而不是网站向我们展示了我们必须使用什么技术来抓取网站。这里我使用Selenium with Scrapy，它也像 Scrapy Spider 一样超级快。

我的代码：

import scrapy
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which
from time import sleep


class TeamsSpider(scrapy.Spider):
    name = 'teams'
    allowed_domains = ['gol.gg']
    start_urls = ['https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/']
   
    def __init__(self):
        chrome_options = Options()
            #chrome_options.add_argument("--headless")

        chrome_path = which("chromedriver")

        self.driver = webdriver.Chrome(executable_path=chrome_path)#, options=chrome_options)
        self.driver.set_window_size(1920, 1080)
        self.driver.get("https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/")
        sleep(5)
        dropDown = self.driver.find_element_by_xpath('//*[@id="cbtournament"]/option[text()= "CBLOL Split 1 2020"]')
        dropDown.click()
        sleep(5)

        self.html = self.driver.page_source
        self.driver.close()
    
    def parse(self, response):
        
        resp = Selector(text=self.html)
        for tr in resp.xpath('(//tbody)[2]/tr'):
            yield {
                'Name': tr.xpath(".//td/a/text()").get(),
                'Season': tr.xpath(".//td[2]/text()").get(),
                'Region': tr.xpath(".//td[3]/text()").get(),
                'Games': tr.xpath(".//td[4]/text()").get(),
                'winRate': tr.xpath(".//td[5]/text()").get()
                    
            }

输出：

2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Flamengo eSports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '61.9%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'FURIA Uppercut', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'INTZ e-Sports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '38.1%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'KaBuM! e-Sports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'paiN Gaming', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '47.6%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Prodigy Esports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Redemption POA', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '28.6%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Vivo Keyd', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '66.7%'}
2021-08-04 13:59:37 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-04 13:59:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 490,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 1078,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 2.668237,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 8, 4, 7, 59, 37, 58803),
 'httpcompression/response_bytes': 278,
 'httpcompression/response_count': 2,
 'item_scraped_count': 8,

【讨论】：

如果有效，请单击我的答案左侧的勾号，以帮助他人作为社区规则。