【问题标题】:Scrapy Crawl returning NoneScrapy Crawl 返回无
【发布时间】:2021-08-03 22:18:16
【问题描述】:

我是巴西人,很抱歉英语不好。我开始学习python以及如何使用scrapy,我试图从表中获取信息,但由于某些原因,我编写的函数返回'None',如您所见:

调试:从 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>{'teste':无}

我尝试在 response.css 中放入的任何类都返回“无”。我还尝试使用相同的代码从其他站点获取文本并且它有效,所以我猜这是关于这个站点的特定内容,但我真的不知道。有人可以帮我解决这些吗?

这是我写的代码:

import scrapy


class QuotesSpider(scrapy.Spider):

    name = "equipes"
    start_urls = ['https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/']

    def parse(self, response):
        yield {'teste': response.css('tbody tr td.tablesaw-cell-persist').get()}

【问题讨论】:

  • 我没有通过笔记本上网查看,你确定不是JS网站吗? Scrapy 无法渲染 JS

标签: python scrapy web-crawler


【解决方案1】:

你的想法是正确的。数据在 javaScript 的帮助下动态生成。如果您从浏览器中禁用 javaScript 并转到下拉列表并尝试更改表的名称,那么您会看到它永远不会像“CBLOL Split 1 2020”那样更改为“CBLOL Academy Split 2 2021”不会改变并且这种行为被称为动态填充的 JavaScript 数据,因为您没有通过抓取静态 HTML 来获取数据。这就是为什么您需要无头浏览器来获取数据的原因。实际上,我们无法修复一种技术来抓取网站,而不是网站向我们展示了我们必须使用什么技术来抓取网站。这里我使用Selenium with Scrapy,它也像 Scrapy Spider 一样超级快。

我的代码:

import scrapy
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which
from time import sleep


class TeamsSpider(scrapy.Spider):
    name = 'teams'
    allowed_domains = ['gol.gg']
    start_urls = ['https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/']
   
    def __init__(self):
        chrome_options = Options()
            #chrome_options.add_argument("--headless")

        chrome_path = which("chromedriver")

        self.driver = webdriver.Chrome(executable_path=chrome_path)#, options=chrome_options)
        self.driver.set_window_size(1920, 1080)
        self.driver.get("https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/")
        sleep(5)
        dropDown = self.driver.find_element_by_xpath('//*[@id="cbtournament"]/option[text()= "CBLOL Split 1 2020"]')
        dropDown.click()
        sleep(5)

        self.html = self.driver.page_source
        self.driver.close()
    
    def parse(self, response):
        
        resp = Selector(text=self.html)
        for tr in resp.xpath('(//tbody)[2]/tr'):
            yield {
                'Name': tr.xpath(".//td/a/text()").get(),
                'Season': tr.xpath(".//td[2]/text()").get(),
                'Region': tr.xpath(".//td[3]/text()").get(),
                'Games': tr.xpath(".//td[4]/text()").get(),
                'winRate': tr.xpath(".//td[5]/text()").get()
                    
            }

输出:

2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Flamengo eSports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '61.9%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'FURIA Uppercut', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'INTZ e-Sports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '38.1%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'KaBuM! e-Sports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'paiN Gaming', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '47.6%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Prodigy Esports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Redemption POA', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '28.6%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Vivo Keyd', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '66.7%'}
2021-08-04 13:59:37 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-04 13:59:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 490,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 1078,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 2.668237,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 8, 4, 7, 59, 37, 58803),
 'httpcompression/response_bytes': 278,
 'httpcompression/response_count': 2,
 'item_scraped_count': 8,


       
        

【讨论】:

  • 如果有效,请单击我的答案左侧的勾号,以帮助他人作为社区规则。
猜你喜欢
  • 1970-01-01
  • 2022-08-19
  • 1970-01-01
  • 2014-05-21
  • 1970-01-01
  • 2017-03-19
  • 2020-01-25
  • 2015-02-14
  • 1970-01-01
相关资源
最近更新 更多