你的想法是正确的。数据在 javaScript 的帮助下动态生成。如果您从浏览器中禁用 javaScript 并转到下拉列表并尝试更改表的名称,那么您会看到它永远不会像“CBLOL Split 1 2020”那样更改为“CBLOL Academy Split 2 2021”不会改变并且这种行为被称为动态填充的 JavaScript 数据,因为您没有通过抓取静态 HTML 来获取数据。这就是为什么您需要无头浏览器来获取数据的原因。实际上,我们无法修复一种技术来抓取网站,而不是网站向我们展示了我们必须使用什么技术来抓取网站。这里我使用Selenium with Scrapy,它也像 Scrapy Spider 一样超级快。
我的代码:
import scrapy
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from shutil import which
from time import sleep
class TeamsSpider(scrapy.Spider):
name = 'teams'
allowed_domains = ['gol.gg']
start_urls = ['https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/']
def __init__(self):
chrome_options = Options()
#chrome_options.add_argument("--headless")
chrome_path = which("chromedriver")
self.driver = webdriver.Chrome(executable_path=chrome_path)#, options=chrome_options)
self.driver.set_window_size(1920, 1080)
self.driver.get("https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/")
sleep(5)
dropDown = self.driver.find_element_by_xpath('//*[@id="cbtournament"]/option[text()= "CBLOL Split 1 2020"]')
dropDown.click()
sleep(5)
self.html = self.driver.page_source
self.driver.close()
def parse(self, response):
resp = Selector(text=self.html)
for tr in resp.xpath('(//tbody)[2]/tr'):
yield {
'Name': tr.xpath(".//td/a/text()").get(),
'Season': tr.xpath(".//td[2]/text()").get(),
'Region': tr.xpath(".//td[3]/text()").get(),
'Games': tr.xpath(".//td[4]/text()").get(),
'winRate': tr.xpath(".//td[5]/text()").get()
}
输出:
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Flamengo eSports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '61.9%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'FURIA Uppercut', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'INTZ e-Sports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '38.1%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'KaBuM! e-Sports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'paiN Gaming', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '47.6%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Prodigy Esports', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '52.4%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Redemption POA', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '28.6%'}
2021-08-04 13:59:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://gol.gg/teams/list/season-ALL/split-ALL/tournament-CBLOL%20Split%201%202020/>
{'Name': 'Vivo Keyd', 'Season': 'S10', 'Region': 'BR', 'Games': '21', 'winRate': '66.7%'}
2021-08-04 13:59:37 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-04 13:59:37 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 490,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 1078,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 2.668237,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 8, 4, 7, 59, 37, 58803),
'httpcompression/response_bytes': 278,
'httpcompression/response_count': 2,
'item_scraped_count': 8,