【问题标题】:Scraping web page with Requests Python使用请求 Python 抓取网页
【发布时间】:2021-07-30 09:44:00
【问题描述】:

我想抓取此页面,但在请求后我找不到 tableBeautifulSoup

代码

headers = {"Referer": "https://www.atptour.com/en/scores/results-archive",
            'User-Agent': 'my-user-agent'
        }
url = 'https://www.atptour.com/en/scores/results-archive?year=2016'
page = requests.get(url, headers=headers)
print(page)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', class_="results-archive-table mega-table")
print(table)

输出<Response [403]>None

【问题讨论】:

  • 用句号.替换类名中的空格,变为results-archive-table.mega-table
  • 在最初的 HTML 中 不是 - 看看 page.text 实际上是什么。
  • 检查page.status_code ?

标签: python beautifulsoup python-requests


【解决方案1】:

我使用scrapy-seleniumselenium stealth 得到Response [200]

代码:

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium_stealth import stealth
from selenium import webdriver
from shutil import which 
from selenium.webdriver.chrome.options import Options

class AtpSpider(scrapy.Spider):
    name = 'atptour'
    chrome_path = which("chromedriver") 
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    
    driver = webdriver.Chrome(executable_path=chrome_path,options=chrome_options)
    stealth(driver,user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36',
    languages=["en-US", "en"], 
    vendor="Google Inc.", 
    platform="Win32",
    webgl_vendor="Intel Inc.", 
    renderer="Intel Iris OpenGL Engine",
    fix_hairline=False) 
  
    
    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.atptour.com/en/scores/results-archive?year=2016',
            wait_time =5,
            callback = self.parse,
        
        )
    def parse(self, response):
        pass
 

输出:

2021-07-31 10:25:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-07-31 10:25:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.atptour.com/en/scores/results-archive> (referer: None)
2021-07-31 10:25:05 [scrapy.core.engine] INFO: Closing spider (finished)
2021-07-31 10:25:05 [selenium.webdriver.remote.remote_connection] DEBUG: DELETE http://127.0.0.1:53662/session/039ca0bb0a64b7b9eb48ab26a0f464a0 {}
2021-07-31 10:25:05 [urllib3.connectionpool] DEBUG: http://127.0.0.1:53662 "DELETE /session/039ca0bb0a64b7b9eb48ab26a0f464a0 HTTP/1.1" 200 14
2021-07-31 10:25:05 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
2021-07-31 10:25:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 15142,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
      

【讨论】:

    【解决方案2】:

    网站受 Cloudflare 保护,并希望在访问网站时启用 javascript,就像 requests 库无法执行的真实浏览器一样。因此,您可以尝试使用Selenium

    我注意到在headless 模式下使用Selenium 会引发验证码,但non-headless 有效。最后可以使用Beautifusoup解析。

    试试这个:

    from selenium import webdriver
    from bs4 import BeautifulSoup
    
    chrome_path = ('Add your chromedriver path here')
    driver = webdriver.Chrome(executable_path=chrome_path)
    
    url = 'https://www.atptour.com/en/scores/results-archive?year=2016'
    driver.get(url)
    data = driver.page_source
    
    soup = BeautifulSoup(data, 'html.parser')
    table = soup.find('table', class_="results-archive-table mega-table")
    print(table)
    
    driver.quit()
    

    【讨论】:

      【解决方案3】:

      看回复:

      print(page)
      <Response [403]>
      

      也许您必须在请求中添加一些标头。

      【讨论】:

      • 我添加了一些标题,但问题仍然存在
      • 对于重度“Java 渲染”网站,可以考虑使用“selenium”。
      • 绕过 Cloudfare 保护你需要 Selenium ,只是为了“人性化”你的机器人。
      • @luka 看看这里:pypi.org/project/cloudscraper
      最近更新 更多