【问题标题】:How to extract values from dynamic website using selenium and PhantomJS如何使用 selenium 和 PhantomJS 从动态网站中提取值
【发布时间】:2019-01-05 18:21:23
【问题描述】:

我正在尝试获取计时器的值 >http://prntscr.com/kcbwd8 在这个网站上 > https://www.whenisthenextsteamsale.com/ 并希望将其存储在一个变量中。

import urllib
from bs4 import BeautifulSoup as bs
import time
import requests
from selenium import webdriver
from urllib.request import urlopen, Request
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}

browser = webdriver.PhantomJS()
browser.get('https://www.whenisthenextsteamsale.com/')

soup = bs(browser.page_source, "html.parser")
result = soup.find_all("p",{"id":"subTimer"})

for item in result:
    print(item.text)

browser.quit()

我已经尝试使用上面的代码,但它返回这个错误>

C:\Users\rober\Anaconda3\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: 用户警告:PhantomJS 的 Selenium 支持已被弃用, 请改用无头版本的 Chrome 或 Firefox
warnings.warn('Selenium 对 PhantomJS 的支持已被弃用, 请使用无头' 19:59:11

有没有办法解决这个问题?如果没有,还有其他方法可以获取站点的动态值并将它们存储在变量中。

谢谢。

【问题讨论】:

    标签: javascript selenium selenium-webdriver web-scraping phantomjs


    【解决方案1】:

    您的代码非常完美。虽然您没有使用您定义为的 headers

    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}
    

    我已经执行了你自己的脚本如下:

    import urllib
    from bs4 import BeautifulSoup as bs
    import time
    import requests
    from selenium import webdriver
    from urllib.request import urlopen, Request
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}
    browser = webdriver.PhantomJS(executable_path=r'C:\\Utility\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
    browser.get('https://www.whenisthenextsteamsale.com/')
    soup = bs(browser.page_source, "html.parser")
    result = soup.find_all("p",{"id":"subTimer"})
    for item in result:
        print(item.text)
    browser.quit()
    

    我确实在控制台上看到了相同的输出:

    C:\Python\lib\site-packages\selenium\webdriver\phantomjs\webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
      warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
    08:06:16
    

    值得一提的是,Selenium 团队已经在 Selenium Java Client 中放弃了对 PhantomJS 的默认支持,并将遵循同样的做法Selenium Python 客户端。您正在观察的 警告PhantomJS__init__() 方法的一部分,如下所示:

    def __init__(self, executable_path="phantomjs",
                 port=0, desired_capabilities=DesiredCapabilities.PHANTOMJS,
                 service_args=None, service_log_path=None):
        """
        Creates a new instance of the PhantomJS / Ghostdriver.
    
        Starts the service and then creates new instance of the driver.
    
        :Args:
         - executable_path - path to the executable. If the default is used it assumes the executable is in the $PATH
         - port - port you would like the service to run, if left as 0, a free port will be found.
         - desired_capabilities: Dictionary object with non-browser specific
           capabilities only, such as "proxy" or "loggingPref".
         - service_args : A List of command line arguments to pass to PhantomJS
         - service_log_path: Path for phantomjs service to log to.
        """
        warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
                      'versions of Chrome or Firefox instead')
        self.service = Service(
            executable_path,
            port=port,
            service_args=service_args,
            log_path=service_log_path)
        self.service.start()
    

    【讨论】:

      【解决方案2】:

      PhantomJs 不再被维护。 https://groups.google.com/forum/m/#!topic/phantomjs/9aI5d-LDuNE

      你应该使用 headless chrome / firefox。

      您将不得不替换此代码:

      browser = webdriver.PhantomJS()
      browser.get('https://www.whenisthenextsteamsale.com/')
      

      from selenium import webdriver
      from selenium.webdriver.firefox.options import Options
      
      options = Options()
      options.add_argument("--headless")
      browser= webdriver.Firefox(firefox_options=options, executable_path="Path to geckodriver.exe")
      browser.get('https://www.whenisthenextsteamsale.com/');
      

      在此处下载 Geckodriver:Download GeckoDriver

      【讨论】:

      • 我是否必须对我目前拥有的代码进行大量更改,如果需要,具体是什么?
      • 是的!您需要更改在现有代码中创建 browser 的方式。我已经在回答中提到过,你怎么叫司机。这将取代 browser 创建。其余一切都保持不变。
      • 我已经更新了我的代码。希望它能回答你的问题。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-10-17
      • 2021-01-06
      • 2019-10-31
      • 2020-10-31
      • 1970-01-01
      • 2019-09-29
      相关资源
      最近更新 更多