【问题标题】:How to scrape the first element of each parent using from The Wall Street Journal market-data quotes using Selenium and Python?如何使用 Selenium 和 Python 从华尔街日报市场数据报价中抓取每个父级的第一个元素?
【发布时间】:2020-10-17 16:37:47
【问题描述】:

这是我要抓取的 HTML:

我正在尝试使用 Selenium 在每个“tr”下获取“td”的第一个实例(beautifulsoup 不适用于此站点)。这个列表很长,所以我试图迭代地做。这是我的代码:

from selenium import webdriver
import os


# define path to chrome driver
chrome_driver = os.path.abspath('C:/Users/USER/Desktop/chromedriver.exe')
browser = webdriver.Chrome(chrome_driver)
browser.get("https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement")

# get entire table
table = browser.find_element_by_xpath('//*[@id="cr_cashflow"]/div[2]/div/table')

#web element is not iterable
for row in table.find_element_by_tag_name('tr'):
    td = row.find_element_by_tag_name('td')
    print(td.text)

#web element is not subscriptable
for row in table.find_elements_by_tag_name('tr'):
    print(row[0].text)

我尝试了上面的两个 for 循环,第一个错误说 webelement 不可迭代,而第二个说它不可下标。两者有什么区别?如何更改我的代码,以便我可以返回“销售额/收入、赚取的保费……”?

【问题讨论】:

    标签: python selenium google-chrome selenium-chromedriver bots


    【解决方案1】:

    要在每个tr 下获得td 的第一个,请使用这个css 选择器:

    table.cr_dataTable tbody tr td[class]:nth-child(1)
    

    试试下面的代码:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import os
    
    chrome_driver = os.path.abspath('C:/Users/USER/Desktop/chromedriver.exe')
    browser = webdriver.Chrome(chrome_driver)
    
    browser.get('https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement')
    
    elements = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'table.cr_dataTable tbody tr td[class]:nth-child(1)')))
    for element in elements:
        print(element.text)
    
    browser.quit()
    

    【讨论】:

      【解决方案2】:

      您可以尝试使用 pandas 获取表 Trying to scrape table using Pandas from Selenium's result

      from selenium import webdriver
      import pandas as pd
      import os
      
      
      # define path to chrome driver
      chrome_driver = os.path.abspath('C:/Users/USER/Desktop/chromedriver.exe')
      browser = webdriver.Chrome(chrome_driver)
      browser.get("https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement")
      
      # get table
      df = pd.read_html(browser.page_source)[0]
      
      # get values
      val = [i for i in df["Fiscal year is January-December. All values USD Millions."].values if isinstance(i, str)]
      

      【讨论】:

        【解决方案3】:

        我采用了您的代码并简化了结构,并以最少的代码行运行了测试,如下所示:

        from selenium import webdriver
        from selenium.webdriver.common.by import By
        from selenium.webdriver.support import expected_conditions as EC
        from selenium.webdriver.support.ui import WebDriverWait
        
        
        options = webdriver.ChromeOptions()
        options.add_experimental_option("excludeSwitches", ["enable-automation"])
        options.add_experimental_option('useAutomationExtension', False)
        driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
        driver.get('https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement')
        print(driver.page_source)
        print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.cr_dataTable tbody tr>td[class]")))])
        print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='cr_dataTable']//tbody//tr/td[@class]")))])
        

        同样,根据您的观察,我遇到了同样的障碍,我的测试没有产生结果。

        在检查 webpagePage Source 时,发现 <script> 中有一个 EventListener 用于验证某些页面指标其中一些是:

        • window.utag_data
        • window.utag_data.page_performance
        • window.PerformanceTiming
        • window.PerformanceObserver
        • newrelic
        • first-contentful-paint

        页面来源

        <script>
            "use strict";
        
            if (window.PerformanceTiming) {
              window.addEventListener('DOMContentLoaded', function () {
                if (window.utag_data && window.utag_data.page_performance) {
                  var dcl = 'DCL ' + parseInt(performance.timing.domContentLoadedEventStart - performance.timing.domLoading);
                  var pp = window.utag_data.page_performance.split('|');
                  pp[1] = dcl;
                  utag_data.page_performance = pp.join('|');
                } else {
                  console.warn('No utag_data.page_performance available');
                }
              });
            }
        
            if (window.PerformanceTiming && window.PerformanceObserver) {
              var observer = new PerformanceObserver(function (list) {
                var entries = list.getEntries();
        
                var _loop = function _loop(i) {
                  var entry = entries[i];
                  var metricName = entry.name;
                  var time = Math.round(entry.startTime + entry.duration);
        
                  if (typeof newrelic !== 'undefined') {
                    newrelic.setCustomAttribute(metricName, time);
                  }
        
                  if (entry.name === 'first-contentful-paint' && window.utag_data && window.utag_data.page_performance) {
                    var fcp = 'FCP ' + parseInt(entry.startTime);
                    var pp = utag_data.page_performance.split('|');
                    pp[0] = fcp;
                    utag_data.page_performance = pp.join('|');
                  } else {
                    window.addEventListener('DOMContentLoaded', function () {
                      if (window.utag_data && window.utag_data.page_performance) {
                        var _fcp = 'FCP ' + parseInt(entry.startTime);
        
                        var _pp = utag_data.page_performance.split('|');
        
                        _pp[0] = _fcp;
                        utag_data.page_performance = _pp.join('|');
                      } else {
                        console.warn('No utag_data.page_performance available');
                      }
                    });
                  }
                };
        
                for (var i = 0; i < entries.length; i++) {
                  _loop(i);
                }
              });
        
              if (window.PerformancePaintTiming) {
                observer.observe({
                  entryTypes: ['paint', 'mark', 'measure']
                });
              } else {
                observer.observe({
                  entryTypes: ['mark', 'measure']
                });
              }
            }
          </script> <script>
        if (window && typeof newrelic !== 'undefined') {
          newrelic.setCustomAttribute('browserWidth', window.innerWidth);
        }
        </script> <title>MET | MetLife Inc. Annual Income Statement - WSJ</title> <link rel="canonical" href="https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement">
        

        结论

        这清楚地表明该网站受到强有力的 Bot Management 技术的保护,并且由Selenium 驱动的WebDriver 发起的导航被检测到并随后被检测到已屏蔽


        参考

        您可以在以下位置找到相关讨论:

        【讨论】:

          猜你喜欢
          • 2020-09-04
          • 1970-01-01
          • 2020-02-14
          • 1970-01-01
          • 2014-04-08
          • 2015-10-29
          • 1970-01-01
          • 2020-07-29
          • 1970-01-01
          相关资源
          最近更新 更多