如何使用 Selenium 和 Python 从华尔街日报市场数据报价中抓取每个父级的第一个元素？答案

【问题标题】：How to scrape the first element of each parent using from The Wall Street Journal market-data quotes using Selenium and Python?如何使用 Selenium 和 Python 从华尔街日报市场数据报价中抓取每个父级的第一个元素？
【发布时间】：2020-10-17 16:37:47
【问题描述】：

这是我要抓取的 HTML：

我正在尝试使用 Selenium 在每个“tr”下获取“td”的第一个实例（beautifulsoup 不适用于此站点）。这个列表很长，所以我试图迭代地做。这是我的代码：

from selenium import webdriver
import os


# define path to chrome driver
chrome_driver = os.path.abspath('C:/Users/USER/Desktop/chromedriver.exe')
browser = webdriver.Chrome(chrome_driver)
browser.get("https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement")

# get entire table
table = browser.find_element_by_xpath('//*[@id="cr_cashflow"]/div[2]/div/table')

#web element is not iterable
for row in table.find_element_by_tag_name('tr'):
    td = row.find_element_by_tag_name('td')
    print(td.text)

#web element is not subscriptable
for row in table.find_elements_by_tag_name('tr'):
    print(row[0].text)

我尝试了上面的两个 for 循环，第一个错误说 webelement 不可迭代，而第二个说它不可下标。两者有什么区别？如何更改我的代码，以便我可以返回“销售额/收入、赚取的保费……”？

【问题讨论】：

标签： python selenium google-chrome selenium-chromedriver bots

【解决方案1】：

要在每个tr 下获得td 的第一个，请使用这个css 选择器：

table.cr_dataTable tbody tr td[class]:nth-child(1)

试试下面的代码：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os

chrome_driver = os.path.abspath('C:/Users/USER/Desktop/chromedriver.exe')
browser = webdriver.Chrome(chrome_driver)

browser.get('https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement')

elements = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'table.cr_dataTable tbody tr td[class]:nth-child(1)')))
for element in elements:
    print(element.text)

browser.quit()

【讨论】：

【解决方案2】：

您可以尝试使用 pandas 获取表 Trying to scrape table using Pandas from Selenium's result

from selenium import webdriver
import pandas as pd
import os


# define path to chrome driver
chrome_driver = os.path.abspath('C:/Users/USER/Desktop/chromedriver.exe')
browser = webdriver.Chrome(chrome_driver)
browser.get("https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement")

# get table
df = pd.read_html(browser.page_source)[0]

# get values
val = [i for i in df["Fiscal year is January-December. All values USD Millions."].values if isinstance(i, str)]

【讨论】：

【解决方案3】：

我采用了您的代码并简化了结构，并以最少的代码行运行了测试，如下所示：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait


options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement')
print(driver.page_source)
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.cr_dataTable tbody tr>td[class]")))])
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='cr_dataTable']//tbody//tr/td[@class]")))])

同样，根据您的观察，我遇到了同样的障碍，我的测试没有产生结果。

在检查 webpage 的 Page Source 时，发现 <script> 中有一个 EventListener 用于验证某些页面指标其中一些是：

window.utag_data
window.utag_data.page_performance
window.PerformanceTiming
window.PerformanceObserver
newrelic
first-contentful-paint

页面来源：

<script>
    "use strict";

    if (window.PerformanceTiming) {
      window.addEventListener('DOMContentLoaded', function () {
        if (window.utag_data && window.utag_data.page_performance) {
          var dcl = 'DCL ' + parseInt(performance.timing.domContentLoadedEventStart - performance.timing.domLoading);
          var pp = window.utag_data.page_performance.split('|');
          pp[1] = dcl;
          utag_data.page_performance = pp.join('|');
        } else {
          console.warn('No utag_data.page_performance available');
        }
      });
    }

    if (window.PerformanceTiming && window.PerformanceObserver) {
      var observer = new PerformanceObserver(function (list) {
        var entries = list.getEntries();

        var _loop = function _loop(i) {
          var entry = entries[i];
          var metricName = entry.name;
          var time = Math.round(entry.startTime + entry.duration);

          if (typeof newrelic !== 'undefined') {
            newrelic.setCustomAttribute(metricName, time);
          }

          if (entry.name === 'first-contentful-paint' && window.utag_data && window.utag_data.page_performance) {
            var fcp = 'FCP ' + parseInt(entry.startTime);
            var pp = utag_data.page_performance.split('|');
            pp[0] = fcp;
            utag_data.page_performance = pp.join('|');
          } else {
            window.addEventListener('DOMContentLoaded', function () {
              if (window.utag_data && window.utag_data.page_performance) {
                var _fcp = 'FCP ' + parseInt(entry.startTime);

                var _pp = utag_data.page_performance.split('|');

                _pp[0] = _fcp;
                utag_data.page_performance = _pp.join('|');
              } else {
                console.warn('No utag_data.page_performance available');
              }
            });
          }
        };

        for (var i = 0; i < entries.length; i++) {
          _loop(i);
        }
      });

      if (window.PerformancePaintTiming) {
        observer.observe({
          entryTypes: ['paint', 'mark', 'measure']
        });
      } else {
        observer.observe({
          entryTypes: ['mark', 'measure']
        });
      }
    }
  </script> <script>
if (window && typeof newrelic !== 'undefined') {
  newrelic.setCustomAttribute('browserWidth', window.innerWidth);
}
</script> <title>MET | MetLife Inc. Annual Income Statement - WSJ</title> <link rel="canonical" href="https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement">

结论

这清楚地表明该网站受到强有力的 Bot Management 技术的保护，并且由Selenium 驱动的WebDriver 发起的导航被检测到并随后被检测到已屏蔽。

参考

您可以在以下位置找到相关讨论：

Can a website detect when you are using selenium with chromedriver?

【讨论】：