我采用了您的代码并简化了结构,并以最少的代码行运行了测试,如下所示:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement')
print(driver.page_source)
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "table.cr_dataTable tbody tr>td[class]")))])
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//table[@class='cr_dataTable']//tbody//tr/td[@class]")))])
同样,根据您的观察,我遇到了同样的障碍,我的测试没有产生结果。
在检查 webpage 的 Page Source 时,发现 <script> 中有一个 EventListener 用于验证某些页面指标其中一些是:
window.utag_data
window.utag_data.page_performance
window.PerformanceTiming
window.PerformanceObserver
newrelic
first-contentful-paint
页面来源:
<script>
"use strict";
if (window.PerformanceTiming) {
window.addEventListener('DOMContentLoaded', function () {
if (window.utag_data && window.utag_data.page_performance) {
var dcl = 'DCL ' + parseInt(performance.timing.domContentLoadedEventStart - performance.timing.domLoading);
var pp = window.utag_data.page_performance.split('|');
pp[1] = dcl;
utag_data.page_performance = pp.join('|');
} else {
console.warn('No utag_data.page_performance available');
}
});
}
if (window.PerformanceTiming && window.PerformanceObserver) {
var observer = new PerformanceObserver(function (list) {
var entries = list.getEntries();
var _loop = function _loop(i) {
var entry = entries[i];
var metricName = entry.name;
var time = Math.round(entry.startTime + entry.duration);
if (typeof newrelic !== 'undefined') {
newrelic.setCustomAttribute(metricName, time);
}
if (entry.name === 'first-contentful-paint' && window.utag_data && window.utag_data.page_performance) {
var fcp = 'FCP ' + parseInt(entry.startTime);
var pp = utag_data.page_performance.split('|');
pp[0] = fcp;
utag_data.page_performance = pp.join('|');
} else {
window.addEventListener('DOMContentLoaded', function () {
if (window.utag_data && window.utag_data.page_performance) {
var _fcp = 'FCP ' + parseInt(entry.startTime);
var _pp = utag_data.page_performance.split('|');
_pp[0] = _fcp;
utag_data.page_performance = _pp.join('|');
} else {
console.warn('No utag_data.page_performance available');
}
});
}
};
for (var i = 0; i < entries.length; i++) {
_loop(i);
}
});
if (window.PerformancePaintTiming) {
observer.observe({
entryTypes: ['paint', 'mark', 'measure']
});
} else {
observer.observe({
entryTypes: ['mark', 'measure']
});
}
}
</script> <script>
if (window && typeof newrelic !== 'undefined') {
newrelic.setCustomAttribute('browserWidth', window.innerWidth);
}
</script> <title>MET | MetLife Inc. Annual Income Statement - WSJ</title> <link rel="canonical" href="https://www.wsj.com/market-data/quotes/MET/financials/annual/income-statement">
结论
这清楚地表明该网站受到强有力的 Bot Management 技术的保护,并且由Selenium 驱动的WebDriver 发起的导航被检测到并随后被检测到已屏蔽。
参考
您可以在以下位置找到相关讨论: