使用带有特定数据表的 python beautifulsoup urllib 抓取数据答案

【问题标题】：webscraping data using python beautiful soup urllib with specific data-table使用带有特定数据表的 python beautifulsoup urllib 抓取数据
【发布时间】：2018-12-12 00:43:29
【问题描述】：

我正在尝试从特定的门户网站抓取网络数据。我之前尝试过学习和实验，但使用 beautiful_soup 和 urllib 的成功有限。

下面是我的代码，它似乎没有抓取我需要的数据...

httpLoc = 'https://uk.investing.com/currencies/forex-options'
url = requests.get(httpLoc,headers={'User-Agent': 'Mozilla/5.0'})
fx_data = np.array([])

content_page = soup(url.content,'html.parser')
containers = content_page.findAll('table', {'class':'vol-data-col'})
for table in containers:
    for td in table.findAll('vol-data-col'):
        #print(td.text)
        fx_data = np.append(fx_data, td.text)

网站中的 html 代码格式如下。我正在尝试迭代提取所有形式为“14.77”的行

td class="vol-data-col ng-binding ng-scope" ng-mouseover="PageSettings.setHoverInstrumentTitle(instruments[$parent.$index].title)" ng-mouseleave="PageSettings.clearHoverInstrumentTitle(instruments[$parent.$index].title)" ng-repeat="period in periods" ui-sref="currency" ng-click="PageSettings.clearHoverInstrumentTitle(); $parent.$parent.$parent.currentTenor = period.name; summaryClickFunc(period, instruments[$parent.$index]); periods[$index].active = true">14.77%</td>

所附图片是数据在网站上的样子

----从 cmets 更新----

我开始尝试使用 selenium，这就是我所拥有的：

import os from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome("C:\\Python\\chromedriver.exe")
# Initialize the webdriver session 
driver.get('https://uk.investing.com/currencies/forex-options')
# replaces "ie.navigate" 
test = driver.find_elements_by_xpath(("//*[@id='curr_table']/class"))

【问题讨论】：

如果你检查url.content 或content_page，你能看到table 的数据吗？
您是否尝试过使用 pd.read_html(url)
如果所有相关表都具有相同的类，请尝试在 findAll 中包含整个类字符串："vol-data-col ng-binding ng-scope"
@alecxe，是的，我做到了...如果我检查 content_page，我可以看到整个 html 页面/代码已加载。
@alexce，不...使用 html5lib.parser 不起作用。我不认为这是问题所在。

标签： python-3.x web-scraping beautifulsoup urllib

【解决方案1】：

您没有获取任何数据的原因是页面的源代码不包含您尝试获取的数据。使用 javascript 动态检索和呈现数据。

要获取数据，您要么必须模拟动态检索，要么使用 selenium 等无头浏览器浏览页面并以这种方式检索数据。

-- 从 cmets 更新--

鉴于您已选择使用 Selenium：

使用您当前的方法，您需要找出您要查找的表的 xpath。您可以通过在浏览器中检查它然后在元素上选择 copy > xpath 来获得它。如果您只想编写自己的 xpath 表示法，您可以看看它是如何完成的 here.

对于您想要的表，xpath 将类似于//table[@class="summary data-table"]

要测试各种 xpath，您可以将它们粘贴到浏览器的控制台中作为查找：

$x('//table[@class="summary data-table"]')

如果您想要更快的方法，您可以使用 querySelectors 或 css：

document.querySelector('table.summary.data-table')

# output from the browser
<table class="summary data-table">…</table>

如需更深入地了解如何使用 Selenium，您可以访问https://wiki.saucelabs.com/display/DOCS/Getting+Started+with+Selenium+for+Automated+Website+Testing

【讨论】：

感谢@B.Adler，我安装并使用了 selenium。但是“打开”页面是我能做到的……有什么想法吗？ ....... import os from selenium import webdriver from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome("C:\\Python\\chromedriver.exe") # 初始化webdriver session driver.get ('investing.com/rates-bonds/…) # 替换 "ie.navigate" test = driver.find_elements_by_xpath(("//*[@id='curr_table']/class"))