Beautiful Soup 在 iShares 上找不到表答案

【问题标题】：Beautiful Soup cannot find table on iSharesBeautiful Soup 在 iShares 上找不到表
【发布时间】：2022-01-06 12:23:57
【问题描述】：

一段时间以来，我一直在尝试从 iShares.com 为一个正在进行的项目抓取 ETF 数据。我正在尝试为多个网站创建网络抓取工具，但它们都是相同的。基本上我遇到了两个问题：

我不断收到错误消息：“AttributeError: 'NoneType' object has no attribute 'tr'”虽然我很确定我选择了正确的表。
当我查看某些网站上的“检查元素”时，我必须单击“显示更多”才能查看所有行的代码。

我不是计算机科学家，但我尝试了许多不同的方法，但遗憾的是都没有成功，所以我希望你能提供帮助。

网址：https://www.ishares.com/uk/individual/en/products/251382/ishares-msci-world-minimum-volatility-ucits-etf

可以在“Holdings”下的 URL 中找到该表。或者，可以在以下路径下找到它： JS 路径： tbody")> xPath: //*[@id="allHoldingsTable"]/tbody

代码：

import requests
import pandas as pd
from bs4 import BeautifulSoup


urls = [
'https://www.ishares.com/uk/individual/en/products/251382/ishares-msci-world-minimum-volatility-ucits-etf'
]

all_data = []
for url in urls:
    print("Loading URL {}".format(url))

    # load the page into soup:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    # find correct table:
    tbl = soup.select_one(".allHoldingsTable")

    # remove the first row (it's not header):
    tbl.tr.extract()

    # convert the html to pandas DF:
    df = pd.read_html(str(tbl),thousands='.', decimal=',')[0]

    # move the first row to header:
    df.columns = map(lambda x: str(x).replace("*", "").strip(), df.loc[0])
    df = df.loc[1:].reset_index(drop=True).rename(columns={"nan": "Name"})

    df["Company"] = soup.h1.text.split("\n")[0].strip()
    df["URL"] = url
    all_data.append(df.loc[:, ~df.isna().all()])

df = pd.concat(all_data, ignore_index=True)
print(df)


from openpyxl import load_workbook
path= '/Users/karlemilthulstrup/Downloads/ishares.xlsx'
book = load_workbook(path ,read_only = False, keep_vba=True)
writer = pd.ExcelWriter(path, engine='openpyxl')
writer.book = book
df.to_excel(writer, index=False)
writer.save()
writer.close()

【问题讨论】：

你最好在“查看页面源代码”而不是“检查元素”中搜索 bs4 将不会看到已被 javascript 添加或修改的元素。它只会看到原始页面源
如果您在浏览器中禁用 javascript 然后查看页面，您将看到馆藏表中没有数据。它只有一个 thead 元素
@ChrisDoyle 出于好奇，是否有机会通过 bs4 抓取添加/修改的元素？
@ChrisDoyle 非常感谢您花时间研究它。我无法刮掉它是有道理的。有没有像wolfstter问的那样解决它？否则，我想我应该走 API 路线。

标签： python pandas web-scraping beautifulsoup

【解决方案1】：

如 cmets 中所述，数据是动态呈现的。如果你不想直接访问数据，你可以使用 Selenium 之类的东西，这将允许页面呈现，然后你可以按照上面的方式进入那里。

此外，还有一个按钮可以为您将其下载到 csv 中。为什么不这样做呢？

但是如果你必须抓取页面，你会得到 json 格式的数据。只需解析它：

import requests
import json
import pandas as pd

url = 'https://www.ishares.com/uk/individual/en/products/251382/ishares-msci-world-minimum-volatility-ucits-etf/1506575576011.ajax?tab=all&fileType=json'
r = requests.get(url)
r.encoding='utf-8-sig'
jsonData = json.loads(r.text)


rows = []
for each in jsonData['aaData']:
    row = {'Issuer Ticker':each[0],
     'Name':each[1],
     'Sector':each[2],
     'Asset Class':each[3],
     'Market Value':each[4]['display'],
     'Market Value Raw':each[4]['raw'],
     'Weight (%)':each[5]['display'],
     'Weight (%) Raw':each[5]['raw'],
     'Notaional Value':each[6]['display'],
     'Notaional Value Raw':each[6]['raw'],
     'Nominal':each[7]['display'],
     'Nominal Raw':each[7]['raw'],
     'ISIN':each[8],
     'Price':each[9]['display'],
     'Price Raw':each[9]['raw'],
     'Location':each[10],
     'Exchange':each[11],
     'Market Currency':each[12]}
     
    rows.append(row)
     
df = pd.DataFrame(rows)

输出：

print(df)
    Issuer Ticker  ... Market Currency
0              VZ  ...             USD
1             ROG  ...             CHF
2            NESN  ...             CHF
3              WM  ...             USD
4             PEP  ...             USD
..            ...  ...             ...
309          ESH2  ...             USD
310          TUH2  ...             USD
311           JPY  ...             USD
312    MARGIN_JPY  ...             JPY
313    MARGIN_SGD  ...             SGD

[314 rows x 18 columns]

【讨论】：

非常感谢！这很棒。我绝对可以根据我的需要完成这项工作。回答您关于 CSV 的问题：是的，我知道您可以下载 CSV，但我对此有一些问题。简而言之，我不需要唯一的股票代码 ID（ISIN 代码）。此外，他们在一段时间内删除了 CSV 选项，所以我很高兴现在有一个可行的解决方案！
您好，我有一个非常琐碎的问题。所以，我已经为我需要的所有资金制作了解析器，除了一个。我不断收到以下错误：“json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)”。我能理解的唯一原因是因为这个页面是美国页面，而其他页面是英国页面。你知道如何解决它吗？我想不通... URL：ishares.com/us/products/283378/…
@KarlEmilThulstrup。尝试使用https://www.ishares.com/us/products/283378/fund/1467271812596.ajax?tab=all&fileType=json
效果很好，谢谢！如果我可以问，你是怎么得到这个的？
转到开发工具 (ctrl-shft-i)。查看网络-> XHR。然后在 Headers 选项卡中找到 url。