【问题标题】:DataFrame not showing complete table dataDataFrame 未显示完整的表格数据
【发布时间】:2022-01-25 08:50:12
【问题描述】:

我从这个网站上抓取了一些关于标准普尔 500 指数股票的信息:https://www.slickcharts.com/sp500。实际的网络抓取位工作正常,就好像我在包含 for 循环之后添加了一个打印语句,所有数据都显示出来了。也就是说,代码:

# Web-scraped S&P 500 data for 500+ US stocks.

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62'}

url = 'https://www.slickcharts.com/sp500' # Data from SlickCharts
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

table1 = soup.find('table', attrs={'class':'table table-hover table-borderless table-sm'})

for row in table1.find_all('tr'):
    all_td_tags = row.find_all('td')
    if len(all_td_tags) > 0:
        company = all_td_tags[1].text
        symbol = all_td_tags[2].text
        weight = all_td_tags[3].text
        price = all_td_tags[4].text
        chg = all_td_tags[5].text
        perChg = all_td_tags[6].text
        print(company, '|', symbol, '|', weight, '|', price, '|', chg, '|', perChg)

输出:

Apple Inc. | AAPL | 6.866056 |    176.34 | 0.06 | (0.03%)
Microsoft Corporation | MSFT | 6.279809 |    334.50 | -0.19 | (-0.06%)
Amazon.com Inc. | AMZN | 3.729209 |    3,418.46 | -2.91 | (-0.09%)
Alphabet Inc. Class A | GOOGL | 2.208863 |    2,938.00 | -0.33 | (-0.01%)
Tesla Inc | TSLA | 2.169114 |    1,069.30 | 2.30 | (0.22%)
Alphabet Inc. Class C | GOOG | 2.056323 |    2,942.00 | -0.85 | (-0.03%)
Meta Platforms Inc. Class A | FB | 1.982391 |    336.00 | 0.76 | (0.23%)
NVIDIA Corporation | NVDA | 1.851853 |    295.60 | -0.80 | (-0.27%)
...

但是,在编写此代码时,使用 DataFrame(我想使用它来搜索特定股票的数据,例如我输入“AAPL”并获得股票的价格、重量等):

# Web-scraped S&P 500 data for 500+ US stocks.

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62'}

url = 'https://www.slickcharts.com/sp500' # Data from SlickCharts
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

table1 = soup.find('table', attrs={'class':'table table-hover table-borderless table-sm'})

for row in table1.find_all('tr'):
    all_td_tags = row.find_all('td')
    if len(all_td_tags) > 0:
        company = all_td_tags[1].text
        symbol = all_td_tags[2].text
        weight = all_td_tags[3].text
        price = all_td_tags[4].text
        chg = all_td_tags[5].text
        perChg = all_td_tags[6].text

df = pd.DataFrame({'Company': [company], 'Symbol': [symbol], 'Weight': [weight], 'Price': [price], 'Change': [chg], 'Percent_Change': [perChg]})

print(df.head())

我只得到一只股票的信息,而我应该得到整个表格:

                    Company Symbol    Weight     Price Change Percent_Change
0  News Corporation Class B    NWS  0.006948     22.75   0.20        (0.89%)

我对 DataFrame 做错了什么,以至于它只显示一只股票(显示的股票恰好是表格中的最后一只)?

更新

我像这样替换了df 的定义:

# Web-scraped S&P 500 data for 500+ US stocks.

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36 Edg/96.0.1054.62'}

url = 'https://www.slickcharts.com/sp500' # Data from SlickCharts
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')

table1 = soup.find('table', attrs={'class':'table table-hover table-borderless table-sm'})

for row in table1.find_all('tr'):
    all_td_tags = row.find_all('td')
    if len(all_td_tags) > 0:
        company = all_td_tags[1].text
        symbol = all_td_tags[2].text
        weight = all_td_tags[3].text
        price = all_td_tags[4].text
        chg = all_td_tags[5].text
        perChg = all_td_tags[6].text
        # print(company, '|', symbol, '|', weight, '|', price, '|', chg, '|', perChg)

df = pd.read_html(str(table1))[0]
print(df)

但是我的输出看起来像这样:

       #                    Company Symbol    Weight    Price   Chg     % Chg
0      1                 Apple Inc.   AAPL  6.866056   176.34  0.06   (0.03%)
1      2      Microsoft Corporation   MSFT  6.279809   334.50 -0.19  (-0.06%)
2      3            Amazon.com Inc.   AMZN  3.729209  3418.46 -2.91  (-0.09%)
3      4      Alphabet Inc. Class A  GOOGL  2.208863  2938.00 -0.33  (-0.01%)
4      5                  Tesla Inc   TSLA  2.169114  1069.30  2.30   (0.22%)
..   ...                        ...    ...       ...      ...   ...       ...
500  501     Discovery Inc. Class A  DISCA  0.009951    24.25 -0.17  (-0.70%)
501  502  Under Armour Inc. Class A    UAA  0.009792    20.62  0.00   (0.00%)
502  503                   Gap Inc.    GPS  0.008945    17.28  0.00   (0.00%)
503  504  Under Armour Inc. Class C     UA  0.008667    17.55  0.00   (0.00%)
504  505   News Corporation Class B    NWS  0.006948    22.75  0.20   (0.89%)

如何让第二列数字消失?

【问题讨论】:

  • 按索引或名称删除该列。此外, ... 只是表示由于空间原因不显示。数据仍然存在于变量中。
  • 嗨@QHarr 请原谅我的无知,因为我对 Python 还是很陌生,而且 Pandas。我尝试执行del df.column.namedf.column.name = None(类似index 而不是column),但要么什么都不会改变,要么我收到一条错误消息,指出我无法删除该属性。你有什么建议?
  • df.drop(['#'], axis = 1, inplace = True)
  • 可以用print(df.shape)查看列数和行数

标签: python pandas dataframe beautifulsoup


【解决方案1】:

由于您在每次迭代中不断重新分配companysymbolweight 等,因此这些变量仅保存您解析的最后一行的值。

您可以改用pd.read_html。它返回一个数据帧列表,HTML sn-p 中的每个<table> 标签对应一个。您通过soup.find 找到的只有一张表,所以它是元素#0:

df = pd.read_html(str(table1))[0]

输出:

 #               Company Symbol   Weight   Price   Chg    % Chg
 1            Apple Inc.   AAPL 6.866056  176.34  0.06  (0.03%)
 2 Microsoft Corporation   MSFT 6.279809  334.50 -0.19 (-0.06%)
 3       Amazon.com Inc.   AMZN 3.729209 3418.46 -2.91 (-0.09%)
 4 Alphabet Inc. Class A  GOOGL 2.208863 2938.00 -0.33 (-0.01%)
 5             Tesla Inc   TSLA 2.169114 1069.30  2.30  (0.22%)
...

根据需要修剪和重命名框架。

【讨论】:

  • 嗨,所以我做了必要的更改,您可以在原始帖子的更新下看到我的新编辑输出。如何清除第二行数字?此外,我怎样才能不仅看到顶部和底部 5 个条目,而且还看到所有条目?最后,我计划使用 df.loc[df['symbol'] == 'AAPL'] 之类的东西,在这种情况下,输出将是包含所有 Apple 股票的行。但是,因为我没有使用 DataFrame,有什么替代品吗?谢谢!
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-05-11
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多