【问题标题】:Pandas: Accessing multiple columns under different top level column index in Multi-index columns DataframePandas:在多索引列Dataframe中访问不同顶级列索引下的多个列
【发布时间】:2021-12-01 21:23:01
【问题描述】:

我无法确定表格上标题的索引,我想将其抓取并输出到 csv 文件中,因此我需要归类为 ResidualMaturityLast 的列,而我只能得到表的主标题而不是子标题。 我曾尝试使用 df[('Yield', 'Last'),但只能获得该特定列,而不能同时获得两者。

import pandas as pd
import requests

url = 'http://www.worldgovernmentbonds.com/country/japan/'
r = requests.get(url)
df_list = pd.read_html(r.text, flavor='html5lib')
df = df_list[4]
yc = df[["ResidualMaturity", "Yield"]]
print(yc)

电流输出

     ResidualMaturity    Yield                   
   ResidualMaturity     Last    Chg 1M   Chg 6M
0           1 month  -0.114%   +9.0 bp  +7.4 bp
1          3 months  -0.109%    0.0 bp  -1.9 bp
2          6 months  -0.119%   -0.3 bp  -1.9 bp
3          9 months  -0.119%  +10.0 bp  +9.9 bp
4            1 year  -0.125%   -0.7 bp  +0.9 bp
5           2 years  -0.121%   +0.9 bp  +1.3 bp
6           3 years  -0.113%   +2.2 bp  +2.7 bp
7           4 years  -0.094%   +2.6 bp  +2.1 bp
8           5 years  -0.082%   +2.3 bp  +1.8 bp
9           6 years  -0.056%   +3.4 bp  +0.4 bp
10          7 years  -0.029%   +5.1 bp  -0.4 bp
11          8 years   0.007%   +5.6 bp  -0.7 bp
12          9 years   0.052%   +5.6 bp  -1.3 bp
13         10 years   0.087%   +4.7 bp  -1.2 bp
14         15 years   0.288%   +4.3 bp  -2.4 bp
15         20 years   0.460%   +3.7 bp  -1.5 bp
16         30 years   0.689%   +3.5 bp  +1.6 bp
17         40 years   0.757%   +3.5 bp  +7.3 bp

我想要得到的期望输出

 ResidualMaturity     Last    
    0           1 month  -0.114%   
    1          3 months  -0.109%    
    2          6 months  -0.119%   
    3          9 months  -0.119%  
    4            1 year  -0.125%   
    5           2 years  -0.121%   
    6           3 years  -0.113%   
    7           4 years  -0.094%   
    8           5 years  -0.082%   
    9           6 years  -0.056%   
    10          7 years  -0.029%   
    11          8 years   0.007%   
    12          9 years   0.052%   
    13         10 years   0.087%  
    14         15 years   0.288%   
    15         20 years   0.460%   
    16         30 years   0.689%   
    17         40 years   0.757%   

我尝试过使用df[('Yield', 'Last')],但只能获取该特定列,而不能同时获取两者。

【问题讨论】:

    标签: python python-3.x pandas csv web-scraping


    【解决方案1】:

    这是我得到的输出:

    import pandas as pd
    import requests
    
    url = 'http://www.worldgovernmentbonds.com/country/japan/'
    r = requests.get(url)
    df_list = pd.read_html(r.text, flavor='html5lib')
    df = df_list[4]
    yc = df[df.columns[1:3]].droplevel(0, axis=1)
    print(yc)
    

    输出:

       ResidualMaturity     Last
    0           1 month  -0.110%
    1          3 months  -0.109%
    2          6 months  -0.119%
    3          9 months  -0.115%
    4            1 year  -0.125%
    5           2 years  -0.120%
    6           3 years  -0.113%
    7           4 years  -0.094%
    8           5 years  -0.084%
    9           6 years  -0.057%
    10          7 years  -0.031%
    11          8 years   0.005%
    12          9 years   0.050%
    13         10 years   0.086%
    14         15 years   0.287%
    15         20 years   0.461%
    16         30 years   0.689%
    17         40 years   0.757%
    

    【讨论】:

    • 谢谢你!该解决方案也有效。终于显示出来了。干杯
    • 谢谢你提出这么好的问题
    【解决方案2】:

    pd.IndexSlice.loc 一起使用

    idx = pd.IndexSlice
    yc.loc[:, idx[:, ['ResidualMaturity', 'Last']]]
    

    或者,在axis=1 上使用.loc,如下所示:

    idx = pd.IndexSlice
    yc.loc(axis=1)[idx[:, ['ResidualMaturity', 'Last']]]
    

    pd.IndexSlice 这样一来,我们就可以指定 1 级的列标签,而无需指定 0 级的列标签。

    结果:

       ResidualMaturity    Yield
       ResidualMaturity     Last
    0           1 month  -0.110%
    1          3 months  -0.109%
    2          6 months  -0.119%
    3          9 months  -0.115%
    4            1 year  -0.125%
    5           2 years  -0.120%
    6           3 years  -0.113%
    7           4 years  -0.094%
    8           5 years  -0.084%
    9           6 years  -0.057%
    10          7 years  -0.031%
    11          8 years   0.005%
    12          9 years   0.050%
    13         10 years   0.086%
    14         15 years   0.287%
    15         20 years   0.461%
    16         30 years   0.689%
    17         40 years   0.757%
    

    如果不想显示0级列索引:

    idx = pd.IndexSlice
    yc.loc(axis=1)[idx[:, ['ResidualMaturity', 'Last']]].droplevel(0, axis=1)
    

    结果:

       ResidualMaturity     Last
    0           1 month  -0.110%
    1          3 months  -0.109%
    2          6 months  -0.119%
    3          9 months  -0.115%
    4            1 year  -0.125%
    5           2 years  -0.120%
    6           3 years  -0.113%
    7           4 years  -0.094%
    8           5 years  -0.084%
    9           6 years  -0.057%
    10          7 years  -0.031%
    11          8 years   0.005%
    12          9 years   0.050%
    13         10 years   0.086%
    14         15 years   0.287%
    15         20 years   0.461%
    16         30 years   0.689%
    17         40 years   0.757%
    

    【讨论】:

    • 感谢这项工作!花了一段时间终于弄明白了
    • @TerenceChew 是的,这种做法并不为人所熟知,所以花时间去搞定是正常的。
    猜你喜欢
    • 2016-02-17
    • 1970-01-01
    • 2021-11-22
    • 2014-03-30
    • 2020-09-13
    • 2021-03-04
    • 2014-01-01
    • 2020-08-20
    • 2014-06-29
    相关资源
    最近更新 更多