【问题标题】:How to remove unwanted data from a data column using pandas DataFrame如何使用 pandas DataFrame 从数据列中删除不需要的数据
【发布时间】:2022-01-17 20:59:27
【问题描述】:

我从抓取的数据中使用逗号分隔两次得到date 以及date column 中的日期。我的目标是删除这个December 13, 2021Mon, 部分,并希望为days 创建一个单独的/新列,我还想删除最后一列,意思是Volumn column

脚本

import requests
import pandas as pd

isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']

dfs = []
for isin in isins:
    html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
    dfs.extend(pd.read_html(html))

df = pd.concat(dfs) 
print(df)

预期输出

        Day     Date        Open   High    Low  Close 
      Monday  Dec 13, 2021  77.77  77.77  77.77  77.77 
      Friday  Dec 10, 2021  77.61  77.61  77.61  77.61  
    Thursday  Dec 09, 2021  77.60  77.60  77.60  77.60  
   Wednesday  Dec 08, 2021  77.47  77.47  77.47  77.47  
     Tuesday  Dec 07, 2021  77.64  77.64  77.64  77.64  

电流输出

                                       Date   Open   High    Low  Close Volume
 Monday, December 13, 2021Mon, Dec 13, 2021  77.77  77.77  77.77  77.77  00.00
  Friday, December 10, 2021Fri, Dec 10, 2021  77.61  77.61  77.61  77.61   ----
Thursday, December 09, 2021Thu, Dec 09, 2021  77.60  77.60  77.60  77.60   ----
Wednesday, December 08, 2021Wed, Dec 08, 2021  77.47  77.47  77.47  77.47   ----
 Tuesday, December 07, 2021Tue, Dec 07, 2021  77.64  77.64  77.64  77.64   ----

提前致谢!

【问题讨论】:

    标签: python pandas dataframe web-scraping


    【解决方案1】:

    我在您的代码中添加了必要的步骤:

    import requests
    import pandas as pd
    
    isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
    
    dfs = []
    for isin in isins:
        html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
        dfs.extend(pd.read_html(html))
    
    df = pd.concat(dfs)
    
    # get the Day column
    df.insert(0, 'Day', df['Date'].apply(lambda d: d[:d.find(',')]))
    # reformat Date to the desired format
    df['Date'] = df['Date'].apply(lambda d: d[-12:])
    # remove the Volume column
    df.pop('Volume')
    
    print(df)
    

    经过这三个操作后,df如下所示:

              Day          Date   Open   High    Low  Close
    0      Monday  Dec 13, 2021  77.77  77.77  77.77  77.77
    1      Friday  Dec 10, 2021  77.61  77.61  77.61  77.61
    2    Thursday  Dec 09, 2021  77.60  77.60  77.60  77.60
    3   Wednesday  Dec 08, 2021  77.47  77.47  77.47  77.47
    4     Tuesday  Dec 07, 2021  77.64  77.64  77.64  77.64
    5      Monday  Dec 06, 2021  77.70  77.70  77.70  77.70
    6      Friday  Dec 03, 2021  77.72  77.72  77.72  77.72
    ...
    

    【讨论】:

    • 但是我得到了`raise KeyError(key) from err KeyError: 'Date'`
    • 请调整完整代码
    • 相应编辑,希望对您有所帮助。
    • 请您解释一下 d[:d.find(',')] 和 d[-12:] 的含义
    • d[:d.find(',')] -> d[: = 从字符串的开头开始直到d.find(',')]= 在字符串中找到第一个','的索引。 d[-12:] -> d[-12 从末尾的 12 个字符开始,一直持续到 :] 字符串的末尾。
    【解决方案2】:

    我会在这里使用正则表达式来拆分。然后,您可以将它们组合起来并在之后进行任何您喜欢的解析:

    import requests
    import pandas as pd
    
    
    isins=['LU0526609390:EUR','IE00BHBX0Z19:EUR']
    
    dfs = []
    for isin in isins:
        html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
        dfs.extend(pd.read_html(html))
    
    df = pd.concat(dfs) 
    print(df)
        
    
    
    df[['Date_alpha', 'Date_beta']] = df['Date'].str.split(r'(\d{4})(\w{1,3})',expand=True)[[0,1]]
    df['Date'] = df['Date_alpha'] + df['Date_beta']
    df = df.drop(['Date_alpha', 'Date_beta'], axis=1)
    

    输出:

    print(df)
                                Date   Open   High    Low  Close Volume
    0      Monday, December 13, 2021  77.77  77.77  77.77  77.77   ----
    1      Friday, December 10, 2021  77.61  77.61  77.61  77.61   ----
    2    Thursday, December 09, 2021  77.60  77.60  77.60  77.60   ----
    3   Wednesday, December 08, 2021  77.47  77.47  77.47  77.47   ----
    4     Tuesday, December 07, 2021  77.64  77.64  77.64  77.64   ----
    5      Monday, December 06, 2021  77.70  77.70  77.70  77.70   ----
    6      Friday, December 03, 2021  77.72  77.72  77.72  77.72   ----
    7    Thursday, December 02, 2021  77.56  77.56  77.56  77.56   ----
    8   Wednesday, December 01, 2021  77.51  77.51  77.51  77.51   ----
    9     Tuesday, November 30, 2021  77.52  77.52  77.52  77.52   ----
    10     Monday, November 29, 2021  77.37  77.37  77.37  77.37   ----
    11     Friday, November 26, 2021  77.44  77.44  77.44  77.44   ----
    12   Thursday, November 25, 2021  77.11  77.11  77.11  77.11   ----
    13  Wednesday, November 24, 2021  77.10  77.10  77.10  77.10   ----
    14    Tuesday, November 23, 2021  77.02  77.02  77.02  77.02   ----
    15     Monday, November 22, 2021  77.32  77.32  77.32  77.32   ----
    16     Friday, November 19, 2021  77.52  77.52  77.52  77.52   ----
    17   Thursday, November 18, 2021  77.38  77.38  77.38  77.38   ----
    18  Wednesday, November 17, 2021  77.26  77.26  77.26  77.26   ----
    19    Tuesday, November 16, 2021  77.24  77.24  77.24  77.24   ----
    20     Monday, November 15, 2021  77.30  77.30  77.30  77.30   ----
    0      Monday, December 13, 2021  11.09  11.09  11.09  11.09   ----
    1      Friday, December 10, 2021  11.08  11.08  11.08  11.08   ----
    2    Thursday, December 09, 2021  11.08  11.08  11.08  11.08   ----
    3   Wednesday, December 08, 2021  11.06  11.06  11.06  11.06   ----
    4     Tuesday, December 07, 2021  11.08  11.08  11.08  11.08   ----
    5      Monday, December 06, 2021  11.09  11.09  11.09  11.09   ----
    6      Friday, December 03, 2021  11.08  11.08  11.08  11.08   ----
    7    Thursday, December 02, 2021  11.08  11.08  11.08  11.08   ----
    8   Wednesday, December 01, 2021  11.05  11.05  11.05  11.05   ----
    9     Tuesday, November 30, 2021  11.07  11.07  11.07  11.07   ----
    10     Monday, November 29, 2021  11.07  11.07  11.07  11.07   ----
    11     Friday, November 26, 2021  11.08  11.08  11.08  11.08   ----
    12   Thursday, November 25, 2021  11.04  11.04  11.04  11.04   ----
    13  Wednesday, November 24, 2021  11.03  11.03  11.03  11.03   ----
    14    Tuesday, November 23, 2021  11.04  11.04  11.04  11.04   ----
    15     Monday, November 22, 2021  11.07  11.07  11.07  11.07   ----
    16     Friday, November 19, 2021  11.09  11.09  11.09  11.09   ----
    17   Thursday, November 18, 2021  11.06  11.06  11.06  11.06   ----
    18  Wednesday, November 17, 2021  11.05  11.05  11.05  11.05   ----
    19    Tuesday, November 16, 2021  11.05  11.05  11.05  11.05   ----
    20     Monday, November 15, 2021  11.05  11.05  11.05  11.05   ----
    

    【讨论】:

    • 请您解释一下 Date_alpha 和 Date_beta 丢弃了哪一部分以及 expand=True)[[0,1]],axis=1 的含义
    • 所以这是在'2021Dec' 部分拆分列,因为它是expand=True,所以它创建了4 个项目。例如:['Monday, Decemeber 06,','2021','Mon', ', Dec 06, 2021'] 我将索引 0 和 1 中的列放入“虚拟”列中。 'Date_alpha''Monday, December 06,''Date_beta''2021'。我可以加入/连接它们以获得完整的日期。连接它们后,我删除了这 2 个虚拟列。
    • @Mohna,我在使用其他解决方案时要小心,因为它假设所有月份的长度都是 3 个字符。如果你得到诸如'June''July'之类的日期,则将其硬编码为在索引位置12处切片,并返回''uly 13, 2021,o'une 13, 2021'
    猜你喜欢
    • 2021-04-14
    • 1970-01-01
    • 2021-06-16
    • 1970-01-01
    • 1970-01-01
    • 2017-04-10
    • 1970-01-01
    • 2022-01-12
    • 1970-01-01
    相关资源
    最近更新 更多