【问题标题】:Convert datetime pandas转换日期时间熊猫
【发布时间】:2020-05-16 22:08:41
【问题描述】:

以下是我的 df 示例

date                   value

0006-03-01 00:00:00    1   
0006-03-15 00:00:00    2   
0006-05-15 00:00:00    1   
0006-07-01 00:00:00    3   
0006-11-01 00:00:00    1   
2009-05-20 00:00:00    2   
2009-05-25 00:00:00    8   
2020-06-24 00:00:00    1   
2020-06-30 00:00:00    2   
2020-07-01 00:00:00    13  
2020-07-15 00:00:00    2   
2020-08-01 00:00:00    4   
2020-10-01 00:00:00    2   
2020-11-01 00:00:00    4    
2023-04-01 00:00:00    1   
2218-11-12 10:00:27    1   
4000-01-01 00:00:00    6 
5492-04-15 00:00:00    1    
5496-03-15 00:00:00    1    
5589-12-01 00:00:00    1    
7199-05-15 00:00:00    1    
9186-12-30 00:00:00    1  

如您所见,数据包含一些拼写错误的日期。

问题:

  • 如何将此列转换为 dd.mm.yyyy 格式?
  • 当 Year 大于 2022 时,我们如何替换行?截至 01.01.2100
  • 当年份小于 2005 时,我们如何删除所有行?

最终的输出应该是这样的。

date                   value


20.05.2009    2   
25.05.2009     8   
26.04.2020     1   
30.06.2020     2   
01.07.2020     13  
15.07.2020     2   
01.08.2020    4   
01.10.2020    2   
01.11.2020    4    
01.01.2100    1   
01.01.2100    1      
01.01.2100    1   
01.01.2100    1   
01.01.2100    1   
01.01.2100    1      
01.01.2100    1   
01.01.2100    1   

我尝试使用 to_datetime 转换列,但失败了。

df[col] = pd.to_datetime(df[col], infer_datetime_format=True)

Out of bounds nanosecond timestamp: 5-03-01 00:00:00

感谢任何人的帮助!

【问题讨论】:

标签: python pandas datetime


【解决方案1】:

您可以在“-”拆分后检查日期时间字符串的第一个元素,并根据其整数值进行清理/替换。对于像“0006”这样的小值,用errors='coerce' 调用pd.to_datetime 就可以了。它将为无效日期留下“NaT”。您可以使用dropna() 删除那些。示例:

import pandas as pd

df = pd.DataFrame({'date': ['0006-03-01 00:00:00',
                            '0006-03-15 00:00:00',
                            '0006-05-15 00:00:00',
                            '0006-07-01 00:00:00',
                            '0006-11-01 00:00:00',
                            'nan',
                            '2009-05-25 00:00:00',
                            '2020-06-24 00:00:00',
                            '2020-06-30 00:00:00',
                            '2020-07-01 00:00:00',
                            '2020-07-15 00:00:00',
                            '2020-08-01 00:00:00',
                            '2020-10-01 00:00:00',
                            '2020-11-01 00:00:00',
                            '2023-04-01 00:00:00',
                            '2218-11-12 10:00:27',
                            '4000-01-01 00:00:00',
                            'NaN',
                            '5496-03-15 00:00:00',
                            '5589-12-01 00:00:00',
                            '7199-05-15 00:00:00',
                            '9186-12-30 00:00:00']})

# first, drop columns where 'date' contains 'nan' (case-insensitive):
df = df.loc[~df['date'].str.contains('nan', case=False)]

# now replace strings where the year is above a threshold:
df.loc[df['date'].str.split('-').str[0].astype(int) > 2022, 'date'] = '2100-01-01 00:00:00'

# convert to datetime, if year is too low, will result in NaT:
df['date'] = pd.to_datetime(df['date'], errors='coerce')
# df['date']
# 0           NaT
# 1           NaT
# 2           NaT
# 3           NaT
# 4           NaT
# 5    2009-05-20
# 6    2009-05-25
# ...

df = df.dropna()
# df
#          date
# 6  2009-05-25
# 7  2020-06-24
# 8  2020-06-30
# 9  2020-07-01
# 10 2020-07-15
# 11 2020-08-01
# 12 2020-10-01
# 13 2020-11-01
# 14 2100-01-01
# 15 2100-01-01
# ...

【讨论】:

  • 谢谢@MrFuppes!它正在做这项工作。此外,我想知道如果 DF 包含 NaN(不删除它们),我该如何调整您的代码。因为在应用您的代码时,它给了我以下错误:invalid literal for int() with base 10: 'nan'跨度>
  • @Annis15 你的意思是“日期”列最初包含字符串“南”?
【解决方案2】:

由于pandas的限制,抛出了越界错误(https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)。此代码将在创建数据框之前删除会导致此错误的值。

import datetime as dt

import pandas as pd

data = [[dt.datetime(year=2022, month=3, day=1), 1],
        [dt.datetime(year=2009, month=5, day=20), 2],
        [dt.datetime(year=2001, month=5, day=20), 2],
        [dt.datetime(year=2023, month=12, day=30), 3],
        [dt.datetime(year=6, month=12, day=30), 3]]
dataCleaned = [elements for elements in data if pd.Timestamp.max > elements[0] > pd.Timestamp.min]

df = pd.DataFrame(dataCleaned, columns=['date', 'Value'])
print(df)
# OUTPUT
        date  Value
0 2022-03-01      1
1 2009-05-20      2
2 2001-05-20      2
3 2023-12-30      3

df.loc[df.date.dt.year > 2022, 'date'] = dt.datetime(year=2100, month=1, day=1)
df.drop(df.loc[df.date.dt.year < 2005, 'date'].index, inplace=True)
print(df)
#OUTPUT
0 2022-03-01      1
1 2009-05-20      2
3 2100-01-01      3

如果您仍想包含引发越界错误的日期,请查看How to work around Python Pandas DataFrame's "Out of bounds nanosecond timestamp" error?

【讨论】:

    【解决方案3】:

    我建议如下:

    df = pd.DataFrame.from_dict({'date': ['0003-03-01 00:00:00',
                                          '7199-05-15 00:00:00',
                                          '2020-10-21 00:00:00'],
                                 'value': [1, 2, 3]})
    
    df['date'] = [d[8:10] + '.' + d[5:7] + '.' + d[:4] if '2004' < d[:4] < '2023' \
                  else '01.01.2100' if d[:4] > '2022' else np.NaN for d in df['date']]
    
    df.dropna(inplace = True)
    

    这会产生所需的输出:

    date        value
    01.01.2100  2
    21.10.2020  3
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-12-07
      • 1970-01-01
      • 2015-11-05
      • 2020-12-18
      • 2018-12-08
      相关资源
      最近更新 更多