【问题标题】:Extracting date time from a mixed letter and numeric column pandas从混合字母和数字列熊猫中提取日期时间
【发布时间】:2026-01-04 00:25:02
【问题描述】:

我在 pandas 数据框中有一列包含两种类型的信息 = 1. 日期和时间,2=公司名称。我必须将列分成两列(date_time、full_company_name)。首先,我尝试根据字符数拆分列(前 19 个一列,其余列到另一列),但后来我意识到有时日期会丢失,因此拆分可能不起作用。然后我尝试使用正则表达式,但我似乎无法正确提取它。

栏目:

想要的输出:

【问题讨论】:

    标签: python regex pandas dataframe split


    【解决方案1】:

    如果日期格式都正确,也许你不必使用正则表达式

    df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
                             "2021-01-01 06:00:00Acme LLC"]})
    df["date"] = pd.to_datetime(df.A.str[:19])
    df["company"] = df.A.str[19:]
    df
    #                                     A                 date          company
    # 0  2021-01-01 05:00:00Acme Industries  2021-01-01 05:00:00  Acme Industries
    # 1         2021-01-01 06:00:00Acme LLC  2021-01-01 06:00:00         Acme LLC
    

    df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})(.*)")
    

    【讨论】:

    • 谢谢!问题是它们可能在以后的一些条目中丢失(我只分享了该列的一个 sn-p)
    【解决方案2】:

    注意: 如果您可以选择避免连接这些字符串,请这样做。这不是一个健康的习惯。

    解决方案(不是那么漂亮):

    import pandas as pd
    from datetime import datetime
    import re
    
    df = pd.DataFrame()
    # creating a list of companies
    companies = ['Google', 'Apple', 'Microsoft', 'Facebook', 'Amazon', 'IBM', 
                 'Oracle', 'Intel', 'Yahoo', 'Alphabet']
    # creating a list of random datetime objects
    dates = [datetime(year=2000 + i, month=1, day=1) for i in range(10)]
    # creating the column named 'date_time/full_company_name'
    df['date_time/full_company_name'] = [f'{str(dates[i])}{companies[i]}' for i in range(len(companies))]
    
    # Before:
    # date_time/full_company_name
    # 2000-01-01 00:00:00Google
    # 2001-01-01 00:00:00Apple
    # 2002-01-01 00:00:00Microsoft
    # 2003-01-01 00:00:00Facebook
    # 2004-01-01 00:00:00Amazon
    # 2005-01-01 00:00:00IBM
    # 2006-01-01 00:00:00Oracle
    # 2007-01-01 00:00:00Intel
    # 2008-01-01 00:00:00Yahoo
    # 2009-01-01 00:00:00Alphabet
    
    new_rows = []
    for row in df['date_time/full_company_name']:
        # extract the date_time from the row using regex
        date_time = re.search(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}', row)
        # handle case of empty date_time
        date_time = date_time.group() if date_time else ''
        # extract the company name from the row from where the date_time ends
        company_name = row[len(date_time):]
        # create a new row with the extracted date_time and company_name
        new_rows.append([date_time, company_name])
    
    # drop the column 'date_time/full_company_name'
    df = df.drop(columns=['date_time/full_company_name'])
    # add the new columns to the dataframe: 'date_time' and 'company_name'
    df['date_time'] = [row[0] for row in new_rows]
    df['company_name'] = [row[1] for row in new_rows]
    
    # After:
    # date_time            full_company_name
    # 2000-01-01 00:00:00       Google
    # 2001-01-01 00:00:00       Apple
    # 2002-01-01 00:00:00       Microsoft
    # 2003-01-01 00:00:00       Facebook
    # 2004-01-01 00:00:00       Amazon
    # 2005-01-01 00:00:00       IBM
    # 2006-01-01 00:00:00       Oracle
    # 2007-01-01 00:00:00       Intel
    # 2008-01-01 00:00:00       Yahoo
    # 2009-01-01 00:00:00       Alphabet
    

    【讨论】:

      【解决方案3】:

      使用非捕获组 ?.* 代替 (.*)

      df = pd.DataFrame({"A": ["2021-01-01 05:00:00Acme Industries",
                               "2021-01-01 06:00:00Acme LLC"]})
      
      df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})?.*")
      

      【讨论】:

      • 谢谢,但与 df.A.str.extract("(\d{4}-\d{2}-\d{2}\s\d{2}: \d{2}:\d{2})(.*)") ?
      • ?.* 是一个非贪婪非捕获组。它不会在 re.findall 模式匹配中返回组结果。