【问题标题】:How to extract first 8 characters from a string in pandas如何从熊猫中的字符串中提取前8个字符
【发布时间】:2019-01-07 11:55:48
【问题描述】:

我在数据框中有一列,我正在尝试从字符串中提取 8 位数字。我该怎么做呢

    Input
 Shipment ID
20180504-S-20000
20180514-S-20537
20180514-S-20541
20180514-S-20644
20180514-S-20644
20180516-S-20009
20180516-S-20009
20180516-S-20009
20180516-S-20009

预期输出

Order_Date
20180504
20180514
20180514
20180514
20180514
20180516
20180516
20180516
20180516

我尝试了下面的代码,它没有工作。

data['Order_Date'] = data['Shipment ID'][:8]

【问题讨论】:

    标签: python-3.x pandas substring


    【解决方案1】:

    你也可以决定从-S删除到最后

    df["Order_Date"]=df['Shipment ID'].replace(regex=r"\-.*",value="")
    df
            Shipment ID Order_Date
    0  20180504-S-20000   20180504
    1  20180514-S-20537   20180514
    2  20180514-S-20541   20180514
    3  20180514-S-20644   20180514
    4  20180514-S-20644   20180514
    5  20180516-S-20009   20180516
    6  20180516-S-20009   20180516
    7  20180516-S-20009   20180516
    8  20180516-S-20009   20180516
    

    您还可以捕获前 8 位数字,然后删除所有内容并用捕获组的反向引用替换:

    df['Shipment ID'].replace(regex=r"(\d{8}).*",value="\\1")
    

    【讨论】:

      【解决方案2】:

      你也可以使用str.extract

      例如:

      import pandas as pd
      
      df = pd.DataFrame({'Shipment ID': ['20180504-S-20000', '20180514-S-20537', '20180514-S-20541', '20180514-S-20644', '20180514-S-20644', '20180516-S-20009', '20180516-S-20009', '20180516-S-20009', '20180516-S-20009']})
      df["Order_Date"] = df["Shipment ID"].str.extract(r"(\d{8})")
      print(df)
      

      输出:

             Shipment ID Order_Date
      0  20180504-S-20000   20180504
      1  20180514-S-20537   20180514
      2  20180514-S-20541   20180514
      3  20180514-S-20644   20180514
      4  20180514-S-20644   20180514
      5  20180516-S-20009   20180516
      6  20180516-S-20009   20180516
      7  20180516-S-20009   20180516
      8  20180516-S-20009   20180516
      

      【讨论】:

        【解决方案3】:

        您很接近,需要使用str 进行索引,这适用于Series 的每个值:

        data['Order_Date'] = data['Shipment ID'].str[:8]
        

        如果没有 NaNs 值,为了获得更好的性能:

        data['Order_Date'] = [x[:8] for x in data['Shipment ID']]
        

        print (data)
                Shipment ID Order_Date
        0  20180504-S-20000   20180504
        1  20180514-S-20537   20180514
        2  20180514-S-20541   20180514
        3  20180514-S-20644   20180514
        4  20180514-S-20644   20180514
        5  20180516-S-20009   20180516
        6  20180516-S-20009   20180516
        7  20180516-S-20009   20180516
        8  20180516-S-20009   20180516
        

        如果省略str代码按位置过滤列,前N个值如:

        print (data['Shipment ID'][:2])
        0    20180504-S-20000
        1    20180514-S-20537
        Name: Shipment ID, dtype: object
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2023-03-08
          • 1970-01-01
          • 2020-11-17
          • 1970-01-01
          • 2017-04-29
          • 2019-11-21
          • 1970-01-01
          • 2021-07-28
          相关资源
          最近更新 更多