【问题标题】:Extract regex with special character提取具有特殊字符的正则表达式
【发布时间】:2021-12-05 04:55:05
【问题描述】:

我想基于另一列 ID 在 pandas 数据框 df 中创建一列。对于包含字符串SATID,我想提取由特殊字符“-”连接的浮点数,并将提取的数据放入一个名为new_col 的新列中。如果ID 不包含SAT 字符串,则保留为NaN

df如下:

    Date        ID                   Time
0   2007-01-10  SAT 1 HHSP           900
1   2007-01-10  DOUBLE 7 HHSP        900
2   2007-01-10  SAT GF-06-5CSBG.431  1000
3   2007-01-10  MA HYDRO HHSP        900
4   2007-01-10  2.233 HHSP           900
5   2007-01-10  SAT L2-15-3CSB1.252  1000
6   2007-01-10  SECTION 6 HHSP       900

预期输出:

    Date        ID                   Time     new_col
0   2007-01-10  SAT 1 HHSP           900      NaN
1   2007-01-10  DOUBLE 7 HHSP        900      NaN
2   2007-01-10  SAT GF-06-5CSBG.431  1000     06-5
3   2007-01-10  MA HYDRO HHSP        900      NaN
4   2007-01-10  2.233 HHSP           900      NaN
5   2007-01-10  SAT L2-15-3 CSB1.252  1000    15-3  * In this case 15-3 instead of 2-15 is extracted because L2 is not completely floats.
6   2007-01-10  SECTION 6 HHSP       900      NaN

【问题讨论】:

    标签: python regex pandas


    【解决方案1】:

    Series.str.extract 与由- 连接的数字和前面的- 一起使用,并且仅用于SATSeries.str.contains 过滤的值:

    m = df['ID'].str.contains('SAT')
    df['new_col'] = df.loc[m, 'ID'].str.extract('[-\s+](\d+\-\d+)')
    print (df)
             Date                   ID  Time new_col
    0  2007-01-10           SAT 1 HHSP   900     NaN
    1  2007-01-10        DOUBLE 7 HHSP   900     NaN
    2  2007-01-10  SAT GF-06-5CSBG.431  1000    06-5
    3  2007-01-10        MA HYDRO HHSP   900     NaN
    4  2007-01-10           2.233 HHSP   900     NaN
    5  2007-01-10  SAT L2-15-3CSB1.252  1000    15-3
    6  2007-01-10       SECTION 6 HHSP   900     NaN
    

    如果值SAT 可以在列中开始使用:

    df['new_col'] = df['ID'].str.extract('^SAT.*[-\s+](\d+\-\d+)', expand=False)
    

    【讨论】:

    • 嗨,jezrael,谢谢你的回答,IDSATGF 06-3 CSB G-407 返回了NaN,而我希望它返回06-03,因为有两个浮点数由- 字符连接。是因为有2个-吗?
    • @nilsinelabore - 然后使用df.loc[m, 'ID'].str.extract('[-\s+](\d+\-\d+)') - 测试空间或-INT-INT 数字之前
    猜你喜欢
    • 1970-01-01
    • 2022-11-16
    • 2021-09-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-02-20
    相关资源
    最近更新 更多