【问题标题】:Remove the datetime string from each row in a Dataframe从数据框中的每一行中删除日期时间字符串
【发布时间】:2021-05-30 13:05:29
【问题描述】:

我有一个如下所示的数据框:

   Film      Description       
0  Batman    Viewed in 2021-10-04T14:30:31Z City Hall, London
1  Superman  Aired 2012-01-04R11:01:10Z in the USA first
2  Hulk      2010-07-04S07:22:02Z Still being produced

我想从“描述”列中的每一行中删除日期时间,如下所示:

    Film      Description      
0   Batman    Viewed in City Hall, London
1   Superman  Aired in the USA first
2   Hulk      Still being produced

我已经尝试过这个字符串正则表达式:

df['Description'] = df['Description '].str.replace(r'\^(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})Z', '')

【问题讨论】:

    标签: python regex pandas dataframe data-cleaning


    【解决方案1】:

    \^ 匹配插入符号。

    除了T,我在日期时间戳中看到RS,必须添加它们。

    使用

    \s*\b\d{4}-\d{2}-\d{2}[TRS]\d{2}:\d{2}:\d{2}Z\b
    

    proof

    解释

    --------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                               more times (matching the most amount
                               possible))
    --------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w) and
                               something that is not a word char
    --------------------------------------------------------------------------------
      \d{4}                    digits (0-9) (4 times)
    --------------------------------------------------------------------------------
      -                        '-'
    --------------------------------------------------------------------------------
      \d{2}                    digits (0-9) (2 times)
    --------------------------------------------------------------------------------
      -                        '-'
    --------------------------------------------------------------------------------
      \d{2}                    digits (0-9) (2 times)
    --------------------------------------------------------------------------------
      [TRS]                    any character of: 'T', 'R', 'S'
    --------------------------------------------------------------------------------
      \d{2}                    digits (0-9) (2 times)
    --------------------------------------------------------------------------------
      :                        ':'
    --------------------------------------------------------------------------------
      \d{2}                    digits (0-9) (2 times)
    --------------------------------------------------------------------------------
      :                        ':'
    --------------------------------------------------------------------------------
      \d{2}                    digits (0-9) (2 times)
    --------------------------------------------------------------------------------
      Z                        'Z'
    --------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w) and
                               something that is not a word char
    

    【讨论】:

    • 这是否会删除数据时间,而不管它在字符串中的位置(即开头、中间、结尾等)?
    • @user341383 是的。
    【解决方案2】:

    我还没有复制您的数据框,但是您的正则表达式无法与胡萝卜一起使用 ^ 会将匹配锁定到字符串的开头,并且您在其中有一个“T”,这将只匹配其中一个描述。

    尝试:

    (\d{4}-\d{2}-\d{2}[TSR]\d{2}:\d{2}:\d{2})Z

    【讨论】:

      【解决方案3】:

      使用str.replace替换;

      : 之前的任何非空白或 :: 本身之后的任何非白色。

          df['Description']=df['Description'].str.replace('\S+(?=[:])|(?<=[:])\S+|[:]','')
      print(df)
      
      
      
             Film             Description
      0    Batman  Viewed in  City Hall, London
      1  Superman       Aired  in the USA first
      2      Hulk          Still being produced
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2022-12-03
        • 1970-01-01
        • 1970-01-01
        • 2010-10-25
        • 1970-01-01
        • 2021-04-09
        相关资源
        最近更新 更多