从数据框中的每一行中删除日期时间字符串答案

【问题标题】：Remove the datetime string from each row in a Dataframe从数据框中的每一行中删除日期时间字符串
【发布时间】：2021-05-30 13:05:29
【问题描述】：

我有一个如下所示的数据框：

   Film      Description       
0  Batman    Viewed in 2021-10-04T14:30:31Z City Hall, London
1  Superman  Aired 2012-01-04R11:01:10Z in the USA first
2  Hulk      2010-07-04S07:22:02Z Still being produced

我想从“描述”列中的每一行中删除日期时间，如下所示：

    Film      Description      
0   Batman    Viewed in City Hall, London
1   Superman  Aired in the USA first
2   Hulk      Still being produced

我已经尝试过这个字符串正则表达式：

df['Description'] = df['Description '].str.replace(r'\^(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2})Z', '')

【问题讨论】：

标签： python regex pandas dataframe data-cleaning

【解决方案1】：

\^ 匹配插入符号。

除了T，我在日期时间戳中看到R和S，必须添加它们。

使用

\s*\b\d{4}-\d{2}-\d{2}[TRS]\d{2}:\d{2}:\d{2}Z\b

见proof。

解释

--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  \d{4}                    digits (0-9) (4 times)
--------------------------------------------------------------------------------
  -                        '-'
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  -                        '-'
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  [TRS]                    any character of: 'T', 'R', 'S'
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  :                        ':'
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  :                        ':'
--------------------------------------------------------------------------------
  \d{2}                    digits (0-9) (2 times)
--------------------------------------------------------------------------------
  Z                        'Z'
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char

【讨论】：

这是否会删除数据时间，而不管它在字符串中的位置（即开头、中间、结尾等）？
@user341383 是的。

【解决方案2】：

我还没有复制您的数据框，但是您的正则表达式无法与胡萝卜一起使用 ^ 会将匹配锁定到字符串的开头，并且您在其中有一个“T”，这将只匹配其中一个描述。

尝试：

(\d{4}-\d{2}-\d{2}[TSR]\d{2}:\d{2}:\d{2})Z

【讨论】：

【解决方案3】：

使用str.replace替换；

: 之前的任何非空白或 : 或 : 本身之后的任何非白色。

    df['Description']=df['Description'].str.replace('\S+(?=[:])|(?<=[:])\S+|[:]','')
print(df)



       Film             Description
0    Batman  Viewed in  City Hall, London
1  Superman       Aired  in the USA first
2      Hulk          Still being produced

【讨论】：