提取熊猫列中每个单词的第一个字符串答案

【问题标题】：Extract first string of each word in pandas column提取熊猫列中每个单词的第一个字符串
【发布时间】：2021-12-06 13:24:28
【问题描述】：

我有如下的DF

col1
GRE MET HOCK 38 
ASS COM CORD EMERG  INIT

我想创建一个列，其中包含 col1 中每个单词的第一个字符串，但保留整数，如下所示

col1                        col2
GRE MET HOCK 38             GMH38
ASS COM CORD EMERG  INIT    ACCEI

我发现了一些可行的方法，但没有给出预期的输出

import re
input = "GRE MET HOCK 38"
output = "".join(item[0].upper() for item in re.findall("\w+", input))

【问题讨论】：

标签： python regex pandas

【解决方案1】：

split 空格上的字符串，然后堆叠成一个长系列。然后你可以取第一个字母，除了字符串isnumeric的分割部分，最后加入结果并赋值回来，与原始DataFrame索引对齐。

import pandas as pd
df = pd.DataFrame({'col1': ['GRE MET HOCK 38', 'ASS COM CORD EMERG  INIT']})

s = df['col1'].str.split('\s+', expand=True).stack()
df['col2'] = s.str[0].mask(s.str.isnumeric(), s).groupby(level=0).agg(''.join)

                       col1   col2
0           GRE MET HOCK 38  GMH38
1  ASS COM CORD EMERG  INIT  ACCEI

【讨论】：

【解决方案2】：

你可以使用Series.str.replace:

import pandas as pd
df = pd.DataFrame({'col':['GRE MET HOCK 38', 'ASS COM CORD EMERG  INIT']})
df['col'].str.replace(r'\b(?!\d+\b)(\w)\w*|\s+', lambda x: x.group(1).upper() if x.group(1) else '', regex=True)
# => 0    GMH38
#    1    ACCEI
#    Name: col, dtype: object

请参阅regex demo。可以根据需要支持正则表达式的数字和单词边界进行调整。

当前模式匹配

\b(?!\d+\b)(\w)\w* - 一个字边界，然后是一个字字符（捕获到第 1 组，然后是零个或多个字字符，但这些字字符不应构成一个整体的数字序列
| - 或
\s+ - 一个或多个空格。

如果 Group 1 匹配，这个大写的值是替换，否则，匹配被删除（替换是一个空字符串）。

【讨论】：

【解决方案3】：

您可以迭代列项目并提取用空格分割项目的单词，然后用单词的第一个字母创建新单词并将新单词保存到列表中，然后将此列表添加到DataFrame的新列中

from pandas import DataFrame

data = {
    'col1' : ['GRE MET HOCK 38', 'ASS COM CORD EMERG INIT'],
}

new_column = []

df = DataFrame(data)

for item in df['col1']:
    new_item = ""
    #extract words from item with split by space
    words = item.split()
    for word in words:
        #add first letter to new item
        new_item += word[0]
    #add new item to new column
    new_column.append(new_item)


#add new column to DataFrame
df['col2'] = new_column
print(df)

【讨论】：