pandas df上的字符串操作答案

【问题标题】：string operation on pandas dfpandas df上的字符串操作
【发布时间】：2017-09-16 03:00:30
【问题描述】：

pandas df 有 11 列需要使用正则表达式修改前 3 列，并使用修改后的列添加一个新列，并将其用于下游连接，像这样我需要保持这些列中的元素并使其成为唯一字符串

column1 column2 column3 column4 ...column 11

需要这样做 new_col = column1:column2-column3(column4)

制作这个新专栏，

column1 column2 column3 newcol column4 ...column 11

我可以使用简单的 python 一行来做到这一点，不知道 pandas 的语法是什么

l = cols[0] + ":" + cols[1] + "-" + cols[2] + "(" + cols[5] + ")"

【问题讨论】：

如果 cols[0]、cols[1]、cols[2] 和 cols[5] 是字符串，您的示例代码将可以正常工作。如果没有，您需要在组合它们之前将它们转换为字符串。在标准 python 代码中，您可以使用str(cols[0]) 来执行此操作。使用 pandas 列，您可以使用 cols[0].astype(str) 执行此操作。
同意，但我仍然不知道如何向现有 df 添加新列

标签： python string python-2.7 pandas

【解决方案1】：

只要所有列都包含字符串，您就应该能够使用您发布的相同语法来执行此操作。

您也可以使用Series.str.cat 方法。

df['new_col'] = cols[0].str.cat(':' + cols[1] + '-' + cols[2] + '(' + cols[5]+ ')')

【讨论】：

df1['unique_col'] = df1['chrom'].str.cat(':' + df1['start'] + '-' + df1['end'] + '(' + df1['strand'] + ')') 给我 AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
@sbradbio 正如我所说的“只要所有列都包含字符串”，如果不是，您将需要像 piRsquared 对 .astype(str) 所做的那样转换 as 字符串
明白了！需要更多的咖啡错过了字符串部分谢谢。

【解决方案2】：

考虑数据框df

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice(a, (5, 10))).add_prefix('col ')

print(df)

  col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 col 8 col 9
0     Q     L     C     K     P     X     N     L     N     T
1     I     X     A     W     Y     M     W     A     C     A
2     U     Z     H     T     N     S     M     E     D     T
3     N     W     H     X     N     U     F     D     X     F
4     Z     L     Y     H     M     G     E     H     W     S

构造一个自定义的format函数

f = lambda row: '{col 1}:{col 2}-{col 3}({col 4})'.format(**row)

并申请df

df.astype(str).apply(f, 1)

0    L:C-K(P)
1    W:A-C(A)
2    W:H-X(N)
3    E:H-W(S)
4    Y:E-P(N)
dtype: object

使用assign 添加一个新列

df.assign(New=df.astype(str).apply(f, 1))
# assign in place with
# df['New'] = df.astype(str).apply(f, 1)

  col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 col 8 col 9       New
0     Q     L     C     K     P     X     N     L     N     T  L:C-K(P)
1     I     X     A     W     Y     M     W     A     C     A  X:A-W(Y)
2     U     Z     H     T     N     S     M     E     D     T  Z:H-T(N)
3     N     W     H     X     N     U     F     D     X     F  W:H-X(N)
4     Z     L     Y     H     M     G     E     H     W     S  L:Y-H(M)

或者您可以将其包装到另一个在pd.Series 上运行的函数中。这要求您以正确的顺序传递列。

def u(a, b, c, d):
    return a + ':' + b + '-' + c + '(' + d + ')'

df.assign(New=u(df['col 1'], df['col 2'], df['col 3'], df['col 4']))
# assign in place with
# df['New'] = u(df['col 1'], df['col 2'], df['col 3'], df['col 4'])

  col 0 col 1 col 2 col 3 col 4 col 5 col 6 col 7 col 8 col 9       New
0     Q     L     C     K     P     X     N     L     N     T  L:C-K(P)
1     I     X     A     W     Y     M     W     A     C     A  X:A-W(Y)
2     U     Z     H     T     N     S     M     E     D     T  Z:H-T(N)
3     N     W     H     X     N     U     F     D     X     F  W:H-X(N)
4     Z     L     Y     H     M     G     E     H     W     S  L:Y-H(M)

【讨论】：

不清楚第一个代码块第二行的a应该是什么。
@piRSquared 非常感谢！你能解释一下你刚刚在第二个代码块（lambda）中做了什么并赋值吗？
我使用assign，因为它会创建数据帧的副本，我通常不想通过覆盖它来破坏您的数据帧。所以我使用assign。但是，您经常会看到答案分配给同一数据框中的新列。这完全没问题。只是不像我通常那样做。
在第二个代码块中......老实说，它与@Grr 所做的相同，只是我将它包装在一个更具可读性的函数中。通过对整个系列进行操作，我们避免了 apply 执行的固有循环。

【解决方案3】：

根据最近删除的答案，这可以正常工作：

df1 = pd.DataFrame({
    'chrom': ['a', 'b', 'c'], 
    'start': ['d', 'e', 'f'], 
    'end': ['g', 'h', 'i'], 
    'strand': ['j', 'k', 'l']}
)
df1['unique_col'] = df1.chrom + ':' + df1.start + '-' + df1.end + '(' + df1.strand + ')'

听起来您的原始数据框可能不包含字符串。如果它包含数字，你需要这样的东西：

df1 = pd.DataFrame({
    'chrom': [1.0, 2.0], 
    'start': [3.0, 4.0], 
    'end': [5.0, 6.0], 
    'strand': [7.0, 8.0]}
)
df1['unique_col'] = (
    df1.chrom.astype(str) + ':' 
    + df1.start.astype(str) + '-' + df1.end.astype(str)
    + '(' + df1.strand.astype(str) + ')'
)

【讨论】：