这模拟了您在屏幕截图中显示的输入:
import pandas as pd
import numpy as np
df = pd.DataFrame({"type": ["a", "c", "d", "a", "b", "a", "a", "c"],
"tags": ["col_t1, col_red, large", np.nan, "col_t2, col_black, small",
"col_t4, large, col_yellow", "col_gold, col_fancy,", "col_t1, thick, col_k",
np.nan, "col_t3, fancy, red"],
"theme": ["t2", "t1", "t2", "t3", "t2", "t1", np.nan, "t3"]})
df.set_index(np.arange(1, len(df)+1), inplace=True)
print df
输出:
tags theme type
1 col_t1, col_red, large t2 a
2 NaN t1 c
3 col_t2, col_black, small t2 d
4 col_t4, large, col_yellow t3 a
5 col_gold, col_fancy, t2 b
6 col_t1, thick, col_k t1 a
7 NaN NaN a
8 col_t3, fancy, red t3 c
产生所需输出的代码:
prefix = "col_"
# Iterate over rows with non-empty theme
for row in df[df["theme"].notnull()].itertuples():
if pd.isnull(row.tags):
# Replace NaN in tags column with a single tag from theme column
df.loc[row.Index, "tags"] = prefix + row.theme
else:
# Extract existing tags with prefix
inferred_tags = [t.replace(prefix, "") for t in row.tags.split(",") if prefix in t]
if row.theme not in inferred_tags:
df.loc[row.Index, "tags"] = row.tags.rstrip(" ,") + ", " + prefix + row.theme
print df
输出:
tags theme type
1 col_t1, col_red, large, col_t2 t2 a
2 col_t1 t1 c
3 col_t2, col_black, small t2 d
4 col_t4, large, col_yellow, col_t3 t3 a
5 col_gold, col_fancy, col_t2 t2 b
6 col_t1, thick, col_k t1 a
7 NaN NaN a
8 col_t3, fancy, red t3 c
希望这是您正在寻找的。据称itertuples() 迭代所有行比iterrows() 更快。另外,请记住,我使用 numpy 和特别是 np.nan 在您的输入中模拟 NaN,但如果您的数据来自 csv,则不需要 numpy。
--- 更新 ---
如 cmets 中所述,代码应替换与主题匹配的标签。这是更新的解决方案:
prefix = "col_"
# Find all unique themes (notnull() excludes nan from the list)
themes = df[df["theme"].notnull()]["theme"].unique()
# Add prefex to all themes for comparison with tags; convert to set
prefixed_themes = set([prefix + t for t in themes])
# Iterate over rows with non-empty theme
for row in df[df["theme"].notnull()].itertuples():
if pd.isnull(row.tags):
# Replace NaN in tags column with a single tag from theme column
df.loc[row.Index, "tags"] = prefix + row.theme
else:
# Extract existing tags with prefix (do not remove prefix; remove all spaces)
inferred_tags = row.tags.replace(" ", "").split(",")
# Use sets to check if there is any intersection between tags and themes
if len(set(inferred_tags).intersection(prefixed_themes)) > 0:
# Iterate over inferred_tags to find and replace matches with themes
for idx, t in enumerate(inferred_tags):
if t in prefixed_themes:
inferred_tags[idx] = prefix + row.theme
df.loc[row.Index, "tags"] = ", ".join(inferred_tags)
else:
# In this case, add theme to tags (no replacement)
df.loc[row.Index, "tags"] = row.tags.rstrip(" ,") + ", " + prefix + row.theme
print df
输出:
tags theme type
1 col_t2, col_red, large t2 a
2 col_t1 t1 c
3 col_t2, col_black, small t2 d
4 col_t4, large, col_yellow, col_t3 t3 a
5 col_gold, col_fancy, col_t2 t2 b
6 col_t1, thick, col_k t1 a
7 NaN NaN a
8 col_t3, fancy, red t3 c
请注意,代码会根据主题列中存在的所有值检查标签(添加前缀);如果某个值(如 t4)不在主题列中,则它不被视为合法主题标签,因此在处理期间不会替换第 4 项中的 col_t4。如果您需要替换所有col_t*,则需要具体说明。希望这是一个有用的解决方案,您可以从这里获得它。