使用带有熊猫数据框的正则表达式替换重复单词但留下中间单词答案

【问题标题】：Replacing duplicate words but leaving the middle word using regex with pandas dataframe使用带有熊猫数据框的正则表达式替换重复单词但留下中间单词
【发布时间】：2026-01-22 21:30:01
【问题描述】：

我已将一些表格导入 pd.dataframe。在数据框中有一列包含公司名称，我想通过删除重复的单词来清理它。

例如：

"奔驰-奔驰" => "奔驰"
"特斯拉 123-特斯拉 123" => "特斯拉 123"
“Apple Store Inc-Apple Store In”=>“Apple Store Inc”

到目前为止，我已经弄清楚了如何使用正则表达式来处理前两种情况。但是，我似乎无法弄清楚如何做第三种情况。

这是我的第三种情况的代码：

df_comp['comp_no_duplicate'] = df_comp['comp_name'].str \
                    .replace(r'(^\b[A-Z]{1,}.*\b)(.*)-{1}\b\1\b', r'\1\2')

使用这段代码，我得到第三种情况的结果： "Apple Store Inc-Apple Store In" => "Apple Store IncIn"

我如何为这种情况编写正则表达式？

【问题讨论】：

总是用-分隔吗？如果是这样，一个简单的x.split('-')[0] 就足够了......
df_comp['comp_no_duplicate'] = df_comp['comp_name'].str.replace(r'^(.*)-\1$', r'\1', regex=True)。注意Apple Store Inc-Apple Store In 在- 之前没有重复字符串。
@WiktorStribiżew 是的，有重复的公司名称总是用“-”分隔，但也有其他没有重复的公司名称也有“-”。
那么，r'^(.*)-\1$' 有效吗？

标签： python regex pandas

【解决方案1】：

硬编码这么多（潜在的）规则可能很麻烦。也许你可以处理这个有点不同。您不想要重复的术语。那么为什么不过滤掉多次出现的术语呢？

有多种方法可以做到这一点，具体取决于您的需要。您可以保留第一次出现，最后一次出现，并且可以追求速度（这将牺牲术语的顺序）或坚持保持顺序。以下是一些实施建议：

import re
import pandas

from typing import List


# Your data
df = pandas.DataFrame(
    [
        {"text": "Benz-Benz"},
        {"text": "Tesla 123-Tesla 123"},
        {"text": "Apple Store Inc-Apple Store In"},
    ]
)


def unordered_deduplicate(text: str) -> str:
    """Take a string and remove duplicate terms, without preserving
    the order of the terms.

    Args:
        text (str): The input text

    Returns:
        str: The cleaned output
    """
    return " ".join(set(re.split(r"\s|-", text)))


def ordered_deduplicate(text: str) -> str:
    """Take a string and remove duplicate terms, only keeping the
    first occurence of a term.

    Args:
        text (str): The input string

    Returns:
        str: The cleaned output
    """

    # Make a list of all the terms
    unique_terms_count = {term: 0 for term in set(re.split(r"\s|-", text))}

    # Loop the terms
    cleaned: List[str] = []
    for term in re.split(r"\s|-", text):

        # Only keep them in the cleaned list if they haven't been seen before
        if unique_terms_count[term] == 0:
            cleaned.append(term)
            unique_terms_count[term] += 1

    return " ".join(cleaned)


# Create clean text columns in different ways
df["unordered_text"] = df["text"].apply(unordered_deduplicate)
df["ordered_text"] = df["text"].apply(ordered_deduplicate)

print(df)

【讨论】：