如何过滤字符串列表？答案

【问题标题】：How to filter a list of strings?如何过滤字符串列表？
【发布时间】：2021-12-27 10:03:04
【问题描述】：

我有一个包含非英语/英语单词的字符串列表。我只想过滤掉英文单词。

例子：


phrases = [
    "S/O अशोक कुमार, ब्लॉक न.-4डी, S/O Ashok Kumar, Block no.-4D.",
    "स्ट्रीट-15, विभाग 5. सिविक सेंटर Street-15, sector -5, Civic Centre",
    "भिलाई, दुर्ग, भिलाई, छत्तीसगढ़, Bhilai, Durg. Bhilai, Chhattisgarh,",
]

到目前为止我的代码：

import re
regex = re.compile("[^a-zA-Z0-9!@#$&()\\-`.+,/\"]+")
for i in phrases:
    print(regex.sub(' ', i))

我的输出：

["S/O , .-4 , S/O Ashok Kumar, Block no.-4D.",
  "-15, 5. Street-15, sector -5, Civic Centre",
  ", , , , Bhilai, Durg. Bhilai, Chhattisgarh",]

我的愿望输出

["S/O Ashok Kumar, Block no.-4D.",
 "Street-15, sector -5, Civic Centre",
 "Bhilai, Durg. Bhilai, Chhattisgarh,"]

【问题讨论】：

看来您的正则表达式中有一个未转义的.，它将匹配任何字符。如果要匹配句点，则需要对其进行转义，即\.。您还应该查看正则表达式的特殊字符，例如 \w 和 \d，这将使您的表达式更短。看来您希望匹配的字符串以英文字母开头，因此您可以在进入字符串其余部分的匹配之前强加该匹配。例如\w[\w\d]+
@bicarlsen 组中的点不需要转义，问题在别处
@bicarlsen，嗨，你能告诉我应该是什么表达方式吗？？
谢谢@mozway，我不知道。
求求你，但不要将正则表达式用于严肃的应用程序（例如 Adhaar Card、Banks），因为我看到了可怕的 ID 错误打印不是一种，而是 20 多种印度语言。更好地投资于特殊的 Unicode 解析器。

标签： python regex string list

【解决方案1】：

如果我查看您的数据，您似乎可以使用以下内容：

import regex as re
lst=["S/O अशोक कुमार, ब्लॉक न.-4डी, S/O Ashok Kumar, Block no.-4D.",
      "स्ट्रीट-15, विभाग 5. सिविक सेंटर Street-15, sector -5, Civic Centre",
      "भिलाई, दुर्ग, भिलाई, छत्तीसगढ़, Bhilai, Durg. Bhilai, Chhattisgarh,",]
for i in lst:
    print(re.sub(r'^.*\p{Devanagari}.+?\b', '', i))

打印：

S/O Ashok Kumar, Block no.-4D.
Street-15, sector -5, Civic Centre
Bhilai, Durg. Bhilai, Chhattisgarh,

查看在线正则表达式demo

^ - 开始字符串锚点；
.*\p{Devanagari} - 0+（贪婪）字符直到最后一个梵文字母；
.+?\b - 1+（惰性）字符直到第一个单词边界

【讨论】：

最好将list 更改为其他内容，因为这样会覆盖该类。
@accdias，你是对的
@JudV，任何指向 {Devanagari} 和其他 unicode 语言定义位置的指针，我都会进行大量抓取，这将为我节省大量正则表达式行
@MortenB，我发现this 是一个很好的信息来源。

【解决方案2】：

如果您的意思是您的字符可能只有标准英文字母，而您的正则表达式适用于此，并且您只想过滤掉有问题的“、、、、”值，您可以这样做：

def format_output(current_output):
    results = []
    for row in current_output:
        # split on the ","
        sub_elements = row.split(",").
        # this will leave the empty ones as "" in the list which can be filtered
        filtered = list(filter(key=lambda x: len(x) > 0, sub_elements))
        # then join the elements togheter and append to the final results array
        results.append(",".join(filtered))

【讨论】：

【解决方案3】：

在我看来，列表的每个元素的第一部分是第二部分的印地语翻译，并且单词数量之间存在一一对应的关系。

因此，对于您提供的示例以及任何遵循完全相同模式的示例（如果不这样做，它将中断），您所要做的就是获取数组每个元素的第二部分。

phrases = ["S/O अशोक कुमार, ब्लॉक न.-4डी, S/O Ashok Kumar, Block no.-4D.",
  "स्ट्रीट-15, विभाग 5. सिविक सेंटर Street-15, sector -5, Civic Centre",
  "भिलाई, दुर्ग, भिलाई, छत्तीसगढ़, Bhilai, Durg. Bhilai, Chhattisgarh,",]


mod_list = []
for s in list:
    tmp_list = []
    strg = s.split()
    n = len(strg)
    for i in range(int(n/2),n):
        tmp_list.append(strg[i])
    tmp_list = ' '.join(tmp_list)
    mod_list.append(tmp_list)

print(mod_list)

输出：

['S/O Ashok Kumar, Block no.-4D.', 
'Street-15, sector -5, Civic Centre', 
'Bhilai, Durg. Bhilai, Chhattisgarh,']

【讨论】：

最好将list 更改为其他内容，因为这样会覆盖该类。

【解决方案4】：

您也可以使用 Python re 实现您所需要的。

解决方案 1：您可以使用

将非 ASCII 字母中的所有文本删除到 最后一个非 ASCII 字母

import re
phrases = [ "S/O अशोक कुमार, ब्लॉक न.-4डी, S/O Ashok Kumar, Block no.-4D.", "स्ट्रीट-15, विभाग 5. सिविक सेंटर Street-15, sector -5, Civic Centre", "भिलाई, दुर्ग, भिलाई, छत्तीसगढ़, Bhilai, Durg. Bhilai, Chhattisgarh,", ]
rx = re.compile(r'[^\W\d_A-Za-z].*[^\W\d_A-Za-z]\W*')
l = [rx.sub('', text) for text in phrases]
print(l)
# => ['S/O S/O Ashok Kumar, Block no.-4D.', 'Street-15, sector -5, Civic Centre', 'Bhilai, Durg. Bhilai, Chhattisgarh,']

这里，[^\W\d_A-Za-z].*[^\W\d_A-Za-z]\W* 匹配一个非 ASCII 字母 ([^\W\d_A-Za-z])，然后是尽可能多的除换行符以外的任何零个或多个字符 (.*)，然后是另一个非 ASCII 字母和任何零个或多个非单词字符 (\W+)。

见this Python demo。

解决方案 2：您还可以删除从开头到最后一个非 ASCII 字母的所有文本以及字符串中以

结尾的所有非单词字符

re.sub(r'.*[^\W\d_A-Za-z]\W*|\W+$', '', text)

请参阅regex demo。详情：

.*[^\W\d_A-Za-z]\W* - 除换行符以外的任何零个或多个字符 (.*)，然后是一个非 ASCII 字母 ([^\W\d_A-Za-z]) 和零个或多个非单词字符 (\W*)李>
| - 或
\W+$ - 字符串末尾的一个或多个非单词字符 (\W+) ($)。

查看Python demo：

import re
phrases = [ "S/O अशोक कुमार, ब्लॉक न.-4डी, S/O Ashok Kumar, Block no.-4D.", "स्ट्रीट-15, विभाग 5. सिविक सेंटर Street-15, sector -5, Civic Centre", "भिलाई, दुर्ग, भिलाई, छत्तीसगढ़, Bhilai, Durg. Bhilai, Chhattisgarh,", ]
rx = re.compile(r'.*[^\W\d_A-Za-z]\W*|\W+$')
l = [rx.sub('', text) for text in phrases]
print(l)
# => ['S/O Ashok Kumar, Block no.-4D', 'Street-15, sector -5, Civic Centre', 'Bhilai, Durg. Bhilai, Chhattisgarh']

【讨论】：