如何过滤所有包含 N 个或更多字符的单词？答案

【问题标题】：How to filter all words, which contain N or more characters?如何过滤所有包含 N 个或更多字符的单词？
【发布时间】：2011-01-04 01:24:16
【问题描述】：

我想处理一个文本文件来查找包含超过 N 个字符的所有单词。欢迎使用 Bash (grep,awk) 或 Python (re) 中的任何解决方案！但是，首选最短的。

【问题讨论】：

为什么要限制re？对我来说似乎是任意的。
@Lennart：因为我想为 shell 提供一个使用管道的单行解决方案 lile cat a.txt | grep blabla...
@S.Lott：没什么，我不是正则表达式的专业人士
这不是 Python 中的正则表达式问题。
-1。 “为我写一些代码”不是我想在 StackOverflow 上看到的问题类型。（常见问题解答中没有具体反对它，但我愿意花费声誉点来表示我的反对。）

标签： python regex linux bash

【解决方案1】：

egrep -o '[^ ]{N,}' <filename>

找到所有至少N 个字符长的非空格结构。如果您担心“文字”，您可以尝试[a-zA-Z]。

【讨论】：

+1 更好：grep -Eo '\<[[:alpha:]]{N,}\>' inputfile（或'\<\w{N,}\>）。
@Dennis Williamson 我考虑过使用字符类，但不确定egrep 是否支持它们。
Regular grep 也支持它们。它们是基本正则表达式的一部分，由 POSIX 指定。

【解决方案2】：

Python

 import fileinput
 N = 5
 for line in fileinput.input():
     for word in line.split():
         if len(word) > N:
              print word

【讨论】：

【解决方案3】：

import re; [s for s in re.findall(r"\w+", open(filename, "r").read()) if len(s) >= N]

【讨论】：

【解决方案4】：

输出长度大于5的单词，以及行号

awk -F ' ' '{for(i=1;i<=NF;i++){ if(length($i)>=6) print NR, $i }}' your_file

【讨论】：

@goreSplatter - 我认为你的 egrep 更好，使用 awk 是因为没有人提供 awk 答案......

【解决方案5】：

#!/usr/bin/env python

import sys, re

def morethan(n, file_or_string):
    try:
        content = open(file_or_string, 'r').read()
    except:
        content = file_or_string
    pattern = re.compile("[\w]{%s,}" % n)
    return pattern.findall(content)

if __name__ == '__main__':
    try:
        print morethan(*sys.argv[1:])
    except:
        print >> sys.stderr, 'Usage: %s [COUNT] [FILENAME]' % sys.argv[0]

使用示例（通过this gist）：

$ git clone -q git://gist.github.com/763574.git && \
     cd 763574 && python morethan.py 7 morethan.py

['stackoverflow', 'questions', '4585255', 'contain', ...

【讨论】：

【解决方案6】：

您可以使用简单的 grep，但它会返回整行：

grep '[^ ]\{N\}'

N 是你的号码。

我不知道如何在 grep 或 awk 中获取单个单词，但在 Python 中很容易：

import re
f = open(filename, 'r')
text = f.read()
big_words = re.findall('[^ ]{N,}', s)

同样，N 是您的号码。 big_words 将是一个包含您的单词的列表。

【讨论】：

【解决方案7】：

在此示例中，将 5 的值替换为您要查找的任何长度。第二个示例将其显示为函数

1)

>>> import re
>>> filename = r'c:\temp\foo.txt'
>>> re.findall('\w{5}', open(filename).read())
['Lorem', 'ipsum', 'dolor', 'conse', 'ctetu', 'adipi', 'scing', 'digni', 'accum', 'congu', ...]

2)

def FindAllWordsLongerThanN(n=5, file='foo.txt'):
    return re.findall('\w{%s}' % n, open(file).read())

FindAllWordsLongerThanN(7, r'c:\temp\foo.txt')

【讨论】：

【解决方案8】：

re.findall(r'\w'*N+r'\w+',txt)

【讨论】：

【解决方案9】：

试试这个：

N = 5 #Threshold
f = open('test.txt','r')
try:
  for line in f.xreadlines():
    print " ".join([w for w in line.split() if len(w) >= N])
finally:
  f.close()

【讨论】：

【解决方案10】：

为了完整性（尽管在这种情况下正则表达式解决方案可能更好）：

>>> from string import punctuation
>>> with open('foreword.rst', 'rt') as infile:
...    for line in infile:
...       for x in line.split():
...           x = x.strip(punctuation)
...           if len(x) > 5:
...              print x

假设您的真正意思是“过滤器”，即每个单词都应打印多次。如果您只想要每个单词一次，我会这样做：

>>> from string import punctuation
>>> result = set()
>>> with open('foreword.rst', 'rt') as infile:
...    for line in infile:
...       for x in line.split():
...           x = x.strip(punctuation)
...           if len(x) > 5:
...              if x not in result:
...                  result.add(x)
...                  print x

【讨论】：

【解决方案11】：

你好，我相信这是一个很好的带有 lambda 函数的解决方案。第一个参数是N

import sys
import os
def main():
    p_file = open("file.txt")
    t= lambda n,s:filter(lambda t:len(t)>n,s.split())
    for line in p_file:
        print t(3,line)
if __name__ == '__main__':
    main()

【讨论】：

【解决方案12】：

纯猛击：

N=10; set -o noglob; for word in $(<inputfile); do ((${#word} > N)) && echo "$word"; done; set +o noglob

如果您的输入文件不包含任何通配符（*、?、[），您可以省略 set 命令。

【讨论】：