【问题标题】:string searching with returning matched line in python在python中返回匹配行的字符串搜索
【发布时间】:2011-07-20 11:23:01
【问题描述】:

我是python的新手。我想在文件的某些行中匹配字符串。比方说, 我有字符串:

british    7
German     8
France     90

我在一个文件中有一些行如下:

<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a France centerfire fire rifle cartridge 90.</s>

我想得到如下输出:

<s id="69-7">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.</s>

我尝试了以下代码:

for i in file:      
    if left in i and right in i:
        line = i.replace(left, '<w1>' + left + '</w1>')
        lineR = line.replace(right, '<w2>' + right + '</w2>')
        text = text + lineR + "\n"
        continue
     return text

但是,它也匹配来自 id.eg. 的字符串。

<s id="69-<w2>7</w2>">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc.</s>

那么,有没有办法将字符串搜索为单词而不是字符,以便我可以转义 &lt;s id="69-&lt;w2&gt;7&lt;/w2&gt;"&gt;

提前感谢您的任何帮助。

【问题讨论】:

  • 这些&lt;s&gt;标签是行上唯一的标签吗,每行只有一个吗?
  • 啊,但是根据您在前 2 行代码 sn-p 3 中的演示,我希望 France 与某处的 90 匹配,还是我遗漏了什么?
  • @mcnemesis:抱歉,打错了!

标签: python string-matching


【解决方案1】:

我有一些相当复杂的东西,但我写的很匆忙,暂时可以完成工作。

注意:

  • 我在 同时是英国流行乐队 10cc 的 studio 7 专辑之后添加了“in France”
    并且只修改了British

  • 由德国乐队 Genesis 8 于 1978 年发行中的“1978”未修改,而“8”已修改。

这就是复杂的原因。

但我担心,尽管有这种复杂性,但并不是所有可能的句子都是准确的。

应该进行改进以使 idi 始终是正确的音乐团体的名称,而不是像当前解决方案中那样始终是第一个。但在不知道自己到底想要什么的情况下,这是一项艰苦的工作

ss ='''british    7
German     8
France     90'''


text = '''<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc in France.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a France centerfire fire rifle cartridge 90.</s>
'''




import re
regx = re.compile('^(.+?)[ \t]+(\d+)',re.MULTILINE)




dico = dict((a.lower(),b) for (a,b) in regx.findall(ss))
print 'dico==',dico
print '\n\n'



rogx = re.compile('(<s id="[\d-]+">|</s>\r?\n)')
splitted = rogx.split(text)
print 'splitted==\n',splitted

print '=================\n'

def repl(mat):
    idi = (b for (a,b) in the if b).next().lower()
    x,y = mat.groups()
    if x:
        if dico[idi.lower()]==x:
            return '<w2>%s</w2>' % x
        else:
            return x
    if y :
        if y.lower()==idi:
            return '<w1>%s</w1>' % y
        else:
            return y

rigx = re.compile('(\d+)|(' + '|'.join(dico.keys()) + ')',re.IGNORECASE)

for i,el in enumerate(splitted[0::2]):
    if el:
        print '-----------------------------'
        print '* index in splitted==',2*i
        print '\n* el==\n',repr(el)
        print '\n* rigx.findall(el)==\n',rigx.findall(el)
        the = rigx.findall(el)
        print '\n* modified el:\n',rigx.sub(repl,el)
        splitted[2*i] = rigx.sub(repl,el)


print '\n\n##################################\n\n'

print 'modified splitted==\n',splitted
print
print ''.join(splitted)

结果

dico== {'german': '8', 'british': '7', 'france': '90'}



splitted==
['', '<s id="69-7">', '...Meanwhile is the studio 7 album by British pop band 10cc in France.', '</s>\n', '', '<s id="15-8">', '...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.', '</s>\n', '', '<s id="1990-2">', 'Magnum Nitro Express is a France centerfire fire rifle cartridge 90.', '</s>\n', '']
=================

-----------------------------
* index in splitted== 2

* el==
'...Meanwhile is the studio 7 album by British pop band 10cc in France.'

* rigx.findall(el)==
[('7', ''), ('', 'British'), ('10', ''), ('', 'France')]

* modified el:
...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc in France.
-----------------------------
* index in splitted== 6

* el==
'...And Then There Were Three... is the ninth studio album by the german band Genesis 8 and was released in 1978.'

* rigx.findall(el)==
[('', 'german'), ('8', ''), ('1978', '')]

* modified el:
...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.
-----------------------------
* index in splitted== 10

* el==
'Magnum Nitro Express is a France centerfire fire rifle cartridge 90.'

* rigx.findall(el)==
[('', 'France'), ('90', '')]

* modified el:
Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.


##################################


modified splitted==
['', '<s id="69-7">', '...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc in France.', '</s>\n', '', '<s id="15-8">', '...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.', '</s>\n', '', '<s id="1990-2">', 'Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.', '</s>\n', '']

<s id="69-7">...Meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc in France.</s>
<s id="15-8">...And Then There Were Three... is the ninth studio album by the <w1>german</w1> band Genesis <w2>8</w2> and was released in 1978.</s>
<s id="1990-2">Magnum Nitro Express is a <w1>France</w1> centerfire fire rifle cartridge <w2>90</w2>.</s>

编辑 1

我消除了 replmodel()

repl() 取 rigx.findall(el) 的值
我为此添加了一行 the = rigx.findall(el)

【讨论】:

  • 谢谢!匹配字符串时是否有一些简单的方法可以忽略&lt;s id="69-7"&gt;
  • @Liza 是的,它包括消除正则表达式模式中的括号。但是如果你这样做,你就不能在分割后的字符串上执行 join() 来获得修改后的字符串:它将缺少分割器。这就是为什么我保留拆分器并且我必须使用索引 (2*i)
  • @eyquem,您的解决方案非常好,非常精致。但它是高度过度设计的。有时,3 行代码处理 99.9% 的输入比 50 行处理 100% 的输入要好。只有当必要性得到证明时,才应该转向这种变体。无论如何 +1。
【解决方案2】:

您应该使用正则表达式专门替换单个单词,而不是单词部分。

类似

import re
left='british'
right='7'
i1 = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\\1<w1>\\2</w1>\\3', i)
i2 = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\\1<w2>\\2</w2>\\3', i1)
print(i2)

这给了我们'&lt;s id="69-7"&gt;...Meanwhile is the studio &lt;w2&gt;7&lt;/w2&gt; album by &lt;w1&gt;British&lt;/w1&gt; pop band 10cc.&lt;/s&gt;'

如果这种方法导致错误,您可以尝试更精细的代码,例如

import re

def do(left, right, line):
    parts = [x for x in re.split('(<[^>]+>)', line) if x]
    for idx, l in enumerate(parts):
        lu = l.upper()
        if (not ('<s' in l or 's>' in l) and
            (left.upper() in lu and right.upper() in lu)):
            l = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\\1<w1>\\2</w1>\\3', l)
            l = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\\1<w2>\\2</w2>\\3', l)
            parts[idx] = l

    return ''.join(parts)


line = '<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>'
print(do('british', '7', line))
print(do('british', '-7', line))

【讨论】:

  • 但是,单词可以是多个空格,数字等。所以,我在数组中取字符串并在行中搜索它们。对于整数字符串,我遇到了上述问题。
  • 哦,那么您必须将行拆分为标签和内容。如果你确定&lt;s&gt; 是唯一的标签,那真的很简单。
  • @Alex Laskin :是的,&lt;s&gt; 是每一行的唯一标签。
【解决方案3】:

最好的方法是使用正则表达式。 但是如果'left'和'right'总是至少有一个尾随和前导空格,那么你可以使用一个简单的技巧(只需在你的模式中添加前导和尾随空格):

line = file.replace(' ' + left + ' ', ' <w1>' + left + '</w1> ')
lineR = line.replace(' ' + right + ' ', ' <w2>' + right + '</w2> ')        

【讨论】:

  • 那么你会得到&lt;w2&gt; 7 &lt;/w2&gt;而不是&lt;w2&gt;7&lt;/w2&gt;
  • @Alex - 是的,你是对的,我已经修改了代码,但不确定这些空格是否会破坏 html(或者我错了?)
猜你喜欢
  • 2013-08-19
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-10-15
  • 1970-01-01
  • 2019-02-18
  • 2012-02-12
  • 1970-01-01
相关资源
最近更新 更多