【问题标题】:Extracting a string between two strings from HTML string从 HTML 字符串中提取两个字符串之间的字符串
【发布时间】:2022-01-06 14:38:07
【问题描述】:

我有以下 python 代码,并在尝试打印 userits number 时尝试regex 我做了以下操作:

import re


txt = '''Element.update("to_users2", "\n\n\n<div class=\"label-field-pair\">\n  <div class=\"label-field-pair11\">\n    <label for=\"student_grade\">Select member</label>\n    <div class =\"scrolable\" >\n      <div class=\"scroll-inside\">\n        <div class=\"hover\"><a href=\"#\" class=\"all\" onClick=\"add_all_recipient('0000000,1111111,2222222,3333333,4444444,5555555,6666666,7777777,8888888,9999999')\">Select All  <span> Add </span></a>\n\n        </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(0000000)\" success=\"Element.hide('loader')\">user zero M ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(1111111)\" success=\"Element.hide('loader')\">user One S ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(2222222)\" success=\"Element.hide('loader')\">user Two A ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(3333333)\" success=\"Element.hide('loader')\">user three H ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(4444444)\" success=\"Element.hide('loader')\">user four M ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(5555555)\" success=\"Element.hide('loader')\">user Five O ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(6666666)\" success=\"Element.hide('loader')\">user six F ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(7777777)\" success=\"Element.hide('loader')\">user Seven Mo ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(8888888)\" success=\"Element.hide('loader')\">user eight ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(9999999)\" success=\"Element.hide('loader')\">\u0650user nine M ...<span> Add </span></a>\n\n          </div>\n        \n      </div>\n    </div>\n  </div>\n</div>\n\n\n");'''


regexp = re.findall(
            r"add_recipient\(([0-9]+)\)\" success=.+>([a-zA-Z0-9\w]+) ", txt)

for x in regexp:
    print(x[1],  x[0])

执行上面的python code会打印如下:

user 0000000
user 1111111
User 2222222
user 3333333
user 4444444
user 5555555
user 6666666
user 7777777
user 8888888

我需要得到如下输出:

user Zero 0000000
user One 1111111
...

我怎样才能得到这样的输出?在某些情况下,re.findall 只返回user 8888888,我不知道为什么。但是我怎样才能得到完整的匹配呢?

【问题讨论】:

    标签: python html python-3.x regex


    【解决方案1】:

    使用正则表达式解析 XML/HTML 是不好的做法,使用解析器(借助一些正则表达式帮助):

    from bs4 import BeautifulSoup
    import re
    
    soup = BeautifulSoup(txt)
    
    out = []
    for e in soup.find_all('a', onclick=True):
        m = re.search('(?<=add_recipient\().*(?=\))', e['onclick'])
        if m:
            a = m.group()
            out.append((e.contents[0], a))
    

    输出:

    [('user zero M ...', '0000000'),
     ('user One S ...', '1111111'),
     ('user Two A ...', '2222222'),
     ('user three H ...', '3333333'),
     ('user four M ...', '4444444'),
     ('user Five O ...', '5555555'),
     ('user six F ...', '6666666'),
     ('user Seven Mo ...', '7777777'),
     ('user eight ...', '8888888'),
     ('ِuser nine M ...', '9999999')]
    

    替代输出(仅名称的前 2 个单词),将最后一行替换为:

    out.append((' '.join(e.contents[0].split(maxsplit=2)[:2]), a))
    

    输出:

    [('user zero', '0000000'),
     ('user One', '1111111'),
     ('user Two', '2222222'),
     ('user three', '3333333'),
     ('user four', '4444444'),
     ('user Five', '5555555'),
     ('user six', '6666666'),
     ('user Seven', '7777777'),
     ('user eight', '8888888'),
     ('ِuser nine', '9999999')]
    

    【讨论】:

      【解决方案2】:

      您可以添加额外的捕获组,并更改打印组值的顺序。

      请注意,您可以将[a-zA-Z0-9\w]+ 写为\w+,因为它也匹配a-zA-Z0-9

      您可以使用[^&lt;&gt;]*&gt; 代替.+&gt; 来防止某些回溯,而不是使用否定字符类跨越尖括号。

      import re
      
      txt = '''Element.update("to_users2", "\n\n\n<div class=\"label-field-pair\">\n  <div class=\"label-field-pair11\">\n    <label for=\"student_grade\">Select member</label>\n    <div class =\"scrolable\" >\n      <div class=\"scroll-inside\">\n        <div class=\"hover\"><a href=\"#\" class=\"all\" onClick=\"add_all_recipient('0000000,1111111,2222222,3333333,4444444,5555555,6666666,7777777,8888888,9999999')\">Select All  <span> Add </span></a>\n\n        </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(0000000)\" success=\"Element.hide('loader')\">user zero M ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(1111111)\" success=\"Element.hide('loader')\">user One S ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(2222222)\" success=\"Element.hide('loader')\">user Two A ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(3333333)\" success=\"Element.hide('loader')\">user three H ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(4444444)\" success=\"Element.hide('loader')\">user four M ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(5555555)\" success=\"Element.hide('loader')\">user Five O ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(6666666)\" success=\"Element.hide('loader')\">user six F ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(7777777)\" success=\"Element.hide('loader')\">user Seven Mo ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(8888888)\" success=\"Element.hide('loader')\">user eight ...<span> Add </span></a>\n\n          </div>\n        \n          \n          <div class=\"hover\"><a href=\"#\" before=\"Element.show('loader')\" class=\"individual\" onClick=\"add_recipient(9999999)\" success=\"Element.hide('loader')\">\u0650user nine M ...<span> Add </span></a>\n\n          </div>\n        \n      </div>\n    </div>\n  </div>\n</div>\n\n\n");'''
      
      for x in re.findall(r"add_recipient\(([0-9]+)\)\" success=[^<>]*>(\w+) (\w+)", txt):
          print(x[1], x[2], x[0])
      

      输出

      user zero 0000000
      user One 1111111
      user Two 2222222
      user three 3333333
      user four 4444444
      user Five 5555555
      user six 6666666
      user Seven 7777777
      user eight 8888888
      

      【讨论】:

        【解决方案3】:

        我不是正则表达式专家

        你可以试试:

        out = re.findall(r"add_recipient\(([0-9]+)\)\" success=.+>(\w+\s+\w+)", txt)
        print(*[' '.join(i[::-1]) for i in out], sep='\n')
        
        # Output
        user zero 0000000
        user One 1111111
        user Two 2222222
        user three 3333333
        user four 4444444
        user Five 5555555
        user six 6666666
        user Seven 7777777
        user eight 8888888
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2013-01-31
          • 2013-12-11
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2014-02-26
          • 1970-01-01
          相关资源
          最近更新 更多