【问题标题】:Regex pattern to get HTML table information获取 HTML 表格信息的正则表达式模式
【发布时间】:2016-09-08 23:17:10
【问题描述】:

我想用正则表达式从 HTML 文件中提取数据,但我不知道应该使用什么模式。 html 代码来自电子邮件。

以下是部分html代码。我希望能够获得“40120 LBS”。

图案会是什么样子?

我想到了类似的东西: 货件重量 [任何字符] [0-9][0-9][0-9][0-9][0-9]

..etc

也许你知道一些更有效的方法来实现我想要的。 谢谢。

<tr style='mso-yfti-irow:8' id="row_65">
  <td width=170 valign=top style='width:127.5pt;background:white;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
  weight<o:p></o:p></span></p>
  </td>
  <td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
  </td>
 </tr>
 <tr style='mso-yfti-irow:9' id="row_116">
  <td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
  or LBS<o:p></o:p></span></p>
  </td>
  <td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
  </td>
 </tr>

【问题讨论】:

  • 您不想为此而use regex。找到一个 html 解析库。
  • 目前尚不清楚您需要从 html 中获得什么

标签: html regex vba web-scraping


【解决方案1】:

与其使用 RegExp 来解析 HTML 文件,不如使用 DOM 解析器。

最直接的方法是添加对Microsoft HTML 对象库 的引用并使用它。了解对象可能有点棘手,但不如尝试使用正则表达式处理 HTML 那么棘手!

关键是确定要使用什么规则来提取值。

这是一个(希望)演示该技术的示例。

Public Sub SimpleParser()
  Dim doc As MSHTML.HTMLDocument
  Dim b As MSHTML.HTMLBody
  Dim tr As MSHTML.HTMLTableRow, td As MSHTML.HTMLTableCell
  Dim columnNumber As Long, rowNumber As Long
  Dim trCells As MSHTML.IHTMLElementCollection
  Set doc = New MSHTML.HTMLDocument
  doc.body.innerHTML = "<table><tr style='mso-yfti-irow:8' id=""row_65""> <td width=170 valign=top style='width:127.5pt;background:white; padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""question_65""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>Shipment's weight<o:p></o:p></span></p> </td> <td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""value_65""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>40120<o:p></o:p></span></p> </td> </tr> <tr style='mso-yfti-irow:9' id=""row_116""> <td width=170 valign=top style='width:127.5pt;background:#F3F3F3; padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""question_116""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>KG or LBS<o:p></o:p></span></p> </td> <td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt' id=""value_116""> <p class=MsoNormal><span style='mso-fareast-font-family:""Times New Roman""'>LBS<o:p></o:p></span></p> </td> </tr></table>"
  Set b = doc.body
  'Example of looping through elements
  For Each tr In b.getElementsByTagName("tr")
    rowNumber = rowNumber + 1
    columnNumber = 0
    For Each td In tr.getElementsByTagName("td")
      columnNumber = columnNumber + 1
      Debug.Print rowNumber & "," & columnNumber, td.innerText
    Next
  Next
  'Go through each row; if the first cell is "Shipment's weight", display the next cell.
  For Each tr In b.getElementsByTagName("tr")
    Set trCells = tr.getElementsByTagName("td")
    If trCells.Item(0).innerText = "Shipment's weight" Then Debug.Print "Weight: " & trCells.Item(1).innerText
  Next

End Sub

【讨论】:

  • 谢谢。这正是我正在寻找的答案。我会根据我的需要调整这段代码。
  • 太棒了!很高兴为您提供帮助!
【解决方案2】:

在 VBA 中解析 HTML

尽管这个解析例程并不能完全满足您的要求,但它应该让您在 VBA 中朝着正确的方向前进。

 'Requires references to Microsoft Internet Controls and Microsoft HTML Object Library
 
Sub Extract_TD_text() 
     
    Dim URL As String 
    Dim IE As InternetExplorer 
    Dim HTMLdoc As HTMLDocument 
    Dim TDelements As IHTMLElementCollection 
    Dim TDelement As HTMLTableCell 
    Dim r As Long 
     
     'Saved from www vbaexpress com/forum/forumdisplay.php?f=17
    URL = "file://C:\VBAExpress_Excel_Forum.html" 
     
    Set IE = New InternetExplorer 
     
    With IE 
        .navigate URL 
        .Visible = True 
         
         'Wait for page to load
        While .Busy Or .readyState <> READYSTATE_COMPLETE: DoEvents: Wend 
             
            Set HTMLdoc = .document 
        End With 
         
        Set TDelements = HTMLdoc.getElementsByTagName("TD") 
         
        Sheet1.Cells.ClearContents 
         
        r = 0 
        For Each TDelement In TDelements 
             'Look for required TD elements - this check is specific to VBA Express forum - modify as required
            If TDelement.className = "alt2" And TDelement.Align = "center" Then 
                Sheet1.Range("A1").Offset(r, 0).Value = TDelement.innerText 
                r = r + 1 
            End If 
        Next 
         
    End Sub 

用正则表达式来做

不建议使用正则表达式来解析 HTML,因为可能会出现所有可能的模糊边缘情况,但您似乎可以对 HTML 进行一些控制,因此您应该能够避免使用正则表达式的许多边缘情况警察哭了。

说明

此正则表达式将执行以下操作:

  • 将示例文本解析为单独的行
  • 收集行号
  • 收集两个纯文本值
  • 避免许多难以用正则表达式解析 html 的模糊边缘情况

正则表达式

<tr\s
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sid=(['"]?)row_([0-9]+)\1(?:\s|>))
(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>
(?:[^<]*<(?:td|p|span)\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>)+([^<]*).*?</td>
(?:[^<]*<(?:td|p|span)\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>)+([^<]*).*?</td>
[^<]*</tr>

注意:对于这个正则表达式,您需要使用以下标志:忽略空格、不区分大小写和点匹配所有字符。要更好地查看图像,您可以右键单击并选择在新窗口中显示。

示例

鉴于您的示例文本

<tr style='mso-yfti-irow:8' id="row_65">
  <td width=170 valign=top style='width:127.5pt;background:white;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
  weight<o:p></o:p></span></p>
  </td>
  <td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
  </td>
 </tr>
 <tr style='mso-yfti-irow:9' id="row_116">
  <td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
  or LBS<o:p></o:p></span></p>
  </td>
  <td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
  </td>
 </tr>

正则表达式将创建以下捕获组

  • 捕获组 0 获取整行
  • 捕获组 1 在行的 id 属性中获取行号周围的引号
  • 捕获组2获取行号
  • 捕获组 3 获取第一个表格单元格值
  • 捕获组 4 获取第二个表格单元格值

还有以下匹配:

[0][0] = <tr style='mso-yfti-irow:8' id="row_65">
  <td width=170 valign=top style='width:127.5pt;background:white;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
  weight<o:p></o:p></span></p>
  </td>
  <td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_65">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
  </td>
 </tr>
[0][1] = "
[0][2] = 65
[0][3] = Shipment's
  weight
[0][4] = 40120

[1][0] = <tr style='mso-yfti-irow:9' id="row_116">
  <td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
  padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
  or LBS<o:p></o:p></span></p>
  </td>
  <td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
  id="value_116">
  <p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
  </td>
 </tr>
[1][1] = "
[1][2] = 116
[1][3] = KG
  or LBS
[1][4] = LBS

说明

NODE                     EXPLANATION
----------------------------------------------------------------------
  <tr                      '<tr'
----------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    id=                      'id='
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      ['"]?                    any character of: ''', '"' (optional
                               (matching the most amount possible))
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    row_                     'row_'
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
      [0-9]+                   any character of: '0' to '9' (1 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
    \1                       what was matched by capture \1
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      >                        '>'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
    [^\s>]*                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '>' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  >                        '>'
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    <                        '<'
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      td                       'td'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      p                        'p'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      span                     'span'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
----------------------------------------------------------------------
  </td>                    '</td>'
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    <                        '<'
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      td                       'td'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      p                        'p'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      span                     'span'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    [^<]*                    any character except: '<' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  .*?                      any character (0 or more times (matching
                           the least amount possible))
----------------------------------------------------------------------
  </td>                    '</td>'
----------------------------------------------------------------------
  [^<]*                    any character except: '<' (0 or more times
                           (matching the most amount possible))
----------------------------------------------------------------------
  </tr>                    '</tr>'

【讨论】:

  • 非常感谢您的反馈。我保证,我会找到更有效的方法来做我想做的事。
  • 提高效率是编程的乐趣。祝您工作顺利,如果这个或其他答案对您有帮助,请将其标记为已接受。
  • RegEx 和解释很有启发性。不过,如果可能的话,我会跳过使用 Internet Explorer 对象;如果您必须从文件加载,我会使用 MSXML。
  • 同意,加载 IE 对象很痛苦,但我认为 OP 是在询问他们已经拥有的 HTML 解析,所以这是一个不错的例子。
  • 你是对的。我将 Outlook 电子邮件另存为 HTML,然后将文件加载到 VBA。
猜你喜欢
  • 1970-01-01
  • 2011-11-11
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多