在 VBA 中解析 HTML
尽管这个解析例程并不能完全满足您的要求,但它应该让您在 VBA 中朝着正确的方向前进。
'Requires references to Microsoft Internet Controls and Microsoft HTML Object Library
Sub Extract_TD_text()
Dim URL As String
Dim IE As InternetExplorer
Dim HTMLdoc As HTMLDocument
Dim TDelements As IHTMLElementCollection
Dim TDelement As HTMLTableCell
Dim r As Long
'Saved from www vbaexpress com/forum/forumdisplay.php?f=17
URL = "file://C:\VBAExpress_Excel_Forum.html"
Set IE = New InternetExplorer
With IE
.navigate URL
.Visible = True
'Wait for page to load
While .Busy Or .readyState <> READYSTATE_COMPLETE: DoEvents: Wend
Set HTMLdoc = .document
End With
Set TDelements = HTMLdoc.getElementsByTagName("TD")
Sheet1.Cells.ClearContents
r = 0
For Each TDelement In TDelements
'Look for required TD elements - this check is specific to VBA Express forum - modify as required
If TDelement.className = "alt2" And TDelement.Align = "center" Then
Sheet1.Range("A1").Offset(r, 0).Value = TDelement.innerText
r = r + 1
End If
Next
End Sub
用正则表达式来做
不建议使用正则表达式来解析 HTML,因为可能会出现所有可能的模糊边缘情况,但您似乎可以对 HTML 进行一些控制,因此您应该能够避免使用正则表达式的许多边缘情况警察哭了。
说明
此正则表达式将执行以下操作:
- 将示例文本解析为单独的行
- 收集行号
- 收集两个纯文本值
- 避免许多难以用正则表达式解析 html 的模糊边缘情况
正则表达式
<tr\s
(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sid=(['"]?)row_([0-9]+)\1(?:\s|>))
(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>
(?:[^<]*<(?:td|p|span)\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>)+([^<]*).*?</td>
(?:[^<]*<(?:td|p|span)\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>)+([^<]*).*?</td>
[^<]*</tr>
注意:对于这个正则表达式,您需要使用以下标志:忽略空格、不区分大小写和点匹配所有字符。要更好地查看图像,您可以右键单击并选择在新窗口中显示。
示例
鉴于您的示例文本
<tr style='mso-yfti-irow:8' id="row_65">
<td width=170 valign=top style='width:127.5pt;background:white;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
weight<o:p></o:p></span></p>
</td>
<td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
</td>
</tr>
<tr style='mso-yfti-irow:9' id="row_116">
<td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
or LBS<o:p></o:p></span></p>
</td>
<td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
</td>
</tr>
正则表达式将创建以下捕获组
- 捕获组 0 获取整行
- 捕获组 1 在行的 id 属性中获取行号周围的引号
- 捕获组2获取行号
- 捕获组 3 获取第一个表格单元格值
- 捕获组 4 获取第二个表格单元格值
还有以下匹配:
[0][0] = <tr style='mso-yfti-irow:8' id="row_65">
<td width=170 valign=top style='width:127.5pt;background:white;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>Shipment's
weight<o:p></o:p></span></p>
</td>
<td style='background:white;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_65">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>40120<o:p></o:p></span></p>
</td>
</tr>
[0][1] = "
[0][2] = 65
[0][3] = Shipment's
weight
[0][4] = 40120
[1][0] = <tr style='mso-yfti-irow:9' id="row_116">
<td width=170 valign=top style='width:127.5pt;background:#F3F3F3;
padding:3.75pt 3.75pt 3.75pt 3.75pt' id="question_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>KG
or LBS<o:p></o:p></span></p>
</td>
<td style='background:#F3F3F3;padding:3.75pt 3.75pt 3.75pt 3.75pt'
id="value_116">
<p class=MsoNormal><span style='mso-fareast-font-family:"Times New Roman"'>LBS<o:p></o:p></span></p>
</td>
</tr>
[1][1] = "
[1][2] = 116
[1][3] = KG
or LBS
[1][4] = LBS
说明
NODE EXPLANATION
----------------------------------------------------------------------
<tr '<tr'
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
id= 'id='
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
row_ 'row_'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
td 'td'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
p 'p'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
span 'span'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
----------------------------------------------------------------------
</td> '</td>'
----------------------------------------------------------------------
(?: group, but do not capture (1 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
td 'td'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
p 'p'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
span 'span'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
)+ end of grouping
----------------------------------------------------------------------
( group and capture to \4:
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \4
----------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
----------------------------------------------------------------------
</td> '</td>'
----------------------------------------------------------------------
[^<]* any character except: '<' (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
</tr> '</tr>'