【发布时间】:2014-08-15 12:35:49
【问题描述】:
我有这样的 HTML 文件
<html>
<head>
<title>Page Name in a Folder</title>
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<meta http-equiv="content-type" content="text/html; charset=utf-8"/>
<meta name="apple-mobile-web-app-capable" content="yes"/>
<link href="resources/css/jquery-ui-themes.css" type="text/css" rel="stylesheet"/>
<link href="resources/css/axure_rp_page.css" type="text/css" rel="stylesheet"/>
<link href="data/styles.css" type="text/css" rel="stylesheet"/>
<link href="files/page_name_in_a_folder/styles.css" type="text/css" rel="stylesheet"/>
</head>
<body>
<div id="base" class="">
<!-- Image Shape Name (Image) -->
<div id="u0" class="ax_image" data-label="Image Shape Name">
<img id="u0_img" class="img " src="images/page_name_not_in_a_folder/u0.png"/>
<!-- Unnamed () -->
<div id="u1" class="text">
<p><span> </span></p>
</div>
</div>
<!-- Heading 1 Shape Name (Shape) -->
<div id="u2" class="ax_h1" data-label="Heading 1 Shape Name">
<img id="u2_img" class="img " src="resources/images/transparent.gif"/>
<!-- Unnamed () -->
<div id="u3" class="text">
<p><span>Heading 1</span></p>
</div>
</div>
<!-- Heading 2 Shape Name (Shape) -->
<div id="u4" class="ax_h2" data-label="Heading 2 Shape Name">
<img id="u4_img" class="img " src="resources/images/transparent.gif"/>
<!-- Unnamed () -->
<div id="u5" class="text">
<p><span>Heading 2</span></p>
</div>
</div>
<!-- Label Shape Name (Shape) -->
<div id="u6" class="ax_paragraph" data-label="Label Shape Name">
<img id="u6_img" class="img " src="resources/images/transparent.gif"/>
<!-- Unnamed () -->
<div id="u7" class="text">
<p><span>Label</span></p>
</div>
</div>
<!-- Unnamed (HTML Button) -->
<div id="u26" class="ax_html_button">
<input id="u26_input" type="submit" value="Submit"/>
</div>
</div>
</body>
</html>
我需要提取所有 DIV 及其类和属性,例如:
- 类名:(ax_html_button) 提取按钮的值 = “提交”
- 类名:(ax_paragraph)提取数据标签的值=“标签 形状名称”
等等
尝试使用 HtmlAgilityPack:
Public Shared Sub parseAgility(fName As String)
Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
htmlDoc.OptionFixNestedTags = True
htmlDoc.Load(fName)
Dim classes As New List(Of String)()
For Each node As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//body//div")
classes.Add(node.InnerHtml)
Next
End Sub
但不确定如何处理所有属性。 有什么想法吗?
以及如何获取输入元素的值(“提交”)?
<div id="u26" class="ax_html_button">
<input id="u26_input" type="submit" value="Submit"/>
</div>
如果我输入这个,我会得到“u16_input”元素的值而不是“u26”!?
For Each node As HtmlNode In htmlDoc.DocumentNode.SelectNodes("//body//div")
Dim className = node.GetAttributeValue("class", "")
Select Case className
Case "ax_html_button"
Dim node2 As HtmlNode = node.SelectSingleNode("//input")
value= node2.GetAttributeValue("value", "")
Case "ax_paragraph"
Case "ax_h1"
Case "ax_h2"
Case "ax_h3"
Case "ax_h4"
Case "ax_h5"
Case "ax_h6"
Case "ax_checkbox"
End Select
Next
编辑:找到解决方案。
【问题讨论】:
-
查看 CsQuery,在我看来,它是 HtmlAgilityPack 的更好替代品。
标签: vb.net parsing html-agility-pack