【发布时间】:2011-06-04 05:08:16
【问题描述】:
我对 HTML Agility Pack 还很陌生,所以我需要一些帮助来了解下一步该去哪里。我可以做一些简单的事情,比如从 href 中提取一个值(知道我正在寻找的 url 字符串),我可以根据正在使用的特定类在跨度中提取值。但是我不明白如何在有大量或标签的情况下使用 HTML Agility Pack,而这些标签不是一个真正的可靠锚点?
这是我正在浏览的一段实际代码。我在单元格中放置了虚拟数据以展示我在寻找什么。
提取以下内容的最佳方法是什么:
1.) 公司名称?
2.) 电话号码?
3.) 电子邮件地址?
HTML....
<td>
<!-- Company Info -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>COMPANY NAME</th>
</tr>
<tr>
<td class="search">
<table cellpadding="5" cellspacing="0" border="0" width="100%">
<tr>
<td>
<table cellpadding="1" cellspacing="0" border="0" width="100%">
<tr>
<td colspan="2" align="center">Un-needed Links...</td>
</tr>
<tr>
<td align="center" colspan="2"><hr></td>
</tr>
<tr>
<td align="right" nowrap>
<b>
<font color="FF0000">
Contact Person
<img src="/images/icon_contact.gif" align="absmiddle"> :
</font>
</b>
</td>
<td align="left" width="100%"> Judy Smith</td>
</tr>
<tr>
<td align="right" nowrap>
<b><font color="FF0000">Phone Number <img src="/images/icon_phone.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 555-555-5555</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">E-mail Address <img src="/images/icon_email.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> <a HREF="mailto:judy.smith@companyname.com">judy.smith@companyname.com</a></td>
</tr>
<tr>
<td align="center" colspan="2"><hr></td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Home Office Location <img src="/images/icon_home.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> ATLANTA, GA</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Home Office Phone <img src="/images/icon_home.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 555-555-5555</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Home Office Fax <img src="/images/icon_home.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 666-666-6666</td>
</tr>
<tr>
<td align="center" colspan="2"><hr></td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Broker MC Number <img src="/images/icon_number.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 123456</td>
</tr>
<tr>
<td align="right" nowrap><b><font color="FF0000">Carrier MC Number <img src="/images/icon_number.gif" align="absmiddle"> :</font></b></td>
<td align="left" width="100%"> 654321</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
<br>
<!-- Starting Point -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>Starting Point</th>
<th>Available</th>
</tr>
<tr>
<td class="search" width="270"> <b>ABBEVILLE, GA </b></td>
<td class="search" align="center" width="100"><span style="color: forestgreen"> 1/5/11 </span></td>
</tr>
</table>
</td>
</tr>
</table>
<br>
<!-- Destination Point -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>Destination Point</th>
<th>Direction</th>
</tr>
<tr>
<td class="search" width="270"> <b>ATLANTA, GA </b></td>
<td class="search" align="center" width="100"><span style="color: FF0000"> </span></td>
</tr>
</table>
</td>
</tr>
</table>
<br>
<!-- Truck Details -->
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="black">
<table cellspacing="1" cellpadding="0" border="0" width="370">
<tr>
<th>Truck Details</th>
</tr>
<tr>
<td class="search">
<table cellpadding="5" cellspacing="0" border="0">
<tr>
<td>
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td align="right"><b>Date Posted :</b></td>
<td align="left"> 1/5/2011 10:34:48 AM</td>
</tr>
<tr>
<td align="right"><b>Quantity :</b></td>
<td align="left"> 1</td>
</tr>
<tr>
<td align="right"><b>Equipment Type :</b></td>
<td align="left"> FT</td>
</tr>
<tr>
<td align="right"><b>Load Size :</b></td>
<td align="left"> Full</td>
</tr>
<tr>
<td align="right" valign="top"><b>Special Information :</b></td>
<td align="left"> </td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
</td>
</tr>
</table>
<br>
</td>
....更多 HTML
【问题讨论】:
标签: c# screen-scraping html-agility-pack