【问题标题】:How to extract data from a HTML table?如何从 HTML 表格中提取数据?
【发布时间】:2013-01-14 16:14:06
【问题描述】:

我最近下载了 HtmlAgilityPack,但我没有找到任何关于如何使用它的真正说明。我试图根据一些不同的讨论板帖子和其他来源拼凑一些代码。这是我目前所拥有的:

Private Sub Button3_Click(ByVal sender As System.Object, ByVal e As System.EventArgs)
    Dim document As New HtmlAgilityPack.HtmlDocument
    document.LoadHtml("www.reuters.com/finance/stocks/overview?symbol=GOOG")

    Dim tabletag = document.DocumentNode.SelectSingleNode("//table[@class='data']/tr[1]/td[2]")
End Sub

如您所见,我正在使用来自www.reuters.com/finance/stocks/overview?symbol=GOOG 的 HTML。

我正在尝试从此页面中提取 Beta 值。该值当前为 1.04。

当我在即时窗口上方运行代码时,会显示此重复 100 次:

1.04
$243,156.41
328.59
--
--
Trading Report for (GOOG). A detailed report, including free correlated market analysis, and updates.
ValuEngine Detailed Valuation Report for GOOG
GOOGLE INC CL A (GOOG)  12-months forecast
GOOGLE INC CL A (GOOG)  2-weeks forecast
Google Inc: Business description, financial summary, 3yr and interim financials, key statistics/ratios and historical ratio analysis.

我只想返回第一个数字 (1.04)。我究竟做错了什么?有什么建议吗?

【问题讨论】:

    标签: vb.net html-agility-pack


    【解决方案1】:

    您需要使用 cookie 和代理。以下对我很有用。让我知道你的想法:

    Imports System.Net
    Imports System.Web
    
    Public Class Form1
    
        Public cookies As New CookieContainer
    
        Private Sub Button1_Click(sender As System.Object, e As System.EventArgs) Handles Button1.Click
    
    
            Dim wreq As HttpWebRequest = WebRequest.Create("http://www.reuters.com/finance/stocks/overview?symbol=GOOG")
    
            wreq.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"
    
            wreq.Method = "get"
    
            Dim prox As IWebProxy = wreq.Proxy
    
            prox.Credentials = CredentialCache.DefaultCredentials
    
    
            Dim document As New HtmlAgilityPack.HtmlDocument
            Dim web As New HtmlAgilityPack.HtmlWeb
    
            web.UseCookies = True
            web.PreRequest = New HtmlAgilityPack.HtmlWeb.PreRequestHandler(AddressOf onPreReq)
    
            wreq.CookieContainer = cookies
    
            Dim res As HttpWebResponse = wreq.GetResponse()
    
    
            document.Load(res.GetResponseStream, True)
    
            'just for testing:
            '   Dim tabletag = document.DocumentNode.SelectNodes("//table")
            '  MsgBox(tabletag.Nodes.Count.ToString)
    
            'returns your field
            Dim tabletag2 = document.DocumentNode.SelectSingleNode("//td[@class='data']")
            MsgBox(tabletag2.InnerText)
    
        End Sub
    
        Private Function onPreReq(req As HttpWebRequest)
    
            req.CookieContainer = cookies
            Return True
    
        End Function
    End Class
    

    【讨论】:

    • 有效!我必须完成将它完全集成到我的程序中,但这似乎很棒!谢谢!
    • 如果我想提取该表中的第三行怎么办? (328.59)
    • 我认为这样做可以: Dim tabletag2 = document.DocumentNode.SelectSingleNode("//div[@id='overallRatios']//table[@class='dataTable']//tr [3]//td[2]")
    • 在运行另一个 URL 之前是否需要关闭任何内容?
    • 您不需要这样做。如果我是你,我会创建一个返回你需要的值的函数,这样它就可以重用了,所以你可以调用类似的东西: Dim test as string = GetRemoteData(url, field)
    猜你喜欢
    • 1970-01-01
    • 2021-03-22
    • 2012-05-09
    • 2019-07-02
    • 1970-01-01
    • 1970-01-01
    • 2022-10-09
    • 2012-08-01
    相关资源
    最近更新 更多