HtmlAgilityPack 在 c# 中没有获取 xpath答案

【问题标题】：HtmlAgilityPack don't get xpath in c#HtmlAgilityPack 在 c# 中没有获取 xpath
【发布时间】：2015-02-04 17:02:53
【问题描述】：

之前，我用这个代码，它可以得到网站的xpath。但是，今天我调试代码，我看到，它没有从网站：webtruyen.com 获取数据 html。我尝试检查 website.com/robots.txt。但它不怀疑。我尝试添加代理来获取数据，但返回数据为空。我不知道如何从网站 webtruyen.com 获取 xpath。谁帮我？我想知道如何从网站http://webtruyen.com 读取数据。

我的代码：

string url = "http://webtruyen.com";
var web = new HtmlWeb();
var doc = web.Load(url);
String temps = "";
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
     temps  = node.InnerHtml;
}

我调试，返回：

InnerHtml 'doc.DocumentNode.InnerHtml' 引发了类型为“System.NullReferenceException”的异常字符串 {System.NullReferenceException}

我的代码使用代理：

string url = "http://webtruyen.com";
var web = new HtmlWeb();
webGet.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) Speedy Spider (http://www.entireweb.com/about/search_tech/speedy_spider/)";
var doc = web.Load(url);
String temps = "";
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
     temps  = node.InnerHtml;
}

【问题讨论】：

HtmlAgilityPack HtmlWeb.Load returning empty Document的可能重复
也许您需要启用 cookie，请参阅链接问题。
@Jodrell 我尝试使用 cookie，但它没有获取 html。你能为我的问题提供代码吗？
尝试 node.Attributes["href"].Value
@SuncoastOwner 谢谢。但我在 var doc = web.Load(url); 处有错误它没有为 var doc 获得价值。我在调试中看到：Id 'doc.DocumentNode.Id' 引发了类型为 'System.Exception' 字符串 {System.Exception} 的异常。你会看到错误：运行此代码时不要加载加载网址。

标签： c# xpath html-agility-pack

【解决方案1】：

我在使用 HtmlWeb.Load() 时遇到了同样的错误，但我可以使用 HttpWebRequest 轻松解决您的问题（TLDR：有关工作代码，请参见 #3）。

步骤 1) 使用以下代码：

HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
            using (Stream s = hwr.GetResponse().GetResponseStream())
            { }

您看到您实际上收到了 403 Forbidden 错误 (WebException)。

第 2 步）

        HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
        HtmlDocument doc = new HtmlDocument();
        try
        {
            using (Stream s = hwr.GetResponse().GetResponseStream())
            { }
        }
        catch (WebException wx)
        {
            doc.LoadHtml(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd());
        }

在 doc.DocumentNode.OuterHtml 上，您会看到禁止错误的 HTML，其中 JavaScript 会在您的浏览器上设置 cookie 并刷新它。

3) 因此，为了在手动浏览器之外加载页面，您必须手动设置该 cookie 并重新访问它。意思是：

        string cookie = string.Empty;
        HttpWebRequest hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
        try
        {
            using (Stream s = hwr.GetResponse().GetResponseStream())
            { }
        }
        catch (WebException wx)
        {
            cookie = Regex.Match(new StreamReader(wx.Response.GetResponseStream()).ReadToEnd(), "document.cookie = '(.*?)';").Groups[1].Value;
        }
        hwr = (HttpWebRequest)WebRequest.Create("http://webtruyen.com");
        hwr.Headers.Add("Cookie", cookie);
        HtmlDocument doc = new HtmlDocument();
        using (Stream s = hwr.GetResponse().GetResponseStream())
        using (StreamReader sr = new StreamReader(s))
        {
            doc.LoadHtml(sr.ReadToEnd());
        }

你得到了页面:)

故事的寓意，如果您的浏览器可以做到，那么您也可以。

【讨论】：

非常感谢。你帮我解决我的问题。我希望我能像你一样了解。
没问题@toan！：）乐意效劳。现在将其标记为答案。