以编程方式从网站保存网页的内部文本答案

【问题标题】：Programatically save the innerText of webpages from a website以编程方式从网站保存网页的内部文本
【发布时间】：2012-02-25 17:13:18
【问题描述】：

我正在使用 Google 的 Apex 系统参加在线课程，并且希望能够自动保存来自某些页面的数据。正常浏览时登录和获取内容的流程如下：打开webapp并登录，导航我要查看的课程，点击课程。当我单击要学习的课程时，它会打开一个包含课程的新窗口。这是我无法通过程序完成的部分。

我想到的第一个方法是使用 PHP，请求网页并简单地保存它们。问题是有一个登录，以及一些我不知道如何使用 php 自动化的 javascript 事件和事情。我已经通过 POST 请求登录，但无法弄清楚其余部分。

今天我尝试使用 dotnet WebBrowser 控件使用 Windows 窗体、C# 来实现。我让它为我登录并导航到我需要选择要打开的课程的页面，但是如果我单击该链接，它会尝试在 Internet Explorer 中打开该网页。如果我使用它打开的链接，我会从网站收到错误消息。

检查我遇到问题的页面上的链接，我发现了打开新窗口的 javascript 事件。它使用重定向链接打开它。在新选项卡而不是新窗口中使用此重定向链接在 Chrome 中有效，但我不知道如何从 C# 获取重定向链接。 a 元素在 iframe 内，我必须在那里获取链接。 How can I, in C#, retrieve an element from within an iframe?

另外，有没有更好的方法来做到这一点？

【问题讨论】：

标签： c# .net html web

【解决方案1】：

使用WebClient类获取url的html。

示例 1：

string htmlTd;

        using (WebClient client = new WebClient())
        {
         //or - request.UserAgent = "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US)"; 
         client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13";
         htmlTd = client.DownloadString(myurl);
        }

       GetImagesInHTMLString(htmlTd);

// 从页面获取图像...由于我的修改，现在出现故障... 我正在努力，但可以帮助您实现目标..

 private void GetImagesInHTMLString(string htmlString)
    {

        List<string> images = new List<string>();
        string pattern = @"<(img)\b[^>]*>";

        Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
        MatchCollection matches = rgx.Matches(htmlString);
         string b =@"src=""";
         string c=@"src="""+myurl+"";

         //if (matches.Count >1)
         //{
            for (int i = 0, l =matches.Count; i < l; i++)
             {


                 string pattern1 =@"s/\s*src='[^']*'//";
                 //    images.Add(matches[i].Value.Replace(b, c));
                 string allmatch = matches[i].Value.Replace(b, c);
                string patrern1="#(= src=['\"].+[^\"]?)?src=[\"']?([^\"']+)#i";  
                 Regex rgx1 = new Regex(pattern1);
                 MatchCollection matches1 = rgx1.Matches(allmatch);
                 string siya = matches1[0].Value.ToString();
                 //string b = @"src=""";
                 //string c = @"src=""" + myurl + "";
             }
        // }       

        foreach (var item in images)
        {
            Response.Write(item);
        }        
    }

来自 WebClient 类链接的示例：

WebClient client = new WebClient ();

        // Add a user agent header in case the 
        // requested URI contains a query.

        client.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

        Stream data = client.OpenRead (URl);
        StreamReader reader = new StreamReader (data);
        string s = reader.ReadToEnd ();
        Console.WriteLine (s);
        data.Close ();
        reader.Close ();

【讨论】：

问题不是从网页中获取 html，而是正确设置服务器上的 php 会话变量，以便我查看并下载页面。如果我用 url 尝试这个，我会收到一个关于未登录的错误。