用Java访问网页的奇怪问题答案

【问题标题】：Weird problem accessing web page with Java用Java访问网页的奇怪问题
【发布时间】：2011-04-23 19:41:05
【问题描述】：

我正在尝试编写一个程序来读取网站http://judgephilosophies.wikispaces.com 的html 源代码。我编写了一些简单的 java 代码来读取和输出源代码，但它只是打印出“null”。不过，这很奇怪——如果我将代码中的“http://judgephilosophies.wikispaces.com”替换为任何其他网站，它就可以正常工作。该程序似乎仅适用于 wikispaces.com 域中的网站，而我完全不知道为什么。代码如下。非常感谢您的帮助。

import java.io.*;
import java.net.*;

public class AccessWebExample 
{
    public static void main (String[] args) throws Exception
    {
        //Create reader to access html source code
        URL url = new URL ("http://judgephilosophies.wikispaces.com/");
        InputStreamReader isr = new InputStreamReader (url.openStream());
        BufferedReader reader = new BufferedReader (isr);

        //Read and print the text
        do
        { 
            System.out.println(reader.readLine());
        }
        while(reader.readLine() != null);
    }
}

【问题讨论】：

怎么不行？ - 如果该站点是 Ajax 站点，那么它将无法工作。您拥有的程序只会从站点获取 HTML。
@Romain - 不，服务器重定向。请参阅下面的答案。

标签： java html-parsing

【解决方案1】：

使用 Wireshark 或类似的工具进行 HTTP 跟踪并进行比较。如果裸 URLConnection 的行为与浏览器不同，这可能是 cookie 或标头的问题。

【讨论】：

【解决方案2】：

在命令行中使用wget，您会发现：

broach@broach-laptop:~$ wget http://judgephilosophies.wikispaces.com/
--2011-04-23 14:50:31--  http://judgephilosophies.wikispaces.com/
Resolving judgephilosophies.wikispaces.com... 208.43.192.33, 75.126.104.177
Connecting to judgephilosophies.wikispaces.com|208.43.192.33|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://session.wikispaces.com/1/auth/auth?authToken=e8ad55c0e2701a0e7da89807255609da [following]

它会重定向（实际上是多次重定向）。您的裸 URLConnection 无法处理。响应代码在标头中，因此您的程序当前打印 null。

您确实应该考虑使用HttpUrlConnection，因为它可以为您处理重定向。要使用 URL 执行此操作，您需要查看返回的标头并处理 HTTP 响应代码（这是 HttpURLConnection 所做的）

【讨论】：