【问题标题】:Downloaded PDF with Java is corrupt?使用 Java 下载的 PDF 已损坏?
【发布时间】:2009-09-04 09:45:41
【问题描述】:

我已经阅读了关于How to download and save a file from internet using Java 的精彩讨论。但是,如果我执行下一个代码,我会得到一个损坏的 PDF。知道为什么吗?

import java.io.*;
import java.net.*;

public class PDFDownload {
    public static String URL = "http://www.nbc.com/Heroes/novels/downloads/";
    public static String FOLDER = "C:/Users/sdelamo/workspace/SandBox/HeroesNovel/";

    public static void main(String[] args) {
        String filename = "Heroes_novel_001.pdf";
        try {
            saveUrl(FOLDER + filename, URL + filename);
        } catch (MalformedURLException e) {
            System.out.println("MalformedURLException");
        } catch (IOException e) {
            System.out.println("IOException");                              
        }                       
    }       



    public static void saveUrl(String filename, String urlString) throws MalformedURLException, IOException {
        BufferedInputStream in = null;
        FileOutputStream fout = null;
        try {
            URL url = new URL(urlString);
            in = new BufferedInputStream(url.openStream());
            fout = new FileOutputStream(filename);

            byte data[] = new byte[1024];
            int count;
            while ((count = in.read(data, 0, 1024)) != -1) {
                fout.write(data, 0, count);
            }
        } finally {
            if (in != null)
                in.close();
            if (fout != null)
                fout.close();
        }
    }
}

以上代码下载的是 html 而不是 PDF。这是输出:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN"
    "http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>

<meta name="viewport" content="width=240, user-scalable=yes" />
<HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
<META HTTP-EQUIV="Expires" CONTENT="-1">
<meta http-equiv="Cache-control" content="no-cache">
<meta http-equiv="Cache-control" content="must-revalidate">
<meta http-equiv="Cache-control" content="max-age=0">
<meta http-equiv="refresh" content="200">

<title>NBC.com: Heroes</title>
<link rel="stylesheet" type="text/css"  href="/style/default.css?sid=8a9212f822e1c675330ec418bc531169" />
<link rel="stylesheet" type="text/css"  href="/style/hro.css?sid=8a9212f822e1c675330ec418bc531169" /> 

</head>
<body>
<center><img src="http://oimg.nbcuni.com/b/ss/nbcunbcnetworkwapbu,nbcuwapsitebu/5/H.8--WAP/4aa0e4cb8b448?vid=8a9212f822e1c675330ec418bc531169&gn=NBC.com Front Door&c2=&c3=Miscellaneous&c4=&c6=m.nbc.com/show/hro&c8=TV Entertainment&c9=NBC Network&c10=&c11= | &c12= | &c25=offdeck&c27=internal&c29=&c44=D=User-Agent&r=" width="5" height="5" border="0" /></center>
<h1 id="fHeader">
<a  href="/?sid=8a9212f822e1c675330ec418bc531169">
<img src="/images/nbc_logo.gif" alt="NBC : logo" border="0" />
</a>
</h1>

<h2>
<a  href="/show/hro?sid=8a9212f822e1c675330ec418bc531169">
<img src="/images/shows/1221684699_Heroes_WAP_166x54.jpg" alt="Heroes : showheader" border="0" />
</a>
</h2>
<div id="tunein_nexton">
    <span id="tunein">Mondays 9/8c</span>
</div><!--end #tunein_nexton-->
<div id="tunein_nexton">
    <!--<span id="tunein">Mondays 8/7c</span>-->

    <p id="nexton"><span class="sectiontitle"></span></p>
</div><!--end #tunein_nexton-->
<div id="featuredcontent">
    <h3>FEATURED CONTENT</h3>
    <table id="featuredItemsTable">

        <tr>
            <td><a  href="/show/hro/videos.html?sid=8a9212f822e1c675330ec418bc531169"><img src="/images/hro/nbc_hro_pro_040X921HRO120FLYPSIDE_exp921_20090_543_large.jpg" alt="featured" /></a>
            </td>
            <td>
                <span class="ftitle">Dreams</span>
                <span class="fdesc">Heroes premieres Mon., Sept. 21s...</span>
            </td>
        </tr>
                                        <tr>
            <td><a  href="/show/hro/recaps.html?sid=8a9212f822e1c675330ec418bc531169"><img src="http://origin-www.nbc.com/Heroes/images/episodes/season3/325/hro_325_01.jpg" alt="featured" height="45" width="80"/></a>
            </td>
            <td>
                <span class="ftitle">Recap:</span>
                <span class="fdesc">Season 3 Episode An Invisible Thread</span>
            </td>
        </tr>
                                        <tr>
            <td><a  href="/show/hro/photos.html?sid=8a9212f822e1c675330ec418bc531169"><img src="http://origin-www.nbc.com/app2/img/200x200xS/scet/photos/51/3736/NUP_110031_0323.JPG" alt="featured" height="45" width="80"/></a>
            </td>
            <td class="finfo">
                <span class="ftitle">Photo:</span>
                <span class="fdesc">Heroes "Cast Photos"</span>
            </td>
        </tr>
                    </table>


</div><!--end #featuredcontent-->

<h3>HEROES</h3>
<table class="showNav">
    <tr><td><a  href="/show/hro/about.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="1">About</a></td></tr>
        <tr><td><a  href="/show/hro/videos.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="2">Videos</a></td></tr>
                <tr><td><a  href="/show/hro/recaps.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="3">Episode Recaps</a></td></tr>
                    <tr><td><a  href="/show/hro/photos.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="4">Photos</a></td></tr>
                <tr><td><a  href="/show/hro/community.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="5">Community</a></td></tr>
    <tr><td><a  href="/shows.shtml?sid=8a9212f822e1c675330ec418bc531169" accesskey="6">Shows List</a></td></tr>
</table>
<!-- <a  href="http://www.insightexpress.com/ix/Survey.aspx?id=151580&accessCode=3161643404&sid=8a9212f822e1c675330ec418bc531169" ><img src="/images/mNBCcom_166x54.jpg" border="0"></a> -->



<div class="footer" align="center"><a  href="http://m.nbc.com?sid=8a9212f822e1c675330ec418bc531169"><strong>NBC Mobile Main</strong></a> | <a  href="/terms.shtml?sid=8a9212f822e1c675330ec418bc531169"><strong>Terms of Use</strong></a> | <a  href="/privacy.shtml?sid=8a9212f822e1c675330ec418bc531169"><strong>Privacy</strong></a></div><div class="cpyrt" align="center">&#169; NBC Universal, Inc.</div>

</body>
</html>

知道如何下载 PDF 吗?

解决方案

在连接前设置 User-Agent。

URL u = new URL(urlString); 
HttpURLConnection huc =  (HttpURLConnection)  u.openConnection();
huc.setRequestMethod("GET"); 
huc.setRequestProperty("User-Agent", "  Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)");
huc.connect();          

in = new BufferedInputStream(huc.getInputStream());

【问题讨论】:

    标签: java pdf url download


    【解决方案1】:

    您是否尝试过使用例如文本编辑器查看下载的文件?

    您会看到它包含一个 HTML 页面,而不是 PDF。可能 URL 没有指向 PDF,或者正在进行一些重定向,标准 java.net 类默认不支持。

    确保 URL 正确指向 PDF。您可以使用 Apache HttpClient 来使用 HTTP 做更复杂的事情,包括自动处理 HTTP 重定向。

    注意:您发布的代码无法编译,因为您错误地放置了}

    【讨论】:

    • 我相信该代码确实指向 PDF。他将文件名附加到 URL。
    • 我用编辑器打开了PDF,里面有一个html文件
    【解决方案2】:

    这与您的其他问题相同。如果 NBC.com 认为您是爬虫,它不会将 PDF 发回给您 :)

    同样的技巧也可以,

    conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13");
    

    【讨论】:

      【解决方案3】:

      检查生成的文件 - 我希望它是一个 HTML 文件。如果没有引荐来源或使用 JavaScript 重定向页面或其他东西,该站点可能会返回错误。您可以使用HttpURLConnection 类来检查服务器返回的 HTTP 标头。

      URL url = new URL(
          "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf");
      HttpURLConnection conn = (HttpURLConnection) url.openConnection();
      conn.setRequestMethod("HEAD");
      try {
        for (Map.Entry<String, List<String>> header : conn.getHeaderFields()
            .entrySet()) {
          System.out.println(header.getKey() + "=" + header.getValue());
        }
      } finally {
        conn.disconnect();
      }
      

      以上代码返回Content-Typetext/html

      【讨论】:

      • 你是对的。我用编辑器打开它,里面有html
      【解决方案4】:

      对于这种探索,我强烈推荐Jython(或Groovy,或...)。例如:

      C:\Users\Vinay>jython Jython 2.5.0(Release_2_5_0:6476,2009 年 6 月 16 日,13:33:26) [Java HotSpot(TM) Client VM (Sun Microsystems Inc.)] on java1.6.0_16 键入“帮助”、“版权”、“信用”或“许可”以获取更多信息。 >>> s = "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf" >>> 导入 java.net >>> 导入 jarray >>> u = java.net.URL(s) >>> os = u.openStream() >>> 缓冲区 = jarray.zeros(1024, 'b') >>> n = os.read(缓冲区, 0, 1024) >>> java.lang.String(缓冲区)
      <?xml version="1.0" encoding="UTF-8" ?>
      <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN"
          "http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd">
      
      <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
      <head>
      
      <meta name="viewport" content="width=240, user-scalable=yes" />
      <HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
      <META HTTP-EQUIV="Expires" CONTENT="-1">
      <meta http-equiv="Cache-control" content="no-cache">
      <meta http-equiv="Cache-control" content="must-revalidate">
      <meta http-equiv="Cache-control" content="max-age=0">
       meta http-equiv="refresh" content="200">
      <title>NBC.com: Heroes</title>
      <link rel="stylesheet" type="text/css"  href="/style/default.css?sid=c67ddc30f79
      ec4cc811f6e67e383fed7" />
      <link rel="stylesheet" type="text/css"  href="/style/hro.css?sid=c67ddc30f79ec4c
      c811f6e67e383fed7" />
      
      </head>
      <body>
      <center><img src="http://oimg.nbcuni.com/b/ss/nbcunbcnetworkwapbu,nbcuwapsitebu/
      5/H.8--WAP/4aa0e7ce2535c?vid=c67ddc30f79ec4cc811f6e67e383fed7&gn=NBC.com Front
      >>>
      

      这确认了您的发现,但没有编辑/编译周期妨碍。只是我的 2 美分...

      至于如何获取数据 - 可能是您必须欺骗您的 User-Agent 标头。在 Firefox 中,相同的 URL 返回 application/pdfContent-Type 和 PDF 文件。

      更新:以下 Jython 脚本:

      import java.net
      import jarray
      
      s = "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf"
      u = java.net.URL(s)
      c = u.openConnection()
      c.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.2) Gecko/20090810 Ubuntu/9.10 (karmic) Firefox/3.5.2")
      BUFLEN = 4
      buffer = jarray.zeros(BUFLEN, 'b')
      c.connect()
      stream = c.getInputStream()
      stream.read(buffer, 0, BUFLEN)
      data = java.lang.String(buffer)
      print data
      

      打印

      %PDF

      所以网站正在查看User-Agent 标头。

      【讨论】:

      • 如何欺骗 User-Agent 标头?
      • 如果您坚持使用 Java 的 HttpURLConnection,请在连接之前将其设置为请求属性。 (请注意,在这种情况下,欺骗用户代理可能会起作用,但这只是网络服务器可以用来区分真实浏览器和机器人/蜘蛛/等的众多技巧之一。)
      【解决方案5】:

      如果设置 User-Agent 没有解决问题。这可能是 Cookie 的问题。安装简单的浏览器插件(EditThisCookie、HTTP Spy for Chrome)并检查请求和响应标头。获取这些 cookie 值并使用相同的 HttpURLConnection 设置它们。

      代码:(Sergio del Amo 发布的解决方案的扩展)

      URL u = new URL(urlString); 
      HttpURLConnection huc =  (HttpURLConnection)  u.openConnection();
      huc.setRequestMethod("GET"); 
      huc.setRequestProperty("User-Agent", "  Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)");
      
      String myCookies = "cookie_name_1=cookie_value_1;cokoie_name_2=cookie_value_2";
      huc.setRequestProperty("Cookie", myCookies);
      
      huc.connect();          
      
      in = new BufferedInputStream(huc.getInputStream());
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2014-08-09
        • 2019-03-24
        • 1970-01-01
        • 2017-12-18
        相关资源
        最近更新 更多