爬虫将“mydomain#!article”转义为“mydomain?_escaped_fragment_=article”，如何找回原来的url？答案

【问题标题】：The crawler escapes "mydomain#!article" into "mydomain?_escaped_fragment_=article", how to retrieve back the original url?爬虫将“mydomain#!article”转义为“mydomain?_escaped_fragment_=article”，如何找回原来的url？
【发布时间】：2014-05-16 04:40:55
【问题描述】：

好的，这就是 Google 所说的 (https://developers.google.com/webmasters/ajax-crawling/docs/getting-started)。

当爬虫看到像www.example.com/ajax.html#!key=value这样的网址时，它会暂时将该网址转换为www.example.com/ajax.html?_escaped_fragment_=key=value

但是，这样做时，它还会在转换过程中转义片段中的某些字符。前任： www.example.com/ajax.html#!key=value;car=% 到 www.example.com/ajax.html?_escaped_fragment_=key=value;car=%25

因此，如果我们想将www.example.com/ajax.html?_escaped_fragment_=key=value;car=%25 转换回原始 url，那么我们需要取消转义片段中的所有 %XX 字符。

谷歌说：

注意：爬虫在执行过程中会转义片段中的某些字符转变。要检索原始片段，请确保取消转义片段中的所有 %XX 个字符。更具体地说，%26 应该变成 &，%20 应该变成空格，%23 应该变成 #，并且 %25 应该变成 %，以此类推。

但谷歌并没有说如何在 java 中做到这一点。

String originalUrl=changedStr.replace("?_escaped_fragment_=", "!#");
// then what to do next so that all the escaped characters will go back to normal?

这样可以吗

originalUrl=java.net.URLDecoder.decode(originalUrl, "UTF-8");

我们必须使用哪一个：“UTF-8”还是“ASCII”？

那么当爬虫对url进行转义时，是否使用了URL.encode()？

如果有，那么它使用“UTF-8”还是“ASCII”？

【问题讨论】：

标签： java gwt

【解决方案1】：

您可能想查看this SO working example。然后你会特别感兴趣的是最后的函数rewriteQueryString。

具体细节是你在正确的轨道上，关键是打电话URLDecoder.decode；你可能也对它周围的包装代码感兴趣。

【讨论】：