使用 Nokogiri 提取链接时如何获取绝对 URL？答案

【问题标题】：How can I get the absolute URL when extracting links using Nokogiri?使用 Nokogiri 提取链接时如何获取绝对 URL？
【发布时间】：2011-06-19 04:35:45
【问题描述】：

我正在使用 Nokogiri 从页面中提取链接，但我想获取绝对路径，即使页面上的路径是相对路径。我怎样才能做到这一点？

【问题讨论】：

标签： ruby nokogiri

【解决方案1】：

Nokogiri 是无关的，除了它给你链接锚的事实。使用 Ruby 的 URI 库来管理路径：

absolute_uri = URI.join( page_url, href ).to_s

实际操作：

require 'uri'

# The URL of the page with the links
page_url = 'http://foo.com/zee/zaw/zoom.html'

# A variety of links to test.
hrefs = %w[
  http://zork.com/             http://zork.com/#id
  http://zork.com/bar          http://zork.com/bar#id
  http://zork.com/bar/         http://zork.com/bar/#id
  http://zork.com/bar/jim.html http://zork.com/bar/jim.html#id
  /bar                         /bar#id
  /bar/                        /bar/#id
  /bar/jim.html                /bar/jim.html#id
  jim.html                     jim.html#id
  ../jim.html                  ../jim.html#id
  ../                          ../#id
  #id
]

hrefs.each do |href|
  root_href = URI.join(page_url,href).to_s
  puts "%-32s -> %s" % [ href, root_href ]
end
#=> http://zork.com/                 -> http://zork.com/
#=> http://zork.com/#id              -> http://zork.com/#id
#=> http://zork.com/bar              -> http://zork.com/bar
#=> http://zork.com/bar#id           -> http://zork.com/bar#id
#=> http://zork.com/bar/             -> http://zork.com/bar/
#=> http://zork.com/bar/#id          -> http://zork.com/bar/#id
#=> http://zork.com/bar/jim.html     -> http://zork.com/bar/jim.html
#=> http://zork.com/bar/jim.html#id  -> http://zork.com/bar/jim.html#id
#=> /bar                             -> http://foo.com/bar
#=> /bar#id                          -> http://foo.com/bar#id
#=> /bar/                            -> http://foo.com/bar/
#=> /bar/#id                         -> http://foo.com/bar/#id
#=> /bar/jim.html                    -> http://foo.com/bar/jim.html
#=> /bar/jim.html#id                 -> http://foo.com/bar/jim.html#id
#=> jim.html                         -> http://foo.com/zee/zaw/jim.html
#=> jim.html#id                      -> http://foo.com/zee/zaw/jim.html#id
#=> ../jim.html                      -> http://foo.com/zee/jim.html
#=> ../jim.html#id                   -> http://foo.com/zee/jim.html#id
#=> ../                              -> http://foo.com/zee/
#=> ../#id                           -> http://foo.com/zee/#id
#=> #id                              -> http://foo.com/zee/zaw/zoom.html#id

这里更复杂的答案以前使用URI.parse(root).merge(URI.parse(href)).to_s。
感谢@pguardiario 的改进。

【讨论】：

Nokogiri 可能与此有关。方法如下：如果 html 文档包含基本标记，则上述解决方案将无法正常工作。在这种情况下，应该使用 base 标记的 href 属性值而不是 page_url。在此处查看@david-thomas 的更详细解释：stackoverflow.com/questions/5559578/…

【解决方案2】：

Phrogz 的回答很好，但更简单：

URI.join(base, url).to_s

【讨论】：

你能举个例子说明什么是base和url吗？
base = "http://www.google.com/somewhere"; url= '/over/there'; 相信pguardino的变量名有点不精确

【解决方案3】：

您需要检查 URL 是绝对的还是相对的，并检查是否以 http: 开头。如果 URL 是相对的，则需要将主机添加到此 URL。你不能通过 nokogiri 做到这一点。您需要处理内部的所有 url 才能像绝对一样呈现。

【讨论】：