使用 nokogiri 解析 google picasa api xml - 命名空间问题？答案

【问题标题】：using nokogiri to parse google picasa api xml - namespacing issue?使用 nokogiri 解析 google picasa api xml - 命名空间问题？
【发布时间】：2011-03-30 18:37:35
【问题描述】：

我正在尝试从一些 google picasa xml 中获取一些数据，但遇到了一些麻烦..

这是实际的 xml（仅包含一个条目）： http://pastie.org/1736008

基本上，我想收集一些 gphoto 属性，所以理想情况下我想做的是：

doc.xpath('//entry').map do |entry|
  {:id => entry.children['gphoto:id'],
   :thumb => entry.children['gphoto:thumbnail'],
   :name => entry.children['gphoto:name'],
   :count => entry.children['gphoto:numphotos']}
end

但是，这不起作用...实际上，当我检查 entry 的子项时，我什至根本看不到任何 'gphoto:xxx' 属性...所以我很困惑如何找到他们。

谢谢！

【问题讨论】：

标签： xml google-api nokogiri

【解决方案1】：

这是一些使用 nokogiri 从您的示例 xml 中提取 gphoto 元素的工作代码。

#!/usr/bin/env ruby
require 'rubygems'
require 'nokogiri'
content = File.read('input.xml')
doc = Nokogiri::XML(content) {|config| 
          config.options = Nokogiri::XML::ParseOptions::STRICT
      }

hashes = doc.xpath('//xmlns:entry').map do |entry|
  {
    :id => entry.xpath('gphoto:id').inner_text,
    :thumb => entry.parent.xpath('gphoto:thumbnail').inner_text,
    :name => entry.xpath('gphoto:name').inner_text,
    :count => entry.xpath('gphoto:numphotos').inner_text
  }
end

puts hashes.inspect

# yields: 
#
# [{:count=>"37", :name=>"Melody19Months", :thumb=>"http://lh3.ggpht.com/_Viv8WkAChHU/AAAAAAAAAAA/AAAAAAAAAAA/pNuu5PgnP1Y/s64-c/soopingsaw.jpg", :id=>"5582695833628950881"}]

注意事项：

gist 中的示例 xml 需要一个结束的“feed”标签。修复了here。
在 xpath 表达式中要查找条目元素，我们必须使用命名空间前缀，因此“xmlns:entry”，而不仅仅是“entry”。后者（在您的原始代码中使用）将找到 no 元素。它正在寻找 null 命名空间中的元素，但在您的示例中，它们都继承了在 feed 元素上指定的默认命名空间。 Aaron Patterson 为这个问题写了一篇（以 Nokogiri 为中心的）介绍，here，还有另一个 here。
元素 gphoto:thumbnail 是 feed 元素的子元素，并且不是每个条目的子元素。我已经为此做了一个小的（hacky）调整，保持原始示例的设计，但是远最好在每个提要中只找出这个元素的值一次（也许稍后填充如果他们真的需要每个人都保留一份副本，则条目会散列）。
实际上没有必要将 Nokogiri 配置为严格，但在早期发现问题时获得一些帮助是件好事。

【讨论】：

【解决方案2】：

您可以搜索 entry 节点，然后查看每个节点以提取 gphoto 命名空间节点：

require 'nokogiri'

doc = Nokogiri::XML(open('./test.xml'))
hashes = doc.search('//xmlns:entry').map do |entry|
  h = {}
  entry.search("*[namespace-uri()='http://schemas.google.com/photos/2007']").each do |gphoto|
    h[gphoto.name] = gphoto.text
  end
  h
end

require 'ap'
ap hashes
# >> [
# >>     [0] {
# >>                        "id" => "5582695833628950881",
# >>                      "name" => "Melody19Months",
# >>                  "location" => "",
# >>                    "access" => "public",
# >>                 "timestamp" => "1299649559000",
# >>                 "numphotos" => "37",
# >>                      "user" => "soopingsaw",
# >>                  "nickname" => "sooping",
# >>         "commentingEnabled" => "true",
# >>              "commentCount" => "0"
# >>     }
# >> ]

这将返回所有 //entry/gphoto:* 注释。如果你只想要某些，你可以过滤你想要的：

require 'nokogiri'

doc = Nokogiri::XML(open('./test.xml'))
hashes = doc.search('//xmlns:entry').map do |entry|
  h = {}
  entry.search("*[namespace-uri()='http://schemas.google.com/photos/2007']").each do |gphoto|
    h[gphoto.name] = gphoto.text if (%w[id thumbnail name numphotos].include?(gphoto.name))
  end
  h
end

require 'ap'
ap hashes

# >> [
# >>     [0] {
# >>                "id" => "5582695833628950881",
# >>              "name" => "Melody19Months",
# >>         "numphotos" => "37"
# >>     }
# >> ]

请注意，在原始问题中出现了访问 gphoto:thumbnail 的尝试，但是 //element/gphoto:thumbnails 没有匹配的节点，因此无法找到。

使用命名空间编写搜索的另一种方法是：

require 'nokogiri'

doc = Nokogiri::XML(open('./test.xml'))
hashes = doc.search('//xmlns:entry').map do |entry|
  h = {}
  entry.search("*").each do |gphoto|
    h[gphoto.name] = gphoto.text if (
      (gphoto.namespace.prefix=='gphoto') && 
      (%w[id thumbnail name numphotos].include?(gphoto.name))
    )
  end
  h
end

require 'ap'
ap hashes

# >> [
# >>     [0] {
# >>                "id" => "5582695833628950881",
# >>              "name" => "Melody19Months",
# >>         "numphotos" => "37"
# >>     }
# >> ]

它不是使用 XPath，而是要求 Nokogiri 查看每个节点的命名空间属性。

【讨论】：