【问题标题】:Grabbing Only the Entries With Specific Attribute Value On A Given Node仅获取给定节点上具有特定属性值的条目
【发布时间】:2014-01-24 04:46:09
【问题描述】:

我正在使用 Nokogiri::XML::Reader 之上的 Xml::Parser 从 XML 文件中提取条目。我只想抓取“Property/PropertyID/Identification ['OrganizationName' == 'northsteppe']”的标签,但无法找出正确的语法,这是我一直在构建的整个 rake 任务下面是一个示例节点,其中包含所有信息和标签。任何指导将不胜感激。

================更新===============

我正在解析的文件是使用 open-uri 提取的,因为它来自外部源,我只是在本地机器上使用旧版本的硬拷贝,以加快开发过程中的速度,因为文件是300MB+ 大小。我试图使用 SAX 解析器,但是对于我来说,要真正掌握正在发生的事情,这种逻辑似乎有点复杂,而且我遇到了同样的问题,这将我抓取的属性限制为仅那些“northsteppe”的属性作为标识标签中的组织名称,话虽如此,我选择使用当前方法尝试相同的任务,我能够获取几乎所有我需要的信息,我只是错过了我上面提到的最后一部分。

=============== 尽可能具体 =============

所以,我觉得描述我正在尝试执行的确切任务将有助于消除任何缺失的空白。任务如下。

从 XML 文件中获取 <Identification> 标记中具有 OraganizationName = 'northsteppe' 的每个属性,然后分别获取与每个属性相关的所有相应信息并将其插入到哈希中。在收集了单个属性的所有信息并将其放置在该哈希中之后,需要将其作为单个条目上传到数据库,该数据库已经按照它需要的方式构建。将该属性插入数据库后,rake 任务将移至 Property 的下一个条目,该条目满足 <Identification> 标记中具有 OrganizationName = 'northsteppe' 的规范并重复该过程,直到所有属性符合以上所列规格的已插入数据库。这样做的目的是让我可以对 Northsteppe 属性的数据进行快速搜索,而不会因 XML 文件中的每个属性而使系统陷入困境。

最终,我将使用 open-uri 从其外部源中提取文件并运行一个 cron 作业以每 6 小时执行一次此 rake 任务并替换数据库。

=================代码==================

namespace :db do

# RAKE TASK DESCRIPTION
desc "Fetch property information and insert it into the database"

# RAKE TASK NAME    
task :print_properties => :environment do

    require 'rubygems'
    require 'nokogiri'

    module Xml
      class Parser
        def initialize(node, &block)
          @node = node
          @node.each do
            self.instance_eval &block
          end
        end

        def name
          @node.name
        end

        def inner_xml
          @node.inner_xml.strip
        end

        def is_start?
          @node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
        end

        def is_end?
          @node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT
        end

        def attribute(attribute)
          @node.attribute(attribute)
        end

        def for_element(name, &block)
          return unless self.name == name and is_start?
          self.instance_eval &block
        end

        def inside_element(name=nil, &block)
          return if @node.self_closing?
          return unless name.nil? or (self.name == name and is_start?)

          name = @node.name
          depth = @node.depth

          @node.each do
            return if self.name == name and is_end? and @node.depth == depth
            self.instance_eval &block
          end
        end
      end
    end


    Xml::Parser.new(Nokogiri::XML::Reader(open("app/assets/xml/mits.xml"))) do
        inside_element 'Property' do

            # OPEN AND PARSE THE <PropertyID> TAG
            inside_element 'PropertyID' do

                inside_element 'Identification' do
                    puts attribute_nodes()
                end

                # OPEN AND PARSE THE <Address> TAG
                inside_element 'Address' do
                    for_element 'AddressLine1' do puts "Street Address: #{inner_xml}" end
                    for_element 'City' do puts "City: #{inner_xml}" end
                    for_element 'PostalCode' do puts "Zipcode: #{inner_xml}" end
                end

            for_element 'MarketingName' do puts "Short Description: #{inner_xml}" end
            end

            # OPEN AND PARSE THE <Information> TAG
            inside_element 'Information' do
                for_element 'LongDescription' do puts "Long Description: #{inner_xml}" end
                inside_element 'Rents' do
                    for_element 'StandardRent' do puts "Rent: #{inner_xml}" end
                end
            end

            inside_element 'Fee' do
                for_element 'ApplicationFee' do puts "Application Fee: #{inner_xml}" end
            end

            inside_element 'ILS_Identification' do
                for_element 'Latitude' do puts "Latitude: #{inner_xml}" end
                for_element 'Longitude' do puts "Longitude: #{inner_xml}" end
            end

        end
    end

end #END INSERT_PROPERTIES TASK

end #END NAMESPACE

和一个 XML 示例 --

<Property IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<PropertyID>
  <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/>
  <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>
  <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
  <WebSite>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</WebSite>
  <Address AddressType="property">
    <Description>Address of Available Listing</Description>
    <AddressLine1>1689 N 4th St </AddressLine1>
    <City>Columbus</City>
    <State>OH</State>
    <PostalCode>43201</PostalCode>
    <Country>US</Country>
  </Address>
  <Phone PhoneType="office">
    <PhoneNumber>(614) 299-4110</PhoneNumber>
  </Phone>
  <Email>northsteppe.nsr@gmail.com</Email>
</PropertyID>
<ILS_Identification ILS_IdentificationType="Apartment" RentalType="Market Rate">
  <Latitude>39.997694</Latitude>
  <Longitude>-82.99903</Longitude>
  <LastUpdate Month="11" Day="11" Year="2013"/>
</ILS_Identification>
<Information>
  <StructureType>Standard</StructureType>
  <UnitCount>1</UnitCount>
  <ShortDescription>Spacious House Central Campus OSU, available fall</ShortDescription>
  <LongDescription>One of our favorites! This great house is perfect for students or a single family. With huge living and sleeping rooms, there is plenty of space. The kitchen is totally modernized with new appliances, and the bathroom has been updated. Natural woodwork and brick accents are seen within the house, and the decorative mantles. Ceiling fans and mini-blinds are included, as well as a FREE stack washer and dryer. The front and side deck. On site parking available.</LongDescription>
  <Rents>
    <StandardRent>2000.00</StandardRent>
  </Rents>
  <PropertyAvailabilityURL>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</PropertyAvailabilityURL>
</Information>
<Fee>
  <ProrateType>Standard</ProrateType>
  <LateType>Standard</LateType>
  <LatePercent>0</LatePercent>
  <LateMinFee>0</LateMinFee>
  <LateFeePerDay>0</LateFeePerDay>
  <NonRefundableHoldFee>0</NonRefundableHoldFee>
  <AdminFee>0</AdminFee>
  <ApplicationFee>30.00</ApplicationFee>
  <BrokerFee>0</BrokerFee>
</Fee>
<Deposit DepositType="Security Deposit">
  <Amount AmountType="Actual">
    <ValueRange Exact="2000.00" Currency="USD"/>
  </Amount>
</Deposit>
<Policy>
  <Pet Allowed="false"/>
</Policy>
<Phase IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Name/>
  <Description/>
  <UnitCount>1</UnitCount>
  <RentableUnits>1</RentableUnits>
  <TotalSquareFeet>0</TotalSquareFeet>
  <RentableSquareFeet>0</RentableSquareFeet>
</Phase>
<Building IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Name/>
  <Description/>
  <UnitCount>1</UnitCount>
  <SquareFeet>0</SquareFeet>
</Building>
<Floorplan IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Name/>
  <UnitCount>1</UnitCount>
  <Room RoomType="Bedroom">
    <Count>4</Count>
    <Comment/>
  </Room>
  <Room RoomType="Bathroom">
    <Count>1</Count>
    <Comment/>
  </Room>
  <SquareFeet Min="0" Max="0"/>
  <MarketRent Min="2000" Max="2000"/>
  <EffectiveRent Min="2000" Max="2000"/>
</Floorplan>
<ILS_Unit IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Units>
    <Unit>
      <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="UL Portfolio"/>
      <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
      <UnitBedrooms>4</UnitBedrooms>
      <UnitBathrooms>1.0</UnitBathrooms>
      <MinSquareFeet>0</MinSquareFeet>
      <MaxSquareFeet>0</MaxSquareFeet>
      <SquareFootType>internal</SquareFootType>
      <UnitRent>2000.00</UnitRent>
      <MarketRent>2000.00</MarketRent>
      <Address AddressType="property">
        <AddressLine1>1689 N 4th St </AddressLine1>
        <City>Columbus</City>
        <PostalCode>43201</PostalCode>
        <Country>US</Country>
      </Address>
    </Unit>
  </Units>
  <Availability>
    <VacateDate Month="7" Day="23" Year="2014"/>
    <VacancyClass>Unoccupied</VacancyClass>
    <MadeReadyDate Month="7" Day="23" Year="2014"/>
  </Availability>
  <Amenity AmenityType="Other">
    <Description>All new stainless steel appliances!  Refinished hardwood floors</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>Ceramic tile</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>Ceiling fans</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>Wrap-around porch</Description>
  </Amenity>
  <Amenity AmenityType="Dryer">
    <Description>Free Washer and Dryer</Description>
  </Amenity>
  <Amenity AmenityType="Washer">
    <Description>Free Washer and Dryer</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>off-street parking available</Description>
  </Amenity>
</ILS_Unit>
<File Active="true" FileID="820982141">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/31077069-6e81-4373-8a89-508c57585543/medium.jpg</Src>
  <Width>360</Width>
  <Height>300</Height>
  <Rank>1</Rank>
</File>
<File Active="true" FileID="820982145">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/84e1be40-96fd-4717-b75d-09b39231a762/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>2</Rank>
</File>
<File Active="true" FileID="820982149">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/cd419635-c37f-4676-a43e-c72671a2a748/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>3</Rank>
</File>
<File Active="true" FileID="820982152">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/6b68dbd5-2cde-477c-99d7-3ca33f03cce8/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>4</Rank>
</File>
<File Active="true" FileID="820982155">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/17b6c7c0-686c-4e46-865b-11d80744354a/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>5</Rank>
</File>
<File Active="true" FileID="820982157">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/3545ac8b-471f-404a-94b2-fcd00dd16e25/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>6</Rank>
</File>
<File Active="true" FileID="820982160">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/02471172-2183-4bf1-a3d7-33415f902c1c/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>7</Rank>
</File>
  </Property>

【问题讨论】:

标签: ruby xml xml-parsing nokogiri


【解决方案1】:

先试试这个:

require 'nokogiri'

doc = Nokogiri::XML(File.read('test.xml'))
doc.search('*[OrganizationName="northsteppe"]') 
# => [#<Nokogiri::XML::Element:0x3fd82499131c name="Identification" attributes=[#<Nokogiri::XML::Attr:0x3fd8249912b8 name="IDValue" value="642da00e-9be3-4a7c-bd50-66a4f0d70af8">, #<Nokogiri::XML::Attr:0x3fd8249912a4 name="OrganizationName" value="northsteppe">, #<Nokogiri::XML::Attr:0x3fd824991290 name="IDType" value="property">]>, #<Nokogiri::XML::Element:0x3fd824990a70 name="Identification" attributes=[#<Nokogiri::XML::Attr:0x3fd824990a0c name="IDValue" value="6e1e61523972d5f0e260e3d38eb488337424f21e">, #<Nokogiri::XML::Attr:0x3fd8249909f8 name="OrganizationName" value="northsteppe">, #<Nokogiri::XML::Attr:0x3fd8249909e4 name="IDType" value="Company">]>]

为了让 Nokogiri 发现的内容更具可读性:

puts doc.search('*[OrganizationName="northsteppe"]').map{ |n| n.to_xml }
# >> <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/>
# >> <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>

我发现使用 CSS 通常比 XPath 更具可读性。在这种情况下,这是一个折腾。


...实际文件为 300MB,加载到 DOM 会导致服务器崩溃。

如果您的服务器无法处理文件大小,那么您最好的选择是 SAX 解析器,它尽可能地节省内存。下面是一个使用示例 XML 的简单示例:

require 'nokogiri'

class MyDocument < Nokogiri::XML::SAX::Document
  @@tags = []

  def start_element name, attributes = []

    attribute_hash = Hash[attributes]
    if (name == 'Identification') && (attribute_hash['OrganizationName'] == 'northsteppe')
      @@tags << {
        name: name,
        attributes: attribute_hash
      }
    end
  end

  def tags
    @@tags
  end
end

doc = MyDocument.new

# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(doc)

# Feed the parser some XML
parser.parse(File.open('test.xml'))

doc.tags 
# => [{:name=>"Identification",
#      :attributes=>
#       {"IDValue"=>"642da00e-9be3-4a7c-bd50-66a4f0d70af8",
#        "OrganizationName"=>"northsteppe",
#        "IDType"=>"property"}},
#     {:name=>"Identification",
#      :attributes=>
#       {"IDValue"=>"6e1e61523972d5f0e260e3d38eb488337424f21e",
#        "OrganizationName"=>"northsteppe",
#        "IDType"=>"Company"}}]

【讨论】:

  • 不幸的是,这种方法不起作用,因为实际文件为 300MB,并且在 DOM 中加载会使服务器崩溃。 ://
  • 您没有提到 非常 重要的信息。您已将所有限制条件放在问题的答案中。不要让我们一块一块地弄清楚。
  • 我深表歉意,我并不是有意遗漏那条信息。我对上述问题添加了两个更新,以尽可能具体。我已经运行了您的代码,它确实提取了 OrganizationName = 'northsteppe' 的所有标识标签,这是我在使用 SAX 之前能够做到的。 :) 也许上面的更新将阐明我正在尝试完成的确切过程,而不是我只是要求拼图的一部分并试图找出其余部分(这已被证明在这项特定任务中不成功)。
【解决方案2】:

所以我发现的解决方案是在一个名为 Saxerator (https://github.com/soulcutter/saxerator) 的小宝石中。它可以进行 SAX 解析,无需 Nokogiri(谢谢),具有出色的文档并且运行速度超快。我会鼓励任何将来需要使用 SAX 解析器的人来研究这个小宝石(双关语),并减轻必须处理所有那些写得很糟糕的 Nokogiri 文档的负担。我的问题的解决方案如下,位于我的seeds.rb 文件中。

    require 'saxerator'

parser = Saxerator.parser(File.new("app/assets/xml/mits_snip.xml")) do |config|
  config.put_attributes_in_hash!
  config.symbolize_keys!
end


parser.for_tag(:Property).each do |property|
    if property[:PropertyID][:Identification][1][:OrganizationName] == 'northsteppe'
        property_attributes = {
            street_address:     property[:PropertyID][:Address][:AddressLine1],
            city:               property[:PropertyID][:Address][:City],
            zipcode:            property[:PropertyID][:Address][:PostalCode],
            short_description:  property[:PropertyID][:MarkertName],
            long_description:   property[:Information][:LongDescription],
            rent:               property[:Information][:Rents][:StandardRent],
            application_fee:    property[:Fee][:ApplicationFee],
            vacancy_status:     property[:ILS_Unit][:Availability][:VacancyClass],
            month_available:    property[:ILS_Unit][:Availability][:MadeReadyDate][:Month],
            latitude:           property[:ILS_Identification][:Latitude],
            longitude:          property[:ILS_Identification][:Longitude]

        }

        if Property.create! property_attributes
            puts "wahoo"
        else
            puts "nope"
        end
    end
end

============== 更新=================

所以我实际上重写了这个任务确实工作得更好,只是想在这里分享它,以防有人偶然发现这个问题——这在我的种子.rb 文件中

require 'saxerator'
require 'open-uri'
@company_name = 'northsteppe'
parser = Saxerator.parser(File.new("../../shared/assets/xml/mits.xml")) do |config|
  config.put_attributes_in_hash!
  config.symbolize_keys!
end
puts "DELETED ALL EXISITNG PROPERTIES" if Property.delete_all
puts "PULLING RELEVENT XML ENTERIES"
@@count = 0
file = File.new("../../shared/assets/xml/nsr_properties.xml",'w')
properties = []
parser.for_tag(:Property).each do |property|
    print '*'
    if property[:PropertyID][:Identification][1][:OrganizationName] == @company_name
        properties << property
        @@count = @@count +1
    end
    # break if @@count == 417 
end
file.write(properties.to_xml)
file.close
puts "ADDING PROPERTIES TO THE DATABASE"
nsr_properties = File.open("../../shared/assets/xml/nsr_properties.xml")
doc = Nokogiri::XML(nsr_properties)
doc.xpath("//saxerator-builder-hash-elements/saxerator-builder-hash-element").each do |property|
    print '.'
    @images =[]
    property.xpath("File/File").each do |image|
        @images << image.at_xpath("Src/text()").to_s
    end
    @amenities = []
    property.xpath("ILS-Unit/Amenity/Amenity").each do |amenity|
        @amenities << amenity.at_xpath("Description/text()").to_s
    end
    information = {
        "street_address" => property.at_xpath("PropertyID/Address/AddressLine1/text()").to_s,
        "city" => property.at_xpath("PropertyID/Address/City/text()").to_s,
        "zipcode" => property.at_xpath("PropertyID/Address/PostalCode/text()").to_s,
        "short_description" => property.at_xpath("PropertyID/MarketingName/text()").to_s,
        "long_description" => property.at_xpath("Information/LongDescription/text()").to_s,
        "rent" => property.at_xpath("Information/Rents/StandardRent/text()").to_s,
        "application_fee" => property.at_xpath("Fee/ApplicationFee/text()").to_s,
        "bedrooms" => property.at_xpath("ILS-Unit/Units/Unit/UnitBedrooms/text()").to_s,
        "bathrooms" => property.at_xpath("ILS-Unit/Units/Unit/UnitBathrooms/text()").to_s,
        "vacancy_status" => property.at_xpath("ILS-Unit/Availability/VacancyClass/text()").to_s,
        "month_available" => property.at_xpath("ILS-Unit/Availability/MadeReadyDate/@Month").to_s,
        "latitude" => property.at_xpath("ILS-Identification/Latitude/text()").to_s,
        "longitude" => property.at_xpath("ILS-Identification/Longitude/text()").to_s,
        "images" => @images,
        "amenities" => @amenities
    }
    Property.create!(information)
end
puts "DONE, WAHOO"

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-04-01
    • 2012-06-05
    • 2020-06-11
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多