【发布时间】:2014-01-24 04:46:09
【问题描述】:
我正在使用 Nokogiri::XML::Reader 之上的 Xml::Parser 从 XML 文件中提取条目。我只想抓取“Property/PropertyID/Identification ['OrganizationName' == 'northsteppe']”的标签,但无法找出正确的语法,这是我一直在构建的整个 rake 任务下面是一个示例节点,其中包含所有信息和标签。任何指导将不胜感激。
================更新===============
我正在解析的文件是使用 open-uri 提取的,因为它来自外部源,我只是在本地机器上使用旧版本的硬拷贝,以加快开发过程中的速度,因为文件是300MB+ 大小。我试图使用 SAX 解析器,但是对于我来说,要真正掌握正在发生的事情,这种逻辑似乎有点复杂,而且我遇到了同样的问题,这将我抓取的属性限制为仅那些“northsteppe”的属性作为标识标签中的组织名称,话虽如此,我选择使用当前方法尝试相同的任务,我能够获取几乎所有我需要的信息,我只是错过了我上面提到的最后一部分。
=============== 尽可能具体 =============
所以,我觉得描述我正在尝试执行的确切任务将有助于消除任何缺失的空白。任务如下。
从 XML 文件中获取 <Identification> 标记中具有 OraganizationName = 'northsteppe' 的每个属性,然后分别获取与每个属性相关的所有相应信息并将其插入到哈希中。在收集了单个属性的所有信息并将其放置在该哈希中之后,需要将其作为单个条目上传到数据库,该数据库已经按照它需要的方式构建。将该属性插入数据库后,rake 任务将移至 Property 的下一个条目,该条目满足 <Identification> 标记中具有 OrganizationName = 'northsteppe' 的规范并重复该过程,直到所有属性符合以上所列规格的已插入数据库。这样做的目的是让我可以对 Northsteppe 属性的数据进行快速搜索,而不会因 XML 文件中的每个属性而使系统陷入困境。
最终,我将使用 open-uri 从其外部源中提取文件并运行一个 cron 作业以每 6 小时执行一次此 rake 任务并替换数据库。
=================代码==================
namespace :db do
# RAKE TASK DESCRIPTION
desc "Fetch property information and insert it into the database"
# RAKE TASK NAME
task :print_properties => :environment do
require 'rubygems'
require 'nokogiri'
module Xml
class Parser
def initialize(node, &block)
@node = node
@node.each do
self.instance_eval &block
end
end
def name
@node.name
end
def inner_xml
@node.inner_xml.strip
end
def is_start?
@node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
end
def is_end?
@node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT
end
def attribute(attribute)
@node.attribute(attribute)
end
def for_element(name, &block)
return unless self.name == name and is_start?
self.instance_eval &block
end
def inside_element(name=nil, &block)
return if @node.self_closing?
return unless name.nil? or (self.name == name and is_start?)
name = @node.name
depth = @node.depth
@node.each do
return if self.name == name and is_end? and @node.depth == depth
self.instance_eval &block
end
end
end
end
Xml::Parser.new(Nokogiri::XML::Reader(open("app/assets/xml/mits.xml"))) do
inside_element 'Property' do
# OPEN AND PARSE THE <PropertyID> TAG
inside_element 'PropertyID' do
inside_element 'Identification' do
puts attribute_nodes()
end
# OPEN AND PARSE THE <Address> TAG
inside_element 'Address' do
for_element 'AddressLine1' do puts "Street Address: #{inner_xml}" end
for_element 'City' do puts "City: #{inner_xml}" end
for_element 'PostalCode' do puts "Zipcode: #{inner_xml}" end
end
for_element 'MarketingName' do puts "Short Description: #{inner_xml}" end
end
# OPEN AND PARSE THE <Information> TAG
inside_element 'Information' do
for_element 'LongDescription' do puts "Long Description: #{inner_xml}" end
inside_element 'Rents' do
for_element 'StandardRent' do puts "Rent: #{inner_xml}" end
end
end
inside_element 'Fee' do
for_element 'ApplicationFee' do puts "Application Fee: #{inner_xml}" end
end
inside_element 'ILS_Identification' do
for_element 'Latitude' do puts "Latitude: #{inner_xml}" end
for_element 'Longitude' do puts "Longitude: #{inner_xml}" end
end
end
end
end #END INSERT_PROPERTIES TASK
end #END NAMESPACE
和一个 XML 示例 --
<Property IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<PropertyID>
<Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/>
<Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>
<MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
<WebSite>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</WebSite>
<Address AddressType="property">
<Description>Address of Available Listing</Description>
<AddressLine1>1689 N 4th St </AddressLine1>
<City>Columbus</City>
<State>OH</State>
<PostalCode>43201</PostalCode>
<Country>US</Country>
</Address>
<Phone PhoneType="office">
<PhoneNumber>(614) 299-4110</PhoneNumber>
</Phone>
<Email>northsteppe.nsr@gmail.com</Email>
</PropertyID>
<ILS_Identification ILS_IdentificationType="Apartment" RentalType="Market Rate">
<Latitude>39.997694</Latitude>
<Longitude>-82.99903</Longitude>
<LastUpdate Month="11" Day="11" Year="2013"/>
</ILS_Identification>
<Information>
<StructureType>Standard</StructureType>
<UnitCount>1</UnitCount>
<ShortDescription>Spacious House Central Campus OSU, available fall</ShortDescription>
<LongDescription>One of our favorites! This great house is perfect for students or a single family. With huge living and sleeping rooms, there is plenty of space. The kitchen is totally modernized with new appliances, and the bathroom has been updated. Natural woodwork and brick accents are seen within the house, and the decorative mantles. Ceiling fans and mini-blinds are included, as well as a FREE stack washer and dryer. The front and side deck. On site parking available.</LongDescription>
<Rents>
<StandardRent>2000.00</StandardRent>
</Rents>
<PropertyAvailabilityURL>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</PropertyAvailabilityURL>
</Information>
<Fee>
<ProrateType>Standard</ProrateType>
<LateType>Standard</LateType>
<LatePercent>0</LatePercent>
<LateMinFee>0</LateMinFee>
<LateFeePerDay>0</LateFeePerDay>
<NonRefundableHoldFee>0</NonRefundableHoldFee>
<AdminFee>0</AdminFee>
<ApplicationFee>30.00</ApplicationFee>
<BrokerFee>0</BrokerFee>
</Fee>
<Deposit DepositType="Security Deposit">
<Amount AmountType="Actual">
<ValueRange Exact="2000.00" Currency="USD"/>
</Amount>
</Deposit>
<Policy>
<Pet Allowed="false"/>
</Policy>
<Phase IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<Name/>
<Description/>
<UnitCount>1</UnitCount>
<RentableUnits>1</RentableUnits>
<TotalSquareFeet>0</TotalSquareFeet>
<RentableSquareFeet>0</RentableSquareFeet>
</Phase>
<Building IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<Name/>
<Description/>
<UnitCount>1</UnitCount>
<SquareFeet>0</SquareFeet>
</Building>
<Floorplan IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<Name/>
<UnitCount>1</UnitCount>
<Room RoomType="Bedroom">
<Count>4</Count>
<Comment/>
</Room>
<Room RoomType="Bathroom">
<Count>1</Count>
<Comment/>
</Room>
<SquareFeet Min="0" Max="0"/>
<MarketRent Min="2000" Max="2000"/>
<EffectiveRent Min="2000" Max="2000"/>
</Floorplan>
<ILS_Unit IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<Units>
<Unit>
<Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="UL Portfolio"/>
<MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
<UnitBedrooms>4</UnitBedrooms>
<UnitBathrooms>1.0</UnitBathrooms>
<MinSquareFeet>0</MinSquareFeet>
<MaxSquareFeet>0</MaxSquareFeet>
<SquareFootType>internal</SquareFootType>
<UnitRent>2000.00</UnitRent>
<MarketRent>2000.00</MarketRent>
<Address AddressType="property">
<AddressLine1>1689 N 4th St </AddressLine1>
<City>Columbus</City>
<PostalCode>43201</PostalCode>
<Country>US</Country>
</Address>
</Unit>
</Units>
<Availability>
<VacateDate Month="7" Day="23" Year="2014"/>
<VacancyClass>Unoccupied</VacancyClass>
<MadeReadyDate Month="7" Day="23" Year="2014"/>
</Availability>
<Amenity AmenityType="Other">
<Description>All new stainless steel appliances! Refinished hardwood floors</Description>
</Amenity>
<Amenity AmenityType="Other">
<Description>Ceramic tile</Description>
</Amenity>
<Amenity AmenityType="Other">
<Description>Ceiling fans</Description>
</Amenity>
<Amenity AmenityType="Other">
<Description>Wrap-around porch</Description>
</Amenity>
<Amenity AmenityType="Dryer">
<Description>Free Washer and Dryer</Description>
</Amenity>
<Amenity AmenityType="Washer">
<Description>Free Washer and Dryer</Description>
</Amenity>
<Amenity AmenityType="Other">
<Description>off-street parking available</Description>
</Amenity>
</ILS_Unit>
<File Active="true" FileID="820982141">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/31077069-6e81-4373-8a89-508c57585543/medium.jpg</Src>
<Width>360</Width>
<Height>300</Height>
<Rank>1</Rank>
</File>
<File Active="true" FileID="820982145">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/84e1be40-96fd-4717-b75d-09b39231a762/medium.jpg</Src>
<Width>350</Width>
<Height>265</Height>
<Rank>2</Rank>
</File>
<File Active="true" FileID="820982149">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/cd419635-c37f-4676-a43e-c72671a2a748/medium.jpg</Src>
<Width>350</Width>
<Height>265</Height>
<Rank>3</Rank>
</File>
<File Active="true" FileID="820982152">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/6b68dbd5-2cde-477c-99d7-3ca33f03cce8/medium.jpg</Src>
<Width>350</Width>
<Height>265</Height>
<Rank>4</Rank>
</File>
<File Active="true" FileID="820982155">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/17b6c7c0-686c-4e46-865b-11d80744354a/medium.jpg</Src>
<Width>350</Width>
<Height>265</Height>
<Rank>5</Rank>
</File>
<File Active="true" FileID="820982157">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/3545ac8b-471f-404a-94b2-fcd00dd16e25/medium.jpg</Src>
<Width>350</Width>
<Height>265</Height>
<Rank>6</Rank>
</File>
<File Active="true" FileID="820982160">
<FileType>Photo</FileType>
<Description>Unit Photo</Description>
<Name/>
<Caption/>
<Format>image/jpeg</Format>
<Src>http://pa.cdn.appfolio.com/northsteppe/images/02471172-2183-4bf1-a3d7-33415f902c1c/medium.jpg</Src>
<Width>350</Width>
<Height>265</Height>
<Rank>7</Rank>
</File>
</Property>
【问题讨论】:
-
amolnpujari.wordpress.com/2012/03/31/reading_huge_xml-rb 我还发现 ox 在读取大型 xml 时比 nokogiri 快 5 倍。另外,我编写了一个包装器,它只允许您使用 ox 搜索大型 xml,允许您使用指定的元素进行迭代。 gist.github.com/amolpujari/5966431
标签: ruby xml xml-parsing nokogiri