【问题标题】:python read data from xmlpython从xml读取数据
【发布时间】:2014-03-14 11:11:49
【问题描述】:

我在 python 中使用 scrapy。

我正在尝试从 xml 文件中获取我的 xpath,如下所示:

def getMasterContainers(self):
    containers=[]
    containersFromXML = self.doc.findall('MasterPage/Containers/xpath')
    for oneXpath in containersFromXML:
        containers.append(oneXpath.text)
    return containers

xml文件是:

<Containers>
  <xpath>'&apos;.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]&apos;'</xpath>
</Containers>

当我 在 cmd 上打印结果时,我得到了这个

container = ''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''

我的问题

当我尝试sel.xpath(self.containers[0]) 时没有得到任何结果,但是当我像这样在代码中编写 xpathsel.xpath('xpath written by hand')我得到了当前数据。

请帮忙。

【问题讨论】:

    标签: python xml python-2.7 xpath scrapy


    【解决方案1】:

    更新:你确定你的问题出在这个 xpath 上吗?您是否确认它没有早于或晚于此 xpath 失败?我不太确定如何使用 scrapy 运行刮擦,所以我只是手动运行 XML 解析,然后在真实文档上运行以下内容,测试文档对我有用。

    first.xml 仅包含 xpath 及其父结构:

    <websiteInformation>
      <MasterPage>
        <Containers>
          <xpath>.//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']</xpath>
        </Containers>
      </MasterPage>
    </websiteInformation>
    

    并解析first.xml

    from lxml import etree
    
    doc = etree.parse(open('first.xml'))
    
    containers = []
    containersFromXML = doc.findall('MasterPage/Containers/xpath')
    for oneXpath in containersFromXML:
        print oneXpath.text
        containers.append(oneXpath.text)
    

    输出:

    .//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
    

    看起来不错。

    test.html 是:

    <html>
      <body>
        <div id="results-list">
          <div class="item paid-featured-item">
            <div class="listing-item">Found A</div>
          </div>
          <div class="item paid-featured-item">
            <div class="listing-item">Found B</div>
          </div>
        </div>
      </body>
    </html>
    

    然后搜索它:

    from scrapy.selector import Selector
    
    sel = Selector(text=open('test.html').read())
    for container in containers:
        print "Xpath: {}".format(container)
        result = sel.xpath(container)
        print "Container: {}".format(len(result))
        for elem in result:
          print elem
    

    输出:

    Xpath: .//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
    Container: 2
    <Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">Found A</div>'>
    <Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">Found B</div>'>
    

    使用wget 输出搜索得到的真实 URL 的结果:

    Xpath: .//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
    Container: 25
    <Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">\n        \n    '>
    # omitted 23
    <Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">\n        \n    '>
    

    看起来您的 xpath 字符串在不应该出现的地方有额外的单引号 (')。在 XML 中它看起来像:

    <xpath>'&apos;.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]&apos;'</xpath>
    

    解析时会显示(如您打印时所示):

    ''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''
    

    你不想要周围的's。应该是这样的:

    .//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]
    

    如果您可以编辑包含 xpath 的 XML 文件,请从每个 &lt;xpath&gt; 中删除前导 '&amp;apos; 和尾随 &amp;apos;'。所以:

    <Containers>
      <xpath>'&apos;.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]&apos;'</xpath>
    </Containers>
    

    应该变成:

    <Containers>
      <xpath>.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]</xpath>
    </Containers>
    

    但如果由于某种原因无法编辑 XML 文件,则在获得 xpath 文本后,将其周围的 's 剥离。所以:

    containers.append(oneXpath.text)
    

    应该变成:

    containers.append(oneXpath.text.strip("'"))
    

    【讨论】:

    • 我尝试了数百万次删除两个引号,但仍然是同样的错误,我可以把整个代码发给你吗?这只是一个小脚本
    • @MarcoDinatsoli 编辑你的问题并将整个代码放在那里,我会看看它。
    • 我已经用 xml 文件发布了整个代码。请我必须在一段时间后将其删除。谢谢你的帮助
    • 我把代码贴出来了,如果你不方便查看,请告诉我以便删除它。感谢您的理解。谢谢
    • @MarcoDinatsoli 你可以随意删除你需要的,我复制下来了
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-02-20
    • 2011-10-30
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多