【问题标题】:Extract tag content from XML using beautifulsoup使用 beautifulsoup 从 XML 中提取标签内容
【发布时间】:2020-05-11 20:29:24
【问题描述】:

感谢我在这里找到的许多线程,我设法完成了一些我想做的事情。但现在我被困住了。帮助将不胜感激。

所以我有这个包含几千条记录的 XML 文件,我想从中提取

  • 标签520(URL)的内容
  • 标签 001 (recno) 的内容,只要找到标签 520 即可

--> 所以结果应该是 URLs + recnos 的列表。

帮助我将后续结果导出到 csv 而不是在屏幕上显示的奖励积分 ;)

# Import BeautifulSoup
from bs4 import BeautifulSoup as bs
content = []

# Read the XML file
with open("snippet_bilzen.xml", "r") as file:

    # Read each line in the file, readlines() returns a list of lines

    content = file.readlines()

    # Combine the lines in the list into a string
    content = "".join(content)
    bs_content = bs(content, "lxml")


#Get contents of tag 520 
rows_url = bs_content.find_all(tag="520")
for row in rows_url:          # Print all occurrences
    print(row.get_text())

    # trying to get contents of tag 001 where 520 occurs 
    rows_id = bs_content.find_all(tag="001")
    for row in rows_id:
        print(row.get_text())

这是 xml 的一部分:

<record>
  <leader>00983nam a2200000 c 4500</leader>
  <controlfield tag="001">c:obg:160033</controlfield>
  <controlfield tag="005">20180605143926.1</controlfield>
  <controlfield tag="008">060214s1987    xx                u und  </controlfield>
  <datafield ind1="3" ind2=" " tag="024">
    <subfield code="a">0075992557726</subfield>
  </datafield>
  <datafield ind1="1" ind2="0" tag="245">
    <subfield code="a">Sign 'O' the times</subfield>
  </datafield>
  <datafield ind1="#" ind2="#" tag="260">
    <subfield code="b">Paisley Park</subfield>
    <subfield code="c">1987</subfield>
  </datafield>
  <datafield ind1=" " ind2=" " tag="300">
    <subfield code="a">2 cd's</subfield>
  </datafield>
  <datafield ind1=" " ind2=" " tag="306">
    <subfield code="a">01:19:51</subfield>
  </datafield>
  <datafield ind1=" " ind2=" " tag="340">
    <subfield code="a">cd</subfield>
  </datafield>
  <datafield ind1=" " ind2=" " tag="500">
    <subfield code="a">Met teksten</subfield>
  </datafield>
  <datafield ind1=" " ind2=" " tag="520">
    <subfield code="a">ill</subfield>
    <subfield code="u">http://geapbib001.cipal.be/docman/docman.phtml?file=authorities.87.95.131.jpg.rm99991231.51210.17208</subfield>
  </datafield>
</record>
<record>
  <leader>00854nam a2200000 c 4500</leader>
  <controlfield tag="001">c:obg:157417</controlfield>
  <controlfield tag="005">20180725100810.1</controlfield>
  <controlfield tag="008">060214s1984    xx                u und  </controlfield>
  <datafield ind1="3" ind2=" " tag="024">
    <subfield code="a">0042282289827</subfield>
  </datafield>
  <datafield ind1="3" ind2=" " tag="024">
    <subfield code="a">4007196101944</subfield>
  </datafield>
  <datafield ind1="2" ind2=" " tag="024">
    <subfield code="a">JKX0823</subfield>
  </datafield>
  <datafield ind1=" " ind2=" " tag="028">
    <subfield code="a">IMCD 236/822 898-2</subfield>
  </datafield>
  <datafield ind1="1" ind2="3" tag="245">
    <subfield code="a">The unforgettable fire</subfield>
  </datafield>
  <datafield ind1="#" ind2="#" tag="260">
    <subfield code="b">Island Records</subfield>
    <subfield code="c">1984</subfield>
  </datafield>
  <datafield ind1=" " ind2=" " tag="300">
    <subfield code="a">1 cd</subfield>
  </datafield>
  <datafield ind1=" " ind2=" " tag="306">
    <subfield code="a">00:42:48</subfield>
  </datafield>
  <datafield ind1=" " ind2=" " tag="340">
    <subfield code="a">cd</subfield>
  </datafield>
  <datafield ind1=" " ind2=" " tag="520">
    <subfield code="a">ill</subfield>
    <subfield code="u">http://geapbib001.cipal.be/docman/docman.phtml?file=authorities.87.31.88.jpg.rm99991231.19959.13742</subfield>
  </datafield>
</record>

【问题讨论】:

  • 您可以编辑您的问题并将 XML 文件中的示例放在那里吗?有一些东西可以尝试代码......
  • 我刚刚添加了一个示例。所以我需要的是 001 中的 2 个数字和 520 u 中的 2 个 URL

标签: python-3.x xml beautifulsoup


【解决方案1】:

试试这个。

from simplified_scrapy import SimplifiedDoc,req,utils

html = utils.getFileContent('snippet_bilzen.xml')
doc = SimplifiedDoc(html)
rows_url = doc.selects('@tag=520').select('@code=u').text
rows_id = doc.selects('@tag=001').text
print (rows_url)
print (rows_id)

结果:

['http://geapbib001.cipal.be/docman/docman.phtml?file=authorities.87.95.131.jpg.rm99991231.51210.17208', 'http://geapbib001.cipal.be/docman/docman.phtml?file=authorities.87.31.88.jpg.rm99991231.19959.13742']
['c:obg:160033', 'c:obg:157417']

【讨论】:

    【解决方案2】:

    如果我理解正确,您只想从存在tag="520"tag="001" 元素的记录中获取数据:

    from bs4 import BeautifulSoup
    
    with open('snippet_bilzen.xml', 'r') as f_in:
        soup = BeautifulSoup(f_in.read(), 'html.parser')
    
    data = []
    for record in soup.select('record:has([tag="520"] > [code="u"]):has([tag="001"])'):
        tag_520 = record.select_one('[tag="520"] > [code="u"]') # select URL
        tag_001 = record.select_one('[tag="001"]')              # select tag="001"
    
        data.append([tag_520.get_text(strip=True), tag_001.get_text(strip=True)])
    
    print(data)
    

    打印:

    [['http://geapbib001.cipal.be/docman/docman.phtml?file=authorities.87.95.131.jpg.rm99991231.51210.17208', 'c:obg:160033'], 
     ['http://geapbib001.cipal.be/docman/docman.phtml?file=authorities.87.31.88.jpg.rm99991231.19959.13742', 'c:obg:157417']]
    

    【讨论】:

    • 太完美了!!非常感谢!
    猜你喜欢
    • 2011-08-25
    • 1970-01-01
    • 1970-01-01
    • 2012-02-13
    • 1970-01-01
    • 1970-01-01
    • 2021-07-15
    • 1970-01-01
    • 2020-11-14
    相关资源
    最近更新 更多