【发布时间】:2020-05-11 20:29:24
【问题描述】:
感谢我在这里找到的许多线程,我设法完成了一些我想做的事情。但现在我被困住了。帮助将不胜感激。
所以我有这个包含几千条记录的 XML 文件,我想从中提取
- 标签520(URL)的内容
- 标签 001 (recno) 的内容,只要找到标签 520 即可
--> 所以结果应该是 URLs + recnos 的列表。
帮助我将后续结果导出到 csv 而不是在屏幕上显示的奖励积分 ;)
# Import BeautifulSoup
from bs4 import BeautifulSoup as bs
content = []
# Read the XML file
with open("snippet_bilzen.xml", "r") as file:
# Read each line in the file, readlines() returns a list of lines
content = file.readlines()
# Combine the lines in the list into a string
content = "".join(content)
bs_content = bs(content, "lxml")
#Get contents of tag 520
rows_url = bs_content.find_all(tag="520")
for row in rows_url: # Print all occurrences
print(row.get_text())
# trying to get contents of tag 001 where 520 occurs
rows_id = bs_content.find_all(tag="001")
for row in rows_id:
print(row.get_text())
这是 xml 的一部分:
<record>
<leader>00983nam a2200000 c 4500</leader>
<controlfield tag="001">c:obg:160033</controlfield>
<controlfield tag="005">20180605143926.1</controlfield>
<controlfield tag="008">060214s1987 xx u und </controlfield>
<datafield ind1="3" ind2=" " tag="024">
<subfield code="a">0075992557726</subfield>
</datafield>
<datafield ind1="1" ind2="0" tag="245">
<subfield code="a">Sign 'O' the times</subfield>
</datafield>
<datafield ind1="#" ind2="#" tag="260">
<subfield code="b">Paisley Park</subfield>
<subfield code="c">1987</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="300">
<subfield code="a">2 cd's</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="306">
<subfield code="a">01:19:51</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="340">
<subfield code="a">cd</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="500">
<subfield code="a">Met teksten</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="520">
<subfield code="a">ill</subfield>
<subfield code="u">http://geapbib001.cipal.be/docman/docman.phtml?file=authorities.87.95.131.jpg.rm99991231.51210.17208</subfield>
</datafield>
</record>
<record>
<leader>00854nam a2200000 c 4500</leader>
<controlfield tag="001">c:obg:157417</controlfield>
<controlfield tag="005">20180725100810.1</controlfield>
<controlfield tag="008">060214s1984 xx u und </controlfield>
<datafield ind1="3" ind2=" " tag="024">
<subfield code="a">0042282289827</subfield>
</datafield>
<datafield ind1="3" ind2=" " tag="024">
<subfield code="a">4007196101944</subfield>
</datafield>
<datafield ind1="2" ind2=" " tag="024">
<subfield code="a">JKX0823</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="028">
<subfield code="a">IMCD 236/822 898-2</subfield>
</datafield>
<datafield ind1="1" ind2="3" tag="245">
<subfield code="a">The unforgettable fire</subfield>
</datafield>
<datafield ind1="#" ind2="#" tag="260">
<subfield code="b">Island Records</subfield>
<subfield code="c">1984</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="300">
<subfield code="a">1 cd</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="306">
<subfield code="a">00:42:48</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="340">
<subfield code="a">cd</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="520">
<subfield code="a">ill</subfield>
<subfield code="u">http://geapbib001.cipal.be/docman/docman.phtml?file=authorities.87.31.88.jpg.rm99991231.19959.13742</subfield>
</datafield>
</record>
【问题讨论】:
-
您可以编辑您的问题并将 XML 文件中的示例放在那里吗?有一些东西可以尝试代码......
-
我刚刚添加了一个示例。所以我需要的是 001 中的 2 个数字和 520 u 中的 2 个 URL
标签: python-3.x xml beautifulsoup