【问题标题】:Substitute all xml text with beautifulsoup library用 beautifulsoup 库替换所有 xml 文本
【发布时间】:2015-08-26 19:09:03
【问题描述】:

我需要使用 Python 中的 Beautifulsoup 库替换 xml 中的所有文本。例如,我有这种 xml 的和平:

<Paragraph>
Procedure general informations
<IntLink Target="il_0_mob_411" Type="MediaObject"/>
<Strong>DIFFICULTY: </Strong>
<IntLink Target="il_0_mob_231" Type="MediaObject"/>
<IntLink Target="il_0_mob_231" Type="MediaObject"/> - 
<Strong>DURATION:</Strong> 15 min.<br/>
<Strong>TOOLS REQUIRED:</Strong> 4mm Allen Key, Pin driver
</Paragraph>

我需要它变成这样:

<Paragraph>
0
<IntLink Target="il_0_mob_411" Type="MediaObject"/>
<Strong>1</Strong>
<IntLink Target="il_0_mob_231" Type="MediaObject"/>
<IntLink Target="il_0_mob_231" Type="MediaObject"/> - 
<Strong>2</Strong>3<br/>
<Strong>4</Strong>5
</Paragraph>

谢谢!

【问题讨论】:

标签: python xml beautifulsoup placeholder


【解决方案1】:

代码如下:

# -*- coding: utf-8 -*-
import HTMLParser
import codecs
import os
import sys
from bs4 import BeautifulSoup


xml_doc = open("export_2.xml")
soup = BeautifulSoup(xml_doc)
pars = HTMLParser.HTMLParser()

open('export.txt', 'w').close()
file_xml = open('export_ph.xml', 'w')
counter = 0
all_texts = soup.find_all(text=True)

print "Inizio esportazione:"
for text in all_texts:
    s = pars.unescape(text)
    s = str(counter)+ ";"+ s + "\n"
    if not (s == "" or s.isspace()):
        with codecs.open("export.txt", "a", encoding="utf-8") as file_text:
            file_text.write(s)

    counter = counter+1
    print ".",

## put  placeholder in the xml
all_xml = soup.find_all()
for text in all_xml:
    s = pars.unescape(text.get_text())
    with codecs.open("export_ph.xml", "a", encoding="utf-8") as file_xml:
        file_xml.write(s)


file_xml_info = os.path.getsize('export_ph.xml')
file_txt_info = os.path.getsize('export.txt')
if (file_txt_info > 0 and file_xml_info > 0):
    print "\nEsportazione completata: \nFile xml: " + str(file_xml_info) + "B" + "\nFile testo a 3 colonne: " + str(file_txt_info) + "B"

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2014-06-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-11-06
    • 1970-01-01
    • 1970-01-01
    • 2014-12-18
    相关资源
    最近更新 更多