使用 BeautifulSoup 翻译 XLIFF 文件答案

【问题标题】：Translating XLIFF files using BeautifulSoup使用 BeautifulSoup 翻译 XLIFF 文件
【发布时间】：2021-05-13 03:43:12
【问题描述】：

我正在使用 BeautifulSoup 和 googletrans 包翻译 Xliff 文件。我设法提取所有字符串并翻译它们，并设法通过使用翻译创建新标签来替换字符串，例如

<trans-unit id="100890::53706_004">
<source>Continue in store</source>
<target>Kontynuuj w sklepie</target>
</trans-unit>

当源标签内部有其他标签时，就会出现问题。

例如

<source><x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"/>Choose your product\
<x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"/>From a list: </source>

这些标签的数量不同，字符串出现的顺序也不同。例如。 <source> text1 <x /> <x/> text2 <x/> text3 </source>。每个 x 标签都是唯一的，具有不同的 id 和属性。

有没有办法修改标签内的文本而无需创建新标签？我在想我可以提取 x 标签及其属性，但是不同代码行中的顺序或字符串和 x 标签有很大不同，我不知道该怎么做。也许还有其他更适合翻译 xliff 文件的软件包？

【问题讨论】：

有问题添加此<source> 的预期结果。使用 BeautifulSoup，您可能必须使用 for-loop 或 list() 让所有孩子进入 <source> 并与他们一起工作。
你能不能edit这个问题来显示你想要给定源的输出
有许多工具（大部分是商业的，有些是免费的）使 XLIFF 翻译变得轻而易举。尝试搜索“CAT 工具”。

标签： python beautifulsoup translation xliff

【解决方案1】：

我建议不要使用通用 XML 解析器来解析 XLIFF 文件。相反，请尝试寻找专门的 XLIFF 工具包。周围有一些 python 项目，但我没有使用它们的经验（我：主要是 Java 人）。

【讨论】：

【解决方案2】：

您可以使用for-loop 与source 中的所有孩子一起工作。
您可以使用copy.copy(child) 和append 将它们复制到target。
同时可以检查child是否为NavigableString并进行转换。

text = '''<source><x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"/>Choose your product\
<x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"/>From a list: </source>'''

conversions = {
    'Choose your product': 'Wybierz swój produkt',
    'From a list: ': 'Z listy: ',
}

from bs4 import BeautifulSoup as BS
from bs4.element import NavigableString
import copy

#soup = BS(text, 'html.parser')  # it has problem to parse it
#soup = BS(text, 'html5lib')     # it has problem to parse it
soup = BS(text, 'lxml')

# create `<target>`
target = soup.new_tag('target')

# add `<target>` after `<source>
source = soup.find('source')
source.insert_after('', target)

# work with children in `<source>`
for child in source:
    print('type:', type(child))

    # duplicate child and add to `<target>`
    child = copy.copy(child)
    target.append(child)

    # convert text and replace in child in `<target>`        
    if isinstance(child, NavigableString):
        new_text = conversions[child.string]
        child.string.replace_with(new_text)

print('--- target ---')
print(target)
print('--- source ---')
print(source)
print('--- soup ---')
print(soup)

结果（稍作修改以使其更具可读性）：

type: <class 'bs4.element.Tag'>
type: <class 'bs4.element.NavigableString'>
type: <class 'bs4.element.Tag'>
type: <class 'bs4.element.NavigableString'>

--- target ---

<target>
  <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
  Wybierz swój produkt
  <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
  Z listy: 
</target>

--- source ---

<source>
  <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
  Choose your product
  <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
  From a list: 
</source>

--- soup ---

<html><body>
<source>
  <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
  Choose your product
  <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
  From a list: 
</source>
<target>
  <x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"></x>
  Wybierz swój produkt
  <x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"></x>
  Z listy: 
</target>
</body></html>

【讨论】：

【解决方案3】：

要从<source> 中提取两个文本条目，您可以使用以下方法：

from bs4 import BeautifulSoup
import requests

html = """<source><x ctype="x-htmltag" equiv-text="&lt;b&gt;" id="html_tag_191"/>Choose your product\
<x ctype="x-htmltag" equiv-text="&lt;/b&gt;" id="html_tag_192"/>From a list: </source>"""

soup = BeautifulSoup(html, 'lxml')
print(list(soup.source.stripped_strings))

给你：

['Choose your product', 'From a list:']

【讨论】：