【发布时间】:2020-09-11 04:06:25
【问题描述】:
我正在尝试删除 Python 函数返回的 XML 文件中的一些空 text 标记,但我收到此错误:TypeError: object of type 'lxml.etree._ElementTree' has no len()。为什么?
这是函数:
def due(pdfpath):
ntree = uniform_cm(pdfpath)
etree.strip_tags(ntree, 'textline')
# Search for all text "textbox" elements
for textbox in ntree.xpath('//textbox'):
new_line = etree.Element("new_line")
previous_bb = None
# From a given textbox element, iterate over all the "text" elements
for x in textbox.iter("text"):
# Get current bb valu
bb = getBBoxFirstValue(x)
# Check current and past values aren't empty
if bb is not None and previous_bb is not None and (bb - previous_bb) > 20:
# Inserte newline into parent tag
x.getparent().insert(x.getparent().index(x), new_line)
# A new "new_line" element is created
new_line = etree.Element("new_line")
# Append current element is new_line tag
new_line.append(x)
# Keep latest non empty BBox 1st value
if bb is not None:
previous_bb = bb
# Add last new_line element if not null
textbox.append(new_line)
tree = ntree
soup = BeautifulSoup(tree, "lxml")
for x in soup.find_all():
if len(x.get_text(strip=True)) == 0:
x.extract()
return tree
【问题讨论】:
标签: python python-3.x beautifulsoup lxml elementtree