Python：将文本从 HTML 或文本文档导入 Word答案

【问题标题】：Python: Import text from HTML or text document into WordPython：将文本从 HTML 或文本文档导入 Word
【发布时间】：2020-07-19 00:50:12
【问题描述】：

我一直在查看一些文档，但我所看到的关于 docx 的所有工作主要是针对在 word 文档中使用文本。我想知道的是，是否有一种简单的方法可以从 HTML 或文本文档中获取文本，并将其导入到 word 文档中，然后进行批发？ HTML/文本文档中的所有文本？好像不太喜欢这个字符串，太长了。

我对文档的理解是，您必须逐段处理文本。我想做的任务相对简单——但这超出了我的 Python 技能。我想在 word 文档上设置边距，然后将文本导入到 word 文档中，使其符合我之前指定的边距。

有人有什么想法吗？我发现以前的帖子都不是很有帮助。

import textwrap
import requests
from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Inches


class DocumentWrapper(textwrap.TextWrapper):

    def wrap(self, text):
        split_text = text.split('\n\n')
        lines = [line for para in split_text for line in textwrap.TextWrapper.wrap(self, para)]
        return lines

page = requests.get("http://classics.mit.edu/Aristotle/prior.mb.txt")
soup = BeautifulSoup(page.text,"html.parser")

#we are going to pull in the text wrap extension that we have added.
#The typical width that we want tow
text_wrap_extension = DocumentWrapper(width=82,initial_indent="",fix_sentence_endings=True)
new_string = text_wrap_extension.fill(page.text)

final_document = "Prior_Analytics.txt"

with open(final_document, "w") as f:
    f.writelines(new_string)

document = Document(final_document)


### Specified margin specifications
sections = document.sections
for section in sections:
    section.top_margin = (Inches(1.00))
    section.bottom_margin = (Inches(1.00))
    section.right_margin = (Inches(1.00))
    section.left_margin = (Inches(1.00))

document.save(final_document)

我得到的错误如下：

docx.opc.exceptions.PackageNotFoundError: Package not found at 'Prior_Analytics.txt'

【问题讨论】：

请帮忙？

标签： python beautifulsoup python-docx python-textprocessing

【解决方案1】：

我明白了。

document = Document()
sections = document.sections
for section in sections:
    section.top_margin = Inches(2)
    section.bottom_margin = Inches(2)
    section.left_margin = Inches(2)
    section.right_margin = Inches(2)
document.add_paragraph(###Add your text here. Add Paragraph Accepts text of whatever size.###)
document.save()#name of document goes here, as a string.

【讨论】：

另请注意，一旦您打开 word 文档，它看起来并不像 word 文档实际上反映了其中包含您设置的尺寸。他们仍然在那里。

【解决方案2】：

这个错误仅仅意味着在你指定的位置没有.docx文件。所以你可以修改你的代码来创建它不存在的文件。

final_document = "Prior_Analytics.txt"

with open(final_document, "w+") as f:
    f.writelines(new_string)

您提供的是相对路径。你怎么知道 Python 的当前工作目录是什么？这就是你给出的相对路径的起点。

几行这样的代码会告诉你：

import os
print(os.path.realpath('./'))

注意：

docx 用于打开.docx 文件

【讨论】：

我不确定如何使用指定的边距将文本导入到 word 文档中。这就是我想要做的。
在这里试试geeksforgeeks.org/python-working-with-docx-module
我真的希望能够创建一个 word 文档，按照我们讨论的方式管理页边距，然后在管理页边距后将代码倒入文档中。根据您发送给我的内容，这似乎是不可能的。看起来我必须逐段或逐行管理文本。您能否确认这是否也是您的理解？
我想我使用 HTML 并为每一行创建一个对象，并使用行管理迭代这些行并以这种方式倾倒它。我怎么知道脚本知道从新段落开始呢？这就是我正在努力解决的问题，以及标题/标题等......
我明白了。让我看看能不能帮到你