从纯 HTML 中提取文本并写入新文件答案

【问题标题】：Extracting text from plain HTML and write to new file从纯 HTML 中提取文本并写入新文件
【发布时间】：2016-07-10 13:03:47
【问题描述】：

我正在提取 HTML 文档的某个部分（公平地说：这是一个 iXBRL 文档的基础，这意味着我确实有很多编写的格式化代码 内）并将我的输出（原始文件没有提取的部分）写入 .txt 文件。我的目标是测量文档大小的差异（原始文档的多少 KB 是指提取的部分）。据我所知，HTML 到文本格式不应该有任何区别，所以我的区别应该是可靠的，尽管我正在比较两种不同的文档格式。到目前为止我的代码是：

import glob
import os
import contextlib
import re


@contextlib.contextmanager
def stdout2file(fname):
    import sys
    f = open(fname, 'w')
    sys.stdout = f
    yield
    sys.stdout = sys.__stdout__
    f.close()


def extractor():
    os.chdir(r"F:\Test")
    with stdout2file("FileShortened.txt"):
        for file in glob.iglob('*.html', recursive=True):
            with open(file) as f:
                contents = f.read()
                extract = re.compile(r'(This is the beginning of).*?Until the End', re.I | re.S)
                cut = extract.sub('', contents)
                print(file.split(os.path.sep)[-1], end="| ")
                print(cut, end="\n")
extractor()

注意：我没有使用 BS4 或 lxml，因为我不仅对 HTML 文本感兴趣，而且实际上对我的 start 和 end-RegEx 之间的所有行感兴趣。所有格式化代码行。

我的代码运行没有问题，但是由于我有很多文件，我的 FileShortened.txt 文档很快就会变得很大。我的问题不在于文件或提取，而是将我的输出重定向 到各种 txt 文件。现在，我将所有内容都放入一个文件中，我需要的是某种“为每个搜索的文件，创建与原始文档同名的新 txt 文件”条件（arcpy模块？！）？

有点像：

File1.html --> File1Short.txt

File2.html --> File2Short.txt ...

有没有一种简单的方法（无需过多更改我的代码）来反转我的代码，将“RegEx 匹配”打印到新的 .txt 文件而不是“我的 RegEx 中的所有内容除了匹配”？

任何帮助表示赞赏！

【问题讨论】：

您可以添加文件样本吗？
@PadraicCunningham：暂时不像我在另一台电脑上那样。以后我会的。但是我的问题不在于文件或提取，而是将我的输出重定向到各种 txt 文件。现在，我将所有内容都放入一个文件中，我需要某种“对于搜索的每个文件，创建与原始文档同名的新 txt 文件”条件（arcpy 模块？！）。跨度>
嗯，好的。这是直截了当的
编辑了我的答案以便更好地理解
那么pastebin.com/MD0EM63G?我只是使用 open 和 write ，因为我发现它更具可读性，我也忘了删除 with stdout2 ...

标签： html python-3.x pycharm xbrl

【解决方案1】：

好的，我想通了。最终代码是：

import glob
import os
import re
from os import path


def extractor():
    os.chdir(r"F:\Test")  # the directory containing my html
    for file in glob.glob("*.html"):  # iterates over all files in the directory ending in .html
        with open(file) as f, open((file.rsplit(".", 1)[0]) + ".txt", "w") as out:
            contents = f.read()
            extract = re.compile(r'Start.*?End', re.I | re.S)
            cut = extract.sub('', contents)
            out.write(cut)
            out.close()
extractor()

【讨论】：