Wikipedia Extractor 作为 Wikipedia Data Dump File 的解析器答案

【问题标题】：Wikipedia Extractor as a parser for Wikipedia Data Dump FileWikipedia Extractor 作为 Wikipedia Data Dump File 的解析器
【发布时间】：2020-03-11 03:05:35
【问题描述】：

我尝试使用“Wikipedia Extractor(http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) 将 bz2 转换为文本。我下载了带有 bz2 扩展名的维基百科转储，然后在命令行上使用了这行代码：

WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles.xml.bz2

这给了我一个可以在链接中看到的结果：

但是，后续说明：为了将整个提取的文本合并到一个文件中，可以发出：

> find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
> rm -rf extracted

我收到以下错误：

File not found - '*bz2'

我能做什么？

【问题讨论】：

当你 cd 进入extracted 目录时，你看到任何bz2 文件吗？
在运行你的第一个命令WikiExtractor.py ... 之后（假设）会创建一个名为extracted 的文件夹；所以尝试运行cd extracted 然后输入ls 如果你在Linux/MacOS 或dir 如果你在Windows cmd 上。这应该会给你一个extracted目录中的文件列表，看看是否有任何文件以bz2结尾
你在windows上使用这个命令吗？

标签： python command-line xml-parsing wikipedia

【解决方案1】：

请通过这个。这会有所帮助。

Error using the 'find' command to generate a collection file on opencv

WikiExtractor 页面上提到的命令适用于 Unix/Linux 系统，不适用于 Windows。

您在 windows 上运行的find 命令的工作方式与 unix/linux 中的不同。

只要您使用 python 前缀运行提取的部分，它就可以在 windows/linux 环境中正常工作。

python WikiExtractor.py -cb 250K -o extracted your_bz2_file

您会看到在与您的脚本相同的目录中创建了一个 extracted 文件夹。

之后find 命令应该像这样工作，只在 linux 上。

find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml

在extracted 文件夹中找到与 bz2 匹配的所有内容，然后然后对这些文件执行bzip2 命令并将结果放入 text.xml 文件。

此外，如果您运行 bzip -help 命令，该命令应该与上面的 find 命令一起运行，您会看到它在 Windows 上不起作用，而对于 Linux，您会得到以下输出。

gaurishankarbadola@ubuntu:~$ bzip2 -help
bzip2, a block-sorting file compressor.  Version 1.0.6, 6-Sept-2010.

   usage: bzip2 [flags and input files in any order]

   -h --help           print this message
   -d --decompress     force decompression
   -z --compress       force compression
   -k --keep           keep (don't delete) input files
   -f --force          overwrite existing output files
   -t --test           test compressed file integrity
   -c --stdout         output to standard out
   -q --quiet          suppress noncritical error messages
   -v --verbose        be verbose (a 2nd -v gives more)
   -L --license        display software version & license
   -V --version        display software version & license
   -s --small          use less memory (at most 2500k)
   -1 .. -9            set block size to 100k .. 900k
   --fast              alias for -1
   --best              alias for -9

   If invoked as `bzip2', default action is to compress.
              as `bunzip2',  default action is to decompress.
              as `bzcat', default action is to decompress to stdout.

   If no file names are given, bzip2 compresses or decompresses
   from standard input to standard output.  You can combine
   short flags, so `-v -4' means the same as -v4 or -4v, &c.

如上所述，bzip2默认动作是压缩，所以使用bzcat进行解压。

仅适用于 linux 的修改后的命令如下所示。

find extracted -name '*bz2' -exec bzcat -c {} \; > text.xml

它适用于我的 ubuntu 系统。

编辑：

对于 Windows：

在您尝试任何操作之前，请先阅读说明

创建一个单独的文件夹并将文件放入文件夹中。文件 --> WikiExtractor.py 和 itwiki-latest-pages-articles1.xml-p1p277091.bz2（在我的例子中，因为它是一个我能找到的小文件）。

2.在当前目录打开命令提示符，运行以下命令解压所有文件。

python WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles1.xml-p1p277091.bz2

根据文件大小需要时间，但现在目录看起来像这样。

注意：如果您已经有解压文件夹，请将其移动到当前目录，使其与上图匹配，您不必再次进行解压。

复制粘贴以下代码并保存在bz2_Extractor.py文件中。

import argparse
import bz2
import logging

from datetime import datetime
from os import listdir
from os.path import isfile, join, isdir

FORMAT = '%(levelname)s: %(message)s'
logging.basicConfig(format=FORMAT)
logger = logging.getLogger()
logger.setLevel(logging.INFO)


def get_all_files_recursively(root):
    files = [join(root, f) for f in listdir(root) if isfile(join(root, f))]
    dirs = [d for d in listdir(root) if isdir(join(root, d))]
    for d in dirs:
        files_in_d = get_all_files_recursively(join(root, d))
        if files_in_d:
            for f in files_in_d:
                files.append(join(f))
    return files


def bzip_decompress(list_of_files, output_file):
    start_time = datetime.now()
    with open(f'{output_file}', 'w+', encoding="utf8") as output_file:
        for file in list_of_files:
            with bz2.open(file, 'rt', encoding="utf8") as bz2_file:
                logger.info(f"Reading/Writing file ---> {file}")
                output_file.writelines(bz2_file.read())
                output_file.write('\n')
    stop_time = datetime.now()
    print(f"Total time taken to write out {len(list_of_files)} files = {(stop_time - start_time).total_seconds()}")


def main():
    parser = argparse.ArgumentParser(description="Input fields")
    parser.add_argument("-r", required=True)
    parser.add_argument("-n", required=False)
    parser.add_argument("-o", required=True)
    args = parser.parse_args()

    all_files = get_all_files_recursively(args.r)
    bzip_decompress(all_files[:int(args.n)], args.o)


if __name__ == "__main__":
    main()

现在在当前目录中打开一个 cmd 并运行以下命令。

请阅读每个输入在命令中的作用。

python bz2_Extractor.py -r extracted -o output.txt -n 10

-r : 你有 bz2 文件的根目录。

-o : 输出文件名

-n : 要写出的文件数。 [如果没有提供，它会写出根目录下的所有文件]

注意：我可以看到您的文件以千兆字节为单位，并且包含超过 50 万篇文章。如果您尝试使用上述命令将其放入单个文件中，我不确定会发生什么，或者您的系统是否能幸存下来，如果它确实幸存下来，输出文件会很大，因为它是从 2.8 中提取的GB 文件，我认为 Windows 操作系统无法直接打开它。

所以我的建议是一次处理 10000 个文件。

让我知道这是否适合你。

PS : 对于上述命令，输出如下所示。

【讨论】：

默认情况下，Windows 不像 Linux 那样支持 bz2 命令。等一下，我会为你写一个python脚本，做同样的事情。