从多个 7-zip 文件中提取特定的文件扩展名答案

【问题标题】：Extract specific file extensions from multiple 7-zip files从多个 7-zip 文件中提取特定的文件扩展名
【发布时间】：2017-06-12 08:39:23
【问题描述】：

我有一个 RAR 文件和一个 ZIP 文件。在这两个中有一个文件夹。文件夹内有几个 7-zip (.7z) 文件。在每个 7z 中都有多个具有相同扩展名但名称不同的文件。

RAR or ZIP file
  |___folder
        |_____Multiple 7z
                  |_____Multiple files with same extension and different name

我只想从数千个文件中提取我需要的文件... 我需要那些名称包含某个子字符串的文件。例如，如果压缩文件的名称中包含 '[!]' 或 '(U)' 或 '(J)'，这就是确定要提取文件的标准。

我可以毫无问题地提取文件夹，所以我有这个结构：

folder
   |_____Multiple 7z
                |_____Multiple files with same extension and different name

我在 Windows 环境中，但我安装了 Cygwin。我想知道如何轻松提取我需要的文件？也许使用单个命令行。

更新

问题有一些改进：

内部 7z 文件及其内部的各自文件的名称中可以包含空格。
有 7z 文件，其中只有一个文件不符合给定条件。因此，作为唯一可能的文件，它们也必须被提取。

解决方案

感谢大家。 bash 解决方案是帮助我的解决方案。我无法测试 Python3 解决方案，因为尝试使用 pip 安装库时遇到问题。我不使用 Python，所以我必须学习并克服这些解决方案所面临的错误。目前，我找到了一个合适的答案。谢谢大家。

【问题讨论】：

"我只想提取我需要的那些......" 你如何确定你需要的那些？
@Tony 我的错...我已经用标准更新了问题。基本上是压缩文件名称中的子字符串。感谢您的关注。

标签： windows cygwin extract 7zip compression

【解决方案1】：

这是经过一些尝试后的最终版本。以前没有用，所以我将其删除，而不是附加。读到最后，因为最终解决方案可能不需要所有内容。

进入主题。我会使用 Python。如果这是一次任务，那么它可能是矫枉过正，但在任何其他情况下 - 您可以记录所有步骤以供将来调查，正则表达式，编排一些命令以提供输入，以及获取和处理输出 - 每次。所有这些情况在 Python 中都很容易。如果你有的话。

现在，我将写下如何拥有 env。配置。并非所有都是强制性的，但尝试安装会执行一些步骤，也许对过程的描述本身可能会有所帮助。

我有MinGW - 32 位版本。然而，提取 7zip 并不是强制性的。安装后转到C:\MinGW\bin 并运行mingw-get.exe：

Basic Setup 我已经安装了msys-base（右键单击，标记安装，从安装菜单 - 应用更改）。这样我就有了 bash、sed、grep 等等。
在All Packages 中有mingw32-libarchive with dll as class. Since pythonlibarchive` 包只是一个包装器，您需要这个dll 来实际包装二进制文件。

示例适用于 Python 3。我使用的是 32 位版本。你可以在他们的主页上fetch它。我已经安装在默认目录中，这很奇怪。所以建议安装在磁盘的根目录中——比如 mingw。

其他 - conemu 比默认控制台好得多。

在 Python 中安装包。 pip 用于此目的。从您的控制台转到 Python 主页，那里有 Scripts 子目录。对我来说是：c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\Scripts。您可以使用例如pip search archive 进行搜索，并使用pip install libarchive-c 进行安装：

> pip.exe install libarchive-c
Collecting libarchive-c
  Downloading libarchive_c-2.7-py2.py3-none-any.whl
Installing collected packages: libarchive-c
Successfully installed libarchive-c-2.7

在cd ..调用python之后，就可以使用/导入新库了：

>>> import libarchive
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\__init__.py", line 1, in <module>
    from .entry import ArchiveEntry
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\entry.py", line 6, in <module>
    from . import ffi
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 27, in <module>
    libarchive = ctypes.cdll.LoadLibrary(libarchive_path)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 426, in LoadLibrary
   return self._dlltype(name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
TypeError: LoadLibrary() argument 1 must be str, not None

所以它失败了。我试图解决这个问题，但失败了：

>>> import libarchive
read format "cab" is not supported
read format "7zip" is not supported
read format "rar" is not supported
read format "lha" is not supported
read filter "uu" is not supported
read filter "lzop" is not supported
read filter "grzip" is not supported
read filter "bzip2" is not supported
read filter "rpm" is not supported
read filter "xz" is not supported
read filter "none" is not supported
read filter "compress" is not supported
read filter "all" is not supported
read filter "lzma" is not supported
read filter "lzip" is not supported
read filter "lrzip" is not supported
read filter "gzip" is not supported
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\__init__.py", line 1, in <module>
    from .entry import ArchiveEntry
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\entry.py", line 6, in <module>
    from . import ffi
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 167, in <module>
    c_int, check_int)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 92, in ffi
    f = getattr(libarchive, 'archive_'+name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 361, in __getattr__
    func = self.__getitem__(name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 366, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: function 'archive_read_open_filename_w' not found

尝试使用set 命令直接提供信息，但失败了......所以我搬到了pylzma - 因为不需要mingw。 pip 安装失败：

> pip.exe install pylzma
Collecting pylzma
  Downloading pylzma-0.4.9.tar.gz (115kB)
    100% |--------------------------------| 122kB 1.3MB/s
Installing collected packages: pylzma
  Running setup.py install for pylzma ... error
    Complete output from command c:\users\texxas\appdata\local\programs\python\python36-32\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\texxas\\AppData\\Local\\Temp\\pip-build-99t_zgmz\\pylzma\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\texxas\AppData\Local\Temp\pip-ffe3nbwk-record\install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build\lib.win32-3.6
    copying py7zlib.py -> build\lib.win32-3.6
    running build_ext
    adding support for multithreaded compression
    building 'pylzma' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

再次失败。但这很简单——我已经安装了 2015 年的 Visual Studio 构建工具，并且工作正常。我安装了sevenzip，所以我创建了示例存档。所以最后我可以启动 python 并做：

from py7zlib import Archive7z
f = open(r"C:\Users\texxas\Desktop\try.7z", 'rb')
a = Archive7z(f)
a.filenames

得到了空列表。仔细观察......可以更好地理解 - pylzma 不考虑空文件 - 只是为了让您意识到这一点。因此，将一个字符放入我的示例文件中，最后一行给出：

>>> a.filenames
['try/a/test.txt', 'try/a/test1.txt', 'try/a/test2.txt', 'try/a/test3.txt', 'try/a/test4.txt', 'try/a/test5.txt', 'try/a/test6.txt', 'try/a/test7.txt', 'try/b/test.txt', 'try/b/test1.txt', 'try/b/test2.txt', 'try/b/test3.txt', 'try/b/test4.txt', 'try/b/test5.txt', 'try/b/test6.txt', 'try/b/test7.txt', 'try/c/test.txt', 'try/c/test1.txt', 'try/c/test11.txt', 'try/c/test2.txt', 'try/c/test3.txt', 'try/c/test4.txt', 'try/c/test5.txt', 'try/c/test6.txt', 'try/c/test7.txt']

所以...休息是小菜一碟。实际上这是原始帖子的一部分：

import os
import py7zlib

for folder, subfolders, files in os.walk('.'):
    for file in files:
        if file.endswith('.7z'):
            # sooo 7z archive - extract needed.
            try:
                with open(file, 'rb') as f:
                    z = py7zlib.Archive7z(f)
                    for file in z.list():
                        if arch.getinfo(file).filename.endswith('*.py'):
                            arch.extract(file, './dest')
            except py7zlib.FormatError as e:
                print ('file ' + file)
                print (str(e))

附带说明 - Anaconda 是很棒的工具，但完全安装需要 500+MB，所以这太多了。

也让我分享wmctrl.py工具，来自我的github：

cmd = 'wmctrl -ir ' + str(active.window) + \
      ' -e 0,' + str(stored.left) + ',' + str(stored.top) + ',' + str(stored.width) + ',' + str(stored.height)
print cmd
res = getoutput(cmd)

这样你就可以编排不同的命令——这里是wmctrl。可以以允许数据处理的方式处理结果。

【讨论】：

嗨！我在 ASW EC2 Ubuntu 环境中尝试了您的解决方案。我尝试安装pylzma 以使用py7zlib 库，但是我收到以下错误：UnsupportedPlatformWarning: Multithreading is not supported on the platform "linux2" 我试图做我的功课，我尝试了一些此处显示的解决方案但没有成功unix.stackexchange.com/questions/175231/… 感谢您的回答。
@Metafaniel 我将编辑我的评论...而不是 Cygwin 使用 mingw - 你将拥有包含所有好东西的 bash.exe。从他们家下载 Python - 那里有 pip.exe，因此您可以使用它来安装 - 从 cmd.exe 或 bash ...我的 VirtualBox 上的 Windows 失败，这就是为什么它需要这么多时间 - 抱歉
@Metafaniel 我将在我的 VirtualBox 上执行所有步骤并逐步描述。
再次感谢您的关注。我敢肯定，一旦我有空闲时间，我会试试你的方法。有时间我会给你一些反馈。再次感谢
会很棒。我会帮助朋友，所以我有机会训练:)

【解决方案2】：

您在问题赏金页脚中声明可以使用 linux。而且我不使用窗户。对于那个很抱歉。我正在使用Python3，你必须在linux环境中（我会尽快在windows上测试）。

档案结构

datadir.rar
          |
          datadir/
                 |
                 zip1.7z
                 zip2.7z
                 zip3.7z
                 zip4.7z
                 zip5.7z

提取结构

extracted/
├── zip1
│   ├── (E) [!].txt
│   ├── (J) [!].txt
│   └── (U) [!].txt
├── zip2
│   ├── (E) [!].txt
│   ├── (J) [!].txt
│   └── (U) [!].txt
├── zip3
│   ├── (J) [!].txt
│   └── (U) [!].txt
└── zip5
    ├── (J).txt
    └── (U).txt

我是这样做的。

import libarchive.public
import os, os.path
from os.path import basename
import errno
import rarfile

#========== FILE UTILS =================

#Make directories
def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc: # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else: raise

#Open "path" for writing, creating any parent directories as needed.
def safe_open_w(path):
    mkdir_p(os.path.dirname(path))
    return open(path, 'wb')

#========== RAR TOOLS ==================

# List
def rar_list(rar_archive):
    with rarfile.RarFile(rar_archive) as rf:
        return rf.namelist()

# extract
def rar_extract(rar_archive, filename, path):
    with rarfile.RarFile(rar_archive) as rf:
        rf.extract(filename,path)

# extract-all
def rar_extract_all(rar_archive, path):
    with rarfile.RarFile(rar_archive) as rf:
        rf.extractall(path)

#========= 7ZIP TOOLS ==================

# List
def zip7_list(zip7file):
    filelist = []
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            filelist.append(entry.pathname.decode("utf-8"))
    return filelist

# extract
def zip7_extract(zip7file, filename, path):
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            if entry.pathname.decode("utf-8") == filename:
                with safe_open_w(os.path.join(path, filename)) as q:
                    for block in entry.get_blocks():
                        q.write(block)
                break

# extract-all
def zip7_extract_all(zip7file, path):
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            if os.path.isdir(entry.pathname.decode("utf-8")):
                continue
            with safe_open_w(os.path.join(path, entry.pathname.decode("utf-8"))) as q:
                for block in entry.get_blocks():
                    q.write(block)

#============ FILE FILTER =================

def exclamation_filter(filename):
    return ("[!]" in filename)

def optional_code_filter(filename):
    return not ("[" in filename)

def has_exclamation_files(filelist):
    for singlefile in filelist:
        if(exclamation_filter(singlefile)):
            return True
    return False

#============ MAIN PROGRAM ================

print("-------------------------")
print("Program Started")
print("-------------------------")

BIG_RAR = 'datadir.rar'
TEMP_DIR = 'temp'
EXTRACT_DIR = 'extracted'
newzip7filelist = []

#Extract big rar and get new file list
for zipfilepath in rar_list(BIG_RAR):
    rar_extract(BIG_RAR, zipfilepath, TEMP_DIR)
    newzip7filelist.append(os.path.join(TEMP_DIR, zipfilepath))

print("7z Files Extracted")
print("-------------------------")

for newzip7file in newzip7filelist:
    innerFiles = zip7_list(newzip7file)
    for singleFile in innerFiles:
        fileSelected = False
        if(has_exclamation_files(innerFiles)):
            if exclamation_filter(singleFile): fileSelected = True
        else:
            if optional_code_filter(singleFile): fileSelected = True
        if(fileSelected):
            print(singleFile)
            outputFile = os.path.join(EXTRACT_DIR, os.path.splitext(basename(newzip7file))[0])
            zip7_extract(newzip7file, singleFile, outputFile)

print("-------------------------")
print("Extraction Complete")
print("-------------------------")

在主程序之上，我已经准备好了所有必需的功能。我没有全部使用，但我保留了它们以备不时之需。

我使用了几个带有python3的python库，但是你只需使用pip安装libarchive和rarfile，其他的都是内置库。

这是copy of my source tree

控制台输出

这是运行此 python 文件时的控制台输出，

-------------------------
Program Started
-------------------------
7z Files Extracted
-------------------------
(J) [!].txt
(U) [!].txt
(E) [!].txt
(J) [!].txt
(U) [!].txt
(E) [!].txt
(J) [!].txt
(U) [!].txt
(J).txt
(U).txt
-------------------------
Extraction Complete
-------------------------

问题

到目前为止，我遇到的唯一问题是，在程序根目录下生成了一些临时文件。无论如何它不会影响程序，但我会尝试解决这个问题。

编辑

你必须跑

sudo apt-get install libarchive-dev

安装实际的libarchive 程序。 Python 库只是它的一个包装器。看看official documentation。

【讨论】：

@MichałZaborowski 文件路径分隔符等是 unix 特定的，因此这不适用于 Windows。 OP说可以使用linux，请在投票前完整阅读问题。而且我不使用 Windows，所以我无法在 Windows 上进行测试。引用 OP maybe in command line not necessailly to be done in Windows,it can be done with Linux/bash, etc。讨厌评论其他用户的答案对您的答案没有任何好处。对不起。
OP 写道，他将为此使用 Cygwin。 Libarchive 是包装器，其中有很多，每个人的界面都有点不同。你确定使用pip OP 会在版本 3 中安装任何东西到 python 吗？在 Cygwin 中？
是的，我最初问了关于 Cygwin 的问题，但由于缺乏答案并且我需要找到答案，如果仅 Linux 的答案有帮助，对我来说没问题。感谢双方
关于答案，我做了功课并尝试安装这些库来测试您的答案。但是我面临这个错误：error: libarchive.so: cannot open shared object file: No such file or directory 我确保我使用的是 python3 pip 而不是 python2，因为我有两个版本。我什至升级了 pip 版本，但这没有帮助，所以我还没有测试你的想法。我不使用 Python。抱歉……
您必须运行apt-get install libarchive-dev 才能安装实际程序。 Python 库只是它的一个包装器。看看我链接到libarchive链接的网站

【解决方案3】：

如何使用这个命令行：

7z -e c:\myDir\*.7z -oc:\outDir "*(U)*.ext" "*(J)*.ext" "*[!]*.ext" -y

地点：

myDir 是你的解压文件夹
outDir 是你的输出目录
ext 是您的文件扩展名

-y 选项用于强制覆盖，以防您在不同存档中具有相同的文件名。

【讨论】：

谢谢约翰尼。基本思想有效。我在 Cygwin 中对此进行了测试，唯一的区别是我需要从 -e 中删除 - 才能使其正常工作。但是，我希望有一种方法可以实现关于 [!] 优先于其他代码组合的逻辑。通过这种方式，我得到的结果比预期的要多得多。也许需要一个正则表达式才能缩小结果？感谢您有兴趣帮助我！

【解决方案4】：

此解决方案基于 bash、grep 和 awk，适用于 Cygwin 和 Ubuntu。

由于您需要先搜索 (X) [!].ext 文件，如果没有此类文件，则查找 (X).ext 文件，我认为不可能编写一些单个表达式来处理此逻辑。

解决方案应该有一些 if/else 条件逻辑来测试存档中的文件列表并决定提取哪些文件。

这是我测试脚本的 zip/rar 存档中的初始结构（我创建了一个script 来准备这个结构）：

folder
├── 7z_1.7z
│   ├── (E).txt
│   ├── (J) [!].txt
│   ├── (J).txt
│   ├── (U) [!].txt
│   └── (U).txt
├── 7z_2.7z
│   ├── (J) [b1].txt
│   ├── (J) [b2].txt
│   ├── (J) [o1].txt
│   └── (J).txt
├── 7z_3.7z
│   ├── (E) [!].txt
│   ├── (J).txt
│   └── (U).txt
└── 7z 4.7z
    └── test.txt

输出是这样的：

output
├── 7z_1.7z           # This is a folder, not an archive
│   ├── (J) [!].txt   # Here we extracted only files with [!]
│   └── (U) [!].txt
├── 7z_2.7z
│   └── (J).txt       # Here there are no [!] files, so we extracted (J)
├── 7z_3.7z
│   └── (E) [!].txt   # We had here both [!] and (J), extracted only file with [!]
└── 7z 4.7z
    └── test.txt      # We had only one file here, extracted it

这是用于提取的script：

#!/bin/bash

# Remove the output (if it's left from previous runs).
rm -r output
mkdir -p output

# Unzip the zip archive.
unzip data.zip -d output
# For rar use
#  unrar x data.rar output
# OR
#  7z x -ooutput data.rar

for archive in output/folder/*.7z
do
  # See https://stackoverflow.com/questions/7148604
  # Get the list of file names, remove the extra output of "7z l"
  list=$(7z l "$archive" | awk '
      /----/ {p = ++p % 2; next}
      $NF == "Name" {pos = index($0,"Name")}
      p {print substr($0,pos)}
  ')
  # Get the list of files with [!].
  extract_list=$(echo "$list" | grep "[!]")
  if [[ -z $extract_list ]]; then
    # If we don't have files with [!], then look for ([A-Z]) pattern
    # to get files with single letter in brackets.
    extract_list=$(echo "$list" | grep "([A-Z])\.")
  fi
  if [[ -z $extract_list ]]; then
    # If we only have one file - extract it.
    if [[ ${#list[@]} -eq 1 ]]; then
      extract_list=$list
    fi
  fi
  if [[ ! -z $extract_list ]]; then
    # If we have files to extract, then do the extraction.
    # Output path is output/7zip_archive_name/
    out_path=output/$(basename "$archive")
    mkdir -p "$out_path"
    echo "$extract_list" | xargs -I {} 7z x -o"$out_path" "$archive" {}
  fi
done

这里的基本思想是检查 7zip 档案并使用7z l 命令（文件列表）获取每个文件的列表。

如果命令的输出很冗长，那么我们使用awk 来清理它并获取文件名列表。

之后，我们使用grep 过滤此列表以获得[!] 文件列表或(X) 文件列表。然后我们只需将此列表传递给 7zip 以提取我们需要的文件。

【讨论】：

嗨！我喜欢你的方法。我测试了对文件名和路径进行必要更改的代码。在我的现实生活文件中存在一些问题：您的脚本似乎没有考虑文件名中带有空格的文件，因为抛出的所有输出都来自无空格的文件名。还有一些情况是 7z 文件中只有一个文件不符合标准。然后，必须提取此文件，因为它是唯一可用的文件……如果您能帮助我改进答案，那就太好了。我也会用这个更新问题。感谢您帮助我。
附注：关于cygwin 似乎是unrar 不可用。我读过7z 可以很好地处理 RAR 文件。我对脚本进行了必要的更改，但出现了 System ERROR: Unknown error -2147024872。我现在没有时间调试它以了解原因。
我添加了处理单个文件存档的案例，还发现并修复了存档名称中的空格问题。您是否也有文件名中的空格问题（我已经在带空格的文件上对其进行了测试，请参阅我的输入/输出示例）。我还更新了scripts on github。
我刚刚测试过，它也可以在 Cygwin 上运行，提取 rar 存档使用 7z x -ooutput data.rar。
感谢您的解决方案已经解决了我的问题。问候和美好的一天！