在python中清理文件路径答案

【问题标题】：Sanitizing a file path in python在python中清理文件路径
【发布时间】：2012-12-18 18:22:49
【问题描述】：

我有一个文件浏览器应用程序，它向用户公开目录及其内容。

我想清理用户输入，这是一个文件路径，因此它不允许绝对路径（例如“/tmp/”）和相对路径（例如“../../etc”）

有没有跨平台的python函数？

【问题讨论】：

标签： python

【解决方案1】：

也适用于寻找摆脱路径中A/./B -> A/B 和A/B/../C -> A/C 的方法的人。你可以使用os.path.normpath。

【讨论】：

【解决方案2】：

一个全面的python文件路径清理器

我对清理路径的任何可用方法都不满意，因此我编写了自己的相对全面的路径清理器。这适用于*从公共端点（http 上传、REST 端点等）获取输入，并确保如果您将数据保存在生成的文件路径中，它不会损坏您的系统**。（注意：此代码针对 Python 3+，您可能需要进行一些更改才能使其在 2.x 上运行）

* 不保证！请不要在没有自己彻底检查的情况下依赖此代码。

** 同样，不保证！您仍然可以做一些疯狂的事情并将 *nix 系统上的根路径设置为 /dev/ 或 /bin/ 或类似的东西。不要那样做。 Windows 上也有一些可能导致损坏的边缘情况（例如设备文件名），您可以检查werkzeug 的utils 中的secure_filename 方法，以便在处理这些问题时获得良好的开端。面向 Windows。

工作原理

你需要指定一个根路径，清理器会确保所有返回的路径都在这个根下。检查get_root_path 函数以了解在哪里执行此操作。确保根路径的值来自您自己的配置，而不是用户输入！
有一个文件名清理程序：
- 将 unicode 转换为 ASCII
- 将路径分隔符转换为下划线
- 仅允许文件名中的白名单中的某些字符。白名单包括所有大小写字母、所有数字、连字符、下划线、空格、左圆括号和右圆括号以及句号（句点）。您可以根据需要自定义此白名单。
- 确保所有名称都至少包含一个字母或数字（以避免使用“..”之类的名称）
要获得有效的文件路径，您应该调用make_valid_file_path。您可以选择在 path 参数中将子目录路径传递给它。这是根路径下的路径，可以来自用户输入。您可以选择在filename 参数中传递一个文件名，这也可以来自用户输入。您传递的文件名中的任何路径信息都不会用于确定文件的路径，而是将其展平为文件名的有效、安全的组成部分。
- 如果没有路径或文件名，它将返回根路径，为主机文件系统正确格式化，结尾带有路径分隔符 (/)。
- 如果有子目录路径，它会将其拆分为其组成部分，使用文件名 sanitiser 对每个部分进行清理，并在没有前导路径分隔符的情况下重建路径。
- 如果有文件名，它会用消毒剂对文件名进行消毒。
- 它将os.path.join 路径组件获取文件的最终路径。
- 作为最终的双重检查结果路径是否有效且安全，它会检查结果路径是否位于根路径下的某个位置。通过拆分和比较路径的组成部分来正确完成此检查，而不是仅仅确保一个字符串以另一个字符串开头。

好的，足够的警告和描述，这里是代码：

import os

def ensure_directory_exists(path_directory):
    if not os.path.exists(path_directory):
        os.makedirs(path_directory)

def os_path_separators():
    seps = []
    for sep in os.path.sep, os.path.altsep:
        if sep:
            seps.append(sep)
    return seps

def sanitise_filesystem_name(potential_file_path_name):
    # Sort out unicode characters
    valid_filename = normalize('NFKD', potential_file_path_name).encode('ascii', 'ignore').decode('ascii')
    # Replace path separators with underscores
    for sep in os_path_separators():
        valid_filename = valid_filename.replace(sep, '_')
    # Ensure only valid characters
    valid_chars = "-_.() {0}{1}".format(string.ascii_letters, string.digits)
    valid_filename = "".join(ch for ch in valid_filename if ch in valid_chars)
    # Ensure at least one letter or number to ignore names such as '..'
    valid_chars = "{0}{1}".format(string.ascii_letters, string.digits)
    test_filename = "".join(ch for ch in potential_file_path_name if ch in valid_chars)
    if len(test_filename) == 0:
        # Replace empty file name or file path part with the following
        valid_filename = "(Empty Name)"
    return valid_filename

def get_root_path():
    # Replace with your own root file path, e.g. '/place/to/save/files/'
    filepath = get_file_root_from_config()
    filepath = os.path.abspath(filepath)
    # ensure trailing path separator (/)
    if not any(filepath[-1] == sep for sep in os_path_separators()):
        filepath = '{0}{1}'.format(filepath, os.path.sep)
    ensure_directory_exists(filepath)
    return filepath

def path_split_into_list(path):
    # Gets all parts of the path as a list, excluding path separators
    parts = []
    while True:
        newpath, tail = os.path.split(path)
        if newpath == path:
            assert not tail
            if path and path not in os_path_separators():
                parts.append(path)
            break
        if tail and tail not in os_path_separators():
            parts.append(tail)
        path = newpath
    parts.reverse()
    return parts

def sanitise_filesystem_path(potential_file_path):
    # Splits up a path and sanitises the name of each part separately
    path_parts_list = path_split_into_list(potential_file_path)
    sanitised_path = ''
    for path_component in path_parts_list:
        sanitised_path = '{0}{1}{2}'.format(sanitised_path, sanitise_filesystem_name(path_component), os.path.sep)
    return sanitised_path

def check_if_path_is_under(parent_path, child_path):
    # Using the function to split paths into lists of component parts, check that one path is underneath another
    child_parts = path_split_into_list(child_path)
    parent_parts = path_split_into_list(parent_path)
    if len(parent_parts) > len(child_parts):
        return False
    return all(part1==part2 for part1, part2 in zip(child_parts, parent_parts))

def make_valid_file_path(path=None, filename=None):
    root_path = get_root_path()
    if path:
        sanitised_path = sanitise_filesystem_path(path)
        if filename:
            sanitised_filename = sanitise_filesystem_name(filename)
            complete_path = os.path.join(root_path, sanitised_path, sanitised_filename)
        else:
            complete_path = os.path.join(root_path, sanitised_path)
    else:
        if filename:
            sanitised_filename = sanitise_filesystem_name(filename)
            complete_path = os.path.join(root_path, sanitised_filename)
        else:
            complete_path = complete_path
    complete_path = os.path.abspath(complete_path)
    if check_if_path_is_under(root_path, complete_path):
        return complete_path
    else:
        return None

【讨论】：

【解决方案3】：

这将阻止用户输入../../../../etc/shadow 之类的文件名，但也不允许basedir 以下子目录中的文件（即basedir/subdir/moredir 被阻止）：

from pathlib import Path
test_path = (Path(basedir) / user_input).resolve()
if test_path.parent != Path(basedir).resolve():
    raise Exception(f"Filename {test_path} is not in {Path(basedir)} directory")

如果你想允许basedir以下的子目录：

if not Path(basedir).resolve() in test_path.resolve().parents:
    raise Exception(f"Filename {test_path} is not in {Path(basedir)} directory")

【讨论】：

【解决方案4】：

我最终在这里寻找一种快速的方法来处理我的用例并最终编写了我自己的。我需要的是一种方法来获取路径并强制它进入 CWD。这适用于处理挂载文件的 CI 系统。

def relative_path(the_path: str) -> str:
    '''
    Force the spec path to be relative to the CI workspace
    Sandboxes the path so that you can't escape out of CWD
    '''
    # Make the path absolute
    the_path = os.path.abspath(the_path)
    # If it started with a . it'll now be /${PWD}/
    # We'll get the path relative to cwd
    if the_path.startswith(os.getcwd()):
        the_path = '{}{}'.format(os.sep, os.path.relpath(the_path))
    # Prepend the path with . and it'll now be ./the/path
    the_path = '.{}'.format(the_path)
    return the_path

就我而言，我不想引发异常。我只是想强制任何给定的路径都将成为 CWD 中的绝对路径。

测试：

def test_relative_path():
    assert relative_path('../test') == './test'
    assert relative_path('../../test') == './test'
    assert relative_path('../../abc/../test') == './test'
    assert relative_path('../../abc/../test/fixtures') == './test/fixtures'
    assert relative_path('../../abc/../.test/fixtures') == './.test/fixtures'
    assert relative_path('/test/foo') == './test/foo'
    assert relative_path('./test/bar') == './test/bar'
    assert relative_path('.test/baz') == './.test/baz'
    assert relative_path('qux') == './qux'

【讨论】：

【解决方案5】：

这是对@mneil 解决方案的改进，使用relpath 的第二个秘密参数：

import os.path

def sanitize_path(path):
    """
    Sanitize a path against directory traversals

    >>> sanitize_path('../test')
    'test'
    >>> sanitize_path('../../test')
    'test'
    >>> sanitize_path('../../abc/../test')
    'test'
    >>> sanitize_path('../../abc/../test/fixtures')
    'test/fixtures'
    >>> sanitize_path('../../abc/../.test/fixtures')
    '.test/fixtures'
    >>> sanitize_path('/test/foo')
    'test/foo'
    >>> sanitize_path('./test/bar')
    'test/bar'
    >>> sanitize_path('.test/baz')
    '.test/baz'
    >>> sanitize_path('qux')
    'qux'
    """
    # - pretending to chroot to the current directory
    # - cancelling all redundant paths (/.. = /)
    # - making the path relative
    return os.path.relpath(os.path.normpath(os.path.join("/", path)), "/")

if __name__ == '__main__':
    import doctest
    doctest.testmod()

【讨论】：

我相信这应该与浏览器用于生成相对于基础的 URL 的算法相同，您可以在 JavaScript 中利用该算法来完成相同的操作：new URL("../../abc/../test", "https://example.com").pathname.substring(1) == "test"
啊，实际上你不需要normpath，因为relpath 隐含地这样做了，但这样更清晰。

【解决方案6】：

要非常具体地针对所提出的问题，但会引发异常而不是将路径转换为相对路径：

path = 'your/path/../../to/reach/root'
if '../' in path or path[:1] == '/':
    raise Exception

【讨论】：