使用 python 脚本从 hdfs (hadoop) 目录中获取文件列表答案

【问题标题】：Get list of files from hdfs (hadoop) directory using python script使用 python 脚本从 hdfs (hadoop) 目录中获取文件列表
【发布时间】：2016-03-30 20:33:10
【问题描述】：

如何使用 python 脚本从 hdfs (hadoop) 目录中获取文件列表？

我尝试了以下行：

dir = sc.textFile("hdfs://127.0.0.1:1900/directory").collect()

目录有文件列表“file1,file2,file3....fileN”。通过使用该行，我只获得了所有内容列表。但我需要获取文件名列表。

谁能帮我找出这个问题？

提前致谢。

【问题讨论】：

标签： python file python-2.7 hadoop directory

【解决方案1】：

你可以使用os库中的listdir函数 files = os.listdir(path)

【讨论】：

files = os.listdir('hdfs://127.0.0.1:19000/Directory) print files 错误显示“WindowsError: [Error 123] The filename, directory name or volume label syntax is不正确：'hdfs://127.0.0.1:19000/Directoryt/*.*'”。对于 hdfs 文件，它不会像普通文件一样处理吗？
是的，我需要该目录中的文件名列表。但是在使用上面的代码时，会显示我之前提到的错误
嗨 Michal，我需要从不在系统目录中的 hdfs (hadoop) 目录中获取文件

【解决方案2】：

使用子进程

import subprocess
p = subprocess.Popen("hdfs dfs -ls <HDFS Location> |  awk '{print $8}'",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT)

for line in p.stdout.readlines():
    print line

编辑：不用python回答。第一个选项也可用于递归打印所有子目录。最后一个重定向语句可以根据您的要求省略或更改。

hdfs dfs -ls -R <HDFS LOCATION> | awk '{print $8}' > output.txt
hdfs dfs -ls <HDFS LOCATION> | awk '{print $8}' > output.txt

编辑：更正 awk 命令中缺少的引号。

【讨论】：

【解决方案3】：

import subprocess

path = "/data"
args = "hdfs dfs -ls "+path+" | awk '{print $8}'"
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)

s_output, s_err = proc.communicate()
all_dart_dirs = s_output.split() #stores list of files and sub-directories in 'path'

【讨论】：

这看起来像是对this answer 的改进。你可能想edit你的答案来解释改进。

【解决方案4】：

对于python 3：

    from subprocess import Popen, PIPE
hdfs_path = '/path/to/the/designated/folder'
process = Popen(f'hdfs dfs -ls -h {hdfs_path}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
list_of_file_names = [fn.split(' ')[-1].split('/')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
list_of_file_names_with_full_address = [fn.split(' ')[-1] for fn in std_out.decode().readlines()[1:]][:-1]

【讨论】：

【解决方案5】：

为什么不让 HDFS 客户端通过使用 -C 标志而不是依靠 awk 或 python 来打印感兴趣的特定列来完成艰苦的工作？

即Popen(['hdfs', 'dfs', '-ls', '-C', dirname])

然后，将输出拆分为新行，然后您将获得路径列表。

这是一个示例以及日志记录和错误处理（包括目录/文件不存在时）：

from subprocess import Popen, PIPE
import logging
logger = logging.getLogger(__name__)

FAILED_TO_LIST_DIRECTORY_MSG = 'No such file or directory'

class HdfsException(Exception):
    pass

def hdfs_ls(dirname):
    """Returns list of HDFS directory entries."""
    logger.info('Listing HDFS directory ' + dirname)
    proc = Popen(['hdfs', 'dfs', '-ls', '-C', dirname], stdout=PIPE, stderr=PIPE)
    (out, err) = proc.communicate()
    if out:
        logger.debug('stdout:\n' + out)
    if proc.returncode != 0:
        errmsg = 'Failed to list HDFS directory "' + dirname + '", return code ' + str(proc.returncode)
        logger.error(errmsg)
        logger.error(err)
        if not FAILED_TO_LIST_DIRECTORY_MSG in err:
            raise HdfsException(errmsg)
        return []
    elif err:
        logger.debug('stderr:\n' + err)
    return out.splitlines()

【讨论】：