Python正则表达式以获取长路径中的文件名答案

【问题标题】：Python regular expression to get filename in a long pathPython正则表达式以获取长路径中的文件名
【发布时间】：2011-04-28 15:07:57
【问题描述】：

我需要分析一些日志文件，如下所示，我想检索 3 部分数据，

时间
目录的一部分，在这种情况下，它将是输入文件中的 ABC 和 DEF。
输入文件中的文件名，在本例中为 2C.013000000B.dat、20100722B.TXT、20100722D1-XYZ.TXT 和 2C.250B。

我使用这个正则表达式，但它无法获得第三部分。

(\d\d:\d\d:\d\d).*(ABC|DEF).*\\(\d\w\.?\w\..*)\soutput.*

任何建议将不胜感激。

08:38:36   TestModule - [INFO]result success !! ftp_site=ftp.test.com file_dir=CPY input file=\root\level1\level2-ABC\2C.013000000B.dat output file=c:\local\project1\data\2C.013000000B.dat.ext
06:40:37   TestModule - [INFO]result success !! ftp_site=ftp.test.com file_dir=CPY input file=\root\level1\level2-ABC\20100722B.TXT output file=c:\local\project1\data\20100722B.TXT.ext
06:40:39   TestModule - [INFO]result success !! ftp_site=ftp.test.com file_dir=CPY input file=\root\level1\level2-DEF\20100722D1-XYZ.TXT output file=c:\local\project1\data\20100722D1-YFP.TXT.ext
06:40:42   TestModule - [INFO]result success !! ftp_site=ftp.test.com file_dir=CPY input file=\root\level1\level2-DEF\2C.250B output file=c:\local\project1\data\2C.250B.ext

BR

爱德华

【问题讨论】：

标签： python regex file

【解决方案1】：

正则表达式非常擅长解决此类问题 - 即解析日志文件记录。 MarcoS 的回答很好地解决了您的直接问题。然而，另一种方法是编写一个（可重用的）通用函数，它将日志文件记录分解为其各种组件，并返回一个包含所有这些已解析组件的匹配对象。分解后，可以轻松地将测试应用于组件部分以检查各种要求（例如输入文件路径必须以ABC 或DEF 结尾）。这是一个具有这样一个功能的python脚本：decomposeLogEntry()，并演示了如何使用它来解决您手头的问题：

import re
def decomposeLogEntry(text):
    r""" Decompose log file entry into its various components.

    If text is a valid log entry, return regex match object of
    log entry components strings. Otherwise return None."""
    return re.match(r"""
        # Decompose log file entry into its various components.
        ^                            # Anchor to start of string
        (?P<time>\d\d:\d\d:\d\d)     # Capture: time
        \s+
        (?P<modname>\w+?)            # Capture module name
        \s-\s\[
        (?P<msgtype>[^]]+)           # Capture message type
        \]
        (?P<message>[^!]+)           # Capture message text
        !!\sftp_site=
        (?P<ftpsite>\S+?)            # Capture ftp URL
        \sfile_dir=
        (?P<filedir>\S+?)            # Capture file directory?
        \sinput\sfile=
        (?P<infile>                  # Capture input path and filename
          (?P<infilepath>\S+)\\      # Capture input file path
          (?P<infilename>[^\s\\]+)   # Capture input file filename
        )
        \soutput\sfile=
        (?P<outfile>                 # Capture input path and filename
          (?P<outfilepath>\S+)\\     # Capture output file path
          (?P<outfilename>[^\s\\]+)  # Capture output file filename
        )
        \s*                          # Optional whitespace at end.
        $                            # Anchor to end of string
        """, text, re.IGNORECASE | re.VERBOSE)

# Demonstrate decomposeLogEntry function. Print components of all log entries.
f=open("testdata.log")
mcnt = 0
for line in f:
    # Decompose this line into its components.
    m = decomposeLogEntry(line)
    if m:
        mcnt += 1
        print "Match number %d" % (mcnt)
        print "  Time:             %s" % m.group("time")
        print "  Module name:      %s" % m.group("modname")
        print "  Message type:     %s" % m.group("time")
        print "  Message:          %s" % m.group("message")
        print "  FTP site URL:     %s" % m.group("ftpsite")
        print "  Input file:       %s" % m.group("infile")
        print "  Input file path:  %s" % m.group("infilepath")
        print "  Input file name:  %s" % m.group("infilename")
        print "  Output file:      %s" % m.group("outfile")
        print "  Output file path: %s" % m.group("outfilepath")
        print "  Output file name: %s" % m.group("outfilename")
        print "\n",
f.close()

# Next pick out only the desired data.
f=open("testdata.log")
mcnt = 0
matches = []
for line in f:
    # Decompose this line into its components.
    m = decomposeLogEntry(line)
    if m:
        # See if this record meets desired requirements
        if re.search(r"ABC$|DEF$", m.group("infilepath")):
            matches.append(line)
f.close()
print "There were %d matching records" % len(matches)

此功能不仅可以挑选出您感兴趣的各个部分，还可以验证输入并拒绝格式错误的记录。一旦编写和调试，此函数可以被其他需要分析日志文件以满足其他要求的程序重用。

以下是脚本应用于您的测试数据时的输出：

r"""
Match number 1
  Time:             08:38:36
  Module name:      TestModule
  Message type:     08:38:36
  Message:          result success
  FTP site URL:     ftp.test.com
  Input file:       \root\level1\level2-ABC\2C.013000000B.dat
  Input file path:  \root\level1\level2-ABC
  Input file name:  2C.013000000B.dat
  Output file:      c:\local\project1\data\2C.013000000B.dat.ext
  Output file path: c:\local\project1\data
  Output file name: 2C.013000000B.dat.ext

Match number 2
  Time:             06:40:37
  Module name:      TestModule
  Message type:     06:40:37
  Message:          result success
  FTP site URL:     ftp.test.com
  Input file:       \root\level1\level2-ABC\20100722B.TXT
  Input file path:  \root\level1\level2-ABC
  Input file name:  20100722B.TXT
  Output file:      c:\local\project1\data\20100722B.TXT.ext
  Output file path: c:\local\project1\data
  Output file name: 20100722B.TXT.ext

Match number 3
  Time:             06:40:39
  Module name:      TestModule
  Message type:     06:40:39
  Message:          result success
  FTP site URL:     ftp.test.com
  Input file:       \root\level1\level2-DEF\20100722D1-XYZ.TXT
  Input file path:  \root\level1\level2-DEF
  Input file name:  20100722D1-XYZ.TXT
  Output file:      c:\local\project1\data\20100722D1-YFP.TXT.ext
  Output file path: c:\local\project1\data
  Output file name: 20100722D1-YFP.TXT.ext

Match number 4
  Time:             06:40:42
  Module name:      TestModule
  Message type:     06:40:42
  Message:          result success
  FTP site URL:     ftp.test.com
  Input file:       \root\level1\level2-DEF\2C.250B
  Input file path:  \root\level1\level2-DEF
  Input file name:  2C.250B
  Output file:      c:\local\project1\data\2C.250B.ext
  Output file path: c:\local\project1\data
  Output file name: 2C.250B.ext

There were 4 matching records
"""

【讨论】：

【解决方案2】：

使用拆分是个好主意。如果你真的想要一个正则表达式，我会这样做：

(\d\d:\d\d:\d\d).*?input file=.*?(ABC|DEF)\\\\(.*?)\soutput

测试一下here

【讨论】：

我在我的程序中尝试，它使用 python 27，但它失败了 Traceback（最近一次调用最后一次）：文件“t.py”，第 21 行，在 S3_search1=re.compile( '(\d\d:\d\d:\d\d).*?输入文件=.*?(ABC|DEF)\(.*?)\soutput',re.IGNORECASE) 文件 "c:\ python27\lib\re.py”，第 190 行，编译返回 _compile(pattern, flags) 文件“c:\python27\lib\re.py”，第 245 行，_compile 引发错误，v # 无效表达式 sre_constants.error : 不平衡括号
@user729544: 抱歉，在 python 中你必须注意The Backslash Plague ...在这种情况下正则表达式可能不是一个好主意的另一个原因
谢谢，现在可以了。是的，我错过了反斜杠瘟疫。

【解决方案3】：

如果您使用正则表达式工具，它将使您的正则表达式故障排除工作变得更加轻松。试试this free one - 可能有更好的，但这很好用。您可以将您的日志文件粘贴到那里，并一次尝试一下您的正则表达式，它会实时突出显示匹配项。

【讨论】：

谢谢，我确实先尝试了这个express builder，但是我得到了这个结果，这就是我在这里问的原因。非常感谢你。 Python 2.7.1 (r271:86832, Nov 27 2010, 18:30:46) [MSC v.1500 32 bit (Intel)] on win32 >>> import re >>> s1=re.compile('\(\ d\w\.?\w*\-?\w*.?(P3|TXT|))\s',re.IGNORECASE) Traceback（最近一次调用最后一次）：文件“”，第 1 行，在文件“c:\python27\lib\re.py”，第 190 行，编译返回 _compile(pattern, flags) 文件“c:\python27\lib\re.py”，第 245 行，在 _compile raise error, v # 无效表达式 sre_constants.error: 不平衡括号
每当我有一个不工作的复杂正则表达式时，我都会尝试一下......例如，我会从第一部分开始，一旦成功，添加更多等等。这样很容易排除故障。

【解决方案4】：

为什么是正则表达式？

考虑使用split 来获取所有单词。这将直接为您提供时间戳。然后检查所有其他单词，检查其中是否有=，在这种情况下再次拆分它们，你就有了你的路径和其他参数。标准 Python 路径处理 (os.path) 将帮助您获取文件夹和文件名。

当然，如果您的路径名可能包含空格，则此方法会失败，否则绝对值得考虑。

【讨论】：

谢谢，我尝试了 split 并且它有效，但我仍然想知道正则表达式如何做到这一点。
有些人在遇到问题时会想“我知道，我会使用正则表达式”。现在他们有两个问题。

【解决方案5】：

你可以通过普通的字符串处理简单地做到这一点

f=open("file")
for line in f:
    date,b = line.split("input")
    print "time: " , date.split()[0]
    input_path = b.split("output")[0]
    tokens=input_path.split("\\")
    filename=tokens[-1]
    directory=tokens[-2].split("-")[-1]
    print filename, directory
f.close()

【讨论】：

谢谢它的工作，但我仍然想知道如何用正则表达式完成这个。

【解决方案6】：

这适用于您的示例：

r'(\d\d:\d\d:\d\d).*(ABC|DEF).*?([^\\]*)\soutput.*'

虽然写得很好的正则表达式在这里是合适的，但我会以不同的方式处理这个问题。更具体地说，os.path.split 旨在将文件名与基本路径分开，并处理此正则表达式忽略的所有极端情况。

【讨论】：

谢谢，但如果文件名包含 ABC 或 DEF，如 20100722D1-ABC.TXT，这将失败，结果将仅为 .TXT。 (\d\d:\d\d:\d\d).*?input file=.*?(ABC|DEF)\\\(.*?)\soutput 工作正常。谢谢 MarcoS。