从包含文件路径和其他随机文本的字符串中提取带有扩展名的图像文件名答案

【问题标题】：Extracting image filename with extension from string that includes file path and other random text从包含文件路径和其他随机文本的字符串中提取带有扩展名的图像文件名
【发布时间】：2021-11-20 00:37:10
【问题描述】：

我正在使用 python 创建我网站中使用的图像列表，以及使用它们的页面。我可以提取页面，但我不知道如何提取图像文件名。我看过的正则表达式示例要求输入只是路径，但我的许多 html 页面都包含指向嵌入段落中的图像的链接。我的猜测是我需要以某种方式仅从文本字符串中提取路径，然后从路径中提取文件名，但我不知道该怎么做。

关于网站的文件结构，我所有的图片都在同一个文件夹中，所以永远不会改变。但是，我有大约 5000 个页面需要扫描，并且图像链接几乎可以显示在从段落到列表或表格的任何位置。

到目前为止，这是我的代码：

    '''
    f = open(html_filename, 'r', encoding="utf-8")
    file_str = f.readlines() 
    f.close() 

    img = 'img'

    # Open the file for writing.
    f = open('link_list.txt','a', encoding="utf-8")
    for line in file_str: 
        if img in line: 
            f.write(line + ', ' + html_filename)
    f.close()
    '''

示例输出如下：

    '''
    <img src="../../Resources/Images/top.png" />, page_one.htm
    <img src="../../Resources/Images/bottom.png" />, page_two.htm
    <p>Next, either double-click in Column A, or click <img alt="" border="0" src="..\..\Resources\IMAGES\ICON.png" style="border: none;" />  to search for it.</p>, page_three.htm
    '''

我想得到什么：

    '''
    top.png, page_one.htm
    bottom.png, page_two.htm
    ICON.png, page_three.htm
    etc.
    '''

任何帮助将不胜感激。

【问题讨论】：

标签： python html filenames

【解决方案1】：

您可以使用BeautifulSoup 模块来提取它们。这是一个例子-

from bs4 import BeautifulSoup
import os

with open('index.html', 'r') as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    for el in soup.findAll('img'):
        print(os.path.split(el['src'])[-1])

这是一个 HTML 文件的输出。

profile.png
project-featured.jpg
project-sdc.png

【讨论】：