如何从这些图像 URL 中分离出标题？答案

【问题标题】：How to isolate titles from these image URLs?如何从这些图像 URL 中分离出标题？
【发布时间】：2020-11-07 18:31:22
【问题描述】：

我有一个包含在“图片”中的图片网址列表。我试图从这些图片 url 中分离出标题，以便我可以在 html 上显示图片（使用整个 url）和相应的标题。

到目前为止，我有这个：

titles = [image[149:199].strip() for image in images]

这给了我以下格式的剥离标题（我提供了两个示例来显示模式）

le_Art_Project.jpg/220px- Rembrandt_van_Rijn_-自画像-_Google_Art_Project.jpg

和

cene_of_the_Prodigal_Son_-Google_Art_Project.jpg/220px-Rembrandt-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-_Google_Art_Project.jpg

粗体（上图）中的位是我要删除的位。从一开始我想删除 before 220px 和结束的所有内容：_-_Google_Art_Project.jpg

python 的新手，我在语法上苦苦挣扎，此外，当我在引用图像（列表）循环时这样做时，字符串操作并不简单，我不确定如何处理这个问题。

整个代码供参考如下：

webscraper.py:

@app.route('/') #this is what we type into our browser to go to pages. we create these using routes
@app.route('/home')
def home():
    images=imagescrape()
    
    titles=[image[99:247].strip() for image in images]
    images_titles=zip(images,titles)
    return render_template('home.html',images=images,images_titles=images_titles)

我已经尝试过/正在尝试：

x = txt.strip("_-_Google_Art_Project.jpg")

查看 strip - 删除不需要的字符串的最后一部分。

我不确定如何将它与删除我想要删除的前导字符串结合起来，并且考虑到我已经拥有的结构/代码，我也以最优雅的方式这样做。

从视觉上看，我正在尝试删除突出显示的前导文本，以及字符串的最后一部分 _-_Google_Art_Project.jpg。。 p>

显示的 HTML 视觉效果：

更新：

基于下面的答案 - 这非常有帮助但不能完全解决它，我正在尝试这种方法（如果可能，不使用取消引用导入和纯 python 字符串操作）

def titleextract(url):
    #return unquote(url[58:url.rindex("/",58)-8].replace('_',''))
    title=url[58:]
    return title

以上，返回：

Rembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg/220pxRembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg

但我想要：

Rembrandt_van_Rijn_-_自画像

或列表中的第二个标题/图像：

Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist_-_Google_Art_Project.jpg/220px-Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist_-_Google_Art_Project.jpg

我想要：

Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist

【问题讨论】：

如果您提供输入的完整示例（例如，作为实际格式化的源代码）和预期输出（再次，作为格式化的实际源代码），将会非常有帮助。
那么，所有字符串总是以"_-_Google_Art_Project.jpg" 结尾吗？请注意，.strip 不会像您期望的那样工作，它不会去除子字符串，它实际上只是考虑您传递给参数的所有字符的集合
请在问题本身中提供格式化文本。不要让我编写代码只是为了重新创建您的示例。
您提供了指向外部网站的链接。在问题本身中提供一个完全包含的示例。您有一堆与您的实际问题无关的第三方依赖项。这不是minimal reproducible example。
我认为他的问题是存在的，但远非明确。请成为更好的人......很好地告诉他写最少的问题，他可以并且可能将来会这样做。 @MissComputing 看看stackoverflow.com/help/minimal-reproducible-example

标签： python regex string

【解决方案1】：

cene_of_the_Prodigal_Son_-_Google_Art_Project.jpg/220px-Rembrandt_-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-_Google_Art_Project.jpg

你有这个字符串并且想要删除。假设我将其存储在 x

y = x.lsplit("px-")[1] 
z = x.rsplit("_Google_Art")[0]

这将创建一个包含 2 个元素的列表：字符串中“px-”之前的内容，以及之后的内容。我们只是在之后抓取这些东西，因为您之前想删除这些东西。如果“px-”并不总是在字符串中，那么我们需要找到其他东西来分割。然后我们在最后的一些东西上分开，然后抓住它之前的东西。

编辑：解决关于如何在该循环中拆分的评论。我认为您指的是：titles=[image[149:199].strip() for image in images]

列表组合很棒，但有时将其写出来更容易。尚未对此进行测试，但想法如下：

titles = []
for image in images:
    title = image[149:199].strip()
    cleaned_left = title.lsplit("px-")[1]
    cleaned_title = title.rsplit("_Google_Art")[0]
    titles.append(cleaned_title)

【讨论】：

您能否使用我现有的代码添加您的答案，以便我可以在现有程序中对其进行测试。换句话说，我如何将您的建议整合到此：titles=[image[99:247].strip() for image in images]
Stackoverflow 不是要修复人们的程序，而是要帮助人们了解实现目标所需的条件。您的帖子被否决了，因为您要求某人为您编写代码。相反，您应该使用尽可能少的代码来显示您遇到的问题，以便我们为您提供帮助。
很高兴你能成功。确保你明白怎么做！我建议研究拆分、剥离和索引
是的，但行数越少并不总是越好。行数无关紧要，代码可读性很重要。如果您认为它仍然可读，那么可以用更少的行来做事。无论如何要做这一切只是点链。同样，尚未测试，但类似： [image[149:199].strip().lsplit("px-")[1].rsplit("_Google_Art")[0] for image in images] Imo 那也是多在一行，但有待解释
由于问题已关闭，我无法留下答案。我已经修改了对previous question 的回答。

【解决方案2】：

import re                          # regular expressions used to match strings 
from bs4 import BeautifulSoup      # web scraping library
from urllib.request import urlopen # open a url connection 
from urllib.parse import unquote   # decode special url characters

@app.route('/')
@app.route('/home')
def home():
    images=imagescrape()
    # Iterate over all sources and extract the title from the URL
    titles=(titleextract(src) for src in images)
    
    # zip combines two lists into one.
    # It goes through all elements and takes one element from the first
    # and one element from the second list, combines them into a tuple 
    # and adds them to a sequence / generator.
    images_titles = zip(images, titles)
    return render_template('home.html', image_titles=images_titles)

def imagescrape():
    result_images=[]
    #html = urlopen('https://en.wikipedia.org/wiki/Prince_Harry,_Duke_of_Sussex')
    html = urlopen('https://en.wikipedia.org/wiki/Rembrandt')
    bs = BeautifulSoup(html, 'html.parser')
    images = bs.find_all('img', {'src':re.compile('.jpg')})
    for image in images:
        result_images.append("https:"+image['src']+'\n') #concatenation!
    return result_images

def titleextract(url):
    # Extract the part of the string between the last two "/" characters
    # Decode special URL characters and cut off the suffix
    # Replace all "_" with spaces
    return unquote(url[58:url.rindex("/", 58)-4]).replace('_', ' ')

{% for image, title in images_titles %}
    <div class="card" style="width: 18rem;">
      <img src="{{image}}" class="card-img-top" alt="...">
      <div class="card-body">
        <h5 class="card-title">{{title}}</h5>
        <p class="card-text">Some quick example text to build on the card title and make up the bulk of the card's content.</p>
        <a href="#" class="btn btn-primary">Go somewhere</a>
      </div>
    </div>
{% endfor %}

【讨论】：