【问题标题】:Extract img src from HTML Using BeautifulSoup4使用 BeautifulSoup4 从 HTML 中提取 img src
【发布时间】:2017-05-05 16:21:32
【问题描述】:
<div id="thumbnailsImagePreview">
     <img src="getImage.do?imageSize=Small&amp;imageId=730645&amp;r=150521020" imageindex="0" hspace="0" vspace="0" loaded="false" class="selected">
     <img src="getImage.do?imageSize=Small&amp;imageId=7589956&amp;r=150521020" imageindex="1" hspace="0" vspace="0" loaded="false">
     <img src="getImage.do?imageSize=Small&amp;imageId=7590018&amp;r=150521020" imageindex="2" hspace="0" vspace="0" loaded="false">
     <img src="getImage.do?imageSize=Small&amp;imageId=2803850&amp;r=150521020" imageindex="3" hspace="0" vspace="0" loaded="false">
     <img src="getImage.do?imageSize=Small&amp;imageId=2973197&amp;r=150521020" imageindex="4" hspace="0" vspace="0" loaded="false">
     <img src="getImage.do?imageSize=Small&amp;imageId=7589888&amp;r=150521020" imageindex="5" hspace="0" vspace="0" loaded="false">
     <img src="getImage.do?imageSize=Small&amp;imageId=7877267&amp;r=150521020" imageindex="6" hspace="0" vspace="0" loaded="false">
     <img src="getImage.do?imageSize=Small&amp;imageId=7877375&amp;r=150521020" imageindex="7" hspace="0" vspace="0" loaded="false">
     <img src="getImage.do?imageSize=Small&amp;imageId=6812892&amp;r=150521020" imageindex="8" hspace="0" vspace="0" loaded="false">

</div>

我正在尝试在此 HTML 中提取指向 img src 的链接(对于具有关联 imageIndex 的链接),但由于它们都保存在 div id“thumbnailsImagePreview”中,所以当我使用以下代码行时,我得到一大块文本,所以我无法为每个 img src 链接解析它。

images = soup.find_all('div', attrs = {'id' : 'thumbnailsImagePreview'})

如何获取链接数组?

当我打印出图像时,这是我得到的:

[<div id="thumbnailsImagePreview">\n<img class="selected" hspace="0" 
imageindex="0" loaded="false" src="getImage.do?
imageSize=Small&amp;imageId=730645&amp;r=150521020" vspace="0"/>\n<img 
hspace="0" imageindex="1" loaded="false" src="getImage.do?
imageSize=Small&amp;imageId=7589956&amp;r=150521020" vspace="0"/>\n<img 
hspace="0" imageindex="2" loaded="false" src="getImage.do?
imageSize=Small&amp;imageId=7590018&amp;r=150521020" vspace="0"/>\n<img 
hspace="0" imageindex="3" loaded="false" src="getImage.do?
imageSize=Small&amp;imageId=2803850&amp;r=150521020" vspace="0"/>\n<img 
hspace="0" imageindex="4" loaded="false" src="getImage.do?
imageSize=Small&amp;imageId=2973197&amp;r=150521020" vspace="0"/>\n<img 
hspace="0" imageindex="5" loaded="false" src="getImage.do?
imageSize=Small&amp;imageId=7589888&amp;r=150521020" vspace="0"/>\n<img 
hspace="0" imageindex="6" loaded="false" src="getImage.do?
imageSize=Small&amp;imageId=7877267&amp;r=150521020" vspace="0"/>\n<img 
hspace="0" imageindex="7" loaded="false" src="getImage.do?
imageSize=Small&amp;imageId=7877375&amp;r=150521020" vspace="0"/>\n<img 
hspace="0" imageindex="8" loaded="false" src="getImage.do?
imageSize=Small&amp;imageId=6812892&amp;r=150521020" vspace="0"/>\n<img 
hspace="0" imageindex="9" loaded="false" 
</div>]

【问题讨论】:

    标签: python python-2.7 web-scraping beautifulsoup


    【解决方案1】:

    您需要定位内部img 元素并通过将每个元素视为字典来获取src 属性值

    image_srcs = [img['src'] for img in soup.select('#thumbnailsImagePreview img[src]')]
    

    #thumbnailsImagePreview img[src] 这里是一个CSS selector,它将查找所有具有src 属性的img 元素位于具有id="thumbnailsImagePreview" 的元素下。

    【讨论】:

      猜你喜欢
      • 2021-01-12
      • 1970-01-01
      • 2018-08-10
      • 2013-02-25
      • 2020-12-01
      • 1970-01-01
      • 2010-09-13
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多