从python中的HTML中提取标签值下的标签答案

【问题标题】：Extracting tag under tag values from HTML in python从python中的HTML中提取标签值下的标签
【发布时间】：2019-06-30 13:10:18
【问题描述】：

<div class="book-cover-image">
<img alt="NOT IN MY BACKYARD – Solid Waste Mgmt in Indian Cities" class="img-responsive" src="https://cdn.downtoearth.org.in/library/medium/2016-05-23/0.42611000_1463993925_book-cover.jpg" title="NOT IN MY BACKYARD – Solid Waste Mgmt in Indian Cities"/>
</div>

我需要从所有这样的 div 标签中提取这个标题值。执行此操作的最佳方法是什么。请提出建议。

我正在尝试获取this page 上提到的所有书籍的标题。

到目前为止我已经尝试过了：

import requests 
from bs4 import BeautifulSoup as bs


url1 ="https://www.downtoearth.org.in/books"
page1 = requests.get(url1, verify=False)

#print(page1.content)

soup1= bs(page1.content, 'html.parser')
class_names = soup1.find_all('div',{'class':'book-cover-image'} )

for class_name in class_names:
    title_text = class_name.text
    print(class_name)
    print(title_text)

【问题讨论】：

添加示例输入和所需的输出。
cdn.downtoearth.org.in/library/medium/2016-05-23/… " title="不在我的后院——印度城市的固体废物管理"/>
输出应该是标题：不在我的后院——印度城市的固体废物管理
到目前为止你尝试了什么？
url1 ="downtoearth.org.in/books" page1 = requests.get(url1, verify=False) #print(page1.content) soup1= bs(page1.content, 'html.parser') class_names = soup1.find_all('div',{'class':'book-cover-image'} ) for class_name in class_names: title_text = class_name.text print(class_name) #print(title_text)

标签： python html text beautifulsoup tags

【解决方案1】：

要获取书籍封面的所有title 属性，您可以使用CSS 选择器.book-cover-image img[title]（选择所有<img> 属性为title 的标签在类book-cover-image 的标签下）：

import requests
from bs4 import BeautifulSoup

url = 'https://www.downtoearth.org.in/books'
soup = BeautifulSoup(requests.get(url).text, 'lxml')

for i, img in enumerate(soup.select('.book-cover-image img[title]'), 1):
    print('{:>4}\t{}'.format(i, img['title']))

打印：

   1    State of India’s Environment 2019: In Figures (eBook)                           
   2    Victim Africa (eBook)                                                           
   3    Frames of change - Heartening tales that define new India                       
   4    STATE OF INDIA’S ENVIRONMENT 2019                                               
   5    State of India’s Environment In Figures 2018 (eBook)                            
   6    Getting to know about environment                                               
   7    CLIMATE CHANGE NOW - The Story of Carbon Colonisation                           
   8    Climate change - For the young and curious                                      
   9    Conflicts of Interest: My Journey through India’s Green Movement                
  10    Body Burden: Lifestyle Diseases                                                 
  11    STATE OF INDIA’S ENVIRONMENT 2018                                               
  12    DROUGHT BUT WHY? How India can fight the scourge by abandoning drought relief   
  13    SOE 2017 (Print version) and SOE 2017 in Figures (Digital version) combo offer  
  14    State of India's Environment 2017 In Figures (eBook)                            
  15    Environment Reader for Universities                                             
  16    Not in My Backyard  (Book & DVD combo offer)                                    
  17    The Crow, Honey Hunter and the Kitchen Garden                                   
  18    BIOSCOPE OF PIU & POM                                                           
  19    SOE 2017 and Food book combo offer                                              
  20    FIRST FOOD: Culture of Taste                                                    
  21    Annual State Of India’s Environment - SOE 2017                                  
  22    An 8-million-year-old mysterious date with monsoon  (e-book)                    
  23    Why I Should be Tolerant                                                        
  24    NOT IN MY BACKYARD – Solid Waste Mgmt in Indian Cities

【讨论】：

谢谢。如果可能，请解释最后两行代码：
@AnchalSarraf soup.select('.book-cover-image img[title]') 对汤执行 CSS 选择器（如答案中所述），print('{:>4}\t{}'.format(i, img['title'])) 执行基本字符串格式化 - 例如{:>4} 打印向右调整 4 个字符的字符串

【解决方案2】：

您可以像这样使用xpath。

import requests
from lxml import html

url1 ="https://www.downtoearth.org.in/books"
res = requests.get(url1, verify=False)
tree = html.fromstring(res.text)
d = tree.xpath("//div[@class='book-cover-image']//img/@title")
for title in d:
    print(title)

输出

State of India’s Environment 2019: In Figures (eBook)
Victim Africa (eBook)
Frames of change - Heartening tales that define new India
STATE OF INDIA’S ENVIRONMENT 2019
State of India’s Environment In Figures 2018 (eBook)
Getting to know about environment
CLIMATE CHANGE NOW - The Story of Carbon Colonisation
Climate change - For the young and curious
Conflicts of Interest: My Journey through India’s Green Movement
Body Burden: Lifestyle Diseases
STATE OF INDIA’S ENVIRONMENT 2018
DROUGHT BUT WHY? How India can fight the scourge by abandoning drought relief
SOE 2017 (Print version) and SOE 2017 in Figures (Digital version) combo offer
State of India's Environment 2017 In Figures (eBook)
Environment Reader for Universities
Not in My Backyard  (Book & DVD combo offer)
The Crow, Honey Hunter and the Kitchen Garden
BIOSCOPE OF PIU & POM
SOE 2017 and Food book combo offer
FIRST FOOD: Culture of Taste
Annual State Of India’s Environment - SOE 2017
An 8-million-year-old mysterious date with monsoon  (e-book) 
Why I Should be Tolerant
NOT IN MY BACKYARD – Solid Waste Mgmt in Indian Cities

【讨论】：

谢谢拉胡尔，它简单易懂
@AnchalSarraf 很高兴为您提供帮助。 :)