【问题标题】:Accessing multiple tags inside one tag访问一个标签内的多个标签
【发布时间】:2020-03-29 14:21:53
【问题描述】:

我有以下 HTML 代码到 webscrape:

<ul class="item-features">
        <li>
            <strong>Graphic Type:</strong> Dedicated Card
        </li>
        <li>
            <strong>Resolution:</strong> 3840 x 2160
        </li>
        <li>
            <strong>Weight:</strong> 4.40 lbs.
        </li>
        <li>
            <strong>Color:</strong> Black
        </li>
</ul>

我想在 .csv 文件中打印 .csv 文件中不同列中的所有单个标签:图形类型、分辨率、重量等。

我在 Python 中尝试过以下操作:

import bs4
from urllib.request import urlopen as req
from bs4 import BeautifulSoup as soup
url ='https://www.newegg.com/Laptops-Notebooks/SubCategory/ID-32?Tid=6740'
Client = req(url)
pagina = Client.read()
Client.close()
pagina_soup=soup(pagina,"html.parser")
productes = pagina_soup.findAll("div",{"class":"item-container})
producte = productes [0]
features = producte.findAll("ul",{"class":"item-features"})
features[0].text

它会显示所有功能,但仅显示在 .csv 的一列中。

'\nGraphic Type: Dedicated CardResolution: 3840 x 2160Weight: 4.40 lbs.Color: Black\nModel #: AERO 15 OLED SA-7US5020SH\nItem #: N82E16834233268\nReturn Policy: Standard Return Policy\n'

我现在不知道如何一一导出它们。请查看我的整个 pyhton 代码:

import bs4
from urllib.request import urlopen as req
from bs4 import BeautifulSoup as soup

#Link de la pàgina on farem webscraping
url = 'https://www.newegg.com/Laptops-Notebooks/SubCategory/ID-32?Tid=6740'

#Obrim una connexió amb la pàgina web
Client = req(url)
#Offloads the content of the page into a variable
pagina = Client.read()
#Closes the client
Client.close()
#html parser
pagina_soup=soup(pagina,"html.parser")
#grabs each product
productes = pagina_soup.findAll("div",{"class":"item-container"})

#Obrim un axiu .csv
filename = "ordinadors.csv"
f=open(filename,"w")

 #Capçaleres del meu arxiu .csv
headers = "Marca; Producte; PreuActual; PreuAnterior; Rebaixa; CostEnvio 
 \n"
 #Escrivim la capçalera
 f.write(headers)

#Fem un loop sobre tots els productes
for producte in productes:

    #Agafem la marca del producte
    marca_productes = producte.findAll("div",{"class":"item-info"})
    marca = marca_productes[0].div.a.img["title"]

    #Agafem el nom del producte
    name = producte.a.img["title"] 

    #Preu Actual
    actual_productes = producte.findAll("li",{"class":"price-current"})
    preuActual = actual_productes[0].strong.text
        
    #Preu anterior    
    try:
        preuAbans = producte.find("li", class_="price- 
        was").next_element.strip()
    except:
        print("Not found")
    
    #Agafem els costes de envio
    costos_productes = producte.findAll("li",{"class":"price-ship"})
    #Com que es tracta d'un vector, agafem el primer element i el netegem.
    costos = costos_productes[0].text.strip()  

    #Writing the file
    f.write(marca + ";" + name.replace(","," ") + ";" + preuActual + ";" 
    + preuAbans + ";" + costos + "\n")

 f.close()

【问题讨论】:

    标签: python html web-scraping tags


    【解决方案1】:
    keys = [x.find().text for x in  pagina_soup.find_all('li')]
    values = [x.find('strong').next_sibling.strip() for x in  pagina_soup.find_all('li')]
    print(keys)
    print(values)
    

    出来:

    Out[6]: ['Graphic Type:', 'Resolution:', 'Weight:', 'Color:']
    Out[7]: ['Dedicated Card', '3840 x 2160', '4.40 lbs.', 'Black']
    

    【讨论】:

      猜你喜欢
      • 2013-04-17
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-07-12
      • 2014-10-04
      • 1970-01-01
      • 2023-01-18
      • 1970-01-01
      相关资源
      最近更新 更多