【问题标题】:How to scrape text of span classes that have the same class value?如何抓取具有相同类值的跨度类的文本?
【发布时间】:2022-01-13 12:15:34
【问题描述】:

我想通过网络抓取从cimri.com 获取我的项目的数据,并且我尝试详细了解手机的技术属性,但是当我想获得具体的技术属性时,比如说,处理器型号和内存大小。看起来像如您所附截图所示,所有技术属性都具有相同的跨度类值。

当我执行以下代码时;

def getAndParseURL(url):
    result = requests.get(url,headers={"User-Agent":"Mozilla/5.0"})
    soup = bts(result.text, 'html.parser')
    return soup

html = getAndParseURL("https://www.cimri.com/cep-telefonlari/en-ucuz-oppo-a74-128gb-4gb-ram-6- 43-inc-48mp-akilli-cep-telefonu-siyah-fiyatlari,775993409")

for i in html.findAll("div",{"class":"s10v53f3-0 bfgzQt"}) :
    for b in i.findAll("ul",{"class":"s10v53f3-2 goYFek"}) :
        for c in b.findAll("li",{"class":"s10v53f3-4 rKbMg"}) :
            for d in c.findAll("span",{"class":"s10v53f3-6 geozbR"}) :
                print(d)

它为我提供了如下所有技术属性;

<span class="s10v53f3-6 geozbR">6.43 inç</span>
<span class="s10v53f3-6 geozbR">AMOLED</span>
<span class="s10v53f3-6 geozbR">FHD+</span>
<span class="s10v53f3-6 geozbR">1080x2400 Piksel</span>
<span class="s10v53f3-6 geozbR">84.4 %</span>
<span class="s10v53f3-6 geozbR">409 PPI</span>
<span class="s10v53f3-6 geozbR">Kapasitif Ekran</span>
<span class="s10v53f3-6 geozbR">800</span>
<span class="s10v53f3-6 geozbR">1000000:1</span>
<span class="s10v53f3-6 geozbR">Qualcomm SM6115 Snapdragon 662</span>
<span class="s10v53f3-6 geozbR">2.0 GHz</span>
<span class="s10v53f3-6 geozbR">Adreno 610</span>
<span class="s10v53f3-6 geozbR">4 GB RAM</span>
<span class="s10v53f3-6 geozbR">Android 11</span>
<span class="s10v53f3-6 geozbR">Android</span>
<span class="s10v53f3-6 geozbR">8 Çekirdek</span>
<span class="s10v53f3-6 geozbR">11 nm</span>
<span class="s10v53f3-6 geozbR">64 bit</span>
<span class="s10v53f3-6 geozbR">950 MHz</span>
<span class="s10v53f3-6 geozbR">LPDDR4x</span>
<span class="s10v53f3-6 geozbR">Çift Kanal</span>
<span class="s10v53f3-6 geozbR">48 MP</span>
<span class="s10v53f3-6 geozbR">F2.4</span>
<span class="s10v53f3-6 geozbR">16 MP</span>
<span class="s10v53f3-6 geozbR">F1.7</span>
<span class="s10v53f3-6 geozbR">F2.4</span>
<span class="s10v53f3-6 geozbR">1080p (Full HD)</span>
<span class="s10v53f3-6 geozbR">30 FPS</span>
<span class="s10v53f3-6 geozbR">2 MP</span>
<span class="s10v53f3-6 geozbR">LED</span>
<span class="s10v53f3-6 geozbR">73.8 mm</span>
<span class="s10v53f3-6 geozbR">160.3 mm</span>
<span class="s10v53f3-6 geozbR">8 mm</span>
<span class="s10v53f3-6 geozbR">175 gr</span>
<span class="s10v53f3-6 geozbR">Siyah</span>
<span class="s10v53f3-6 geozbR">USB Type-C</span>
<span class="s10v53f3-6 geozbR">Li-Po</span>
<span class="s10v53f3-6 geozbR">5000 mAh</span>
<span class="s10v53f3-6 geozbR">128 GB</span>
<span class="s10v53f3-6 geozbR">5.0</span>
<span class="s10v53f3-6 geozbR">3.5 mm</span>
<span class="s10v53f3-6 geozbR">Wi-Fi 5</span>
<span class="s10v53f3-6 geozbR">42.2 Mbps</span>
<span class="s10v53f3-6 geozbR">5.76 Mbps</span>
<span class="s10v53f3-6 geozbR">2021</span>
<span class="s10v53f3-6 geozbR">Ekran İçinde</span>
<span class="s10v53f3-6 geozbR">Nano-SIM (4FF)</span>
<span class="s10v53f3-6 geozbR">30</span>
<span class="s10v53f3-6 geozbR">1080p</span>

我已经将所有功能都视为 dict,但是当我查看所有手机的品牌和型号时,每个品牌和每个型号都有不同数量的功能,要创建数据框,每个品牌和每个型号都必须具有相同的列,所以我已决定在数据框中获取其中一些功能。

【问题讨论】:

  • 也许你应该在for-loops 中对数据进行分组——即将所有值分组在for b 的一个循环中或for c 的一个循环中——然后使用索引从中获取一个值组。
  • 更好地显示带有真实 URL 的最小工作代码,因此我们可以简单地复制并运行它。并展示你想要得到的东西。
  • 或者你应该创建更复杂的代码 - 获取所有 li 然后检查它是否有 span - 如果它没有跨度那么你有 header/title 可以用作keydictionary 中保留下一个lispan 作为该组中的值 - 或者您可以使用此title 来识别区域。

标签: python pandas selenium web-scraping beautifulsoup


【解决方案1】:

注意您的问题需要更清晰才能获得具体答案。因此,我只想显示两个选项来处理您的评论,并有助于拉近。它们基于请求时可用的product

我想具体说一下,处理器型号和内存大小。

选项#1

只需选择包含您的属性的span taht 并从其直接兄弟中获取文本:

processorModel = soup.select_one('span:-soup-contains("İşlemci Modeli") + span').text 
--> Apple A13 Bionic

memorySize = soup.select_one('span:-soup-contains("RAM Kapasitesi") + span').text   
-->3 GB RAM

选项#2

使用结构化信息创建一个字典并迭代以选择您的属性:

specs = {}
for x in soup.select('[name="specs"] ul'):
    specs[x.li.text]= {list(s.stripped_strings)[0]:list(s.stripped_strings)[1] for s in x.select('li:has(span)')}
specs

-->

{'Model Bilgisi': {'Iphone Modelleri': 'Iphone SE'},
 'Ekran Özellikleri': {'Ekran Boyutu': '4.7 inç',
  'Ekran Teknolojisi': 'IPS LCD',
  'Çözünürlük Standartı': 'HD+',
  'Ekran Çözünürlüğü': '750x1334 Piksel',
  'Ekran Gövde Oranı': '65.4 %',
  'Piksel Yoğunluğu': '326 PPI',
  'Multi Touch': 'Var',
  'Dokunmatik Türü': 'Kapasitif Ekran',
  'Ekran Parlaklığı (cd/m²)': '625',
  'Çizilmeye Karşı Dayanıklılık': 'Var',
  'Ekran Kontrast Oranı': '1400:1',
  'Sürekli Açık Ekran': 'Yok'},
 'Teknik Özellikler': {'İşlemci Modeli': 'Apple A13 Bionic',
  'İşlemci Frekansı': '2.65 GHz',
  'Grafik İşlemci (GPU)': 'Apple GPU',
  'RAM Kapasitesi': '3 GB RAM',
  'Antutu Puanı': 'Belirtilmemiş',
  'İşletim Sistemi Versiyonu': 'iOS 13',
  'İşletim Sistemi': 'iOS',
  'CPU Üretim Süreci': '7 nm+'},...}

【讨论】:

  • 首先,感谢您的解决方案,选项#1 正是我想要的。我已经将所有功能都视为 dict,但是当我查看所有手机的品牌和型号时,每个品牌和每个型号都有不同数量的功能来创建数据框,每个品牌和每个型号都必须具有相同的列,所以我决定得到数据框中的其中一些功能和选项 #1 对我来说很好用:)
猜你喜欢
  • 2021-04-27
  • 1970-01-01
  • 1970-01-01
  • 2020-12-03
  • 1970-01-01
  • 2022-12-24
  • 2019-09-23
  • 2023-04-08
  • 1970-01-01
相关资源
最近更新 更多