【问题标题】:Is there a better way to use BeautifulSoup?有没有更好的方法来使用 BeautifulSoup?
【发布时间】:2019-08-08 14:20:09
【问题描述】:

我正在尝试抓取另一个法国网站,我的脚本运行良好,但看起来很丑陋,我认为有更好的方法来实现和抓取我想要的东西。

实际上我使用“item”作为列表并选择所需的每个元素,我想知道是否可以解析这样的选定元素。


for item in soup.select('.search-list-item'):
    if '/annonces/' in item.select( 'div.col-right > a'):
        print('Ok, my code it's not beautiful but it's better :D')

使用这样的代码,我认为让其他开发人员了解我想要做什么会更好。

实际上是我的脚本:


import requests
from bs4 import BeautifulSoup
import json

url = 'https://www.pap.fr/annonce/vente-maisons-nantes-44-g43619-jusqu-a-900000-euros'
headers = {
    'User-Agent': '*',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
    }

s = requests.Session()
s.headers.update(headers)

r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
a = []
for item in soup.select('.search-list-item'):
    dict = {}
    try: 
        if '/annonces/' in item.contents[3].contents[3].attrs['href']:
            dict['id'] = int(item.contents[3].contents[3].attrs['name'])
            dict['url'] = "https://www.pap.fr"+item.contents[3].contents[3].attrs['href']
            dict['name'] = item.contents[3].contents[3].contents[1].contents[0]
            dict['pieces'] = int(''.join(filter(str.isdigit, (item.contents[3].contents[3].contents[3].contents[1].contents[0]))))
            dict['chambres'] = int(''.join(filter(str.isdigit, (item.contents[3].contents[3].contents[3].contents[3].contents[0]))))
            dict['superficie'] = int(''.join(filter(str.isdigit, (item.contents[3].contents[3].contents[3].contents[5].contents[0]))))
            dict['price']= int(''.join(filter(str.isdigit, (item.contents[3].contents[3].contents[5].contents[0]))))
            dict['picture']=item.contents[1].contents[1].contents[1].attrs['src']
        if dict:
            a.append(dict)
    except KeyError:
        pass

print(json.dumps(a, indent=4))

最后,我的 Json 中出现了一点格式问题,“nbsp;”,我认为这只是 span 中的空格。

非常感谢。

【问题讨论】:

标签: json python-3.x web-scraping beautifulsoup


【解决方案1】:

您可以使用zip() 方法将页面元素“绑定”在一起。如果我观察正确,几乎所有元素(无图片-为此我使用.find_previous() 方法)都在<a> 标签下,属性为name=

import re
import json
import requests
from bs4 import BeautifulSoup

url = 'https://www.pap.fr/annonce/vente-maisons-nantes-44-g43619-jusqu-a-900000-euros'
headers = {
    'User-Agent': '*',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
    }

s = requests.Session()
s.headers.update(headers)

r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
a = []
for _id, name,tags,price in zip( soup.select('a[name][href]'),
                        soup.select('a[name] .h1'),
                        soup.select('a[name] .item-tags'),
                        soup.select('a[name] .item-price')):
    name = name.get_text(strip=True)
    url = _id['href']
    pieces, chambres, superficie = map(lambda k: int(''.join(re.findall(r'\d+', k)[0])), [li.get_text(strip=True) for li in tags.select('li')])
    price = int( ''.join( re.findall(r'\d+', price.get_text(strip=True)) ))
    picture = _id.find_previous('img')['src']
    _id = _id['name']

    a.append({'id':_id, 'url':url, 'name':name,
              'pieces':pieces, 'chambres':chambres, 'superficie':superficie,
              'price':price, 'picture':picture})

print(json.dumps(a, indent=4))

打印:

[
    {
        "id": "427500904",
        "url": "/annonces/maison-nantes-r427500904",
        "name": "Vente maison 172\u00a0m\u00b2 Nantes",
        "pieces": 6,
        "chambres": 4,
        "superficie": 172,
        "price": 650000,
        "picture": "https://static.pap.fr/photos/C75/C75A0904.thumb.jpg"
    },
    {
        "id": "427700568",
        "url": "/annonces/maison-saint-sebastien-sur-loire-r427700568",
        "name": "Vente maison 212\u00a0m\u00b2 Saint-S\u00e9bastien-Sur-Loire",
        "pieces": 6,
        "chambres": 4,
        "superficie": 212,
        "price": 507000,
        "picture": "https://static.pap.fr/photos/C77/C77A0568.thumb.jpg"
    },
    {
        "id": "423000083",
        "url": "/annonces/maison-saint-herblain-44800-r423000083",
        "name": "Vente maison 92\u00a0m\u00b2 Saint-Herblain (44800)",
        "pieces": 4,
        "chambres": 3,
        "superficie": 92,
        "price": 254900,
        "picture": "https://static.pap.fr/photos/C30/C30A0083.thumb.jpg"
    },
    {
        "id": "426801502",
        "url": "/annonces/maison-saint-herblain-r426801502",
        "name": "Vente maison 117\u00a0m\u00b2 Saint-Herblain",
        "pieces": 5,
        "chambres": 4,
        "superficie": 117,
        "price": 359800,
        "picture": "https://static.pap.fr/photos/C68/C68A1502.thumb.jpg"
    },
    {
        "id": "427500274",
        "url": "/annonces/maison-orvault-44700-r427500274",
        "name": "Vente maison 170\u00a0m\u00b2 Orvault (44700)",
        "pieces": 6,
        "chambres": 4,
        "superficie": 170,
        "price": 453000,
        "picture": "https://static.pap.fr/photos/C75/C75A0274.thumb.jpg"
    },
    {
        "id": "427600879",
        "url": "/annonces/maison-orvault-44700-r427600879",
        "name": "Vente maison 155\u00a0m\u00b2 Orvault (44700)",
        "pieces": 9,
        "chambres": 4,
        "superficie": 155,
        "price": 425000,
        "picture": "https://static.pap.fr/photos/C76/C76A0879.thumb.jpg"
    },
    {
        "id": "427800917",
        "url": "/annonces/maison-orvault-44700-r427800917",
        "name": "Vente maison 132\u00a0m\u00b2 Orvault (44700)",
        "pieces": 6,
        "chambres": 4,
        "superficie": 132,
        "price": 445000,
        "picture": "https://static.pap.fr/photos/C78/C78A0917.thumb.jpg"
    },
    {
        "id": "427101281",
        "url": "/annonces/maison-vertou-r427101281",
        "name": "Vente maison 207\u00a0m\u00b2 Vertou",
        "pieces": 7,
        "chambres": 4,
        "superficie": 207,
        "price": 530000,
        "picture": "https://static.pap.fr/photos/C71/C71A1281.thumb.jpg"
    },
    {
        "id": "425701850",
        "url": "/annonces/maison-saint-aignan-grandlieu-44860-r425701850",
        "name": "Vente maison 172\u00a0m\u00b2 Saint-Aignan-Grandlieu (44860)",
        "pieces": 7,
        "chambres": 4,
        "superficie": 172,
        "price": 480000,
        "picture": "https://static.pap.fr/photos/C57/C57A1850.thumb.jpg"
    },
    {
        "id": "427101024",
        "url": "/annonces/maison-suce-sur-erdre-44240-r427101024",
        "name": "Vente maison 198\u00a0m\u00b2 Suce-Sur-Erdre (44240)",
        "pieces": 9,
        "chambres": 4,
        "superficie": 198,
        "price": 450000,
        "picture": "https://static.pap.fr/photos/C71/C71A1024.thumb.jpg"
    }
]

【讨论】:

  • 再次感谢安德烈。我已经尝试过你的 zip 方法 :),但我失败了......问题是由于一个空数组,所以 "picture":"url" 不在正确的位置。我会试着去理解这个黑魔法“碎片、房间、表面 = map(lambda k: int(''.join(re.findall(r'\d+', k)[0])), [li.get_text( strip=True) for li in tags.select('li')]) price = int( ''.join( re.findall(r'\d+', price.get_text(strip=True)) ))" 谢谢你的工作
猜你喜欢
  • 1970-01-01
  • 2012-12-05
  • 2019-12-08
  • 2013-01-01
  • 1970-01-01
  • 2015-04-07
  • 1970-01-01
  • 2023-03-06
  • 2019-01-12
相关资源
最近更新 更多