抓取 url 列表答案

【问题标题】：Scraping a list of urls抓取 url 列表
【发布时间】：2017-07-07 01:24:35
【问题描述】：

我正在使用 Python 3.5 并试图抓取一个 url 列表（来自同一网站），代码如下：

import urllib.request
from bs4 import BeautifulSoup



url_list = ['URL1',
            'URL2','URL3]

def soup():
    for url in url_list:
        sauce = urllib.request.urlopen(url)
        for things in sauce:
            soup_maker = BeautifulSoup(things, 'html.parser')
            return soup_maker

# Scraping
def getPropNames():
    for propName in soup.findAll('div', class_="property-cta"):
        for h1 in propName.findAll('h1'):
            print(h1.text)

def getPrice():
    for price in soup.findAll('p', class_="room-price"):
        print(price.text)

def getRoom():
    for theRoom in soup.findAll('div', class_="featured-item-inner"):
        for h5 in theRoom.findAll('h5'):
            print(h5.text)


for soups in soup():
    getPropNames()
    getPrice()
    getRoom()

到目前为止，如果我打印汤、获取 propNames、getPrice 或 getRoom，它们似乎都可以工作。但我似乎无法通过每个 url 打印 getPropNames、getPrice 和 getRoom。

仅学习 Python 几个月，因此非常感谢您的帮助！

【问题讨论】：

标签： python web-scraping urllib bs4

【解决方案1】：

想想这段代码做了什么：

def soup():
    for url in url_list:
        sauce = urllib.request.urlopen(url)
        for things in sauce:
            soup_maker = BeautifulSoup(things, 'html.parser')
            return soup_maker

让我给你看一个例子：

def soup2():
    for url in url_list:
        print(url)
        for thing in ['a', 'b', 'c']:
            print(url, thing)
            maker = 2 * thing
            return maker

url_list = ['one', 'two', 'three'] 的输出是：

one
('one', 'a')

你现在看到了吗？到底是怎么回事？

基本上你的汤函数首先返回return——不返回任何迭代器，任何列表；只有第一个 BeautifulSoup - 你很幸运（或不幸运）这是可迭代的 :)

所以修改代码：

def soup3():
    soups = []
    for url in url_list:
        print(url)
        for thing in ['a', 'b', 'c']:
            print(url, thing)
            maker = 2 * thing
            soups.append(maker)
    return soups

然后输出是：

one
('one', 'a')
('one', 'b')
('one', 'c')
two
('two', 'a')
('two', 'b')
('two', 'c')
three
('three', 'a')
('three', 'b')
('three', 'c')

但我相信这也行不通 :) 只是想知道酱汁返回了什么：sauce = urllib.request.urlopen(url) 以及实际上您的代码正在迭代什么：for things in sauce - 意思是 things 是什么。

编码愉快。

【讨论】：

感谢 Sebastian Opałczyński，我会接受它，试着理解它，然后告诉你结果！

【解决方案2】：

每个get* 函数都使用一个未在任何地方正确设置的全局变量soup。即使是这样，这也不是一个好方法。将 soup 改为函数参数，例如：

def getRoom(soup):
    for theRoom in soup.findAll('div', class_="featured-item-inner"):
        for h5 in theRoom.findAll('h5'):
            print(h5.text)

for soup in soups():
    getPropNames(soup)
    getPrice(soup)
    getRoom(soup)

其次，您应该从 soup() 执行 yield 而不是 return 将其变成生成器。否则，您需要返回 BeautifulSoup 对象列表。

def soups():
    for url in url_list:
        sauce = urllib.request.urlopen(url)
        for things in sauce:
            soup_maker = BeautifulSoup(things, 'html.parser')
            yield soup_maker

我还建议使用 XPath 或 CSS 选择器来提取 HTML 元素：https://stackoverflow.com/a/11466033/2997179。

【讨论】：

谢谢 Martin Valgur，这很有见地——我会研究 Xpath/CSS。在应用您的建议时，我收到以下错误消息：AttributeError: 'function' object has no attribute 'findAll - 任何想法？
您是否将soup 参数添加到所有函数？我建议还将soup() 函数重命名为soups()。
谢谢，那是我错了！但是，它似乎只适用于 getPrice。其他 2 不返回任何东西？奇怪的是，当我第一次编写这些函数时，我使用了 1 个 url，它们都运行良好。