使用 BeautifulSoup 循环浏览网页抓取的 url 列表答案

【问题标题】：Looping through a list of urls for web scraping with BeautifulSoup使用 BeautifulSoup 循环浏览网页抓取的 url 列表
【发布时间】：2016-03-01 20:09:18
【问题描述】：

我想从具有以下形式的 URL 的网站中提取一些信息： http://www.pedigreequery.com/american+pharoah 其中“american+pharoah”是许多马名之一的扩展名。我有一个我要搜索的马名列表，我只需要弄清楚如何在“http://www.pedigreequery.com/”之后插入名称

这是我目前拥有的：

import csv
allhorses = csv.reader(open('HORSES.csv') )
rows=list(allhorses)

import requests 
from bs4 import BeautifulSoup
for i in rows:      # Number of pages plus one 
    url = "http://www.pedigreequery.com/".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    letters = soup.find_all("a", class_="horseName")
    print(letters)

当我打印出网址时，它的末尾没有马的名字，只有引号中的网址。最后的信件/打印声明只是为了检查它是否真的进入了网站。这就是我所看到的循环最后按数字更改的 URL 的方式——我还没有找到关于按字符更改的 URL 的建议。

谢谢！

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

您的格式中缺少占位符，因此请将格式扫描到：

url = "http://www.pedigreequery.com/{}".format(i)
                                     ^
                                   #add placeholder

此外，您最多只能从rows=list(allhorses) 获得列表列表，因此您将传递一个列表而不是字符串/马名，如果每行有一匹马，只需正常打开文件并遍历文件对象即可换行符。

假设每行一个马名，整个工作代码将是：

import requests
from bs4 import BeautifulSoup

with open("HORSES.csv") as f:
    for horse in map(str.strip,f):      # Number of pages plus one
        url = "http://www.pedigreequery.com/{}".format(horse)
        r = requests.get(url)
        soup = BeautifulSoup(r.content)
        letters = soup.find_all("a", class_="horseName")
        print(letters)

如果每行有多匹马，您可以使用 csv 库，但您需要一个内部循环：

with open("HORSES.csv") as f:
    for row in csv.reader(f):   
        # Number of pages plus one
        for horse in row:
            url = "http://www.pedigreequery.com/{}".format(horse)
            r = requests.get(url)
            soup = BeautifulSoup(r.content)
            letters = soup.find_all("a", class_="horseName")
            print(letters)

最后，如果您没有正确存储名称，您有几个选项，其中最简单的是手动拆分和创建查询。

  url = "http://www.pedigreequery.com/{}".format("+".join(horse.split()))

【讨论】：

是的，您对列表列表的看法是正确的，解决了这个问题！关于 URL 也是正确的，它现在可以完美运行。谢谢！