使用列表中的 url 抓取文本 (BeautifulSoup4)答案

【问题标题】：Scraping text using url from list (BeautifulSoup4)使用列表中的 url 抓取文本 (BeautifulSoup4)
【发布时间】：2013-01-08 20:39:36
【问题描述】：

我已经从一个 URL 获得了有效的文本抓取工具。问题是我需要再抓取 25 个网址。这些网址几乎相同，唯一的区别是最后一个字母。这是更清晰的代码：

import urllib2
from bs4 import BeautifulSoup

f = open('(path to file)/names', 'a')
links = ['http://guardsmanbob.com/media/playlist.php?char='+ chr(i) for i in range(97,123)]

response = urllib2.urlopen(links[0]).read()
soup = BeautifulSoup(response)

for tr in soup.findAll('tr'):
    if not tr.find('td'): continue
    for td in tr.find('td').findAll('a'):
        f.write(td.contents[0] + '\n')

我不能让这个脚本一次性运行列表中的所有 url。我设法得到的只是每个网址的第一首歌曲名称。对不起我的英语不好。希望你能理解我。

【问题讨论】：

标签： python python-2.7 beautifulsoup urllib2

【解决方案1】：

I can't make this script to run all urls from list in one time.

将您的代码保存在带有一个参数的方法中，*args（或任何您想要的名称，只是不要忘记*）。 * 将自动解压缩您的列表。 * 没有正式名称，但是有些人（包括我）喜欢称它为splat operator。

def start_download(*args):
    for value in args:
        ##for debugging purposes
        ##print value

        response = urllib2.urlopen(value).read()
        ##put the rest of your code here

if __name__ == '__main__':
    links = ['http://guardsmanbob.com/media/playlist.php?char='+ 
              chr(i) for i in range(97,123)]

    start_download(links)

编辑： 或者您可以直接遍历您的链接列表并下载每个链接。

    links = ['http://guardsmanbob.com/media/playlist.php?char='+ 
              chr(i) for i in range(97,123)]
    for link in links:
         response = urllib2.urlopen(link).read()
         ##put the rest of your code here

编辑 2：

为了获取所有链接，然后将它们保存在文件中，这是带有特定 cmets 的整个代码：

import urllib2
from bs4 import BeautifulSoup, SoupStrainer

links = ['http://guardsmanbob.com/media/playlist.php?char='+ 
          chr(i) for i in range(97,123)]

    for link in links:
         response = urllib2.urlopen(link).read()
         ## gets all <a> tags
         soup = BeautifulSoup(response, parse_only=SoupStrainer('a'))
         ## unnecessary link texts to be removed
         not_included = ['News', 'FAQ', 'Stream', 'Chat', 'Media',
                    'League of Legends', 'Forum', 'Latest', 'Wallpapers',
                    'Links', 'Playlist', 'Sessions', 'BobRadio', 'All',
                    'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J',
                    'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T',
                    'U', 'V', 'W', 'X', 'Y', 'Z', 'Misc', 'Play',
                    'Learn more about me', 'Chat info', 'Boblights',
                    'Music Playlist', 'Official Facebook',
                    'Latest Music Played', 'Muppets - Closing Theme',
                    'Billy Joel - The River Of Dreams',
                    'Manic Street Preachers - If You Tolerate This 
                     Your Children Will Be Next',
                    'The Bravery - An Honest Mistake', 
                    'The Black Keys - Strange Times',
                    'View whole playlist', 'View latest sessions', 
                    'Referral Link', 'Donate to BoB', 
                    'Guardsman Bob', 'Website template', 
                    'Arcsin']

         ## create a file named "test.txt"
         ## write to file and close afterwards
         with open("test.txt", 'w') as output:
             for hyperlink in soup:
                if hyperlink.text:
                    if hyperlink.text not in not_included:
                        ##print hyperlink.text
                        output.write("%s\n" % hyperlink.text.encode('utf-8'))

这是保存在test.txt中的输出：

我建议您每次循环链接列表时将test.txt 更改为不同的文件名（例如 S 歌曲标题），因为它会覆盖前一个。

【讨论】：

在使用第一种方法时，我得到这个：AttributeError: 'list' object has no attribute 'timeout'。在使用第二种方法时，我只得到每个网址的第一首歌曲名称。我该如何解决？
好的，所以您可以使用第二种方法遍历您的链接列表。然后你想获取每个 url 的所有歌曲名称，对吗？
那么我假设您想将链接保存在文件中？好的，将编辑我的答案。
我不需要链接。我所需要的只是获取歌曲名称并将其放入文件中。使用第二种方法，我可以做到这一点，我只得到第一首歌，而不是全部。
糟糕，我的意思是您想将链接文本（歌曲标题）保存在文件中。 :)