我需要帮助网络抓取答案

【问题标题】：I need help web-scraping我需要帮助网络抓取
【发布时间】：2012-07-25 18:56:32
【问题描述】：

所以我想从 visual.ly 中抓取可视化，但是现在我不明白“显示更多”按钮是如何工作的。截至目前，我的代码将获取图片链接、图片旁边的文本以及页面的链接。我想知道“显示更多”按钮是如何工作的，因为我将尝试使用页数进行循环。截至目前，我不知道如何单独循环遍历每个。关于如何循环并继续获得比最初向您显示的更多图像的任何想法？？？

from BeautifulSoup import BeautifulSoup
import urllib2  
import HTMLParser
import urllib, re

counter = 1
columnno = 1
parser = HTMLParser.HTMLParser()

soup = BeautifulSoup(urllib2.urlopen('http://visual.ly/?view=explore&   type=static#v2_filter').read())

image = soup.findAll("div", attrs = {'class': 'view-mode-wrapper'})

if columnno < 4:
    column = image[0].findAll("div", attrs = {'class': 'v2_grid_column'})
    columnno += 1
else:
    column = image[0].findAll("div", attrs = {'class': 'v2_grid_column last'})

visualizations = column[0].findAll("div", attrs = {'class': '0 v2_grid_item viewmode-item'})

getImage = visualizations[0].find("a")

print counter

print getImage['href']

soup1 = BeautifulSoup(urllib2.urlopen(getImage['href']).read())

theImage = soup1.findAll("div", attrs = {'class': 'ig-graphic-wrapper'})

text = soup1.findAll("div", attrs = {'class': 'ig-content-right'})

getText = text[0].findAll("div", attrs = {'class': 'ig-description right-section first'})

imageLink = theImage[0].find("a")

print imageLink['href']

print getText

for row in image:
    theImage = image[0].find("a")

    actually_download = False
    if actually_download:
        filename = link.split('/')[-1]
        urllib.urlretrieve(link, filename)

counter += 1

【问题讨论】：

您的浏览器是否安装了 Web Developer 工具栏？我发现它对于可视化（不是双关语）表单数据、按钮操作、链接等非常有用。
如果您打印链接是否指向正确的资源？这将是调试的第一步。
不，我没有 web 开发工具栏，除非你指的是 firebug？
是的，但有时它也指向我不想要的东西。不知道为什么，但有时它会使用完全相同的代码对两个不同的东西进行数据抓取。
如果你尝试使用类似 wget (gnu.org/software/wget) 的东西会怎样？除非您需要对 HTML 进行一些特定处理，否则您可以根据需要下载整个站点。您可以使用 sed (gnu.org/software/sed) 重写链接。

标签： python python-2.7 web-scraping

【解决方案1】：

您不能在此处使用 urllib-parser 组合，因为它使用 javascript 来加载更多内容。为此，您需要一个完整的浏览器模拟器（支持 javascript）。我以前从未使用过Selenium，但我听说它可以做到这一点，并且有一个python binding

但是，我发现它使用了一种非常可预测的形式

http://visual.ly/?page=<page_number>

对于它的 GET 请求。也许一个更简单的方法是去下面

<div class="view-mode-wrapper">...</div>

解析数据（使用上面的url格式）。毕竟，ajax 请求必须到某个位置。

那你就可以了

for i in xrange(<whatever>):
    url = r'http://visual.ly/?page={pagenum}'.format(pagenum=i)
    #do whatever you want from here

【讨论】：