【问题标题】:PYTHON: Submitting queries in APSX, and scraping results from aspx pagesPYTHON:在 APSX 中提交查询,并从 aspx 页面中抓取结果
【发布时间】:2014-03-06 10:24:02
【问题描述】:

我想从“http://www.ratsit.se/BC/SearchPerson.aspx”中的人那里获得废品信息,我正在编写以下代码:

import urllib
from bs4 import BeautifulSoup

headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Origin': 'http://www.ratsit.se',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'http://www.ratsit.se/',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}

class MyOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)     Chrome/24.0.1312.57 Safari/537.17'

myopener = MyOpener()
url = 'http://www.ratsit.se/BC/SearchPerson.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup = BeautifulSoup(f)
# parse and retrieve two vital form values
viewstate = soup.select("#__VIEWSTATE")[0]['value']
#eventvalidation = soup.select("#__EVENTVALIDATION")[0]['value']

formData = (

('__LASTFOCUS',''),
('__EVENTTARGET',''),
('__EVENTARGUMENT',''),
#('__EVENTVALIDATION', eventvalidation),
('__VIEWSTATE', viewstate),
('ctl00$cphMain$txtFirstName', 'name'), 
('ctl00$cphMain$txtLastName', ''),  
('ctl00$cphMain$txtBirthDate', ''),                                                          # etc. (not all listed)
('ctl00$cphMain$txtAddress', ''),   
('ctl00$cphMain$txtZipCode', ''),  
('ctl00$cphMain$txtCity', ''),  
('ctl00$cphMain$txtKommun',''),
#('btnSearchAjax','Sök'),
)

encodedFields = urllib.urlencode(formData)
 # second HTTP request with form data
f = myopener.open(url, encodedFields)

try:
 # actually we'd better use BeautifulSoup once again to
 # retrieve results(instead of writing out the whole HTML file)
 # Besides, since the result is split into multipages,
 # we need send more HTTP requests
 fout = open('tmp.html', 'w')
except:
 print('Could not open output file\n')
 fout.writelines(f.readlines())
 fout.close()

我从服务器收到“我的 ip 被阻止”的响应,但当我使用浏览器工作时,这不是真正的原因...任何建议我哪里出错了..

谢谢

【问题讨论】:

  • 响应消息一字不差地说,“我的 ip 被阻止”?为什么人们不能发布正确的错误消息?
  • Ratsit 上的搜索次数现在受到每小时、每天、每周和每月的限制。来自该 IP 地址的搜索超出了这些限制,要继续搜索,需要与 Ratsit 达成用户协议。
  • 但是,我说的不是真的,因为如果我使用浏览器显示没有问题。

标签: python asp.net http


【解决方案1】:

您的代码不起作用。

  File "/Users/florianoswald/git/webscraper/scrape2.py", line 16
  version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)     Chrome/24.0.1312.57 Safari/537.17'
      ^
  IndentationError: expected an indented block

这应该是一个类定义吗?为什么我们仍然需要MyOpener 类?这也有效:

myopener = urllib.FancyURLopener()
my.open("http://www.google.com")
<addinfourl at 4411860752 whose fp = <socket._fileobject object at 0x106ed1c50>>

【讨论】:

    猜你喜欢
    • 2011-01-04
    • 2010-12-01
    • 2021-06-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-12-04
    • 2016-11-27
    • 2018-12-25
    相关资源
    最近更新 更多