使用 Python 和 Mechanize 提交表单数据并进行身份验证答案

【问题标题】：Using Python and Mechanize to submit form data and authenticate使用 Python 和 Mechanize 提交表单数据并进行身份验证
【发布时间】：2011-06-10 20:46:45
【问题描述】：

我想提交登录 Reddit.com 网站，导航到页面的特定区域，然后提交评论。我看不出这段代码有什么问题，但它不起作用，因为 Reddit 网站上没有反映任何变化。

import mechanize
import cookielib


def main():

#Browser
br = mechanize.Browser()


# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Browser options
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

#Opens the site to be navigated
r= br.open('http://www.reddit.com')
html = r.read()

# Select the second (index one) form
br.select_form(nr=1)

# User credentials
br.form['user'] = 'DUMMYUSERNAME'
br.form['passwd'] = 'DUMMYPASSWORD'

# Login
br.submit()

#Open up comment page
r= br.open('http://www.reddit.com/r/PoopSandwiches/comments/f47f8/testing/')
html = r.read()

#Text box is the 8th form on the page (which, I believe, is the text area)
br.select_form(nr=7)

#Change 'text' value to a testing string
br.form['text']= "this is an automated test"

#Submit the information  
br.submit()

这是怎么回事？

【问题讨论】：

尝试添加至少 10 秒的睡眠。您还应该检查浏览器中的表单（不是“查看源代码”，而是 Chrome 中的“检查元素”或 FF 中的类似内容）并与下载的 HTML 进行比较。它可能有由 JS 动态填充的字段。
对了，Reddit 有一个 API，这样不是更好吗？
嗯，让我尝试添加睡眠。我不确定如何使用 API，因为没有提交 cmets 的文档。
编辑：尝试睡眠。没用。

标签： python networking screen-scraping mechanize

【解决方案1】：

如果可能，我肯定会建议尝试使用 API，但这对我有用（不适用于您的示例帖子，该帖子已被删除，但适用于任何活跃的帖子）：

#!/usr/bin/env python

import mechanize
import cookielib
import urllib
import logging
import sys

def main():

    br = mechanize.Browser()
    cj = cookielib.LWPCookieJar()
    br.set_cookiejar(cj)

    br.set_handle_equiv(True)
    br.set_handle_gzip(True)
    br.set_handle_redirect(True)
    br.set_handle_referer(True)
    br.set_handle_robots(False)

    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

    r= br.open('http://www.reddit.com')

    # Select the second (index one) form
    br.select_form(nr=1)

    # User credentials
    br.form['user'] = 'user'
    br.form['passwd'] = 'passwd'

    # Login
    br.submit()

    # Open up comment page
    posting = 'http://www.reddit.com/r/PoopSandwiches/comments/f47f8/testing/'
    rval = 'PoopSandwiches'
    # you can get the rval in other ways, but this will work for testing

    r = br.open(posting)

    # You need the 'uh' value from the first form
    br.select_form(nr=0)
    uh = br.form['uh']

    br.select_form(nr=7)
    thing_id = br.form['thing_id']
    id = '#' + br.form.attrs['id']
    # The id that gets posted is the form id with a '#' prepended.

    data = {'uh':uh, 'thing_id':thing_id, 'id':id, 'renderstyle':'html', 'r':rval, 'text':"Your text here!"}
    new_data_dict = dict((k, urllib.quote(v).replace('%20', '+')) for k, v in data.iteritems())

    # not sure if the replace needs to happen, I did it anyway
    new_data = 'thing_id=%(thing_id)s&text=%(text)s&id=%(id)s&r=%(r)s&uh=%(uh)s&renderstyle=%(renderstyle)s' %(new_data_dict)

    # not sure which of these headers are really needed, but it works with all
    # of them, so why not just include them.
    req = mechanize.Request('http://www.reddit.com/api/comment', new_data)
    req.add_header('Referer', posting)
    req.add_header('Accept', ' application/json, text/javascript, */*')
    req.add_header('Content-Type', 'application/x-www-form-urlencoded; charset=UTF-8')
    req.add_header('X-Requested-With', 'XMLHttpRequest')
    cj.add_cookie_header(req)
    res = mechanize.urlopen(req)

main()

关闭 javascript 看看 reddit cmets 是如何处理的会很有趣。现在有一堆magic 发生在发布帖子时调用的 onsubmit 函数中。这是添加 uh 和 id 值的地方。

【讨论】：

哇。太感谢了。我永远也想不通。
嗯...我在所有活动线程上都收到此错误：ControlNotFoundError: no control matching name 'thing_id.'有什么想法吗？
哈哈，没有。你误解了那句话——无论我在哪个活动线程上使用这个程序，它仍然会触发错误。我正在尝试制作的程序是为了我自己的目的。它将相关书籍章节发布到我主持的私人 subreddit。
问题已解决——它是包含 thing_id 的第 [8] 种形式。非常感谢。
Hmmm... 看起来 thing_id 对于不同的 subreddits 有不同的形式（一个有趣的问题！）此外，选择带有错误 thing_id 的表单将向某人发布回复，而不是新评论。