使用 mechanize/selenium 抓取网站时 html 中缺少内容答案

【问题标题】：Missing contents in html when crawling website with mechanize/selenium使用 mechanize/selenium 抓取网站时 html 中缺少内容
【发布时间】：2017-08-25 10:28:19
【问题描述】：

我正在尝试从此页面中抓取信息： http://www.repertoireconservatoires.fr/repertoire/?instrument=&region=67%2C68&etablissment_type=

使用我拥有的所有工具（beautifulSoup、mechanize、selenium）并购买了一天的内容访问权限，我无法在浏览器的源代码中获得完整的 HTML 页面。这是它在 chrome 中的样子

<!-- featured news area on homepage template if applied -->
    <div class="latest-news-homepage" role="complementary">

        <div class="section-inner-container">
    <div class="archive-wrapper">

        <!-- si:resultat -->
            <h2 class="nb_resultats">132 établissements trouvés</h2>                        
                <h3 class="archive-title">Liste des établissements 
        correspondant à votre recherche :</h3>
                <ul class="archive-post-list">
        <!-- repeat:repertoire -->

        [...]

        <!-- /repeat:repertoire -->

        <!-- /si:resultat -->

                        </ul>
                    </div>  
            </div>

        </div>
        <!-- /. end of featured-news container -->

这是我从 mechanize 或 selenium 得到的回应：

<!-- featured news area on homepage template if applied -->
    <div class="latest-news-homepage" role="complementary">

            <div class="section-inner-content">

<div class="archive-wrapper"> 



                </div>  

        </div>

    </div>
    <!-- /. end of featured-news container -->

因此，“归档包装器”类中没有任何内容（不确定此处的术语）。我从评论中得到了一些东西似乎隐瞒了内容，但我真的不知道是什么也不知道为什么。我对编码非常陌生，但这是我想出的：

import mechanize
import cookielib

url = 'http://www.repertoireconservatoires.fr/repertoire/?instrument=&region=67%2C68&etablissment_type=&page=0'

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open(url)
br.select_form(nr=0)
br.form['ticket'] = PASSWORD
br.submit()
print br.response().read()

这与 selenium，希望浏览器仿真就足够了：

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS()

# submit a login form
username = driver.find_element_by_name('ticket')

username.send_keys(PASSWORD)

username.submit()

print driver.page_source

1 天访问代码（代码中的密码）是 H2CB-LLL9，反正它会在几个小时后过期，所以如果它可能有帮助... 希望你能让我摆脱这个：p 我一直在使用搜索功能来获取此代码，但我在这里找不到我的问题的解决方案。

非常感谢！

【问题讨论】：

您能否提供完整的代码，例如 iwth selenium？

标签： python html selenium web-crawler mechanize

【解决方案1】：

对不起，我太笨了，解决办法是这样的：

    url = 'http://www.repertoireconservatoires.fr/repertoire/?
        instrument=&region=67%2C68&etablissment_type='
    cj = cookielib.CookieJar()
    br = mechanize.Browser()
    br.set_cookiejar(cj)
    br.open(url)
    br.select_form(nr=0)
    br.form['ticket'] = 'H2CB-LLL9'
    br.submit()
    br.select_form(nr=0)
    br.form['region'] = ['67,68']
    br.submit()
    co = br.response().read()
    print co

我只是填写密码然后提交，没有在下拉菜单中输入任何值，所以我得到空结果。谢谢你的帮助，我的错。我很高兴我弄明白了，虽然我花了太多时间

【讨论】：