【问题标题】:webscraping not able to get the source code of page网页抓取无法获取页面源代码
【发布时间】:2018-11-12 06:04:40
【问题描述】:

我正在尝试抓取https://onlineservices.ocswssw.org/Thinclient/Public/PR/EN 下面是代码。

import requests
from bs4 import BeautifulSoup as BS

sess = requests.session()
html = sess.get(url,headers={'User-Agent': 'Mozilla/5.0'},allow_redirects=True)
Soup = BS(html.text,'lxml')
with open('ocswssw.html,'w') as f:
    print(Soup.prettify())

如果您比较 ocswssw.html 和 chrome 中的网站。他们不匹配。

但是我收到的一些源代码并不完整。请让我知道出了什么问题。

我不喜欢在浏览器弹出窗口的地方使用 selenium。

【问题讨论】:

  • 你说的“不完整”是什么意思?您希望在源代码中找到什么?
  • 如果您在 chrome 中运行 url 并将汤粘贴到不匹配的 txt 中
  • 当然它们不会匹配:在 Chrome 中,您会看到渲染页面并执行 JavaScript。 requests 返回页面源代码.... 那么您的预期输出是什么?
  • 我想在网站上搜索社工。对于我收到的输出,我无法做到这一点。
  • 您的意思是“公司名称”吗,例如“A. Bacchus 社会工作专业公司”?

标签: python web-scraping beautifulsoup request


【解决方案1】:

页面是使用 javascript 创建的。 所以,你不能只使用 requests/bs4 来获取页面源

解决方法:使用 HeadlessChrome 创建由 javascript 创建的页面源

【讨论】:

    【解决方案2】:

    我并不完全清楚您最终要完成什么,但是在接收来源时,我:

    1) 使用 open() 方法为您的 ocswssw.html 参数添加了缺少的撇号和

    2) 运行代码并收到与 Google Chrome 提供的几乎相同的源代码。

    BS 结果:

    <!DOCTYPE html>
    <html>
     <head>
      <meta charset="utf-8"/>
      <meta content="width=device-width, initial-scale=1" name="viewport"/>
      <title>
       OCSWSSW | Member Search
      </title>
      <link href="/Thinclient/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
      <link href="/Thinclient/Content/bootstrap.071220161413.css" rel="stylesheet" type="text/css"/>
      <link href="/Thinclient/Content/kendo/kendo.common-bootstrap.min.css" rel="stylesheet"/>
      <link href="/Thinclient/Content/kendo/kendo.bootstrap.min.css" rel="stylesheet"/>
      <link href="/Thinclient/Content/ThinStyle.110820150951.css" rel="stylesheet" title="Blue" type="text/css"/>
      <link href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css" rel="stylesheet"/>
      <link href="/Thinclient/Content/icheck/square/blue.css" rel="stylesheet"/>
      <link href="/Thinclient/Content/GlobalStyleSheet.css" rel="stylesheet"/>
      <script type="text/javascript">
       HomeURL = "#/forms/new/?table=0x800000000000003D&amp;form=0x800000000000004D&amp;command=0x8000000000000C2D";
            AfterLoginData = null
    
            LanguageDictionary = {};
    
            LanguageDictionary.TC_COMMON = {"OkButtonTextOK":"Ok","OkButtonTextContinue":"Continue","OkButtonTextYes":"Yes","OkButtonTextDelete":"Clear","CancelButtonTextCancel":"Cancel","CancelButtonTextNo":"No","CancelButtonTextLogout":"Logout","MiddleButtonTextNo":"No","AjaxRequestError":"The Web server does not respond currently. Please try again later.","UserIdleMessage":"You are innactive, do you want to continue or you disconnect?","ErrorTitle":"Error","ErrorHeaderTitle":"Application error","ErrorHeaderText":"An application error has occurred while processing the current request. The error was recorded and sent to the site administrator. Provide your administrator ID error below.","ErrorMessage":"Message:","ErrorIdentifier":"Identify:","ErrorDate":"Date:"}
    
            LanguageDictionary.TC_SEARCH = {"OperatorNotEqual":"Not =","OperatorIsDefined":"Is Defined","OperatorIsNotDefined":"Is Not Defined","OperatorContains":"Contains","OperatorDoesNotContain":"Does not contain","OperatorBeginsWith":"Begins with","OperatorDoesNotBeginWith":"Does not begin with","OperatorIsEmpty":"Is Empty","OperatorIsNotEmpty":"Is not empty","CustomFiltersNotComplete":"One or more custom filters are not complete. Examine each custom filter and make sure that the valid search criteria are provided.","NavigateAwayFromSearchWithFilterSet":"You are about to leave this page without performing the search filters custom.","NoGlobalSearchPermissions":"Password","SearchDefinitionLostAlert":"The definition of research will be lost if the primary table is changed. Are you sure you want to change the primary table of the research."}
    
            LanguageDictionary.TC_FORM = {"RequiredFieldsNotSet":"Unable to save the form data. Provide a value for all required fields.","NavigateAwayFromUnsavedForm":"You are about to exit the form without saving it","RefreshFormLosesModifiedData":"The data of the form has changed. The changes you made will be lost when you refresh the form. Do you want to continue?","SaveDataBeforeClose":"The data of the form has changed. Do you want to save them before closing?","DeleteWarning":"The form data will be deleted. Are you sure you want to continue?","DeleteSecondaryWarning":"You are about to delete the form data.","RequiredField":"This is a required field","InvalidFormat":"The format for this field is not valid"}
    
            LanguageDictionary.TC_GLOBALSEARCH = {"CollapseAllLabel":"Reduce everything","ExpandAllLabel":"About expand"}
    
            LanguageDictionary.TC_WIDGETS = {"CallListItem":"Appeal","FaxListItem":"Fax","SmsListItem":"SMS"}
      </script>
      <script src="/Thinclient/Scripts/jquery-1.11.1.min.js" type="text/javascript">
      </script>
      <script src="/Thinclient/Scripts/jquery-migrate-1.2.1.min.js" type="text/javascript">
      </script>
      <script src="/Thinclient/Scripts/icheck.min.js">
      </script>
      <script src="/Thinclient/Scripts/kendo/kendo.all.min.js">
      </script>
      <script src="/Thinclient/Scripts/kendo/kendo.timezones.min.js">
      </script>
      <script src="/Thinclient/Scripts/kendo/kendo.aspnetmvc.min.js">
      </script>
      <script src="/Thinclient/Scripts/kendo/cultures/kendo.culture.en-US.min.js">
      </script>
      <script>
       kendo.culture("en-US");
      </script>
     </head>
     <body class="k-content">
      <div class="k-loading-mask" id="loadingMsg" style="width:100%;height:100%">
       <span class="k-loading-text">
        Loading...
       </span>
       <div class="k-loading-image">
        <div class="k-loading-color">
        </div>
       </div>
      </div>
      <input id="hdPollingFrequency" type="hidden" value="32767"/>
      <input id="hdPrivateComputerTimeout" type="hidden" value="32767"/>
      <input id="hdPublicComputerTimeout" type="hidden" value="32767"/>
      <input id="hdWarningDisplayDuration" type="hidden" value="0"/>
      <input id="hdWindowsAuthentication" type="hidden" value="false"/>
      <div class="container">
       <div id="content">
       </div>
      </div>
      <div id="loading" style="display: none;">
       <h1>
        We are processing your request. Please be patient.
       </h1>
       <input class="abortButton" type="button" value="Abort"/>
      </div>
      <script id="taskpadGroupTmpl" type="text/x-jquery-tmpl">
       <div class="panelBlock">
                <div class="panelTitle"><div class="panelLink"><a class="panelDD-dn" id="${DisplayName}" href="#">${DisplayName}</a></div><div class = "imgPanel">
                <a class="imgPanelDD" href="#">&nbsp;</a></div>
                </div>
                        <div class="panelContent1" id="panelContent1 + ${DisplayName}">
                            <ul>
                                {{tmpl(TaskItemCollection) "#taskpadItemTmpl"}}
                            </ul>                   
                        </div>            
                </div>
      </script>
      <script id="KendoTestTemplate" type="text/x-kendo-template">
       <h2>#= test #</h2>
            <ul>
                            #= kendo.render(kendo.template($("\\#KendoTestLiTemplate").html()), litest) #
            </ul>
      </script>
      <script id="KendoTestLiTemplate" type="text/x-kendo-template">
       <li>#= displayName#</li>
      </script>
      <script id="ErrorTemplate" type="text/x-jquery-tmpl">
       <div class="errorMsg k-widget k-notification k-notification-error " data-role="alert" style="display: block; opacity: 1;">
                <div class="k-notification-wrap">
                    <span class="k-icon k-i-note">
                        error
                    </span>
                    ${errorMsg}
                    <span class="k-icon k-i-close">
                        Hide
                    </span>
                </div>
            </div>
      </script>
      <script id="HelpButtonTemplate" type="text/x-jquery-tmpl">
       <button class="k-button k-primary helpButton" id="${id}" onclick="return false;">?</button>
      </script>
      <script id="IconTemplate" type="text/x-jquery-tmpl">
       <span class="k-icon ${icon}"></span>
      </script>
      <script id="trash" type="text/x-kendo-template">
       <li style="background: url(./Images/#=item.ImageId#.#=item.ImageHash#.#=item.ImageFileExtension#) no-repeat;"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
        <li class="#=GetCssClass(item.ContentType)#"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
      </script>
      <script id="taskpadItemTmpl" type="text/x-jquery-tmpl">
       {{if ImageId}}
                <li style="background: url(./Images/${ImageId}.${ImageHash}.${ImageFileExtension}) no-repeat;"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
          {{else}}
            <li class="${GetCssClass(ContentType)}"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
           {{/if}}
      </script>
      <script id="buttonBarButtonTmpl" type="text/x-jquery-tmpl">
       <button value="submit" class="submitBtn k-button k-primary" data-actionCommand="${Action}" data-Disabled="${Disabled}" data-Visible="${Visible}" data-Name="${Name}">
                                <span>${DisplayName}</span>
            </button>
      </script>
      <script src="/Thinclient/Scripts/jquery.filedownload.150420151637.js" type="text/javascript">
      </script>
      <script src="/Thinclient/Scripts/jquery.tmpl.min.150420151637.js" type="text/javascript">
      </script>
      <script src="/Thinclient/Scripts/pubsub.150420151641.js" type="text/javascript">
      </script>
      <script src="/Thinclient/Scripts/jquery.form.150420151637.js" type="text/javascript">
      </script>
      <script src="/Thinclient/Scripts/bootstrap.min.150420151637.js" type="text/javascript">
      </script>
      <script src="/Thinclient/Scripts/sameheight.min.150420151641.js" type="text/javascript">
      </script>
      <script src="/Thinclient/Scripts/Core.141120161617.js" type="text/javascript">
      </script>
      <script src="/Thinclient/Scripts/PivotalThinClient.150420151641.js" type="text/javascript">
      </script>
     </body>
    </html>
    

    来自浏览器源的结果

    <!DOCTYPE html>
    <html>
    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <title>OCSWSSW | Member Search</title>
        <link href="/Thinclient/favicon.ico" type="image/x-icon" rel="shortcut icon" />
        <link href="/Thinclient/Content/bootstrap.071220161413.css" rel="stylesheet" type="text/css" />
        <link rel="stylesheet" href="/Thinclient/Content/kendo/kendo.common-bootstrap.min.css" />
        <link rel="stylesheet" href="/Thinclient/Content/kendo/kendo.bootstrap.min.css" />
        <link href="/Thinclient/Content/ThinStyle.110820150951.css" rel="stylesheet" title="Blue" type="text/css" />
        <link rel="stylesheet" href="//maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css">    
        <link rel="stylesheet" href="/Thinclient/Content/icheck/square/blue.css" />
    
                <link rel="stylesheet" href="/Thinclient/Content/GlobalStyleSheet.css" />
    
    
        <script type="text/javascript" >
            HomeURL = "#/forms/new/?table=0x800000000000003D&amp;form=0x800000000000004D&amp;command=0x8000000000000C2D";
            AfterLoginData = null
    
            LanguageDictionary = {};
    
            LanguageDictionary.TC_COMMON = {"OkButtonTextOK":"Ok","OkButtonTextContinue":"Continue","OkButtonTextYes":"Yes","OkButtonTextDelete":"Clear","CancelButtonTextCancel":"Cancel","CancelButtonTextNo":"No","CancelButtonTextLogout":"Logout","MiddleButtonTextNo":"No","AjaxRequestError":"The Web server does not respond currently. Please try again later.","UserIdleMessage":"You are innactive, do you want to continue or you disconnect?","ErrorTitle":"Error","ErrorHeaderTitle":"Application error","ErrorHeaderText":"An application error has occurred while processing the current request. The error was recorded and sent to the site administrator. Provide your administrator ID error below.","ErrorMessage":"Message:","ErrorIdentifier":"Identify:","ErrorDate":"Date:"}
    
            LanguageDictionary.TC_SEARCH = {"OperatorNotEqual":"Not =","OperatorIsDefined":"Is Defined","OperatorIsNotDefined":"Is Not Defined","OperatorContains":"Contains","OperatorDoesNotContain":"Does not contain","OperatorBeginsWith":"Begins with","OperatorDoesNotBeginWith":"Does not begin with","OperatorIsEmpty":"Is Empty","OperatorIsNotEmpty":"Is not empty","CustomFiltersNotComplete":"One or more custom filters are not complete. Examine each custom filter and make sure that the valid search criteria are provided.","NavigateAwayFromSearchWithFilterSet":"You are about to leave this page without performing the search filters custom.","NoGlobalSearchPermissions":"Password","SearchDefinitionLostAlert":"The definition of research will be lost if the primary table is changed. Are you sure you want to change the primary table of the research."}
    
            LanguageDictionary.TC_FORM = {"RequiredFieldsNotSet":"Unable to save the form data. Provide a value for all required fields.","NavigateAwayFromUnsavedForm":"You are about to exit the form without saving it","RefreshFormLosesModifiedData":"The data of the form has changed. The changes you made will be lost when you refresh the form. Do you want to continue?","SaveDataBeforeClose":"The data of the form has changed. Do you want to save them before closing?","DeleteWarning":"The form data will be deleted. Are you sure you want to continue?","DeleteSecondaryWarning":"You are about to delete the form data.","RequiredField":"This is a required field","InvalidFormat":"The format for this field is not valid"}
    
            LanguageDictionary.TC_GLOBALSEARCH = {"CollapseAllLabel":"Reduce everything","ExpandAllLabel":"About expand"}
    
            LanguageDictionary.TC_WIDGETS = {"CallListItem":"Appeal","FaxListItem":"Fax","SmsListItem":"SMS"}
        </script>
        <script type="text/javascript" src="/Thinclient/Scripts/jquery-1.11.1.min.js"></script>
        <script src="/Thinclient/Scripts/jquery-migrate-1.2.1.min.js" type="text/javascript"></script>
    
        <script src="/Thinclient/Scripts/icheck.min.js"></script>
        <script src="/Thinclient/Scripts/kendo/kendo.all.min.js"></script>
        <script src="/Thinclient/Scripts/kendo/kendo.timezones.min.js"></script>
        <script src="/Thinclient/Scripts/kendo/kendo.aspnetmvc.min.js"></script>
    
        <script src="/Thinclient/Scripts/kendo/cultures/kendo.culture.en-US.min.js"></script>
        <script>
            kendo.culture("en-US");
    </script>
    
    
    
    
    </head>
    <body class="k-content">
        <div id="loadingMsg" class="k-loading-mask" style="width:100%;height:100%">
            <span class="k-loading-text">Loading...</span>
            <div class="k-loading-image">
                <div class="k-loading-color"></div>
            </div>
        </div>
    
        <input type="hidden" ID="hdPollingFrequency" value="32767"/>
        <input type="hidden" ID="hdPrivateComputerTimeout" value= "32767"/>
        <input type="hidden" ID="hdPublicComputerTimeout" value="32767"/>
        <input type="hidden" ID="hdWarningDisplayDuration" value="0"/>
        <input type="hidden" id="hdWindowsAuthentication" value="false"/>
    
    
    
    <div class="container">
        <div id="content">
    
        </div>
    </div>
    
        <div id="loading" style="display: none;">
            <h1>
                We are processing your request. Please be patient.</h1>
            <input type="button" value="Abort" class="abortButton" />
        </div>
        <script id="taskpadGroupTmpl" type="text/x-jquery-tmpl">
            <div class="panelBlock">
                <div class="panelTitle"><div class="panelLink"><a class="panelDD-dn" id="${DisplayName}" href="#">${DisplayName}</a></div><div class = "imgPanel">
                <a class="imgPanelDD" href="#">&nbsp;</a></div>
                </div>
                        <div class="panelContent1" id="panelContent1 + ${DisplayName}">
                            <ul>
                                {{tmpl(TaskItemCollection) "#taskpadItemTmpl"}}
                            </ul>                   
                        </div>            
                </div>      
        </script>
        <script id="KendoTestTemplate" type="text/x-kendo-template">
            <h2>#= test #</h2>
            <ul>
                            #= kendo.render(kendo.template($("\\#KendoTestLiTemplate").html()), litest) #
            </ul>   
        </script>
        <script id="KendoTestLiTemplate" type="text/x-kendo-template">
            <li>#= displayName#</li>    
        </script>
        <script id="ErrorTemplate" type="text/x-jquery-tmpl">
            <div class="errorMsg k-widget k-notification k-notification-error " data-role="alert" style="display: block; opacity: 1;">
                <div class="k-notification-wrap">
                    <span class="k-icon k-i-note">
                        error
                    </span>
                    ${errorMsg}
                    <span class="k-icon k-i-close">
                        Hide
                    </span>
                </div>
            </div>
        </script>
        <script id="HelpButtonTemplate" type="text/x-jquery-tmpl">
            <button class="k-button k-primary helpButton" id="${id}" onclick="return false;">?</button>
        </script>
        <script id="IconTemplate" type="text/x-jquery-tmpl">
            <span class="k-icon ${icon}"></span>
        </script>
        <script id="trash" type="text/x-kendo-template">
        <li style="background: url(./Images/#=item.ImageId#.#=item.ImageHash#.#=item.ImageFileExtension#) no-repeat;"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
        <li class="#=GetCssClass(item.ContentType)#"><a href="#=item.ActionCommand#" #if (item.ShowInNewWindow){# target="_blank" #}# class="#if (!item.ShowInNewWindow){# ajax-links #} if (item.ContentType == 'Email'){# mailto-links #}# linkTaskItem">#=item.DisplayName#</a></li>
    
        </script>
        <script id="taskpadItemTmpl" type="text/x-jquery-tmpl">
         {{if ImageId}}
                <li style="background: url(./Images/${ImageId}.${ImageHash}.${ImageFileExtension}) no-repeat;"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
          {{else}}
            <li class="${GetCssClass(ContentType)}"><a href="${ActionCommand}" {{if ShowInNewWindow}} target="_blank" {{/if}} class="{{if !ShowInNewWindow}} ajax-links {{/if}} {{if (ContentType == 'Email')}} mailto-links {{/if}} linkTaskItem">${DisplayName}</a></li>
           {{/if}}
        </script>
        <script id="buttonBarButtonTmpl" type="text/x-jquery-tmpl">
            <button value="submit" class="submitBtn k-button k-primary" data-actionCommand="${Action}" data-Disabled="${Disabled}" data-Visible="${Visible}" data-Name="${Name}">
                                <span>${DisplayName}</span>
            </button>
        </script>
        <script src="/Thinclient/Scripts/jquery.filedownload.150420151637.js" type="text/javascript"></script>      
        <script src="/Thinclient/Scripts/jquery.tmpl.min.150420151637.js" type="text/javascript"></script>  
        <script src="/Thinclient/Scripts/pubsub.150420151641.js" type="text/javascript"></script>
        <script src="/Thinclient/Scripts/jquery.form.150420151637.js" type="text/javascript"></script>
        <script src="/Thinclient/Scripts/bootstrap.min.150420151637.js" type="text/javascript"></script>
    
        <script src="/Thinclient/Scripts/sameheight.min.150420151641.js" type="text/javascript"></script>
        <script src="/Thinclient/Scripts/Core.141120161617.js" type="text/javascript"></script>
    
    
    
    
    
    
    
    
    
    
        <script src="/Thinclient/Scripts/PivotalThinClient.150420151641.js" type="text/javascript"></script>
    
    
    
    
    </body>
    </html>
    
    
    <script>
        //$( window ).load( 
        //$(".k-state-default").hover(function () {
        //    $(this).toggleClass("k-state-hover");
        //})
        //);
    </script>
    

    这不是你想要的 Beautiful Soup 吗?

    【讨论】:

    • 请在浏览器中运行并与您收到的代码进行比较
    • 我在 BS4 结果下方的答案中添加了 Chrome 的源代码。我在这里没有看到任何遗漏或不完整的东西..
    • 嗨,不要使用“查看页面源代码”.. 而是使用“ctrl+shift+I”并转到原始源代码的“元素”。
    • 您指的是每个父元素子元素的下拉箭头吗?这有什么不同/缺少什么?
    • 在网站上你可以看到名字,姓氏,但脚本不会得到那个
    【解决方案3】:

    它是动态页面(Ajax)你不能使用bs4,如果你不喜欢浏览器弹出窗口的selenium,你可以添加--headless选项来隐藏它。这里的例子

    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    from bs4 import BeautifulSoup
    
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait 
    from selenium.webdriver.support import expected_conditions as EC
    
    options = Options()
    options.add_argument('--headless')
    #options.add_argument('--disable-gpu')  # maybe needed if running on Windows.
    driver = webdriver.Chrome(chrome_options=options)
    
    print("Loading Page...")
    driver.get('https://onlineservices.ocswssw.org/Thinclient/Public/PR/EN/')
    
    # wait max 20 second until ajax content rendered
    print("Wait Ajax finished...")
    WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID , 'MainForm')))
    
    html = driver.execute_script("return document.documentElement.outerHTML")
    Soup = BeautifulSoup(html, 'html.parser')
    with open('ocswssw.html', 'w') as f:
        sourceCode = Soup.prettify().encode('utf-8')
        f.write(sourceCode)
        print(sourceCode)
    
    driver.quit()
    

    【讨论】:

    • 是什么让你选择 MainForm 元素等待直到?
    • 因为这是加载 ajax 后存在的元素之一
    • 那么它只是一个随机选择的元素还是特定于您想要与之交互的元素?
    • 可以是随机的
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-02-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多