【问题标题】:Is there a way to collect data/Parse pages in Beautifulsoup from dynamically compiles webpages?有没有办法从动态编译的网页中收集 Beautifulsoup 中的数据/解析页面?
【发布时间】:2016-08-29 13:31:12
【问题描述】:

我曾经使用 Beautifulsoup 解析网页中的数据。但是,当我查看源代码时,我不确定如何从由脚本(JS 和 JSON)填充的网页中收集数据。是否有任何工具可以收集或呈现页面,以便我可以或链接从这些页面收集数据。

我在下面放了一个例如 JSON/JS 源页面的例子。

<!DOCTYPE html>
<html>
<head>
  <link rel="stylesheet" type="text/css" class="__meteor-css__" href="/3688b5ba42be128b061150ae66a2c2f245507d7e.css?meteor_css_resource=true">  <link rel="stylesheet" type="text/css" class="__meteor-css__" href="/4281a8e71152d94a7380f89ab8dd32d9542c9b5c.css?meteor_css_resource=true">
<meta name="fragment" content="!">
<script type="text/inject-data">%7B%22fast-render-data%22%3A%7B%22collectionData%22%3A%7B%22users%22%3A%5B%5B%7B%22emails%22%3A%5B%7B%22address%22%3A%22suhas.servesh%40gmail.com%22%2C%22verified%22%3Afalse%7D%5D%2C%22profile%22%3A%7B%22defaultSiteName%22%3A%22draftkings%22%2C%22defaultSportName%22%3A%22mlb%22%7D%2C%22username%22%3A%22kloudklown%22%2C%22_id%22%3A%22YnZKGMPLrwHCzHRh5%22%7D%5D%5D%2C%22kadira_settings%22%3A%5B%5B%7B%22appId%22%3A%22SiGbMwMEWLf7WK3KB%22%2C%22endpoint%22%3A%22https%3A%2F%2Fenginex.kadira.io%22%2C%22clientEngineSyncDelay%22%3A10000%2C%22enableErrorTracking%22%3Atrue%2C%22_id%22%3A%22SgS4nrWA5a6nDdzaY%22%7D%5D%5D%7D%2C%22subscriptions%22%3A%7B%7D%2C%22loginToken%22%3A%22-cCvsClRaCVlHa24nJLdIjfDp0EOC_flNuR7IR6Qxqj%22%7D%7D</script>
<script type="text/javascript" src="https://js.stripe.com/v2/"></script>
    <script type="text/javascript" src="https://checkout.stripe.com/checkout.js"></script>
<link href="https://d1mua5vq38hnzr.cloudfront.net/favicon.ico" rel="icon" type="image/x-icon" />
    <script type="text/javascript" src="https://static.leaddyno.com/js"></script>
    <!-- Facebook Pixel Code -->
    <script>
    !function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod?
    n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n;
    n.push=n;n.loaded=!0;n.version='2.0';n.queue=[];t=b.createElement(e);t.async=!0;
    t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window,
    document,'script','https://connect.facebook.net/en_US/fbevents.js');

    fbq('init', '156814968048022');
    fbq('track', "PageView");</script>
    <noscript><img height="1" width="1" style="display:none"
    src="https://www.facebook.com/tr?id=156814968048022&ev=PageView&noscript=1"
    /></noscript>
    <!-- End Facebook Pixel Code -->

</head>
<body>



<script type="text/javascript">__meteor_runtime_config__ = JSON.parse(decodeURIComponent("%7B%22meteorRelease%22%3A%22METEOR%401.3.4.1%22%2C%22meteorEnv%22%3A%7B%22NODE_ENV%22%3A%22production%22%2C%22TEST_METADATA%22%3A%22%7B%7D%22%7D%2C%22PUBLIC_SETTINGS%22%3A%7B%22ga%22%3A%7B%22account%22%3A%22UA-58886344-1%22%7D%7D%2C%22ROOT_URL%22%3A%22https%3A%2F%2Fdailyfantasynerd.com%22%2C%22ROOT_URL_PATH_PREFIX%22%3A%22%22%2C%22appId%22%3A%228u0umeqb2znyyvsybl%22%2C%22kadira%22%3A%7B%22appId%22%3A%22SiGbMwMEWLf7WK3KB%22%2C%22endpoint%22%3A%22https%3A%2F%2Fenginex.kadira.io%22%2C%22clientEngineSyncDelay%22%3A10000%2C%22enableErrorTracking%22%3Atrue%7D%2C%22autoupdateVersion%22%3A%22cd1f15509aed34ad130a1b1cc1c46cb282abe1dd%22%2C%22autoupdateVersionRefreshable%22%3A%227a8125062727989a665ebc42d995410c7cc05ab7%22%2C%22autoupdateVersionCordova%22%3A%22none%22%7D"));</script>

  <script type="text/javascript" src="/e517e573069a465b017732a35a886ff1c36e2550.js?meteor_js_resource=true"></script>


</body>
</html>

【问题讨论】:

  • 您要抓取的页面是什么?

标签: javascript json parsing beautifulsoup


【解决方案1】:

您可以将 PyQt 与它的 webkit 绑定一起使用。这是一个示例脚本,取自this blog post

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from bs4 import BeautifulSoup


class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://webscraping.com'  
r = Render(url)  
html = r.frame.toHtml()
soup = BeautifulSoup(html, 'html.parser')

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-12-31
    • 2012-01-16
    • 2018-11-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-10-04
    相关资源
    最近更新 更多