【问题标题】:BeautifulSoup does not returns all dataBeautifulSoup 不返回所有数据
【发布时间】:2017-08-28 06:24:21
【问题描述】:

我今天正在尝试使用 Python 的库 BeautifulSoup 解析一些月相数据。

from bs4 import BeautifulSoup
import urllib2

moon_url = "http://www.moongiant.com/phase/today/"


try:
    rqest =  urllib2.urlopen(moon_url)
    moon_Soup = BeautifulSoup(rqest, 'lxml')
    moon_angle = 0
    moon_illumination = 0
    main_data = moon_Soup.find('div', {'id' : 'moonDetails'})
    print main_data

except urllib2.URLError:
    print "Error"

但是输出不是这个:

<div id="moonDetails">        
      Phase: <span>Waxing Crescent</span><br>Illumination: <span>36%
</span><br>Moon Age: <span>6.00 days</span><br>Moon Angle: <span>0.55</span><br>Moon Distance: <span>364,</span>434.78 km<br>Sun Angle: <span>0.53</span><br>Sun Distance: <span>149,</span>571,918.47 km<br>
</div>

只有这个:

<div id="moonDetails">
</div>

有什么想法吗?

【问题讨论】:

  • 这个数据在var mArray而不是&lt;div id="moonDetails"&gt;
  • 其实它在 var jArray 中。如何使用 Python 解析 jArray?
  • 非常感谢,真的很有帮助!

标签: python parsing beautifulsoup html-parsing


【解决方案1】:

正如 RaminNietzsche 在 cmets 中所说,您应该在这个特定的 script 标记中提取脚本文本。您可以使用regexbuilt-in methods(例如split()strip()replace()

代码:

from bs4 import BeautifulSoup
import requests
import re
import json

moon_url = "http://www.moongiant.com/phase/today/"
html_source =  requests.get(moon_url).text

moon_soup = BeautifulSoup(html_source, 'html.parser')

data = moon_soup.find_all('script', {'type' : 'text/javascript'})

for d in data:
    d = d.text
    if 'var jArray=' in d:
        jArray = re.search('\{(.*?)\}', d).group()
        moon_data = json.loads(jArray)
        print(moon_data)

        #if you want mArray data too, you just have to:
        # 1. add `'var mArray=' in d` in the if clause, and
        # 2. uncomment the following lines
        #mArray = re.search('\[+(.*?)\];', d).group()
        #print(mArray)

输出:

{'3': ['<b>April 4</b>', '58%\n', 'Sun Angle: 0.53291621763825', 'Sun Distance: 149657950.85286', 'Moon Distance: 369697.55153449', 'Moon Age: 8.1316595947356', 'Moon Angle: 0.53870564539409', 'Waxing Gibbous', 'April 4'], '2': ["<span style='color:#c7b699'><b>April 3</b></span>", 'Illumination: <span>47%\n</span>', 'Sun Angle: <span>0.53', 'Sun Distance: <span>149,</span>614,</span>943.28', 'Moon Distance: <span>366,</span>585.35', 'Moon Age: <span>7.08', 'Moon Angle: <span>0.54', 'First Quarter', '<b>Monday, April 3, 2017</b>', 'April', 'Phase: <span>First Quarter</span>', 'April 3'], '1': ['<b>April 2</b>', '36%\n', 'Sun Angle: 0.53322274612254', 'Sun Distance: 149571918.46739', 'Moon Distance: 364434.77975454', 'Moon Age: 6.002888839693', 'Moon Angle: 0.54648504798072', 'Waxing Crescent', 'April 2'], '4': ['<b>April 5</b>', '69%\n', 'Sun Angle: 0.53276322269153', 'Sun Distance: 149700928.5008', 'Moon Distance: 373577.14506795', 'Moon Age: 9.1657967733025', 'Moon Angle: 0.53311119464703', 'Waxing Gibbous', 'April 5'], '0': ['<b>April 1</b>', '25%\n', 'Sun Angle: 0.53337618944887', 'Sun Distance: 149528889.15122', 'Moon Distance: 363387.67496992', 'Moon Age: 4.9078487808877', 'Moon Angle: 0.54805974945761', 'Waxing Crescent', 'April 1']}

由于它是作为JSON 加载的,因此您可以像这样浏览它:

示例代码:

print(moon_data['4'])
print('-')*5
print(moon_data['4'][2])

输出:

['<b>April 5</b>', '69%\n', 'Sun Angle: 0.53276322269153', 'Sun Distance: 149700928.5008', 'Moon Distance: 373577.14506795', 'Moon Age: 9.1657967733025', 'Moon Angle: 0.53311119464703', 'Waxing Gibbous', 'April 5']
-----
Sun Angle: 0.53276322269153

【讨论】:

    【解决方案2】:

    实际上在 RaminNietzsche 发表评论之后,我使用了 dryscrape 库。

    from bs4 import BeautifulSoup
    import urllib2
    import dryscrape
    
        moon_url = "http://www.moongiant.com/phase/today/"
    
    try:
        rqest =  urllib2.urlopen(moon_url)
        session = dryscrape.Session()
        session.visit(moon_url)
        response = session.body()
        soup = BeautifulSoup(response, 'lxml')
    
        moon_data = soup.findAll('div', {'id':'moonDetails'})
        print moon_data
    

    因此现在的输出是:

    <div id="moonDetails">        
          Phase: <span>Waxing Crescent</span><br>Illumination: <span>36%
    </span><br>Moon Age: <span>6.00 days</span><br>Moon Angle: <span>0.55</span><br>Moon Distance: <span>364,</span>434.78 km<br>Sun Angle: <span>0.53</span><br>Sun Distance: <span>149,</span>571,918.47 km<br>
    </div>
    

    谢谢大家的回答!

    【讨论】:

    • 似乎不兼容 Windows?那里的文档根本没有提到安装..
    【解决方案3】:

    另一种方式,我从access Chrome DOM 的root 的回答中抄袭了它的要点。

    这个想法是,您可以同时使用 seleniumlxml 来访问已由其 javascript 加载和处理的页面的 DOM。

    >>> moon_url = "http://www.moongiant.com/phase/today/"
    >>> import selenium.webdriver as webdriver
    >>> import lxml.html as html
    >>> import lxml.html.clean as clean
    >>> 
    >>> browser = webdriver.Chrome()
    >>> browser.get(moon_url)
    >>> content = browser.page_source
    >>> cleaner = clean.Cleaner()
    >>> content = cleaner.clean_html(content)
    >>> doc = html.fromstring(content)
    >>> type(doc)
    <class 'lxml.html.HtmlElement'>
    >>> type(content)
    <class 'str'>
    >>> open('c:/scratch/content.htm','w').write(content)
    27070
    

    一旦你这样做了,正如上面最后几句话所示,你可以访问 DOM 或者/两者作为 HTML 或作为适合使用 lxml 处理的树。在您的情况下,您可能更喜欢用 HTML 制作汤;这意味着将 BeautifulSoup 应用于content

    顺便说一句,当我保存 content 时,我确实在 HTML 中找到了以下结构,正如人们所期望的那样。

    <div id="moonDetails">
        Phase: <span>First Quarter</span><br>
        Illumination: <span>47%</span><br>
        Moon Age: <span>7.08 days</span><br>
        Moon Angle: <span>0.54</span><br>
        Moon Distance: <span>366,</span>585.35 km<br>
        Sun Angle: <span>0.53</span><br>
        Sun Distance: <span>149,</span>614,943.28 km<br>
    </div>
    

    【讨论】:

      猜你喜欢
      • 2019-02-26
      • 1970-01-01
      • 2021-04-11
      • 2018-11-30
      • 2018-04-05
      • 2018-02-17
      • 1970-01-01
      • 2023-04-09
      • 2021-07-14
      相关资源
      最近更新 更多