【问题标题】:how to scrape imbeded script on webpage in python如何在python中抓取网页上的嵌入脚本
【发布时间】:2014-12-28 02:56:57
【问题描述】:

例如,我有网页http://www.amazon.com/dp/1597805483

我想用xpath来刮这句话Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.

page = requests.get(url)
tree = html.fromstring(page.text)
feature_bullets = tree.xpath('//*[@id="iframeContent"]/div/text()')
print feature_bullets

上面的代码没有返回任何内容。原因是浏览器解释的xpath与源代码不同。但我不知道如何从源代码中获取 xpath。

【问题讨论】:

    标签: python html xpath web-scraping html-parsing


    【解决方案1】:

    构建您正在抓取的页面涉及很多事情。

    至于描述,具体来说,底层的HTML是在一个javascript函数内部构建的:

    <script type="text/javascript">
    
        P.when('DynamicIframe').execute(function (DynamicIframe) {
            var BookDescriptionIframe = null,
                    bookDescEncodedData = "%3Cdiv%3E%3CB%3EA%20Fantastic%20Anthology%20Combining%20the%20Love%20of%20Science%20Fiction%20with%20Our%20National%20Pastime%3C%2FB%3E%3CBR%3E%3CBR%3EOf%20all%20the%20sports%20played%20across%20the%20globe%2C%20none%20has%20more%20curses%20and%20superstitions%20than%20baseball%2C%20America%26%238217%3Bs%20national%20pastime.%3Cbr%3E%3CBR%3E%3CI%3EField%20of%20Fantasies%3C%2FI%3E%20delves%20right%20into%20that%20superstition%20with%20short%20stories%20written%20by%20several%20key%20authors%20about%20baseball%20and%20the%20supernatural.%20%20Here%20you%27ll%20encounter%20ghostly%20apparitions%20in%20the%20stands%2C%20a%20strangely%20charming%20vampire%20double-play%20combination%2C%20one%20fan%20who%20can%20call%20every%20shot%20and%20another%20who%20can%20see%20the%20past%2C%20a%20sad%20alternate-reality%20for%20the%20game%27s%20most%20famous%20player%2C%20unlikely%20appearances%20on%20the%20field%20by%20famous%20personalities%20from%20Stephen%20Crane%20to%20Fidel%20Castro%2C%20a%20hilariously%20humble%20teenage%20phenom%2C%20and%20much%20more.%20In%20this%20wonderful%20anthology%20are%20stories%20from%20such%20award-winning%20writers%20as%3A%3CBR%3E%3CBR%3EStephen%20King%20and%20Stewart%20O%26%238217%3BNan%3Cbr%3EJack%20Kerouac%3CBR%3EKaren%20Joy%20Fowler%3CBR%3ERod%20Serling%3CBR%3EW.%20P.%20Kinsella%3CBR%3EAnd%20many%20more%21%3CBR%3E%3CBR%3ENever%20has%20a%20book%20combined%20the%20incredible%20with%20great%20baseball%20fiction%20like%20%3CI%3EField%20of%20Fantasies%3C%2FI%3E.%20This%20wide-ranging%20collection%20reaches%20from%20some%20of%20the%20earliest%20classics%20from%20the%20pulp%20era%20and%20baseball%27s%20golden%20age%2C%20all%20the%20way%20to%20material%20appearing%20here%20for%20the%20first%20time%20in%20a%20print%20edition.%20Whether%20you%20love%20the%20game%20or%20just%20great%20fiction%2C%20these%20stories%20will%20appeal%20to%20all%2C%20as%20the%20writers%20in%20this%20anthology%20bring%20great%20storytelling%20of%20the%20strange%20and%20supernatural%20to%20the%20plate%2C%20inning%20after%20inning.%3CBR%3E%3C%2Fdiv%3E",
                    bookDescriptionAvailableHeight,
                    minBookDescriptionInitialHeight = 112,
                    options = {};
        ...
    
    </script>
    

    这里的想法是获取脚本标签的文本,使用正则表达式提取描述值,取消引用 HTML,使用 lxml.html 解析它并获取 .text_content()

    import re
    from urlparse import unquote
    
    from lxml import html
    import requests
    
    url = "http://rads.stackoverflow.com/amzn/click/1597805483"
    page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})
    tree = html.fromstring(page.content)
    
    script = tree.xpath('//script[contains(., "bookDescEncodedData")]')[0]
    match = re.search(r'bookDescEncodedData = "(.*?)",', script.text)
    if match:
        description_html = html.fromstring(unquote(match.group(1)))
        print description_html.text_content()
    

    打印:

    A Fantastic Anthology Combining the Love of Science Fiction with Our National Pastime. 
    Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.Field of Fantasies delves right into that superstition with short stories written by several key authors about baseball and the supernatural.  
    Here you'll encounter ghostly apparitions in the stands, a strangely charming vampire double-play combination, one fan who can call every shot and another who can see the past, a sad alternate-reality for the game's most famous player, unlikely appearances on the field by famous personalities from Stephen Crane to Fidel Castro, a hilariously humble teenage phenom, and much more. 
    In this wonderful anthology are stories from such award-winning writers as:Stephen King and Stewart O’NanJack KerouacKaren Joy FowlerRod SerlingW. P. KinsellaAnd many more!Never has a book combined the incredible with great baseball fiction like Field of Fantasies. 
    This wide-ranging collection reaches from some of the earliest classics from the pulp era and baseball's golden age, all the way to material appearing here for the first time in a print edition. Whether you love the game or just great fiction, these stories will appeal to all, as the writers in this anthology bring great storytelling of the strange and supernatural to the plate, inning after inning.
    

    类似的解决方案,但使用BeautifulSoup

    import re
    from urlparse import unquote
    
    from bs4 import BeautifulSoup
    import requests
    
    url = "http://rads.stackoverflow.com/amzn/click/1597805483"
    page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})
    soup = BeautifulSoup(page.content)
    
    script = soup.find('script', text=lambda x:'bookDescEncodedData' in x)
    match = re.search(r'bookDescEncodedData = "(.*?)",', script.text)
    if match:
        description_html = BeautifulSoup(unquote(match.group(1)))
        print description_html.text
    

    或者,您可以采用高级方法并在selenium 的帮助下使用真正的浏览器:

    from selenium import webdriver
    
    url = "http://rads.stackoverflow.com/amzn/click/1597805483"
    
    driver = webdriver.Firefox()
    driver.get(url)
    
    iframe = driver.find_element_by_id('bookDesc_iframe')
    driver.switch_to.frame(iframe)
    
    print driver.find_element_by_id('iframeContent').text
    
    driver.close()
    

    产生更好的格式化输出:

    A Fantastic Anthology Combining the Love of Science Fiction with Our National Pastime
    
    Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.
    
    Field of Fantasies delves right into that superstition with short stories written by several key authors about baseball and the supernatural. Here you'll encounter ghostly apparitions in the stands, a strangely charming vampire double-play combination, one fan who can call every shot and another who can see the past, a sad alternate-reality for the game's most famous player, unlikely appearances on the field by famous personalities from Stephen Crane to Fidel Castro, a hilariously humble teenage phenom, and much more. In this wonderful anthology are stories from such award-winning writers as:
    
    Stephen King and Stewart O’Nan
    Jack Kerouac
    Karen Joy Fowler
    Rod Serling
    W. P. Kinsella
    And many more!
    
    Never has a book combined the incredible with great baseball fiction like Field of Fantasies. This wide-ranging collection reaches from some of the earliest classics from the pulp era and baseball's golden age, all the way to material appearing here for the first time in a print edition. Whether you love the game or just great fiction, these stories will appeal to all, as the writers in this anthology bring great storytelling of the strange and supernatural to the plate, inning after inning.
    

    【讨论】:

    • 您使用哪个工具来查找 xpath
    • @so3 chrome 开发者工具和大脑开发者工具 :) xpath 非常简单,您可能会看到 - 我只是检查 script 标记内的文本。
    • 但是 chrome 开发者工具没有给你原始源代码的 xpath
    • @so3 好吧,我进行了一项研究,试图找到描述的部分内容,发现它们隐藏在该脚本标签中。这基本上是实现的关键。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2011-07-17
    • 2013-12-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多