【问题标题】:Any way to get JS object using scrapy使用scrapy获取JS对象的任何方法
【发布时间】:2015-01-09 23:08:32
【问题描述】:

我正在使用 scrapy 在 uslpro 网站上收集日程信息。我正在抓取的网站是http://uslpro.uslsoccer.com/schedules/index_E.html

页面的内容是在页面加载时呈现的。所以我无法直接从源代码中获取表格数据。看了源码,发现调度对象是存放在一个对象里面的。

这是 JavaScript 代码。

preRender: function(){
var gmsA=diiH2A(DIISnapshot.gamesHolder);
....

这个 gmsA 对象包含所有日程安排信息。有没有办法使用scrapy获取这个JS对象?非常感谢您的帮助。

【问题讨论】:

    标签: javascript python web-scraping html-parsing scrapy


    【解决方案1】:

    对于初学者,您有多种选择:

    • 解析包含数据的 javascript 文件(我将在下面描述)
    • 使用Scrapyscrapyjs 工具
    • selenium 的帮助下自动化一个真正的浏览器

    好的,第一个选项(可以说是最复杂的)。

    页面是通过单独调用 .js 文件加载的,该文件在两个单独的对象中包含有关比赛和球队的信息:

    DIISnapshot.gms = {
        "4428801":{"code":"1","tg":65672522,"fg":"2953156","fac":"22419","facn":"Blackbaud Stadium","tm1":"13380700","tm2":"22310","sc1":"1","sc2":"1","gmapply":"","dt":"22-MAR-2014","tim":"30-DEC-1899 19:30:00.0000","se":"65672455","modst":"","gmlabel":"","golive":0,"gmrpt":"67842863","urlvideo":"http://www.youtube.com/watch?v=JHi6_nnuAsQ","urlaudio":""}
      , "4428803":{"code":"2","tg":65672522,"fg":"2953471","fac":"1078448","facn":"StubHub Center","tm1":"33398866","tm2":"66919078","sc1":"1","sc2":"3","gmapply":"","dt":"22-MAR-2014","tim":"30-DEC-1899 22:30:00.0000","se":"65672455","modst":"","gmlabel":"","golive":0,"gmrpt":"67846731","urlvideo":"http://www.youtube.com/watch?v=nLaRaTi7BgE","urlaudio":""}
        ...   
      , "5004593":{"code":"217","tg":65672522,"fg":"66919058","fac":"66919059","facn":"Bonney Field","tm1":"934394","tm2":"65674034","sc1":"0","sc2":"2","gmapply":"3","dt":"27-SEP-2014","tim":"30-DEC-1899 22:30:00.0000","se":"65672455","modst":"21-SEP-2014 1:48:26.5710","gmlabel":"FINAL","golive":0,"gmrpt":"72827154","urlvideo":"https://www.youtube.com/watch?v=QPhL8Ktkz4M","urlaudio":""}
    };  
    
    DIISnapshot.tms = {
        "13380700":{"name":"Orlando City SC","club":"","nick":"Orlando","primarytg":"65672522"}
        ...
      , "8969532":{"name":"Pittsburgh Riverhounds","club":"","nick":"Pittsburgh","primarytg":"65672522"}
      , "934394":{"name":"Harrisburg City Islanders","club":"","nick":"Harrisburg","primarytg":"65672522"}
    };
    

    事情变得有点困难,因为该js 文件的URL 也是用javascript 在以下script 标记中构造的:

    <script type="text/javascript">
    var DIISnapshot = {
      goLive: function(gamekey) {
        clickpop1=window.open('http://uslpro.uslsoccer.com/scripts/runisa.dll?M2:gp::72013+Elements/DisplayBlank+E+2187955++'+gamekey+'+65672455','clickpop1','toolbar=0,location=0,status=0,menubar=0,scrollbars=1,resizable=0,top=100,left=100,width=315,height=425');
      }
    };
    var DIISchedule = {
      MISL_lgkey: '36509042',
      sename:'2014',
      sekey: '65672455',
      lgkey: '2792331',
      tg: '65672522',
      ...
    
      fetchInfo:function(){
        var fname = DIISchedule.tg;
        if (fname === '') fname = DIISchedule.sekey;
        new Ajax.Request('/schedules/' + DIISchedule.seSeq + '/' + fname + '.js?'+rand4(),{asynchronous: false});
        DIISnapshot.gamesHolder = DIISnapshot.gms;
        DIISnapshot.teamsHolder = DIISnapshot.tms;
        DIISnapshot.origTeams = [];
        for (var teamkey in DIISnapshot.tms) DIISnapshot.origTeams.push(teamkey);
      },
      ...
    
        DIISchedule.scheduleLoaded = true;
      }
    }
    document.observe('dom:loaded',DIISchedule.init);
    </script>
    

    好的,让我们使用BeautifulSoup HTML解析器和slimit javascript parser来获取用于构造URL的动态部分(即tg值是js的名称),然后请求一个 URL,解析 javascript 并打印出匹配项:

    import json
    import random
    import re
    
    from bs4 import BeautifulSoup
    import requests
    from slimit import ast
    from slimit.parser import Parser
    from slimit.visitors import nodevisitor
    
    # start a session
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'}
    session = requests.Session()
    response = session.get('http://uslpro.uslsoccer.com/schedules/index_E.html', headers=headers)
    
    # get the dynamic part of the JS url
    soup = BeautifulSoup(response.content)
    script = soup.find('script', text=lambda x: x and 'var DIISchedule' in x)
    tg = re.search(r"tg: '(\d+)',", script.text).group(1)
    
    # request to JS url
    js_url = "http://uslpro.uslsoccer.com/schedules/2014/{tg}.js?{rand}".format(tg=tg, rand=random.randint(1000, 9999))
    response = session.get(js_url, headers=headers)
    
    # parse js
    parser = Parser()
    tree = parser.parse(response.content)
    matches, teams = [json.loads(node.right.to_ecma())
                      for node in nodevisitor.visit(tree)
                      if isinstance(node, ast.Assign) and isinstance(node.left, ast.DotAccessor)]
    
    for match in matches.itervalues():
        print teams[match['tm1']]['name'], '%s : %s' % (match['sc1'], match['sc2']), teams[match['tm2']]['name']
    

    打印:

    Arizona United SC 0 : 2 Orange County Blues FC
    LA Galaxy II 1 : 0 Seattle Sounders FC Reserves
    LA Galaxy II 1 : 3 Harrisburg City Islanders
    New York Red Bulls Reserves 0 : 1 OKC Energy FC
    Wilmington Hammerheads FC 2 : 1 Charlotte Eagles
    Richmond Kickers 3 : 2 Harrisburg City Islanders
    Charleston Battery 0 : 2 Orlando City SC
    Charlotte Eagles 0 : 2 Richmond Kickers
    Sacramento Republic FC 2 : 1 Dayton Dutch Lions FC
    OKC Energy FC 0 : 5 LA Galaxy II
    ...
    

    打印匹配列表的部分用于演示目的。您可以使用matchesteams 字典以您需要的格式输出数据。

    由于这不是一个受欢迎的标签,我不希望有任何支持 - 最重要的是,这对我来说是一个有趣的挑战。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-08-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多