【问题标题】:Extracting data from script tag using scrapy使用scrapy从脚本标签中提取数据
【发布时间】:2020-02-19 22:13:40
【问题描述】:

这是页面源代码中的一个脚本标记,我想从中提取 mp4 中的字符串:列表 Using scrapy。我无法将它加载到 json 加载器中,我找不到任何其他方法来做到这一点。无法弄清楚它的 xpath。

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>RikTak Video Player - Version 1</title>
    <script src="https://cdn.radiantmediatechs.com/rmp/5.2.1/js/rmp.min.js"></script>
    <style>
        body {
            margin: 0;
        }
    </style>
</head>
<body>
<div id="rmpPlayer"></div>
<script>
    var bitrates = {
         mp4: ['https://mvd8.ddns.me:443/viewm/52/653/52653.mp4?wmsAuthSign=c2VydmVyX3RpbWU9MTAvMjMvMjAxOSA2OjI2OjAzIFBNJmhhc2hfdmFsdWU9ODlyM3FWTlRONldQWGJOT3JWQWJTUT09JnZhbGlkbWludXRlcz02MA==']
    };

        var schedule = {
        preroll: [
            'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_text_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar'
            ],
        midroll: [

            [600,'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_text_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar'],

            [1200,'https://pubads.g.doubleclick.net/gampad/ads?iu=/60345044/Pirsom_Ayoub_LTD_TOP/farfeshplus/farfeshplus_Preroll&description_url=https%3A%2F%2Fwww.farfeshplus.com%2F&env=vp&impl=s&correlator=&tfcd=0&npa=0&gdfp_req=1&output=vast&sz=640x480&unviewed_position_start=1'],

            [1800,'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_text_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar']
            ],
        postroll: [
          'https://pubads.g.doubleclick.net/gampad/ads?iu=/60345044/Pirsom_Ayoub_LTD_TOP/farfeshplus/farfeshplus_Preroll&description_url=https%3A%2F%2Fwww.farfeshplus.com%2F&env=vp&impl=s&correlator=&tfcd=0&npa=0&gdfp_req=1&output=vast&sz=640x480&unviewed_position_start=1'
          ]
    };
        var settings = {
        licenseKey: 'Kl8lNHNrNzkyY3M5dj9yb201ZGFzaXMzMGRiMEElXyo=',
        bitrates: bitrates,
        delayToFade: 3000,
        width: 750,
        height: 440,
        skin: 's4',
        poster: 'https://images.farfeshplus.com/videos/lrg/laila_m_29.jpg',
        ads: true,      
        adSchedule: schedule
    };
    var elementID = 'rmpPlayer';
    var rmp = new RadiantMP(elementID);
    rmp.init(settings);
</script>
</body>
</html>

指导我使用一些方法来提取这些数据

【问题讨论】:

  • 你想要魔线吗?
  • mp4 内:列表

标签: python web-scraping scrapy


【解决方案1】:

首先您应该选择right selector 以将脚本标签信息提取为文本。

text = url.xpath('//body/script/text()').get()

然后你可以使用正则表达式来找到你想要的。

import re
mp4 = re.compile(r"(?<=mp4:\s\[')(.*)'\]")
print(mp4.findall(text)[0])

查看 @CypherX 以获得与 beautifullsoup 相同的结果。

输出

https://mvd8.ddns.me:443/viewm/88/686/88686.mp4?wmsAuthSign=c2VydmVyX3RpbWU9MTAvMjMvMjAxOSAzOjMwOjE3IFBNJmhhc2hfdmFsdWU9UXgrZ1dHTWxhVGdNM0Iyd3dSeHJBdz09JnZhbGlkbWludXRlcz02MA==

数据

text = """
<script>
    var bitrates = {
         mp4: ['https://mvd8.ddns.me:443/viewm/88/686/88686.mp4?wmsAuthSign=c2VydmVyX3RpbWU9MTAvMjMvMjAxOSAzOjMwOjE3IFBNJmhhc2hfdmFsdWU9UXgrZ1dHTWxhVGdNM0Iyd3dSeHJBdz09JnZhbGlkbWludXRlcz02MA==']
    };

        var schedule = {
        preroll: [
            'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_text_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar'
            ],
        midroll: [

            [600,'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_text_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar'],

            [1200,'https://pubads.g.doubleclick.net/gampad/ads?iu=/60345044/Pirsom_Ayoub_LTD_TOP/farfeshplus/farfeshplus_Preroll&description_url=https%3A%2F%2Fwww.farfeshplus.com%2F&env=vp&impl=s&correlator=&tfcd=0&npa=0&gdfp_req=1&output=vast&sz=640x480&unviewed_position_start=1'],

            [1800,'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_text_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar']
            ],
        postroll: [
          'https://pubads.g.doubleclick.net/gampad/ads?iu=/60345044/Pirsom_Ayoub_LTD_TOP/farfeshplus/farfeshplus_Preroll&description_url=https%3A%2F%2Fwww.farfeshplus.com%2F&env=vp&impl=s&correlator=&tfcd=0&npa=0&gdfp_req=1&output=vast&sz=640x480&unviewed_position_start=1'
          ]
    };
        var settings = {
        licenseKey: 'Kl8lNHNrNzkyY3M5dj9yb201ZGFzaXMzMGRiMEElXyo=',
        bitrates: bitrates,
        delayToFade: 3000,
        width: 750,
        height: 440,
        skin: 's4',
        poster: 'https://images.farfeshplus.com/videos/lrg/laila_m_29.jpg',
        ads: true,      
        adSchedule: schedule
    };
    var elementID = 'rmpPlayer';
    var rmp = new RadiantMP(elementID);
    rmp.init(settings);
</script>
"""

【讨论】:

  • 但我在scrapy中使用它是这样的: def parse_frame(self, response): for url in response.xpath('//html'): URL = url.xpath('//scipt ').extract() mp4 = re.compile(r"(?
  • @IbtsamCh 什么给你一个元组?
  • 我正在使用scrapy来获取这些数据,你能告诉我一个xpath表达式来获取这个链接吗?
  • 页面内只有脚本标签还是几个?
  • thnx...我明白了
【解决方案2】:

另一种选择是将BeautifulSoupregex 一起使用。 regex 部分与 @FlorianBernard 建议的部分相同。

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(text, "html.parser")
script = soup.find_all('script')[1]
mp4 = re.compile(r"(?<=mp4:\s\[\')(.*)\'\]")
print(mp4.findall(script.get_text())[0])

输出

https://mvd8.ddns.me:443/viewm/52/653/52653.mp4?wmsAuthSign=c2VydmVyX3RpbWU9MTAvMjMvMjAxOSA2OjI2OjAzIFBNJmhhc2hfdmFsdWU9ODlyM3FWTlRONldQWGJOT3JWQWJTUT09JnZhbGlkbWludXRlcz02MA==

数据

这里的text 是包含整个html 文档的变量。

text = """
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>RikTak Video Player - Version 1</title>
    <script src="https://cdn.radiantmediatechs.com/rmp/5.2.1/js/rmp.min.js"></script>
    <style>
        body {
            margin: 0;
        }
    </style>
</head>
<body>
<div id="rmpPlayer"></div>
<script>
    var bitrates = {
         mp4: ['https://mvd8.ddns.me:443/viewm/52/653/52653.mp4?wmsAuthSign=c2VydmVyX3RpbWU9MTAvMjMvMjAxOSA2OjI2OjAzIFBNJmhhc2hfdmFsdWU9ODlyM3FWTlRONldQWGJOT3JWQWJTUT09JnZhbGlkbWludXRlcz02MA==']
    };

        var schedule = {
        preroll: [
            'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_text_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar'
            ],
        midroll: [

            [600,'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_text_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar'],

            [1200,'https://pubads.g.doubleclick.net/gampad/ads?iu=/60345044/Pirsom_Ayoub_LTD_TOP/farfeshplus/farfeshplus_Preroll&description_url=https%3A%2F%2Fwww.farfeshplus.com%2F&env=vp&impl=s&correlator=&tfcd=0&npa=0&gdfp_req=1&output=vast&sz=640x480&unviewed_position_start=1'],

            [1800,'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_text_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar']
            ],
        postroll: [
          'https://pubads.g.doubleclick.net/gampad/ads?iu=/60345044/Pirsom_Ayoub_LTD_TOP/farfeshplus/farfeshplus_Preroll&description_url=https%3A%2F%2Fwww.farfeshplus.com%2F&env=vp&impl=s&correlator=&tfcd=0&npa=0&gdfp_req=1&output=vast&sz=640x480&unviewed_position_start=1'
          ]
    };
        var settings = {
        licenseKey: 'Kl8lNHNrNzkyY3M5dj9yb201ZGFzaXMzMGRiMEElXyo=',
        bitrates: bitrates,
        delayToFade: 3000,
        width: 750,
        height: 440,
        skin: 's4',
        poster: 'https://images.farfeshplus.com/videos/lrg/laila_m_29.jpg',
        ads: true,      
        adSchedule: schedule
    };
    var elementID = 'rmpPlayer';
    var rmp = new RadiantMP(elementID);
    rmp.init(settings);
</script>
</body>
</html>
"""

【讨论】:

  • @FlorianBernard 我使用了您的部分解决方案来提出替代方案,因此也投票支持您。
  • 没关系,我也根据您的格式更新了我的 anwser 并为您投票。
猜你喜欢
  • 2021-01-09
  • 2020-06-18
  • 2018-05-23
  • 2020-12-07
  • 2018-06-10
  • 1970-01-01
  • 2018-12-05
  • 1970-01-01
  • 2018-07-15
相关资源
最近更新 更多