【问题标题】:Scraping JavaScript var from website using Beautiful Soup in Python使用 Python 中的 Beautiful Soup 从网站抓取 JavaScript var
【发布时间】:2022-01-09 19:54:09
【问题描述】:

我在“GET”请求后抓取网站的 HTML。我要从中提取数据的网站上有product1218181参数,所以有product{1218181}。我正在使用美丽的汤,因为它是我通常使用的,但我似乎无法弄清楚如何从 html 中获取 javascript 变量。像这样的 HTML:

<script>var product1218181 = {"name":"XIAOMI Poco X3 Pro 256 GB Akıllı Telefon Bronz","id":"1218181","price":"5799.00","brand":"XIAOMI","ean":"6934177738371","dimension25":"InStock","dimension26":11.90,"dimension24":18.00,"category":"Telefon","dimension9":"Cep Telefonları","dimension10":"Android Telefonlar"};</script>

我想这样刮:

name: XIAOMI Poco X3 Pro 256 GB Akıllı Telefon
id: 1218181
price: 5799.00
brand: XIAOMI

更新

这样的完整代码,我想抓取这个网站的产品信息

import requests
import re, json
from bs4 import BeautifulSoup

URL = "https://www.mediamarkt.com.tr/tr/category/_cep-telefonlar%C4%B1-504171.html"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

results = soup.find(id="category")

test = '<script>var product1218181 = {"name":"XIAOMI Poco X3 Pro 256 GB Akıllı Telefon Bronz","id":"1218181","price":"5799.00","brand":"XIAOMI","ean":"6934177738371","dimension25":"InStock","dimension26":11.90,"dimension24":18.00,"category":"Telefon","dimension9":"Cep Telefonları","dimension10":"Android Telefonlar"};</script>'

pattern = re.compile('.*?var product1218181 = (.*?);.*?')
match = pattern.match(test)
if match is not None:
    data = json.loads(match.groups()[0])
    for key, value in data.items():
        print(key, ":", value)

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    您可以使用正则表达式选择requests.get().text 中的变量并使用json.loads() 加载字符串:

    m = re.search(r'var product.+ = ({.*})', page.text)
    json.loads(m.group(1))
    

    获取字典列表的示例:

    import requests
    import re, json
    from bs4 import BeautifulSoup
    
    URL = "https://www.mediamarkt.com.tr/tr/category/_cep-telefonlar%C4%B1-504171.html"
    page = requests.get(URL)
    
    soup = BeautifulSoup(page.content, "html.parser")
    data = [json.loads(m.group(1)) for m in re.finditer(r'var product.+ = ({.*})', page.text)]
    

    输出

    [{'name': 'XIAOMI Poco X3 Pro 256 GB Akıllı Telefon Bronz', 'id': '1218181', 'price': '5799.00', 'brand': 'XIAOMI', 'ean': '6934177738371', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'APPLE iPhone 12 64GB Akıllı Telefon Yeşil', 'id': '1212811', 'price': '14749.00', 'brand': 'APPLE', 'ean': '0194252030943', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları'}, {'name': 'SAMSUNG Galaxy A22 128 GB Akıllı Telefon Beyaz', 'id': '1217491', 'price': '3499.00', 'brand': 'SAMSUNG', 'ean': '8806092288300', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar', 'dimension11': 'Samsung Telefon'}, {'name': 'XIAOMI Redmi 9T 128 GB Akıllı Telefon Yeşil', 'id': '1216309', 'price': '3399.00', 'brand': 'XIAOMI', 'ean': '6934177746031', 'dimension25': 'OutOfStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'APPLE iPhone 12 128GB Akıllı Telefon Siyah', 'id': '1212812', 'price': '15699.00', 'brand': 'APPLE', 'ean': '0194252031285', 'dimension25': 'InStock', 'dimension26': 9.99, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları'}, {'name': 'APPLE iPhone 11 64GB Akıllı Telefon Sarı', 'id': '1212830', 'price': '10349.00', 'brand': 'APPLE', 'ean': '0194252098264', 'dimension25': 'InStock', 'dimension26': 9.99, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları'}, {'name': 'CASPER VIA F20 128 GB Akıllı Telefon Beyaz', 'id': '1216984', 'price': '2999.00', 'brand': 'CASPER', 'ean': '8699247212134', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'VIVO Y53S 128 GB Akıllı Telefon Derin Mavi', 'id': '1217949', 'price': '4499.00', 'brand': 'VIVO', 'ean': '6935117836812', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'OPPO A74 128 GB Akıllı Telefon Gece Mavisi', 'id': '1215862', 'price': '4499.00', 'brand': 'OPPO', 'ean': '8683040000227', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'XIAOMI Redmi 9T 128 GB Akıllı Telefon Gri', 'id': '1216310', 'price': '3399.00', 'brand': 'XIAOMI', 'ean': '6934177746086', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'VIVO Y53S 128 GB Akıllı Telefon Gökkuşağı', 'id': '1218011', 'price': '4499.00', 'brand': 'VIVO', 'ean': '6935117836829', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'OPPO A55 64GB Akıllı Telefon Yıldızlı Siyah', 'id': '1218661', 'price': '3499.00', 'brand': 'OPPO', 'ean': '8683040000418', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar', 'dimension11': 'Oppo Telefon'}, {'name': 'OPPO A55 64GB Akıllı Telefon Gökkuşağı Mavisi', 'id': '1218660', 'price': '3499.00', 'brand': 'OPPO', 'ean': '8683040000425', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar', 'dimension11': 'Oppo Telefon'}, {'name': 'TCL 20 E 32 GB Akıllı Telefon Mavi', 'id': '1217712', 'price': '2399.00', 'brand': 'TCL', 'ean': '4894461894812', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'OPPO A74 128 GB Akıllı Telefon Prizma Siyahı', 'id': '1215856', 'price': '4499.00', 'brand': 'OPPO', 'ean': '8683040000210', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'APPLE iPhone 11 128GB Akıllı Telefon Mor', 'id': '1212837', 'price': '10849.00', 'brand': 'APPLE', 'ean': '0194252100431', 'dimension25': 'InStock', 'dimension26': 9.99, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları'}, {'name': 'XIAOMI Redmi Note 10 S 128 GB Akıllı Telefon Beyaz', 'id': '1217380', 'price': '4999.00', 'brand': 'XIAOMI', 'ean': '6934177748431', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'CASPER VIA E4 32 GB Akıllı Telefon Siyah', 'id': '1216978', 'price': '2299.00', 'brand': 'CASPER', 'ean': '8699247209356', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları', 'dimension10': 'Android Telefonlar'}, {'name': 'APPLE iPhone 13 Mini 128 GB Akıllı Telefon Starlight', 'id': '1217590', 'price': '14799.00', 'brand': 'APPLE', 'ean': '0194252689950', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları'}, {'name': 'APPLE iPhone 13 Mini 256 GB Akıllı Telefon Starlight', 'id': '1217595', 'price': '16199.00', 'brand': 'APPLE', 'ean': '0194252691304', 'dimension25': 'InStock', 'dimension26': 11.9, 'dimension24': 18.0, 'category': 'Telefon', 'dimension9': 'Cep Telefonları'}]
    

    【讨论】:

    • 不打印输出
    • 非常感谢。如何抓取该类别中的完整页面?
    • 要获取字典列表,请查看示例 - 基于它创建数据框或迭代以获取您的特定内容。
    • 我怎样才能这样输出:名称:XIAOMI Poco X3 Pro 256 GB Akıllı Telefon Bronz id:1218181 价格:5799.00 品牌:XIAOMI ean:6934177738371 dimension25:InStock dimension26:11.9 dimension24:18.0 类别:Telefon 类别: dimension9:Cep Telefonları dimension10:Android Telefonlar
    • 您在第一条评论中提到您不想打印 - 只需迭代 data --> for item in data: for k,v in item.items(): print(k, ":", v) 每个问题都应该只关注一个问题,我认为这个问题已经得到回答.其他问题注定ask a new question
    【解决方案2】:

    您可以使用regexre 模块)提取该行,然后使用json.loads() 处理它以将json值解析为dict

    这是一个示例 sn-p:

    import re, json
    
    test = '<script>var product1218181 = {"name":"XIAOMI Poco X3 Pro 256 GB Akıllı Telefon Bronz","id":"1218181","price":"5799.00","brand":"XIAOMI","ean":"6934177738371","dimension25":"InStock","dimension26":11.90,"dimension24":18.00,"category":"Telefon","dimension9":"Cep Telefonları","dimension10":"Android Telefonlar"};</script>'
    
    pattern = re.compile('.*?var product.+ = (.*?);.*?')
    match = pattern.match(test)
    if match is not None:
        data = json.loads(match.groups()[0])
        for key, value in data.items():
            print(key, ":", value)
    

    输出:

    name : XIAOMI Poco X3 Pro 256 GB Akıllı Telefon Bronz
    id : 1218181
    price : 5799.00
    brand : XIAOMI
    ean : 6934177738371
    dimension25 : InStock
    dimension26 : 11.9
    dimension24 : 18.0
    category : Telefon
    dimension9 : Cep Telefonları
    dimension10 : Android Telefonlar
    

    【讨论】:

    • 感谢您的回答,但 product1218181 是真实的。所以网站对每种产品都有不同的ID。例子; var 产品{id}
    • @Sadik170 你可以在正则表达式中使用product.+。请检查更新
    • 我更新了我的问题。
    【解决方案3】:

    我编写了这个脚本来解析该脚本标签内的 JSON。我使用了 json 库和 BeautifulSoup。

    首先,我遍历了网站中的所有脚本(如果有多个脚本并且我们没有每个脚本的idclass)并选择了我们需要的脚本,即包含“名称”(您可以使其更准确)。

    然后通过简单的字符串修改,我能够提取字典/json数据。

    from bs4 import BeautifulSoup
    import json
    html = '''<script>var product1218181 = {"name":"XIAOMI Poco X3 Pro 256 GB Akıllı Telefon Bronz","id":"1218181","price":"5799.00","brand":"XIAOMI","ean":"6934177738371","dimension25":"InStock","dimension26":11.90,"dimension24":18.00,"category":"Telefon","dimension9":"Cep Telefonları","dimension10":"Android Telefonlar"};</script>'''
    
    soup = BeautifulSoup(html, 'html.parser')
    for item in soup.find_all('script'):
        if '= {"name":' in item.text:
            dictionary = item.text.split(' = ', 1)[-1][:-1]
            jsonResponse = json.loads(dictionary)
            print(jsonResponse)
    

    【讨论】:

    • 'name' 过滤器不起作用
    【解决方案4】:

    试试这个:

    import requests
    import re, json
    from bs4 import BeautifulSoup
    
    URL = "https://www.mediamarkt.com.tr/tr/category/_cep-telefonlar%C4%B1-504171.html"
    
    page = requests.get(URL)
    
    soup = BeautifulSoup(page.content, "html.parser")
    
    results = soup.find(id="category").find("script").text
    
    data = json.loads(re.findall("(?:{).*(?:})", results)[0])
    
    print(data)
    
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-01-09
      相关资源
      最近更新 更多