【问题标题】:How can I extract URLs from within Javascript code? - Python如何从 Javascript 代码中提取 URL? - Python
【发布时间】:2019-03-23 19:43:37
【问题描述】:

我的一个网站前段时间离线了,我需要恢复图像。我已经设法编写了一些 python,它使用 Beautiful Soup 从脚本标签中提取代码。我现在需要从提取的文本中解析一些 url。所需的 url 与 "large" 图像相关。我不确定如何为所有图像合并循环,而不仅仅是第一个图像并删除语音标记。任何帮助将不胜感激

提取的文本:

var gallery_items = [{
    "type": "image",
    "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg",
    "medium-height": 267,
    "medium-width": 400,
    "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg",
    "large-height": 450,
    "large-width": 675,
    "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg",
    "caption": ""
}, {
    "type": "image",
    "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg",
    "medium-height": 267,
    "medium-width": 400,
    "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg",
    "large-height": 450,
    "large-width": 675,
    "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg",
    "caption": ""
}];

Python 脚本

from bs4 import BeautifulSoup
import urllib.request as request
import re

folder = r'./gallery'
URL = 'https://web.archive.org/web/20180324152250/http://www.example.com:80/project/test-museum-visitors-center/'
response = request.urlopen(URL)
soup = BeautifulSoup(response, 'html.parser')

scriptCnt = soup.find('div', {'class': 'posts-wrapper'})
script = scriptCnt.find('script').text

try:
    found = re.search('"large":(.+?)"', script).group(1)
except AttributeError:
    found = 'None Found!'


print(found)

输出

"https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg

【问题讨论】:

标签: python beautifulsoup html-parsing


【解决方案1】:

给定的数据是 JSON 格式,很容易用 Python 的 JSON 库解析。 您需要做的就是仔细单独提取 JSON 并提供给 JSON 解析器。代码可能看起来像,

import json
script_str = '''var gallery_items = [{ "type": "image", "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg", "medium-height": 267, "medium-width": 400, "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg", "large-height": 450, "large-width": 675, "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg", "caption": "" }, { "type": "image", "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg", "medium-height": 267, "medium-width": 400, "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg", "large-height": 450, "large-width": 675, "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg", "caption": "" }];'''
json_str = script_str[str(script_str).find('var gallery_items = '):str(script_str).find(';')].replace('var gallery_items = ', '')
json_str = json.loads(json_str)
for item in json_str:
    print(item['large'])

希望这会有所帮助!干杯!

【讨论】:

  • 感谢您抽出宝贵时间回答,这正是我想要实现的目标。
猜你喜欢
  • 1970-01-01
  • 2017-07-27
  • 1970-01-01
  • 2014-02-13
  • 2019-09-16
  • 2019-07-24
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多