【发布时间】:2019-03-23 19:43:37
【问题描述】:
我的一个网站前段时间离线了,我需要恢复图像。我已经设法编写了一些 python,它使用 Beautiful Soup 从脚本标签中提取代码。我现在需要从提取的文本中解析一些 url。所需的 url 与 "large" 图像相关。我不确定如何为所有图像合并循环,而不仅仅是第一个图像并删除语音标记。任何帮助将不胜感激
提取的文本:
var gallery_items = [{
"type": "image",
"medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg",
"medium-height": 267,
"medium-width": 400,
"large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg",
"large-height": 450,
"large-width": 675,
"awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg",
"caption": ""
}, {
"type": "image",
"medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg",
"medium-height": 267,
"medium-width": 400,
"large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg",
"large-height": 450,
"large-width": 675,
"awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg",
"caption": ""
}];
Python 脚本
from bs4 import BeautifulSoup
import urllib.request as request
import re
folder = r'./gallery'
URL = 'https://web.archive.org/web/20180324152250/http://www.example.com:80/project/test-museum-visitors-center/'
response = request.urlopen(URL)
soup = BeautifulSoup(response, 'html.parser')
scriptCnt = soup.find('div', {'class': 'posts-wrapper'})
script = scriptCnt.find('script').text
try:
found = re.search('"large":(.+?)"', script).group(1)
except AttributeError:
found = 'None Found!'
print(found)
输出
"https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg
【问题讨论】:
-
我认为使用 xpath 可以帮助你更多:stackoverflow.com/a/29890627/438627
-
你想要这个
found.replace("\\","")吗?
标签: python beautifulsoup html-parsing