【发布时间】:2019-06-11 19:41:44
【问题描述】:
所以我的代码有效,但仅适用于一个网址。 (比如我用http://www.ancient-hebrew.org/m/dictionary/1000.html)
但是,我想将每个 html 文件 URL 应用到我的代码中。可以在这里找到 (https://www.ancient-hebrew.org/m/dictionary/)。
from bs4 import BeautifulSoup
import re
import urllib
def getImage(_list):
images = []
# adds the url
for image in _list:
images.append(re.sub(
r"..\/..\/", r"http://www.ancient-hebrew.org/", image['src']))
return images
def getAudioFile(_list):
audio = []
# removes a tab character + adds the url
for l in _list:
audio.append("http://www.ancient-hebrew.org/m/dictionary/" +
l['href'].replace("\t", ''))
return ''.join(audio)
def getHebrewWord(_list):
hebrew = []
for f in _list:
hebrew.append(f.string.strip())
return ''.join(hebrew)
url = 'http://www.ancient-hebrew.org/m/dictionary/1000.html'
file_name = str(re.search(r'(\d+).\w+$', url).group(1)) + ".txt"
raw_html = urllib.urlopen(url).readlines()
_list = []
_dict = {}
_ignore = {'audioURLs': '', 'pronuncation': [],
'imageURLs': [], 'hebrewWord': ''}
for line in raw_html:
number = 1
html = BeautifulSoup(line, 'lxml')
# Image Files URLs
images = getImage(html.find_all('img', src=re.compile('.jpg$')))
# Audio File URLs
audioFile = getAudioFile(html.find_all('a', href=re.compile('.mp3$')))
# Hebrew Words
hebrewWords = getHebrewWord(html.find_all('font', face="arial", size="+1"))
# Pronunciations
pronunciation = [item.next_sibling.strip()
for item in html.select('img + font')]
# Output: {'audioURLs': '', 'pronuncation': [], 'imageURLs': [], 'hebrewWord': ''}
dictionary = {
'audioURLs': audioFile,
'pronuncation': pronunciation,
'imageURLs': images,
'hebrewWord': hebrewWords
}
if dictionary != _ignore:
_list.append(dictionary)
with open(file_name, 'w') as f:
for item in _list:
f.write("%s\n" % item)
所以最后我想将它们写入尽可能多的文件。有什么简单的方法可以做到这一点。
【问题讨论】:
-
是
awk脚本是选项吗?
标签: python regex python-3.x web-scraping beautifulsoup