【问题标题】:How To Get to Every single HTML file in a URL如何访问 URL 中的每个 HTML 文件
【发布时间】:2019-06-11 19:41:44
【问题描述】:

所以我的代码有效,但仅适用于一个网址。 (比如我用http://www.ancient-hebrew.org/m/dictionary/1000.html

但是,我想将每个 html 文件 URL 应用到我的代码中。可以在这里找到 (https://www.ancient-hebrew.org/m/dictionary/)。

from bs4 import BeautifulSoup
import re
import urllib


def getImage(_list):
    images = []
    # adds the url
    for image in _list:
        images.append(re.sub(
            r"..\/..\/", r"http://www.ancient-hebrew.org/", image['src']))
    return images


def getAudioFile(_list):
    audio = []
    # removes a tab character + adds the url
    for l in _list:
        audio.append("http://www.ancient-hebrew.org/m/dictionary/" +
                     l['href'].replace("\t", ''))
    return ''.join(audio)


def getHebrewWord(_list):
    hebrew = []
    for f in _list:
        hebrew.append(f.string.strip())
    return ''.join(hebrew)


url = 'http://www.ancient-hebrew.org/m/dictionary/1000.html'
file_name = str(re.search(r'(\d+).\w+$', url).group(1)) + ".txt"
raw_html = urllib.urlopen(url).readlines()
_list = []
_dict = {}
_ignore = {'audioURLs': '', 'pronuncation': [],
           'imageURLs': [], 'hebrewWord': ''}
for line in raw_html:
    number = 1
    html = BeautifulSoup(line, 'lxml')

    # Image Files URLs
    images = getImage(html.find_all('img', src=re.compile('.jpg$')))

    # Audio File URLs
    audioFile = getAudioFile(html.find_all('a', href=re.compile('.mp3$')))

    # Hebrew Words
    hebrewWords = getHebrewWord(html.find_all('font', face="arial", size="+1"))

    # Pronunciations
    pronunciation = [item.next_sibling.strip()
                     for item in html.select('img + font')]

    # Output: {'audioURLs': '', 'pronuncation': [], 'imageURLs': [], 'hebrewWord': ''}
    dictionary = {
        'audioURLs': audioFile,
        'pronuncation': pronunciation,
        'imageURLs': images,
        'hebrewWord': hebrewWords
    }
    if dictionary != _ignore:
        _list.append(dictionary)

with open(file_name, 'w') as f:
    for item in _list:
        f.write("%s\n" % item)

所以最后我想将它们写入尽可能多的文件。有什么简单的方法可以做到这一点。

【问题讨论】:

  • awk 脚本是选项吗?

标签: python regex python-3.x web-scraping beautifulsoup


【解决方案1】:

在我看来,您让它变得有些不必要地复杂(并且 - 大罪! - 在 html D 上使用了正则表达式:)。我试图简化其中的一部分——获取图像和声音的链接并将它们插入到列表中。请注意,出于各种原因,我更改了您使用的一些变量名称 - 但将所有内容放入您的结构并扩展它以获取单词本身应该相对容易:

from bs4 import BeautifulSoup as bs
import requests

url = 'http://www.ancient-hebrew.org/m/dictionary/1000.html'
raw_html = requests.get(url)

soup = bs(raw_html.content, 'lxml')

image_list = []
audio_list = []

images = soup.find_all ('img')
audios = soup.find_all ('a',href=True)

for image in images:
    if 'jpg' in image['src']:
        image_link = "http://www.ancient-hebrew.org/"+image['src'].replace('../../','')
        image_list.append(image_link)

for audio in audios:
    if 'mp3' in audio['href']:
        audio_link = "http://www.ancient-hebrew.org/m/dictionary/"+audio['href'].replace("\t", '')
        audio_list.append(link)

等等

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多