有没有一种特定的方法可以使用 python3 从 json 文件中读取 url？答案

【问题标题】：Is there a spefic way to read only the urls from a json file with python3?有没有一种特定的方法可以使用 python3 从 json 文件中读取 url？
【发布时间】：2021-03-23 13:33:11
【问题描述】：

我正在尝试从 google 历史记录中收集链接并将其内容传递到 .txt 文件。所有代码都有效（当我创建了一个仅包含 url 的 json 时），但是当源示例中的链接存在时，我会收到下面提到的错误。我怀疑这是因为源数据中的“，但是我怎样才能让它只读取 URL 部分？

源数据：

{
    "Browser History": [
        {
            "favicon_url": "https://www.google.com/favicon.ico",
            "page_transition": "LINK",
            "title": "Google Datenexport",
            "url": "https://takeout.google.com/",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607693084782187
},
        {
            "favicon_url": "https://support.google.com/favicon.ico",
            "page_transition": "LINK",
            "title": "So laden Sie Ihre Google-Daten herunter - Google-Konto-Hilfe",
            "url": "https://support.google.com/accounts/answer/3024190?visit_id\u003d637432898341218017-3159218066\u0026hl\u003dde\u0026rd\u003d1",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607693036534748
},
        {
            "favicon_url": "https://www.google.com/favicon.ico",
            "page_transition": "LINK",
            "title": "Google \u2013 Meine Aktivitäten",
            "url": "https://myactivity.google.com/activitycontrols/webandapp?view\u003ditem",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607693013403569
},
        {
            "favicon_url": "https://www.com-magazin.de/favicon.ico",
            "page_transition": "LINK",
            "title": "Google-Suchverlauf herunterladen und deaktivieren - com! professional",
            "url": "https://www.com-magazin.de/news/google/google-suchverlauf-herunterladen-deaktivieren-928063.html#:~:text\u003dUm%20die%20eigenen%20Suchanfragen%20herunterzuladen,Nutzer%20den%20Eintrag%20%22Herunterladen%22.",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607692994577620
}```

我目前使用的代码：

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
import json


def getHtml(url):
    global response
    ua = UserAgent()
    {'user-agent': ua.random}
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
    except Exception as e:
        print(e)
    return response.content


with open('urls.json', 'r') as history:
    json_data = history.read()
    data = json.loads(json_data)

    for block in data:
        print("scraping " + block["url"] + "...")
        html = getHtml(json_data)
        soup = BeautifulSoup(markup, "html5lib")
        text = soup.find_all(text=True)

        output = ''

        blacklist = [
            "style",
            "url",
            "404",
            "ngnix",
            "url"

        ]

        for t in text:
            if t.parent.name not in blacklist:
                output += '{} '.format(t)

        with open("{}.txt".format(i), "w") as out_fd:
            out_fd.write(output)

【问题讨论】：

标签： python json python-3.x web-scraping beautifulsoup

【解决方案1】：

如果您的源数据看起来像这样，

[{
            "page_transition": "LINK",
            "title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
            "url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
            "client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
            "time_usec": 1607593733981438
},{
            "page_transition": "LINK",
            "title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
            "url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
            "client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
            "time_usec": 1607593733981438
}, {
            "page_transition": "LINK",
            "title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
            "url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
            "client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
            "time_usec": 1607593733981438
}]

如果我正确阅读了您的问题，最好将您的源数据直接解析为 JSON，然后使用“url”键获取 URL。

with open ('history.json','r') as history:
    json_data = history.read()
    data = json.loads(json_data)

    for k, v in data.items():  #because now your Source data is a dictionary
        for block in v:        #because v is the list of textblocks
            print("scraping " + block["url"] + "...")

【讨论】：

首先非常感谢您的帮助，但现在我收到一条不同的错误消息：文件“/home/user/Dokumente/foo Exception gethandled.py”，第 22 行，在 print("scraping" + json_data["url"] + "...") KeyError: 'url' 我试图通过 try 和 except 来避免这个问题，但这也没有用，你有什么想法吗如何解决这个问题？
@Reijarmo 您的数据现在看起来如何？您遇到了这个问题，因为您的源数据中没有“url”键。
非常感谢您一直以来的支持。源数据是一个 .json 文件（当您从您的谷歌帐户安全备份时获得的文件。它包含上百个文本块，如上面的（源数据）。我试图用try: json_data=json.load(history) print("scraping " + json_data["url"] + "...") except KeyError: pass html = getHtml (json_data) 解决这个问题然后代码开始处理某个站点，但由于 mutch 异常而出现大量错误。当我把你的问题弄错时，非常抱歉浪费你的时间
@Reijarmo 好的，我展示的代码只读取一个块。既然你提到它包含许多块，我假设它是一个块数组？如果是这种情况，您将需要遍历它，这意味着您需要使用 for 循环。
啊抱歉有误会，对不起@blessthefrey。是的，它是一个块数组，我试图实现循环，但它只是创建了许多 json.decoder.JSONDecodeError: expecting value。（我现在尝试使用的“循环”：with open('urls.json') as history: for i, line in enumerate(history.readlines()): try: json_data=json.load(history) print("scraping " + json_data["url"] + "...") except KeyError: pass