【问题标题】:Is there a spefic way to read only the urls from a json file with python3?有没有一种特定的方法可以使用 python3 从 json 文件中读取 url?
【发布时间】:2021-03-23 13:33:11
【问题描述】:

我正在尝试从 google 历史记录中收集链接并将其内容传递到 .txt 文件。所有代码都有效(当我创建了一个仅包含 url 的 json 时),但是当源示例中的链接存在时,我会收到下面提到的错误。我怀疑这是因为源数据中的“,但是我怎样才能让它只读取 URL 部分?

源数据:

{
    "Browser History": [
        {
            "favicon_url": "https://www.google.com/favicon.ico",
            "page_transition": "LINK",
            "title": "Google Datenexport",
            "url": "https://takeout.google.com/",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607693084782187
},
        {
            "favicon_url": "https://support.google.com/favicon.ico",
            "page_transition": "LINK",
            "title": "So laden Sie Ihre Google-Daten herunter - Google-Konto-Hilfe",
            "url": "https://support.google.com/accounts/answer/3024190?visit_id\u003d637432898341218017-3159218066\u0026hl\u003dde\u0026rd\u003d1",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607693036534748
},
        {
            "favicon_url": "https://www.google.com/favicon.ico",
            "page_transition": "LINK",
            "title": "Google \u2013 Meine Aktivitäten",
            "url": "https://myactivity.google.com/activitycontrols/webandapp?view\u003ditem",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607693013403569
},
        {
            "favicon_url": "https://www.com-magazin.de/favicon.ico",
            "page_transition": "LINK",
            "title": "Google-Suchverlauf herunterladen und deaktivieren - com! professional",
            "url": "https://www.com-magazin.de/news/google/google-suchverlauf-herunterladen-deaktivieren-928063.html#:~:text\u003dUm%20die%20eigenen%20Suchanfragen%20herunterzuladen,Nutzer%20den%20Eintrag%20%22Herunterladen%22.",
            "client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
            "time_usec": 1607692994577620
}```
 
  

我目前使用的代码:

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
import json


def getHtml(url):
    global response
    ua = UserAgent()
    {'user-agent': ua.random}
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
    except Exception as e:
        print(e)
    return response.content


with open('urls.json', 'r') as history:
    json_data = history.read()
    data = json.loads(json_data)

    for block in data:
        print("scraping " + block["url"] + "...")
        html = getHtml(json_data)
        soup = BeautifulSoup(markup, "html5lib")
        text = soup.find_all(text=True)

        output = ''

        blacklist = [
            "style",
            "url",
            "404",
            "ngnix",
            "url"

        ]

        for t in text:
            if t.parent.name not in blacklist:
                output += '{} '.format(t)

        with open("{}.txt".format(i), "w") as out_fd:
            out_fd.write(output)


【问题讨论】:

    标签: python json python-3.x web-scraping beautifulsoup


    【解决方案1】:

    如果您的源数据看起来像这样,

    [{
                "page_transition": "LINK",
                "title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
                "url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
                "client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
                "time_usec": 1607593733981438
    },{
                "page_transition": "LINK",
                "title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
                "url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
                "client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
                "time_usec": 1607593733981438
    }, {
                "page_transition": "LINK",
                "title": "Niedersachsen nimmt geplante Corona-Lockerungen für Silvester zurück",
                "url": "https://www.rnd.de/politik/niedersachsen-nimmt-geplante-corona-lockerungen-fur-silvester-zuruck-IEW2P4XT4M24ZFFZSX7ILE6JGM.html?outputType\u003damp\u0026utm_source\u003dupday\u0026utm_medium\u003dreferral",
                "client_id": "59VD9fg/2RVO1jSDxOwfxw\u003d\u003d",
                "time_usec": 1607593733981438
    }]
    

    如果我正确阅读了您的问题,最好将您的源数据直接解析为 JSON,然后使用“url”键获取 URL。

    with open ('history.json','r') as history:
        json_data = history.read()
        data = json.loads(json_data)
    
        for k, v in data.items():  #because now your Source data is a dictionary
            for block in v:        #because v is the list of textblocks
                print("scraping " + block["url"] + "...")
    

    【讨论】:

    • 首先非常感谢您的帮助,但现在我收到一条不同的错误消息:文件“/home/user/Dokumente/foo Exception gethandled.py”,第 22 行,在 print("scraping" + json_data["url"] + "...") KeyError: 'url' 我试图通过 try 和 except 来避免这个问题,但这也没有用,你有什么想法吗如何解决这个问题?
    • @Reijarmo 您的数据现在看起来如何?您遇到了这个问题,因为您的源数据中没有“url”键。
    • 非常感谢您一直以来的支持。源数据是一个 .json 文件(当您从您的谷歌帐户安全备份时获得的文件。它包含上百个文本块,如上面的(源数据)。我试图用try: json_data=json.load(history) print("scraping " + json_data["url"] + "...") except KeyError: pass html = getHtml (json_data) 解决这个问题然后代码开始处理某个站点,但由于 mutch 异常而出现大量错误。当我把你的问题弄错时,非常抱歉浪费你的时间
    • @Reijarmo 好的,我展示的代码只读取一个块。既然你提到它包含许多块,我假设它是一个块数组?如果是这种情况,您将需要遍历它,这意味着您需要使用 for 循环。
    • 啊抱歉有误会,对不起@blessthefrey。是的,它是一个块数组,我试图实现循环,但它只是创建了许多 json.decoder.JSONDecodeError: expecting value。 (我现在尝试使用的“循环”:with open('urls.json') as history: for i, line in enumerate(history.readlines()): try: json_data=json.load(history) print("scraping " + json_data["url"] + "...") except KeyError: pass
    猜你喜欢
    • 2016-05-29
    • 2017-02-08
    • 1970-01-01
    • 2022-01-02
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-08-24
    • 1970-01-01
    相关资源
    最近更新 更多