【发布时间】:2021-03-23 13:33:11
【问题描述】:
我正在尝试从 google 历史记录中收集链接并将其内容传递到 .txt 文件。所有代码都有效(当我创建了一个仅包含 url 的 json 时),但是当源示例中的链接存在时,我会收到下面提到的错误。我怀疑这是因为源数据中的“,但是我怎样才能让它只读取 URL 部分?
源数据:
{
"Browser History": [
{
"favicon_url": "https://www.google.com/favicon.ico",
"page_transition": "LINK",
"title": "Google Datenexport",
"url": "https://takeout.google.com/",
"client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
"time_usec": 1607693084782187
},
{
"favicon_url": "https://support.google.com/favicon.ico",
"page_transition": "LINK",
"title": "So laden Sie Ihre Google-Daten herunter - Google-Konto-Hilfe",
"url": "https://support.google.com/accounts/answer/3024190?visit_id\u003d637432898341218017-3159218066\u0026hl\u003dde\u0026rd\u003d1",
"client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
"time_usec": 1607693036534748
},
{
"favicon_url": "https://www.google.com/favicon.ico",
"page_transition": "LINK",
"title": "Google \u2013 Meine Aktivitäten",
"url": "https://myactivity.google.com/activitycontrols/webandapp?view\u003ditem",
"client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
"time_usec": 1607693013403569
},
{
"favicon_url": "https://www.com-magazin.de/favicon.ico",
"page_transition": "LINK",
"title": "Google-Suchverlauf herunterladen und deaktivieren - com! professional",
"url": "https://www.com-magazin.de/news/google/google-suchverlauf-herunterladen-deaktivieren-928063.html#:~:text\u003dUm%20die%20eigenen%20Suchanfragen%20herunterzuladen,Nutzer%20den%20Eintrag%20%22Herunterladen%22.",
"client_id": "cWD5MfDDekj1z9aA5VeCQQ\u003d\u003d",
"time_usec": 1607692994577620
}```
我目前使用的代码:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
from dhooks import Webhook, Embed
import json
def getHtml(url):
global response
ua = UserAgent()
{'user-agent': ua.random}
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
except Exception as e:
print(e)
return response.content
with open('urls.json', 'r') as history:
json_data = history.read()
data = json.loads(json_data)
for block in data:
print("scraping " + block["url"] + "...")
html = getHtml(json_data)
soup = BeautifulSoup(markup, "html5lib")
text = soup.find_all(text=True)
output = ''
blacklist = [
"style",
"url",
"404",
"ngnix",
"url"
]
for t in text:
if t.parent.name not in blacklist:
output += '{} '.format(t)
with open("{}.txt".format(i), "w") as out_fd:
out_fd.write(output)
【问题讨论】:
标签: python json python-3.x web-scraping beautifulsoup