【发布时间】:2020-03-19 09:11:03
【问题描述】:
所以我想找到所有的搜索结果并将它们存储在一个列表或其他东西中。分析 Google 页面后,我发现所有结果在技术上都属于 g 类:
所以从技术上讲,从搜索结果页面中提取 URL(即)应该很简单:
import urllib
from bs4 import BeautifulSoup
import requests
text = 'cyber security'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
links = [div.find('a') for div in result_divs]
hrefs = [link.get('href') for link in links]
print(hrefs)
然而,我没有输出。为什么?
编辑:即使手动解析存储的页面也无济于事:
with open('output.html', 'wb') as f:
f.write(response.content)
webbrowser.open('output.html')
url = "output.html"
page = open(url)
soup = BeautifulSoup(page.read(), features="lxml")
#soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
links = [div.find('a') for div in result_divs]
hrefs = [link.get('href') for link in links]
print(hrefs)
【问题讨论】:
-
我在玩这个代码repl.it/repls/ThirdSneakyConfig,看来google发送的html和浏览器中的html真的不一样
-
回答类似问题:stackoverflow.com/a/60889629/1291371。包含代码示例。
标签: python web-scraping beautifulsoup