这段代码取自我的另一个answer,它提取了子域链接。
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'site:minecraft.fandom.com'}
html = requests.get(f'https://www.google.com/search?q=',
headers=headers,
params=params).text
soup = BeautifulSoup(html, 'lxml')
for container in soup.findAll('div', class_='tF2Cxc'):
link = container.find('a')['href']
print(link)
输出:
https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff
使用来自 SerpApi 的 Google Search Engine Results API 的替代解决方案。这是一个付费 API,可免费试用 5,000 次搜索。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "site:minecraft.fandom.com",
"api_key": os.getenv('API_KEY')
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
输出:
https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff
免责声明,我为 SerpApi 工作。