使用 Beautifulsoup 和 Selenium 从 JavaScript 驱动的页面解析 URL答案

【问题标题】：Parsing URL's from JavaScript driven page with Beautifulsoup and Selenium使用 Beautifulsoup 和 Selenium 从 JavaScript 驱动的页面解析 URL
【发布时间】：2021-09-02 20:00:08
【问题描述】：

我想解析 Git 存储库中出现任何电子邮件的所有 URL。我用https://grep.app

代码：

from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://grep.app/search?current=100&q=%40gmail.com'
chrome = "/home/dev/chromedriver"
browser = webdriver.Chrome(executable_path=chrome)
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)

当代码启动时，Chrome 启动并加载带有结果的页面，并且在 Chrome 的开发人员工具中，在源代码中我可以看到很多用于 URL 的 A 和 HREF。 Source from page

喜欢： lib/plugins/revert/lang/eu/lang.php

但我的代码只从页脚返回“标签”：

"[<a href="/"><span class="slashes">//</span>grep.app</a>, <a href="mailto:hello@grep.app">Contact</a>]"

据我了解，JS 解析有问题。请指教我做错了什么？

【问题讨论】：

标签： javascript python selenium beautifulsoup webdriver

【解决方案1】：

代码：

from bs4 import BeautifulSoup
from selenium import webdriver

url = 'https://grep.app/search?current=100&q=%40gmail.com'
chrome = "/home/dev/chromedriver"
browser = webdriver.Chrome(executable_path=chrome)
browser.get(url)

html = browser.page_source
soup = BeautifulSoup(html, 'lxml')

links = []
tags = soup.find_all('a', href=True)
for tag in tags:
    links.append(tag['href'])
    
print(links)

输出：

['/', 'mailto:hello@grep.app']

【讨论】：

约翰，谢谢，但它与我从代码中得到的输出相同。没有什么能比得上文件的真实 URL。