Python打开一个url并从'onclick'中提取location.href值答案

【问题标题】：Python open a url and extract location.href value from 'onclick'Python打开一个url并从'onclick'中提取location.href值
【发布时间】：2020-08-11 05:42:00
【问题描述】：

这是来自 url "https://www.example.com" 的检查元素：

<button type="button" class="btn btn-click" onclick="location.href='https://d3.net/archive/123.mp4'"></button>

我想写一个脚本打开上面的url（https://www.example.com），然后从'onclick'中提取这个'https://d3.net/archive/123.mp4'。

我该怎么办？

【问题讨论】：

你发布的元素是从你的soap对象返回的？
@KunduK 右键点击inspect时来自url。
如果这个由 java-script bs4 渲染在这种情况下不能帮助你。你需要使用像 selenium 这样的浏览器工具。如果这个公开，请发布你的 url？
只需打印它以检查您是否收到任何元素或None。 print(soup.select_one("button.btn.btn-click[onclick]"))
@KunduK 我尝试按照您的建议打印此内容，但没有得到。我的 url 不是公开的，它需要用户名和密码，但我在运行代码时已经登录...我是否必须包含处理用户名和密码的代码才能使“requests.get(url)”工作？

标签： python html selenium beautifulsoup onclick

【解决方案1】：

您可以使用 Selenium 轻松做到这一点。

from selenium import webdriver
import re

driver = webdriver.Firefox()
# Navigate to the URL
driver.get("http://www.example.com")
# Find all links matching our XPATH
elements_list = driver.find_elements_by_xpath("//button[@class='btn btn-click']")
# Iterate the element list
for element in elements_list: 
    # Extract the onClick attribute value
    onclick_attr_value = element.get_attribute("onclick")
    # Match regex to capture the URL only
    match = re.search("'(.*)'", onclick_attr_value)
    if match:
        # If the regex matched, Bingo!
        found_url = match.group(1)
        print(found_url)

【讨论】：

【解决方案2】：

如果你想继续这样的链接，我推荐一种不同的方法：

使用 CSS Selectors

那么不需要re 模块。

用于调试的更新 1

如果您想查看根据您的要求获得的汤，您可以在此处进行操作（请参阅评论 #DEBUG1）或在下面的代码示例（#DEBUG3）中查看您获得的汤成分，请参阅下面的代码示例中的#DEBUG2。


from bs4 import BeautifulSoup 
#DEBUG1 print the soup. 
# import sys
# print >> sys.stderr, soup.prettify() # python2.X
print(soup.prettify) # python3.X

# select button with location href at the beginning (you can add the class as well)
for item in soup.select("button[onclick^=\"location.href=\"]"):
    # ... do stuff here, e.g.
    onclick = item["onclick"]
    href = onclick.split("=")[1]

    # now href is 'https://d3spcaxyl0it1f.cloudfront.net/archive/123.mp4'
    href = href.strip("'")

    # the leading and trailing ' are gone.
    if href.endswith(".mp4"):
        # do stuff here or precise your css selector further
        # ...

为什么要使用这种方法？确保您的按钮确实具有请求的属性。

为什么这是一件好事？因为您不需要先使用item.get('onclick') 来检查属性是否存在，然后再根据该决定采取行动。

如何将页面放入汤中？（直接引用@akore128）


# import sys
import requests 
from bs4 import BeautifulSoup 

page = requests.get('http://www.example.com') 
#DEBUG2
# print >> sys.stderr, page.text # python2.X
print(page.text) # python3.X

# Create a BeautifulSoup object 
soup = BeautifulSoup(page.text, 'html.parser')
#DEBUG3 print the soup. 
# print >> sys.stderr, soup.prettify() # python2.X
print(soup.prettify) # python3.X

【讨论】：

但是，当我在“requests.get('')”中包含“verify=False”时，错误消失了，但我什么也没得到。这意味着print(href) 不起作用。我该怎么办？我是否必须包含处理用户名和密码的代码才能使requests.get('<url>') 工作？
在这种情况下，请将此信息添加到您的问题中，因为进一步的答案应考虑到这一点。如果您需要将用户名和密码传递给站点，那么“如何从访问受限页面获取汤”的方式就大不相同了。请针对该主题提出第二个问题，因为“如何从汤中检索特定数据”与其他主题无关。