【问题标题】:Python3 print specific href linksPython3打印特定的href链接
【发布时间】:2021-10-23 00:10:43
【问题描述】:

我试图让脚本抓取一个网站,只寻找具有 .php?id= 的hrefs 我可以使用bs4 打印所有hrefs 但不能从.php 中选择一个?id= 并打印出来

<li><a href="#">Education & Research </a>
<ul>                         
<li><a href="caseofthe_month.php">Case of the Month</a></li>
<a href="page.php?id=2">
<a href="idontwantthispagetoshowup.php">
<a href="page.php?id=5">Prospectus Fellowship-July-14</a>
<a href="thisoneeither.php">

'''

def gethref(ip):
    url = ("http://" + ip)
    print("[x] ~ SCAN: " + url + " ~ [x]")
    req = requests.get(url)
    tree = html.fromstring(req.text)
    tree_href = tree.xpath('//@href')
    #print(tree_href)
    if '*.php?id=*' in tree_href:
        print (tree_href)
    #soup = BeautifulSoup(req.text, 'html.parser')
    #h = soup.find_all('href=*.php')
    #print(h)
    #sqli = soup.select('a')
    #for link in soup.find_all('a'):
    #   sqli = (link.get('href'))
    #   sqli = str(sqli)
    #   print(sqli)
    #   if 'page' in sqli:
    #       print(sqli.a)

【问题讨论】:

  • 请发布您的完整代码(包括导入,html 是什么?)

标签: python python-3.x beautifulsoup lxml href


【解决方案1】:

这是你需要找到所有包含.php?id=的href的代码

from bs4 import BeautifulSoup
import requests
import re

def gethref(ip):
    url = ("http://" + ip)
    print("[x] ~ SCAN: " + url + " ~ [x]")
    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    h = soup.find_all(href=re.compile(r'(.*).php\?id=\d*'))
    print(h)
    # sqli = soup.select('a') # i don't know what its doing, so i just commented it out
    # for link in soup.find_all('a'):
    #   sqli = str(link.get('href'))
    #   print(sqli)
    #   if 'page' in sqli:
    #       print(sqli.a)

我想这就是你需要的

如果它不起作用,请告诉我...

【讨论】:

    【解决方案2】:

    你可以使用 CSS 选择器a[href*=".php?id="]:

    from bs4 import BeautifulSoup
    
    html_doc = """
    <li><a href="#">Education & Research</a>
    
    <ul>                         
    <li>
        <a href="caseofthe_month.php">Case of the Month</a>
    </li>
    </ul>
    
    <a href="page.php?id=2"></a>
    <a href="idontwantthispagetoshowup.php">
    <a href="page.php?id=5">Prospectus Fellowship-July-14</a>
    <a href="thisoneeither.php"></a>
    """
    
    soup = BeautifulSoup(html_doc, "html.parser")
    
    for link in soup.select('a[href*=".php?id="]'):
        print(link["href"])
    

    打印:

    page.php?id=2
    page.php?id=5
    

    或者:

    for link in soup.find_all("a"):
        if ".php?id=" in link.get("href", ""):
            print(link["href"])
    

    或者:

    for link in soup.find_all(
        lambda t: t.name == "a" and ".php?id=" in t.get("href", "")
    ):
        print(link["href"])
    

    【讨论】:

      猜你喜欢
      • 2021-12-03
      • 1970-01-01
      • 2012-03-12
      • 2014-11-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-06-01
      • 1970-01-01
      相关资源
      最近更新 更多