【问题标题】:Extracting CVE Info with a Python 3 regular expression使用 Python 3 正则表达式提取 CVE 信息
【发布时间】:2020-02-11 23:22:16
【问题描述】:

我经常需要供应商安全公告页面上列出的 CVE 列表。有时这很容易复制,但通常它们与一堆文本混合在一起。

我很久没有接触过 Python,所以我认为这将是一个很好的练习来弄清楚如何提取这些信息 - 特别是因为我发现自己一直在手动操作。

这是我当前的代码:

#!/usr/bin/env python3

# REQUIREMENTS
#   python3
#   BeautifulSoup (pip3 install beautifulsoup)
#   python 3 certificates (Applications/Python 3.x/ Install Certificates.command) <-- this one took me forever to figure out!

import sys
if sys.version_info[0] < 3:
    raise Exception("Use Python 3:  python3 " + sys.argv[0])
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

#specify/get the url to scrape
#url ='https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
#url = 'https://source.android.com/security/bulletin/2020-02-01.html'
url = input("What is the URL?  ") or 'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
print("Checking URL: " + url)

# CVE regular expression
cve_pattern = 'CVE-\d{4}-\d{4,7}'

# query the website and return the html
page = urlopen(url).read()

# parse the html returned using beautiful soup
soup = BeautifulSoup(page, 'html.parser')

count = 0

############################################################
# ANDROID === search for CVE references within <td> tags ===

# find all <td> tags
all_tds = soup.find_all("td")

#print all_tds

for td in all_tds:
    if "cve" in td.text.lower():
        print(td.text)


############################################################
# CHROME === search for CVE reference within <span> tags ===

# find all <span> tags
all_spans = soup.find_all("span")

for span in all_spans:
    # this code returns results in triplicate
    for i in re.finditer(cve_pattern, span.text):
        count += 1
        print(count, i.group())


    # this code works, but only returns the first match
#   match = re.search(cve_pattern,span.text)
#   if match:
#       print(match.group(0))

我为 Android URL 所做的工作正常;我遇到的问题是 Chrome URL。他们在&lt;span&gt; 标签内有 CVE 信息,我正在尝试利用正则表达式将其提取出来。

使用re.finditer 方法,我最终得到一式三份的结果。 使用 re.search 方法,它错过了 CVE-2019-19925 - 他们在同一行列出了两个 CVE。

您能否就如何实现此功能提供任何建议?

【问题讨论】:

    标签: regex python-3.x cve


    【解决方案1】:

    我终于自己解决了。不需要 BeautifulSoup;现在一切都是正则表达式。为了解决我之前看到的重复/三重结果,我将 re.findall 列表结果转换为字典(保留唯一值的顺序)并返回到列表。

    import sys
    if sys.version_info[0] < 3:
        raise Exception("Use Python 3:  python3 " + sys.argv[0])
    import requests
    import re
    
    # Specify/get the url to scrape (included a default for easier testing)
    ### there is no input validation taking place here ###
    url = input("What is the URL?  ") #or 'https://chromereleases.googleblog.com/2020/02/stable-channel-update-for-desktop.html'
    print()
    
    # CVE regular expression
    cve_pattern = r'CVE-\d{4}-\d{4,7}'
    
    # query the website and return the html
    page = requests.get(url)
    
    # initialize count to 0
    count = 0
    
    #search for CVE references using RegEx
    cves = re.findall(cve_pattern, page.text)
    
    # after several days of fiddling, I was still getting double and sometimes triple results on certain pages.  This next line
    # converts the list of objects returned from re.findall to a dictionary (which retains order) to get unique values, then back to a list.
    # (thanks to https://stackoverflow.com/a/48028065/9205677)
    # I found order to be important sometimes, as the most severely rated CVEs are often listed first on the page
    cves = list(dict.fromkeys(cves))
    
    # print the results to the screen
    for cve in cves:
        print(cve)
        count += 1
    
    print()
    print(str(count) + " CVEs found at " + url)
    print()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-11-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-09-28
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多