【问题标题】:Scraping data and putting it in different columns using BeautifulSoup使用 BeautifulSoup 抓取数据并将其放在不同的列中
【发布时间】:2018-12-12 07:01:01
【问题描述】:

我编写了一个脚本来从网站上抓取数据。它有 2 列。但我想在其中添加另一列(抽象列)。我怎样才能在同一个循环中做到这一点?我需要在第三列中获取“抽象”数据。图片附在下面。

代码如下:

    import requests 
    import csv      
    from bs4 import BeautifulSoup    

    file = "Details181.csv"    
    Headers = ["Category", "Vulnerabilities", "Abstract"]   
    url = "https:/en/weakness?po={}"    

    with open(file, 'w', newline='') as f:  

       csvriter = csv.writer(f, delimiter=',', quotechar='"')     

       csvriter.writerow(Headers)     

       for page in range(1, 131):     

          r = requests.get(url.format(page))     

          soup = BeautifulSoup(r.text, 'lxml')   

          for title in soup.select('div.title > h1'):   

              csvriter.writerow([title.strip() for title in                
                  title.text.split(':')]); 

【问题讨论】:

    标签: python-3.x web-scraping beautifulsoup


    【解决方案1】:

    根据您的描述,我猜abstractcategory, vulnerability 可能有共同的父亲 div 元素。

    然后我尝试在每个循环中找到公共div并提取数据,最后,我验证了我的猜测,当title没有vulnerability content时,我还为vulnerability添加了默认值

    以下代码运行成功

    import requests
    import csv
    from bs4 import BeautifulSoup
    
    file = "Details181.csv"
    Headers = ["Category", "Vulnerabilities", "Abstract"]
    url = "https://vulncat.fortify.com/en/weakness?po={}"
    
    with open(file, 'w', newline='') as f:
        csvriter = csv.writer(f, delimiter=',', quotechar='"')
        csvriter.writerow(Headers)
    
        for page in range(1, 131):
            r = requests.get(url.format(page))
            soup = BeautifulSoup(r.text, 'lxml')
    
            # find the common father div info
            all_father_info = soup.find_all("div", class_="detailcell weaknessCell panel")
            for father in all_father_info:
    
                # find the son div info, then extract data
                son_info_12 = father.find('h1').text.split(":")
                if len(son_info_12) == 2:
                    category, vulnerability = son_info_12[0].strip(), son_info_12[1].strip()
                elif len(son_info_12) == 1:
                    category = son_info_12[0].strip()
                    vulnerability = ""
                else:
                    category, vulnerability = "", ""
    
                # find the son div info, then extract abstract
                abstract = father.find("div", class_="t").text.strip()
    
                # write data into csv file
                csvriter.writerow([category, vulnerability, abstract])
    

    【讨论】:

    • 用 abstract_data 试过这个,但它不会去任何地方:abstract_data = soup.findall("div",{"class":"t"}).get_text()
    • 完成这项工作。你真棒。
    猜你喜欢
    • 1970-01-01
    • 2022-01-18
    • 1970-01-01
    • 1970-01-01
    • 2021-10-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多