【问题标题】:Scrape for Absolute URL with html.parse and remove duplicates使用 html.parse 抓取绝对 URL 并删除重复项
【发布时间】:2019-09-26 09:53:31
【问题描述】:

我正在尝试确保将相对链接保存为此 CSV 中的绝对链接。 (URL 解析)我也在尝试删除重复项,这就是我创建变量“ddupe”的原因。

当我在桌面上打开 csv 时,我一直在保存所有相对 URL。 有人可以帮我解决这个问题吗?我想过像这个页面一样调用“集合”:How do you remove duplicates from a list whilst preserving order?

#Importing the request library to make HTTP requests
#Importing the bs4 library to extract / parse html and xml files
#utlize urlparse to change relative URL to absolute URL
#import csv (built in package) to read / write to Microsoft Excel
from bs4 import BeautifulSoup  
import requests
from urllib.parse import urlparse
import csv

#create the page variable
#associate page to request to obtain the information from raw_html
#store the html information in a text
page = requests.get('https://www.census.gov/programs-surveys/popest.html')
parsed = urlparse(page)
raw_html = page.text    # declare the raw_html variable

soup = BeautifulSoup(raw_html, 'html.parser')  # parse the html

#remove duplicate htmls
ddupe = open(‘page.text’, ‘r’).readlines() 
ddupe_set = set(ddupe)
out = open(‘page.text’, ‘w’)
for ddupe in ddupe_set:
    out.write(ddupe)

T = [["US Census Bureau Links"]] #Title

#Finds all the links
links = map(lambda link: link['href'], soup.find_all('a', href=True)) 

with open("US_Census_Bureau_links.csv","w",newline="") as f:    
    cw=csv.writer(f, quoting=csv.QUOTE_ALL) #Create a file handle for csv writer                          
    cw.writerows(T)             #Creates the Title
    for link in links:                                  #Parses the links in the csv
    cw.writerow([link])                        

f.close()       #closes the program  

【问题讨论】:

    标签: html python-3.x parsing beautifulsoup duplicates


    【解决方案1】:

    您要查找的函数是urljoin,而不是urlparse(都来自同一个包urllib.parse)。它应该在这一行之后的某个地方使用:

    links = map(lambda link: link['href'], soup.find_all('a', href=True))
    

    使用列表推导式或 map + lambda,就像您在此处所做的那样,将相对 URL 与基本路径连接起来。

    【讨论】:

      猜你喜欢
      • 2019-03-14
      • 1970-01-01
      • 1970-01-01
      • 2021-12-18
      • 2017-08-23
      • 2021-07-16
      • 1970-01-01
      • 1970-01-01
      • 2018-03-11
      相关资源
      最近更新 更多