【问题标题】:BeautifulSoup Scraping Elements Containing Certain DateBeautifulSoup 抓取包含特定日期的元素
【发布时间】:2022-01-23 21:06:50
【问题描述】:

我正在使用 BeautifulSoup 从以下 URL 抓取网页:https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html

我能够抓取左侧超链接内的网页,但现在我正在尝试为我抓取的页面创建一些参数。我正在使用的参数是右侧的“最后出”日期。基本上,我试图只抓取具有特定日期的 Last Out 的网页。例如,仅抓取 2020 年 1 月 1 日之后的页面。

我认为需要做的是有一个 if 语句,如果日期高于 1-1-2020,那么它将继续抓取相应的超链接。不过我不太确定,或者是否可以用 Beautiful Soup 做到这一点。

感谢任何帮助、想法或建议。

import csv
import requests
from bs4 import BeautifulSoup as bs

headers = []
datarows = []

with requests.Session() as s:
    s.headers = {"User-Agent": "Safari/537.36"}
    r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
    soup = bs(r.content, 'lxml')
    address_links = [i['href'] for i in soup.select('.table td:nth-child(2) > a')]
    
    for url in address_links:

        r = s.get(url)
        soup = bs(r.content, 'lxml')
        table = soup.find(id="table_maina")
        
        if table:
            item = soup.find('h1').text
            newitem = item.replace('Dogecoin','')
            finalitem = newitem.replace('Address','')

            for row in table.find_all('tr'):
                heads = row.find_all('th')
                if heads:
                    headers = [th.text for th in heads]
                else:
                    datarows.append([td.text for td in row.find_all('td')])

            fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
            fcsv.writerow(headers)
            fcsv.writerows(datarows)

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    使用日期时间库是最好的方法,因为它可以轻松比较日期/时间比较。我能够在您的代码中实现它。我留了一些cmets来解释代码:

    import csv
    import requests
    from bs4 import BeautifulSoup as bs
    from datetime import datetime
    
    headers = []
    datarows = []
    # define 1-1-2020 as a datetime object
    after_date = datetime(2020, 1, 1)
    
    with requests.Session() as s:
        s.headers = {"User-Agent": "Safari/537.36"}
        r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
        soup = bs(r.content, 'lxml')
    
        # select all tr elements (minus the first one, which is the header)
        table_elements = soup.select('tr')[1:]
        address_links = []
        for element in table_elements:
            children = element.contents  # get children of table element
            url = children[1].a['href']
            last_out_str = children[8].text
            # check to make sure the date field isn't empty
            if last_out_str != "":
                # load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
                last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
                # if check to see if the date is after 2020/1/1
                if last_out > after_date:
                    address_links.append(url)
    
        for url in address_links:
    
            r = s.get(url)
            soup = bs(r.content, 'lxml')
            table = soup.find(id="table_maina")
    
            if table:
                item = soup.find('h1').text
                newitem = item.replace('Dogecoin', '')
                finalitem = newitem.replace('Address', '')
    
                for row in table.find_all('tr'):
                    heads = row.find_all('th')
                    if heads:
                        headers = [th.text for th in heads]
                    else:
                        datarows.append([td.text for td in row.find_all('td')])
    
                fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
                fcsv.writerow(headers)
                fcsv.writerows(datarows)
    

    如果您对我的 cmets 没有回答的工作原理有任何疑问,请发表评论,我很乐意回答!

    【讨论】:

    • @alwayshope430 我没有遇到这个问题,我没有更改任何与导出到 csv 相关的代码,因此不应该发生这种情况。在 DEHNjTW7rtKoRkEZvWuVKq1HXUoGft55dK.csv 中,我可以在 csv 文件的底部看到 2015 年的交易。你确定你正在运行我粘贴的确切代码吗?
    【解决方案2】:

    您是对的,您需要进行日期比较,但为了做到这一点,您需要将日期从字符串转换为日期时间对象。查看 datetime 模块,特别是 strptime() 方法将字符串转换为 datetime 对象。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-04-05
      • 2023-03-16
      • 1970-01-01
      相关资源
      最近更新 更多