BeautifulSoup 抓取包含特定日期的元素答案

【问题标题】：BeautifulSoup Scraping Elements Containing Certain DateBeautifulSoup 抓取包含特定日期的元素
【发布时间】：2022-01-23 21:06:50
【问题描述】：

我正在使用 BeautifulSoup 从以下 URL 抓取网页：https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html

我能够抓取左侧超链接内的网页，但现在我正在尝试为我抓取的页面创建一些参数。我正在使用的参数是右侧的“最后出”日期。基本上，我试图只抓取具有特定日期的 Last Out 的网页。例如，仅抓取 2020 年 1 月 1 日之后的页面。

我认为需要做的是有一个 if 语句，如果日期高于 1-1-2020，那么它将继续抓取相应的超链接。不过我不太确定，或者是否可以用 Beautiful Soup 做到这一点。

感谢任何帮助、想法或建议。

import csv
import requests
from bs4 import BeautifulSoup as bs

headers = []
datarows = []

with requests.Session() as s:
    s.headers = {"User-Agent": "Safari/537.36"}
    r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
    soup = bs(r.content, 'lxml')
    address_links = [i['href'] for i in soup.select('.table td:nth-child(2) > a')]
    
    for url in address_links:

        r = s.get(url)
        soup = bs(r.content, 'lxml')
        table = soup.find(id="table_maina")
        
        if table:
            item = soup.find('h1').text
            newitem = item.replace('Dogecoin','')
            finalitem = newitem.replace('Address','')

            for row in table.find_all('tr'):
                heads = row.find_all('th')
                if heads:
                    headers = [th.text for th in heads]
                else:
                    datarows.append([td.text for td in row.find_all('td')])

            fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
            fcsv.writerow(headers)
            fcsv.writerows(datarows)

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

使用日期时间库是最好的方法，因为它可以轻松比较日期/时间比较。我能够在您的代码中实现它。我留了一些cmets来解释代码：

import csv
import requests
from bs4 import BeautifulSoup as bs
from datetime import datetime

headers = []
datarows = []
# define 1-1-2020 as a datetime object
after_date = datetime(2020, 1, 1)

with requests.Session() as s:
    s.headers = {"User-Agent": "Safari/537.36"}
    r = s.get('https://bitinfocharts.com/top-100-richest-dogecoin-addresses-2.html')
    soup = bs(r.content, 'lxml')

    # select all tr elements (minus the first one, which is the header)
    table_elements = soup.select('tr')[1:]
    address_links = []
    for element in table_elements:
        children = element.contents  # get children of table element
        url = children[1].a['href']
        last_out_str = children[8].text
        # check to make sure the date field isn't empty
        if last_out_str != "":
            # load date into datetime object for comparison (second part is defining the layout of the date as years-months-days hour:minute:second timezone)
            last_out = datetime.strptime(last_out_str, "%Y-%m-%d %H:%M:%S %Z")
            # if check to see if the date is after 2020/1/1
            if last_out > after_date:
                address_links.append(url)

    for url in address_links:

        r = s.get(url)
        soup = bs(r.content, 'lxml')
        table = soup.find(id="table_maina")

        if table:
            item = soup.find('h1').text
            newitem = item.replace('Dogecoin', '')
            finalitem = newitem.replace('Address', '')

            for row in table.find_all('tr'):
                heads = row.find_all('th')
                if heads:
                    headers = [th.text for th in heads]
                else:
                    datarows.append([td.text for td in row.find_all('td')])

            fcsv = csv.writer(open(f'{finalitem}.csv', 'w', newline=''))
            fcsv.writerow(headers)
            fcsv.writerows(datarows)

如果您对我的 cmets 没有回答的工作原理有任何疑问，请发表评论，我很乐意回答！

【讨论】：

@alwayshope430 我没有遇到这个问题，我没有更改任何与导出到 csv 相关的代码，因此不应该发生这种情况。在 DEHNjTW7rtKoRkEZvWuVKq1HXUoGft55dK.csv 中，我可以在 csv 文件的底部看到 2015 年的交易。你确定你正在运行我粘贴的确切代码吗？

【解决方案2】：

您是对的，您需要进行日期比较，但为了做到这一点，您需要将日期从字符串转换为日期时间对象。查看 datetime 模块，特别是 strptime() 方法将字符串转换为 datetime 对象。

【讨论】：