【问题标题】:How to download all rows data from a website using beatifulsoup如何使用 beautifulsoup 从网站下载所有行数据
【发布时间】:2021-03-09 04:17:04
【问题描述】:

我想从天气方面获得一些信息。 https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295

分开的小时和分钟:

<div class="entry-hour">
        <span><span class="hour">0</span><span class="minutes">00</span></span>
    </div>

预测温度:

<span class="forecast-temp">9°C</span>

和FeelTemp:

<span class="forecast-feeltemp">Odczuwalna 4°C </span>

我站着不动,因为我不知道如何获取所有行和其余数据; ( 提前感谢您的帮助...

下面是我的伪代码;)

#!/usr/bin/python3
import pymysql.cursors
from time import sleep, gmtime, strftime
import datetime
import pytz
from selenium import webdriver
from bs4 import BeautifulSoup


options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')



browser = webdriver.Chrome(
        ("/usr/bin/chromedriver"),
        chrome_options=options)

browser.get("https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295")
sleep(3)
source = browser.page_source # Get the entire page source from the browser
if browser is not None :browser.close() # No need for the browser so close it 
soup = BeautifulSoup(source,'html.parser')
try:
    Tags = soup.select('.weather-forecast-hbh-list') # get the elements using css selectors    
    for tag in Tags: # loop through them 
        hour      = tag.find('div').find('span').text
        #minutes = ?
        #temp =?
        #feel_temp = ?
        print (hour + "\n")

except Exception as e:
    print(e)

【问题讨论】:

    标签: python python-3.x selenium beautifulsoup


    【解决方案1】:

    这样做的一种方法是循环使用类 weather-entry 的所有 div,然后从每个 div 中提取文本,并在此过程中构建一个表结构。

    例如:

    import requests
    from bs4 import BeautifulSoup
    from tabulate import tabulate
    
    page = requests.get('https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295').content
    weather_entries = BeautifulSoup(page, "html.parser").find_all("div", {"class": "weather-entry"})
    
    
    def extract_text(element, class_name):
        return element.find("div", class_=class_name).getText(strip=True)
    
    
    div_classes = [
        "entry-hour",
        "entry-forecast",
        "entry-wind",
        "entry-precipitation",
        "entry-humidity",
    ]
    
    table = [[extract_text(e, c) for c in div_classes] for e in weather_entries]
    columns = ["Time:", "Forecast", "Wind", "Precipitation", "Humidity"]
    print(tabulate(table, headers=columns, tablefmt="pretty"))
    
    

    这个输出:

    +-------+---------------------------------------+----------------------+---------------+----------+
    | Time: |               Forecast                |         Wind         | Precipitation | Humidity |
    +-------+---------------------------------------+----------------------+---------------+----------+
    |  000  |     -2°COdczuwalna 0°CBezchmurnie     |   S4km/hMax 4 km/h   |               |   97%    |
    |  100  |    -2°COdczuwalna -1°CBezchmurnie     |   S4km/hMax 7 km/h   |   Zachm:10%   |   98%    |
    |  200  |    -2°COdczuwalna -1°CBezchmurnie     |  SSW4km/hMax 8 km/h  |               |   98%    |
    |  300  |    -2°COdczuwalna -1°CBezchmurnie     |   S4km/hMax 7 km/h   |               |   98%    |
    |  400  |     -2°COdczuwalna 1°CBezchmurnie     |   N0km/hMax 7 km/h   |               |   93%    |
    |  500  |     -2°COdczuwalna 1°CBezchmurnie     |   N0km/hMax 6 km/h   |               |   99%    |
    |  600  | -2°COdczuwalna -1°CZachmurzenie duże  |  SSW4km/hMax 6 km/h  |   Zachm:76%   |   92%    |
    |  700  |  -1°COdczuwalna 3°CZachmurzenie duże  |   N0km/hMax 7 km/h   |   Zachm:76%   |   84%    |
    |  800  |     -3°COdczuwalna -1°CPochmurno      |  SSW4km/hMax 8 km/h  |   Zachm:91%   |   99%    |
    |  900  |      3°COdczuwalna 5°CPochmurno       |  SSW4km/hMax 8 km/h  |   Zachm:91%   |   79%    |
    | 1000  |      5°COdczuwalna 4°CPochmurno       |  S11km/hMax 11 km/h  |   Zachm:91%   |   71%    |
    | 1100  |      6°COdczuwalna 5°CPochmurno       | SSW11km/hMax 20 km/h |  Zachm:100%   |   65%    |
    | 1200  |      9°COdczuwalna 7°CPochmurno       |  S15km/hMax 25 km/h  |  Zachm:100%   |   66%    |
    | 1300  |   10°COdczuwalna 8°CPrzelotne opady   |  S15km/hMax 25 km/h  |  Zachm:100%   |   60%    |
    | 1400  |      11°COdczuwalna 8°CPochmurno      |  S18km/hMax 24 km/h  |  Zachm:100%   |   55%    |
    | 1500  |      10°COdczuwalna 6°CPochmurno      |  S22km/hMax 27 km/h  |   Zachm:91%   |   57%    |
    | 1600  |      10°COdczuwalna 6°CPochmurno      |  S22km/hMax 31 km/h  |   Zachm:91%   |   60%    |
    | 1700  |   12°COdczuwalna 8°CPrzelotne opady   |  S18km/hMax 32 km/h  |  Zachm:100%   |   53%    |
    | 1800  | 9°COdczuwalna 4°CCzęściowo słonecznie |  S18km/hMax 33 km/h  |   Zachm:50%   |   66%    |
    | 1900  |      8°COdczuwalna 4°CPochmurno       |  S15km/hMax 31 km/h  |  Zachm:100%   |   82%    |
    | 2000  |      8°COdczuwalna 4°CPochmurno       |  S18km/hMax 22 km/h  |   Zachm:91%   |   82%    |
    | 2100  |   9°COdczuwalna 5°CPrzelotne opady    | SSW18km/hMax 22 km/h |  Zachm:100%   |   78%    |
    | 2200  |      8°COdczuwalna 4°CPochmurno       | SSW15km/hMax 28 km/h |  Zachm:100%   |   80%    |
    | 2300  |   8°COdczuwalna 5°CPrzelotne opady    | SSW11km/hMax 25 km/h |   Zachm:91%   |   81%    |
    +-------+---------------------------------------+----------------------+---------------+----------+
    

    显然,您需要对文本值进行一些解析,但这应该可以帮助您入门。

    【讨论】:

      【解决方案2】:

      谢谢我的朋友,我已经明白了;)我得先把所有的东西都拿到循环中返回;)

      #!/usr/bin/python3
      import requests
      from bs4 import BeautifulSoup
      
      page = requests.get('https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295').content
      weather_entries = BeautifulSoup(page, "html.parser").find_all("div", {"class": "weather-entry"})
      for weather_entrie in weather_entries:
          hour = weather_entrie.find('span', {'class' : 'hour'}).text
          minutes = weather_entrie.find('span', {'class' : 'minutes'}).text
          temp = weather_entrie.find('span', {'class' : 'forecast-temp'}).text
          tempFeel = weather_entrie.find('span', {'class' : 'forecast-feeltemp'}).text
          print(hour + ":" + minutes + " \t " + temp + " \t " + tempFeel)
          
      

      【讨论】:

        【解决方案3】:

        我对@9​​87654321@ 没有太多经验,但是使用 xpath 进行 selenium web 抓取本身也可以实现相同的目的。下面的代码可用于提取所需的详细信息。

        from selenium import webdriver
        from selenium.webdriver.support.ui import WebDriverWait
        from selenium.webdriver.common.action_chains import ActionChains
        from selenium.webdriver.support import expected_conditions as EC
        from selenium.webdriver.common.by import By
        
        options = webdriver.ChromeOptions()
        options.add_argument('--headless')
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        
        browser = webdriver.Chrome(
                ("/usr/bin/chromedriver"),
                chrome_options=options)
        
        browser.get("https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295")
        WebDriverWait(browser, 30).until(EC.presence_of_element_located((By.XPATH, "//div[@class='entry-hour']")))
        weather_entry = browser.find_elements_by_xpath("//div[@class='weather-entry']")
        for w in weather_entry:
            hour = w.find_element_by_xpath(".//div[@class='entry-hour']/span/span[@class='hour']").text
            temp = w.find_element_by_xpath(".//div[@class='entry-forecast']/div//span[@class='temp-info']/span[@class='forecast-temp']").text
            feeltemp = w.find_element_by_xpath(".//div[@class='entry-forecast']/div//span[@class='temp-info']/span[@class='forecast-feeltemp']").text
            print('hour '+ hour + ' temp ' + temp + ' feeltemp ' + feeltemp)
        

        【讨论】:

          猜你喜欢
          • 2013-08-26
          • 1970-01-01
          • 2013-12-05
          • 2020-08-02
          • 1970-01-01
          • 1970-01-01
          • 2017-12-17
          • 2018-01-03
          • 2017-06-04
          相关资源
          最近更新 更多