【问题标题】:Adding new rows to a pandas df on loop在循环中向熊猫 df 添加新行
【发布时间】:2021-05-20 00:23:06
【问题描述】:

我很好奇如何使用来自循环交互的新数据附加或连接 pandas df。我使用 selenium 来查看网页,使用 BeautifulSoup 来阅读 HTML。从那里,我每页有两个数据表。我在多个页面上运行此程序,我想将第 2 页上的表 1 中的数据添加到第 1 页上的表 1 中,并且两个页面上的表 2 中的数据相同。

我认为我需要在 df 上添加一个附加功能,但我不确定。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
import time
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
import pandas as pd

urls = ["https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2021/02/06","https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2021/02/10"]
datalist_races = [] #empty list
x = 0 #counter
datalist_results = [] #empty list
x = 0 #counter


for url in urls:
    driver = webdriver.Chrome()
    driver.get(url)
    html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
    WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "f_fs13")))
    htmlStr = driver.page_source

    soup_level1 = soup(htmlStr, 'html.parser')

    race_soup = soup_level1.find('tbody',{'class':'f_fs13'}).find_parent('table')
    results_soup = soup_level1.find('tbody',{'class':'f_fs12'}).find_parent('table')

    df_races = pd.read_html(str(race_soup))[0]
    datalist_races.append(df_races[0])

    df_results = pd.read_html(str(results_soup))[0]
    datalist_results.append(df_results[0])

    print(df_results)


    driver.close()

任何见解都会很棒。阅读这里的 cmets 和帖子,以及观看 YT 视频,让我没有进一步的进步。

【问题讨论】:

    标签: python pandas selenium


    【解决方案1】:

    在你的循环中对你想要附加的任何 df 执行此操作:

    df.loc[len(df.index)] = data_element
    

    所以你的情况

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    import csv
    import time
    from selenium.webdriver.common.action_chains import ActionChains
    from selenium.webdriver.chrome.options import Options
    from selenium.webdriver.common.keys import Keys
    from bs4 import BeautifulSoup as soup
    import pandas as pd
    
    urls = ["https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2021/02/06","https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2021/02/10"]
    datalist_races = [] #empty list
    x = 0 #counter
    datalist_results = [] #empty list
    x = 0 #counter
    
    
    for url in urls:
        driver = webdriver.Chrome()
        driver.get(url)
        html = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
        WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "f_fs13")))
        htmlStr = driver.page_source
    
        soup_level1 = soup(htmlStr, 'html.parser')
    
        race_soup = soup_level1.find('tbody',{'class':'f_fs13'}).find_parent('table')
        results_soup = soup_level1.find('tbody',{'class':'f_fs12'}).find_parent('table')
    
        df_races = pd.read_html(str(race_soup))[0]
        datalist_races.loc[len(datalist_races.index)] = df_races.loc[0]
    
        df_results = pd.read_html(str(results_soup))[0]
        datalist_results.loc[len(datalist_results.index)] = df_results.loc[0]
    
        print(df_results)
    
    
        driver.close()
    

    【讨论】:

    • Traceback(最近一次调用最后):文件“C:\Users\Spenc\Desktop\Python Stuff\selenium test 2.py”,第 33 行,在 datalist_races.loc[len( datalist_races.index)] = df_races[0] 文件“C:\Users\Spenc\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\frame.py”,第 3024 行,在 getitem indexer = self.columns.get_loc(key) 文件“C:\Users\Spinc\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\indexes\base.py ",第 3082 行,在 get_loc raise KeyError(key) from err KeyError: 0 我收到了上述错误。对 datalist_races 感到好奇
    • 我认为 df_races[0] 有问题。我用 df_races.loc[0] 替换它(df_results 相同)
    • 嗯。这次不同的错误。回溯(最后一次调用):文件“C:\Users\Spenc\Desktop\Python Stuff\selenium test 2.py”,第 33 行,在 datalist_races.loc[len(datalist_races.index)] = df_races. loc[0] AttributeError: 'list' 对象没有属性 'loc'
    • ``` df_races ```的输出是什么
    • 尝试:df_races[0]
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-08-18
    • 1970-01-01
    • 2020-08-29
    • 2021-10-26
    • 2019-08-13
    • 1970-01-01
    • 2019-02-12
    相关资源
    最近更新 更多