【问题标题】:How can I scrape a row from this table我怎样才能从这张桌子上刮下一行
【发布时间】:2023-03-18 06:49:02
【问题描述】:

所以我一直试图从this页面的大桌子上刮取所有赢得美国总统大选的总统的选举人票。

这是我一直在尝试使用的代码:

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas

# using selenium and shromedriver to extract the javascript wikipage

scrape_options = Options()
scrape_options.add_argument('--headless')
driver = webdriver.Chrome(r'web scraping master/chromedriver', options=scrape_options)
page_info = driver.get('https://en.wikipedia.org/wiki/United_States_presidential_election')

# waiting for the javascript to load



try:WebDriverWait(driver,10).until(EC.presence_of_element_located((By.CSS_SELECTOR,".wikitable.sortable.jquery-tablesorter")))
finally:
    page = driver.page_source
    soup = BeautifulSoup(page, 'lxml')
table = soup.find('table', {'class': 'wikitable sortable 
jquerytablesorter'})
    #print(table)

rows=table.find_all('tr')

到目前为止,代码的工作原理。这是应该获取我需要的信息的代码部分。

for row in rows:
need=row.find_all('td')


for n in need:
    
    
    try:
        if len(n.find('b')==0):
            continue
        else:
            if nek.find('b').find('sup'):
            continue
            electoral_votes=n.find('span',{'style':"position: relative margin: 0 
0.3em;"}).get_text()
                print(electoral_votes)
    except:continue

运行这部分代码后,代码没有返回任何我需要的东西。

有人可以帮帮我吗?

我会很高兴的

【问题讨论】:

  • 你能修正你的缩进吗?见stackoverflow.com/help/formatting
  • 你要什么表?这里不需要使用 selenium。
  • @chitown88 是的,我在编写代码后就知道了。为了回答您的问题,我在页面上最大的表格之后。看不懂我可以附上截图。
  • @JustinEzequiel 很抱歉缩进不佳,我对在这个网站上提问有点陌生,所以对我来说还是有点奇怪。谢谢。

标签: python html web-scraping beautifulsoup tags


【解决方案1】:

试图刮取所有总统的选举人票 赢得美国总统大选

因为您希望成为总统的所有总统候选人(我们将抛入Joe Biden,虽然他是在写作28/11/2020的时间总统;你可以轻松删除),我选择了一种循环表行的方法.

表格行故意被特定的css selector 限制,以弥补表格的不规则性,并仅选择总统候选人列中的大胆获胜者。我选择了这个级别,所以我可以继续选择各种子元素来填充我的输出;格式为{year:[winner, vote],.....}

我使用包含 (*) 运算符的属性选择器,通过包含字符串 'United States presidential election'title 属性来定位感兴趣的年份;我使用进一步的 css 选择器来获得获胜者(具有粗体突出显示);我使用正则表达式从 tr 元素的文本中提取选票。


from bs4 import BeautifulSoup as bs
import requests,re 

soup = bs(requests.get('https://en.wikipedia.org/wiki/United_States_presidential_election').text, 'lxml')
presidential_wins_by_year = {
      int(i.select_one('[title*="United States presidential election"]').text):  #year
      [i.select_one('td[rowspan] ~ td:nth-of-type(3) b a').text.strip(), # winner candidate
       re.search('(\d+\s?\/\s?\d+)', i.text).groups(0)[0] #votes
      ]
  for i in soup.select('.sortable tr:has(td[rowspan] ~ td:nth-of-type(3) b a)')
}
print(presidential_wins_by_year)

示例输出:

【讨论】:

    【解决方案2】:

    您可以只使用 pandas 来读取 html。这会将所有表返回到列表中。只需拉出您感兴趣的表格即可:

    代码:

    import pandas as pd
    
    url = 'https://en.wikipedia.org/wiki/United_States_presidential_election'
    
    dfs = pd.read_html(url)
    

    输出:

    print(dfs[2].head(20).to_string())
    
        Year                  Party    Presidential candidate Vice presidential candidate Popular vote      % Electoral votes Notes
    0   1788            Independent         George Washington                None[note 3]        43782  100.0        69 / 138   NaN
    1   1788             Federalist        John Adams[note 4]                None[note 3]          NaN    NaN        34 / 138   NaN
    2   1788             Federalist                  John Jay                None[note 3]          NaN    NaN         9 / 138   NaN
    3   1788             Federalist        Robert H. Harrison                None[note 3]          NaN    NaN         6 / 138   NaN
    4   1788             Federalist             John Rutledge                None[note 3]          NaN    NaN         6 / 138   NaN
    5   1788             Federalist              John Hancock                None[note 3]          NaN    NaN         4 / 138   NaN
    6   1788    Anti-Administration            George Clinton                None[note 3]          NaN    NaN         3 / 138   NaN
    7   1788             Federalist         Samuel Huntington                None[note 3]          NaN    NaN         2 / 138   NaN
    8   1788             Federalist               John Milton                None[note 3]          NaN    NaN         2 / 138   NaN
    9   1788             Federalist           James Armstrong                None[note 3]          NaN    NaN         1 / 138   NaN
    10  1788             Federalist          Benjamin Lincoln                None[note 3]          NaN    NaN         1 / 138   NaN
    11  1788    Anti-Administration            Edward Telfair                None[note 3]          NaN    NaN         1 / 138   NaN
    12  1792            Independent         George Washington                None[note 3]        28579  100.0       132 / 264   NaN
    13  1792             Federalist        John Adams[note 4]                None[note 3]          NaN    NaN        77 / 264   NaN
    14  1792  Democratic-Republican            George Clinton                None[note 3]          NaN    NaN        50 / 264   NaN
    15  1792  Democratic-Republican          Thomas Jefferson                None[note 3]          NaN    NaN         4 / 264   NaN
    16  1792  Democratic-Republican                Aaron Burr                None[note 3]          NaN    NaN         1 / 264   NaN
    17  1796             Federalist                John Adams                None[note 3]        35726   53.4        71 / 276   NaN
    18  1796  Democratic-Republican  Thomas Jefferson[note 5]                None[note 3]        31115   46.6        68 / 276   NaN
    19  1796             Federalist           Thomas Pinckney                None[note 3]          NaN    NaN        59 / 276   NaN
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-07-01
      • 1970-01-01
      • 2011-04-07
      • 2021-10-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多