【问题标题】:Web-Scraping All Text Between <table>TABLE I NEED</table> in Python在 Python 中 Web-Scraping <table>TABLE I NEED</table> 之间的所有文本
【发布时间】:2021-10-07 06:47:29
【问题描述】:

我正在尝试从以下URL 中抓取以从 WorldOMeter 获取 CoVid 数据,并且在此页面上存在一个 ID 为 main_table_countries_today 的表格,其中包含我希望收集的 15x225 (3,375) 个数据单元格。

我尝试了一些方法,但让我分享一下我认为我做过的最接近的尝试:

import requests
from os import system

url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'


# Refreshes the Terminal Emulator window
def clear_screen():

    def bash_input(user_in):
        _ = system(user_in)
    
    bash_input('clear')


# This bot searches for <table> and </table> to start/stop recording data
class Bot:

    def __init__(self,
                 line_added=False,
                 looking_for_start=True,
                 looking_for_end=False):

        self.line_adding = line_added
        self.looking_for_start = looking_for_start
        self.looking_for_end = looking_for_end
    
    def set_line_adding(self, bool):

        self.line_adding = bool

    def set_start_look(self, bool):

        self. looking_for_start = bool

    def set_end_look(self, bool):

        self.looking_for_end = bool


if __name__ == '__main__':

    # Start with a fresh Terminal emulator
    clear_screen()
    
    my_bot = Bot()

    r = requests.get(url).text
    all_r = r.split('\n')

    for rs in all_r:

        if my_bot.looking_for_start and table_id in rs:
                
            my_bot.set_line_adding(True)
            my_bot.set_end_look(True)
            my_bot.set_start_look(False)
        
        if my_bot.looking_for_end and table_end in rs:    
                
            my_bot.set_line_adding(False)
            my_bot.looking_for_end(False)
        
        if my_bot.line_adding:

            all_lines.append(rs)
        

        for lines in all_lines:
            print(lines)
        
        print('\n\n\n\n')
        print(len(all_lines))

这会打印 6,551 行代码,是我需要的两倍多。这通常没问题,因为下一步是清理与我的数据无关的行,但是,这不会产生整个表。我之前使用 BeautifulSoup 进行的另一次尝试(非常相似的过程)也没有从上述表格开始和停止。它看起来像这样:

from bs4 import BeautifulSoup
import requests
from os import system

url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'

# Declare an empty list to fill with lines of text
all_lines = list()


if __name__ == '__main__':

    # Here we go, again...
    _ = system('clear')

    r = requests.get(url).text
    soup = BeautifulSoup(r)
    my_table = soup.find_all('table', {'id': table_id})

    for current_line in my_table:

        page_lines = str(current_line).split('\n')

        for line in page_lines:
            all_lines.append(line)

    for line in all_lines:
        print(line)

    print('\n\n')
    print(len(all_lines))

因此产生 5,547 行。

我也尝试过使用 Pandas 和 Selenium,但我已经删除了该代码。我希望通过展示我的两次“最佳”尝试,有人可能会发现我遗漏的一些明显问题。

如果我能在屏幕上显示数据,我会很高兴的。我最终试图将这些数据转换成一个看起来像这样的字典(将导出为.json 文件):

data = {
    "Country": [country for country in countries],
    "Total Cases": [case for case in total_cases],
    "New Cases": [case for case in new_cases],
    "Total Deaths": [death for death in total_deaths],
    "New Deaths": [death for death in new_deaths],
    "Total Recovered": [death for death in total_recovered],
    "New Recovered": [death for death in new_recovered],
    "Active Cases": [case for case in active_cases],
    "Serious/Critical": [case for case in serious_critical],
    "Total Cases/1M pop": [case for case in total_case_per_million],
    "Deaths/1M pop": [death for death in deaths_per_million],
    "Total Tests": [test for test in total_tests],
    "Tests/1M pop": [test for test in tests_per_million],
    "Population": [population for population in populations]
}

有什么建议吗?

【问题讨论】:

  • 您可以查看my answer,如果它对您有用并解决了您的问题,那么您可以将此答案标记为已接受的答案。 @T.J.

标签: python python-3.x web-scraping beautifulsoup python-requests


【解决方案1】:

以下是你可以尝试的,你可以在代码中找到基本解释:

from bs4 import BeautifulSoup
import requests
import pandas as pd

page = requests.get('https://www.worldometers.info/coronavirus')
soup = BeautifulSoup(page.content,"lxml")

table = soup.find('table', attrs={'id': 'main_table_countries_today'})
# Finding table using id

trs = table.find_all("tr", attrs={"style": ""})
# Finding tr from table using style attribute

data = []
data.append(trs[0].text.strip().split("\n")[:13])
# Appending first element of trs to data(list)

for tr in trs[1:]:
    data.append(tr.text.strip().split("\n")[:12])
    # Appending all other data from tr in data(list)

df = pd.DataFrame(data[1:], columns=data[0][:12])
# Converting data into pandas DataFrame and specifying header name from first row of data.

print(df)
"""
          #          Country,Other  TotalCases   NewCases TotalDeaths  \
0     World            198,878,345    +370,787  4,238,503      +6,065
1         1                    USA  35,745,024               629,315
2         2                  India  31,693,625    +39,041    424,777
3         3                 Brazil  19,917,855               556,437
4         4                 Russia   6,288,677    +22,804    159,352
..      ...                    ...         ...        ...         ...
208     211  Saint Pierre Miquelon          28
209     213             Montserrat          21                     1
210     215         Western Sahara          10                     1
211     222                  China      93,005        +75      4,636
212  Total:            198,878,345    +370,787  4,238,503      +6,065

       NewDeaths TotalRecovered NewRecovered ActiveCases Serious,Critical  \
0    179,521,450       +271,005   15,118,392      90,326           25,514
1                    29,666,117                5,449,592           11,516
2           +393     30,846,509      +33,636     422,339            8,944
3                    18,619,542                  741,876            8,318
4           +789      5,625,890      +17,271     503,435            2,300
..           ...            ...          ...         ...              ...
208                          26                        2
209                          19                        1
210                           8                        1
211                      87,347          +24       1,022               25
212  179,521,450       +271,005   15,118,392      90,326         25,514.2

    Tot Cases/1M pop Deaths/1M pop
0              543.8
1            107,311         1,889
2             22,725           305
3             92,991         2,598
4             43,073         1,091
..               ...           ...
208            4,859
209            4,204           200
210               16             2
211               65             3
212            543.8

[213 rows x 12 columns]
"""
# If you don't need pandas index then you can try this :
df.reset_index(inplace=True)
# And to set # column index :
df.set_index("#",inplace=True)

# Now we got complete data, if we want we may save it in file as well 

pd.to_csv("<>.csv",index=False)
# or if excel 
pd.to_excel("<>.xlsx",index=False)

你得到“5,547”行结果,因为有很多空行和一些不必要的行,所以它变得那么大。这减少了您像data 字典那样的手动工作,现在您不必逐个编写列名。

【讨论】:

    【解决方案2】:

    该表包含许多其他信息。您可以连续获取前 15 个&lt;td&gt; 单元格并剥离前/后 8 行:

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    
    url = "https://www.worldometers.info/coronavirus/"
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    all_data = []
    for tr in soup.select("#main_table_countries_today tr:has(td)")[8:-8]:
        tds = [td.get_text(strip=True) for td in tr.select("td")][:15]
        all_data.append(tds)
    
    df = pd.DataFrame(
        all_data,
        columns=[
            "#",
            "Country",
            "Total Cases",
            "New Cases",
            "Total Deaths",
            "New Deaths",
            "Total Recovered",
            "New Recovered",
            "Active Cases",
            "Serious, Critical",
            "Tot Cases/1M pop",
            "Deaths/1M pop",
            "Total Tests",
            "Tests/1M pop",
            "Population",
        ],
    )
    print(df)
    

    打印:

           #                 Country Total Cases New Cases Total Deaths New Deaths Total Recovered New Recovered Active Cases Serious, Critical Tot Cases/1M pop Deaths/1M pop  Total Tests Tests/1M pop     Population
    0      1                     USA  35,745,024                629,315                 29,666,117                  5,449,592            11,516          107,311         1,889  529,679,820    1,590,160    333,098,437
    1      2                   India  31,693,625   +39,041      424,777       +393      30,846,509       +33,636      422,339             8,944           22,725           305  468,216,510      335,725  1,394,642,466
    2      3                  Brazil  19,917,855                556,437                 18,619,542                    741,876             8,318           92,991         2,598   55,034,721      256,943    214,190,490
    3      4                  Russia   6,288,677   +22,804      159,352       +789       5,625,890       +17,271      503,435             2,300           43,073         1,091  165,800,000    1,135,600    146,002,094
    
    ...
    
    218  219                   Samoa           3                                                 3                          0                                 15                                                199,837
    219  220            Saint Helena           2                                                 2                          0                                328                                                  6,097
    220  221              Micronesia           1                                                 1                          0                                  9                                                116,324
    221  222                   China      93,005       +75        4,636                     87,347           +24        1,022                25               65             3  160,000,000      111,163  1,439,323,776
    

    【讨论】:

      最近更新 更多