【发布时间】:2021-10-07 06:47:29
【问题描述】:
我正在尝试从以下URL 中抓取以从 WorldOMeter 获取 CoVid 数据,并且在此页面上存在一个 ID 为 main_table_countries_today 的表格,其中包含我希望收集的 15x225 (3,375) 个数据单元格。
我尝试了一些方法,但让我分享一下我认为我做过的最接近的尝试:
import requests
from os import system
url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'
# Refreshes the Terminal Emulator window
def clear_screen():
def bash_input(user_in):
_ = system(user_in)
bash_input('clear')
# This bot searches for <table> and </table> to start/stop recording data
class Bot:
def __init__(self,
line_added=False,
looking_for_start=True,
looking_for_end=False):
self.line_adding = line_added
self.looking_for_start = looking_for_start
self.looking_for_end = looking_for_end
def set_line_adding(self, bool):
self.line_adding = bool
def set_start_look(self, bool):
self. looking_for_start = bool
def set_end_look(self, bool):
self.looking_for_end = bool
if __name__ == '__main__':
# Start with a fresh Terminal emulator
clear_screen()
my_bot = Bot()
r = requests.get(url).text
all_r = r.split('\n')
for rs in all_r:
if my_bot.looking_for_start and table_id in rs:
my_bot.set_line_adding(True)
my_bot.set_end_look(True)
my_bot.set_start_look(False)
if my_bot.looking_for_end and table_end in rs:
my_bot.set_line_adding(False)
my_bot.looking_for_end(False)
if my_bot.line_adding:
all_lines.append(rs)
for lines in all_lines:
print(lines)
print('\n\n\n\n')
print(len(all_lines))
这会打印 6,551 行代码,是我需要的两倍多。这通常没问题,因为下一步是清理与我的数据无关的行,但是,这不会产生整个表。我之前使用 BeautifulSoup 进行的另一次尝试(非常相似的过程)也没有从上述表格开始和停止。它看起来像这样:
from bs4 import BeautifulSoup
import requests
from os import system
url = 'https://www.worldometers.info/coronavirus/'
table_id = 'main_table_countries_today'
table_end = '</table>'
# Declare an empty list to fill with lines of text
all_lines = list()
if __name__ == '__main__':
# Here we go, again...
_ = system('clear')
r = requests.get(url).text
soup = BeautifulSoup(r)
my_table = soup.find_all('table', {'id': table_id})
for current_line in my_table:
page_lines = str(current_line).split('\n')
for line in page_lines:
all_lines.append(line)
for line in all_lines:
print(line)
print('\n\n')
print(len(all_lines))
因此产生 5,547 行。
我也尝试过使用 Pandas 和 Selenium,但我已经删除了该代码。我希望通过展示我的两次“最佳”尝试,有人可能会发现我遗漏的一些明显问题。
如果我能在屏幕上显示数据,我会很高兴的。我最终试图将这些数据转换成一个看起来像这样的字典(将导出为.json 文件):
data = {
"Country": [country for country in countries],
"Total Cases": [case for case in total_cases],
"New Cases": [case for case in new_cases],
"Total Deaths": [death for death in total_deaths],
"New Deaths": [death for death in new_deaths],
"Total Recovered": [death for death in total_recovered],
"New Recovered": [death for death in new_recovered],
"Active Cases": [case for case in active_cases],
"Serious/Critical": [case for case in serious_critical],
"Total Cases/1M pop": [case for case in total_case_per_million],
"Deaths/1M pop": [death for death in deaths_per_million],
"Total Tests": [test for test in total_tests],
"Tests/1M pop": [test for test in tests_per_million],
"Population": [population for population in populations]
}
有什么建议吗?
【问题讨论】:
-
您可以查看my answer,如果它对您有用并解决了您的问题,那么您可以将此答案标记为已接受的答案。 @T.J.
标签: python python-3.x web-scraping beautifulsoup python-requests