【发布时间】:2018-06-09 19:24:44
【问题描述】:
这是我目前所拥有的:
from requests import get
url = 'https://howlongtobeat.com/game.php?id=38050'
response = get(url)
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
game_name = html_soup.select('div.profile_header')[0].text
game_length = html_soup.select('div.game_times li div')[-1].text
game_developer = html_soup.find_all('strong', string='\nDeveloper:\n')[0].next_sibling
game_publisher = html_soup.find_all('strong', string='\nPublisher:\n')[0].next_sibling
game_console = html_soup.find_all('strong', string='\nPlayable On:\n')[0].next_sibling
game_genres = html_soup.find_all('strong', string='\nGenres:\n')[0].next_sibling
print(game_name)
print(game_length)
print(game_developer)
print(game_publisher)
print(game_console)
print(game_genres)
这个输出:
God of War (2018)
31 Hours
SIE Santa Monica Studio
Sony Interactive Entertainment
PlayStation 4
Third-Person, Action, Adventure
计划使用这些数据制作电子表格(一旦我弄清楚如何提取游戏名称、主要 + 额外游戏长度、开发者名称、发行商、可玩时间和类型字段)
所以它会存储这些数据,我认为它应该先打印这样的数据,然后才能存储它:
God of War (2018)
31 Hours
SIE Santa Monica Studio
Sony Interactive Entertainment
PlayStation 4
Third-Person, Action, Adventure
任何帮助将不胜感激
编辑---
我做了一些研究,我认为我需要 Pandas
【问题讨论】:
标签: python html pandas web-scraping beautifulsoup