【问题标题】:Splitting html from scraped data (Python+BeautifulSoup4)从抓取的数据中拆分 html (Python+BeautifulSoup4)
【发布时间】:2019-05-14 12:26:52
【问题描述】:

我遇到了一个问题,即在没有获取所有 html 数据的情况下抓取标签内的文本。 这是我的python代码。我要抓取的文本不在 span 类中,而是在标签中独立存在。这是放置文本的示例。

<a href="/counterstrike/rankings/team-details/32537">
  <span class="ranking">49</span>
  <span class="flag flag-pl" data-tooltip="" tabindex="1" title="Poland></span>
  TEXT-I-WANT-TO-SCRAPE
  <span class="elo">1103</span>
</a>

如果我使用“.text.encode('utf8').lstrip().rstrip()”函数,我仍然会得到这样的数据:

打印(文本) '49\n \n\n\n TEXT-I-WANT-TO-SCRAPE \n \n 1103'

我的问题是如何只获取标签内的文本?

同时抓取 elo 和排名是没有问题的,因为它们包含在具有特定类的 span 中。

def get_matches():
matches = get_parsed_page("https://www.gosugamers.net/counterstrike/rankings")
rankings = matches.find("ul", {"class": "ranking-list"})
matchdays = rankings.find_all("li")

for match in matchdays:
    matchDetails = match.find_all("a")

    for getMatch in matchDetails:
        elo = match.find("span", {"class": "elo"}).text.encode('utf8').lstrip().rstrip()
        ranking = match.find("span", {"class": "ranking"}).text.encode('utf8').lstrip().rstrip()
        textt = match.find("a").text.encode('utf8').lstrip().rstrip()

        print(ranking,elo,textt)

最好的问候

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    使用next_element 获取标签下一个元素的文本。试试下面的代码。使用正则表达式查找特定的href抓取

    from bs4 import BeautifulSoup
    import requests
    import re
    data=requests.get("https://www.gosugamers.net/counterstrike/rankings").text
    soup=BeautifulSoup(data,'html.parser')
    for a in soup.find_all('a',href=re.compile("/counterstrike/rankings/team-details")):
        ranking=a.find('span' , class_='ranking').text.replace('\n','').strip()
        name=a.find('span', class_='ranking').next_element.next_element.next_element.next_element.replace('\n','').strip()
        elo=a.find('span',class_='elo').text.replace('\n','').strip()
        print(ranking,name,elo)
    

    输出:

    1 Astralis 1505
    2 Team Liquid 1469
    3 ENCE eSports 1402
    4 Vitality 1365
    5 AVANGAR 1326
    6 Natus Vincere 1298
    7 Ninjas in Pyjamas 1294
    8 fnatic 1292
    9 MiBR 1269
    10 FURIA 1264
    11 mousesports 1258
    12 Renegades 1252
    13 NRG eSports 1248
    14 ORDER 1240
    15 Grayhound Gaming 1237
    16 Valiance 1235
    17 Windigo 1228
    18 FaZe Clan 1222
    19 North 1220
    20 G2 Esports 1213
    21 OpTic Gaming 1201
    22 MVP PK 1196
    23 Heroic 1183
    24 Chiefs eSports Club 1177
    25 3DMAX.CS 1173
    26 HellRaisers 1168
    27 Rogue 1167
    28 BIG 1165
    29 forZe 1165
    30 Ghost Gaming 1159
    31 Swole Patrol 1154
    32 TyLoo 1151
    33 Red Reserve 1142
    34 Isurus Gaming 1142
    35 Team Kinguin 1136
    36 Tainted Minds 1135
    37 Movistar Riders 1134
    38 NoChance 1134
    39 DETONA Gaming 1132
    40 Space Soldiers 1120
    41 Bravado Gaming 1117
    42 BPro Gaming 1116
    43 Cloud9 1116
    44 GamerLegion 1113
    45 CyberZen 1111
    46 Epsilon 1111
    47 CLG Red 1107
    48 Luminosity Gaming 1107
    49 devils.one 1103
    50 Sprout 1096
    

    【讨论】:

      猜你喜欢
      • 2020-09-26
      • 2015-01-09
      • 1970-01-01
      • 2023-03-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-09-16
      • 1970-01-01
      相关资源
      最近更新 更多