【问题标题】:Extract string within script tags from HTML从 HTML 中提取脚本标签中的字符串
【发布时间】:2018-12-28 21:43:59
【问题描述】:

我正在尝试制作一个网络爬虫来从以下网站获取数据(我稍后想为同一网站上的几家航空公司做这件事): https://www.flightradar24.com/data/airlines/kl-klm/routes

我目前有以下代码:

from bs4 import BeautifulSoup
import requests

airlines = ['kl-klm']

for a in airlines:
    url = 'https://www.flightradar24.com/data/airlines/' + a + '/routes'
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    print(soup)

这给了我整个页面的源代码,但我想在脚本标签中提取一段特定的文本,即

var arrRoutes=[{"airport1":{"country":"Denmark","iata":"AAL","icao":"EKYT","lat":57.092781,"lon":9.849164,"name":"Aalborg Airport"},"airport2":{"country":"Netherlands","iata":"AMS","icao":"EHAM","lat":52.308609,"lon":4.763889,"name":"Amsterdam Schiphol Airport"}},{"airport1":{"country":"United Kingdom","iata":"ABZ","icao":"EGPD","lat":57.201939,"lon":-2.19777,"name":"Aberdeen International Airport"},"airport2":{"country":"Netherlands","iata":"AMS","icao":"EHAM","lat":52.308609,"lon":4.763889,"name":"Amsterdam Schiphol Airport"}}...

...等等。一直到列表的末尾。

如何提取此信息,以便我可以找到每个机场的入境和出境航班总数?例如,阿姆斯特丹史基浦机场作为机场 1 或 2 出现的总次数?

有没有办法先从 HTML 中提取字符串,然后将其转换为带有字典的 Python 列表?还是直接计算字符串中的每个元素更有意义?

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup python-requests


    【解决方案1】:

    使用re.compile

    例如:

    import re
    
    soup = BeautifulSoup(page.text, 'html.parser')
    jData = soup.find("script", text=re.compile(r"var arrRoutes=.*?")).string
    print( jData.replace("var arrRoutes=", ""))
    

    输出:

    [{"airport1":{"country":"Denmark","iata":"AAL","icao":"EKYT","lat":57.092781,"lon":9.849164,"name":"Aalborg Airport"},"airport2":{"country":"Netherlands","iata":"AMS","icao":"EHAM","lat":52.308609,"lon":4.763889,"name":"Amsterdam Schiphol Airport"}},{"airport1":{"country":"United Kingdom","iata":"ABZ","icao":"EGPD","lat":57.201939,"lon":-2.19777,"name":"Aberdeen International Airport"},"airport2":{"country":"Netherlands","iata":"AMS","icao":"EHAM","lat":52.308609,"lon":4.763889,"name":"Amsterdam Schiphol Airport"}},......
    

    【讨论】:

      【解决方案2】:

      您可以使用ast.literal_eval 将数据提取到python 列表中。我做了一个简单的函数find_airport(),在这里你提供数据和机场名称,并返回它在 airport_1 和 airport_2 的次数:

      from bs4 import BeautifulSoup
      import requests
      import re
      from ast import literal_eval
      from pprint import pprint
      
      airlines = ['kl-klm']
      
      headers = {"Host":"www.flightradar24.com",
      "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
      "Accept-Encoding":"gzip,deflate,br",
      "User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}
      
      def find_aiport(data, name):
          airport_1, airport_2 = 0, 0
          for d in data:
              if d['airport1']['name'] == name:
                  airport_1 += 1
              if d['airport2']['name'] == name:
                  airport_2 += 1
          return airport_1, airport_2
      
      for a in airlines:
          url = 'https://www.flightradar24.com/data/airlines/' + a + '/routes'
          page = requests.get(url, headers=headers)
          soup = BeautifulSoup(page.text, 'lxml')
      
          m = re.search(r'(?<=arrRoutes=)\[\{(.*?)\}\]', soup.text)
          l = literal_eval(m[0])
          pprint(l)
      
          print(find_aiport(l, 'Amsterdam Schiphol Airport'))
      

      打印:

      [{'airport1': {'country': 'Denmark',
                     'iata': 'AAL',
                     'icao': 'EKYT',
                     'lat': 57.092781,
                     'lon': 9.849164,
                     'name': 'Aalborg Airport'},
        'airport2': {'country': 'Netherlands',
                     'iata': 'AMS',
                     'icao': 'EHAM',
                     'lat': 52.308609,
                     'lon': 4.763889,
                     'name': 'Amsterdam Schiphol Airport'}},
       {'airport1': {'country': 'United Kingdom',
                     'iata': 'ABZ',
                     'icao': 'EGPD',
                     'lat': 57.201939,
                     'lon': -2.19777,
                     'name': 'Aberdeen International Airport'},
        'airport2': {'country': 'Netherlands',
                     'iata': 'AMS',
                     'icao': 'EHAM',
                     'lat': 52.308609,
                     'lon': 4.763889,
                     'name': 'Amsterdam Schiphol Airport'}},
      
      ...and so on
      

      最后:

      (147, 146)
      

      对于“阿姆斯特丹史基浦机场”

      【讨论】:

      • 太棒了,正是我想要的。谢谢!
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-08-09
      • 2012-01-19
      • 1970-01-01
      • 1970-01-01
      • 2020-02-28
      • 1970-01-01
      相关资源
      最近更新 更多