【问题标题】:Web Scraping Pop-Up网页抓取弹出窗口
【发布时间】:2020-09-13 16:40:13
【问题描述】:

我是 Web Scraping 的新手,我正在尝试从城镇网站自动检索包裹信息。我有 300 多个包裹需要书本和页码。

这是网站: https://newmilfordct.mapgeo.io/datasets/properties?abuttersDistance=100&latlng=41.587864%2C-73.425014

当你去那里时,你可以点击搜索,然后我会输入标识符(例如 68/20)。我有所有这些的清单。从那里出现个人资料,我可以得到书和页码。

这是我目前所拥有的

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://newmilfordct.mapgeo.io/datasets/properties?abuttersDistance=100&latlng=41.587864%2C-  73.425014"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

我连接到该站点,但我不知道如何与之交互。 如果有人可以在正确的方向上帮助我,我们将不胜感激,并且可以节省大量的工作时间。

【问题讨论】:

  • 想要的输出是什么?
  • 这似乎是一个硒的东西..因为你需要点击/与网站交互。

标签: python python-3.x web-scraping beautifulsoup


【解决方案1】:

您可以通过向API url 发送POST 请求来获取给定标识符的数据。

这是怎么做的:

import requests

search_url = "https://newmilfordct.mapgeo.io/api/datasets/properties/search?format=json"

identifier = "68/20"

payload = {
    "page": 1,
    "quickSearch": identifier
}

search_results = requests.post(search_url, payload).json()
# print(search_results)

for item in search_results:
    name = item['displayName']
    owner = item['ownerName']
    geometry = item['geometry']
    book = item['lastSaleBook']
    page = item['lastSalePage']
    print(f"Name: {name} | Owner: {owner}")
    print(f"Book/Page: {book}/{page}")
    print(geometry)
    print("-" * 80)

输出:

Name: 17 BUCKINGHAM LN | Owner: ROTELLI LOUIS
Book/Page: 0970/230
{"type":"Polygon","coordinates":[[[-73.4909038060549,41.6425898231357],[-73.4909821900848,41.6425591025291],[-73.4907493168393,41.6419510845828],[-73.4911769908149,41.6420353877],[-73.4915429751214,41.6418889484739],[-73.4915515509607,41.6418998161938],[-73.4919447199921,41.6423992451082],[-73.4920405021311,41.6425204818934],[-73.4919930203487,41.6425307775562],[-73.4919273071398,41.6425305146988],[-73.4917614178846,41.642552550643],[-73.491595684262,41.642581803258],[-73.4910018358319,41.6426901884681],[-73.4910019510053,41.6427258656192],[-73.4909038060549,41.6425898231357]]]}
--------------------------------------------------------------------------------
Name: 15 BUCKINGHAM LN | Owner: NEELANDS DOUGLAS S + SALOME S
Book/Page: 0330/394
{"type":"Polygon","coordinates":[[[-73.4904204439222,41.6413365201908],[-73.4908759926496,41.6411167792846],[-73.4909181970441,41.6410961714263],[-73.4915429751214,41.6418889484739],[-73.4911769908149,41.6420353877],[-73.4907493168393,41.6419510845828],[-73.4909821900848,41.6425591025291],[-73.4909038060549,41.6425898231357],[-73.4904204439222,41.6413365201908]]]}
--------------------------------------------------------------------------------

JSON 中还有更多内容。只需取消注释此行 # print(search_results) 即可获得完整的回复。

编辑:关于API 的简短说明。

当您将标识符放入网络浏览器的开发人员工具的搜索字段中时,您可以先睹为快。然后转到Network 选项卡并选择XHR 过滤器。

选择第一项并选择Headers。在那里你会找到Request URLRequest payload

【讨论】:

  • 效果很好,非常感谢!如果你不介意我问你在哪里找到 api。我查看了网站上的所有源代码,没有看到任何引用该链接的内容。
  • 感谢您提供的信息。这是我第一次这样做,直到现在我才了解它的功能。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2015-07-28
  • 2018-05-23
  • 1970-01-01
  • 2020-07-07
  • 2021-07-21
  • 1970-01-01
相关资源
最近更新 更多