使用持久连接抓取纯 HTML 内容答案

【问题标题】：Scraping plain HTML-Content with durable connection使用持久连接抓取纯 HTML 内容
【发布时间】：2021-08-12 09:18:27
【问题描述】：

首先我想说，我只是一个业余程序员，拥有绝对的业余身份。

我有一个 mobotix 摄像头，它在移动时会发出警报。通过特定的 URL，我可以查询警报状态。如果我用浏览器打开 URL，我会得到纯文本形式的实时状态。当相机检测到移动时，会写入一个条目。如果长时间没有任何反应，则会创建一些空白行。问题是，请求永远不会结束。正如您在图片中看到的，只有 firefox 典型的加载点，而不是 favicon。

我尝试使用 python 从网站获取数据。但是由于请求永远不会结束，所以它不会工作......

我尝试了一些简单的“request.get”，并找到了一些使用 scrapy 或 beautifulsoup 的示例。但对我来说，学习这些东西是非常非常困难的。因此，我想问你们，看看哪个更糟糕，或者你可以给我一点推动正确的方向。

我尝试在请求中使用简单的超时，但这会给我带来一些错误，这不是问题的重点，因为我正在尝试创建与相机的永久/常设连接。然后我想过滤警报计数器的数量并用它做一些新的动作。

对不起，我的英语不好。希望你能帮助我。

import requests

url='http://192.168.0.242/control/event.jpg?output=alarmupdate&filter=AS'
user='admin'
pwd='pwd'

with requests.Session() as session:
    session.auth = (user, pwd)

r = requests.get(url, auth=(user, pwd), verify=False)

if r.status_code == 200:
    print('Success!')
elif r.status_code != 200:
    print('Error.')

print (r)

浏览器图片

【问题讨论】：

标签： python html screen-scraping

【解决方案1】：

您需要了解的第一件事是您的父网站。示例：https://www.google.com/search?q& ..... 是谷歌搜索网站，https://www.google.com/ 也是谷歌搜索网页。因此，首先获取您的基础网站。然后您需要在 python 3.x 中尝试以下代码并检查您的网页响应什么。

import requests
url = "http://www.yourwebpage.com"
print(requests.get(url))

现在，如果您收到 http 200 或 403（那么您需要传递您的凭据）帖子，您就可以开始了。如果您遇到任何其他 http 代码，请告诉我我们将相应地处理它。同样基于此，我们将使用 BeautifulSoup 进行网络 scrape。

【讨论】：

【解决方案2】：

感谢您的回答。评论不允许我使用更多字母。所以我会回答：

编辑：好的...对不起，我需要了解这个论坛。我应该编辑我认为的第一篇文章，然后发表评论……对吗？对不起……

感谢您的回答。我得到了 HTTP 代码 200。那不是我的问题。稍后，相机应该可以通过不同端口上的相同域访问。

这是我的代码：

import requests
import time

url_alarm='http://192.168.0.246/output_emz.xml?A10A'
domain='http://192.168.0.242'
#url='/control/event.jpg?output=alarmupdate&filter=AS'
url='/control/event.jpg'
ports=['80', '8080']
user='admin'
pwd='pwd'

def Alarm():
    requests.get(url_alarm)
    time.sleep(0.5)
    requests.get(url_alarm)

def connection():
    for x in ports:
        completeurl = domain+':'+x+url
        try:
            r = requests.get(completeurl, auth=(user, pwd), verify=False)
    
            if r.status_code == 200:
                print ('Reached ' + domain + ' at Port ' + x)
                print('Sending Alarm...')
                Alarm()
                print('Alarm successfully sent \n')
            elif r.status_code != 200:
                print('Error when trying to reach ' + domain + ' at Port ' + x)
                print('Trying just to reach the domain...')
                r = requests.get(domain, auth=(user, pwd), verify=False)
                if r.status_code == 200:
                    print('This would work...')
                elif r.status_code != 200:
                    print('This also did not work. Trying next cam...')

        except:
            print('Connection to camera failed at '+completeurl)

###__________CODE__________###

connection()

这让我失望了：

Reached http://192.168.0.242 at Port 80
Sending Alarm...
Alarm successfully sent

Error when trying to reach http://192.168.0.242 at Port 8080
Trying just to reach the domain...
This would work...

但是一旦我尝试添加“?output=alarmupdate&filter=AS”，请求将永远不会结束，代码也永远不会运行...

【讨论】：