【问题标题】:drysrape install Ubuntu server 16.04drysrape 安装 Ubuntu 服务器 16.04
【发布时间】:2017-08-06 15:46:05
【问题描述】:

我无法在 ubuntu 16.04 服务器上实现 dryscrape(在数字海洋上全新安装) - 目的是抓取 JS 填充的网站。

我正在遵循来自here 的dryscrape 安装说明:

apt-get update
apt-get install qt5-default libqt5webkit5-dev build-essential \
                  python-lxml python-pip xvfb

pip install dryscrape

然后运行我在同一链接中找到here 的以下python 脚本以及测试html 页面。 (它返回 html 或 JS)

Python

import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
my_url = 'http://www.example.com/scrape.php'
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")

HTML - scrape.php

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>

当我这样做时,我似乎无法获得预期的返回数据,而这只是错误。

我想知道我是否缺少任何明显的东西?

注意:我搜索了许多安装指南/线程,但似乎无法使其正常工作。我也尝试过使用硒,但似乎也无济于事。非常感谢。

输出

Traceback (most recent call last):
  File "js.py", line 3, in <module>
    session = dryscrape.Session()
  File "/usr/local/lib/python2.7/dist-packages/dryscrape/session.py", line 22, in __init__
    self.driver = driver or DefaultDriver()
  File "/usr/local/lib/python2.7/dist-packages/dryscrape/driver/webkit.py", line 30, in __init__
    super(Driver, self).__init__(**kw)
  File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 230, in __init__
    self.conn = connection or ServerConnection()
  File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 507, in __init__
    self._sock = (server or get_default_server()).connect()
  File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 450, in get_default_server
    _default_server = Server()
  File "/usr/local/lib/python2.7/dist-packages/webkit_server.py", line 424, in __init__
    raise NoX11Error("Could not connect to X server. "
webkit_server.NoX11Error: Could not connect to X server. Try calling dryscrape.start_xvfb() before creating a session.

工作脚本

import dryscrape
from bs4 import BeautifulSoup

dryscrape.start_xvfb()
session = dryscrape.Session()
my_url = 'https://www.example.com/scrape.php'
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response, "html.parser")
print soup.find(id="intro-text").text

【问题讨论】:

    标签: javascript python ubuntu web-scraping dryscrape


    【解决方案1】:

    您没有运行 X 服务器。线索是

    在创建会话之前尝试调用 dryscrape.start_xvfb()

    http://dryscrape.readthedocs.io/en/latest/usage.html

    if 'linux' in sys.platform:
        # start xvfb in case no X is running. Make sure xvfb 
        # is installed, otherwise this won't work!
        dryscrape.start_xvfb()
    

    http://dryscrape.readthedocs.io/en/latest/installation.html

    xvfb_(仅当没有其他 X 服务器可用时才需要)

    所以你可以添加:

    dryscrape.start_xvfb()
    

    之前:

    session = dryscrape.Session()
    

    【讨论】:

    • 感谢您,我已在答案底部添加了一个更新/工作的 python 脚本。我需要添加的唯一额外内容是在 soup = BeautifulSoup(response, "html.parser") 中指定 html 解析器非常感谢帮助,因为我昨天花了 4 小时阅读并尝试解决。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2018-04-03
    • 2017-11-10
    • 1970-01-01
    • 2018-12-08
    • 2016-11-27
    • 2020-01-16
    • 1970-01-01
    相关资源
    最近更新 更多