雅虎的 YQL 是否还有其他选项可以从其他网站提取 HTML答案

【问题标题】：Is there other options for Yahoo's YQL for extracting HTML from other websites雅虎的 YQL 是否还有其他选项可以从其他网站提取 HTML
【发布时间】：2017-06-26 13:03:32
【问题描述】：

在我的应用程序中，我使用 Yahoo 的 YQL API 从其他网站提取 HTML，但 yahoo 停止了 API，Yahoo 用于提取 HTML 的 YQL API 将不再起作用。

{
 "query": {
  "count": 0,
  "created": "2017-06-26T12:57:49Z",
  "lang": "en-US",
  "meta": {
   "message": "html table is no longer supported. See https://policies.yahoo.com/us/en/yahoo/terms/product-atos/yql/index.htm for YQL Terms of Use"
  },
  "results": null
 }
}

It can be read here.

到目前为止，我是这样做的：

$(function () {
    var fileFieldId;
    var fileFieldClass;
    var query;
    var apiUrl;
    $(".data-from-url").keyup(function () {
        fileFieldId = $(this).attr('id');
        fileFieldClass = $(this).attr('class');
        fileFieldVal = $(this).val();
        query = 'select * from html where url="' + $(this).val() + '" and xpath="*"';
        apiUrl = 'https://query.yahooapis.com/v1/public/yql?q=' + encodeURIComponent(query);

        $.get(apiUrl, function(data) {
          var html = $(data).find('html');
          $("input.post[data-title='" + fileFieldId + "']" ).val(html.find("meta[property='og:title']").attr('content') || 'no title found');
          $("textarea.post-description[data-description='" + fileFieldId + "']" ).val(html.find("meta[property='og:description']").attr('content') || 'no title found');
          $("input.post-remote-image[data-img='" + fileFieldId + "']" ).val(html.find("meta[property='og:image']").attr('content') || '');

    });

});

Here is a jsfiddle for call I am doing

  $(function () {
      var query;
      var apiUrl;
      $("button.click").click(function () {
          //query = 'select * from htmlstring where url="' + $(this).val() + '" and xpath="//a"&format=json&env=store://datatables.org/alltableswithkeys&callback=';
          apiUrl = "https://query.yahooapis.com/v1/public/yql?q=select * from htmlstring where url='http://stackoverflow.com/'&format=json&diagnostics=true&env=store://datatables.org/alltableswithkeys&callback=";
          $('p.extract').toggle();
          $.get(apiUrl, function(data) {
          	$('p.extract').addClass('none');
            var html = $(data).find('html');
            $("input.title" ).val(html.find("meta[property='og:title']").attr('content') || 'no title found');
           	 $("textarea.description").val(html.find("meta[property='og:description']").attr('content') || 'no title found');
            $("input.image").val(html.find("meta[property='og:image']").attr('content') || '');

      });

  });
    });

input {
    width: 100%;
    margin-bottom: 20px;
    padding: 10px;
}

.none{display:none;}

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<button class="click">Click Me</button>
<br>
<p class="extract" style="display:none;">Extracting html</p>
<input type="text" class="title">
<br>
<textarea name="" id="" cols="30" rows="5" class="description"></textarea>
<br>
<input type="text" class="image">

还有其他方法可以从其他网站head 中提取HTML meta 吗？

【问题讨论】：

分享你用过的查询字符串。
创建服务器端爬虫
@CodeIt 我刚刚将查询添加到问题中
使用这个query我能够得到stackoverflow的完整html。如果可行，我会将其发布为答案。
感谢@CodeIt，感谢您的帮助。但是如果 API 出现故障，它会如何工作呢？是的，只要我能从head 中提取meta 数据，我将不胜感激:)

标签： jquery ajax api yql yahoo-api

【解决方案1】：

使用 YQL 提取 HTML

http://developer.yahoo.com/yql/console/?q=select%20*%20from%20htmlstring%20where%20url%3D'YOUR_ENCODED_URL_HERE'&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys

示例

http://developer.yahoo.com/yql/console/?q=select%20*%20from%20htmlstring%20where%20url%3D'http%3A%2F%2Fstackoverflow.com%2F'&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys

REST 查询

https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20htmlstring%20where%20url%3D'http%3A%2F%2Fstackoverflow.com%2F'&format=json&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys&callback=

Source

htmlstring 是社区开放数据表的一部分。

【讨论】：

我这样做对吗：query = 'select * from html where url="' + $(this).val() + '" and xpath="*"';apiUrl = 'http://developer.yahoo.com/yql/console/?q=' + encodeURIComponent(query);？
它不是来自 html 你需要从 htmlstring 中选择。见here;
谢谢您，感谢您的帮助。 Im doing this，但出现错误：GET query.yahooapis.com/v1/public/yql?q=select%20*%20from%20htmlstring%…e%20url%3D%22https%3A%2F%2Fstackoverflow.com%2F%22%20and%20xpath%3D%22* %22 400（错误请求） 不明白出了什么问题。感谢您的帮助:)
我的查询变成：query.yahooapis.com/v1/public/yql?q=select * from htmlstring where url="stackoverflow.com" and xpath=""* 我得到一个错误
我只是直接运行了整个查询，仍然无法从head 中获取meta。我得到 no found :/

【解决方案2】：

您也许可以使用查询选择器读取元标记？我使用 fetch 来抓取 google 文档，这些文档很有帮助地在 html 元标记中包含所有文档属性。然后我将 html 放入一个临时对象中，我可以在我认为合适的时候使用 queryselector。比如：

var url = "https://docs.google.com/presentation/d/1blSsU5LHnrjSjb7voHXkRA_NlWo3yNjLiyttmoWfslM/edit#slide=id.gcb9a0b074_1_0"
var id = url.split("://")[1].split("/")[3];
var source = "https://docs.google.com/presentation/d/" + id + "/edit?usp=sharing";
fetch(source).then(function(response) {
        return response.text();
    }).then(function(html) {
        var doc = document.implementation.createHTMLDocument("foo");
        doc.documentElement.innerHTML = html;
        return doc.querySelector("meta[property='og:description']").getAttribute("content");
    }).then(function(title) {
       console.log("document title", title);
    });

【讨论】：