是否有仅用于检索内容摘要的 Wikipedia API？答案

【问题标题】：Is there a Wikipedia API just for retrieve the content summary?是否有仅用于检索内容摘要的 Wikipedia API？
【发布时间】：2011-12-18 22:25:10
【问题描述】：

我只需要检索维基百科页面的第一段。

内容必须是 HTML 格式，准备好在我的网站上显示（所以 no BBCode，或 Wikipedia 特殊的代码！）

【问题讨论】：

维基百科不使用 BB 代码，它使用自己的 wiki 标记代码。
它不适用于每篇维基百科文章。 ro.wikipedia.org/w/…

标签： wikipedia wikipedia-api

【解决方案1】：

有一种方法可以在不进行任何 HTML 解析的情况下获取整个“介绍部分”！与AnthonyS's answer类似，加了explaintext参数，可以得到纯文本的介绍部分文本。

查询

获取 Stack Overflow 的纯文本介绍：

使用页面标题：

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=Stack%20Overflow

或者使用pageids:

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&pageids=21721040

JSON 响应

（已删除警告）

{
    "query": {
        "pages": {
            "21721040": {
                "pageid": 21721040,
                "ns": 0,
                "title": "Stack Overflow",
                "extract": "Stack Overflow is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky, as a more open alternative to earlier Q&A sites such as Experts Exchange. The name for the website was chosen by voting in April 2008 by readers of Coding Horror, Atwood's popular programming blog.\nIt features questions and answers on a wide range of topics in computer programming. The website serves as a platform for users to ask and answer questions, and, through membership and active participation, to vote questions and answers up or down and edit questions and answers in a fashion similar to a wiki or Digg. Users of Stack Overflow can earn reputation points and \"badges\"; for example, a person is awarded 10 reputation points for receiving an \"up\" vote on an answer given to a question, and can receive badges for their valued contributions, which represents a kind of gamification of the traditional Q&A site or forum. All user-generated content is licensed under a Creative Commons Attribute-ShareAlike license. Questions are closed in order to allow low quality questions to improve. Jeff Atwood stated in 2010 that duplicate questions are not seen as a problem but rather they constitute an advantage if such additional questions drive extra traffic to the site by multiplying relevant keyword hits in search engines.\nAs of April 2014, Stack Overflow has over 2,700,000 registered users and more than 7,100,000 questions. Based on the type of tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript, C#, PHP, Android, jQuery, Python and HTML."
            }
        }
    }
}

文档：API: query/prop=extracts

【讨论】：

非常推荐使用&redirects=1，它会自动重定向到同义词的内容
如果我不知道页码，如何从这个 JSON 响应中获取信息。我无法访问包含“extract”的 JSON 数组
@LaurynasG 您可以将对象转换为数组，然后像这样抓取它：$extract = current((array)$json_query->query->pages)->extract
@LaurynasG, @MarcGuay 您还可以将`indexpageids 作为参数添加到 URL 以获取 pageid 列表以便于迭代。
@cglacet 是的。只需使用pageids= 查询参数，就像https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&pageids=21721040

【解决方案2】：

实际上有一个非常棒的 prop，叫做 extracts，可以与专门为此目的设计的查询一起使用。

Extracts 允许您获取文章摘录（截断的文章文本）。有一个名为 exintro 的参数可用于检索第零部分中的文本（无需图像或信息框等额外资源）。您还可以按一定数量的字符 (exchars) 或按一定数量的句子 (exsentences) 检索更精细的数据提取。

这是一个示例查询 http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow 和 API 沙盒 http://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow 来尝试更多地使用此查询。

请注意，如果您特别想要第一段，您仍然需要按照所选答案中的建议进行一些额外的解析。此处的不同之处在于，此查询返回的响应比建议的其他一些 API 查询短，因为您在 API 响应中没有额外的资产（例如图像）来解析。

【讨论】：

什么是“道具”？财产？
第一个链接（实际上）已损坏。该页面上没有“摘录”或“摘录”。

【解决方案3】：

自 2017 年以来，维基百科提供了具有更好缓存的 REST API。在the documentation 中，您可以找到以下非常适合您的用例的API（因为它被新的Page Previews 功能所使用）。

https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow 返回以下数据，可用于显示带有小缩略图的摘要：

{
  "type": "standard",
  "title": "Stack Overflow",
  "displaytitle": "Stack Overflow",
  "extract": "Stack Overflow is a question and answer site for professional and enthusiast programmers. It is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer sites such as Experts-Exchange. The name for the website was chosen by voting in April 2008 by readers of Coding Horror, Atwood's popular programming blog.",
  "extract_html": "<p><b>Stack Overflow</b> is a question and answer site for professional and enthusiast programmers. It is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky. It features questions and answers on a wide range of topics in computer programming. It was created to be a more open alternative to earlier question and answer sites such as Experts-Exchange. The name for the website was chosen by voting in April 2008 by readers of <i>Coding Horror</i>, Atwood's popular programming blog.</p>",
  "namespace": {
    "id": 0,
    "text": ""
  },
  "wikibase_item": "Q549037",
  "titles": {
    "canonical": "Stack_Overflow",
    "normalized": "Stack Overflow",
    "display": "Stack Overflow"
  },
  "pageid": 21721040,
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/en/thumb/f/fa/Stack_Overflow_homepage%2C_Feb_2017.png/320px-Stack_Overflow_homepage%2C_Feb_2017.png",
    "width": 320,
    "height": 149
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/en/f/fa/Stack_Overflow_homepage%2C_Feb_2017.png",
    "width": 462,
    "height": 215
  },
  "lang": "en",
  "dir": "ltr",
  "revision": "902900099",
  "tid": "1a9cdbc0-949b-11e9-bf92-7cc0de1b4f72",
  "timestamp": "2019-06-22T03:09:01Z",
  "description": "website hosting questions and answers on a wide range of topics in computer programming",
  "content_urls": {
    "desktop": {
      "page": "https://en.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.wikipedia.org/wiki/Stack_Overflow?action=history",
      "edit": "https://en.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.wikipedia.org/wiki/Talk:Stack_Overflow"
    },
    "mobile": {
      "page": "https://en.m.wikipedia.org/wiki/Stack_Overflow",
      "revisions": "https://en.m.wikipedia.org/wiki/Special:History/Stack_Overflow",
      "edit": "https://en.m.wikipedia.org/wiki/Stack_Overflow?action=edit",
      "talk": "https://en.m.wikipedia.org/wiki/Talk:Stack_Overflow"
    }
  },
  "api_urls": {
    "summary": "https://en.wikipedia.org/api/rest_v1/page/summary/Stack_Overflow",
    "metadata": "https://en.wikipedia.org/api/rest_v1/page/metadata/Stack_Overflow",
    "references": "https://en.wikipedia.org/api/rest_v1/page/references/Stack_Overflow",
    "media": "https://en.wikipedia.org/api/rest_v1/page/media/Stack_Overflow",
    "edit_html": "https://en.wikipedia.org/api/rest_v1/page/html/Stack_Overflow",
    "talk_page_html": "https://en.wikipedia.org/api/rest_v1/page/html/Talk:Stack_Overflow"
  }
}

默认情况下，它遵循重定向（因此/api/rest_v1/page/summary/StackOverflow 也可以使用），但可以使用?redirect=false 禁用它。

如果您需要从其他域访问 API，您可以将 CORS 标头设置为 &origin=（例如，&origin=*）。

截至 2019 年：API 似乎返回了有关该页面的更多有用信息。

【讨论】：

这还包括“类型”，如果您需要知道您搜索的内容是否有“歧义”，这非常好。
我在尝试从基于 Angular 的应用程序访问此链接时遇到 CORS 错误，谁能告诉我如何解决这个问题。
是否也可以通过 wikidata ID 进行查询？我有一些我提取的 json 数据，看起来像 "other_tags" : "\"addr:country\"=>\"CW\",\"historic\"=>\"ruins\",\"name:nl\"=>\"Riffort\",\"wikidata\"=>\"Q4563360\",\"wikipedia\"=>\"nl:Riffort\"" 我们现在可以通过 QID 获取提取吗？
这个可以用来加载多页的摘要吗？
@SouravChatterjee 要求什么，这个 API 可以用来按页面 id 搜索吗？好像没有

【解决方案4】：

此代码允许您以纯文本形式检索页面第一段的内容。

这个答案的一部分来自here，因此来自here。请参阅MediaWiki API documentation 了解更多信息。

// action=parse: get parsed text
// page=Baseball: from the page Baseball
// format=json: in JSON format
// prop=text: send the text content of the article
// section=0: top content of the page

$url = 'http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Baseball&prop=text&section=0';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "TestScript"); // required by wikipedia.org server; use YOUR user agent with YOUR contact information. (otherwise your IP might get blocked)
$c = curl_exec($ch);

$json = json_decode($c);

$content = $json->{'parse'}->{'text'}->{'*'}; // Get the main text content of the query (it's parsed HTML)

// Pattern for first match of a paragraph
$pattern = '#<p>(.*)</p>#Us'; // http://www.phpbuilder.com/board/showthread.php?t=10352690
if(preg_match($pattern, $content, $matches))
{
    // print $matches[0]; // Content of the first paragraph (including wrapping <p> tag)
    print strip_tags($matches[1]); // Content of the first paragraph without the HTML tags.
}

【讨论】：

但是如果您搜索“珊瑚”，结果将是不需要的。有没有其他办法，只能挑出带smmary的p标签

【解决方案5】：

是的，有。例如，如果您想获取文章第一部分的内容Stack Overflow，请使用如下查询：

http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=revisions&titles=Stack%20Overflow&rvprop=content&rvsection=0&rvparse

各部分的意思是：

format=xml：将结果格式化程序返回为 XML。其他选项（如 JSON）可用。这不会影响页面内容本身的格式，只会影响封闭的数据格式。
action=query&prop=revisions：获取有关页面修订的信息。由于我们没有指定哪个版本，所以使用的是最新版本。
titles=Stack%20Overflow：获取有关页面Stack Overflow的信息。如果您将它们的名称以| 分隔，则可以一次获取更多页面的文本。
rvprop=content：返回修订的内容（或文本）。
rvsection=0：仅返回第 0 节的内容。
rvparse：返回解析为HTML的内容。

请记住，这会返回整个第一部分，包括帽子注释（“用于其他用途……”）、信息框或图像。

有多个可用于各种语言的库，可以更轻松地使用 API，如果您使用其中一个可能会更好。

【讨论】：

我不想要内容解析广告 HTML，我只想得到“纯文本”（维基百科代码都不）
API 不提供类似的东西。我可以理解为什么：因为从 API 的角度来看，不清楚这个“纯文本”究竟应该包含什么。例如，它应该如何表示表格，是否包含“[需要引用]”、导航框或图像描述。
在链接末尾添加&redirects=true 可确保您到达目标文章（如果存在）。

【解决方案6】：

这是我现在正在为我正在制作的网站使用的代码，它需要获取维基百科文章的前导段落、摘要和第 0 节，并且这一切都在浏览器中完成（客户端 JavaScript ) 感谢JSONP 的魔力！ --> http://jsfiddle.net/gautamadude/HMJJg/1/

它使用 Wikipedia API 来获取 HTML 中的前导段落（称为第 0 节），如下所示：http://en.wikipedia.org/w/api.php?format=json&action=parse&page=Stack_Overflow&prop=text&section=0&callback=?

然后它会去除 HTML 和其他不需要的数据，为您提供一个干净的文章摘要字符串。如果您愿意，只需稍作调整，就可以在前导段落周围添加一个“p”HTML 标记，但现在它们之间只有一个换行符。

代码：

var url = "http://en.wikipedia.org/wiki/Stack_Overflow";
var title = url.split("/").slice(4).join("/");

// Get leading paragraphs (section 0)
$.getJSON("http://en.wikipedia.org/w/api.php?format=json&action=parse&page=" + title + "&prop=text&section=0&callback=?", function (data) {
    for (text in data.parse.text) {
        var text = data.parse.text[text].split("<p>");
        var pText = "";

        for (p in text) {
            // Remove HTML comment
            text[p] = text[p].split("<!--");
            if (text[p].length > 1) {
                text[p][0] = text[p][0].split(/\r\n|\r|\n/);
                text[p][0] = text[p][0][0];
                text[p][0] += "</p> ";
            }
            text[p] = text[p][0];

            // Construct a string from paragraphs
            if (text[p].indexOf("</p>") == text[p].length - 5) {
                var htmlStrip = text[p].replace(/<(?:.|\n)*?>/gm, '') // Remove HTML
                var splitNewline = htmlStrip.split(/\r\n|\r|\n/); //Split on newlines
                for (newline in splitNewline) {
                    if (splitNewline[newline].substring(0, 11) != "Cite error:") {
                        pText += splitNewline[newline];
                        pText += "\n";
                    }
                }
            }
        }
        pText = pText.substring(0, pText.length - 2); // Remove extra newline
        pText = pText.replace(/\[\d+\]/g, ""); // Remove reference tags (e.x. [1], [4], etc)
        document.getElementById('textarea').value = pText
        document.getElementById('div_text').textContent = pText
    }
});

【讨论】：

您是否将此添加到客户端脚本中？如果是这样，那不是 XSS 吗？
它有很多错误，用你的脚本试试这个链接：en.wikipedia.org/wiki/Modular_Advanced_Armed_Robotic_System

【解决方案7】：

此 URL 将以 XML 格式返回摘要。

http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=Agra&MaxHits=1

我创建了一个函数来从 Wikipedia 获取关键字的描述。

function getDescription($keyword) {
    $url = 'http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=' . urlencode($keyword) . '&MaxHits=1';
    $xml = simplexml_load_file($url);
    return $xml->Result->Description;
}

echo getDescription('agra');

【讨论】：

【解决方案8】：

您还可以通过DBPedia 获取诸如第一段之类的内容，它获取维基百科内容并从中创建结构化信息 (RDF)，并通过 API 提供。 DBPedia API 是一个 SPARQL API（基于 RDF），但它输出 JSON 并且很容易包装。

例如，这里有一个名为WikipediaJS 的超级简单的 JavaScript 库，它可以提取结构化内容，包括摘要第一段。

您可以在这篇博文中了解更多信息：WikipediaJS - accessing Wikipedia article data through Javascript

JavaScript 库代码可以在 wikipedia.js 中找到。

【讨论】：

【解决方案9】：

abstract.xml.gz dump 听起来像你想要的。

【讨论】：

【解决方案10】：

如果您只是寻找文本，然后可以拆分，但不想使用 API，请查看 en.wikipedia.org/w/index.php?title=大象&action=raw.

【讨论】：

“准备在我的网站上显示（所以没有 BBCODE 或 WIKIPEDIA 特殊代码！）”而这恰恰相反

【解决方案11】：

我的方法如下（在 PHP 中）：

$url = "whatever_you_need"

$html = file_get_contents('https://en.wikipedia.org/w/api.php?action=opensearch&search='.$url);
$utf8html = html_entity_decode(preg_replace("/U\+([0-9A-F]{4})/", "&#x\\1;", $html), ENT_NOQUOTES, 'UTF-8');

$utf8html 可能需要进一步清理，但基本上就是这样。

【讨论】：

最好用&utf8=从API中询问utf8

【解决方案12】：

我尝试了Michael Rapadas' 和@Krinkle 的解决方案，但在我的情况下，我很难根据大小写找到一些文章。喜欢这里：

https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&exsentences=1&explaintext=&titles=Led%20zeppelin

注意我用exsentences=1截断了响应

显然“标题规范化”无法正常工作：

标题规范化将页面标题转换为其规范形式。这表示将第一个字符大写，将下划线替换为空格，并将命名空间更改为为此定义的本地化形式维基。标题规范化是自动完成的，不管哪个使用查询模块。但是，页面中的任何尾随换行符标题 (\n) 会导致奇怪的行为，应该去掉它们首先。

我知道我本可以轻松解决大写问题，但也存在将对象强制转换为数组的不便。

因为我真的想要一个知名且明确的搜索的第一段（没有风险从其他文章中获取信息），所以我这样做了：

https://en.wikipedia.org/w/api.php?action=opensearch&search=led%20zeppelin&limit=1&format=json

请注意，在这种情况下，我使用 limit=1 进行了截断

这边：

我可以非常轻松地访问响应数据。
响应非常小。

但我们必须注意搜索的大小写。

【讨论】：

这里没有名为“Krinkle”的用户。它指的是什么答案？它是 "01AutoMonkey"、"AnthonyS" 和 "Alex" 之一。请通过editing (changing) your answer 回复，而不是在 cmets 中（没有“编辑：”、“更新：”或类似的 - 答案应该看起来好像是今天写的)。