如何使用 Wikipedia 的 API 获取 Wikipedia 内容？答案

【问题标题】：How can I get Wikipedia content using Wikipedia's API?如何使用 Wikipedia 的 API 获取 Wikipedia 内容？
【发布时间】：2011-11-03 08:36:17
【问题描述】：

我想要获取维基百科文章的第一段。

执行此操作的 API 查询是什么？

【问题讨论】：

标签： wikipedia-api

【解决方案1】：

请参阅MediaWiki documentation 中的此部分。

这些是关键参数。

prop=revisions&rvprop=content&rvsection=0

rvsection = 0 指定只返回前导部分。

看这个例子。

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=pizza

要获取 HTML，您可以类似地使用 action=parse http://en.wikipedia.org/w/api.php?action=parse&section=0&prop=text&page=pizza

请注意，您必须删除所有模板或信息框。

【讨论】：

我必须在得到它的值后发送 action=parse 查询吗？
我想得到一个干净的文本，我应该自己写解析器吗？或者有一些 API 查询可以做到这一点？谢谢

【解决方案2】：

如果您需要对大量文章执行此操作，那么与其直接查询网站，不如考虑下载 Wikipedia 数据库转储，然后通过 API（例如 JWPL）访问它。

【讨论】：

【解决方案3】：

请参阅 Is there a Wikipedia API just for retrieve the content summary? 了解其他建议的解决方案。这是我建议的一个：

实际上有一个非常棒的prop，叫做extracts，可以与专门为此目的设计的查询一起使用。提取允许您获取文章摘录（截断的文章文本）。有一个名为 exintro 的参数可用于检索第零部分中的文本（无需图像或信息框等额外资源）。您还可以按一定数量的字符（exchars）或按一定数量的句子（exsentences）

等更精细的粒度检索提取物

这是一个示例查询 http://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow 和 API 沙盒 http://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exintro=&titles=Stack%20Overflow 来对这个查询进行更多试验。

请注意，如果您特别想要第一段，您仍然需要获取第一个标签。但是，在此 API 调用中，无需解析图像等其他资产。如果您对这个介绍摘要感到满意，您可以通过运行像 PHP's strip_tag 这样删除 HTML 标记的函数来检索文本。

【讨论】：

感谢您的回答，提取的东西非常有用。

【解决方案4】：

您可以直接下载 Wikipedia 数据库并使用 Wiki Parser 将所有页面解析为 XML，这是一个独立的应用程序。第一段是生成的 XML 中的一个单独节点。

或者，您可以从其纯文本输出中提取第一段。

【讨论】：

【解决方案5】：

您可以通过查询https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=java等页面获得该文章在维基百科中的介绍。您只需要解析 JSON 文件，结果是已清除的纯文本，包括删除链接和引用。

【讨论】：

【解决方案6】：

我是这样做的：

https://en.wikipedia.org/w/api.php?action=opensearch&search=bee&limit=1&format=json

你得到的响应是一个包含数据的数组，易于解析：

[
  "bee",
  [
    "Bee"
  ],
  [
    "Bees are flying insects closely related to wasps and ants, known for their role in pollination and, in the case of the best-known bee species, the European honey bee, for producing honey and beeswax."
  ],
  [
    "https://en.wikipedia.org/wiki/Bee"
  ]
]

你需要的是第一段limit=1。

【讨论】：

奇怪的是，这个方法已经不行了。它没有给我任何描述
这里也一样，这个方法没有返回描述，有什么原因吗？
还看到此端点不再提供任何描述...

【解决方案7】：

<script>    
    function dowiki(place) {
        var URL = 'https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=';

        URL += "&titles=" + place;
        URL += "&rvprop=content";
        URL += "&callback=?";
        $.getJSON(URL, function (data) {
            var obj = data.query.pages;
            var ob = Object.keys(obj)[0];
            console.log(obj[ob]["extract"]);
            try{
                document.getElementById('Label11').textContent = obj[ob]["extract"];
            }
            catch (err) {
                document.getElementById('Label11').textContent = err.message;
            }

        });
    }
</script>

【讨论】：

考虑在你的答案中添加一些文字描述:)（即与其他人相比它带来了什么）
解释一下。请通过editing (changing) your answer 回复，而不是在 cmets 中（without "Edit:"、"Update:" 或类似的 - 答案应该看起来像是今天写的)。

【解决方案8】：

您可以使用 jQuery 来做到这一点。首先使用适当的参数创建 URL。检查this link 以了解参数的含义。然后使用$.ajax() 方法检索文章。请注意，维基百科不允许跨源请求。这就是我们在请求中使用dataType : jsonp 的原因。

var wikiURL = "https://en.wikipedia.org/w/api.php";
wikiURL += '?' + $.param({
    'action' : 'opensearch',
    'search' : 'your_search_term',
    'prop'  : 'revisions',
    'rvprop' : 'content',
    'format' : 'json',
    'limit' : 10
});

$.ajax({
    url: wikiURL,
    dataType: 'jsonp',
    success: function(data) {
        console.log(data);
    }
});

【讨论】：

【解决方案9】：

您可以为此使用摘要 REST 端点的 extract_html 字段：例如https://en.wikipedia.org/api/rest_v1/page/summary/Cat.

注意：这旨在通过删除大部分发音来简化内容，在某些情况下主要在括号中。

【讨论】：

这应该是最佳答案。超级简单

【解决方案10】：

假设keyword = "Batman" //Term you want to search，使用：

https://en.wikipedia.org/w/api.php?action=parse&page={{keyword}}&format=json&prop=text&section=0

从 Wikipedia 中获取 JSON 格式的摘要/第一段。

【讨论】：

【解决方案11】：

获取文章的第一段：

https://en.wikipedia.org/w/api.php?action=query&titles=Belgrade&prop=extracts&format=json&exintro=1

我根据自己的需要创建了简短的Wikipedia API docs。有关于如何获取文章、图像和类似内容的工作示例。

【讨论】：

【解决方案12】：

这是一个可以转储法语和英语维基词典和维基百科的程序：

import sys
import asyncio
import urllib.parse
from uuid import uuid4

import httpx
import found
from found import nstore
from found import bstore
from loguru import logger as log

try:
    import ujson as json
except ImportError:
    import json


# XXX: https://github.com/Delgan/loguru
log.debug("That's it, beautiful and simple logging!")


async def get(http, url, params=None):
    response = await http.get(url, params=params)
    if response.status_code == 200:
        return response.content

    log.error("http get failed with url and reponse: {} {}", url, response)
    return None



def make_timestamper():
    import time
    start_monotonic = time.monotonic()
    start = time.time()
    loop = asyncio.get_event_loop()

    def timestamp():
        # Wanna be faster than datetime.now().timestamp()
        # approximation of current epoch time.
        out = start + loop.time() - start_monotonic
        out = int(out)
        return out

    return timestamp


async def wikimedia_titles(http, wiki="https://en.wikipedia.org/"):
    log.debug('Started generating asynchronously wiki titles at {}', wiki)
    # XXX: https://www.mediawiki.org/wiki/API:Allpages#Python
    url = "{}/w/api.php".format(wiki)
    params = {
        "action": "query",
        "format": "json",
        "list": "allpages",
        "apfilterredir": "nonredirects",
        "apfrom": "",
    }

    while True:
        content = await get(http, url, params=params)
        if content is None:
            continue
        content = json.loads(content)

        for page in content["query"]["allpages"]:
            yield page["title"]
        try:
            apcontinue = content['continue']['apcontinue']
        except KeyError:
            return
        else:
            params["apfrom"] = apcontinue


async def wikimedia_html(http, wiki="https://en.wikipedia.org/", title="Apple"):
    # e.g. https://en.wikipedia.org/api/rest_v1/page/html/Apple
    url = "{}/api/rest_v1/page/html/{}".format(wiki, urllib.parse.quote(title))
    out = await get(http, url)
    return wiki, title, out


async def save(tx, data, blob, doc):
    uid = uuid4()
    doc['html'] = await bstore.get_or_create(tx, blob, doc['html'])

    for key, value in doc.items():
        nstore.add(tx, data, uid, key, value)

    return uid


WIKIS = (
    "https://en.wikipedia.org/",
    "https://fr.wikipedia.org/",
    "https://en.wiktionary.org/",
    "https://fr.wiktionary.org/",
)

async def chunks(iterable, size):
    # chunk async generator https://stackoverflow.com/a/22045226
    while True:
        out = list()
        for _ in range(size):
            try:
                item = await iterable.__anext__()
            except StopAsyncIteration:
                yield out
                return
            else:
                out.append(item)
        yield out


async def main():
    # logging
    log.remove()
    log.add(sys.stderr, enqueue=True)

    # singleton
    timestamper = make_timestamper()
    database = await found.open()
    data = nstore.make('data', ('sourcery-data',), 3)
    blob = bstore.make('blob', ('sourcery-blob',))

    async with httpx.AsyncClient() as http:
        for wiki in WIKIS:
            log.info('Getting started with wiki at {}', wiki)
            # Polite limit @ https://en.wikipedia.org/api/rest_v1/
            async for chunk in chunks(wikimedia_titles(http, wiki), 200):
                log.info('iterate')
                coroutines = (wikimedia_html(http, wiki, title) for title in chunk)
                items = await asyncio.gather(*coroutines, return_exceptions=True)
                for item in items:
                    if isinstance(item, Exception):
                        msg = "Failed to fetch html on `{}` with `{}`"
                        log.error(msg, wiki, item)
                        continue
                    wiki, title, html = item
                    if html is None:
                        continue
                    log.debug(
                        "Fetch `{}` at `{}` with length {}",
                        title,
                        wiki,
                        len(html)
                    )

                    doc = dict(
                        wiki=wiki,
                        title=title,
                        html=html,
                        timestamp=timestamper(),
                    )

                    await found.transactional(database, save, data, blob, doc)


if __name__ == "__main__":
    asyncio.run(main())

获取 wikimedia 数据的另一种方法是依赖 kiwix zim 转储。

【讨论】：