【问题标题】:Can I use pywikipedia to get just the text of a page?我可以使用 pywikipedia 来获取页面的文本吗?
【发布时间】:2009-06-20 15:49:27
【问题描述】:

是否有可能,使用 pywikipedia,只获取页面的文本,没有任何内部链接或模板,也没有图片等?

【问题讨论】:

    标签: python wiki mediawiki pywikibot


    【解决方案1】:

    如果您的意思是“我只想获取 wikitext”,请查看 wikipedia.Page 类和 get 方法。

    import wikipedia
    
    site = wikipedia.getSite('en', 'wikipedia')
    page = wikipedia.Page(site, 'Test')
    
    print page.get() # '''Test''', '''TEST''' or '''Tester''' may refer to:
    #==Science and technology==
    #* [[Concept inventory]] - an assessment to reveal student thinking on a topic.
    # ...
    

    这样您就可以从文章中获得完整的原始维基文本。

    如果你想去掉 wiki 语法,比如将 [[Concept inventory]] 转换为 Concept 库存等等,那会有点痛苦。

    这个麻烦的主要原因是MediaWiki wiki 语法没有定义语法。这使得解析和剥离变得非常困难。我目前知道没有软件可以让你准确地做到这一点。当然还有 MediaWiki Parser 类,但它是 PHP 的,有点难掌握,而且它的用途非常不同。

    但是,如果您只想去除链接,或者非常简单的 wiki 结构,请使用正则表达式:

    text = re.sub('\[\[([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor]] sit amet, consectetur adipiscing elit.')
    print text #Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    

    然后对于管道链接:

    text = re.sub('\[\[(?:[^\]\|]*)\|([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor|DOLOR]] sit amet, consectetur adipiscing elit.')
    print text #Lorem ipsum DOLOR sit amet, consectetur adipiscing elit.
    

    等等。

    但是,例如,没有可靠的简单方法可以从页面中去除嵌套模板。对于在其 cmets 中有链接的图像也是如此。这非常困难,并且涉及递归地删除最内部的链接并用标记替换它并重新开始。如果您愿意,可以查看 wikipedia.py 中的 templateWithParams 函数,但它并不漂亮。

    【讨论】:

    • 显然我误解了问题的范围。鉴于没有其他答案,我已尽力而为。 :-)
    【解决方案2】:

    有一个名为mwparserfromhell on Github 的模块可以根据您的需要让您非常接近您想要的。它有一个名为 strip_code() 的方法,可以去除很多标记。

    import pywikibot
    import mwparserfromhell
    
    test_wikipedia = pywikibot.Site('en', 'test')
    text = pywikibot.Page(test_wikipedia, 'Lestat_de_Lioncourt').get()
    
    full = mwparserfromhell.parse(text)
    stripped = full.strip_code()
    
    print full
    print '*******************'
    print stripped
    

    比较sn-p:

    {{db-foreign}}
    <!--  Commented out because image was deleted: [[Image:lestat_tom_cruise.jpg|thumb|right|[[Tom Cruise]] as Lestat in the film ''[[Interview With The Vampire: The Vampire Chronicles]]''|{{deletable image-caption|1=Friday, 11 April 2008}}]] -->
    
    [[Image:lestat.jpg|thumb|right|[[Stuart Townsend]] as Lestat in the film ''[[Queen of the Damned (film)|Queen of the Damned]]'']]
    
    [[Image:Lestat IWTV.jpg|thumb|right|[[Tom Cruise]] as Lestat in the 1994 film ''[[Interview with the Vampire (film)|Interview with the Vampire]]'']]
    
    '''Lestat de Lioncourt''' is a [[fictional character]] appearing in several [[novel]]s by [[Anne Rice]], including ''[[The Vampire Lestat]]''. He is a [[vampire]] and the main character in the majority of ''[[The Vampire Chronicles]]'', narrated in first person.   
    
    ==Publication history==
    Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''[[The Vampire Lestat]]'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 
    
    
    *******************
    
    thumb|right|Stuart Townsend as Lestat in the film ''Queen of the Damned''
    
    '''Lestat de Lioncourt''' is a fictional character appearing in several novels by Anne Rice, including ''The Vampire Lestat''. He is a vampire and the main character in the majority of ''The Vampire Chronicles'', narrated in first person.   
    
    Publication history
    Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''The Vampire Lestat'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 
    

    【讨论】:

      【解决方案3】:

      您可以使用wikitextparser。例如:

      import pywikibot
      import wikitextparser
      en_wikipedia = pywikibot.Site('en', 'wikipedia')
      text = pywikibot.Page(en_wikipedia,'Bla Bla Bla').get()
      print(wikitextparser.parse(text).sections[0].plain_text())
      

      会给你:

      "Bla Bla Bla" is a song written and recorded by Italian DJ Gigi D'Agostino. It heavily samples the vocals of "Why did you do it?" by British band Stretch. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. It was sampled in the song "Jump" from Lupe Fiasco's 2017 album Drogas Light.
      

      【讨论】:

        【解决方案4】:

        Pywikibot 能够删除任何 wikitext 或 html 标签。 textlib里面有两个函数:

        1. 删除HTMLParts(text: str, keeptags=['tt', 'nowiki', 'small', 'sup']) -&gt; str:

          返回没有禁用 HTML 标记的部分的文本,但 保留 html 标记之间的文本。例如:

           from pywikibot Import textlib
           text = 'This is <small>small</small> text'
           print(removeHTMLParts(text, keeptags=[]))
          

          这将打印:

           This is small text
          
        2. removeDisabledParts(text: str, tags=None, include=[], site=None) -&gt; str: 返回没有禁用 wiki 标记的部分的文本。这删除 wikitext 文本中的文本。例如:

           from pywikibot Import textlib
           text = 'This is <small>small</small> text'
           print(removeDisabledPartsParts(text, tags=['small']))
          

          这将打印:

           This is  text
          

          有很多预定义的标签要删除或保留 'comment', 'header', 'link', 'template';

          标签参数的默认值为['comment', 'includeonly', 'nowiki', 'pre', 'syntaxhighlight']

          其他一些例子:

          removeDisabledPartsParts('See [[this link]]', tags=['link'])'See ' removeDisabledPartsParts('&lt;!-- no comments --&gt;', tags=['comment'])'' removeDisabledPartsParts('{{Infobox}}', tags=['template']) 提供 '',但仅适用于 Pywikibot 6.0.0 或更高版本

        【讨论】:

          猜你喜欢
          • 2023-03-19
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2020-02-23
          • 2022-10-14
          • 1970-01-01
          • 2012-08-21
          • 1970-01-01
          相关资源
          最近更新 更多