【发布时间】:2020-09-26 05:44:17
【问题描述】:
我正在尝试使用 API 从 Wikipedia 获取公共用户信息。 (使用脚本get_pages_revisions.py)。获得修订后,我使用 BeautifulSoup 去除所有 HTML 标签。但是,我发现剩下的文字仍然很乱。
例如,当我从User:(aeropagitica) 获取文本数据时,结果显示如下: (一小部分)
{{administrator}}
{{divbox|gray||Wikipedia is currently working on {{NUMBEROFARTICLES}} articles. The local time at the Wikipedia servers is '''{{CURRENTTIME}}''' on {{CURRENTDAYNAME}} {{CURRENTDAY}} {{CURRENTMONTHNAME}}, {{CURRENTYEAR}}.}}
• '''[[:WP:AIV|AIV]]''' •
'''[[Wikipedia:Articles for deletion/Log/{{CURRENTYEAR}} {{CURRENTMONTHNAME}} {{CURRENTDAY}}|AfD]]''' • '''[[User:(aeropagitica)/RFA summary|RfA]]''' • '''[[:Category:Candidates for speedy deletion|CSD]]''' • '''[[Wikipedia:Template messages|tpl]]''' • '''[[Wikipedia:Template_messages/User_talk_namespace|user talk tpl]]''' • '''[[Special:Newpages|new]]''' • '''[[Wikipedia:Stubs|stubs]]''' • '''[[Wikipedia:Copyright problems|(c)]]''' • '''[[Wikipedia:Manual of Style|MoS]]''' • '''[[User:Interiot/Tool2|edits (interiot)]]''' • '''[[Wikipedia:Proposed_deletion|prod]]''' • '''[[Special:Log/Newusers|newusers]]''' • '''[http://tools.wikimedia.de/~essjay/edit_count/Count.php? PHP interiot's tool]''' • '''[http://tools.wikimedia.de/~interiot/cgi-bin/Tool1/wannabe_kate Interiot's tool 1]''' • '''[[:Wikipedia:Article Creation and Improvement Drive|Article Improvement]]'''
{{purge|Purge server cache}}
I was [[Wikipedia:Requests_for_adminship/%28aeropagitica%29|nominated for adminship]] by [[User:King of Hearts|King of Hearts]] on February 27th 2006. The vote achieved consensus and I was accepted for the role with a score of '''40/10/5''' on March 7th 2006.
When I am not working on Wikipedia pages, I enjoy learning to play acoustic fingerstyle guitar, photography, learning languages (Spanish and French) and travel.
''Userboxes''
{| style="text-align:center; border: 1px solid #000000; background-color:#00cc99; width:100%; -moz-border-radius: 15px;"
|- padding:5em;padding-top:0.5em;"
|{{user en}}
请问:
- 如何在此处删除
style="...."、cellpadding="...."之类的字符串?我可以一次删除所有这些格式字符串吗? - 有很多这样的块:
{{Userbox|#77E0E8|#D0F8FF|{{CURRENTDAY}}|It is currently a [[{{CURRENTDAYNAME}}]]. I don't like {{CURRENTDAYNAME}}s.}}
"It is .." 后面的信息是我们需要的,但是它前面的文字:Userbox|#77E0E8,也用于网页布局定义,应该去掉。有什么办法可以删除这条线的前半部分?
(Userbox 只是其中一种,还有很多其他类型,如User:、Category:,因此使用自定义re 规则将很难移动它们)
(我是 BeautifulSoup 和 Web Parser 的初学者,所以任何建议或提示都会很有价值。提前感谢您的帮助!)
【问题讨论】:
标签: html parsing beautifulsoup html-parsing mediawiki