【问题标题】:Getting encoded text while scraping the data from URL using Beautifulsoup Python使用 Beautifulsoup Python 从 URL 抓取数据时获取编码文本
【发布时间】:2017-11-14 09:00:57
【问题描述】:

代码部分:

[<div class="hidden_elem"><code id="u_0_8"><!-- <div class="_4-u2 _5z71 _18ib _4-u8"><div class="_4-u3 _5z73"><div class="clearfix"><div class="lfloat _ohe"><a class="_5z74" href="/events/dialog/public_guest_list/?acontext%5Bref%5D=51&amp;acontext%5Bsource%5D=1&amp;acontext%5Baction_history%5D=%5B%7B%22surface%22%3A%22permalink%22%2C%22mechanism%22%3A%22surface%22%2C%22extra_data%22%3A%5B%5D%7D%2C%7B%22surface%22%3A%22permalink%22%2C%22mechanism%22%3A%22guest_list%22%2C%22extra_data%22%3A%5B%5D%7D%5D&amp;acontext%5Bhas_source%5D=1&amp;event_id=1407771472571452" rel="dialog" role="button">560 \u091c\u093e \u0930\u0939\u0947 \u0939\u0948\u0902&nbsp;\xb7&nbsp;3.1 \u0939\u091c\u093c\u093e\u0930 \u0915\u0940 \u0930\u0941\u091a\u093f \u0939\u0948</a><div class="_5z7d">\u0907\u0938 \u0908\u0935\u0947\u0902\u091f \u0915\u094b \u0905\u092a\u0928\u0947 \u092e\u093f\u0924\u094d\u0930\u094b\u0902 \u0938\u0947 \u0938\u093e\u091d\u093e \u0915\u0930\u0947\u0902</div></div><a class="_42ft _4jy0 _i8v _3-8w rfloat _ohf _4jy4 _517h _51sy" role="button" href="#" ajaxify="#" rel="dialog" data-testid="event_invite_button"><i class="_3-8_ _3-8_ img sp_WYmAGAVQNZh sx_82e44d"></i>\u0906\u092e\u0902\u0924\u094d\u0930\u093f\u0924 \u0915\u0930\u0947\u0902</a></div></div></div> --></code></div>, <div class="hidden_elem"><code id="u_0_i"><!-- <div class="_5vl5 _3a9j"><ul class="uiList _4kg _4ks"><li class="_3slj"><div class="_36hm"><table class="uiGrid _51mz" cellspacing="0" cellpadding="0"><tbody><tr class="_51mx"><td class="_51m- _phw"><div class="_6a" aria-hidden="true"><div class="_6a _6b" style="height:18px"></div><div class="_6a _6b"><i class="_ohg img sp_ESbkBsVlxUv sx_c2b8bd"><u>clock</u></i></div></div></td><td class="_51m- _4930 _phw _51mw"><div class="_xkh _phw"><div class="_6a"><div class="_6a _6b" style="height:18px"></div><div class="_6a _6b"><div class="_publicProdFeedInfo__timeRowTitle _5xhk" content="2017-07-28T21:30:00-07:00 to 2017-07-29T05:00:00-07:00"><span><span itemprop="startDate">29 \u091c\u0941\u0932\u093e\u0908</span></span> <span title="09:30 &#x905;&#x92a;&#x930;&#x93e;&#x939;&#x94d;&#x928; &#x906;&#x92a;&#x915;&#x947; &#x938;&#x92e;&#x92f; &#x92e;&#x947;&#x902;">10:00 \u092a\u0942\u0930\u094d\u0935\u093e\u0939\u094d\u0928</span> - <span title="05:00 &#x92a;&#x942;&#x930;&#x94d;&#x935;&#x93e;&#x939;&#x94d;&#x928; &#x906;&#x92a;&#x915;&#x947; &#x938;&#x92e;&#x92f; &#x92e;&#x947;&#x902;">05:30 \u0905\u092a\u0930\u093e\u0939\u094d\u0928 UTC+05:30</span></div><div class="_5xhp fsm fwn fcg"></div></div></div></div></td></tr></tbody></table></div></li><li class="_3xd0 _3slj"><div class="_36hm _5cmn" id="u_0_9"><table class="uiGrid _51mz" cellspacing="0" cellpadding="0"><tbody><tr class="_51mx"><td class="_51m- _phw"><div class="_6a" aria-hidden="true"><div class="_6a _6b" style="height:32px"></div><div class="_6a _6b"><i class="_ohg img sp_ESbkBsVlxUv sx_f4bee6"><u>pin</u></i></div></div></td><td class="_51m- _51mw"><div class="clearfix _4930"><div class="_xkg _phw rfloat _ohf"><div><div id="u_0_a"><div class="_6a"><div class="_6a _6b" style="height:32px"></div><div class="_6a _6b"><a href="#" role="button">\u092e\u0948\u092a \u0926\u093f\u0916\u093e\u090f\u0901</a></div></div></div><div class="hidden_elem" id="u_0_b"><div class="_6a"><div class="_6a _6b" style="height:32px"></div><div class="_6a _6b"><a href="#" role="button">\u092e\u0948\u092a \u091b\u093f\u092a\u093e\u090f\u0901</a></div></div></div></div></div><div class="_xkh _phw _42ef"><div class="_6a"><div class="_6a _6b" style="height:32px"></div><div class="_6a _6b"><a class="_5xhk" href="https://www.facebook.com/iitd.delhi/" id="u_0_d" data-testid="event-permalink-location">IIT Delhi</a><div class="_5xhp fsm fwn fcg">Hauz Khaz, New Delhi, India 110016</div></div></div></div></div></td></tr></tbody></table></div><div class="_4-u2 hidden_elem _5xhn _4-u8" id="u_0_c"><div class="clearfix _ikh"><div class="_4bl7"><div class="_23mo"><div class="fbPlaceFlyoutWrap _5xho" id="u_0_e"><div class="fbPlaceFlyout" style="width:240px;"><div class="fbPlaceFlyoutShell" style="width:46px;bottom:-31px;"><div class="fbPlaceFlyoutBox uiBoxWhite" style="width: 46px"><div><div class="_52i5"><a href="https://www.facebook.com/iitd.delhi/"><img class="_s0 img" src="https://scontent.fdel6-1.fna.fbcdn.net/v/t1.0-1/p40x40/255575_512250575469178_612128240_n.jpg?oh=dc9acf8d4452db344aaba7fde25efa84&amp;oe=59AD9DC7" alt="" itemprop="image" aria-label="IIT Delhi" role="img" style="width:40px;height:40px" /></a></div></div><div class="fbPlaceFlyoutMapArrow"><i class="img sp_ESbkBsVlxUv sx_104d97"></i></div><div class="fbPlaceFlyoutMapArrow"><i class="img sp_ESbkBsVlxUv sx_104d97"></i></div></div></div></div><a href="#" rel="dialog" ajaxify="/places/map/?id=211928345501404" role="button"><div><div class="_4j7v _2vs2"><img class="_a3f img" alt="" aria-label="&#x928;&#x915;&#x94d;&#x936;&#x93e; &#x905;&#x91f;&#x948;&#x91a;&#x92e;&#x947;&#x902;&#x91f;" src="https://external.fdel6-1.fna.fbcdn.net/static_map.php?region=IN&amp;v=29&amp;osm_provider=2&amp;size=240x132&amp;center=28.545188216208%2C77.193069476906&amp;zoom=15&amp;markers=28.54518822%2C77.19306948&amp;language=hi_IN" width="240" height="132" /><span id="u_0_g"></span></div></div></a></div></div></div><div class="_4bl9 _2qsg"><div><span class="_c24">\u0915\u0949\u0932\u0947\u091c \u0914\u0930 \u092f\u0942\u0928\u093f\u0935\u0930\u094d\u0938\u093f\u091f\u0940</span><div><div class="_4iae"><div><div class="_6a _5xoz _5xo-"><i class="img sp_ESbkBsVlxUv sx_ac5297"></i></div><div class="_6a _5xoz"><i class="img sp_ESbkBsVlxUv sx_ac5297"></i></div><div class="_6a _5xoz"><i class="img sp_ESbkBsVlxUv sx_ac5297"></i></div><div class="_6a _5xoz"><i class="img sp_ESbkBsVlxUv sx_ac5297"></i></div><div class="_6a _5xoz _4ial"><i class="img sp_ESbkBsVlxUv sx_ac5297"></i></div></div><div class="_559j" style="clip: rect(0px, 63px, 16px, 0px)"><div class="_6a _5xoz _5xo-"><i class="img sp_ESbkBsVlxUv sx_59de11"></i></div><div class="_6a _5xoz"><i class="img sp_ESbkBsVlxUv sx_59de11"></i></div><div class="_6a _5xoz"><i class="img sp_ESbkBsVlxUv sx_59de11"></i></div><div class="_6a _5xoz"><i class="img sp_ESbkBsVlxUv sx_59de11"></i></div><div class="_6a _5xoz _4ial"><i class="img sp_ESbkBsVlxUv sx_59de11"></i></div></div></div></div><hr class="_23mm" /><div><span class="_c24">011 2659 6316</span></div><div><span class="_c24"></span></div><div class="ptm"><a class="_42ft _4jy0 _4jy3 _517h _51sy" role="button" href="http://l.facebook.com/l.php?u=http%3A%2F%2Fshare.here.com%2Fr%2Fmylocation%2Fe-eyJuYW1lIjoiSUlUIERlbGhpIiwiYWRkcmVzcyI6IkhhdXogS2hheiwgTmV3IERlbGhpLCBJbmRpYSAxMTAwMTYiLCJsYXRpdHVkZSI6MjguNTQ1MTg4MjE2MjA4LCJsb25naXR1ZGUiOjc3LjE5MzA2OTQ3NjkwNiwicHJvdmlkZXJOYW1lIjoiZmFjZWJvb2siLCJwcm92aWRlcklkIjoyMTE5MjgzNDU1MDE0MDR9%3Flink%3Dunknown%26fb_locale%3Dhi_IN%26ref%3Dfacebook&amp;h=ATP2RoDOmV19cipyFvxN_S_G4uI7FP1aDGQXs8I8palbouMF9Ut2wIJBE-D0XSb9O2x9_YcBTP1eLGOs-qvz3hHjCMi-5oGqGiE1TJerNdX-KKhRgc6j392SdLAY&amp;s=1" id="u_0_f" target="_blank" rel="nofollow" onmouseover="LinkshimAsyncLink.swap(this, &quot;http:\\\\/\\\\/share.here.com\\\\/r\\\\/mylocation\\\\/e-eyJuYW1lIjoiSUlUIERlbGhpIiwiYWRkcmVzcyI6IkhhdXogS2hheiwgTmV3IERlbGhpLCBJbmRpYSAxMTAwMTYiLCJsYXRpdHVkZSI6MjguNTQ1MTg4MjE2MjA4LCJsb25naXR1ZGUiOjc3LjE5MzA2OTQ3NjkwNiwicHJvdmlkZXJOYW1lIjoiZmFjZWJvb2siLCJwcm92aWRlcklkIjoyMTE5MjgzNDU1MDE0MDR9?link=unknown&amp;fb_locale=hi_IN&amp;ref=facebook&quot;);" onclick="LinkshimAsyncLink.swap(this, &quot;http:\\\\/\\\\/l.facebook.com\\\\/l.php?u=http\\\\u00253A\\\\u00252F\\\\u00252Fshare.here.com\\\\u00252Fr\\\\u00252Fmylocation\\\\u00252Fe-eyJuYW1lIjoiSUlUIERlbGhpIiwiYWRkcmVzcyI6IkhhdXogS2hheiwgTmV3IERlbGhpLCBJbmRpYSAxMTAwMTYiLCJsYXRpdHVkZSI6MjguNTQ1MTg4MjE2MjA4LCJsb25naXR1ZGUiOjc3LjE5MzA2OTQ3NjkwNiwicHJvdmlkZXJOYW1lIjoiZmFjZWJvb2siLCJwcm92aWRlcklkIjoyMTE5MjgzNDU1MDE0MDR9\\\\u00253Flink\\\\u00253Dunknown\\\\u002526fb_locale\\\\u00253Dhi_IN\\\\u002526ref\\\\u00253Dfacebook&amp;h=ATP2RoDOmV19cipyFvxN_S_G4uI7FP1aDGQXs8I8palbouMF9Ut2wIJBE-D0XSb9O2x9_YcBTP1eLGOs-qvz3hHjCMi-5oGqGiE1TJerNdX-KKhRgc6j392SdLAY&amp;s=1&quot;);">\u0926\u093f\u0936\u093e\u090f\u0901 \u092a\u094d\u0930\u093e\u092a\u094d\u0924 \u0915\u0930\u0947\u0902</a></div></div></div></div></div></li></ul><div id="event_navigation" class="_4dn9"><div id="u_0_h"></div></div></div> --></code></div>, <div class="hidden_elem"><code id="u_0_m"><!-- <div class="_4z-v"><div class="_4-u2 _3xaf _3-95 _4-u8"><div class="_4-u3 _5dwa _5dwb _57_-"><span class="_38my _5803">\u0935\u093f\u0935\u0930\u0923<span class="_c1c"></span></span><div class="_3s3-"></div></div><div class="_2qgs"><span class="_4n-j _fbReactionComponent__eventDetailsContentTags fsl" data-testid="event-permalink-details">Indian Youth Forum is proud to announce the first-ever Startup Festival 2017 which will bring together the brightest startups of the country all in one place. And these startups are looking to hire you!<br /> For the first time ever, these bright and young startups, will open their ships to technical and non-technical talent, on an adventurous voyage filled with learning to become the next big company. The event is open to working professionals and talented freshers looking for a challenging and enriching role.<br /> <br /> For Any Kind of Association Queries Mail us at -<br /> mystory&#064;indiayf.in or Inbox us .</span></div><div class="_1r51"><ul class="uiList uiCollapsedList uiCollapsedListHidden _509- _4ki" id="u_0_j"><li><a href="/events/discovery/?acontext=%7B%22ref%22%3A51%2C%22source%22%3A1%2C%22action_history%22%3A%22%5B%7B%5C%22surface%5C%22%3A%5C%22permalink%5C%22%2C%5C%22mechanism%5C%22%3A%5C%22surface%5C%22%2C%5C%22extra_data%5C%22%3A%5B%5D%7D%2C%7B%5C%22surface%5C%22%3A%5C%22permalink%5C%22%2C%5C%22mechanism%5C%22%3A%5C%22event_information%5C%22%2C%5C%22extra_data%5C%22%3A%7B%5C%22tag%5C%22%3A%5C%22StartUp%5C%22%7D%7D%5D%22%2C%22has_source%22%3Atrue%7D&amp;suggestion_token=%7B%22tags%22%3A%5B181836542181749%5D%7D"><span class="_47od">StartUp</span></a></li><li><a href="/events/discovery/?acontext=%7B%22ref%22%3A51%2C%22source%22%3A1%2C%22action_history%22%3A%22%5B%7B%5C%22surface%5C%22%3A%5C%22permalink%5C%22%2C%5C%22mechanism%5C%22%3A%5C%22surface%5C%22%2C%5C%22extra_data%5C%22%3A%5B%5D%7D%2C%7B%5C%22surface%5C%22%3A%5C%22permalink%5C%22%2C%5C%22mechanism%5C%22%3A%5C%22event_information%5C%22%2C%5C%22extra_data%5C%22%3A%7B%5C%22tag%5C%22%3A%5C%22Job+hunting%5C%22%7D%7D%5D%22%2C%22has_source%22%3Atrue%7D&amp;suggestion_token=%7B%22tags%22%3A%5B111193155571103%5D%7D"><span class="_47od">Job hunting</span></a></li><li><a href="/events/discovery/?acontext=%7B%22ref%22%3A51%2C%22source%22%3A1%2C%22action_history%22%3A%22%5B%7B%5C%22surface%5C%22%3A%5C%22permalink%5C%22%2C%5C%22mechanism%5C%22%3A%5C%22surface%5C%22%2C%5C%22extra_data%5C%22%3A%5B%5D%7D%2C%7B%5C%22surface%5C%22%3A%5C%22permalink%5C%22%2C%5C%22mechanism%5C%22%3A%5C%22event_information%5C%22%2C%5C%22extra_data%5C%22%3A%7B%5C%22tag%5C%22%3A%5C%22Startup.com%5C%22%7D%7D%5D%22%2C%22has_source%22%3Atrue%7D&amp;suggestion_token=%7B%22tags%22%3A%5B109416335743992%5D%7D"><span class="_47od">Startup.com</span></a></li></ul></div></div><div class="_4-u2 _3xaf _3-95 _4-u8"><div class="_4-u3 _5dwa _5dwb _57_-"><span class="_38my _5803">Indian Youth Forum \u0915\u0947 \u092c\u093e\u0930\u0947 \u092e\u0947\u0902<span class="_c1c"></span></span><div class="_3s3-"></div></div><div><div><div class="_37p5"><div class="clearfix"><img class="_37p7 _8o _8r lfloat _ohe img" height="100" src="https://scontent.fdel6-1.fna.fbcdn.net/v/t1.0-0/c5.0.100.100/p100x100/16708216_1083815345075324_1809238266151282211_n.jpg?oh=cdc9096728fec80a0147133a6b1599d6&amp;oe=59E5EFDB" alt="" /><div class="_8u _42ef"><div class="_37p8"><div class="_50f4"><span class="fwb"><a class="profileLink" href="https://www.facebook.com/IyfIndianyouthforum/">Indian Youth Forum</a></span></div><div class="_37p9 _50f3">News &amp; Media Website</div><div class="_37pa _50f3">We find and tell stories of people doing good to inspire global action. Because we&#039;re convinced each of us has the power to make the world better .</div></div></div></div></div></div></div></div><div class="_4-u2 _3xaf _3-95 _4-u8"><div class="_4-u3 _5dwa _5dwb _57_-"><span class="_38my _5803">\u0938\u094d\u0925\u093e\u0928 \u0915\u0947 \u092c\u093e\u0930\u0947 \u092e\u0947\u0902<span class="_c1c"></span></span><div class="_3s3-"></div></div><div class="_37p6"><div><div><div><div class="_4sdm _6lh _dcs"><div class="_5hv6"><div class="_6lp"><div class="_6ln fsxxl fwb"><a href="https://www.facebook.com/iitd.delhi/" data-ft="&#123;&quot;tn&quot;:&quot;k&quot;&#125;">IIT Delhi</a></div><div class="_6lo ellipsis fsm fwn fcg">\u0915\u0949\u0932\u0947\u091c \u0914\u0930 \u092f\u0942\u0928\u093f\u0935\u0930\u094d\u0938\u093f\u091f\u0940</div></div></div><div class="uiScaledImageContainer _6li _6l-" style="width:100%"><img class="scaledImageFitWidth img" src="https://scontent.fdel6-1.fna.fbcdn.net/v/t1.0-0/p320x320/1660351_782270428467190_610794429_n.jpg?oh=4b4957698cf37eaa2621307fc3c61b8f&amp;oe=59E14DBB" style="top:-60px;" alt="&#039;Picture credit: Arshad Nasser (2013JDS6003) M.Des- Industrial Design&#039;" width="480" height="320" /></div><a class="_8xh" href="https://www.facebook.com/iitd.delhi/" style="width:100%" data-ft="&#123;&quot;tn&quot;:&quot;k&quot;&#125;"></a><a class="_3aml" href="https://www.facebook.com/iitd.delhi/" style="width:100%"></a><div class="clearfix _5kun"><a class="_6ll lfloat _ohe" href="https://www.facebook.com/iitd.delhi/" data-ft="&#123;&quot;tn&quot;:&quot;k&quot;&#125;"><div class="_6lm _4m78"><div class="uiScaledImageContainer profilePic" style="width: 96px; height: 96px"><img class="scaledImageFitWidth img" src="https://scontent.fdel6-1.fna.fbcdn.net/v/t1.0-1/p100x100/255575_512250575469178_612128240_n.jpg?oh=e2bf449617f68eac2b8cd02d7c35a513&amp;oe=59A0C926" alt="IIT Delhi" width="96" height="96" /></div></div></a><div class="_6lk _42ef"><div><div class="_8yb"><div>2,82,390 \u092a\u0938\u0902\u0926</div><div>2,019 \u0932\u094b\u0917 \u0907\u0938 \u092c\u093e\u0930\u0947 \u092e\u0947\u0902 \u092c\u093e\u0924 \u0915\u0930 \u0930\u0939\u0947 \u0939\u0948\u0902</div></div></div></div></div></div></div></div></div></div><div class="_4z-w"><a class="_4b4x" href="https://www.facebook.com/iitd.delhi/" id="u_0_k">\u092a\u0947\u091c \u092a\u0930 \u091c\u093e\u090f\u0901</a></div></div><div class="_4-u2 _3xaf _3-95 _4-u8"><div class="_4x0f"><div class="_4x0g"><div class="_4x0d _4x0e"><div class="_41dr _4x0c"><span><img class="_s0 _41ds _54ru img" src="https://scontent.fdel6-1.fna.fbcdn.net/v/t1.0-1/c4.15.32.32/p40x40/15747342_1195628017184471_1949447432837553984_n.jpg?oh=54f25e123a74d63f279279ee62318a79&amp;oe=59B5B106" alt="" aria-label="Jha Ayush" role="img" /></span></div><div class="_41dr _4x0c"><a href="https://www.facebook.com/IyfIndianyouthforum/"><img class="_s0 _41ds _54ru img" src="https://scontent.fdel6-1.fna.fbcdn.net/v/t1.0-1/p32x32/15541314_1041942845929241_1722198877754933119_n.jpg?oh=973e318ede53168d58f6e7be835583c0&amp;oe=59A926CC" alt="" aria-label="Indian Youth Forum" role="img" /></a></div><div class="_41dr _4x0c"><a href="https://www.facebook.com/kumeshyadav"><img class="_s0 _41ds _54ru img" src="https://scontent.fdel6-1.fna.fbcdn.net/v/t1.0-1/p32x32/15337627_10153988267585286_2118657580809154297_n.jpg?oh=182fa980f18ed2d94c6717f8de3af7ad&amp;oe=599BC3CD" alt="" aria-label="Kumesh Yadav" role="img" /></a></div><div class="_41dr _4x0c"><span><img class="_s0 _41ds _54ru img" src="https://scontent.fdel6-1.fna.fbcdn.net/v/t1.0-1/p32x32/15965812_10158191872490352_4833263074795798396_n.jpg?oh=ce18a15878fc5814539a57aed4c0446b&amp;oe=59A47E1F" alt="" aria-label="Kanika Gupta" role="img" /></span></div></div></div><div class="_4x0h">\u091a\u0930\u094d\u091a\u093e \u092e\u0947\u0902 12 \u092a\u094b\u0938\u094d\u091f.</div></div><div class="_4z-w"><a class="_4b4x" href="/events/1407771472571452/?active_tab=discussion" id="u_0_l">\u091a\u0930\u094d\u091a\u093e \u0926\u0947\u0916\u0947\u0902</a></div></div></div> --></code></div>]

上面是我需要从中刮取 div class= '_publicProdFeedInfo__timeRowTitle _5xhk' 中的文本的代码部分,当我刮取时,它会显示如下编码的文本:

&lt;div class="_publicProdFeedInfo__timeRowTitle _5xhk" content="2017-07-28T21:30:00-07:00 to 2017-07-29T05:00:00-07:00"&gt;&lt;span&gt;&lt;span itemprop="startDate"&gt;29 जुलाई&lt;/span&gt;&lt;/span&gt; &lt;span title="09:30 अपराह्न आपके समय में"&gt;10:00 पूर्वाह्न&lt;/span&gt; - &lt;span title="05:00 पूर्वाह्न आपके समय में"&gt;05:30 अपराह्न UTC+05:30&lt;/span&gt;&lt;/div&gt;

虽然文本存在于 url 的源代码中:https://www.facebook.com/events/1407771472571452/

你能告诉我如何解决它

这是我正在使用的python代码

import urllib2
from bs4 import BeautifulSoup
facebook="https://www.facebook.com/events/1407771472571452/"
page = urllib2.urlopen(facebook)
soup = BeautifulSoup(page, 'lxml')
data = soup.findAll("div", {"class": "hidden_elem"})
for item in data:
             commentedHTML = item.find('code').contents[0]
             more_soup = BeautifulSoup(commentedHTML, 'lxml')
             wanted_text = more_soup.findAll('div', {'class': '_publicProdFeedInfo__timeRowTitle _5xhk'})
             if wanted_text:
                gotdata2 = (wanted_text[0])

                print gotdata2

【问题讨论】:

  • 链接失效。
  • 您的 Facebook 语言设置为印地语。
  • @SachinKukreja 我已经编辑了链接
  • 但我认为将语言设置为印地语不会产生任何影响,因为我只能在这个类中从类中删除文本文本,我正在获取编码文本
  • 我怀疑我是否有任何用处,因为我在 facebook 页面上看不到任何此类文字。

标签: python python-2.7 web-scraping beautifulsoup


【解决方案1】:

经过多次尝试,我终于通过在请求标头中指定语言来修复它:

url:https://www.facebook.com/events/1407771472571452/
headers = {"Accept-Language": "en-US,en;q=0.5"}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.text,'lxml')

【讨论】:

    【解决方案2】:

    识别div 元素,然后是其中的code 元素。注释可作为此codestring 使用,并可传递给BeautifulSoup 进行解析。一旦你用评论的内容制作了另一种汤,你就可以像处理其他任何东西一样处理它。

    >>> import bs4
    >>> import requests
    >>> page = requests.get('https://www.facebook.com/events/1407771472571452/').text
    >>> soup = bs4.BeautifulSoup(page, 'lxml')
    >>> div = soup.find('div', attrs={'class':"hidden_elem"})
    >>> code = div.find('code')
    >>> soup_2 = bs4.BeautifulSoup(code.string, 'lxml')
    >>> soup_2.findAll('a')
    [<a class="_5z74" href="/events/dialog/public_guest_list/?acontext%5Bref%5D=51&amp;acontext%5Bsource%5D=1&amp;acontext%5Baction_history%5D=%5B%7B%22surface%22%3A%22permalink%22%2C%22mechanism%22%3A%22surface%22%2C%22extra_data%22%3A%5B%5D%7D%2C%7B%22surface%22%3A%22permalink%22%2C%22mechanism%22%3A%22guest_list%22%2C%22extra_data%22%3A%5B%5D%7D%5D&amp;acontext%5Bhas_source%5D=1&amp;event_id=1407771472571452" rel="dialog" role="button">601 Going · 3.3K Interested</a>, <a ajaxify="#" class="_42ft _4jy0 _i8v _3-8w rfloat _ohf _4jy4 _517h _51sy" data-testid="event_invite_button" href="#" rel="dialog" role="button"><i class="_3-8_ _3-8_ img sp__Uck8Egf9Z1 sx_deb798"></i>Invite</a>]
    

    编辑:如果我按照评论中的建议进行操作,就会出现这种情况。

    >>> divs_2 = soup_2.findAll('div')
    >>> for item in divs_2:
    ...     item.contents
    ...     
    [<div class="_4-u3 _5z73"><div class="clearfix"><div class="lfloat _ohe"><a class="_5z74" href="/events/dialog/public_guest_list/?acontext%5Bref%5D=51&amp;acontext%5Bsource%5D=1&amp;acontext%5Baction_history%5D=%5B%7B%22surface%22%3A%22permalink%22%2C%22mechanism%22%3A%22surface%22%2C%22extra_data%22%3A%5B%5D%7D%2C%7B%22surface%22%3A%22permalink%22%2C%22mechanism%22%3A%22guest_list%22%2C%22extra_data%22%3A%5B%5D%7D%5D&amp;acontext%5Bhas_source%5D=1&amp;event_id=1407771472571452" rel="dialog" role="button">602 Going · 3.3K Interested</a><div class="_5z7d">Share this event with your friends</div></div><a ajaxify="#" class="_42ft _4jy0 _i8v _3-8w rfloat _ohf _4jy4 _517h _51sy" data-testid="event_invite_button" href="#" rel="dialog" role="button"><i class="_3-8_ _3-8_ img sp__Uck8Egf9Z1 sx_deb798"></i>Invite</a></div></div>]
    [<div class="clearfix"><div class="lfloat _ohe"><a class="_5z74" href="/events/dialog/public_guest_list/?acontext%5Bref%5D=51&amp;acontext%5Bsource%5D=1&amp;acontext%5Baction_history%5D=%5B%7B%22surface%22%3A%22permalink%22%2C%22mechanism%22%3A%22surface%22%2C%22extra_data%22%3A%5B%5D%7D%2C%7B%22surface%22%3A%22permalink%22%2C%22mechanism%22%3A%22guest_list%22%2C%22extra_data%22%3A%5B%5D%7D%5D&amp;acontext%5Bhas_source%5D=1&amp;event_id=1407771472571452" rel="dialog" role="button">602 Going · 3.3K Interested</a><div class="_5z7d">Share this event with your friends</div></div><a ajaxify="#" class="_42ft _4jy0 _i8v _3-8w rfloat _ohf _4jy4 _517h _51sy" data-testid="event_invite_button" href="#" rel="dialog" role="button"><i class="_3-8_ _3-8_ img sp__Uck8Egf9Z1 sx_deb798"></i>Invite</a></div>]
    [<div class="lfloat _ohe"><a class="_5z74" href="/events/dialog/public_guest_list/?acontext%5Bref%5D=51&amp;acontext%5Bsource%5D=1&amp;acontext%5Baction_history%5D=%5B%7B%22surface%22%3A%22permalink%22%2C%22mechanism%22%3A%22surface%22%2C%22extra_data%22%3A%5B%5D%7D%2C%7B%22surface%22%3A%22permalink%22%2C%22mechanism%22%3A%22guest_list%22%2C%22extra_data%22%3A%5B%5D%7D%5D&amp;acontext%5Bhas_source%5D=1&amp;event_id=1407771472571452" rel="dialog" role="button">602 Going · 3.3K Interested</a><div class="_5z7d">Share this event with your friends</div></div>, <a ajaxify="#" class="_42ft _4jy0 _i8v _3-8w rfloat _ohf _4jy4 _517h _51sy" data-testid="event_invite_button" href="#" rel="dialog" role="button"><i class="_3-8_ _3-8_ img sp__Uck8Egf9Z1 sx_deb798"></i>Invite</a>]
    [<a class="_5z74" href="/events/dialog/public_guest_list/?acontext%5Bref%5D=51&amp;acontext%5Bsource%5D=1&amp;acontext%5Baction_history%5D=%5B%7B%22surface%22%3A%22permalink%22%2C%22mechanism%22%3A%22surface%22%2C%22extra_data%22%3A%5B%5D%7D%2C%7B%22surface%22%3A%22permalink%22%2C%22mechanism%22%3A%22guest_list%22%2C%22extra_data%22%3A%5B%5D%7D%5D&amp;acontext%5Bhas_source%5D=1&amp;event_id=1407771472571452" rel="dialog" role="button">602 Going · 3.3K Interested</a>, <div class="_5z7d">Share this event with your friends</div>]
    ['Share this event with your friends']
    

    对我来说,更简单的情况可能是尝试用英文请求页面,以避免需要翻译以其他语言编码的字符串。我没有这方面的经验,但您可以尝试调查requestsurllib2 可以使用哪些选项来提出这样的请求。

    【讨论】:

    • 先生最后如果你输入 'div' 代替 'a' --- soup_2.findAll('a')--- 那么你也会在输出中得到一些编码文本需要该编码文本而不是通过解码或任何其他方式
    • 看看编辑。我不知道我还能告诉你什么。
    • 先生,您的输出显示此代码运行良好,根本没有编码文本,但是,当我使用相同的代码时,我的输出中得到了编码文本,请查看此 url 上的输出:link
    【解决方案3】:

    读取响应后,从 UTF-8 解码:

    page = urllib2.urlopen(facebook)
    soup = BeautifulSoup(page.read().decode('utf-8', 'ignore'), 'lxml)
    

    注意: 添加ignore 是为了避免由于存在无效的 UTF-8 字符而失败,在解析时这些字符将被删除。

    【讨论】:

    • 我试过这个也显示相同的输出:'
      05:30 अपराह्न UTC+05:30
      '
    • 您的答案根本不起作用,它还显示编码文本
    猜你喜欢
    • 1970-01-01
    • 2021-12-17
    • 2018-01-09
    • 1970-01-01
    • 1970-01-01
    • 2021-09-01
    • 1970-01-01
    • 2016-03-24
    相关资源
    最近更新 更多