【发布时间】:2019-09-09 18:43:26
【问题描述】:
因此,我正在尝试抓取 LinkedIn 的关于页面,以获取某些公司的“特色”。当试图用漂亮的汤刮LinkedIn时,它给了我一个拒绝访问的错误,所以我使用标题来伪造我的浏览器。但是,它给出的是这个输出而不是相应的 HTML:
\n\nwindow.onload = function() {\n // 从 cookie 中解析跟踪代码。\n var trk = "bf";\n var trkInfo = "bf";\n var cookies = document. cookie.split("; ");\n for (var i = 0; i 8)) {\n trk = cookies[i].substring(8);\n }\n else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n trkInfo = cookies[i].substring(8);\n }\n }\n\n if (window.location.protocol == "http :") {\n // 如果设置了 "sl" cookie,则重定向到 https。\n for (var i = 0; i 3)) {\n window.location.href = "https:" + window.location.href.substring(window.location.protocol .length);\n return;\n }\n }\n }\n\n // 获取新域。对于国际域名,例如\n // fr.linkedin.com,我们将其转换为 www.linkedin.com\n var domain = "www.linkedin.com";\n if (domain != location.host) {\ n var subdomainIndex = location.host.indexOf(".linkedin");\n if (subdomainIndex != -1) {\n domain = "www" + location.host.substring(subdomainIndex);\n }\n } \n\n window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +\n "&originalReferer=" + document.referrer.substr(0 , 200) +\n "&sessionRedirect=" + encodeURIComponent(window.location.href);\n}\n\n'
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.linkedin.com/company/biotech/'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14;
rv:66.0) Gecko/20100101 Firefox/66.0", "Accept":
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate",
"DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
response = requests.get(url, headers=headers)
print(response.content)
我做错了什么?我认为它试图检查 cookie。有没有办法可以将它添加到我的代码中?
【问题讨论】:
标签: python html selenium web-scraping beautifulsoup