网络抓取 LinkedIn 没有给我 html....我做错了什么？答案

【问题标题】：Web scraping LinkedIn doesn't give me the html.... what am I doing wrong?网络抓取 LinkedIn 没有给我 html....我做错了什么？
【发布时间】：2019-09-09 18:43:26
【问题描述】：

因此，我正在尝试抓取 LinkedIn 的关于页面，以获取某些公司的“特色”。当试图用漂亮的汤刮LinkedIn时，它给了我一个拒绝访问的错误，所以我使用标题来伪造我的浏览器。但是，它给出的是这个输出而不是相应的 HTML：

\n\nwindow.onload = function() {\n // 从 cookie 中解析跟踪代码。\n var trk = "bf";\n var trkInfo = "bf";\n var cookies = document. cookie.split("; ");\n for (var i = 0; i 8)) {\n trk = cookies[i].substring(8);\n }\n else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n trkInfo = cookies[i].substring(8);\n }\n }\n\n if (window.location.protocol == "http :") {\n // 如果设置了 "sl" cookie，则重定向到 https。\n for (var i = 0; i 3)) {\n window.location.href = "https:" + window.location.href.substring(window.location.protocol .length);\n return;\n }\n }\n }\n\n // 获取新域。对于国际域名，例如\n // fr.linkedin.com，我们将其转换为 www.linkedin.com\n var domain = "www.linkedin.com";\n if (domain != location.host) {\ n var subdomainIndex = location.host.indexOf(".linkedin");\n if (subdomainIndex != -1) {\n domain = "www" + location.host.substring(subdomainIndex);\n }\n } \n\n window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +\n "&originalReferer=" + document.referrer.substr(0 , 200) +\n "&sessionRedirect=" + encodeURIComponent(window.location.href);\n}\n\n'

import requests
from bs4 import BeautifulSoup as BS


url = 'https://www.linkedin.com/company/biotech/'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; 
rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": 
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
"Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", 
"DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

response = requests.get(url, headers=headers)
print(response.content)

我做错了什么？我认为它试图检查 cookie。有没有办法可以将它添加到我的代码中？

【问题讨论】：

标签： python html selenium web-scraping beautifulsoup

【解决方案1】：

您可以使用 Selenium 获取具有动态 JS 内容的页面。此外，您必须登录，因为您要检索的页面需要身份验证。所以：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

EMAIL = ''
PASSWORD = ''

driver = webdriver.Chrome()
driver.get('https://www.linkedin.com/company/biotech/')
el = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'form-toggle')))
driver.execute_script("arguments[0].click();", el)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'login-email'))).send_keys(EMAIL)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'login-password'))).send_keys(PASSWORD)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'login-submit'))).click()
text = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="ember71"]/dl/dd[4]'))).text

输出：

Distributing medical products

【讨论】：

【解决方案2】：

LinkedIn 实际上正在执行一些有趣的 Cookie 设置和后续重定向，这会阻止您的代码按原样工作。通过检查初始请求时返回的 JavaScript 可以清楚地看到这一点。基本上，HTTP Cookies 由 Web 服务器设置用于跟踪信息，并且在最终重定向发生之前，这些 cookie 会由您遇到的 JavaScript 解析。如果你对 JavaScript 进行逆向工程，你会发现最终的重定向是这样的（至少对我来说，基于我的位置和跟踪信息）：

url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'

另外，您可以使用 Python 的 requests 模块为您维护会话，它会自动管理 HTTP 标头（例如 cookie），因此您不必担心。下面应该为您提供您正在寻找的 HTML 源代码。我会留给你来实现 BeautifulSoup 并解析你想要的。

import requests
from bs4 import BeautifulSoup as BS

url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'


with requests.Session() as s:
        response = s.get(url)
        print(response.content)

【讨论】：

【解决方案3】：

你需要先美化响应。

page_content = BeautifulSoup(page_response.content, "html.parser")
#we use the html parser to parse the url content and store it in a variable.
textContent = []
for i in range(0, 20):
    paragraphs = page_content.find_all("p")[i].text
    textContent.append(paragraphs)
# In my use case, I want to store the speech data I mentioned earlier.  so in this example, I loop through the paragraphs, and push them into an array so that I can manipulate and do fun stuff with the data.

不是我的例子，但可以在这里找到 https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486

【讨论】：

我相信问题是关于绕过LinkedIn的重定向以实际获取可以被漂亮汤解析的HTML源代码，而不是使用漂亮汤的问题。
我认为这是问题所在：“但是，它给出了这个输出而不是相应的 HTML”，当他打印原始请求响应时，他没有得到 HTML。