【问题标题】:Web scraping LinkedIn doesn't give me the html.... what am I doing wrong?网络抓取 LinkedIn 没有给我 html....我做错了什么?
【发布时间】:2019-09-09 18:43:26
【问题描述】:

因此,我正在尝试抓取 LinkedIn 的关于页面,以获取某些公司的“特色”。当试图用漂亮的汤刮LinkedIn时,它给了我一个拒绝访问的错误,所以我使用标题来伪造我的浏览器。但是,它给出的是这个输出而不是相应的 HTML:

\n\nwindow.onload = function() {\n // 从 cookie 中解析跟踪代码。\n var trk = "bf";\n var trkInfo = "bf";\n var cookies = document. cookie.split("; ");\n for (var i = 0; i 8)) {\n trk = cookies[i].substring(8);\n }\n else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n trkInfo = cookies[i].substring(8);\n }\n }\n\n if (window.location.protocol == "http :") {\n // 如果设置了 "sl" cookie,则重定向到 https。\n for (var i = 0; i 3)) {\n window.location.href = "https:" + window.location.href.substring(window.location.protocol .length);\n return;\n }\n }\n }\n\n // 获取新域。对于国际域名,例如\n // fr.linkedin.com,我们将其转换为 www.linkedin.com\n var domain = "www.linkedin.com";\n if (domain != location.host) {\ n var subdomainIndex = location.host.indexOf(".linkedin");\n if (subdomainIndex != -1) {\n domain = "www" + location.host.substring(subdomainIndex);\n }\n } \n\n window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +\n "&originalReferer=" + document.referrer.substr(0 , 200) +\n "&sessionRedirect=" + encodeURIComponent(window.location.href);\n}\n\n'

import requests
from bs4 import BeautifulSoup as BS


url = 'https://www.linkedin.com/company/biotech/'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; 
rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": 
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
"Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", 
"DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

response = requests.get(url, headers=headers)
print(response.content) 

我做错了什么?我认为它试图检查 cookie。有没有办法可以将它添加到我的代码中?

【问题讨论】:

    标签: python html selenium web-scraping beautifulsoup


    【解决方案1】:

    您可以使用 Selenium 获取具有动态 JS 内容的页面。此外,您必须登录,因为您要检索的页面需要身份验证。所以:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    EMAIL = ''
    PASSWORD = ''
    
    driver = webdriver.Chrome()
    driver.get('https://www.linkedin.com/company/biotech/')
    el = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'form-toggle')))
    driver.execute_script("arguments[0].click();", el)
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'login-email'))).send_keys(EMAIL)
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'login-password'))).send_keys(PASSWORD)
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'login-submit'))).click()
    text = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[@id="ember71"]/dl/dd[4]'))).text
    

    输出:

    Distributing medical products
    

    【讨论】:

      【解决方案2】:

      LinkedIn 实际上正在执行一些有趣的 Cookie 设置和后续重定向,这会阻止您的代码按原样工作。通过检查初始请求时返回的 JavaScript 可以清楚地看到这一点。基本上,HTTP Cookies 由 Web 服务器设置用于跟踪信息,并且在最终重定向发生之前,这些 cookie 会由您遇到的 JavaScript 解析。如果你对 JavaScript 进行逆向工程,你会发现最终的重定向是这样的(至少对我来说,基于我的位置和跟踪信息):

      url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'
      

      另外,您可以使用 Python 的 requests 模块为您维护会话,它会自动管理 HTTP 标头(例如 cookie),因此您不必担心。下面应该为您提供您正在寻找的 HTML 源代码。我会留给你来实现 BeautifulSoup 并解析你想要的。

      import requests
      from bs4 import BeautifulSoup as BS
      
      url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'
      
      
      with requests.Session() as s:
              response = s.get(url)
              print(response.content) 
      

      【讨论】:

        【解决方案3】:

        你需要先美化响应。

        page_content = BeautifulSoup(page_response.content, "html.parser")
        #we use the html parser to parse the url content and store it in a variable.
        textContent = []
        for i in range(0, 20):
            paragraphs = page_content.find_all("p")[i].text
            textContent.append(paragraphs)
        # In my use case, I want to store the speech data I mentioned earlier.  so in this example, I loop through the paragraphs, and push them into an array so that I can manipulate and do fun stuff with the data.
        

        不是我的例子,但可以在这里找到 https://codeburst.io/web-scraping-101-with-python-beautiful-soup-bb617be1f486

        【讨论】:

        • 我相信问题是关于绕过LinkedIn的重定向以实际获取可以被漂亮汤解析的HTML源代码,而不是使用漂亮汤的问题。
        • 我认为这是问题所在:“但是,它给出了这个输出而不是相应的 HTML”,当他打印原始请求响应时,他没有得到 HTML。
        猜你喜欢
        • 1970-01-01
        • 2022-12-20
        • 1970-01-01
        • 1970-01-01
        • 2015-01-26
        • 1970-01-01
        • 1970-01-01
        • 2013-02-23
        • 2022-08-22
        相关资源
        最近更新 更多