【问题标题】:BeautifulSoup Instagram post html scrapingBeautifulSoup Instagram 帖子 html 抓取
【发布时间】:2019-05-08 00:42:29
【问题描述】:

我想从特定的 Instagram 页面(例如 https://www.instagram.com/p/BoFlrM7gwnK/)中抓取帖子描述。我有一部分代码可以从 Instagram 页面获取一些重新发送的帖子,它会输出很多不需要的信息,比如页面的一些脚本。

from random import choice
import json
from pprint import pprint

import requests
from bs4 import BeautifulSoup

_user_agents = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36']


class InstagramScraper:

def __init__(self, user_agents=None, proxy=None):
    self.user_agents = user_agents
    self.proxy = proxy

def __random_agent(self):
    if self.user_agents and isinstance(self.user_agents, list):
        return choice(self.user_agents)
    return choice(_user_agents)

def __request_url(self, url):
    try:
        response = requests.get(url, headers={'User-Agent': self.__random_agent()}, proxies={'http': self.proxy,
                                                                                             'https': self.proxy})
        response.raise_for_status()
    except requests.HTTPError:
        raise requests.HTTPError('Received non 200 status code from Instagram')
    except requests.RequestException:
        raise requests.RequestException
    else:
        return response.text

@staticmethod
def extract_json_data(html):
    soup = BeautifulSoup(html, 'html.parser')
    body = soup.find('body')
    script_tag = body.find('script')
    raw_string = script_tag.text.strip().replace('window._sharedData =', '').replace(';', '')
    return json.loads(raw_string)

def profile_page_metrics(self, profile_url):
    results = {}
    try:
        response = self.__request_url(profile_url)
        json_data = self.extract_json_data(response)
        metrics = json_data['entry_data']['ProfilePage'][0]['graphql']['user']
    except Exception as e:
        raise e
    else:
        for key, value in metrics.items():
            if key != 'edge_owner_to_timeline_media':
                if value and isinstance(value, dict):
                    value = value['count']
                    results[key] = value
                elif value:
                    results[key] = value
    return results

def profile_page_recent_posts(self, profile_url):
    results = []
    try:
        response = self.__request_url(profile_url)
        json_data = self.extract_json_data(response)
        metrics = json_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media'][
            "edges"]
    except Exception as e:
        raise e
    else:
        for node in metrics:
            node = node.get('node')
            if node and isinstance(node, dict):
                results.append(node)
    return results


k = InstagramScraper()

results=k.profile_page_recent_posts('https://www.instagram.com/selenagomez/')
pprint(results)

有没有办法从它的 url 中获取特定帖子的信息?任何帮助将不胜感激。

【问题讨论】:

    标签: python beautifulsoup instagram


    【解决方案1】:

    只需复制profile_page_recent_posts()方法,例如

    def get_single_posts(self, post_url):
        results = []
        response = self.__request_url(post_url)
        json_data = self.extract_json_data(response)
    
        post_text = json_data['entry_data']['PostPage'][0]['graphql']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']['text']
        post_shortcode = json_data['entry_data']['PostPage'][0]['graphql']['shortcode_media']['shortcode']
    
        results.append({'text' : post_text, 'shortcode' : post_shortcode})
    
        return results
    

    输出:

    {'text' : 'Mood lol....', 'shortcode' : 'BoFlrM7gwnK'}
    

    找到您想要保存json_data 到文件的值并使用 JSON 查看器选择正确的键。

    【讨论】:

    • 非常感谢!这对我有帮助:)
    猜你喜欢
    • 2018-02-04
    • 2018-08-11
    • 2021-10-21
    • 1970-01-01
    • 2013-08-10
    • 2022-07-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多