【问题标题】:Beautifulsoup find method returns not subscriptable objectBeautifulsoup find 方法返回不可下标的对象
【发布时间】:2021-02-28 11:36:25
【问题描述】:

我试图使用 beautifulsoup、request 和 json 创建一个 Twitter 抓取工具。但是,当我尝试运行代码时,它引发了错误 object is not subscriptable。我检查了错误所在的行,但找不到引发错误的原因。有人可以帮忙吗?我无法修复它。

 File "tweetscraper.py", line 131, in <module>
    start()
  File "tweetscraper.py", line 125, in start
    tweets = get_tweets_data(username, soup)
  File "tweetscraper.py", line 54, in get_tweets_data
    next_pointer = soup.find("div", {"class": "stream-container"})["data-min-position"]
TypeError: 'NoneType' object is not subscriptable

这是我的代码:

def get_tweet_text(tweet):
    tweet_text_box = tweet.find("p", {"class": "TweetTextSize TweetTextSize--normal js-tweet-text tweet-text"})
    images_in_tweet_tag = tweet_text_box.find_all("a", {"class": "twitter-timeline-link u-hidden"})
    tweet_text = tweet_text_box.text
    for image_in_tweet_tag in images_in_tweet_tag:
        tweet_text = tweet_text.replace(image_in_tweet_tag.text, '')

    return tweet_text

def get_this_page_tweets(soup):
    tweets_list = list()
    tweets = soup.find_all("li", {"data-item-type": "tweet"})
    for tweet in tweets:
        tweet_data = None
        try:
            tweet_data = get_tweet_text(tweet)
        except Exception as e:
            continue
            #ignore if there is any loading or tweet error

        if tweet_data:
            tweets_list.append(tweet_data)
            print(".", end="")
            sys.stdout.flush()

    return tweets_list


def get_tweets_data(username, soup):
    tweets_list = list()
    tweets_list.extend(get_this_page_tweets(soup))

    next_pointer = soup.find("div", {"class": "stream-container"})["data-min-position"]

    while True:
        next_url = "https://twitter.com/i/profiles/show/" + username + \
                   "/timeline/tweets?include_available_features=1&" \
                   "include_entities=1&max_position=" + next_pointer + "&reset_error_state=false"

        next_response = None
        try:
            next_response = requests.get(next_url)
        except Exception as e:
            # in case there is some issue with request. None encountered so far.
            print(e)
            return tweets_list

        tweets_data = next_response.text
        tweets_obj = json.loads(tweets_data)
        if not tweets_obj["has_more_items"] and not tweets_obj["min_position"]:
            # using two checks here bcz in one case has_more_items was false but there were more items
            print("\nNo more tweets returned")
            break
        next_pointer = tweets_obj["min_position"]
        html = tweets_obj["items_html"]
        soup = BeautifulSoup(html, 'lxml')
        tweets_list.extend(get_this_page_tweets(soup))

    return tweets_list


# dump final result in a json file
def dump_data(username, tweets):
    filename = username+"_twitter.json"
    print("\nDumping data in file " + filename)
    data = dict()
    data["tweets"] = tweets
    with open(filename, 'w') as fh:
        fh.write(json.dumps(data))

    return filename


def get_username():
    # if username is not passed
    if len(sys.argv) < 2:
        usage()
    username = sys.argv[1].strip().lower()
    if not username:
        usage()

    return username


def start(username = None):
    username = get_username()
    url = "http://www.twitter.com/" + username
    print("\n\nDownloading tweets for " + username)
    response = None
    try:
        response = requests.get(url)
    except Exception as e:
        print(repr(e))
        sys.exit(1)
    
    if response.status_code != 200:
        print("Non success status code returned "+str(response.status_code))
        sys.exit(1)

    soup = BeautifulSoup(response.text, 'lxml')

    if soup.find("div", {"class": "errorpage-topbar"}):
        print("\n\n Error: Invalid username.")
        sys.exit(1)

    tweets = get_tweets_data(username, soup)
    # dump data in a text file
    dump_data(username, tweets)
    print(str(len(tweets))+" tweets dumped.")


start()

【问题讨论】:

  • 你能把你想要抓取的内容准确地显示为next_pointer吗?我找不到类名 stream-container 的标签。

标签: json object web-scraping beautifulsoup


【解决方案1】:

find() 方法只会返回与网站数据匹配的第一个匹配项。这是返回的单个对象。而find_all() 方法将返回与指定条件匹配的所有匹配项。所以find_all() 方法返回一个可下标的列表。

Beautiful Soup Documentation.了解更多信息

【讨论】:

  • 我按照你说的编辑了代码,但它引发了另一个错误。 TypeError: 列表索引必须是整数或切片,而不是 str
  • 也许可以尝试以这种方式迭代它的元素:假设您将find_all() 的输出存储在变量list 中,那么您可以使用for item in list:。另外,附注:你原来的错误是NoneType object,这意味着find()什么也没返回。在find_all()的情况下,如果没有找到,它将返回一个空列表。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2021-08-19
  • 1970-01-01
  • 2020-10-18
  • 1970-01-01
  • 1970-01-01
  • 2019-09-18
  • 1970-01-01
相关资源
最近更新 更多