BeautifulSoup find_all('href') 只返回部分值答案

【问题标题】：BeautifulSoup find_all('href') returns only part of the valueBeautifulSoup find_all('href') 只返回部分值
【发布时间】：2020-09-19 06:19:39
【问题描述】：

我正在尝试从 IMDB 电影页面中抓取演员/女演员 ID。我只想要演员（我不想得到任何剧组），而这个问题是专门关于获取人的内部ID的。我已经有了人们的名字，所以我不需要帮助来获取这些名字。我从这个网页 (https://www.imdb.com/title/tt0084726/fullcredits?ref_=tt_cl_sm#cast) 开始作为硬编码的 url 来获取正确的代码。

在检查链接时，我发现演员的链接看起来像这样。

<a href="/name/nm0000638/?ref_=ttfc_fc_cl_t1"> William Shatner</a>
<a href="/name/nm0000559/?ref_=ttfc_fc_cl_t2"> Leonard Nimoy</a>
<a href="/name/nm0346415/?ref_=ttfc_fc_cl_t17"> Nicholas Guest</a>

而其他贡献者的看起来像这样

<a href="/name/nm0583292/?ref_=ttfc_fc_dr1"> Nicholas Meyer </a>
<a href="/name/nm0734472/?ref_=ttfc_fc_wr1"> Gene Roddenberry</a>

这应该允许我通过检查 href 的结尾是否为“t[0-9]+$”而不是相同但带有“dr”或“wr”来区分演员/女演员和导演或作家等剧组”。

这是我正在运行的代码。

import urllib.request
from bs4 import BeautifulSoup
import re

movieNumber = 'tt0084726'
url = 'https://www.imdb.com/title/' + movieNumber + '/fullcredits?ref_=tt_cl_sm#cast'

def clearLists(n):
    return [[] for _ in range(n)]

def getSoupObject(urlInput):
    page = urllib.request.urlopen(urlInput).read()
    soup = BeautifulSoup(page, features="html.parser")
    return(soup)

def getPeopleForMovie(soupObject):
    listOfPeopleNames, listOfPeopleIDs, listOfMovieIDs = clearLists(3)

    #get all the tags with links in them
    link_tags = soupObject.find_all('a')

    #get the ids of people
    for linkTag in link_tags:
        link = str(linkTag.get('href'))
        #print(link)
        p = re.compile('t[0-9]+$')
        q = p.search(link)
        if link.startswith('/name/') and q != None:
            id = link[6:15]
            #print(id)
            listOfPeopleIDs.append(id)

    #return the names and IDs
    return listOfPeopleNames, listOfPeopleIDs

newSoupObject = getSoupObject(url)
pNames, pIds = getPeopleForMovie(newSoupObject)

上面的代码返回一个空的 ID 列表，如果你取消注释 print 语句，你可以看到这是因为放入“link”变量的值最终是下面的值（特定人的变化)

/name/nm0583292/
/name/nm0000638/

那不行。我只想要演员和女演员的 ID，以便以后可以使用这些 ID。我试图在stackoverflow上找到其他答案；我一直没能找到这个特定的问题。

这个问题 (Beautifulsoup: parsing html – get part of href) 与我想要做的很接近，但它从标签之间的文本部分获取信息，而不是从标签属性中的 href 部分获取信息。

如何确保从页面中只获得我想要的名称 ID（仅是演员的 ID）？ （另外，请随时提供收紧代码的建议）

【问题讨论】：

有一些关于代码的 cmets，但最重要的是，您的代码加载的 html 与浏览器中呈现的 html 不匹配 - 它不包括您尝试匹配的查询参数，所以/name/nm0000638/?ref_=ttfc_fc_cl_t1 看起来就像/name/nm0000638/。您可能需要考虑另一种匹配演员的方式，例如仅在演员部分中获取链接？ BS 应该让它变得相当简单。

标签： python html web-scraping beautifulsoup href

【解决方案1】：

您尝试匹配的链接似乎在加载后被 JavaScript 修改，或者可能基于其他变量而不是单独的 URL（如 cookie 或标头）以不同方式加载。

但是，由于您只关注演员中人物的链接，因此更简单的方法是简单地匹配演员部分中人物的 ID。这实际上相当简单，因为它们都在一个元素中，<table class="cast_list">

所以：

import urllib.request
from bs4 import BeautifulSoup
import re

# it's Python, so use Python conventions, no uppercase in function or variable names
movie_number = 'tt0084726'
# The f-string is often more readable than a + concatenation
url = f'https://www.imdb.com/title/{movieNumber}/fullcredits?ref_=tt_cl_sm#cast'


# this is overly fancy for something as simple as initialising some variables
# how about:
# a, b, c = [], [], []
# def clearLists(n):
#     return [[] for _ in range(n)]


# in an object-oriented program, assuming something is an object is the norm
def get_soup(url_input):
    page = urllib.request.urlopen(url_input).read()
    soup = BeautifulSoup(page, features="html.parser")
    # removed needless parentheses - arguably, even `soup` is superfluous:
    # return BeautifulSoup(page, features="html.parser")
    return soup


# keep two empty lines between functions, it's standard and for good reason
# it's easier to spot where a function starts and stops
# try using an editor or IDE that highlights your PEP8 mistakes, like PyCharm
# (that's just my opinion there, other IDEs than PyCharm will do as well)
def get_people_for_movie(soup_object):
    # removed unused variables, also 'list_of_people_ids' is needlessly verbose
    # since they go together, why not return people as a list of tuples, or a dictionary?
    # I'd prefer a dictionary as it automatically gets rid of duplicates as well
    people = {}

    # (put a space at the start of your comment blocks!)
    # get all the anchors tags inside the `cast_list` table
    link_tags = soup_object.find('table', class_='cast_list').find_all('a')

    # the whole point of compiling the regex is to only have to do it once, 
    # so outside the loop
    id_regex = re.compile(r'/name/nm(\d+)/')

    # get the ids and names of people
    for link_tag in link_tags:
        # the href attributes is a strings, so casting with str() serves no purpose
        href = link_tag.get('href')
        # matching and extracting part of the match can all be done in one step:
        match = id_regex.search(href)
        if match:
            # don't shadow Python keywords like `id` with variable names!
            identifier = match.group(1)
            name = link_tag.text.strip()
            # just ignore the ones with no text, they're the thumbs
            if name:
                people[identifier] = name

    # return the names and IDs
    return people


def main():
    # don't do stuff globally, it'll just cause problems when reusing names in functions
    soup = get_soup(url)
    people = get_people_for_movie(soup)
    print(people)


# not needed here, but a good habit, allows you to import stuff without running the main
if __name__ == '__main__':
    main()

结果：

{'0000638': 'William Shatner', '0000559': 'Leonard Nimoy', '0001420': 'DeForest Kelley', etc.

还有一些调整的代码，并且没有对您的代码进行注释：

import urllib.request
from bs4 import BeautifulSoup
import re


def get_soup(url_input):
    page = urllib.request.urlopen(url_input).read()
    return BeautifulSoup(page, features="html.parser")


def get_people_for_movie(soup_object):
    people = {}

    link_tags = soup_object.find('table', class_='cast_list').find_all('a')

    id_regex = re.compile(r'/name/nm(\d+)/')

    # get the ids and names of the cast
    for link_tag in link_tags:
        match = id_regex.search(link_tag.get('href'))
        if match:
            name = link_tag.text.strip()
            if name:
                people[match.group(1)] = name

    return people


def main():
    movie_number = 'tt0084726'
    url = f'https://www.imdb.com/title/{movie_number}/fullcredits?ref_=tt_cl_sm#cast'

    people = get_people_for_movie(get_soup(url))
    print(people)


if __name__ == '__main__':
    main()

【讨论】：