【问题标题】:How to scrape and extract same specific information from multiple URLs in a list如何从列表中的多个 URL 中抓取和提取相同的特定信息
【发布时间】:2021-07-19 00:58:15
【问题描述】:

我想抓取电影的类型和长度(运行时间),以获得 250 部电影的列表。 一个名为“链接”的列表包含这 250 个电影页面的 URL。 我编写了一个代码来从包含 250 个 URL 的列表“链接”中的单个 URL 中提取流派和长度。

links=['https://www.imdb.com/title/tt0093603/','https://www.imdb.com/title/tt8176054/','https://www.imdb.com/title/tt0367495/','https://www.imdb.com/title/tt0048473/','https://www.imdb.com/title/tt0079221/','https://www.imdb.com/title/tt7391996/','https://www.imdb.com/title/tt0052572/','https://www.imdb.com/title/tt0237376/','https://www.imdb.com/title/tt0214915/','https://www.imdb.com/title/tt5311546/','https://www.imdb.com/title/tt7019842/','https://www.imdb.com/title/tt0105575/','https://www.imdb.com/title/tt0400234/','https://www.imdb.com/title/tt8413338/','https://www.imdb.com/title/tt12361178/','https://www.imdb.com/title/tt4991384/','https://www.imdb.com/title/tt1187043/','https://www.imdb.com/title/tt8948790/','https://www.imdb.com/title/tt0986264/','https://www.imdb.com/title/tt10189514/','https://www.imdb.com/title/tt0101649/','https://www.imdb.com/title/tt5074352/','https://www.imdb.com/title/tt9477520/','https://www.imdb.com/title/tt7060344/','https://www.imdb.com/title/tt9900782/','https://www.imdb.com/title/tt0291855/','https://www.imdb.com/title/tt0048956/','https://www.imdb.com/title/tt0085743/','https://www.imdb.com/title/tt0050870/','https://www.imdb.com/title/tt7738784/','https://www.imdb.com/title/tt5959980/','https://www.imdb.com/title/tt0059246/','https://www.imdb.com/title/tt4987556/','https://www.imdb.com/title/tt0312859/','https://www.imdb.com/title/tt0072783/','https://www.imdb.com/title/tt0119385/','https://www.imdb.com/title/tt0292246/','https://www.imdb.com/title/tt10214826/','https://www.imdb.com/title/tt7019942/','https://www.imdb.com/title/tt3417422/','https://www.imdb.com/title/tt7465992/','https://www.imdb.com/title/tt5867800/','https://www.imdb.com/title/tt6148156/','https://www.imdb.com/title/tt8239946/',
'https://www.imdb.com/title/tt0466460/','https://www.imdb.com/title/tt0459516/','https://www.imdb.com/title/tt4679210/','https://www.imdb.com/title/tt0376127/','https://www.imdb.com/title/tt0066763/','https://www.imdb.com/title/tt3973410/','https://www.imdb.com/title/tt3668162/','https://www.imdb.com/title/tt0220656/','https://www.imdb.com/title/tt6380520/','https://www.imdb.com/title/tt0195231/','https://www.imdb.com/title/tt8108198/','https://www.imdb.com/title/tt4429128/','https://www.imdb.com/title/tt2877108/','https://www.imdb.com/title/tt2181831/','https://www.imdb.com/title/tt3569782/','https://www.imdb.com/title/tt0376076/','https://www.imdb.com/title/tt1954470/','https://www.imdb.com/title/tt1620933/','https://www.imdb.com/title/tt5312232/','https://www.imdb.com/title/tt2356180/','https://www.imdb.com/title/tt0242519/','https://www.imdb.com/title/tt4934950/','https://www.imdb.com/title/tt0367110/','https://www.imdb.com/title/tt0073707/','https://www.imdb.com/title/tt2218988/','https://www.imdb.com/title/tt0871510/','https://www.imdb.com/title/tt0375611/','https://www.imdb.com/title/tt0104561/','https://www.imdb.com/title/tt0054098/','https://www.imdb.com/title/tt1562872/','https://www.imdb.com/title/tt4430212/','https://www.imdb.com/title/tt4851630/','https://www.imdb.com/title/tt5005684/','https://www.imdb.com/title/tt10324144/','https://www.imdb.com/title/tt1639426/','https://www.imdb.com/title/tt0057935/','https://www.imdb.com/title/tt7060460/','https://www.imdb.com/title/tt1280558/','https://www.imdb.com/title/tt3322420/','https://www.imdb.com/title/tt4635372/','https://www.imdb.com/title/tt0242256/','https://www.imdb.com/title/tt0200087/','https://www.imdb.com/title/tt0374887/','https://www.imdb.com/title/tt0139876/','https://www.imdb.com/title/tt0292490/','https://www.imdb.com/title/tt0105271/','https://www.imdb.com/title/tt9052870/','https://www.imdb.com/title/tt2283748/','https://www.imdb.com/title/tt0405508/','https://www.imdb.com/title/tt0364647/','https://www.imdb.com/title/tt0169102/','https://www.imdb.com/title/tt1821480/','https://www.imdb.com/title/tt0109117/','https://www.imdb.com/title/tt8291224/','https://www.imdb.com/title/tt2338151/','https://www.imdb.com/title/tt2358592/','https://www.imdb.com/title/tt0453729/','https://www.imdb.com/title/tt0319736/','https://www.imdb.com/title/tt0843326/','https://www.imdb.com/title/tt2082197/','https://www.imdb.com/title/tt5571734/','https://www.imdb.com/title/tt0112553/','https://www.imdb.com/title/tt0379370/','https://www.imdb.com/title/tt8144834/','https://www.imdb.com/title/tt0488414/','https://www.imdb.com/title/tt0116630/','https://www.imdb.com/title/tt13299890/','https://www.imdb.com/title/tt0456144/','https://www.imdb.com/title/tt7822438/','https://www.imdb.com/title/tt5824826/','https://www.imdb.com/title/tt4849438/','https://www.imdb.com/title/tt0072860/','https://www.imdb.com/title/tt1695800/','https://www.imdb.com/title/tt2564144/','https://www.imdb.com/title/tt1261047/','https://www.imdb.com/title/tt0063404/','https://www.imdb.com/title/tt0471571/','https://www.imdb.com/title/tt7392212/','https://www.imdb.com/title/tt3390572/','https://www.imdb.com/title/tt0112870/','https://www.imdb.com/title/tt6315524/','https://www.imdb.com/title/tt5906392/','https://www.imdb.com/title/tt0213969/','https://www.imdb.com/title/tt2882328/','https://www.imdb.com/title/tt0050188/','https://www.imdb.com/title/tt1821317/','https://www.imdb.com/title/tt2377938/','https://www.imdb.com/title/tt7838252/','https://www.imdb.com/title/tt10919240/','https://www.imdb.com/title/tt1180583/','https://www.imdb.com/title/tt1773764/','https://www.imdb.com/title/tt3394420/','https://www.imdb.com/title/tt7725596/','https://www.imdb.com/title/tt2395469/','https://www.imdb.com/title/tt1327035/','https://www.imdb.com/title/tt3863552/','https://www.imdb.com/title/tt1649431/','https://www.imdb.com/title/tt0051792/','https://www.imdb.com/title/tt0220832/','https://www.imdb.com/title/tt1857670/','https://www.imdb.com/title/tt3614516/','https://www.imdb.com/title/tt7180544/','https://www.imdb.com/title/tt0296574/','https://www.imdb.com/title/tt7294534/','https://www.imdb.com/title/tt3449292/','https://www.imdb.com/title/tt11581174/','https://www.imdb.com/title/tt2585562/','https://www.imdb.com/title/tt1188996/','https://www.imdb.com/title/tt5082014/','https://www.imdb.com/title/tt3124456/',
 'https://www.imdb.com/title/tt8110330/',
 'https://www.imdb.com/title/tt0347304/',
 'https://www.imdb.com/title/tt1093370/',
 'https://www.imdb.com/title/tt2924472/',
 'https://www.imdb.com/title/tt1609168/',
 'https://www.imdb.com/title/tt6167894/',
 'https://www.imdb.com/title/tt0118751/',
 'https://www.imdb.com/title/tt7485048/',
 'https://www.imdb.com/title/tt2325915/',
 'https://www.imdb.com/title/tt0375878/',
 'https://www.imdb.com/title/tt1417299/',
 'https://www.imdb.com/title/tt7218518/',
 'https://www.imdb.com/title/tt0323013/',
 'https://www.imdb.com/title/tt8108200/',
 'https://www.imdb.com/title/tt2631186/',
 'https://www.imdb.com/title/tt0455829/',
 'https://www.imdb.com/title/tt0824316/',
 'https://www.imdb.com/title/tt0222012/',
 'https://www.imdb.com/title/tt11322920/',
 'https://www.imdb.com/title/tt3848892/',
 'https://www.imdb.com/title/tt10717738/',
 'https://www.imdb.com/title/tt4387040/',
 'https://www.imdb.com/title/tt5764096/',
 'https://www.imdb.com/title/tt0366840/',
 'https://www.imdb.com/title/tt2181931/',
 'https://www.imdb.com/title/tt1517561/',
 'https://www.imdb.com/title/tt0373856/',
 'https://www.imdb.com/title/tt2926068/',
 'https://www.imdb.com/title/tt2350496/',
 'https://www.imdb.com/title/tt1077248/',
 'https://www.imdb.com/title/tt0402014/',
 'https://www.imdb.com/title/tt13206926/',
 'https://www.imdb.com/title/tt8130968/',
 'https://www.imdb.com/title/tt0816258/',
 'https://www.imdb.com/title/tt6108090/',
 'https://www.imdb.com/title/tt4169250/',
 'https://www.imdb.com/title/tt0291376/',
 'https://www.imdb.com/title/tt2317337/',
 'https://www.imdb.com/title/tt0093578/',
 'https://www.imdb.com/title/tt7098658/',
 'https://www.imdb.com/title/tt4434004/',
 'https://www.imdb.com/title/tt1907761/',
 'https://www.imdb.com/title/tt7758160/',
 'https://www.imdb.com/title/tt0077451/',
 'https://www.imdb.com/title/tt4432480/',
 'https://www.imdb.com/title/tt1230165/',
 'https://www.imdb.com/title/tt0420332/',
 'https://www.imdb.com/title/tt3822396/',
 'https://www.imdb.com/title/tt1851988/',
 'https://www.imdb.com/title/tt5121000/',
 'https://www.imdb.com/title/tt1288638/',
 'https://www.imdb.com/title/tt0499375/',
 'https://www.imdb.com/title/tt0431619/',
 'https://www.imdb.com/title/tt2187153/',
 'https://www.imdb.com/title/tt0196069/',
 'https://www.imdb.com/title/tt2213054/',
 'https://www.imdb.com/title/tt3801314/',
 'https://www.imdb.com/title/tt1292703/',
 'https://www.imdb.com/title/tt4981966/',
 'https://www.imdb.com/title/tt1266583/',
 'https://www.imdb.com/title/tt1839596/',
 'https://www.imdb.com/title/tt0422320/',
 'https://www.imdb.com/title/tt7998242/',
 'https://www.imdb.com/title/tt2258337/',
 'https://www.imdb.com/title/tt0110222/',
 'https://www.imdb.com/title/tt0109555/',
 'https://www.imdb.com/title/tt6484982/',
 'https://www.imdb.com/title/tt4900716/',
 'https://www.imdb.com/title/tt3320542/',
 'https://www.imdb.com/title/tt7142506/',
 'https://www.imdb.com/title/tt1241195/',
 'https://www.imdb.com/title/tt8108268/',
 'https://www.imdb.com/title/tt0150433/',
 'https://www.imdb.com/title/tt2855648/',
 'https://www.imdb.com/title/tt0098999/',
 'https://www.imdb.com/title/tt0432047/',
 'https://www.imdb.com/title/tt3447364/',
 'https://www.imdb.com/title/tt1014672/',
 'https://www.imdb.com/title/tt1926313/',
 'https://www.imdb.com/title/tt5286444/',
 'https://www.imdb.com/title/tt2980794/',
 'https://www.imdb.com/title/tt8042292/',
 'https://www.imdb.com/title/tt1447500/',
 'https://www.imdb.com/title/tt0106333/',
 'https://www.imdb.com/title/tt2140465/',
 'https://www.imdb.com/title/tt0920464/',
 'https://www.imdb.com/title/tt5310090/',
 'https://www.imdb.com/title/tt7212754/',
 'https://www.imdb.com/title/tt1324059/',
 'https://www.imdb.com/title/tt3767372/',
 'https://www.imdb.com/title/tt2375559/',
 'https://www.imdb.com/title/tt6027478/',
 'https://www.imdb.com/title/tt8590896/',
 'https://www.imdb.com/title/tt0172684/',
 'https://www.imdb.com/title/tt6206564/',
 'https://www.imdb.com/title/tt0449994/']]

现在我必须为该列表中的所有 250 个 URL 执行此操作。当循环这个过程时,我只得到了最后一个 URL 信息。

这是我为 1 个 URL 编写的代码,

def get_movie_info(a_tag, div_tag):

  # returns all the required info about a movie
  span_tags1 = a_tag.find_all('span')
  genre=span_tags1[0].text.strip()
  li_tags = div_tag.find_all('li')
  length_of_film=li_tags[1].text.strip()
  return genre, length_of_film 
  movie_page_url = links[0]       #1st url in the list
  response = requests.get(movie_page_url)

  #get a tags
  a_tags = movie_doc.find_all('a', attrs={'class':"GenresAndPlot__GenreChip-cum89p-3 fzmeux ipc-chip ipc-chip--on-baseAlt"})

  #get div tags
  div_tags = movie_doc.find_all('div', attrs={'class':"TitleBlock__TitleMetaDataContainer-sc-1nlhx7j-2 hWHMKr"})

  movie_dict = {
    'genre1' : [],
    'length_of_movie' : []}

  a_tag = a_tags[0]
  div_tag = div_tags[0]

  movie_info = get_movie_info(a_tag,div_tag)
  movie_dict['genre1'].append(movie_info[0])
  movie_dict['length_of_movie'].append(movie_info[1])

输出是

movie_dict = {'genre1': ['犯罪'], 'length_of_movie': ['2h 25min']}

输出应该是包含“genre1”和“length_of_movie”列以及 250 行的数据帧,分别是电影的流派和长度

【问题讨论】:

    标签: python list loops web-scraping imdb


    【解决方案1】:

    使用电影 URL 遍历您的列表并将结果放入字典值。最后一步,创建数据框:

    import requests
    from bs4 import BeautifulSoup
    
    links = [
        "https://www.imdb.com/title/tt0093603/",
        "https://www.imdb.com/title/tt8176054/",
        "https://www.imdb.com/title/tt0367495/",
        # ... rest of your URLs
    ]
    
    
    def get_movie_info(a_tag, div_tag):
        span_tags1 = a_tag.find_all("span")
        genre = span_tags1[0].text.strip()
        li_tag = div_tag.find(lambda tag: tag.name == "li" and "min" in tag.text)
        length_of_film = li_tag.text.strip()
        return genre, length_of_film
    
    
    movie_dict = {"genre1": [], "length_of_movie": []}
    for movie_page_url in links:
        response = requests.get(movie_page_url)
        movie_doc = BeautifulSoup(response.content, "html.parser")
    
        # get a tags
        a_tags = movie_doc.find_all(
            "a",
            attrs={
                "class": "GenresAndPlot__GenreChip-cum89p-3 fzmeux ipc-chip ipc-chip--on-baseAlt"
            },
        )
    
        # get div tags
        div_tags = movie_doc.find_all(
            "div",
            attrs={
                "class": "TitleBlock__TitleMetaDataContainer-sc-1nlhx7j-2 hWHMKr"
            },
        )
    
        a_tag = a_tags[0]
        div_tag = div_tags[0]
    
        movie_info = get_movie_info(a_tag, div_tag)
        movie_dict["genre1"].append(movie_info[0])
        movie_dict["length_of_movie"].append(movie_info[1])
    
    df = pd.DataFrame(movie_dict)
    print(df)
    

    打印:

          genre1 length_of_movie
    0      Crime        2h 25min
    1      Drama        2h 34min
    2  Adventure        2h 40min
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-08-25
      • 2021-08-01
      • 1970-01-01
      • 2016-05-23
      • 2020-10-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多