【问题标题】:Scraping two pages at the same time : pandas error同时抓取两个页面:pandas 错误
【发布时间】:2019-10-24 03:03:13
【问题描述】:

我想从这两个页面保存电影评论和电影标题。

https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=~
https://movie.naver.com/movie/bi/mi/basic.nhn?code=~

当我运行这段代码并打开 csv 文件时。

ValueError: 传递值的形状是 (2, 6),索引意味着 (2, 10)

from bs4 import BeautifulSoup
from urllib.request import urlopen
from selenium import webdriver
from urllib.request import urljoin
import pandas as pd
import requests

#url_base = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=25917&type=after&page=1'
base_url = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=' #review page
base_url2 = 'https://movie.naver.com/movie/bi/mi/basic.nhn?code=' #movie title
pages =['177374','164102']

#print(soup.find_all('div', 'score_reple'))
#div = soup.find('h3', 'h_movie')

df = pd.DataFrame()
for n in pages:
    # Create url
    url = base_url + n
    url2 = base_url2 + n

    # Parse data using BS
    print('Downloading page %s...' % url)
    print('Downloading page %s...' % url2)

    res = requests.get(url)
    soup = BeautifulSoup(res.text, "html.parser")
    reple = soup.find_all('div', 'score_reple')
    res2 = requests.get(url2)
    soup = BeautifulSoup(res2.text, "html.parser")
    title = soup.find('h3', 'h_movie')
    #ratesc = soup.find('','')
    #story=rname.getText()
    #data = [title,reple]
    data = {'title':[title], 'reviewn':[reple]}
    df = df.append(pd.DataFrame(data), sort=True).reset_index(drop=True)

df.to_csv('./title.csv', sep=',', encoding='utf-8-sig')

如何修复此代码?

【问题讨论】:

  • 如果你有硒,为什么还要使用 BeautifulSoup?

标签: python pandas web-scraping beautifulsoup web-crawler


【解决方案1】:

您可以尝试清理它的一件事是首先转换为字符串,然后根据 html 放置约束,如下所示:

title = str(soup.find('h3', 'h_movie'))
start = '" title="'
end = '                                     ,                   2018">'
newTitle = title[title.find(start)+len(start):title.rfind(end)]

然后在评论部分尝试相同的操作。您需要缩小结果集的范围,然后在审查部分所在的位置转换为字符串并对其施加约束。

然后,您将清理数据并准备添加到 DataFrame 中。

希望这有助于您走上正确的道路!

【讨论】:

    【解决方案2】:

    现在是干净的......只需使用以下内容删除标签:

    from bs4 import BeautifulSoup
    from urllib.request import urlopen
    #from selenium import webdriver
    from urllib.request import urljoin
    import pandas as pd
    import requests
    import re
    
    #url_base = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=25917&type=after&page=1'
    base_url = 'https://movie.naver.com/movie/bi/mi/pointWriteFormList.nhn?code=' #review page
    base_url2 = 'https://movie.naver.com/movie/bi/mi/basic.nhn?code=' #movie title
    pages =['177374','164102']
    
    df = pd.DataFrame()
    for n in pages:
        # Create url
        url = base_url + n
        url2 = base_url2 + n
    
        res = requests.get(url)
        soup = BeautifulSoup(res.text, "html.parser")
        reple = soup.find("span", {"id":re.compile("^_filtered_ment")}).getText()
        res2 = requests.get(url2)
        soup = BeautifulSoup(res2.text, "html.parser")
        title = soup.find('h3', 'h_movie')
        for a in title.find_all('a'):
            #print(a.text)
            title=a.text
    
        data = {'title':[title], 'reviewn':[reple]}
        df = df.append(pd.DataFrame(data))
    
    df.to_csv('./title.csv', sep=',', encoding='utf-8-sig')
    

    我为正则表达式类 _filtered_ment_* 添加了 import re

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2012-06-17
      • 1970-01-01
      • 1970-01-01
      • 2020-04-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多