【问题标题】:Can't write in a CSV file python无法写入 CSV 文件 python
【发布时间】:2019-10-23 21:43:52
【问题描述】:

我试图在 scraping 后使用 pandas 数据框将数据写入 csv,但即使在程序执行后 csv 也是空的。标头首先写入,但当数据帧生效时它们也会被覆盖。 代码如下:

from bs4 import BeautifulSoup
import requests
import re as resju
import csv
import pandas as pd
re = requests.get('https://www.farfeshplus.com/Video.asp?ZoneID=297')

soup = BeautifulSoup(re.content, 'html.parser')

links = soup.findAll('a', {'class': 'opacityit'})
links_with_text = [a['href'] for a in links]

headers = ['Name', 'LINK']
# this is output file, u can change the path as you desire, default is the working directory
file = open('data123.csv', 'w', encoding="utf-8")
writer = csv.writer(file)
writer.writerow(headers)

for i in links_with_text:
    new_re = requests.get(i)
    new_soup = BeautifulSoup(new_re.content, 'html.parser')
    m = new_soup.select_one('h1 div')
    Name = m.text

    print(Name)

    n = new_soup.select_one('iframe')
    ni = n['src']

    iframe = requests.get(ni)
    i_soup = BeautifulSoup(iframe.content, 'html.parser')

    d_script = i_soup.select_one('body > script')
    d_link = d_script.text

    mp4 = resju.compile(r"(?<=mp4:\s\[\')(.*)\'\]")
    final_link = mp4.findall(d_link)[0]
    print(final_link)

    df = pd.DataFrame(zip(Name, final_link))

    df.to_csv(file, header=None, index=False)

file.close()

df.head() 返回:

 0  1
0  ل  h
1  ي  t
2  ل  t
3  ى  p
4     s
   0  1
0  ل  h
1  ي  t
2  ل  t
3  ى  p
4     s

有什么建议吗?

【问题讨论】:

  • 您可以在写入 csv 文件之前执行 print(df.head()) 吗?我不认为它是在写 csv 问题
  • 看起来您在 for 循环中覆盖 csv,尝试将 for 循环的元素附加到全局变量,然后在循环外调用它。
  • @Ram,已编辑。请再次检查
  • @Datanovice,你能举个例子吗?我想不通

标签: python pandas beautifulsoup


【解决方案1】:

您似乎正在使用混合库来写入 csv,pandas 可以很好地处理这一切,因此无需使用 python 的内置 csv 模块 -

我在下面修改了您的代码 - 这应该将您的数据框作为一个完整的 df 返回并将其写为 csv。

同样使用Headers=None,您将列设置为空,因此它们将被索引号引用。

from bs4 import BeautifulSoup
import requests
import re as resju
#import csv
import pandas as pd
re = requests.get('https://www.farfeshplus.com/Video.asp?ZoneID=297')

soup = BeautifulSoup(re.content, 'html.parser')

links = soup.findAll('a', {'class': 'opacityit'})
links_with_text = [a['href'] for a in links]

names_ = [] # global list to hold all iterable variables from your loops
final_links_ = []

for i in links_with_text:
    new_re = requests.get(i)
    new_soup = BeautifulSoup(new_re.content, 'html.parser')
    m = new_soup.select_one('h1 div')
    Name = m.text
    names_.append(name) # append to global list. 


    print(Name)

    n = new_soup.select_one('iframe')
    ni = n['src']

    iframe = requests.get(ni)
    i_soup = BeautifulSoup(iframe.content, 'html.parser')

    d_script = i_soup.select_one('body > script')
    d_link = d_script.text

    mp4 = resju.compile(r"(?<=mp4:\s\[\')(.*)\'\]")
    final_link = mp4.findall(d_link)[0]
    print(final_link)
    final_links_.append(final_link) # append to global list.


df = pd.DataFrame(zip(names_, final_links_)) # use global lists.
df.columns = ['Name', 'LINK']

df.to_csv(file, index=False)

【讨论】:

    猜你喜欢
    • 2014-08-21
    • 2018-03-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-11-15
    • 1970-01-01
    • 2012-05-28
    • 1970-01-01
    相关资源
    最近更新 更多