在 Pandas `read_csv` 中添加了额外的行答案

【问题标题】：Extra lines getting added in Pandas `read_csv`在 Pandas `read_csv` 中添加了额外的行
【发布时间】：2021-06-24 00:51:03
【问题描述】：

我正在使用 pandas read_csv 加载两个不同的文件。一个包含英语句子，另一个包含印地语句子。在txt 文件中，两者的句子数相同。但是当我在 Google Colab 中加载文件时，行数正在改变，因此稍后会出现错误。

这是通过pandas加载的英文句子文件。

                                                       eng
0           Give your application an accessibility workout
1                        Accerciser Accessibility Explorer
2           The default plugin layout for the bottom panel
3              The default plugin layout for the top panel
4           A list of plugins that are disabled by default
...                                                    ...
1565202           The programme will be streamed live via:
1565203  Ministry of Education Facebook Page: https://w...
1565204                                UGC YouTube Channel
1565205  UGC Twitter Handle (@ugc_india) : https://twit...
1565206             It would also be broadcast on DD News.

[1565207 rows x 1 columns]

这是通过 pandas 加载的印地语语言文件

                                                        hi
0          अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें
1                          एक्सेर्साइसर पहुंचनीयता अन्वेषक
2                    निचले पटल के लिए डिफोल्ट प्लग-इन खाका
3                     ऊपरी पटल के लिए डिफोल्ट प्लग-इन खाका
4        उन प्लग-इनों की सूची जिन्हें डिफोल्ट रूप से नि...
...                                                    ...
1573683    कार्यक्रम को इनके जरिए लाइव स्ट्रीम किया जाएगा:
1573684  मानव संसाधन विकास मंत्रालय का फेसबुक पेज: http...
1573685                                यूजीसी यूट्यूब चैनल
1573686  यूजीसी ट्विटर हैंडल (@ugc_india) : https://twi...
1573687  कार्यक्रम को डीडी न्यूज पर भी प्रसारित किया जा...

[1573688 rows x 1 columns]

正如我们所见，这两个数据帧中的行数不同。而在原始的txt 文件中，每个文件都有 1609682 个句子。

【问题讨论】：

两个文件是否包含与句子相同的行数？既然你用的是read_csv，那么分隔符是什么？

标签： python python-3.x pandas keras google-colaboratory

【解决方案1】：

不久前，我在自己的作品中看到了类似的东西。问题是换行符和字符返回作为分隔符的使用不一致。不可能确切地知道你正在处理什么，因为你没有分享任何可以表明你真正问题的样本数据。您可以在 Notepad++ 中打开文件并“查看”所有换行符和字符返回以及任何/所有其他不可打印的字符。

请记住，“\n 字符”是换行符，“\r”是回车符。

试试这样的。

要使用 Python 删除换行符，请参见示例：

//Removes all 3 types of line breaks
string = string.replace("\r","")
string = string.replace("\n","")

https://texthandler.com/info/remove-line-breaks-python/

【讨论】：

是的，我也试过了，但是没有用。也使用了utf-8 编码，但这也没有用。