Python3，Pandas - 基于左列数据的新列值（动态）答案

【问题标题】：Python3, Pandas - New Column Value based on Column To Left Data (Dynamic)Python3，Pandas - 基于左列数据的新列值（动态）
【发布时间】：2017-07-17 21:04:25
【问题描述】：

我有一个电子表格，其中有几列包含调查回复。该电子表格将被合并到其他电子表格中，然后我将拥有类似于下面的重复行。然后，我需要用相同的文本回答所有问题，并根据整个合并文档计算答案的百分比。

Excel 数据示例

**Poll Question**                                                     **Poll Responses**
The content was clear and effectively delivered                         37 Total Votes
Strongly Agree                                                          24.30%
Agree                                                                   70.30%
Neutral                                                                 2.70%
Disagree                                                                2.70%
Strongly Disagree                                                       0.00%
The Instructor(s) were engaging and motivating                          37 Total Votes
Strongly Agree                                                          21.60%
Agree                                                                   73.00%
Neutral                                                                 2.70%
Disagree                                                                2.70%
Strongly Disagree                                                       0.00%
I would attend another training session delivered by this Instructor(s) 37 Total Votes
Strongly Agree                                                          21.60%
Agree                                                                   73.00%
Neutral                                                                 5.40%
Disagree                                                                0.00%
Strongly Disagree                                                       0.00%
This was a good format for my training                                  37 Total Votes
Strongly Agree                                                          24.30%
Agree                                                                   62.20%
Neutral                                                                 8.10%
Disagree                                                                2.70%
Strongly Disagree                                                       2.70%
Any comments/suggestions about this training course?                    5 Total Votes

我计算非百分比投票数的方法是将百分比转换为数字。例如。从37 Total Votes 中查找并提取37，然后使用以下公式获取对该特定答案投票的用户数量：percent * total / 100。

所以24.30 * 37 / 100 = 8.99 向上取整意味着 37 人中有 9 人投票支持“非常同意”。

这是我希望能够做的电子表格示例：

**Poll Question**  **Poll Responses**  **non-percent**  **subtotal**
  ...                 37 Total Votes     0               37
  ...                 24.30%             9               37
  ...                 70.30%             26              37
  ...                 2.70%              1               37
  ...                 2.70%              1               37
  ...                 0.00%              0               37

（注意：non-percent 和 subtotal 将是新创建的列）

目前，我使用一个装满.xls 文件的文件夹，然后循环浏览该文件夹，以.xlsx 格式将它们保存到另一个文件夹中。在该循环中，我添加了一个注释块，其中包含我的# NEW test CODE，我试图在其中放置执行此操作的逻辑。

如您所见，我试图定位单元格并获取值，然后获取一些正则表达式并从中提取数字，（然后将其添加到该行中的 subtotal 列。然后我想添加它，直到我看到包含x Total Votes 的行的新实例。

这是我当前的代码：

import numpy as np
import pandas as pd

files = get_files('/excels/', '.xls')
df_array = []

for i, f in enumerate(files, start=1):
    sheet = pd.read_html(f, attrs={'class' : 'reportData'}, flavor='bs4')
    event_id = get_event_id(pd.read_html(f, attrs={'id' : 'eventSummary'}))
    event_title= get_event_title(pd.read_html(f, attrs={'id' : 'eventSummary'}))
    filename = event_id + '.xlsx'
    rel_path = 'xlsx/' + filename
    writer = pd.ExcelWriter(rel_path)

    for df in sheet:
        # NEW test CODE
        q_total = 0
        df.columns = df.columns.str.strip()
        if df[df['Poll Responses'].str.contains("Total Votes")]:
        # if df['Poll Responses'].str.contains("Total Votes"):
            q_total = re.findall(r'.+?(?=\sTotal\sVotes)', df['Poll Responses'].str.contains("Total Votes"))[0]
            print(q_total)
        # df['Question Total'] = np.where(df['Poll Responses'].str.contains("Total Votes"), 'yes', 'no')
        # END NEW test Code
        df.insert(0, 'Event ID', event_id)
        df.insert(1, 'Event Title', event_title)
        df.to_excel(writer,'sheet')
        writer.save()

    # progress of entire list
    if i <= len(files):
        print('\r{:*^10}{:.0f}%'.format('Converting: ', i/len(files)*100), end='')

print('\n')

TL;DR 这看起来很复杂，但如果我能得到两个 new 列，其中包含一个问题的总票数和一个答案的票数（不是百分比），那么我可以做一些 VLOOKUP合并文档上的魔法。任何帮助或方法建议将不胜感激。谢谢！

【问题讨论】：

每个问题的答案是否总是相同的？您可以将每张工作表读入数据框，然后将它们加在一起。剩下的部分由 Pandas 解决。
很遗憾，没有。因为可能会有“评论框”之类的问题，并且不会与其他问题相隔 5 行。或者用户可以选择不做李克特风格的问题。

标签： python excel python-3.x pandas

【解决方案1】：

我解决了这个问题，我将在下面发布伪代码：

我循环浏览每张纸。在该循环中，我使用for n, row in enumerate(df.itertuples(), 1): 遍历每一行。
我得到了可能包含“投票响应”的字段的值poll_response = str(row[3])
使用if / else 检查poll_response 是否包含文本“Total Votes”。如果是，则它必须是一个问题，否则它必须是一个有答案的行。
在问题的if 中，我得到了包含我需要的数据的单元格。然后我有一个函数将问题文本与数组中的所有对象问题文本进行比较。如果匹配，那么我只需更新对象的字段，否则我创建一个新的问题对象。
else 该行是答案行，我使用问题文本在数组中查找对象并更新/添加答案或数据。
此过程循环遍历每个电子表格中的所有行，现在我的数组中充满了独特的问题对象。

【讨论】：