【问题标题】：How to randomly create a preference dataframe from a dataframe of choices?如何从选择的数据框中随机创建偏好数据框？
【发布时间】：2020-04-03 15:16:20
【问题描述】：

我有一个投票数据框，我想创建一个偏好。例如这里是每个城市 Comm, Comm2 中每一方 P1, P2, P3 的票数...

    Comm    Votes   P1      P2      P3
0   comm1   1315.0  2.0     424.0   572.0
1   comm2   4682.0  117.0   2053.0  1584.0
2   comm3   2397.0  2.0     40.0    192.0
3   comm4   931.0   2.0     12.0    345.0
4   comm5   842.0   47.0    209.0   76.0
... ... ... ... ... ...
1524    comm1525    10477.0 13.0    673.0   333.0
1525    comm1526    2674.0  1.0 55.0    194.0
1526    comm1527    1691.0  331.0   29.0    78.0

这些选举结果足以首次通过投票系统，我想测试the alternative election model。因此，我需要了解每个政党的偏好。

由于我不知道偏好，我想用随机数制作它们。我想选民是诚实的。例如，对于城镇“comm”中的“P1”政党，我们知道有 2 人投票支持它，并且有 1315 名选民。我需要创建偏好，看看人们是否会将其作为他们的第一、第二或第三选择。也就是说，对于每一方：

     Comm      Votes    P1_1        P1_2    P1_3    P2_1    P2_2    P2_3    P3_1     P3_2   P3_3
0    comm1      1315.0  2.0         1011.0  303.0   424.0   881.0   10.0    570.0    1.0    1.0
... ... ... ... ... ...
1526 comm1527   1691.0  331.0   1300.0  60.0    299.0   22.0    10.0    ...

所以我必须这样做：

# for each column in parties I create (parties -1) other columns
# I rename them all Party_i. The former 1 becomes Party_1.
# In the other columns I put a random number. 
# For a given line, the sum of all Party_i for i in [1, parties] mus t be equal to Votes

到目前为止我已经试过了：

parties = [item for item in df.columns if item not in ['Comm','Votes']]

for index, row in df_test.iterrows():
    # In the other columns I put a random number. 
    for party in parties:
        # for each column in parties I create (parties -1) other columns
        for i in range(0,len(parties) -1):
            print(random.randrange(0, row['Votes']))
            # I rename them all Party_i. The former 1 becomes Party_1. 
            row["{party}_{preference}".format(party = party,preference = i)] = random.randrange(0, row['Votes']) if (row[party] < row['Votes']) else 0 # false because the sum of the votes isn't = to df['Votes']

结果是：

     Comm      Votes    ... P1_1    P1_2   P1_3    P2_1    P2_2    P2_3    P3_1     P3_2   P3_3
0    comm1      1315.0  ... 1003    460    1588    1284    1482    1613    1429   345
1    comm2      1691.0  ... 1003    460    1588    1284    1482    1613    ...  
...

但是：

每行的数字都相同
Pi_1 行中的值不等于Pi 行中的值（Pi 是给定的一方）。
[0,party] 中所有 j 的 Pi_j 总和不等于 Votes 列中的数字

更新

我用他自己的数据尝试了 Antihead 的答案，效果很好。但是当应用到我自己的数据时，它不会。它给我留下了一个空的数据框：

import collections

def fill_cells(cell):
    v_max = cell['Votes']
    all_dict = {}
    #iterate over parties.copy()
    for p in parties:
        tmp_l = parties.copy()
        tmp_l.remove(p)
        # sample new data with equal choices
        sampled = np.random.choice(tmp_l, int(v_max-cell[p]))
        # transform into dictionary
        c_sampled = dict(collections.Counter(sampled))
        c_sampled.update({p:cell[p]})
        # batch update of the dictio~nary keys
        all_dict.update(
            dict(zip([p+'_%s' %k[1] for k in c_sampled.keys()], c_sampled.values()))
            )
    return pd.Series(all_dict)

确实，具有以下数据框：

    Comm    Votes   LPC     CPC     BQ
0   comm1   1315.0  2.0     424.0   572.0
1   comm2   4682.0  117.0   2053.0  1584.0
2   comm3   2397.0  2.0     40.0    192.0
3   comm4   931.0   2.0     12.0    345.0
4   comm5   842.0   47.0    209.0   76.0
...     ...     ...     ...     ...     ...
1522    comm1523    23808.0     1588.0  4458.0  13147.0
1523    comm1524    639.0   40.0    126.0   40.0
1524    comm1525    10477.0     13.0    673.0   333.0
1525    comm1526    2674.0  1.0     55.0    194.0
1526    comm1527    1691.0  331.0   29.0    78.0

我有一个空数据框：

【问题讨论】：

您能否重新表述问题以便更好地理解？你想不想计算未知社区中的一个人投票给 P1、P2、P3 的机会？
@Antihead 当然。例如，如果我们取 comm1（给出行）和一方 P1，我只希望每个单元格 P1_i 中的随机数为 [1，方数] 中的 i，它们的总和必须等于投票数且 P1_1 必须等于到 P1。这有意义吗？
我理解正确吗：对于每个单元格：您想将剩余的选票划分为：|Votes|-P_i，到P_{j}j element from [1,2,3] where i!=j？（随机）
@Antihead，是的！ P_{j} 是随机的，表示政党 P 的票数
tmp_l = parties 不会复制您的列表，而是引用它。你需要复制列表而不是tmp_l = parties,copy()

标签： python python-3.x dataframe random

【解决方案1】：

这行得通吗：

# data
columns = ['Comm', 'Votes', 'P1', 'P2', 'P3']
data =[['comm1', 1315.0, 2.0, 424.0, 572.0],
['comm2', 4682.0, 117.0, 2053.0, 1584.0],
['comm3', 2397.0, 2.0, 40.0, 192.0],
['comm4', 931.0, 2.0, 12.0, 345.0],
['comm5', 842.0, 47.0, 209.0, 76.0],
['comm1525', 10477.0, 13.0, 673.0, 333.0],
['comm1526', 2674.0, 1.0, 55.0, 194.0],
['comm1527', 1691.0, 331.0, 29.0, 78.0]]


df =pd.DataFrame(data=data, columns=columns)

import collections

def fill_cells(cell):
    v_max = cell['Votes']
    all_dict = {}
    #iterate over parties
    for p in ['P1', 'P2', 'P3']:
        tmp_l = ['P1', 'P2', 'P3']
        tmp_l.remove(p)
        # sample new data with equal choices
        sampled = np.random.choice(tmp_l, int(v_max-cell[p]))
        # transform into dictionary
        c_sampled = dict(collections.Counter(sampled))
        c_sampled.update({p:cell[p]})
        # batch update of the dictionary keys
        all_dict.update(
            dict(zip([p+'_%s' %k[1] for k in c_sampled.keys()], c_sampled.values()))
            )
    return pd.Series(all_dict)
# get back a data frame
df.apply(fill_cells, axis=1)

如果您需要将数据框合并回来，请执行以下操作：


new_df = df.apply(fill_cells, axis=1)
pd.concat([df, new_df], axis=1)

【讨论】：

不幸的是没有:(我得到ValueError: ('negative dimensions are not allowed', 'occurred at index 0')和KeyError: ('PI_3', 'occurred at index 0')
你使用 python3 吗？尝试首先使用我在答案中定义的数据框，看看代码是否运行。（我运行它没有问题。..）
是的，我使用 python3 并尝试了您的数据框。我已经用我对你的回答的尝试更新了我的问题
非常感谢，您的代码现在可以使用了！但是标题的名称是P1_P1, P1_P2, P1_P3, P2_P1, ...，我希望创建P1_1, P1_2, ... 以表达偏好。例如P1_2 将是选择派对P1 作为第二选择的人数列。我目前正在为此目的使用您的代码，但我还没有实现它
我相应地调整了代码 p+'_%s' %k[1] 现在给你的表达式是 Pi_j 而不是 Pi_Pj

【解决方案2】：

基于 Antihead 的回答并针对以下数据集：

    Comm    Votes   LPC     CPC     BQ
0   comm1   1315.0  2.0     424.0   572.0
1   comm2   4682.0  117.0   2053.0  1584.0
2   comm3   2397.0  2.0     40.0    192.0
3   comm4   931.0   2.0     12.0    345.0
4   comm5   842.0   47.0    209.0   76.0
...     ...     ...     ...     ...     ...
1522    comm1523    23808.0     1588.0  4458.0  13147.0
1523    comm1524    639.0   40.0    126.0   40.0
1524    comm1525    10477.0     13.0    673.0   333.0
1525    comm1526    2674.0  1.0     55.0    194.0
1526    comm1527    1691.0  331.0   29.0    78.0

我试过了：

def fill_cells(cell):
    votes_max = cell['Votes']
    all_dict = {}
    #iterate over parties
    parties_temp = parties.copy()
    for p in parties_temp:
        preferences = ['1','2','3']
        for preference in preferences:
            preferences.remove(preference)
            # sample new data with equal choices
            sampled = np.random.choice(preferences, int(votes_max-cell[p])) 
            # transform into dictionary
            c_sampled = dict(collections.Counter(sampled))
            c_sampled.update({p:cell[p]})
            c_sampled['1'] = c_sampled.pop(p)
            # batch update of the dictionary keys
            all_dict.update(
                dict(zip([p+'_%s' %k for k in c_sampled.keys()],c_sampled.values()))
            )
    return pd.Series(all_dict)

    LPC_2   LPC_3   LPC_1   CPC_2   CPC_3   CPC_1   BQ_2    BQ_3    BQ_1
    0   891.0   487.0   424.0   743.0   373.0   572.0   1313.0  683.0   2.0
    1   2629.0  1342.0  2053.0  3098.0  1603.0  1584.0  4565.0  2301.0  117.0
    2   2357.0  1186.0  40.0    2205.0  1047.0  192.0   2395.0  1171.0  2.0
    3   919.0   451.0   12.0    586.0   288.0   345.0   929.0   455.0   2.0
    4   633.0   309.0   209.0   766.0   399.0   76.0    795.0   396.0   47.0
    ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
    1520    1088.0  536.0   42.0    970.0   462.0   160.0   1117.0  540.0   13.0
    1521    4742.0  2341.0  219.0   3655.0  1865.0  1306.0  4705.0  2375.0  256.0
    1522    19350.0     9733.0  4458.0  10661.0     5352.0  13147.0     22220.0     11100.0     1588.0
    1523    513.0   264.0   126.0   599.0   267.0   40.0    599.0   306.0   40.0
    1524    9804.0  4885.0  673.0   10144.0     5012.0  333.0   10464.0     5162.0  13.0

差不多好了。我更喜欢动态编码而不是硬编码['1','2','3']。

【讨论】：