【问题标题】:PySpark ChiSqSelector p-values and test statisticsPySpark ChiSqSelector p 值和测试统计
【发布时间】:2018-06-22 09:04:55
【问题描述】:

我正在使用 PySpark 的 pyspark.ml.feature.ChiSqSelector 来执行特征选择。 apps 是一个包含稀疏矩阵的列,这些矩阵对应于特定的name(机器)是否安装了特定的应用程序。总而言之,有人可以安装 21,615 个可能的应用程序。

使用ChiSqSelector 对象拟合和转换新数据后,我对selected_apps 现在代表什么感到困惑。文档在这里没有帮助。我有几个问题:

1) 如何获得与 21,615 个输入应用程序中的每一个相关的卡方检验统计量和 p 值?通过查看 dir(selector) 似乎无法立即访问。

2) 为什么selected_apps 会显示不同的应用程序?我的预感是下面第二行的机器没有应用程序 0、1、2 等,所以selected_apps 中显示的该行是它执行的前 50 个应用程序基于他们的 p 值。这个 API 似乎与 scikit-learns SelectKBest(chi2) 工作有很大不同,后者只返回最相关的前 k 个特征,而不管特定机器是否具有该特征的“1”。 p>

3) 如何覆盖默认的numTopFeatures=50 设置?这主要与问题 1) 相关,并且仅利用 p 值进行特征选择。似乎没有numTopFeatures=-1-type 选项用于基本上“忘记”这个参数。

>>> selector = ChiSqSelector(
...     featuresCol='apps',
...     outputCol='selected_apps',
...     labelCol='multiple_event',
...     fpr=0.05
... )
>>> result = selector.fit(df).transform(df)                                                                
>>> print(result.show())
+---------------+-----------+--------------+--------------------+--------------------+
|           name|total_event|multiple_event|                apps|       selected_apps|
+---------------+-----------+--------------+--------------------+--------------------+
|000000000000021|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000022|          0|             0|(21615,[3,6,7,8,9...|(50,[3,6,7,8,9,11...|
|000000000000023|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000024|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000025|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000026|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000027|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000028|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000029|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000030|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000031|          0|             0|(21615,[0,1,2,3,4...|(50,[0,1,2,3,4,6,...|
|000000000000032|          0|             0|(21615,[6,7,8,9,1...|(50,[6,7,8,9,13,1...|
|000000000000033|          0|             0|(21615,[0,1,2,3,4...|(50,[0,1,2,3,4,6,...|
|000000000000034|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000035|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000036|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000037|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000038|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000039|          0|             0|(21615,[0,1,2,3,6...|(50,[0,1,2,3,6,7,...|
|000000000000040|          0|             0|(21615,[0,1,2,3,4...|(50,[0,1,2,3,4,6,...|
+---------------+-----------+--------------+--------------------+--------------------+

【问题讨论】:

    标签: python pyspark chi-squared


    【解决方案1】:

    我想通了。解决方法如下:

    from pyspark.mllib.linalg import Vectors
    from pyspark.mllib.regression import LabeledPoint
    from pyspark.mllib.stat import Statistics
    
    # Convert everything to a LabeledPoint object, the main consumption
    # data structure for most of mllib
    to_labeled_point = lambda x: LabeledPoint(x[0], Vectors.dense(x[1].toArray()))
    
    obs = (
        df
        .select('multiple_event', 'apps')
        .rdd
        .map(to_labeled_point)
    )
    
    # The contingency table is constructed from an RDD of LabeledPoint and used to conduct
    # the independence test. Returns an array containing the ChiSquaredTestResult for every feature
    # against the label.
    feature_test_results = Statistics.chiSqTest(obs)
    
    data = []
    
    for idx, result in enumerate(feature_test_results):
        row = {
            'feature_index': idx,
            'p_value': result.pValue,
            'statistic': result.statistic,
            'degrees_of_freedom': result.degreesOfFreedom
        }
        data.append(row)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-12-09
      • 1970-01-01
      • 2012-03-25
      • 2023-03-09
      • 2021-07-08
      • 2020-10-15
      • 2021-08-18
      相关资源
      最近更新 更多