【问题标题】：Decision Tree using data mining technique to determine whether a tumor is benign or malignant决策树使用数据挖掘技术来确定肿瘤是良性还是恶性
【发布时间】：2022-12-25 17:01:53
【问题描述】：

我必须从 .csv 文件中读取患者数据，并根据每个患者读取的数据，使用决策树确定肿瘤是良性还是恶性。

我真的在为如何开始这个而苦苦挣扎。到目前为止，我已经编写了从 .csv 文件中读取数据并将数据存储到一个向量中的代码，如下所示，分布在几个头文件和 cpp 文件中。

根据我收集到的信息，我可以创建一个父决策类，然后我要处理的每个属性都是子类。不确定这是否有意义。请告诉我。

您将在下面找到我要处理的属性以及一个图形树，该图形树显示了如何确定肿瘤是良性还是恶性的，我的代码需要以此为基础。我还将包含一小部分 .csv 文件示例。

请给我一些指导，我如何做到这一点。我在使用指针表示法时遇到了最大的困难。任何指导将不胜感激。

CSVLine.h

#ifndef CSVLINE_H
#define CSVLINE_H

#include <string>
#include <sstream>
#include <vector>

using namespace std;

class CSVLine
{
private:
    vector<string> data;

public:
    CSVLine() {}
    CSVLine(const CSVLine& other)
    {
        data = other.data;
    }

    CSVLine operator = (const CSVLine& other)
    {
        data = other.data;
    }
    ~CSVLine() {}

    void parse(string line, char delimiter = ',');
    string getString(int columnNumber);
    int getInt(int columnNumber);
};

#endif


CSVLine.cpp

#include "CSVLine.h"

void CSVLine::parse(string line, char delimiter)
{
    stringstream inLine(line);

    string tempColumn = "";

    while (getline(inLine, tempColumn, delimiter))
    {
        data.push_back(tempColumn);
    }
}

string CSVLine::getString(int columnNumber)
{
    return data[columnNumber];
}

int CSVLine::getInt(int columnNumber)
{
    return atoi(data[columnNumber].c_str());
}


CSVReader.h

#ifndef CSVREADER_H
#define CSVREADER_H

#include <vector>
#include <fstream>
#include <iostream>

#include "CSVLine.h"

using namespace std;

class CSVReader
{
public:
    CSVReader() {}

    vector<CSVLine> read(string fileName);
};

#endif


CSVReader.cpp

#include "CSVReader.h"

vector<CSVLine> CSVReader::read(string fileName)
{
    ifstream inputFile;
    vector<CSVLine> lines;
    inputFile.open(fileName.c_str());
    string line = "";

    while (getline(inputFile, line))
    {
        CSVLine csvLine;
        csvLine.parse(line);
        lines.push_back(csvLine);
    }

    return lines;
}

【问题讨论】：

标签： c++ pointers nodes data-mining decision-tree

【解决方案1】：

这是我会做的。

首先，我会将功能表翻译成higher-order macro：

#define FOREACH_FEATURE(OP)                     
  OP(1, SampleCodeNumber, int, -1)              
  OP(2, ClumpThickness, int, -1)                
  OP(3, UniformityOfCellSize, int, -1)
// Fill in the rest of the table of features here yourself

然后我会使用这个宏来生成一个具有患者所有特征的结构，如下所示：

struct PatientData {
#define DECL_FEATURE(index, name, type, init) type name = init;
  FOREACH_FEATURE(DECL_FEATURE)
#undef DECL_FEATURE

  PatientData() {}
  
  PatientData(CSVLine& src) {
#define READ_FEATURE(index, name, type, init) name = src.getInt(index-1);
    FOREACH_FEATURE(READ_FEATURE)
#undef READ_FEATURE
  }
};

然后我将从 CSVLine 构造一个 PatientData 对象：

CSVLine line = ...;
PatientData patientData(line);

然后我将在 patientData 对象上将决策树实现为嵌套的 if-statements：

if (patientData.UniformityOfCellSize <= 2) {
  // ...
} else {
  // ...
}

这会让你开始，但你需要完成并可能扩展 FOREACH_FEATURE 宏并实现决策树......

节点和指针方法

如果您不想像上面那样实现您的树，请放弃上面的代码，而是执行以下操作。首先包含一些我们需要的文件并实现一个 Feature 类：

#include <memory>
#include <functional>

struct Feature {
  int index1;
  int apply(CSVLine& line) const {return line.getInt(index1-1);}
};

并将功能表翻译成Feature，如下所示：

Feature SampleCodeNumber{1};
Feature ClumpThickness{2};
Feature UniformityOfCellSize{3};
// Fill in the rest yourself

我们将使用 std::function<bool(CSVLine)> 来决定树中的分支：

typedef std::function<bool(CSVLine&)> BranchCondition;

重载 Feature 和 double 的比较运算符以返回 BranchCondition 让我们整齐地表达 BranchConditions：

#define DEF_FEATURE_OP(op) BranchCondition operator op (Feature f, double x) {return [f, x](CSVLine& line) {return f.apply(line) op x;};}
DEF_FEATURE_OP(<)
DEF_FEATURE_OP(<=)
DEF_FEATURE_OP(>)
DEF_FEATURE_OP(>=)
#undef DEF_FEATURE_OP

我们还需要声明分类的返回值：

enum class Severity {
  Benign, Malign
};

作为决策树的基类，我们声明

class PatientClassifier {
public:
  virtual Severity classify(CSVLine& p) const = 0;
  virtual ~PatientClassifier() {}
};

并针对常量值的简单情况以及函数severity来构造它来实现它：

class ConstantClassifier : public PatientClassifier {
public:
  ConstantClassifier(Severity v) : _value(v) {}
  Severity classify(CSVLine&) const override {return _value;}
private:
  Severity _value;
};

std::shared_ptr<PatientClassifier> severity(Severity v) {
  return std::make_shared<ConstantClassifier>(v);
}

对于分支情况以及函数branch：

class BranchingClassifier : public PatientClassifier {
public:
  BranchingClassifier(
    BranchCondition f,
    const std::shared_ptr<PatientClassifier>& onTrue,
    const std::shared_ptr<PatientClassifier>& onFalse)
    : _f(f), _onTrue(onTrue), _onFalse(onFalse) {}
  
  Severity classify(CSVLine& p) const override {
    return (_f(p)? _onTrue : _onFalse)->classify(p);
  }
private:
  BranchCondition _f;
  std::shared_ptr<PatientClassifier> _onTrue;
  std::shared_ptr<PatientClassifier> _onFalse;
};

std::shared_ptr<PatientClassifier> branch(
  BranchCondition f,
  const std::shared_ptr<PatientClassifier>& onTrue,
  const std::shared_ptr<PatientClassifier>& onFalse) {
  return std::make_shared<BranchingClassifier>(f, onTrue, onFalse);
}

然后我们就这样构建树

  auto decisionTree = branch(
    UniformityOfCellSize <= 2.0,
    severity(Severity::Benign),
    severity(Severity::Malign));

  CSVLine line;
  auto result = decisionTree->classify(line);

笔记：CSVLine 不需要自定义复制构造函数和赋值运算符。 getInt方法可以标记为const。

【讨论】：

感谢您的回复和提示。我希望得到关于如何实现节点指针的解释，而不是只有大量的 if-else 语句列表。我在想我应该做一个决策节点来声明一个 true 和 false 节点，并用它们运行我的数据以确定下一个要转到哪个分支。我希望这是有道理的。我只是不确定如何实际编码这些节点以指向树的下一个分支。如果你能帮我解决这个问题，我将非常感激！
例如，使用我上面提供的数据：我已经为每个用于确定肿瘤是良性还是恶性的属性创建了一个节点类。类 Uniformity_of_Cell_Size_Node : public DecisionNode<CancerDecisionTree::patient_data> { public: Uniformity_of_Cell_Size_Node(DecisionNode* left_node, DecisionNode* right_node) : DecisionNode<CancerDecisionTree::patient_data>(t_node, f_node) {} bool process(CancerDecisionTree::patient_data& data) { 如果(data.cellSize <= 2) { return left_node_->process(data); } ETC......
我的问题是，根据上面提供的树，细胞大小的均匀性有多个条件 1. 如果细胞大小 <= 2 然后查看裸核 2. 如果细胞大小 > 2 然后查看细胞形状的均匀性 3 . 如果细胞形状 >2，则查看细胞大小，如果 <=4，则查看裸核 4. 如果裸核 >2，则查看团块厚度，如果 <=6，则查看细胞大小，如果 <=3 则为恶性，但如果单元格大小 >4 则为恶性
在我看来，嵌套的if 语句仍然是这种小型决策树的最简单解决方案，代码更少，错误更少，代码速度更快。除非您在编写程序时没有决策树的规范，否则我认为您是过度工程化如果您使用以下方法构建决策树，则您的代码节点和指针.你仍然需要很多代码来构造决策树，但代码将不是嵌套的ifs。为什么您更喜欢基于节点和指针的解决方案？
对于这个问题，我需要使用节点和指针。我要声明一个根节点，然后根据从 .csv 文件读取的数据，我需要指向任一情况（树的左侧或右侧）的节点。例如，如果 Cell Size <=2 那么我应该遍历到树的左侧。如果 Cell Size >2，那么我应该遍历到树的右侧。