♎
Limited AI
  • Machine Learning
    • Linear Model Cheating Sheet
    • Nonlinear Model Cheating Sheet
    • General Linear Model 1
    • General Linear Model 2
    • General Linear Model 3
    • Tree Based Methods
    • Tree Based Methods Supplement
    • XG,Cat,Light__Boosting
    • KNN&PCA
    • Model Performance
    • Model Evaluation
    • Code Practice
      • KNN
      • Decision Tree Python Code
    • Data and Feature Engineering
      • Handle Bias Data
      • Cold Start Problem
  • Deep Learning
    • Summary v2
    • Basic Neural Network
      • From Linear to Deep
      • Perceptron and Activation Function
      • NN network Details
      • Backpropagation Details
      • Gradient Vanishing vs Gradient Exploding
    • Basic CNN
      • Why CNN
      • Filter/ Convolution Kernel and Its Operation
      • Padding& Stride
      • Layers
      • Extra:From Fully Connected Layers to Convolutions
      • Extra: Multiple Input and Multiple Output Channels
    • Advance CNN
      • Convolutional Neural Networks(LeNet)
      • Deep Convolution Neural Networks(AlexNet)
      • Networks Using Blocks (VGG)
      • Network in Network(NiN)
      • Multi-Branch Networks(GoogLeNet&I mageNet)
      • Residual Networks(ResNet) and ResNeXt
      • Densely Connected Networks(DenseNet)
      • Batch Normalization
    • Basic RNN
      • Seq Model
      • Raw Text to Seq
      • Language Models
      • Recurrent Neural Networks(RNN)
      • Backpropagation Through Time
    • Advance RNN
      • Gated Recurrent Units(GRU)
      • Long Short-Term Memory(LSTM)
      • Bidirectional Recurrent Neural Networks(BRNN)
      • Encoder-Decoder Architecture
      • Seuqence to Sequence Learning(Seq2Seq)
    • Attention Mechanisms and Transformers
      • Queries, Keys, and Values
      • Attention is all you need
        • Attention and Kernel
        • Attention Scoring Functions
        • The Bahdanau Attention Mechanism
        • Multi-Head Attention
        • Self-Attention
        • Attention的实现
      • The Transformer Architecture
        • Extra Reading
        • 最短的最大路径长度
      • Large-Scaling Pretraning with Transformers
        • BERT vs OpenAI GPT vs ELMo
        • Decoder Model框架
        • Bert vs XLNet
        • T5& GPT& Bert比较
        • 编码器-解码器架构 vs GPT 模型
        • Encoder vs Decoder Reference
      • Transformers for Vision
      • Transformer for Multiomodal
    • NLP Pretraining
      • Word Embedding(word2vec)
        • Extra Reading
      • Approximate Training
      • Word Embedding with Global Vectors(GloVe)
        • Extra Reading
        • Supplement
      • Encoder(BERT)
        • BERT
        • Extra Reading
      • Decoder(GPT&XLNet&Lamma)
        • GPT
        • XLNet
          • XLNet架构
          • XLNet特点与其他比较
      • Encoder-Decoder(BART& T5)
        • BART
        • T5
  • GenAI
    • Introduction
      • GenAI Paper Must Read
      • GenAI六个阶段
    • Language Models Pre-training
      • Encoder-Decoder Architecture
      • Encoder Deep Dive
      • Decoder Deep Dive
      • Encoder VS Decoder
      • Attention Mechanism
      • Transformers
    • Example: Llama 3 8B架构
    • Fine-Tuning Generation Models
    • RAG and Adavance RAG
    • AI Agent
  • Statistics and Optimization
    • A/B testing
    • Sampling/ABtesting/GradientMethod
    • Gradient Decent Deep Dive
  • Machine Learning System Design
    • Extra Reading
    • Introduction
  • Responsible AI
    • AI Risk and Uncertainty
      • What is AI risk
      • General Intro for Uncertainty Quantification
      • Calibration
      • Conformal Prediction
        • Review the linear regression
        • Exchangeability
        • Split Conformal Prediction
        • Conformalized Quantile Regression
        • Beyond marginal coverage
        • Split Conformal Classification
        • Full Conformal Coverage
        • Cross-Validation +
        • Conformal Histgram Regression
    • xAI
      • SHAP value
  • Extra Research
    • Paper Reading
    • Reference
Powered by GitBook
On this page
  • Decision Tree
  • What is decision tree?
  • How to Build a Decision Tree?
  • Compare with GAM&linear regression
  • How to know decide which feature first?
  • Tree Pruning
  • Tree Model with ensemble method
  • 简介:
  • bagging的原理:
  • Boosting
  • Supplement-Decision Tree
  • Reference
  • 集成学习(Ensemble Method)
  • 集成方法:
  1. Machine Learning

Tree Based Methods

PreviousGeneral Linear Model 3NextTree Based Methods Supplement

Last updated 10 months ago

We discuss tree based methods here.(Now here include ensemble methods)

Decision Tree

What is decision tree?

(说白了就是每层来细分你的features)

  • Decision Tree is a supervised prediction model.

  • Decision Tree is a usually a k-nary tree. Each leaf denotes one classification or continuous value of the dependent variable.

  • Two types of Decision Tree: Classification/Regression Tree

root node&splitting&decision node&leaf/terminal node&Pruning&Branch/Sub-tree&Parent and Child Node

参考:

1)decision tree 是属于decision tree learning这一块的,如果是二分的问题,就是基于每个feature进行分类,在树结构上进行classification和regression(divide and conquer strategy)

2)tree结构:In keeping with the tree analogy, the regions R1R_1R1​, R2R_2R2​ and R3R_3R3​ are known as terminal nodes or leaves of the tree.

3)Tree based methods partition the feature space into a set of rectangles, and then fit a simple(like a constant) in each one.

How to Build a Decision Tree?

two steps:

  1. Feature prioritization(一层在哪)

  2. Feature splitting(每一层分几个)

相当于你每层都在对一个features进行切割(你想象成为scenario tree)

How to build(ID3 Algorithm)?

  • At each iteration step, we split the feature which can give us the largest entropy gain

  • definition of entropy熵:

    Ie(i)=−log2piI_e(i)=-log_2 p_iIe​(i)=−log2​pi​

    Entropy of the entire information system:

    H=∑i=1npiIe(i)=−∑i=1npilog⁡2piH= \sum_{i=1}^n p_i I_e(i)= - \sum_{i=1}^n p_i \log_2 p_iH=i=1∑n​pi​Ie​(i)=−i=1∑n​pi​log2​pi​

    example:

​ Entropy of the current system:(注意第一步算的是dependent variable的熵)

HY=∑i=1npiIe(i)=−∑i=1npilog⁡2pi=Y的熵+N的熵=−0.7log⁡20.7+−0.3log⁡20.3=0/879H_Y= \sum_{i=1}^n p_i I_e(i)= - \sum_{i=1}^n p_i \log_2 p_i=Y的熵+N的熵= -0.7\log_2 0.7 +-0.3 \log_2 0.3=0/879HY​=i=1∑n​pi​Ie​(i)=−i=1∑n​pi​log2​pi​=Y的熵+N的熵=−0.7log2​0.7+−0.3log2​0.3=0/879

​ Condition entropy: (现在在之前那个entropy下,看看变化)

这里考虑blog(C)在之前Y下状态下的 conditional entropy

HY∣C=P(C=L)H(Y∣C=L)+P(C=M)H(Y∣C=m)+P(C=S)H(Y∣C=S)H_{Y\vert C}= P(C=L) H(Y \vert C=L)+P(C=M) H(Y \vert C=m)+P(C=S) H(Y \vert C=S)HY∣C​=P(C=L)H(Y∣C=L)+P(C=M)H(Y∣C=m)+P(C=S)H(Y∣C=S)

where P(C=L)=0.3,P(C=M)=0.4,P(C=S)=0.3P(C=L)= 0.3 , P(C=M) = 0.4, P(C=S) =0.3P(C=L)=0.3,P(C=M)=0.4,P(C=S)=0.3

H(Y∣C=L)=3/3,H(N∣C=L)=0H(Y\vert C=L)=3/3, H(N\vert C=L)=0H(Y∣C=L)=3/3,H(N∣C=L)=0;

H(Y∣C=M)=3/4,H(N∣C=M)=1/4H(Y\vert C=M)=3/4, H(N\vert C=M)=1/4H(Y∣C=M)=3/4,H(N∣C=M)=1/4;

H(Y∣C=S)=1/3,H(N∣C=S)=2/3H(Y\vert C=S)=1/3, H(N\vert C=S)=2/3H(Y∣C=S)=1/3,H(N∣C=S)=2/3

​ Information gain = Entropy - Conditional Entropy

The entropy gain(增益,就是变化) by splitting feature C =0.879-0.603=0.276

The entropy gain(增益,就是变化) by splitting feature A =0.033

The entropy gain(增益,就是变化) by splitting feature B =0.553

Therefore ,we choose Feature B as the first splitting feature(相当于用熵来feature selection,先选影响大的)

How to build(geometric)?

如何分割,其实相当于你在平面上来分割,可以当作是GAM model

Prediction via Stratification of the Feature Space(相当于你在特征空间分层预测,基本上和linear的不一样的地方,linear是线,这里是竖的还有横的)

  1. We divide the predictor space—that is, the set of possible values for X1,X2,⋯ ,XpX_1,X_2,\cdots, X_pX1​,X2​,⋯,Xp​—into JJJ distinct and non-overlapping regions R1,R2,⋯ ,RJR_1, R_2, \cdots, R_JR1​,R2​,⋯,RJ​.

  2. For every observation that falls into the region RjR_jRj​ , we make the same prediction, which is simply the mean of the response values for the training observations in RjR_jRj​ .

For step 1, how do we construct the regions R1,R2,⋯ ,RJR_1, R_2, \cdots, R_JR1​,R2​,⋯,RJ​?

  1. a top-down, greedy approach 自上而下"说的是从树顶分割成两个新的分支 The approach is top-down because it begins at the top of the tree (at which point all observations belong to a single region) and then successively splits the predictor space(依次分裂预测变量空间); each split is indicated via two new branches further down on the tree.

  2. that is known as recursive binary splitting

  3. greedy: It is greedy because at each step of the tree-building process,在建立树的每一个步骤中

  4. **best:**he best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.分裂确定的仅限于某一部进程,而不是针对区选择能够在未来进程中构建出更好的树的分裂点t

  5. 寻找继续分割数据及的最优预测变量&最优分割点

Compare with GAM&linear regression

  • E[Y∣X1,⋯ ,Xp]=α+f1(X1)+⋯+fp(Xp)−E[Y\vert X_1,\cdots,X_p]=\alpha+f_1(X_1)+\cdots+f_p(X_p)-E[Y∣X1​,⋯,Xp​]=α+f1​(X1​)+⋯+fp​(Xp​)− Generalized additive model(GAM)

  • 不是线性,是因为不是多项式

  • We can formulate decision tree model as the following formula:

f(X)=∑k=1Kcontribution(xk)==∑k=1KckI(X∈Rk)f(X)= \sum_{k=1}^K contribution(x_k)= =\sum_{k=1}^K c_k I_{(X \in R_k)}f(X)=k=1∑K​contribution(xk​)==k=1∑K​ck​I(X∈Rk​)​

这里的contribution(xk)contribution(x_k)contribution(xk​)表示的是输入数据第kkk个feature 对输出结果产生的作用,它可能是信息熵/gini impurity(见下一小节)的减少。

  • 如果和linear regression 比较

f(X)=β0+∑j=1KXjβjf(X)=\beta_0+\sum_{j=1}^KX_j\beta_jf(X)=β0​+j=1∑K​Xj​βj​

在LR里面中, xkx_kxk​这项对f(X)f(X)f(X)的contribution是确定的,只与自己的取值和对应的系数有关

  • Decision tree中的contribution(xk)contribution(x_k)contribution(xk​)的取值,不仅仅与xkx_kxk​有关,也与XXX中其他feature所决定的整个decision path有关(可以看到前面例子real avatar,不止对前面和后面两个feature都有影响)(更具体地说,feature在path上的先后顺序不同,它的contribution就会不同)

  • 很明显,这是一个非线性关系

Decision Tree的特点

  • 优点:他可以用来描述xxx和yyy之间更复杂的关系,建模能力更强(有人用来直接feature selection)

  • 缺点:严重overfitting

How to know decide which feature first?

6)如何判定好坏,比较前后的misclass的比例;用三个不同的entropy&gini

​ tree regression: RSS ∑j=1J∑i∈Ri(yi−y^Ri)2\sum_{j=1}^J\sum_{i \in R_i}(y_i-\hat{y}{R_i})^2∑j=1J​∑i∈Ri​​(yi​−y^​Ri​)2 ,where Ri∈JR_{i \in J}Ri∈J​is the amount of region

​ tree classification:

  • Classification error rate E=1−max⁡k(p^mk)E=1-\max_{k}(\hat{p}_{mk})E=1−maxk​(p^​mk​)

  • Gini Index: G=∑k=1Kp^mk(1−p^mk)G=\sum_{k=1}^K\hat{p}{mk}(1-\hat{p}{mk})G=∑k=1K​p^​mk(1−p^​mk)

  • cross-entropy: D=−∑k=1Kp^mklog⁡p^mkD=-\sum_{k=1}^K\hat{p}_{mk}\log {\hat{p}}_{mk} D=−∑k=1K​p^​mk​logp^​mk​

Tree Pruning

  • 解决overfitting,不能用Regularization(因为你没loss function呀)

  • post-pruning/pre-pruning

  • The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance.

  • **How do we determine the best way to prune the tree? ** Intuitively, our goal is to select a subtree that leads to the lowest test error rate. Use KKK-fold cross-validation to choose $\alpha$. That is, divide the training observations into KKK folds. For each k=1,⋯ ,Kk = 1, \cdots, Kk=1,⋯,K:

Tree Model with ensemble method

简介:

  • In ensemble learning, we train multiple learners to solve the same problem.

  • 这些方法的出现,原本是为了来帮助decision tree,让一个本来variance很高,但bias比较低的进行一些转变.由于decision tree 的variance会比较大,bias小,所以可以用ensemble的方法

  • Ensemble learning including boosting method(Adaboosting) and bagging method(random forest).

Bagging --> focus on reducing variance(bootstrap,用已有的sample,多次重复,当然是降低variance)

  • learning independently from each other in parallel

  • combine with deterministic averaging process

  • bagging里有random forest

Boosting --> focus on reducing bias(每次增加sample,然后调整权重,调整背后的distribution,当然是降低bias)

  • Learn sequentially in a adaptative way(a base model depends on the previous ones)

  • Combine with deterministic strategy

  • boosting里面有adaboosting&GBDT&XGboosting)

  • boost主要是increasing your mistakes weight(Adaboost)

bagging的原理:

what is bagging(回答面试?):(有放回抽样)

  • bagging == bootstrapping

  • bagging is an ensemble learning method

  • bagging method -sampling with replacement

  • advantage: decrease variance by introducing randomness into your model framework

  • random forest leverages a a bagging method to solve overfitting problems in complete decision trees.

  1. 就是bootstrap, 相当于你对现有的sample 再次sample,也就是resample,记得我们学高级统计时候,证明过,这样可以降低variance,同时其预测的插值还符合正态分布

  2. 采用的方式就是 f^avg=1B∑b=1Bf^b(x)\hat{f}{avg}=\frac{1}{B}\sum{b=1}^B {\hat{f}_b}(x)f^​avg=B1​∑b=1Bf^​b​(x) 来降低variance,但实际情况不可能所以,In this approach we generate BBB different bootstrapped training data sets. We then train our method on the bbb-th bootstrapped training set in order to get f^∗b(x)\hat{f}^{\ast b}(x)f^​∗b(x), and finally average all the predictions, to obtain

    ​

    f^bag(x)=1B∑b=1Bf^∗b(x)\hat{f}_{bag}(x)=\frac{1}{B}\sum_{b=1}^B \hat{f}^{\ast b}(x)f^​bag​(x)=B1​b=1∑B​f^​∗b(x)

主要算法random forest:

假设总体数据中有N个样本,每个样本有m个特征属性(feature)

  • 每次从原数据中有方回地随机选取N个观测值(注意这个过程中有的样本可能被选了多次有的样本可能被没有包含在取出的N个观测值中)

  • 从得到的这些样本中随机选取k(k<m)个特征,利用这k个特征计算出一个最佳分裂方式的decision tree

  • 多次重复上面的操作,得到一组互不相同的decision tree,把他们整合起来构成random forest 模型,对新数据进行分类或者预测

  1. Random forests provide an improvement over bagged trees by way of a small tweak that decorrelates the trees.a random sample of mmm predictors is chosen as split candidates from the full set of ppp predictors. The split is allowed to use only one of those m predictors.

  2. A fresh sample of mmm predictors is taken at each split, and typically we choose m≈pm \approx \sqrt{p} m≈p​—that is , the number of predictors considered at each split is approximately equal to the square root of the total number of predictors

  3. The main difference between bagging and random forests is the choice of predictor subset size $m$.

  4. We can think of this process as decorrelating the trees, thereby making the average of the resulting trees less variable and hence more reliable.

  5. averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. 感想:说白了就是你用了少的sample自然reduce variance? 相当于RF就是控制m 你重复的此书

  6. out of bag estimate?

summary of random forest(面试)

  • sample idea in a random forest algorithm:

    • bagging for data

    • sampling for features(不改变features的distribution)

  • In the random forest algorithm, we do NOT need to do pruning for each 'weak' decision tree.

  • Advantage of random forest

    • less overfitting(你random后那些不重要的更不重要了,其实也是CLT)

    • parallel implementation

  • random forest VS Bagging: difference at the feature sampling step.(random forest feature也抽样了)

(这里的majority decision可以average,当然这里的树不需要合在一起)

Feature Importance value in random forest

importance(i)=performance(RF)−performance(RF random i)importance(i)= performance(RF)- performance(RF^{\ random \ i})importance(i)=performance(RF)−performance(RF random i)
  • It does not signify positive or negative impact on the class label. It only judges how much class discriminatory information each variable contains.

(RF非常擅长抓住nonlinear之间的关系,nonlinear很痛苦的地方就是不提供interpretability,那我该怎么办呢?相当于找coeffiecent/ MSE)

(相当于引入一个随机量,让这个feature就没了,但是又好过去掉feature,但是就随机(比如随机$X_4$把$X_4$random),当你随机后,你的random forest就改变了,很自然你就知道这个feature的important,这又像feature selection)

  • decision tree可不可以做feature importance?不能(因为没有feature sampling)

  • performance 可以是confusion matrix /F1

notes:

  1. 如果是feature有categorical和continuous,你可能先做categorical(当然可以从数据本身来)

(面试问题)

注意三个问题,1)forest 的深度,有最优,过度会overfitting?;2)tree的多少,越多越好;3)feature在每个tree上,当然多了更robust

Boosting

  1. bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model.(bagging是用再抽样法创造多个副本,在对每个副本建立决策树,然后将这些数结合在一起来建立一个预测模型)

  2. 每一棵树都是建立在一次bootstrap上,与其他树独立;boosting也是类似,但是这里的树采用sequentially 顺序生成:

    • each tree is grown using information from previously grown trees.

    • Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

    感想1:相当于改变本身的树,而不是sample 树本身&这是弱学习转化成为强学习;

    感想2:给定一份训练数据集(各样本权重是一样的,之后会有变化),然后进行M次迭代,每次迭代后,对分类错误的样本加大权重,对正确分类的样本减少权重,在下一次的迭代中更加关注错分的样本。

  3. Boosting has three tuning parameters:

    • The number of trees BBB: Unlike bagging and random forests, boosting can overfit if B is too large, although this overfitting tends to occur slowly if at all. We use cross-validation to select $B$.

    • The shrinkage parameter λ\lambdaλ: Typical values are 0.010.010.01 or 0.0010.0010.001, and the right choice can depend on the problem. Very small λ\lambdaλ can require using a very large value of B in order to achieve good performance.

    • The number ddd of splits in each tree, which controls the complexity of the boosted ensemble. Often d=1d = 1d=1 works well, in which case each tree is a stump

主要算法 adaboosting&XGBoost&Gradient Boosting

Boosting 提升法.一种将多个弱分类器通过组合提升为强分类器的思想. 它实现的关键在于:在每轮迭代训练中,通过改变样本权重的方式改变样本分布,从而在下一轮对误分或者偏差大的样本进行近似局部的拟合(类似于加权回归中的加权,这更容易理解),最后组合起来,达到提升的目的. 这里会有几个问题:

  1. 每轮训练偏差大小的标准是什么?(与损失函数有关)

  2. 弱分类器怎么组合?(损失函数 对 模型权重 求偏导)

  3. 样本权重怎样调整?

Supplement-Decision Tree

所以决策树算法的核心是要解决两个问题:

1)如何从数据集中找出最佳分枝?

2)如何在合适的时候让决策树停止生长,防止过拟合?

几乎所有与决策树有关的模型调整方法,都逃不开这两个核心问题。

关于第一个问题,一个数据集必然给出了很多的特征,先按照哪个特征来高效地划分数据集,是决策树需要解决的最大问题。当然,我们完全可以按照特征从头到尾顺次来划分数据集,但是,这样做其实并不好,因为你如果一不小心把重要的特征排到了后边,那么你的决策树会很低效。

于是,机器学习专家们就去信息学科中偷来了一个词语:信息熵。信息越确定(单一),信息熵越小;信息越多变(混乱),信息熵越大。

通过比较拆分前后的信息熵之差找出更重要的特征的方法,就产生了ID3和C4.5两种决策树算法。

之后,科学家又找到了一个更好的衡量方法——基尼系数。于是就产生了CART决策树。

Reference

  1. 机器学习西瓜书

  2. an introduction to statistical learning: https://zhuanlan.zhihu.com/p/57324157

  3. Andrew Ng: https://www.youtube.com/watch?v=wr9gUr-eWdA&list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU&index=10

  4. https://liangyaorong.github.io/blog/2017/%E6%B7%B1%E5%85%A5%E7%90%86%E8%A7%A3Boosting/

    https://www.zybuluo.com/frank-shaw/note/127048

  5. 很好的调参模型

    https://zhuanlan.zhihu.com/p/103136609

  6. 解释tree层的问题https://blog.csdn.net/u012328159/article/details/70184415

集成学习(Ensemble Method)

集成方法:

基本上常和树模型一起 link tree models

decision tree——1)boosting&Adaboosting(代表);2)bagging&random forest(bagging的变形)

1)==boosting==:每次增加sample,然后调整权重,调整背后的distribution,当然是降低bias

​ boosting里面有adaboosting&GBDT&XGboosting

2)==bagging==:其实就是bootstrap,用已有的sample,多次重复,当然是降低variance

​ random forest:随机属性????

综合方法:meta algorithm&stacking algorithm

further steps:上课所记载的关于decision tree的(andrew NG的 youtube 课)