XG,Cat,Light__Boosting
Last updated
Last updated
Xgboosting本质上还是一个gbdt,但是努力的把速度和效率发挥到了极致,所以就被取名是X(Extreme)
从损失函数的角度提出的,它在损失函数里加入了正则惩罚项,同时认为单单求偏导还不够.因为求偏导实际是一阶泰勒展开,属于一阶优化,收敛速度还不够快.他提出了损失函数二阶泰勒展开式的想法.
Here’s a simple example of a CART that classifies whether someone will like a hypothetical computer game X. The example of tree is below:
The prediction scores of each individual decision tree then sum up to get If you look at the example, an important fact is that the two trees try to complement each other. Mathematically, we can write our model in the form
where, K is the number of trees, is the functional space of , is the set of possible CARTs.
The objective function for the about model is given by:
where, first term is the loss function and the second is the regularization parameter.
Now, Instead of learning the tree all at once which makes the optimization harder, we apply the additive strategy, minimize the loss what we have learned and add a new tree which can be summarised below:
The objective function of the above model can be defined as:
Now, let’s apply the Taylor series expansion up to the second order:
where g_i and h_i can be defined as:
Simplifying and removing the constant:
Regularization
Now, we define the regularization term, but first we need to define the model:
Here, w is the vector of scores on leaves of tree, q is the function assigning each data point to the corresponding leaf, and T is the number of leaves. The regularization term is then defined by:
Now, our objective function becomes:
Now, we simplify the above expression:
where,
where, γ\gammaγ is pruning parameter, i.e., the least information gain to perform split.
Now, we try to measure how good the tree is; we can’t directly optimize the tree, we will try to optimize one level of the tree at a time. Specifically, we try to split a leaf into two leaves, and the score it gains is:
xgboosting的核心算法
就是不停的添加树,不断地进行特征分裂来生长一棵树,每次添加一个树。其实就是学习一个新的函数,去预测之前的残差
当我们训练完成后得到的k棵树,我们要预测一个样本的分数,其实就是根据地这个样本的特征,在每棵树中会落到对应的一个叶子节点,每个叶子节点就对应一个分数
最后只需要将每棵树对应的分数加起来就是该样本的预测值
Reference
In this equation, w_j are independent of each other, the best for a given structure q(x) and the best objective reduction we can get is: