♎
Limited AI
  • Machine Learning
    • Linear Model Cheating Sheet
    • Nonlinear Model Cheating Sheet
    • General Linear Model 1
    • General Linear Model 2
    • General Linear Model 3
    • Tree Based Methods
    • Tree Based Methods Supplement
    • XG,Cat,Light__Boosting
    • KNN&PCA
    • Model Performance
    • Model Evaluation
    • Code Practice
      • KNN
      • Decision Tree Python Code
    • Data and Feature Engineering
      • Handle Bias Data
      • Cold Start Problem
  • Deep Learning
    • Summary v2
    • Basic Neural Network
      • From Linear to Deep
      • Perceptron and Activation Function
      • NN network Details
      • Backpropagation Details
      • Gradient Vanishing vs Gradient Exploding
    • Basic CNN
      • Why CNN
      • Filter/ Convolution Kernel and Its Operation
      • Padding& Stride
      • Layers
      • Extra:From Fully Connected Layers to Convolutions
      • Extra: Multiple Input and Multiple Output Channels
    • Advance CNN
      • Convolutional Neural Networks(LeNet)
      • Deep Convolution Neural Networks(AlexNet)
      • Networks Using Blocks (VGG)
      • Network in Network(NiN)
      • Multi-Branch Networks(GoogLeNet&I mageNet)
      • Residual Networks(ResNet) and ResNeXt
      • Densely Connected Networks(DenseNet)
      • Batch Normalization
    • Basic RNN
      • Seq Model
      • Raw Text to Seq
      • Language Models
      • Recurrent Neural Networks(RNN)
      • Backpropagation Through Time
    • Advance RNN
      • Gated Recurrent Units(GRU)
      • Long Short-Term Memory(LSTM)
      • Bidirectional Recurrent Neural Networks(BRNN)
      • Encoder-Decoder Architecture
      • Seuqence to Sequence Learning(Seq2Seq)
    • Attention Mechanisms and Transformers
      • Queries, Keys, and Values
      • Attention is all you need
        • Attention and Kernel
        • Attention Scoring Functions
        • The Bahdanau Attention Mechanism
        • Multi-Head Attention
        • Self-Attention
        • Attention的实现
      • The Transformer Architecture
        • Extra Reading
        • 最短的最大路径长度
      • Large-Scaling Pretraning with Transformers
        • BERT vs OpenAI GPT vs ELMo
        • Decoder Model框架
        • Bert vs XLNet
        • T5& GPT& Bert比较
        • 编码器-解码器架构 vs GPT 模型
        • Encoder vs Decoder Reference
      • Transformers for Vision
      • Transformer for Multiomodal
    • NLP Pretraining
      • Word Embedding(word2vec)
        • Extra Reading
      • Approximate Training
      • Word Embedding with Global Vectors(GloVe)
        • Extra Reading
        • Supplement
      • Encoder(BERT)
        • BERT
        • Extra Reading
      • Decoder(GPT&XLNet&Lamma)
        • GPT
        • XLNet
          • XLNet架构
          • XLNet特点与其他比较
      • Encoder-Decoder(BART& T5)
        • BART
        • T5
  • GenAI
    • Introduction
      • GenAI Paper Must Read
      • GenAI六个阶段
    • Language Models Pre-training
      • Encoder-Decoder Architecture
      • Encoder Deep Dive
      • Decoder Deep Dive
      • Encoder VS Decoder
      • Attention Mechanism
      • Transformers
    • Example: Llama 3 8B架构
    • Fine-Tuning Generation Models
    • RAG and Adavance RAG
    • AI Agent
  • Statistics and Optimization
    • A/B testing
    • Sampling/ABtesting/GradientMethod
    • Gradient Decent Deep Dive
  • Machine Learning System Design
    • Extra Reading
    • Introduction
  • Responsible AI
    • AI Risk and Uncertainty
      • What is AI risk
      • General Intro for Uncertainty Quantification
      • Calibration
      • Conformal Prediction
        • Review the linear regression
        • Exchangeability
        • Split Conformal Prediction
        • Conformalized Quantile Regression
        • Beyond marginal coverage
        • Split Conformal Classification
        • Full Conformal Coverage
        • Cross-Validation +
        • Conformal Histgram Regression
    • xAI
      • SHAP value
  • Extra Research
    • Paper Reading
    • Reference
Powered by GitBook
On this page
  • Logistic Regression
  • 1. Background
  • 2. Model
  • 2.1 logistic function
  • 2.2 How is the performance of this model?(估计回归系数)
  • 2.3 Multiple Logistic Regression
  • Regularization——Ridge regression/Lasso regression
  • Subset selection
  • Shrinkage
  • Dimension reduction
  1. Machine Learning

General Linear Model 2

PreviousGeneral Linear Model 1NextGeneral Linear Model 3

Last updated 10 months ago

Continue previous linear regression. Here we would like to introduce Logistic regression.

Review:

  1. what is linear regression?

  2. how to get loss function? two methods? Least Square(残差满足正态上的最大似然估计)/MLE?

Quote: when you are learning a new model, it is not for the model, but it is for the idea behind

notes:

  1. 任何机器学习都是为了找 XXX 和 YYY 的关系

  2. logistic regression是不能用least square 推导(non-linear的least square是可以的)

Logistic Regression

1. Background

当我们有categorical的variable的时候,比如说肿瘤的问题,得还是不得;或者是信用卡是否批下来贷款

这类问题我们怎么处理?

2. Model

2.1 logistic function

2.1.1 图形角度理解(和linear regression比较)

所以我们需要一个新的一个函数来逼近,比如上面的图(注意这个图形是关于xxx 和在yyy 下的概率$p$的情况),

所以这里有个问题,x∈(0,∞)x \in (0, \infty)x∈(0,∞), 但是y=±1y= \pm 1y=±1, 所以怎么联系 x→yx \rightarrow yx→y呢?(从连续映射到离散)

采用这个方法:x→p(x)→yx \rightarrow p(x) \rightarrow yx→p(x)→y, i.e.

  1. 首先用xxx 去拟合 概率ppp,

  2. 然后用ppp再去拟合 yyy(采用threshold)

很自然我们要选用一个好的ppp去模拟,那在这里我们用sigmoid function中的logistic function(其实还有其他的平滑曲线:https://en.wikipedia.org/wiki/Sigmoid_function)

p(x)=ef(x)1+ef(x)=11+e−f(x)p(x)=\frac{e^{f(x)}}{1+e^{f(x)}}=\frac{1}{1+e^{-f(x)}}p(x)=1+ef(x)ef(x)​=1+e−f(x)1​

Where f(x)f(x)f(x) can be xxxor other function(其他扭动方式) ax+bax +bax+b

2.1.2 值域的理解(和linear regression比较)

  1. Y是得病不得病,p(x)p(x)p(x)是相当于肿瘤的指数,XXX 是关于肿瘤的input(size,位置,etc)

  2. 很自然想到伯努利分布p(Y=yi)=pyi(1−p)1−yi, 0<p<1, yi=0,1p(Y=y_i)=p^{y_i}(1-p)^{1-y_i},\ 0<p<1,\ y_i=0,1p(Y=yi​)=pyi​(1−p)1−yi​, 0<p<1, yi​=0,1

logp(X)1−p(X)=β0+β1Xlog\frac{p(X)}{1-p(X)} = \beta_0+\beta_1Xlog1−p(X)p(X)​=β0​+β1​X
  1. 如何推导, 用博弈论中的odds=p1−podds=\frac{p}{1-p}odds=1−pp​,值域 [0,∞)[0,\infty)[0,∞), 然后 log(odds)=logp1−plog(odds)=log\frac{p}{1-p}log(odds)=log1−pp​,值域就到了[−∞,∞][-\infty,\infty][−∞,∞]

    logp1−p=f(x)=β0+β1xp=11+eβ0+β1x\begin{split} log\frac{p}{1-p}=f(x)=\beta_0+\beta_1x \\ p=\frac{1}{1+e^{\beta_0+\beta_1x}} \end{split}log1−pp​=f(x)=β0​+β1​xp=1+eβ0​+β1​x1​​

2.2 How is the performance of this model?(估计回归系数)

2.2.1 Using MLE to get the loss function(面试必考题)

Step 1: Choose model

In (binary) Logistic Regression, YYY follows Bernoulli distribution(yyy和ppp的bernoulli关系)

P(Y=yi∣X)=pyi(1−p)1−yi, 0<p<1, yi=0,1P(Y=y_i\vert X) = p^{y_i}(1-p)^{1-y_i}, \ 0<p<1,\ y_i=0,1P(Y=yi​∣X)=pyi​(1−p)1−yi​, 0<p<1, yi​=0,1

ppp 和 xxx 是logic function的关系

p=hβ(xi)=11+e−β0+β1xp=h_{\beta}(x_i)=\frac{1}{1+e^{-\beta_0+\beta_1x}}p=hβ​(xi​)=1+e−β0​+β1​x1​

Step 2: Calculate loss function by MLE:

L(β^0,β^1)=P(Y1,⋯ ,Yn∣X1,⋯ ,Xn)L(\hat\beta_0,\hat\beta_1)=P(Y_1,\cdots,Y_n \vert X_1,\cdots,X_n)L(β^​0​,β^​1​)=P(Y1​,⋯,Yn​∣X1​,⋯,Xn​)

Since (x1,y1),⋯(xn,yn)(x_1,y_1),\cdots(x_n,y_n)(x1​,y1​),⋯(xn​,yn​) are linear independent.(obersvation is independent & x2x_2x2​和y1y_1y1​没有关系)

​ a. P(Y1,⋯ ,Yn∣X1,⋯ ,Xn)=Πi=1nP(Yi∣X1,⋯ ,Xn)P(Y_1, \cdots, Y_n\vert X_1,\cdots, X_n)=\Pi_{i=1}^{n}P(Y_i\vert X_1,\cdots,X_n)P(Y1​,⋯,Yn​∣X1​,⋯,Xn​)=Πi=1n​P(Yi​∣X1​,⋯,Xn​)

​ b. P(Yi∣X1,⋯ ,Xn)=P(Yi∣Xi)P(Y_i \vert X_1,\cdots,X_n)=P(Y_i \vert X_i)P(Yi​∣X1​,⋯,Xn​)=P(Yi​∣Xi​)

Therefore

L(β^0,β^1)=Πi=1nP(Yi∣Xi)=Πi=0npyi(1−p)1−yi=Πi=0nhβ(xi)yi(1−hβ(xi))1−yiL(\hat\beta_0,\hat \beta_1)=\Pi_{i=1}^{n}P(Y_i \vert X_i)= \Pi_{i=0}^n p^{y_i}(1-p)^{1-y_i}=\Pi_{i=0}^n {h_{\beta}(x_i)}^{y_i}(1-h_{\beta}(x_i))^{1-y_i}L(β^​0​,β^​1​)=Πi=1n​P(Yi​∣Xi​)=Πi=0n​pyi​(1−p)1−yi​=Πi=0n​hβ​(xi​)yi​(1−hβ​(xi​))1−yi​

Which is

log⁡L(β^0,β^1)=∑i=1n[yilog⁡(hβ(xi)+(1−yi)log⁡(1−hβ(xi))]\log L(\hat\beta_0,\hat \beta_1)= \sum_{i=1}^{n}[y_i \log(h_{\beta}(x_i)+(1-y_i)\log(1-h_{\beta}(x_i))]logL(β^​0​,β^​1​)=i=1∑n​[yi​log(hβ​(xi​)+(1−yi​)log(1−hβ​(xi​))]

Based on MLE, it means we are finding the β\betaβ, which is the solution for the following optimization problem

arg⁡max⁡βlog⁡L(β^0,β^1)=∑i=1n[yilog⁡(hβ(xi)+(1−yi)log⁡(1−hβ(xi))]arg⁡min⁡βlog⁡L(β^0,β^1)=∑i=1n[−yilog⁡(hβ(xi)−(1−yi)log⁡(1−hβ(xi))]\begin{split} \arg \max_{\beta}\log L(\hat\beta_0,\hat \beta_1)= \sum_{i=1}^{n}[y_i \log(h_{\beta}(x_i)+(1-y_i)\log(1-h_{\beta}(x_i))] \\ \arg \min_{\beta}\log L(\hat\beta_0,\hat \beta_1)= \sum_{i=1}^{n}[-y_i \log(h_{\beta}(x_i)-(1-y_i)\log(1-h_{\beta}(x_i))] \\ \end{split}argβmax​logL(β^​0​,β^​1​)=i=1∑n​[yi​log(hβ​(xi​)+(1−yi​)log(1−hβ​(xi​))]argβmin​logL(β^​0​,β^​1​)=i=1∑n​[−yi​log(hβ​(xi​)−(1−yi​)log(1−hβ​(xi​))]​

2.2.2 linear regression VS logistics regression

当自变量xxx 给定时,我们对yyy 的distribution的假设(这块需要rethink)

  1. Linear regression的assumption 是: p(y∣x)p(y \vert x)p(y∣x)是mean=f(x)=ax+bmean=f(x)=ax+bmean=f(x)=ax+b,variance=σ2variance= \sigma^2variance=σ2(一个与xxx无关的值)的normal distribution

  2. logistic regression的assumption 是:yyy服从 p=11+e−ax+bp=\frac{1}{1+e^{-ax+b}}p=1+e−ax+b1​ Bernoulli distribution

2.2.3. 机器学习模型的‘能力范围’:

  1. Step1: ‘Given XXX,得YYY的distribution’——这是机器学习模型利用数据可以解决的问题

  2. Step2: ‘根据YYY的distribution,得到YYY的取值’——判断怎么用是你的问题

example:

1)predict model: y=2x+5000,(从新司机预测老司机的值), 如果一个新司机拿x=1000,得出y是7000,但是不代表说这个y就一定是7000,因为你得出来的是y的Gaussian分布的mean=7000, 这表示的是这个老司机最有可能性的是7000,但是不一定是7000

2)所以你通常用均值(就是你期待的,也就是expected)来代表模型预测值

Example2:

1)比如s说你预测下雨否,狗叫不叫预测下雨,其实也是两步,比如说狗叫5声下雨的概率是$0.6$,所以明天下不下雨?

2)这个问题到会与不会是因为你多加了一步,threshold定了个$0.5$

Example3:

1)一个网络公司,想知道提高病毒判断率高了——thereshold的问题,这个玩意数据有时候没法告诉你,所以可能需要有trade off

2.2.4. Logistic regresion的拓展

p(x)=hβ(xi)=11+eg(x), where  g(x)=β0+β1,1x1+β1,2x12+β2x2p(x)=h_{\beta}(x_i)=\frac{1}{1+e^{g(x)}}, \ \textit{where }\ g(x)=\beta_0+\beta_{1,1}x_1+\beta_{1,2}x_1^2+\beta_2x_2p(x)=hβ​(xi​)=1+eg(x)1​, where  g(x)=β0​+β1,1​x1​+β1,2​x12​+β2​x2​

里面的polinomial可以是多个未知数的多项式

但是依然你会面对一个(overfit/underfit)的问题:

  1. 模型复杂度对分类效果的影响

  2. 模型复杂提到,可以更好的对应training data,但是对testing data不一定好

  3. 过度拟合就是overfitting,但是过少可能就underfitting

notes:

  1. (threshold相当于调整你图上的hyperplane的后面那个constant log⁡p(X)1−p(X)=C=β0+⋯β1X1\log \frac{p(X)}{1-p(X)}=C=\beta_0+\cdots\beta_1X_1log1−p(X)p(X)​=C=β0​+⋯β1​X1​)

  1. 很多时候,多分类问题可以比较成为多个两分类问题,两两二分类来做

  2. logistics regression并不适用linear的来考虑sysem error $\varepsilon$

Reference:

  1. this is the notes for laioffer+ my understanding

  2. the bool ISL

2.3 Multiple Logistic Regression

logp(X)1−p(X)=β0+β1X1+β2X2+⋯+βpXpp(X)=11−e−(β0+β1X1+β2X2+⋯+βpXp)\begin{split} log\frac{p(X)}{1-p(X)} = \beta_0+\beta_1X_1+\beta_2X_2+\cdots+\beta_p X_p \\ p(X) =\frac{1}{1-e^{-(\beta_0+\beta_1X_1+\beta_2X_2+\cdots+\beta_p X_p)}} \end{split}log1−p(X)p(X)​=β0​+β1​X1​+β2​X2​+⋯+βp​Xp​p(X)=1−e−(β0​+β1​X1​+β2​X2​+⋯+βp​Xp​)1​​

(这里可以用前面的值域的博弈论来推导,当然这个就是概率函数p(x)p(x)p(x) 和 xxx的关系)

  1. Linear discriminant analysis 线性判别分析 (LDA)是对费舍尔的线性鉴别方法的归纳,这种方法使用统计学,模式识别和机器学习方法,试图找到两类物体或事件的特征的一个线性组合,以能够特征化或区分它们。所得的组合可用来作为一个线性分类器,或者,更常见的是,为后续的分类做降维处理。

  2. LDA与方差分析(ANOVA)和回归分析紧密相关,这两种分析方法也试图通过一些特征或测量值的线性组合来表示一个因变量。然而,方差分析使用类别自变量和连续数因变量,而判别分析连续自变量和类别因变量(即类标签)。逻辑回归和概率回归比方差分析更类似于LDA,因为他们也是用连续自变量来解释类别因变量的。LDA的基本假设是自变量是正态分布的,当这一假设无法满足时,在实际应用中更倾向于用上述的其他方法。

Regularization——Ridge regression/Lasso regression

  1. model interpretability: by removing irrelevant features- that is, by setting the corresponding coefficient estimates to zero-- we can obtain a model that is more easily interpreted. We will present some approaches for automatically performing feature selection.

  2. predictive performance: especially when p>np>np>n to control the variance.

  3. Three methods:

Subset selection

we identify a subset of the ppp predictors that we believe to be related to the response. We then fit a model using least squares on the reduced set of variables:(相当于你选一个度量单位,然后以此来选择)

  1. 一个个去选,你到底要几个feature,基于smallest RSS/CpC_pCp​/AIC/BIC/adjusted R2R^2R2

  2. Overfitting&stepwise methods,

  3. Forward/backward stepwise selection

  4. estimating test error: two approaches:

    1. smallest RSS/CpC_pCp​/AIC/BIC/adjusted R2R^2R2

    2. Validation/cross varidation??

Shrinkage

We fit a model involving all ppp predictors, but the estimated coefficients are shrunken towards zero relative to the least squares estimates. This shrinkage (regularization) has the effect of reducing variance and can also perform variable selection. (shrinkage,相当于渐进,直到消失)

  1. Ridge regression:∑i=1n(yi−β0−∑j=1pβjxij)2+λ∑j=1pβj2\sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_j x_{ij})^2+\lambda\sum_{j=1}^p \beta_j^2∑i=1n​(yi​−β0​−∑j=1p​βj​xij​)2+λ∑j=1p​βj2​

  2. Ridge regression之前最好先stadardizing the predictors;因为substantially实质上,不同的scale会导致不同的coefficient

  3. Why does ridge regression improve over least squares?

  4. Lasso regression:∑i=1n(yi−β0−∑j=1pβjxij)2+λ∑j=1p∣∣βj∣∣\sum_{i=1}^n(y_i-\beta_0-\sum_{j=1}^p\beta_j x_{ij})^2+\lambda\sum_{j=1}^p \vert \vert \beta_j \vert \vert∑i=1n​(yi​−β0​−∑j=1p​βj​xij​)2+λ∑j=1p​∣∣βj​∣∣

  5. Lasso regression:overcome the disadvantage(包含所有的input/predictors 在最后的模型里面);这个用的l1l_1l1​ penalty

  6. Lasso regression: yields sparse models, models that involve only a subset of the variables

  7. Lasso regression: performs variable selction///select a good value of λ\lambdaλ for the lasso is critical///cross-validation is again the method of choice//MSE smallest

  8. tuning parameter:对于一个sample,用cross validation

Dimension reduction

We project the predictors into MMM-dimensional subspace, where M<pM<pM<p. This is achieved by computing MMM different linear combinations or projections, of the variables. Then these MMM projections are used as predictors to fit a linear regression model by least squares.(把ppp压缩MMM)

  1. PCA;

  2. Transform;

  3. Partial least square