♎
Limited AI
  • Machine Learning
    • Linear Model Cheating Sheet
    • Nonlinear Model Cheating Sheet
    • General Linear Model 1
    • General Linear Model 2
    • General Linear Model 3
    • Tree Based Methods
    • Tree Based Methods Supplement
    • XG,Cat,Light__Boosting
    • KNN&PCA
    • Model Performance
    • Model Evaluation
    • Code Practice
      • KNN
      • Decision Tree Python Code
    • Data and Feature Engineering
      • Handle Bias Data
      • Cold Start Problem
  • Deep Learning
    • Summary v2
    • Basic Neural Network
      • From Linear to Deep
      • Perceptron and Activation Function
      • NN network Details
      • Backpropagation Details
      • Gradient Vanishing vs Gradient Exploding
    • Basic CNN
      • Why CNN
      • Filter/ Convolution Kernel and Its Operation
      • Padding& Stride
      • Layers
      • Extra:From Fully Connected Layers to Convolutions
      • Extra: Multiple Input and Multiple Output Channels
    • Advance CNN
      • Convolutional Neural Networks(LeNet)
      • Deep Convolution Neural Networks(AlexNet)
      • Networks Using Blocks (VGG)
      • Network in Network(NiN)
      • Multi-Branch Networks(GoogLeNet&I mageNet)
      • Residual Networks(ResNet) and ResNeXt
      • Densely Connected Networks(DenseNet)
      • Batch Normalization
    • Basic RNN
      • Seq Model
      • Raw Text to Seq
      • Language Models
      • Recurrent Neural Networks(RNN)
      • Backpropagation Through Time
    • Advance RNN
      • Gated Recurrent Units(GRU)
      • Long Short-Term Memory(LSTM)
      • Bidirectional Recurrent Neural Networks(BRNN)
      • Encoder-Decoder Architecture
      • Seuqence to Sequence Learning(Seq2Seq)
    • Attention Mechanisms and Transformers
      • Queries, Keys, and Values
      • Attention is all you need
        • Attention and Kernel
        • Attention Scoring Functions
        • The Bahdanau Attention Mechanism
        • Multi-Head Attention
        • Self-Attention
        • Attention的实现
      • The Transformer Architecture
        • Extra Reading
        • 最短的最大路径长度
      • Large-Scaling Pretraning with Transformers
        • BERT vs OpenAI GPT vs ELMo
        • Decoder Model框架
        • Bert vs XLNet
        • T5& GPT& Bert比较
        • 编码器-解码器架构 vs GPT 模型
        • Encoder vs Decoder Reference
      • Transformers for Vision
      • Transformer for Multiomodal
    • NLP Pretraining
      • Word Embedding(word2vec)
        • Extra Reading
      • Approximate Training
      • Word Embedding with Global Vectors(GloVe)
        • Extra Reading
        • Supplement
      • Encoder(BERT)
        • BERT
        • Extra Reading
      • Decoder(GPT&XLNet&Lamma)
        • GPT
        • XLNet
          • XLNet架构
          • XLNet特点与其他比较
      • Encoder-Decoder(BART& T5)
        • BART
        • T5
  • GenAI
    • Introduction
      • GenAI Paper Must Read
      • GenAI六个阶段
    • Language Models Pre-training
      • Encoder-Decoder Architecture
      • Encoder Deep Dive
      • Decoder Deep Dive
      • Encoder VS Decoder
      • Attention Mechanism
      • Transformers
    • Example: Llama 3 8B架构
    • Fine-Tuning Generation Models
    • RAG and Adavance RAG
    • AI Agent
  • Statistics and Optimization
    • A/B testing
    • Sampling/ABtesting/GradientMethod
    • Gradient Decent Deep Dive
  • Machine Learning System Design
    • Extra Reading
    • Introduction
  • Responsible AI
    • AI Risk and Uncertainty
      • What is AI risk
      • General Intro for Uncertainty Quantification
      • Calibration
      • Conformal Prediction
        • Review the linear regression
        • Exchangeability
        • Split Conformal Prediction
        • Conformalized Quantile Regression
        • Beyond marginal coverage
        • Split Conformal Classification
        • Full Conformal Coverage
        • Cross-Validation +
        • Conformal Histgram Regression
    • xAI
      • SHAP value
  • Extra Research
    • Paper Reading
    • Reference
Powered by GitBook
On this page
  • General Linear Model 1——Linear Regression
  • 1. What is Linear Regression?
  • 2. The keys
  • Multiple Linear Regression
  • Gradient descent简单的解释
  1. Machine Learning

General Linear Model 1

Here are the notes for general linear regression.

General Linear Model 1——Linear Regression

1. What is Linear Regression?

  1. It is one of the most well known/understood algorithm in statistics and machine learning

  2. Linear regression is a linear model​, which a linear relationship between the input XXX and output YYY. More technical, we can consider YYY can be a linear combination of XXX

Quote: 我们的目的就是找到一条直线,使所有我们之前input的点到这条直线的距离最小。

  1. The representation of simpliest linear regression can be written as

Y=β0+β1X+ϵY= \beta_0+\beta_1X+\epsilonY=β0​+β1​X+ϵ
  1. In statistics, this belongs to the parametric model, i.e. it has the parameter β0\beta_0β0​ and β1\beta_1β1​

(这里当然最好是Y=β0+β1XY= \beta_0+\beta_1XY=β0​+β1​X,,但是没办法我们有$\epsilon$,(但是我们知道它是normal))

so, what is our target?

为了找 XXX和 YYY的关系——Find the best β0\beta_0β0​ and β1\beta_1β1​ ——find total error 最小(loss function)——find Least square

2. The keys

2.1 How to determine this model?(估计参数)

2.1.1 Loss function?(we can link the more general case for the loss function)

​ 1. 如何我们要知道我们好不好呢?就需要看看error

​ Error=∣yi−yi′∣Error=\vert y_i-y_i^{'}\vertError=∣yi​−yi′​∣

  1. error 可以变化吗?可以, 但是为了好计算所以用squre

  2. loss function=total error

2.1.2.最小二乘法least square推导loss function

Loss=∑i=1nerror2=∑i=1n(yi−β0+β1Xi)2Loss=\sum_{i=1}^{n} error^2=\sum_{i=1}^{n}(y_i-\beta_0+\beta_1X_i)^2Loss=i=1∑n​error2=i=1∑n​(yi​−β0​+β1​Xi​)2
  1. 这里不一定要平方,但是平方和可以找到某种意义上的最好值

    1. 总的误差的平方最小的y就是真值,这个假设在误差是在随机波动下是最优的

    2. 所以我们可以来找 Loss=min⁡β0,β1∑i=1n(yi−β0+β1Xi)2Loss =\min_{\beta_0,\beta_1} \sum_{i=1}^{n}(y_i-\beta_0+\beta_1X_i)^2Loss=minβ0​,β1​​∑i=1n​(yi​−β0​+β1​Xi​)2

  2. 因着是找 β0\beta_0β0​ and β1\beta_1β1​符合 arg⁡min⁡β0,β1∑i=1n(yi−β0+β1Xi)2\arg \min_{\beta_0,\beta_1} \sum_{i=1}^{n}(y_i-\beta_0+\beta_1X_i)^2argminβ0​,β1​​∑i=1n​(yi​−β0​+β1​Xi​)2

  3. 所以我是在找β0\beta_0β0​,β1\beta_1β1​使得∑i=1n(yi−β0+β1Xi)2\sum_{i=1}^{n}(y_i-\beta_0+\beta_1X_i)^2∑i=1n​(yi​−β0​+β1​Xi​)2最小,也就是loss的最小

  4. 最小二乘法不永远是最优

一些提醒:

  1. least square(最小二乘法)是从cost function的角度,利用距离的定义建立目标函数;(注意最小二乘法是方法整个loss体系的建立可以link到statsitcal learning theory)

  2. 经典的参数估计方法是从概率的角度建立目标,比如说最大似然估计MLE(maximum likelihood estimation)

2.1.3 最大似然估计MLE(maximum likelihood estimation)推导loss function

  1. mle是什么?

    1. mle is a method of estimating the parameters of a statistical model given obersvation, by finding the parameter values that maximize the likelihood of making the observation given the parameters. 用参数估计的方法,在有了一定的观测值之后,来找parameter,让我们可以最有可能看到我们观测值,让我们可以最大程度放大我们的观测值

  2. 如果对于linear regression 来说,相当与用一个方法,去找穿过最大可能性(最大密度)(尽可能多的概率)的那些点的线上

    1. 同时这条线对于x来说是最大可能性分布所在的线(CLT)

    2. 见图

  1. 别忘了我们要使用model里的assumption: p(y|x) 是 mean=μ=f(x)mean=\mu=f(x)mean=μ=f(x)(和xxx有关), variance=σ2variance=\sigma^2variance=σ2(和xxx无关)的normal distribution(or we consider ε∼N(0,σ2)\varepsilon \sim N(0,\sigma^2)ε∼N(0,σ2))

  2. 推导**(面试必考题)**

    1. we know that Y∣X∼N(βˉ0+βˉ1X,σ2)Y\vert X \sim N(\bar{\beta}_0+\bar{\beta}_1X, \sigma^2)Y∣X∼N(βˉ​0​+βˉ​1​X,σ2)

    2. p(Yi∣Xi)=1σ(2π)e−12σ2(Yi−βˉ0−βˉ1Xi)2p(Y_i\vert X_i)=\frac{1}{\sigma(2\pi)}e^{-\frac{1}{2\sigma^2}(Y_i-\bar{\beta}_0-\bar{\beta}_1X_i)^2}p(Yi​∣Xi​)=σ(2π)1​e−2σ21​(Yi​−βˉ​0​−βˉ​1​Xi​)2
  3. L(βˉ0,βˉ1,σ2)=p(Y1,⋯ ,Yn∣X1,⋯ ,Xn)=1σn(2π)n/2e−12σ2∑i=1n(Yi−βˉ0−βˉ1Xi)2L(\bar{\beta}_0,\bar{\beta}_1,\sigma^2)=p(Y_1,\cdots,Y_n\vert X_1,\cdots,X_n)=\frac{1}{\sigma^{n}(2\pi)^{n/2}}e^{-\frac{1}{2\sigma^2}\sum_{i=1}^n(Y_i-\bar{\beta}_0-\bar{\beta}_1X_i)^2}L(βˉ​0​,βˉ​1​,σ2)=p(Y1​,⋯,Yn​∣X1​,⋯,Xn​)=σn(2π)n/21​e−2σ21​∑i=1n​(Yi​−βˉ​0​−βˉ​1​Xi​)2

    under the assumption (X1,Y1),⋯ ,(Xn,Yn)(X_1,Y_1),\cdots, (X_n,Y_n)(X1​,Y1​),⋯,(Xn​,Yn​) are independent

    where L(βˉ0,βˉ1,σ2)=p(Y1,⋯ ,Yn∣X1,⋯ ,Xn)=p(Y1∣X1,⋯ ,Xn)⋯p(Yn∣X1,⋯ ,Xn)=p(Y1∣X1)p(Y2∣X2)⋯p(Yn∣Xn) L(\bar{\beta}_0,\bar{\beta}_1,\sigma^2)=p(Y_1,\cdots,Y_n\vert X_1,\cdots,X_n)=p(Y_1\vert X_1,\cdots,X_n)\cdots p(Y_n\vert X_1,\cdots,X_n) =p(Y_1 \vert X_1)p(Y_2\vert X_2)\cdots p(Y_n\vert X_n)L(βˉ​0​,βˉ​1​,σ2)=p(Y1​,⋯,Yn​∣X1​,⋯,Xn​)=p(Y1​∣X1​,⋯,Xn​)⋯p(Yn​∣X1​,⋯,Xn​)=p(Y1​∣X1​)p(Y2​∣X2​)⋯p(Yn​∣Xn​)L(βˉ0,βˉ1,σ2)=p(Y1,⋯ ,Yn∣X1,⋯ ,Xn)=p(Y1∣X1,⋯ ,Xn)⋯p(Yn∣X1,⋯ ,Xn)=p(Y1∣X1)p(Y2∣X2)⋯p(Yn∣Xn)L(\bar{\beta}_0,\bar{\beta}_1,\sigma^2)=p(Y_1,\cdots,Y_n\vert X_1,\cdots,X_n)=p(Y_1\vert X_1,\cdots,X_n)\cdots p(Y_n\vert X_1,\cdots,X_n) =p(Y_1 \vert X_1)p(Y_2\vert X_2)\cdots p(Y_n\vert X_n)L(βˉ​0​,βˉ​1​,σ2)=p(Y1​,⋯,Yn​∣X1​,⋯,Xn​)=p(Y1​∣X1​,⋯,Xn​)⋯p(Yn​∣X1​,⋯,Xn​)=p(Y1​∣X1​)p(Y2​∣X2​)⋯p(Yn​∣Xn​)

  4. the corresponding the log function:(我只关心parameter,log函数不影响单调等数学性质)

    log⁡L(βˉ0,βˉ1,σ2)=−nlog⁡(2πσ)−12σ2∑i=1n(Yi−βˉ0−βˉ1Xi)2\log L(\bar{\beta}_0,\bar{\beta}_1,\sigma^2)=-n\log(\sqrt{2\pi}\sigma)-\frac{1}{2\sigma^2}\sum_{i=1}^n(Y_i-\bar{\beta}_0-\bar{\beta}_1X_i)^2logL(βˉ​0​,βˉ​1​,σ2)=−nlog(2π​σ)−2σ21​i=1∑n​(Yi​−βˉ​0​−βˉ​1​Xi​)2

    所以我们要找的就是

    arg⁡max⁡−∑i=1n(Yi−βˉ0−βˉ1Xi)2\arg \max -\sum_{i=1}^n(Y_i-\bar{\beta}_0-\bar{\beta}_1X_i)^2argmax−i=1∑n​(Yi​−βˉ​0​−βˉ​1​Xi​)2

    i.e.

    arg⁡min⁡∑i=1n(Yi−βˉ0−βˉ1Xi)2\arg \min \sum_{i=1}^n(Y_i-\bar{\beta}_0-\bar{\beta}_1X_i)^2argmini=1∑n​(Yi​−βˉ​0​−βˉ​1​Xi​)2

两者关系?

  1. 当他们在linear regression下的assumption下,这两个方法得到结果是相通的

  2. one is from statistics, and the other one is from optimization

一点提醒:

  1. noise是数据造成的,是inherent bias. error是模型造成的,是人为的。是两个不同的概念

2.2 How is the performance of this model?(这个是来看模型自己的好坏,评价自己的参数)

我们只能系统的保证其不会偏差($E(Y)=\mu$)

Consider 这个问题,我们需要link到统计上的假设检验问题

null hypothesis : β1=0\beta_1=0β1​=0

  1. 目的是从统计上来判断这组数据和population相差多少,assessing the accuracy of the coefficient estimation,可以使用 ppp-value 或者是 confidence interval β^1±2SE(β^1)\hat{\beta}_1 \pm 2SE(\hat{\beta}_1)β^​1​±2SE(β^​1​)

  2. 选择的统计量 t=β^1−0SE(β^1)t=\frac{\hat{\beta}_1-0}{SE(\hat{\beta}_1)}t=SE(β^​1​)β^​1​−0​ where ttt distribution with n−2n-2n−2 degrees of freedom assuming β1=0\beta_1=0β1​=0

2.3 How can we compare this model with others models?

(这个相当于模型外的判断模型的好坏 i.e. the extent to which the model fits the data)

  1. assessing the overall accuracy

  2. RSE=1n−2RSS=1n−2∑in(yi−y^i)2RSE=\sqrt{\frac{1}{n-2}RSS}=\sqrt{\frac{1}{n-2}\sum_i^{n}(y_i-\hat{y}_i)^2}RSE=n−21​RSS​=n−21​∑in​(yi​−y^​i​)2​

  3. R2=TSS−RRRTSS=1−RSS∑in(yi−y^i)2R^2=\frac{TSS-RRR}{TSS}=1-\frac{RSS}{\sum_i^{n}(y_i-\hat{y}_i)^2}R2=TSSTSS−RRR​=1−∑in​(yi​−y^​i​)2RSS​,

  4. where TSSTSSTSS is total sum of squares, $RSS$ is the residual sum of squares(对误差的多少)

  5. 当是simple regression时,他相同于correlation

  6. 这里相当于 proportion of variability in YYY that can be explained using XXX (YYY 的变化中能够被$X$解释的部分的比例 )

2.4 GLM extra study with assumption?

https://zhuanlan.zhihu.com/p/22876460

Multiple Linear Regression

We shall also put the notes in goodnotes here

  1. interpreint regression coefficients:希望input 时uncorrelated;correlation 会影响;可以单独和output比较

  2. RSS来判定好坏

  3. Is at least one of the predictors X1,⋯ ,XpX_1,\cdots, X_pX1​,⋯,Xp​ useful in predictiing the response? FFF Statistic:

  4. Do all the predictors help to explain YYY, or is only a subset of the predictors useful? ​ (不可能经过所有的input;所以基于最小化$RSS$选择一个XiX_iXi​,然后你基于最小化RSSRSSRSS选择第二个XjX_jXj​,直到选出来的ppp-value合格)(或者你可以采用全部放进去,基于ppp-value,然后一个个删掉)

  5. How well does the model fit the data?

    • systematic criteria for choosing an 'optimal' member in the path of models produced by forward or backward stepwise selection;

    • 其他度量方式 Mallow's CpC_pCp​, Akaike information criterion(AIC), Bayesian information criterion(BIC), adjusted R2R^2R2, Cross-validation(CV)

  6. Given a set of predictor values, what response value should we predict, and how accurate is our prediction

  7. 小心qualitative data;可以换成binary x1=0&1x_1=0\&1x1​=0&1在不同情况下,当然还可以有 $x_2$

  8. Removing the additive assumption: interactions and nonlinearity

    • Interaction:市场造成的相互的影响,比如说你增加x1x_1x1​会影响x2x_2x2​;这时候刚增加一个 x1x2x_1x_2x1​x2​项

    • hierarchy:hierarchy principle:if we include an interaction in a model, we should also include the main effects, even if the $p$-value associated with their coefficients are not significant.

  9. outliers&non-constant variance of error terms& high leverage points& collinearity section3.3

Gradient descent简单的解释

Gradient descent is a commonly used optimization technique for other models as well, like neural networks, which we'll explore later in this track. Here's an overview of the gradient descent algorithm for a single-parameter linear regression model:

  • select initial values for the parameter: a1a_1a1​

  • repeat until convergence (usually implemented with a max number of iterations):

    • calculate the error (MSE) of the model that uses the current parameter value: MSE(a1)=1n∑i=1n(y^(i)−y(i))2MSE(a_1)=\frac{1}{n}\sum_{i=1}^n({\hat{y}}^{(i)}-y^{(i)})^2MSE(a1​)=n1​∑i=1n​(y^​(i)−y(i))2

    • calculate the derivative of the error (MSE) at the current parameter value: $\frac{d}{da_1}MSE(a_1)$

    • update the parameter value by subtracting the derivative times a constant ($\alpha$, called the learning rate): a1=a1−αdda1MSE(a1)a_1=a_1-\alpha \frac{d}{da_1}MSE(a_1)a1​=a1​−αda1​d​MSE(a1​)

Reference:

  1. books: an introduction to statistical learning

  2. notes in Good notes

  3. lai notes

PreviousNonlinear Model Cheating SheetNextGeneral Linear Model 2

Last updated 7 months ago