♎
Limited AI
  • Machine Learning
    • Linear Model Cheating Sheet
    • Nonlinear Model Cheating Sheet
    • General Linear Model 1
    • General Linear Model 2
    • General Linear Model 3
    • Tree Based Methods
    • Tree Based Methods Supplement
    • XG,Cat,Light__Boosting
    • KNN&PCA
    • Model Performance
    • Model Evaluation
    • Code Practice
      • KNN
      • Decision Tree Python Code
    • Data and Feature Engineering
      • Handle Bias Data
      • Cold Start Problem
  • Deep Learning
    • Summary v2
    • Basic Neural Network
      • From Linear to Deep
      • Perceptron and Activation Function
      • NN network Details
      • Backpropagation Details
      • Gradient Vanishing vs Gradient Exploding
    • Basic CNN
      • Why CNN
      • Filter/ Convolution Kernel and Its Operation
      • Padding& Stride
      • Layers
      • Extra:From Fully Connected Layers to Convolutions
      • Extra: Multiple Input and Multiple Output Channels
    • Advance CNN
      • Convolutional Neural Networks(LeNet)
      • Deep Convolution Neural Networks(AlexNet)
      • Networks Using Blocks (VGG)
      • Network in Network(NiN)
      • Multi-Branch Networks(GoogLeNet&I mageNet)
      • Residual Networks(ResNet) and ResNeXt
      • Densely Connected Networks(DenseNet)
      • Batch Normalization
    • Basic RNN
      • Seq Model
      • Raw Text to Seq
      • Language Models
      • Recurrent Neural Networks(RNN)
      • Backpropagation Through Time
    • Advance RNN
      • Gated Recurrent Units(GRU)
      • Long Short-Term Memory(LSTM)
      • Bidirectional Recurrent Neural Networks(BRNN)
      • Encoder-Decoder Architecture
      • Seuqence to Sequence Learning(Seq2Seq)
    • Attention Mechanisms and Transformers
      • Queries, Keys, and Values
      • Attention is all you need
        • Attention and Kernel
        • Attention Scoring Functions
        • The Bahdanau Attention Mechanism
        • Multi-Head Attention
        • Self-Attention
        • Attention的实现
      • The Transformer Architecture
        • Extra Reading
        • 最短的最大路径长度
      • Large-Scaling Pretraning with Transformers
        • BERT vs OpenAI GPT vs ELMo
        • Decoder Model框架
        • Bert vs XLNet
        • T5& GPT& Bert比较
        • 编码器-解码器架构 vs GPT 模型
        • Encoder vs Decoder Reference
      • Transformers for Vision
      • Transformer for Multiomodal
    • NLP Pretraining
      • Word Embedding(word2vec)
        • Extra Reading
      • Approximate Training
      • Word Embedding with Global Vectors(GloVe)
        • Extra Reading
        • Supplement
      • Encoder(BERT)
        • BERT
        • Extra Reading
      • Decoder(GPT&XLNet&Lamma)
        • GPT
        • XLNet
          • XLNet架构
          • XLNet特点与其他比较
      • Encoder-Decoder(BART& T5)
        • BART
        • T5
  • GenAI
    • Introduction
      • GenAI Paper Must Read
      • GenAI六个阶段
    • Language Models Pre-training
      • Encoder-Decoder Architecture
      • Encoder Deep Dive
      • Decoder Deep Dive
      • Encoder VS Decoder
      • Attention Mechanism
      • Transformers
    • Example: Llama 3 8B架构
    • Fine-Tuning Generation Models
    • RAG and Adavance RAG
    • AI Agent
  • Statistics and Optimization
    • A/B testing
    • Sampling/ABtesting/GradientMethod
    • Gradient Decent Deep Dive
  • Machine Learning System Design
    • Extra Reading
    • Introduction
  • Responsible AI
    • AI Risk and Uncertainty
      • What is AI risk
      • General Intro for Uncertainty Quantification
      • Calibration
      • Conformal Prediction
        • Review the linear regression
        • Exchangeability
        • Split Conformal Prediction
        • Conformalized Quantile Regression
        • Beyond marginal coverage
        • Split Conformal Classification
        • Full Conformal Coverage
        • Cross-Validation +
        • Conformal Histgram Regression
    • xAI
      • SHAP value
  • Extra Research
    • Paper Reading
    • Reference
Powered by GitBook
On this page
  • Linear Regression
  • Definition
  • Representation
  • How to determine this model?loss function?
  • How is the performance of this model?
  • How can we compare this model with others model?
  • Multilinear Regression
  • Logistic Regression
  • 图形/值域的理解
  • How to determine this model?loss function?
  • How is the performance of this model?
  • How can we compare this model with others model?
  • Compare with linear regression
  • Multiple Logistic Regression
  • Regularization——Ridge regression/Lasso regression
  • SVM
  1. Machine Learning

Linear Model Cheating Sheet

PreviousMachine LearningNextNonlinear Model Cheating Sheet

Last updated 10 months ago

Linear Regression

Definition

  • Linear regression is a linear model, which a linear relationship between the input and output . More technical, we can consider yyy​ can be a linear combination of XXX​

Representation

  • Y=β0+β1X+ϵY= \beta_0+\beta_1X+\epsilonY=β0​+β1​X+ϵ

How to determine this model?loss function?

  • 最小二乘法least square推导loss function

    • Loss=min⁡β0,β1∑i=1n(yi−β0+β1Xi)2Loss =\min_{\beta_0,\beta_1} \sum_{i=1}^{n}(y_i-\beta_0+\beta_1X_i)^2Loss=minβ0​,β1​​∑i=1n​(yi​−β0​+β1​Xi​)2

  • 最大似然估计MLE(maximum likelihood estimation)推导loss function

    • what is mle?

      • 用参数估计的方法,在有了一定的观测值之后,来找parameter,让我们可以最有可能看到我们观测值,让我们可以最大程度放大我们的观测值

      • method

      • estimating the parameters

      • statistical model

      • given observation

      • finding the parameter values

      • maximize the likelihood(making the observation given the parameters)

    • 推导过程(Y follws什么distribution?)

      • p(Yi∣Xi)=1σ(2π)e−12σ2(Yi−βˉ0−βˉ1Xi)2p(Y_i\vert X_i)=\frac{1}{\sigma(2\pi)}e^{-\frac{1}{2\sigma^2}(Y_i-\bar{\beta}_0-\bar{\beta}_1X_i)^2}p(Yi​∣Xi​)=σ(2π)1​e−2σ21​(Yi​−βˉ​0​−βˉ​1​Xi​)2

      • L(βˉ0,βˉ1,σ2)=p(Y1,⋯ ,Yn∣X1,⋯ ,Xn)=1σn(2π)n/2e−12σ2∑i=1n(Yi−βˉ0−βˉ1Xi)2L(\bar{\beta}_0,\bar{\beta}_1,\sigma^2)=p(Y_1,\cdots,Y_n\vert X_1,\cdots,X_n)=\frac{1}{\sigma^{n}(2\pi)^{n/2}}e^{-\frac{1}{2\sigma^2}\sum_{i=1}^n(Y_i-\bar{\beta}_0-\bar{\beta}_1X_i)^2}L(βˉ​0​,βˉ​1​,σ2)=p(Y1​,⋯,Yn​∣X1​,⋯,Xn​)=σn(2π)n/21​e−2σ21​∑i=1n​(Yi​−βˉ​0​−βˉ​1​Xi​)2

  • 两者的关系

    • 当他们在linear regression下的assumption下,这两个方法得到结果是相通的

    • one is from statistics, and the other one is from optimization

  • 一些提醒

    • noise是数据造成的,是inherent bias. error是模型造成的,是人为的。是两个不同的概念

How is the performance of this model?

  • (这个是来看模型自己的好坏,评价自己的参数)

  • 我们只能系统的保证其不会偏差(E(Y)=μE(Y)=\muE(Y)=μ​​​​​​​​)

  • null hypothesis : β1=0\beta_1=0β1​=0

    • 目的是从统计上来判断这组数据和population相差多少,assessing the accuracy of the coefficient estimation,可以使用 ppp​-value 或者是 confidence interval

    • 选择的统计量t=β^1−0SE(β^1)t=\frac{\hat{\beta}_1-0}{SE(\hat{\beta}_1)}t=SE(β^​1​)β^​1​−0​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​distribution with n−2n-2n−2 degrees of freedom assuming β1=0\beta_1=0β1​=0​​​​​​

How can we compare this model with others model?

  • (这个相当于模型外的判断模型的好坏 i.e. the extent to which the model fits the data)

  • assessing the overall accuracy

  • RSE=1n−2RSS=1n−2∑in(yi−y^i)2RSE=\sqrt{\frac{1}{n-2}RSS}=\sqrt{\frac{1}{n-2}\sum_i^{n}(y_i-\hat{y}_i)^2}RSE=n−21​RSS​=n−21​∑in​(yi​−y^​i​)2​

  • R2=TSS−RRRTSS=1−RSS∑in(yi−y^i)2R^2=\frac{TSS-RRR}{TSS}=1-\frac{RSS}{\sum_i^{n}(y_i-\hat{y}_i)^2}R2=TSSTSS−RRR​=1−∑in​(yi​−y^​i​)2RSS​

    • where TSSTSSTSS​​​ is total sum of squares, RSSRSSRSS​​​ is the residual sum of squares(对误差的多少)

    • 当是simple regression时,他相同于correlation

    • 这里相当于 proportion of variability in that can be explained using ( 的变化中能够被解释的部分的比例 )

Multilinear Regression

Logistic Regression

图形/值域的理解

  • 采用这个方法:x→p(x)→yx \rightarrow p(x) \rightarrow yx→p(x)→y

    • 首先用 xxx​去拟合 概率p(x)p(x)p(x)​​​​

    • 然后用ppp​再去拟合yyy​ (采用threshold)

    • p(x)=ef(x)1+ef(x)=11+e−f(x)p(x)=\frac{e^{f(x)}}{1+e^{f(x)}}=\frac{1}{1+e^{-f(x)}}p(x)=1+ef(x)ef(x)​=1+e−f(x)1​

  • YYY​是得病不得病,ppp​是相当于肿瘤的指数, XXX​是关于肿瘤的input(size,位置,etc)

    • p(Y=yi)=pyi(1−p)1−yi, 0<p<1, yi=0,1p(Y=y_i)=p^{y_i}(1-p)^{1-y_i},\ 0<p<1,\ y_i=0,1p(Y=yi​)=pyi​(1−p)1−yi​, 0<p<1, yi​=0,1

    • logp(X)1−p(X)=β0+β1Xlog\frac{p(X)}{1-p(X)} = \beta_0+\beta_1Xlog1−p(X)p(X)​=β0​+β1​X

How to determine this model?loss function?

  • Using MLE to get the loss function(面试必考题)

    • Key:Y fellows 什么distribution?

      • P(Y=yi∣X)=pyi(1−p)1−yi, 0<p<1, yi=0,1P(Y=y_i\vert X) = p^{y_i}(1-p)^{1-y_i}, \ 0<p<1,\ y_i=0,1P(Y=yi​∣X)=pyi​(1−p)1−yi​, 0<p<1, yi​=0,1

    • p=hβ(xi)=11+e−β0+β1xp=h_{\beta}(x_i)=\frac{1}{1+e^{-\beta_0+\beta_1x}}p=hβ​(xi​)=1+e−β0​+β1​x1​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​​

    • L(β^0,β^1)=Πi=1nP(Yi∣Xi)=Πi=0npyi(1−p)1−yi=Πi=0nhβ(xi)yi(1−hβ(xi))1−yiL(\hat\beta_0,\hat \beta_1)=\Pi_{i=1}^{n}P(Y_i \vert X_i)= \Pi_{i=0}^n p^{y_i}(1-p)^{1-y_i}=\Pi_{i=0}^n {h_{\beta}(x_i)}^{y_i}(1-h_{\beta}(x_i))^{1-y_i}L(β^​0​,β^​1​)=Πi=1n​P(Yi​∣Xi​)=Πi=0n​pyi​(1−p)1−yi​=Πi=0n​hβ​(xi​)yi​(1−hβ​(xi​))1−yi​

  • 如何调参呢?有没有类似与linear的t distribution之类的?

How is the performance of this model?

  • t distribution?something?

How can we compare this model with others model?

  • link: Confusion matrix/AUC

Compare with linear regression

  • 关于y的assumption

  • Step1: ‘Given ,得的distribution’——这是机器学习模型利用数据可以解决的问题

  • Step2: ‘根据的distribution,得到的取值’——判断怎么用是你的问题

Multiple Logistic Regression

  • p(x)=hβ(xi)=11+eg(x), where  g(x)=β0+β1,1x1+β1,2x12+β2x2p(x)=h_{\beta}(x_i)=\frac{1}{1+e^{g(x)}}, \ \textit{where }\ g(x)=\beta_0+\beta_{1,1}x_1+\beta_{1,2}x_1^2+\beta_2x_2p(x)=hβ​(xi​)=1+eg(x)1​, where  g(x)=β0​+β1,1​x1​+β1,2​x12​+β2​x2​

  • 你会面对一个(overfit/underfit)的问题:

    • 模型复杂度对分类效果的影响

    • 模型复杂提到,可以更好的对应training data,但是对testing data不一定好

    • 过度拟合就是overfitting,但是过少可能就underfitting

  • 很多时候,多分类问题可以比较成为多个两分类问题,两两二分类来做

Regularization——Ridge regression/Lasso regression

SVM

  • E((wTx+b,0)max⁡)E((w^Tx+b,0)_{\max})E((wTx+b,0)max​)

  • Maximum margin classifier max⁡1∣∣w∣∣,s.t.yi(wTxi+b)≥0,i=1,⋯\max \frac{1}{\vert \vert w \vert \vert},\qquad s.t. y_i(w^Tx_i+b)\geq 0,\qquad i=1,\cdotsmax∣∣w∣∣1​,s.t.yi​(wTxi​+b)≥0,i=1,⋯

  • min⁡12∣∣w∣∣2,s.t.yi(wTxi+b)≥1,i=1,⋯ ,n\min \frac{1}{2}\vert\vert w \vert\vert^2,\qquad s.t. y_i(w^Tx_i+b)\geq 1,\qquad i=1,\cdots,nmin21​∣∣w∣∣2,s.t.yi​(wTxi​+b)≥1,i=1,⋯,n

    • dual L(w,b,a)=12∣∣w∣∣2−∑i=1nαi(yi(wTxi+b)−1)\mathcal{L}(w,b,a)=\frac{1}{2}\vert \vert w\vert \vert^2-\sum_{i=1}^n \alpha_i(y_i(w^Tx_i+b)-1)L(w,b,a)=21​∣∣w∣∣2−∑i=1n​αi​(yi​(wTxi​+b)−1)

    • L(w,b,a)=∑i=1nαi−12αiαjyiyjxiTxj\mathcal{L}(w,b,a)=\sum_{i=1}^n\alpha_i-\frac{1}{2}\alpha_i\alpha_jy_i y_jx_i^T x_jL(w,b,a)=∑i=1n​αi​−21​αi​αj​yi​yj​xiT​xj​ and w=∑i=1nαiyixiw=\sum_{i=1}^n\alpha_iy_ix_iw=∑i=1n​αi​yi​xi​

  • 换成核估计

    • f(x)=∑i=1Nwiϕi(x)+bf(x)=\sum_{i=1}^Nw_i\phi_i(x)+bf(x)=∑i=1N​wi​ϕi​(x)+b 转换成为 f(x)=∑i=1lαiyi⟨ϕ(xi),ϕ(x)⟩+bf(x)=\sum_{i=1}^l\alpha_i y_i \langle \phi(x_i), \phi(x) \rangle+bf(x)=∑i=1l​αi​yi​⟨ϕ(xi​),ϕ(x)⟩+b

    • α\alphaα​​​​​​可以由dual 来求

      • max⁡α∑i=1nαi−12∑i,j=1nαiαjyiyj⟨ϕ(xi)ϕ(xj)⟩s.t.αi≥0,i=1,⋯ ,n;∑i=1nαiyi=0\max_{\alpha}\sum_{i=1}^n \alpha_i-\frac{1}{2}\sum_{i,j=1}^n\alpha_i\alpha_jy_iy_j\langle \phi(x_i)\phi(x_j)\rangle \qquad s.t. \alpha_i \geq 0,i=1,\cdots, n; \sum_{i=1}^n\alpha_iy_i=0maxα​∑i=1n​αi​−21​∑i,j=1n​αi​αj​yi​yj​⟨ϕ(xi​)ϕ(xj​)⟩s.t.αi​≥0,i=1,⋯,n;∑i=1n​αi​yi​=0

      • max⁡α∑i=1nαi−12∑i,j=1nαiαjyiyjxixjs.t.αi≥0,i=1,⋯ ,n;∑i=1nαiyi=0\max_{\alpha}\sum_{i=1}^n \alpha_i-\frac{1}{2}\sum_{i,j=1}^n\alpha_i\alpha_jy_iy_jx_ix_j \qquad s.t. \alpha_i \geq 0,i=1,\cdots, n; \sum_{i=1}^n\alpha_iy_i=0maxα​∑i=1n​αi​−21​∑i,j=1n​αi​αj​yi​yj​xi​xj​s.t.αi​≥0,i=1,⋯,n;∑i=1n​αi​yi​=0