♎
Limited AI
  • Machine Learning
    • Linear Model Cheating Sheet
    • Nonlinear Model Cheating Sheet
    • General Linear Model 1
    • General Linear Model 2
    • General Linear Model 3
    • Tree Based Methods
    • Tree Based Methods Supplement
    • XG,Cat,Light__Boosting
    • KNN&PCA
    • Model Performance
    • Model Evaluation
    • Code Practice
      • KNN
      • Decision Tree Python Code
    • Data and Feature Engineering
      • Handle Bias Data
      • Cold Start Problem
  • Deep Learning
    • Summary v2
    • Basic Neural Network
      • From Linear to Deep
      • Perceptron and Activation Function
      • NN network Details
      • Backpropagation Details
      • Gradient Vanishing vs Gradient Exploding
    • Basic CNN
      • Why CNN
      • Filter/ Convolution Kernel and Its Operation
      • Padding& Stride
      • Layers
      • Extra:From Fully Connected Layers to Convolutions
      • Extra: Multiple Input and Multiple Output Channels
    • Advance CNN
      • Convolutional Neural Networks(LeNet)
      • Deep Convolution Neural Networks(AlexNet)
      • Networks Using Blocks (VGG)
      • Network in Network(NiN)
      • Multi-Branch Networks(GoogLeNet&I mageNet)
      • Residual Networks(ResNet) and ResNeXt
      • Densely Connected Networks(DenseNet)
      • Batch Normalization
    • Basic RNN
      • Seq Model
      • Raw Text to Seq
      • Language Models
      • Recurrent Neural Networks(RNN)
      • Backpropagation Through Time
    • Advance RNN
      • Gated Recurrent Units(GRU)
      • Long Short-Term Memory(LSTM)
      • Bidirectional Recurrent Neural Networks(BRNN)
      • Encoder-Decoder Architecture
      • Seuqence to Sequence Learning(Seq2Seq)
    • Attention Mechanisms and Transformers
      • Queries, Keys, and Values
      • Attention is all you need
        • Attention and Kernel
        • Attention Scoring Functions
        • The Bahdanau Attention Mechanism
        • Multi-Head Attention
        • Self-Attention
        • Attention的实现
      • The Transformer Architecture
        • Extra Reading
        • 最短的最大路径长度
      • Large-Scaling Pretraning with Transformers
        • BERT vs OpenAI GPT vs ELMo
        • Decoder Model框架
        • Bert vs XLNet
        • T5& GPT& Bert比较
        • 编码器-解码器架构 vs GPT 模型
        • Encoder vs Decoder Reference
      • Transformers for Vision
      • Transformer for Multiomodal
    • NLP Pretraining
      • Word Embedding(word2vec)
        • Extra Reading
      • Approximate Training
      • Word Embedding with Global Vectors(GloVe)
        • Extra Reading
        • Supplement
      • Encoder(BERT)
        • BERT
        • Extra Reading
      • Decoder(GPT&XLNet&Lamma)
        • GPT
        • XLNet
          • XLNet架构
          • XLNet特点与其他比较
      • Encoder-Decoder(BART& T5)
        • BART
        • T5
  • GenAI
    • Introduction
      • GenAI Paper Must Read
      • GenAI六个阶段
    • Language Models Pre-training
      • Encoder-Decoder Architecture
      • Encoder Deep Dive
      • Decoder Deep Dive
      • Encoder VS Decoder
      • Attention Mechanism
      • Transformers
    • Example: Llama 3 8B架构
    • Fine-Tuning Generation Models
    • RAG and Adavance RAG
    • AI Agent
  • Statistics and Optimization
    • A/B testing
    • Sampling/ABtesting/GradientMethod
    • Gradient Decent Deep Dive
  • Machine Learning System Design
    • Extra Reading
    • Introduction
  • Responsible AI
    • AI Risk and Uncertainty
      • What is AI risk
      • General Intro for Uncertainty Quantification
      • Calibration
      • Conformal Prediction
        • Review the linear regression
        • Exchangeability
        • Split Conformal Prediction
        • Conformalized Quantile Regression
        • Beyond marginal coverage
        • Split Conformal Classification
        • Full Conformal Coverage
        • Cross-Validation +
        • Conformal Histgram Regression
    • xAI
      • SHAP value
  • Extra Research
    • Paper Reading
    • Reference
Powered by GitBook
On this page
  • 1 Why do model evaluation?
  • 2 How to evaluate a model?
  • What data to evaluate
  • What method to use(实验评估方法/Model Selection)
  • What metrics to compare(性能度量performance measure)
  • 3 Failures Analysis
  • Machine Learning End-to-End Pipeline
  • Business Design
  • Data acquisition
  • Data preparation
  • Training & Validation
  • Testing Evaluation
  • Deployment& Inference
  1. Machine Learning

Model Performance

1 Why do model evaluation?

  • 这个部分相当于是你做完clean/feature engineering/ model,就到model evaluation了,说白了就是说,你模型到底如何

  • 对应How can we compare this model with others model?

2 How to evaluate a model?

What data to evaluate

  • select and split data

What method to use(实验评估方法/Model Selection)

Cross Validation

  • What is Cross Validation?

    • Assess how your model result will generalize to another independent data set.

    • Predict and test on the same data is a methodological mistake

    • There are several cross validation techniques, popular is k-fold cross validation

K-fold Cross Validation

What metrics to compare(性能度量performance measure)

Classification-Confusion matrix

  • TP/FN/FP/TN;True/False;Postive/Negative

    • TP: true positive(真正例); FN: false negative(假反例)(type II)

    • FP: false positive(假正例type I); TN: true negative(真反例)

    • True/False means if you made a correct/wrong prediction

    • Positive/Negative means what your prediction is/is not

  • Accuracy

    • TP+TNTP+TN+FP+FN\frac{TP+TN}{TP+TN+FP+FN}TP+TN+FP+FNTP+TN​​​​​​​​​​​​​​​​​​​​​​​​​​​

    • 预测中你对的多少个

  • Precision

    • P=TPTP+FPP=\frac{TP}{TP+FP}P=TP+FPTP​​​​​​​​​​​​​​​​​​​​

    • 查准率precision:你说的准确到底有多少可以相信的

    • 在所有你认为positive的数据中,有多少真的是positive?

    • example, spam email:要求precision高(杀的必须准)

  • Recall Sensitivity

    • R=TPTP+FNR=\frac{TP}{TP+FN}R=TP+FNTP​

    • (查全率recall:真的是不是都预测对了)查全率recall:真的是不是都预测对了

    • 在所有positive的数据中,有多少被你正确地识别出来(是positive)

    • disease/cybersecuirity:要求recall高( 宁可错杀)

    • 如果negative很重要,你看recall,反之你看precision更关注positive

  • F1

    • F1=2TP2TP+FP+FN=21recall+1precisionF1=\frac{2TP}{2TP+FP+FN}=\frac{2}{\frac{1}{recall}+\frac{1}{precision}}F1=2TP+FP+FN2TP​=recall1​+precision1​2​

    • (可以统一recall&precision)

    • 越高越好

Classification-ROC

  • Receiver operation characteristic curve(根据你的threshold来制定的,threshold是用来就判断是positive or negative,像logistic regression里,我高于0.8,判定为positive,当然你可以选0.6)

  • Define False Positive Rate as X axis, True Positive Rate as Y axis

  • The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold.

  • Special Points in ROC space

    • best case(0,1); worst case:(1,0)

    • 对角线上的点:

      • 当threshold设定为最高时,所有样本都被预测为negative,此时得到的点在(0,0).

      • 当threshold设定为最低时,所有样本都被预测为positive,此时得到的点在(1,1)

    • why does Equal Error rate mean FPR = FNR?

      • 已知固定关系: FNR=1-TP/Number of real positive = 1- TPR

      • 根据图中焦点性质可知: FPR (x)= 1-TPR(y)

      • FPR=FNR

Classification-AUC

  • 你想完整的表示前面的auc么?

    • Area under the curve of ROC(AUC)

    • AUC value: [0,1]

    • The larger the value is, the better classification performance your classifier has.

    • AUC value is a probability value.

  • 面试题:机器学习里0-1的值,都希望有一个概率。怎么用概率来解释AUC?

    • ROC AUC is the probability that a randomly-chosen positive example is ranked more highly than a randomly-chosen negative example.

Regression

  • ERM,empirical risk measure:

  • coefficient of determination

3 Failures Analysis

  • 不同的策略来解决减少failures(把你的failures的分类)

    • 可以通过模型解决

    • 可以通过调参解决

    • 可以通过data解决(可能你少一部分data)

    • 不能解决

  • failures analysis的目的就是进一步提高模型的性能,进行迭代开发,retraining

  • Summary

    • cross validation——找骨架(model selection/model infrastructure)

    • mixed validation and training data into training——找肉,例如y=ax+b里面的a和b

    • model evaluation——ROC,AUC,Precision,Recall

    • failure analysis——确定问题在哪里,然后更新step1-3

Machine Learning End-to-End Pipeline

Business Design

Data acquisition

  • collect data from outside system

  • pre-process into the format that needed

Data preparation

  • Clean, transform, validate and select data

  • annotation

  • feature extraction

  • prepare data into: training, validation and evaluation datasets

Training & Validation

  • Select the model

  • Train the model

  • Tune the model

  • Validate the training process

Testing Evaluation

  • evaluate the performance metrics on the evaluation dataset

  • analyze failures

Deployment& Inference

  • Integrated into production apps

  • config models based on system needs

  • predict on real-life data

  • log failures/errors

PreviousKNN&PCANextModel Evaluation

Last updated 10 months ago

picture