6 Video Recommendation System

Clarifying requirements
- suggest post from content creator
- improve dau or session or engagement for the viewer
- ml objective is aligned with the business motive: DAU/ session
- viewing content: engaging, liking, commenting so on
- ML model that improves individual engagement: view/ post/ liking/ commenting
- non-funtional
  - scalable
  - available
  - tooling
    debuggability, monitoring, MLoperations, alerts and warnings
  - analytics
- estimation
  - DAU 500 million
Framing as ML
- Defining the ML objective
  - Maximize # of user clicks
  - Maximize # of completed videos
  - Maximize watch time
  - Maximize # of relevant videos
- Specifying input and output
  - input: user
  - output: a ranked list of videos sorted by their relevanace scores
- ML category
  - Non-personalized —— Rule- based filtering
  - Personalized
    Content-based filtering
    user A liked video X& Y, user A liked similar video Z
    Collaborative filtering
    user-based
    find similar user and his related items
    item-based
    Hybrid filtering
    cf+ content, or cf---> content
- Pipepline
  - Candidate generating-----> Ranking-----> Post Processing& Reranking(fairness & diversity)
Data preparation
- User Feature
  - Demographics
  - Behavior history
    Search history
    Liked videos
    Watched videos
    Impressions
  - Interests
    主题/标签分布（user-topic embedding via clustering）长期兴趣 vs 短期兴趣（Long-term embedding / short-term sequence embedding, e.g. GRU/Transformer）
  - Engagement level / 活跃度：日均观看时长、日均登录频率、近7日/30日活跃度
- Video features
  - Basic
    Video ID (embedding), Duration, Language(embedding)
    titles (pre-train BERT)
    tags, dance& music(CBOW)
    likes, views, length
  - Popularity / 流行度：
    播放量、点赞数、评论数、分享数
    热门趋势特征（time series smoothing: 近1小时、近1天的点击量增速）
  - Content embedding / 内容表示：
    文本（标题/字幕 → BERT embedding）
    图像（封面图像 → CNN embedding）
    视频帧序列（CLIP/Video Transformer embedding）
    音频特征（音乐/讲话 embedding）
- User-video interactions
  - user id, video id
  - interaction type (like, impression ,watch, click, search, comment)
  - iteraction value (8 second, 46 minutes)
  - location(lat, long)
  - timestamp
  - serach history
  - liked videos
  - watched videos and impression
- Even feature
  - 类目、主办方、文本 embedding（描述/标题）
  - 时间：活动开始时间、剩余时间
  - 地点：经纬度、场馆类型
  - 有时间窗口（start/end time），过期即失效。
  - 有空间位置（venue, city, GPS），用户可达性强约束。
  - 有容量限制（门票数、报名名额），供给不是无限。
  - 消费成本高：需要出行、时间投入、甚至金钱。
  - 内容特征多是文本（标题、描述）、类目、主办方 → 语义特征有限，更多依赖时间、地理、社交因子。
- event user features
  - Demographics / 人口属性：性别、年龄、职业、地域
  - Interests / 兴趣画像：
    历史报名/参加过的活动类别（concert, sports, tech talk, meetup）
    兴趣 embedding（基于活动标签聚类）
  - Behavior history / 历史行为：
    报名/出席/缺席历史
    活动停留时长（event dwell time）
    活动反馈（rating, review, like, share）
  - Engagement / 活跃度：
    报名频率（每周/月几次）
    是否活跃在特定类别（比如：音乐会/讲座/户外）
- event
  - Basic metadata / 基础信息：标题、描述、组织者、类型（concert、seminar、sports）
  - Time / 时间特征：开始时间、结束时间、是否周末/节假日
  - Location / 地点特征：城市、场馆、地理坐标（lat/long）
  - Capacity / 容量特征：总席位数、剩余席位、是否sold out
  - Price / 价格：票价、折扣、是否免费
  - Popularity / 热度：报名人数、浏览次数、收藏次数、社交媒体提及度
  - Organizer reputation / 主办方信誉：评分、历史活动反馈
- Context Features | 上下文特征
  - User’s current time / 当前时间：离活动开始还有多久
  - User’s location / 当前位置：和活动地点的距离、交通便利性
  - Device / 设备环境：App / Web / Mobile
  - Session context / 会话上下文：
    用户正在浏览的活动类别
    是否在搜索特定关键字（“rock concert near me”）
- User-Event Interaction Features | 用户-活动交互特征
  - Geographic distance / 地理匹配：
    用户当前位置 vs 活动地点的距离
    预计交通时间（driving / public transport）
  - Temporal availability / 时间冲突：
    活动时间是否与用户已报名活动冲突
    用户的空闲时间与活动时间匹配度
  - Interest match / 兴趣匹配：
    用户历史参与的活动类别 vs 候选活动类别相似度
    用户 embedding 与活动 embedding 的余弦相似度
  - Price sensitivity / 价格偏好：
    用户历史报名的价格区间 vs 候选活动价格
  - Social influence / 社交关系：
    好友是否也报名了该活动
    共同社交圈对该活动的参与度
  - Freshness & Novelty / 新颖性：
    候选活动是否为用户从未参加过的类型
    主办方/场馆是否是用户第一次接触
Model development
- Pipepline
  - Candidate generating(Recall)------> Ranking-----> Post Processing& Reranking(fairness & diversity)
- Method 1:Matrix factorization----> collaborated based filter
  - Feedback matrix
    全部都是原来user的反应，都是由0，1组成的matrix
    A = UV factor as two lower-dim matrix,
    get predict score matrix by times these two lower matrix---> get the score
  - Training loss
    Squared distance over observed pairs
    Squared distance over all pairs
    Weighted combination of observed and unobserved pairs
    loss = sum (A_ij - Ui Vj)^2 + w sum (A_ij - Ui Vj)^2
  - Optimization algorithm
    SGD
    WALS
  - Pros & Cons
    Pros: training & serving speed fast
    Cons: ony relies on user-video interactions
- Method 2: Two tower model (binary classification problem)
  - user tower + video tower
  - user feature--> user encoder(DNN) --> user embedding
  - video feature--> videoencoder(DNN) --> video embedding
  - find similarity ---> dot product (0.7) ---> cross-entropy loss& sigmoid ---> label
    要么在建索引时就只存可用活动（pre-filter），要么在召回结果上再过滤（post-filter）。
    SGD
    AUC, ROC curve 80%
  - pro& con
    pros: utlizes user feautre & handless new user
    cons: slower serving, training is more expnesive
  - 输入 (Input)
    Two-Tower = User Tower + Item Tower
    User Tower 输入
    用户特征 (User features)：人口统计、兴趣标签、历史序列 embedding 等
    输出：用户向量 u∈Rd
    Item Tower 输入
    物品/活动特征 (Item features)：类目、文本 embedding、地理位置、价格等
    输出：物品向量 v∈Rd
    👉 最后把 u,vu, vu,v 做相似度计算：
    s(u,v)=u⊤v或cos⁡(u,v)s(u,v)
  - 输出 (Output)
    任务：二分类 → 判断用户 uuu 是否会对物品 vvv 产生正反馈 (点击/报名/购买)。
    输出概率：
    y^=σ(s(u,v))∈(0,1)
    其中 σ\sigmaσ 是 sigmoid 函数。
  - Loss Function (训练损失函数)
- ANN
  - Two-Tower 模型把用户和物品嵌入到同一空间，ANN 是在这个空间里高效检索最近邻物品的加速器。没有 ANN，Two-Tower 在大规模推荐里就落不了地。
  - ANN (Approximate Nearest Neighbor)：不用精确比较所有向量，而是用索引结构快速缩小搜索范围，只在一小部分候选里做精确比较。
    k means cluster, bucket,
    先找到用户 embedding 属于哪个桶，再只在该桶里查
- Reranking
  - 方案 A：GBDT（XGBoost / LightGBM）
    输入：所有特征拼接成 tabular
    用户特征 (User features)
    年龄 = 25
    平均消费 = 30
    活动特征 (Event features)
    票价 = 20
    活动类目 = 音乐会交
    互特征 (User–Event interaction)
    距离 = 5 km
    出行时间 = 15 min
    是否时间冲突 = 0
    好友参加人数 = 3
    把它们拼接成一行，就得到
    输出：预测一个概率 P(y=1 | user, event)。
    优点：解释性强、工程落地快。
    缺点：对序列行为建模能力弱。
  - 方案 B：深度 CTR 模型（常见于工业界）
    Embedding 层：对用户、活动、类目、时间、地点等离散特征 embedding。
    行为序列建模：DIN/DIEN/Transformer，把用户历史行为和当前候选活动做 attention。
    交互层：Wide&Deep、DeepFM、DCN，建模高阶交互（用户兴趣 × 活动属性 × 上下文）。
    多目标输出：
    点击概率 (CTR)
    报名概率 (Signup)
    到场概率 (Attendance)
  - 方案 C：多任务学习 (Multi-task Learning)
    用 shared bottom + task-specific tower。
    主任务：到场预测 (Attendance)
    辅助任务：点击、报名（帮助缓解稀疏问题）。
  - 训练目标 (Objective)
    如果是单任务：
    L=−∑(ylog⁡y^+(1−y)log⁡(1−y^)
    其中 y=1y=1y=1 表示用户真实报名/到场。
    如果是多任务：
    L=w1⋅CTR Loss+w2⋅Signup Loss+w3⋅Attendance Loss
    训练样本：
    正样本 = 用户真实参加/报名的活动
    负样本 = 用户曝光但未报名/未到场的活动
  - 在线服务流程：
    输入用户特征 + 候选活动特征。
    精排模型输出打分
    （CTR / CVR / Attendance Probability）。
    结合业务目标打分，例如：
    Final Score=α⋅CTR+β⋅Signup+γ⋅Attendance
    按分数排序，返回前 N 个活动。
Evaluation
- Offline metrics
  - Precision@k
    前k个中符合要求的
    meausre the proportion of relevant videos among the top k recommended videos.
    Multiple k valus (1,5, 10) can be used
  - Recall@k
    Fraction of all relevant items retrieved in the top-k.
  - mAP
    ranking quality of recommend video
  - Diversity
- Online metrics
  - CTR
    number of cliekd videos / total number of recoomenned video
  - # of completed videos
  - Total watch time
  - Explicit user feedback
  - CVR (Conversion rate, 转化率)
    Definition: Ratio of conversions to clicks.
  - Retention (留存率)
    Definition: Fraction of users who return after d days.
  - DAU/MAU (粘性)
    Definition: Daily active users over monthly active users.
  - Revenue Lift (收入提升)
    Definition: Percentage increase in revenue compared to control.
  - Latency (延迟)
    Definition: Time taken to serve a request, usually measured at P95/P99.请求处理所需时间，通常取 P95/P99。
  - Throughput (吞吐量)
    Definition: Number of requests processed per second.
Serving
- Candidate generation
  - two tower neural network
  - retreived the most similar videos from the approximate nearest neighbor service(ANN)
    these vidoes are ranked based on similarity in the embedding space and are returned as the output
  - prefer efficiency over accuracy , not concerned about flase positives
  - can apply k candidate generation to diversify recommended videos (relevant & popular& trending)
- Scoring
  - prioritize accuracy over efficiency
  - query user. + thoursands of candidate videos ----> scoring (two tower neural network model) ---> dozens of videos
  - could choose content base filters and pick a model which relies on video features.
- Re-ranking
  - adding additional criteria or constraint
  - may use standalone ML models to determine if a video is clickbait
  - important thing to consider
    region -restrict vidoes
    video freshness
    video spreading misinformatin
    duplicate or near-duplicate videos
    fairness and bias
- Challenges in recommendation system
  - Serving speed
  - Precision
  - Diversity
  - Cold-start problem
  - Training scalability
Other talking points

PreviousAnomaly Detection NextResponsible AI

Last updated 2 months ago