当前位置:网站首页>[advertising system] incremental training & feature access / feature elimination
[advertising system] incremental training & feature access / feature elimination
2022-07-05 10:57:00 【CC‘s World】
One 、 Incremental training
Sometimes there are a lot of training data , Tens of millions are also common . Although tens of millions of people only look at the records, the number is not much , But what if there are hundreds of features , That data set is terrible , If saved as numpy.float type , That's definitely exploding the memory . I'm in this situation , Start to consider incremental training of incremental model .
On very large datasets , There are usually several ways :1. Dimensionality reduction of data ,2. Incremental training , Use streaming or similar streaming processing ,3. Big machine , High memory , Or use spark colony .
Incremental training , In fact, it has the same meaning as online learning , The typical representative of online learning is SGD Optimization of the logistics regress, Initialize parameters with data first , Update the parameters with a data on the line , Although the passage of time , The effect is getting better and better . This avoids the problem of updating the model offline .
Incremental training has two main functions , One is to find ways to use all the data , The other is to find ways to make timely use of new data . It can improve the timeliness of the model 、 Sample size and saving cluster resources .
Recommended scenarios are usually due to the introduction of a large number of ID Class characteristics lead to the existence of a large number of sparse parameters , For example, in classic YouTube DNN In the model , Use the videos watched by users and user history search tokens As the main Embedded features . According to the discussion in the paper ,YouTube DNN in candidate video as well as search tokens There are millions . On this basis, if cross features are used , It will further aggravate the problem of parameter explosion .
Low frequency scenes are recommended ID Class features will also bring the risk of over fitting to the system , In response to this question , We designed feature access / Exit mechanism strategy , It is convenient to preset the expression ability according to the specific model , Adjust the influence of low-frequency sparse parameters on the model .
Two 、 Feature access
In the business scenario , New samples will be produced all the time , New samples bring new features . Some features appear less frequently , If all are added to the model , On the one hand, it is a challenge for memory , On the other hand , Low frequency features will bring over fitting . Therefore, some characteristic access mechanisms will be formulated , Including filtering based on probability , Bloon filters, etc .
The training framework will set feature access for new features “ The threshold ” To prevent frequent access of low-frequency features . We provide two mechanisms to limit access to new features :
- Probability increases , Every time you encounter new features , Generate probability according to the preset distribution , Control feature access ;
- Use Counting Bloom Filter Count the occurrence times of new features , When the number exceeds the threshold , admittance .

The picture above briefly describes CBF Principle , Suppose the capacity is 16, Two hash Function is used as Feature ID To Index Mapping . When querying the characteristic frequency ,Feature1 after Hash Function1 and Hash Function2 Get... Separately Slot 3 and Slot 6, Two Slot Values are 1,Feature The number of occurrences can be regarded as 1.Feature2 after Hash Function1 and Hash Function2 Get... Separately Slot 6 and Slot 15. Two Slot Values, respectively 1 and 0,Feature2 The number of occurrences can be regarded as 0. That is, map to all Slot in Value minimum value .

3、 ... and 、 Feature elimination
Some features will fail if they are not updated for a long time . To relieve memory pressure , Improve the timeliness of the model , Obsolete features need to be eliminated , Make elimination rules .
For features that have been admitted , There are three ways to judge whether it is in the low-frequency state :
- Update time . If a feature has not been updated for a long time , It is considered to have been in a low-frequency state ;
- L2 norm . If a feature L2 The result of norm calculation is too small , It is considered to have been in a low-frequency state ;
- Comprehensive score of statistical value . Support user-defined functions , Through characteristic statistics ( Exposure number , clicks , Number of likes , Number of comments, etc ) To calculate the comprehensive score of features , If the score is less than the threshold, it is considered to be in a low-frequency state .
Features judged to be in a low-frequency state will be eliminated and shielded , The next time it reappears, it will be treated as a new feature .

Use feature access & after , The recommended model can generally be reduced to a quarter of the size when it is not used , Online forecasting AUC Remain flat in the thousandth .
Reference material
边栏推荐
猜你喜欢

微信核酸检测预约小程序系统毕业设计毕设(6)开题答辩PPT

磨砺·聚变|知道创宇移动端官网焕新上线,开启数字安全之旅!

32:第三章:开发通行证服务:15:浏览器存储介质,简介;(cookie,Session Storage,Local Storage)

Operation of simulated examination platform of special operation certificate examination question bank for safety production management personnel of hazardous chemical production units in 2022

关于 “原型” 的那些事你真的理解了吗?【上篇】

2022 chemical automation control instrument examination questions and online simulation examination

Web3基金会「Grant计划」赋能开发者,盘点四大成功项目
![[JS] extract the scores in the string, calculate the average score after summarizing, compare with each score, and output](/img/96/b8585205b3faf503686c5bbdcecc53.png)
[JS] extract the scores in the string, calculate the average score after summarizing, compare with each score, and output

一次edu证书站的挖掘

关于vray 5.2的使用(自研笔记)
随机推荐
风控模型启用前的最后一道工序,80%的童鞋在这都踩坑
Taro advanced
In the year of "mutual entanglement" of mobile phone manufacturers, the "machine sea tactics" failed, and the "slow pace" playing method rose
正则表达式
脚手架开发进阶
beego跨域问题解决方案-亲试成功
websocket
Web Security
32:第三章:开发通行证服务:15:浏览器存储介质,简介;(cookie,Session Storage,Local Storage)
流程控制、
力扣(LeetCode)185. 部门工资前三高的所有员工(2022.07.04)
Explanation of full vulnerability script of network security C module of secondary vocational group script containing 4 vulnerabilities
LDAP overview
Go-3-第一个Go程序
[JS learning notes 54] BFC mode
Network security of secondary vocational group 2021 Jiangsu provincial competition 5 sets of topics environment + analysis of all necessary private messages I
vite//
[TCP] TCP connection status JSON output on the server
The first product of Sepp power battery was officially launched
2022年化工自动化控制仪表考试试题及在线模拟考试