当前位置:网站首页>Introduction to Tianchi news recommendation: 4 Characteristic Engineering
Introduction to Tianchi news recommendation: 4 Characteristic Engineering
2022-07-04 01:20:00 【Programmer base camp】
Preface
Feature engineering is to make features and labels , Turn to supervised learning .
- Features that can be used directly :
- The article's own characteristics , category_id The type of the article , created_at_ts Indicates when the article was created , This is related to the timeliness of the article , words_count It's the number of words in the article , Generally, the number of words is too long. We don't like to click , It doesn't rule out that some people like to read long articles .
- The content of the article embedding features , It was used in this recall , Here you can choose to use , You can choose not to , You can also try other types of embedding features , such as W2V etc.
- User's device feature information
- The idea of constructing supervision data set , According to the result of the recall , We're going to get one {user_id: [ List of possible articles to click on ]} A dictionary of form . So we can, for each user , Each article constructs a possible set of tests , For example, for users user1, Suppose you get his recall list {user1: [item1, item2, item3]}, We can get three rows of data (user1, item1), (user1, item2), (user1, item3) In the form of , These are the first two columns of features when monitoring test sets .
The idea of structural features is this , We know that each user's click on the article is closely related to its historical click article information , Like the same theme , Similar and so on . So the feature structure is an important series of features Is to combine the user's history, click on the article information . We've got a data set of two columns for each user and click on the candidate article , And our goal is to predict the last click on the article , A more natural way of thinking is to have a relationship with the last few clicks on the article , This takes into account the history of the click article information , It has to be closer to the last click , Because one of the most important features of news is its timeliness . Often the last click of a user has a lot to do with the last few clicks . So we can do this for each candidate article , Make features related to the last few clicks as follows :
- The candidate item Similarity to the last few clicks (embedding Inner product ) — This is directly related to the user's historical behavior
- The candidate item Statistical characteristics of similarity features with the last few clicks — Statistical features can reduce some fluctuations and anomalies
- The candidate item The difference between the number of words in the last few clicks — You can see user preferences by the number of words
- The candidate item The time difference characteristics established with the last few clicks on the article — The time difference feature shows the user's preference for the real-time of the article
You need to think about - If used youtube If you recall , We can also create users and candidates item Similar characteristics of
- Word2Vec The main idea is : The context of a word can well express the meaning of a word . A way of generating word vectors through unsupervised learning .word2vec There are two very classic models in :skip-gram and cbow.
skip-gram: The head word is known to predict the surrounding words .
cbow: Knowing the surrounding words predicts the head word .
In the use of gensim Training word2vec When , There are several important parameters
- size: The dimension of the word vector .
- window: It determines how far the target word will relate to the context .
- sg: If it is 0, It is CBOW Model , yes 1 It is Skip-Gram Model .
- workers: Indicates the number of threads during training
- min_count: Set the smallest
- iter: The number of times to traverse the entire dataset during training
Specific tutorials and codes
边栏推荐
- 查询效率提升10倍!3种优化方案,帮你解决MySQL深分页问题
- [prefix and notes] prefix and introduction and use
- “疫”起坚守 保障数据中台服务“不打烊”
- Windos10 reinstallation system tutorial
- Analysis and solution of lazyinitializationexception
- Mobile asynchronous sending SMS verification code solution -efficiency+redis
- swagger中响应参数为Boolean或是integer如何设置响应描述信息
- Thinkphp6 integrated JWT method and detailed explanation of generation, removal and destruction
- Gauss elimination method and template code
- mysql使用视图报错,EXPLAIN/SHOW can not be issued; lacking privileges for underlying table
猜你喜欢

Sequence list and linked list
![[prefix and notes] prefix and introduction and use](/img/a6/a75e287ac481559d8f733e6ca3e59c.jpg)
[prefix and notes] prefix and introduction and use

Luogu p1309 Swiss wheel

Avoid playing with super high conversion rate in material minefields

Introduction to A-frame virtual reality development

1-redis architecture design to use scenarios - four deployment and operation modes (Part 1)

GUI 应用:socket 网络聊天室

How to use AHAS to ensure the stability of Web services?

swagger中响应参数为Boolean或是integer如何设置响应描述信息

基于.NetCore开发博客项目 StarBlog - (14) 实现主题切换功能
随机推荐
Sequence list and linked list
Audio resource settings for U3D resource management
Day05 table
How can enterprises optimize the best cost of cloud computing?
MySQL uses the view to report an error, explain/show can not be issued; lacking privileges for underlying table
Infiltration learning diary day19
Oracle database knowledge points (I)
[dynamic programming] leetcode 53: maximum subarray sum
A-Frame虚拟现实开发入门
GUI 应用:socket 网络聊天室
Luogu p1309 Swiss wheel
mysql使用视图报错,EXPLAIN/SHOW can not be issued; lacking privileges for underlying table
机器学习基础:用 Lasso 做特征选择
Hash table, string hash (special KMP)
@EnableAsync @Async
Function: find the sum of the elements on the main and sub diagonal of the matrix with 5 rows and 5 columns. Note that the elements where the two diagonals intersect are added only once. For example,
Function: find the approximate value of the limit of the ratio of the former term to the latter term of Fibonacci sequence. For example, when the error is 0.0001, the function value is 0.618056.
Understanding of Radix
Day05 表格
关于 uintptr_t和intptr_t 类型