当前位置:网站首页>Introduction to Tianchi news recommendation: 4 Characteristic Engineering
Introduction to Tianchi news recommendation: 4 Characteristic Engineering
2022-07-04 01:20:00 【Programmer base camp】
Preface
Feature engineering is to make features and labels , Turn to supervised learning .
- Features that can be used directly :
- The article's own characteristics , category_id The type of the article , created_at_ts Indicates when the article was created , This is related to the timeliness of the article , words_count It's the number of words in the article , Generally, the number of words is too long. We don't like to click , It doesn't rule out that some people like to read long articles .
- The content of the article embedding features , It was used in this recall , Here you can choose to use , You can choose not to , You can also try other types of embedding features , such as W2V etc.
- User's device feature information
- The idea of constructing supervision data set , According to the result of the recall , We're going to get one {user_id: [ List of possible articles to click on ]} A dictionary of form . So we can, for each user , Each article constructs a possible set of tests , For example, for users user1, Suppose you get his recall list {user1: [item1, item2, item3]}, We can get three rows of data (user1, item1), (user1, item2), (user1, item3) In the form of , These are the first two columns of features when monitoring test sets .
The idea of structural features is this , We know that each user's click on the article is closely related to its historical click article information , Like the same theme , Similar and so on . So the feature structure is an important series of features Is to combine the user's history, click on the article information . We've got a data set of two columns for each user and click on the candidate article , And our goal is to predict the last click on the article , A more natural way of thinking is to have a relationship with the last few clicks on the article , This takes into account the history of the click article information , It has to be closer to the last click , Because one of the most important features of news is its timeliness . Often the last click of a user has a lot to do with the last few clicks . So we can do this for each candidate article , Make features related to the last few clicks as follows :
- The candidate item Similarity to the last few clicks (embedding Inner product ) — This is directly related to the user's historical behavior
- The candidate item Statistical characteristics of similarity features with the last few clicks — Statistical features can reduce some fluctuations and anomalies
- The candidate item The difference between the number of words in the last few clicks — You can see user preferences by the number of words
- The candidate item The time difference characteristics established with the last few clicks on the article — The time difference feature shows the user's preference for the real-time of the article
You need to think about - If used youtube If you recall , We can also create users and candidates item Similar characteristics of
- Word2Vec The main idea is : The context of a word can well express the meaning of a word . A way of generating word vectors through unsupervised learning .word2vec There are two very classic models in :skip-gram and cbow.
skip-gram: The head word is known to predict the surrounding words .
cbow: Knowing the surrounding words predicts the head word .
In the use of gensim Training word2vec When , There are several important parameters
- size: The dimension of the word vector .
- window: It determines how far the target word will relate to the context .
- sg: If it is 0, It is CBOW Model , yes 1 It is Skip-Gram Model .
- workers: Indicates the number of threads during training
- min_count: Set the smallest
- iter: The number of times to traverse the entire dataset during training
Specific tutorials and codes
边栏推荐
- Future源码一观-JUC系列
- Conditional test, if, case conditional test statements of shell script
- Oracle database knowledge points (IV)
- Solution of cursor thickening
- Mongodb learning notes: command line tools
- 查询效率提升10倍!3种优化方案,帮你解决MySQL深分页问题
- Introduction to A-frame virtual reality development
- Beijing invites reporters and media
- C library function int fprintf (file *stream, const char *format,...) Send formatted output to stream
- Query efficiency increased by 10 times! Three optimization schemes to help you solve the deep paging problem of MySQL
猜你喜欢

Hash table, string hash (special KMP)

机器学习基础:用 Lasso 做特征选择

GUI application: socket network chat room

Since the "epidemic", we have adhered to the "no closing" of data middle office services

Ka! Why does the seat belt suddenly fail to pull? After reading these pictures, I can't stop wearing them

【.NET+MQTT】. Net6 environment to achieve mqtt communication, as well as bilateral message subscription and publishing code demonstration of server and client

Introduction to A-frame virtual reality development

HackTheBox-baby breaking grad

中电资讯-信贷业务数字化转型如何从星空到指尖?

Windos10 reinstallation system tutorial
随机推荐
Introduction to unity shader essentials reading notes Chapter III unity shader Foundation
Day05 表格
Day05 table
How to use AHAS to ensure the stability of Web services?
OS interrupt mechanism and interrupt handler
Sequence list and linked list
ThinkPHP uses redis to update database tables
Huawei BFD and NQA
Mongodb learning notes: command line tools
Force deduction solution summary 1189- maximum number of "balloons"
leetcode 121 Best Time to Buy and Sell Stock 买卖股票的最佳时机(简单)
How to set the response description information when the response parameter in swagger is Boolean or integer
查询效率提升10倍!3种优化方案,帮你解决MySQL深分页问题
技術實踐|線上故障分析及解决方法(上)
@EnableAsync @Async
长文综述:大脑中的熵、自由能、对称性和动力学
Oracle database knowledge points (I)
Conditional test, if, case conditional test statements of shell script
Solution of cursor thickening
How to delete MySQL components using xshell7?