当前位置：网站首页>Introduction to Tianchi news recommendation: 4 Characteristic Engineering

Introduction to Tianchi news recommendation: 4 Characteristic Engineering

2022-07-04 01:20:00 【Programmer base camp】

Feature engineering is to make features and labels , Turn to supervised learning .

The article's own characteristics , category_id The type of the article , created_at_ts Indicates when the article was created , This is related to the timeliness of the article , words_count It's the number of words in the article , Generally, the number of words is too long. We don't like to click , It doesn't rule out that some people like to read long articles .
The content of the article embedding features , It was used in this recall , Here you can choose to use , You can choose not to , You can also try other types of embedding features , such as W2V etc.
User's device feature information

The idea of constructing supervision data set , According to the result of the recall , We're going to get one {user_id: [ List of possible articles to click on ]} A dictionary of form . So we can, for each user , Each article constructs a possible set of tests , For example, for users user1, Suppose you get his recall list {user1: [item1, item2, item3]}, We can get three rows of data (user1, item1), (user1, item2), (user1, item3) In the form of , These are the first two columns of features when monitoring test sets .
The idea of structural features is this , We know that each user's click on the article is closely related to its historical click article information , Like the same theme , Similar and so on . So the feature structure is an important series of features Is to combine the user's history, click on the article information . We've got a data set of two columns for each user and click on the candidate article , And our goal is to predict the last click on the article , A more natural way of thinking is to have a relationship with the last few clicks on the article , This takes into account the history of the click article information , It has to be closer to the last click , Because one of the most important features of news is its timeliness . Often the last click of a user has a lot to do with the last few clicks . So we can do this for each candidate article , Make features related to the last few clicks as follows ：

The candidate item Similarity to the last few clicks (embedding Inner product ） — This is directly related to the user's historical behavior
The candidate item Statistical characteristics of similarity features with the last few clicks — Statistical features can reduce some fluctuations and anomalies
The candidate item The difference between the number of words in the last few clicks — You can see user preferences by the number of words
The candidate item The time difference characteristics established with the last few clicks on the article — The time difference feature shows the user's preference for the real-time of the article
You need to think about
If used youtube If you recall , We can also create users and candidates item Similar characteristics of

Word2Vec The main idea is ： The context of a word can well express the meaning of a word . A way of generating word vectors through unsupervised learning .word2vec There are two very classic models in ：skip-gram and cbow.
skip-gram： The head word is known to predict the surrounding words .
cbow： Knowing the surrounding words predicts the head word .