当前位置:网站首页>Introduction to Tianchi news recommendation: 4 Characteristic Engineering
Introduction to Tianchi news recommendation: 4 Characteristic Engineering
2022-07-04 01:20:00 【Programmer base camp】
Preface
Feature engineering is to make features and labels , Turn to supervised learning .
- Features that can be used directly :
- The article's own characteristics , category_id The type of the article , created_at_ts Indicates when the article was created , This is related to the timeliness of the article , words_count It's the number of words in the article , Generally, the number of words is too long. We don't like to click , It doesn't rule out that some people like to read long articles .
- The content of the article embedding features , It was used in this recall , Here you can choose to use , You can choose not to , You can also try other types of embedding features , such as W2V etc.
- User's device feature information
- The idea of constructing supervision data set , According to the result of the recall , We're going to get one {user_id: [ List of possible articles to click on ]} A dictionary of form . So we can, for each user , Each article constructs a possible set of tests , For example, for users user1, Suppose you get his recall list {user1: [item1, item2, item3]}, We can get three rows of data (user1, item1), (user1, item2), (user1, item3) In the form of , These are the first two columns of features when monitoring test sets .
The idea of structural features is this , We know that each user's click on the article is closely related to its historical click article information , Like the same theme , Similar and so on . So the feature structure is an important series of features Is to combine the user's history, click on the article information . We've got a data set of two columns for each user and click on the candidate article , And our goal is to predict the last click on the article , A more natural way of thinking is to have a relationship with the last few clicks on the article , This takes into account the history of the click article information , It has to be closer to the last click , Because one of the most important features of news is its timeliness . Often the last click of a user has a lot to do with the last few clicks . So we can do this for each candidate article , Make features related to the last few clicks as follows :
- The candidate item Similarity to the last few clicks (embedding Inner product ) — This is directly related to the user's historical behavior
- The candidate item Statistical characteristics of similarity features with the last few clicks — Statistical features can reduce some fluctuations and anomalies
- The candidate item The difference between the number of words in the last few clicks — You can see user preferences by the number of words
- The candidate item The time difference characteristics established with the last few clicks on the article — The time difference feature shows the user's preference for the real-time of the article
You need to think about - If used youtube If you recall , We can also create users and candidates item Similar characteristics of
- Word2Vec The main idea is : The context of a word can well express the meaning of a word . A way of generating word vectors through unsupervised learning .word2vec There are two very classic models in :skip-gram and cbow.
skip-gram: The head word is known to predict the surrounding words .
cbow: Knowing the surrounding words predicts the head word .
In the use of gensim Training word2vec When , There are several important parameters
- size: The dimension of the word vector .
- window: It determines how far the target word will relate to the context .
- sg: If it is 0, It is CBOW Model , yes 1 It is Skip-Gram Model .
- workers: Indicates the number of threads during training
- min_count: Set the smallest
- iter: The number of times to traverse the entire dataset during training
Specific tutorials and codes
边栏推荐
- 功能:求5行5列矩阵的主、副对角线上元素之和。注意, 两条对角线相交的元素只加一次。例如:主函数中给出的矩阵的两条对角线的和为45。
- 7.1 学习内容
- C library function int fprintf (file *stream, const char *format,...) Send formatted output to stream
- The force deduction method summarizes the single elements in the 540 ordered array
- Print diamond pattern
- I don't care about you. OKR or KPI, PPT is easy for you
- Msp32c3 board connection MSSQL method
- It's OK to have hands-on 8 - project construction details 3-jenkins' parametric construction
- 打印菱形图案
- Technical practice online fault analysis and solutions (Part 1)
猜你喜欢

功能:求5行5列矩阵的主、副对角线上元素之和。注意, 两条对角线相交的元素只加一次。例如:主函数中给出的矩阵的两条对角线的和为45。

@EnableAsync @Async

Network layer - routing

It's OK to have hands-on 8 - project construction details 3-jenkins' parametric construction

Huawei rip and BFD linkage

Characteristics of ginger

How to use AHAS to ensure the stability of Web services?

Avoid playing with super high conversion rate in material minefields

Analysis and solution of lazyinitializationexception

Weekly open source project recommendation plan
随机推荐
功能:将主函数中输入的字符串反序存放。例如:输入字符串“abcdefg”,则应输出“gfedcba”。
Function: find the sum of the elements on the main and sub diagonal of the matrix with 5 rows and 5 columns. Note that the elements where the two diagonals intersect are added only once. For example,
Summary of JWT related knowledge
Long article review: entropy, free energy, symmetry and dynamics in the brain
Typescript basic knowledge sorting
Huawei rip and BFD linkage
51 single chip microcomputer timer 2 is used as serial port
I don't care about you. OKR or KPI, PPT is easy for you
【.NET+MQTT】.NET6 环境下实现MQTT通信,以及服务端、客户端的双边消息订阅与发布的代码演示
File contains vulnerability summary
老姜的特点
7.1 learning content
Solution of cursor thickening
The FISCO bcos console calls the contract and reports an error does not exist
leetcode 121 Best Time to Buy and Sell Stock 买卖股票的最佳时机(简单)
What is the GPM scheduler for go?
7.1 学习内容
Characteristics of ginger
ThinkPHP uses redis to update database tables
Oracle database knowledge points (IV)