当前位置：网站首页>[recommended algorithm] C interview question of a small factory

[recommended algorithm] C interview question of a small factory

2022-07-01 04:06:00 【Mountain peak evening view】

List of articles

zero 、 Project questions
One 、 Machine learning algorithm
Two 、 Recommendation algorithm
3、 ... and 、Python Basics
- 3.1 Python Memory management mechanism
Four 、Redis relevant
- 4.1 redis in ×× Source code implementation
Reference

zero 、 Project questions

0.1 User portrait

The first is the offline processing part ： get data ： Image processing after crawling data .
- MongoDB User portrait in , come from mysql User registry and user log data in （ Such as the amount of reading 、 Number of likes 、 Collection number, etc ）.
- User portraits and object portraits , Material storage MongoDB Medium SinaNews In the database ; Here we use MongoDB Because its documentation is similar to JSON object , Adding and deleting fields is very convenient .
- The processed materials will be stored in redis in （ Directly from MongoDB It will be more difficult to pull ）. Save the recommended list and popular list offline redis. Front end display .
- PS： This is for the convenience of construction , Data is not available online in real time , But I crawl data at a fixed time every night .

0.2 Responsible module

Data cleaning , Algorithm model, etc .

0.3 Cold start problem

Analyze the problem according to the following mind map ：
Insert picture description here

One 、 Machine learning algorithm

1.0 Where is the randomness of random forest

The randomness in random forest mainly comes from three aspects ：

One is bootstrap The randomness of training set caused by sampling ,
Secondly, the randomness of randomly selecting feature subsets for each node for impure calculation ,
The third is the randomness when using random segmentation point selection （ At this time, the random forest is also called Extremely Randomized Trees）.

1.1 GBDT It's different from random forests

（ One ）GBDT= Decision tree +AdaBoost Integrated learning .GBDT It's the use of residual training （ Use the negative gradient to fit the residual ）, In the process of forecasting , We also need to add up the predictions of all the trees , Get the final prediction result .

（ Two ） Random forest is based on decision tree （ Commonly used CART Trees ） Based on the learner's bagging Algorithm .
（1） Random forest when dealing with regression problems , The output value is the average value of each learner ;
（2） Random forest has two strategies when dealing with classification problems ：

The first is the voting strategy used in the original paper , That is, each learner outputs a category , Returns the category with the highest predicted frequency ;
The second is sklearn The probabilistic aggregation strategy used in , That is, the average probability that the sample belongs to a certain category is calculated first through the probability distribution output by each learner , After taking the average probability distribution arg ⁡ max ⁡ \arg\maxargmax To output the most likely category .

1.2 bagging and boosting difference

Base classifier error = deviation + variance

Boosting Through the step-by-step aggregation of the wrong samples by the base classifier , Reduce the deviation of integrated classifier ; After training a weak classifier , Calculate its error or residual , As input to the next classifier —— This process is reducing the loss function , Keep the model approaching “ Bull's eye ”.
Bagging Through a divide and rule strategy , Through the use of training samples for many times , Multiple trained models for comprehensive decision-making , To reduce the variance of the ensemble classifier . A little loosely , Yes n The prediction results of independent and uncorrelated models are averaged , Variance is the variance of the original single model 1/n.

Two 、 Recommendation algorithm

（1）NeuralCF Training process , Sampling process .
（2） Why? DIN To introduce attention mechanisms .

The model name	The basic principle	characteristic	limitations
NeuralCF	The dot product operation of user vector and item vector in traditional matrix decomposition , Replaced by neural networks for interoperability	The matrix decomposition model with enhanced expression ability	Only users and items are used `id` features , No more features added
Wide&deep	utilize wide Partially strengthen the memory ability of the model , utilize deep Partially strengthen the generalization ability of the model	Create the construction method of composite model	wide The features that need to be combined manually
Deep&Cross	use Cross Network substitution Wide&Deep In the model Wide part	It's solved Wide&deep The problem of artificial combination feature of model	Cross The complexity of the network is high
DeepFM	use FM replace Wide&deep Of wide part	Strengthened wide Part of the feature cross ability	And classic wide&deep The structural difference is not obvious
DIN	Introduce attention mechanism , And use the correlation between user behavior items and target advertising items to calculate the attention score	According to the different target advertising items , Make more targeted recommendations	Not fully utilized except “ Beyond historical behavior ” Other characteristics of
DIEN	Use the sequence model to simulate the evolution process of users' interests	The sequence model enhances the system's ability to express the changes of users' interests , Make the system start to consider the valuable information contained in the time-dependent behavior sequence	The training of sequence model is complex , The delay of online service is long , Engineering optimization is required

among DeepFM Model ：
Insert picture description here

3、 ... and 、Python Basics

3.1 Python Memory management mechanism

Immutable object ： Numbers character string Tuples ; The variable object ： Dictionaries list Byte array .
Immutable objects include int,float,long,str,tuple etc.

For variables of immutable type , If you want to change variables , A new value will be created , Bind the variable to the new value , If the old value is not referenced, it will wait for garbage collection .

Python Garbage collection is mainly based on reference counting , shortcoming ： Can't solve the problem of the object “ Circular reference ”、 Need extra space to maintain reference count
The following four situations , Reference count for object +1：
Object created （a=11）、 Object is quoted （b=a）、 Object is passed to the function as an argument func(a)、 Object is stored as an element in a container （ Such as lst1=[a,a]）
The following four situations , Reference count for object -1：
The alias of the object is explicitly destroyed del a、 The alias of the object is given a new object a=66、 An object leaves its scope ( Such as fun Function execution finished ,fun Local variables in , Note that global variables do not ), The container in which the object is located is destroyed or the object is removed from the container

#!/usr/bin/python
## -*- coding: utf-8 -*-
import sys
def func(c):
    print ('in func function',sys.getrefcount(c)-1)

print ('init',sys.getrefcount(11)-1)
a=11
print ('after a=11----',sys.getrefcount(11)-1)
b=a
print ('after b=a----',sys.getrefcount(11)-1)
func(11) # In the calling function is +2: Another reference is that the function stack holds the reference of the input parameter to the formal parameter 
print ('after func(11)----',sys.getrefcount(11)-1)
lst1=[a,12,14]
print ('after lst1=[a,12,14]----',sys.getrefcount(11)-1)
a=666
print ('after a=666----',sys.getrefcount(11)-1)
del a
print ('after del a----',sys.getrefcount(11)-1)
del b
print ('after del b----',sys.getrefcount(11)-1)
del lst1
print ('after del lst1----',sys.getrefcount(11)-1)

The result is

init 50
after a=11---- 51
after b=a---- 52
in func function 54
after func(11)---- 52
after lst1=[a,12,14]---- 53
after a=666---- 52
after del a---- 52
after del b---- 51
after del lst1---- 50

Four 、Redis relevant

4.1 redis in ×× Source code implementation

The first stage ： read Redis Data structure part of
- Basically located in the following files ： Memory allocation zmalloc.c and zmalloc.h
- Dynamic string sds.h and sds.c
- Double ended linked list adlist.c and adlist.h
- Dictionaries dict.h and dict.c
- Skip list server.h It's about zskiplist The structure and zskiplistNode structure , as well as t_zset.c All in zsl Initial function , such as zslCreate、zslInsert、zslDeleteNode wait .
- Base Statistics hyperloglog.c Medium hllhdr structure , And all with hll Initial function
The second stage ： be familiar with Redis Memory coding structure
- Integer set data structure intset.h and intset.c
- Compressed list data structure ziplist.h and ziplist.c
The third stage ： be familiar with Redis Implementation of data type
- Object system object.c
- String key t_string.c
- List building t_list.c
- The hash key t_hash.c
- Set key t_set.c
- Ordered set key t_zset.c Middle Division zsl All functions except the function at the beginning
- HyperLogLog key hyperloglog.c In the pf Initial function
The fourth stage be familiar with Redis The realization of database
- Database implementation redis.h In the document redisDb structure , as well as db.c file
- notifications notify.c
- RDB Persistence rdb.c
- AOF Persistence aof.c
And the implementation of some independent functional modules
- Publish and subscribe redis.h Of documents pubsubPattern structure , as well as pubsub.c file
- Business redis.h Of documents multiState Structure and multiCmd structure ,multi.c file
The fifth stage Familiar with client and server code implementation
- Event processing module ae.c/ae_epoll.c/ae_evport.c/ae_kqueue.c/ae_select.c
- Network link library anet.c and networking.c
- Server side redis.c
- client redis-cli.c
- At this time, you can read the code implementation of the following independent function modules
- lua Script scripting.c
- The slow query slowlog.c
- monitor monitor.c
Phase 6 This stage is mainly about getting familiar with Redis Multi machine part of the code implementation
- Copy function replication.c
- Redis Sentinel sentinel.c
- colony cluster.c