当前位置:网站首页>What is the key to fast multi tag user profile analysis?
What is the key to fast multi tag user profile analysis?
2022-07-29 08:55:00 【Big data dreamer】
User profile analysis requires many tags to describe user attributes , There are usually two types of labels . There may be multiple values for a type of user tag , For example, the user's educational background is middle school 、 university 、 Graduate student 、 Doctor, etc , The age group is children、juvenile、youth、middle age、old age, Such tags are called enumeration tags . The other type of user tag has only two values , For example, whether the user is registered 、 Is it active 、 White collar or not 、 Whether the target users of certain promotions, etc , Such tags are called binary tags .
In the user portrait analysis scenario , It is often necessary to filter and calculate the combination conditions of these two kinds of labels , for example : Find out middle-aged 、 A college degree 、 register 、 Active users , And it is the target user of the five black promotions last year .
When the total amount of data is huge , The bottleneck of computing performance often focuses on this conditional filtering . These conditions are very casual , Indexes cannot be pre calculated or expected , Must have efficient hard traversal ability . Now , The storage and calculation methods used for enumeration tags and binary tags are very critical .
In the relational database 、 In the data warehouse , Enumeration tags are just ordinary fields , The corresponding filtering calculation is in WHERE Clause IN To complete , It's usually d IN (d1,…,dn) In the form of , That is, the fields d The value is contained in the value set {di,…} Time is true .IN The performance of calculation is poor , Mainly because there are too many comparison operations . To judge the field d Whether it is included in the value set , If you use sequential search , need d With members in the value set 1 To n Comparison calculation of times . Even if the value set is ordered, use dichotomy to find , Also compare several times . When the amount of data is large, there will be a lot of comparisons , Judge IN It's going to be slow , And the larger the value set, the slower the speed .
The key to optimizing the filtering performance of enumeration tags is to eliminate the comparison operation . First , determine IN Field ( It's written as IN The field before the condition ) List of possible values . The possible value is usually not too much , This list will not be too long . Then convert the original data , hold IN Replace the field value with the sequence number of the corresponding record in the list ( Location ), Save as a new data .
Do... On the new data after replacement IN When judging , First, we need to generate a Boolean value set with the same length as the list , Its first i The value is determined by the i Whether members are IN The value set of the field determines , Among them is true, No, it is false. Ergodic time , use IN field value ( The serial number of the list ) To get the members in the Boolean set , yes true It meets the filtering conditions , Otherwise, it doesn't conform to .
This method is essentially to “ Set value comparison ” Convert to “ Serial number reference ”, The comparative calculation is omitted , Performance will be greatly improved . And the calculation time is independent of the size of the value set , Not as IN The enumeration value in the condition increases .
SQL Generally, the serial number is not supported in ( Location ) The method of directly getting the members of the set , Use the association table to transition , Will lead to more complex JOIN operation , This optimization method cannot be directly implemented .
Binary tags are generally stored in Boolean fields in the database . If there are only a few or dozens , Then simply write the filter conditions in WHERE Medium will do . But the total number of tags may reach hundreds of thousands . Many database tables do not support so many fields , You have to divide it into several tables before doing JOIN. When the data volume is large , The performance of large meter connection is very poor .
To avoid large table connections , You can also turn thousands of Boolean fields into rows , Use one “ Tag number ” Field storage , When calculating, group first and then filter 、 Statistics . But this grouping result set is very large , Need external memory cache , The performance is still very poor .
If binary bits of integers are used to store binary labels (0,1 Each represents a value ), that 16 Short integers can be saved 16 A label ,100 An integer field can be saved 1600 A label , It can effectively reduce the number of fields , Avoid large meter connections .
however , Many databases do not support this bit dependent calculation , This performance optimization method cannot be realized .
Open source data computing engine SPL Support Serial number reference and Bitwise operation , The above optimization method can be easily realized . Corresponding SPL The code is also simple , For example, the original data table T_ordinary The fields in include : user id、 Enumerate label fields dName( For example, age group :children、juvenile、youth、middle age、old age)、16 Boolean label fields flag1 To flag16, And the amount field amt. among ,dName The value range of is in the option table dim in . The following code can complete the conversion of serial number reference and bit storage :
| A | |
| 1 | =file("T_ordinary.ctx").open().cursor(id,dName,flag1,flag2,…,flag16,amt) |
| 2 | =T("dim.btx") |
| 3 | =A1.new(id,[email protected](dName):d,bits(flag1,flag2,…,flag16):b,amt) |
| 4 | =file("T.ctx").create(id,d,b,amt) |
| 5 | =A4.append(A3) |
A3 use pos Function will dName Replace the value of with dim The serial number in , Save as a new field d.dim In advance dName Orderly , It's faster to use dichotomy here . Use at the same time bits Function 16 Label fields are converted into one 16 Bit integer field b.
Converted table T You can do high-performance tag filtering and statistics . for example , The filter condition is dName Value the collection passed in at the front [“middle age”,“old age”] in , And the label flag4、flag8 by 1. After filtering , according to d Group summary amount and record number , The code looks something like this :
| A | |
| 1 | =bits(0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0) |
| 2 | =T("dim.btx").(["middle age","old age"][email protected](~)) |
| 3 | =file("T.ctx").open().cursor(amt;A2(d) && and(b,A1)==A1) |
| 4 | =A3.groups(d;sum(amt),count(~)) |
A1 use bits Function generation 16 Bit small integer , The first 4、8 The bit value is 1, Corresponding label flag4、flag8.A2 Generate a set of Boolean values .A3 Use Boolean value set and small integer to do filtering calculation .
In the use of SPL Of Virtual table after , You can also make these transformed fields transparent , Use it directly like a normal field . such as : Based on tables T Define virtual table T_pseudo after , The above code will roughly look like this :
| A | |
| 1 | =T_pseudo.select(flag4 && flag8 && ["middle age","old age"][email protected](dName)) |
| 2 | =A3.groups(dName;sum(amt),count(~)) |
flag4、flag8 Is the bit dimension field defined in the virtual table , Corresponding T In the table b Field 4、8 position .dName Is the enumeration dimension field in the virtual table , Its value is T In the table d The name corresponding to the field serial number .
With the virtual table , The actual storage and calculation methods remain unchanged ,SPL Will automatically complete the above algorithm . and , Ordinary Boolean values can be used in filter conditions , The grouped values in the result set will also become easy to read strings , There is no need to convert serial number and name . See SPL Virtual table data type optimization .
SPL Information
边栏推荐
猜你喜欢

Virtual augmentation and reality Part 2 (I'm a Firebird)

Quaternion and its simple application in unity

Clickhouse learning (III) table engine

ERROR 1045 (28000): Access denied for user ‘ODBC‘@‘localhost‘ (using password: NO)

2022 Shandong Province safety officer C certificate work certificate question bank and answers

7.2-function-overloading

2022 electrician (elementary) test question simulation test platform operation

Hc-sr04 use method and routine of ultrasonic ranging module (STM32)

C language -- 23 two-dimensional array

Leetcode deduction topic summary (topic No.: 53, 3, 141, interview question 022, the entry node of the link in the sword finger offer chain, 20, 19, Niuke NC1, 103, 1143, Niuke 127)
随机推荐
The biggest upgrade of Bluetooth over the years: Bluetooth Le audio is about to appear in all kinds of digital products
A little knowledge [synchronized]
Txt plain text operation
RESTful 风格详解
Reptile practice (10): send daily news
Analysis of zorder sampling partition process in Hudi - "deepnova developer community"
Arfoundation starts from scratch 5-ar image tracking
Solve the problem of false Base64 character in Base64
Basic shell operations (Part 2)
Thrift installation manual
6.3 references
[from_bilibili_dr_can][[advanced control theory] 9_ State observer design] [learning record]
[opencv] - Operator (Sobel, canny, Laplacian) learning
Data is the main body of future world development, and data security should be raised to the national strategic level
Use SQL client How can the job generated by SH achieve breakpoint continuation after cancle?
Mathematical modeling - Differential Equations
(Video + graphic) introduction to machine learning series - Chapter 3 logical regression
数据表示与计算(进制)
User identity identification and account system practice
Sudoku (DFS)