Interview assault 63: how to remove duplication in MySQL?

stay MySQL in , There are two most common methods of weight removal ： Use distinct Or use group by, What's the difference between them ？ Let's take a look at .

1. Create test data

--  Create test table 
drop table if exists pageview;
create table pageview(
    id bigint primary key auto_increment comment ' Since the primary key ',
    aid bigint not null comment ' article ID',
    uid bigint not null comment '（ visit ） user ID',
    createtime datetime default now() comment ' Creation time '
) default charset='utf8mb4';
--  Add test data 
insert into pageview(aid,uid) values(1,1);
insert into pageview(aid,uid) values(1,1);
insert into pageview(aid,uid) values(2,1);
insert into pageview(aid,uid) values(2,2);

The final display effect is as follows ：

2.distinct Use

distinct The basic grammar is as follows ：

SELECT DISTINCT column_name,column_name FROM table_name;

2.1 Separate the heavy ones

We use first distinct Realize single column weight removal , according to aid（ article ID） duplicate removal , The specific implementation is as follows ：

2.2 More than one, more than one

In addition to single train weight removal ,distinct It also supports multiple columns （ Two or more trains ） duplicate removal , We according to the aid（ article ID） and uid（ user ID） Combined weightlessness , The specific implementation is as follows ：

2.3 Aggregate functions + duplicate removal

Use distinct + Aggregate function de duplication , Calculation aid Total number of strips after weight removal , The specific implementation is as follows ：

3.group by Use

group by The basic grammar is as follows ：

SELECT column_name,column_name FROM table_name 
WHERE column_name operator value 
GROUP BY column_name

3.1 Separate the heavy ones

according to aid（ article ID） duplicate removal , The specific implementation is as follows ： And distinct comparison group by More columns can be displayed , and distinct Only the de duplicated columns can be displayed .

3.2 More than one, more than one

according to aid（ article ID） and uid（ user ID） Combined weightlessness , The specific implementation is as follows ：

3.3 Aggregate functions + group by

Count each one aid Total quantity ,SQL The implementation is as follows ： As can be seen from the above results , Use group by and distinct Add count The query semantics of is completely different ,distinct + count It counts the total quantity after weight removal , and group by + count Statistics are the total number of each group of data after grouping .

4.distinct and group by The difference between

Official documents describe distinct When it comes to ： in the majority of cases distinct It's special group by, As shown in the figure below ： Official document address ：dev.mysql.com/doc/refman/… But there are still some subtle differences between the two , For example, the following .

difference 1： The query result set is different

When using distinct When you go to heavy duty , In the query result set, only the de duplication information , As shown in the figure below ： When you try to add a non de duplication field （ Inquire about ） when ,SQL An error will be reported, as shown in the figure below ： While using group by Sorting can query one or more fields , As shown in the figure below ：

difference 2： Different business scenarios

To count the total quantity after weight removal, you need to use distinct, And statistical grouping details , Or when adding query criteria on the basis of grouping details , You have to use group by 了 . Use distinct Count the total quantity of a column after weight removal ： The number after statistical grouping is greater than 2 The article , Then use group by 了 , As shown in the figure below ：

difference 3： Different performance

If the de duplicated field has an index , that group by and distinct You can use indexes , In this case, their performance is the same ; and When the de duplicated field has no index ,distinct Performance will be higher than group by, Because in MySQL 8.0 Before ,group by There is a hidden function that will sort by default , This will trigger filesort This leads to reduced query performance .

summary

In most scenes distinct It's special group by, But there are subtle differences between the two , For example, they are on the query result set 、 Specific business scenarios used , And the performance is different .

Reference resources & Acknowledgement

zhuanlan.zhihu.com/p/384840662

It's up to you to judge right and wrong , Disdain is to listen to people , Gain or loss is more important than number .
official account ：Java Analysis of the real interview questions
Interview collection ：gitee.com/mydb/interv…