当前位置:网站首页>N methods of data De duplication using SQL
N methods of data De duplication using SQL
2022-06-28 01:02:00 【Ink Sky Wheel】
Remember many years ago , A test girl found me :
Brother Qiang , The data in my table is repeated , How to delete duplicate data ?
Similar scenarios that require data De duplication , It is quite common in practical work .
Let's talk about , Use SQL Statement come and go , What are the common methods .
If we have one student surface :
create table student(id int,name varchar(50),age int,address varchar(100));
The data in the table are as follows :

Method 1 : Use DISTINCT Keyword de duplication .
DISTINCT keyword , When use , It will be followed by the duplicate fields . It can ensure that the data of these de duplication fields are not repeated .
such as , Take out student In the table , Not repeated address What are they? , You can use the following SQL sentence :
select distinct addressfrom student;
The results are as follows :

This method , The biggest advantage is that it is easy to use .
But there is also a big drawback , That is, the de duplicated fields and the fields in the final returned result set , It's consistent . in other words , Above SQL In the sentence , Use address The fields are de duplicated , Final result , Only return address A field .
If you want to use address Field de duplication , And return other fields at the same time ,DISTINCT It can't be done .
Method 2 : Use GROUP BY Keyword de duplication
And DISTINCT Same keyword ,GROUP BY keyword , It's also the standard SQL Common de duplication methods supported . It can remove the weight at the same time , Synchronously return information of other fields .
Or to address Take the field de duplication as an example , Other fields can be obtained as needed using aggregate functions :
select min(id),max(name),max(age),addressfrom studentgroup by address;
The results are as follows :

In the above sentence , Not only for address The field has been de duplicated , It also returns id、name、age Field information .
At this point , Than DISTINCT It's easy to use .
however , Take a closer look. , It seems that something is wrong .
id=1 Of the students , It should be called zhoujunting , But in the above return result, it is yangxiaoyu , Back to age Field , The same problem applies .
in other words , In the returned results , In the same line id、name、age, May not belong to the same student , This makes the data look a bit confusing .
If there are requirements for data consistency , You can use the third method below .
Method 3 : Use the window function to remove duplicates .
There are several window functions , It's similar in use , This is just an introduction ROW_NUMBER() over(partition by ... order by ...).
selectid,name,age,addressfrom (select id,name,age,address,row_number() over(partition by addressorder by id asc) as rnfrom student)awhere a.rn = 1;
ROW_NUMBER() The meaning of window function is , First, the data is processed according to partition by The fields of , And then to order by To sort the fields of , The serial number from 1 Began to increase .
above SQL The result returned is :

This returns the result , It's much more perfect .
however , It should be noted that , Some databases do not support window functions . image MySQL In the database .
Method four : Use IN duplicate removal
The key to this approach is , Find the characteristics of a set of non repeating data , Then take the data with this feature .
such as : Press address To and fro , If the data is duplicated , take id The biggest one .
select *from studentwhere id in (select max(id)from studentgroup by address);
SQL The results are as follows :

Of course , Can also take id The smallest one , Put the... In the above statement max Change to min That's all right. .
This method is suitable for a field in the table where the data is not repeated ( above SQL Medium id Field ) The situation of .
If such a field does not exist in the table , This method is no longer applicable . But some databases , A similar field is built-in and can be used .
such as , stay ORACLE In the database , have access to ROWID Instead of the above SQL Medium id Field . Of course, only limited to ORACLE database :
select *from studentwhere rowid in (select max(rowid)from studentgroup by address);
Method five : Use NOT EXISTS duplicate removal
It is similar to the idea of method 4 , Use NOT EXISTS The same effect can be achieved .
select *from student awhere not exists(select 1from student bwhere a.address = b.addressand a.id > b.id);
SQL The results are as follows :

Methods six : Use ALL keyword
stay MySQL In the database , There is a special operator ALL, This is a set operator .
select *from student awhere a.id <= ALL(select b.idfrom student bwhere a.address = b.address);
SQL The results are as follows :

Above SQL in ,ALL The operator means to say ,a.id Field to <=ALL Operator all the values found in parentheses .
therefore , The core idea of this method is similar to method 4 .
Methods seven : Use INNER JOIN + GROUP BY keyword
The core idea of this method , It is also similar to method 4 .
selecta.*from student ainner join student bon a.address = b.addressand a.id >= b.idgroup by a.id,a.name,a.age,a.addresshaving count(*)=1;
SQL The results are as follows :

Use the above skillfully 7 A method of data De duplication , Basically, all data De duplication problems can be solved .
Of course , If you have a better way , Welcome to leave a message below .
ps. Click below to read the article , You can download the materials and SQL Example statement . You can also add me wechat 201855204 obtain .
Recommended reading :
边栏推荐
- Taro--- day1--- construction project
- HCIP/HCIE Routing&Switching / Datacom备考宝典系列(十九)PKI知识点全面总结(公钥基础架构)
- Internship: business process introduction
- Why are cloud vendors targeting this KPI?
- The development of the Internet provides new solutions for industrial transformation
- Taro--- day2--- compile and run
- 大尺寸导电滑环市场应用强度如何
- 796 div.2 C. managing history thinking
- 手机股票开户安全吗,买股票在哪开户?
- SDF学习之扭曲模型
猜你喜欢

Arduino uno realizes simple touch switch through direct detection of capacitance

Leetcode 720. 词典中最长的单词(为啥感觉这道题很难?)

GFS 分布式文件系统概述与部署

深入解析kubernetes controller-runtime

LabVIEW continuous sampling and limited sampling mode

Alchemy (1): identify prototype, demo and MVP in project development
最新MySQL高级SQL语句大全

剑指 Offer 65. 不用加减乘除做加法

electron窗口背景透明无边框(可用于启动页面)

去哪儿网(Qunar) DevOps 实践分享
随机推荐
JVM的内存模型简介
Redis主从复制、哨兵模式、集群的概述与搭建
What is promise
免费、好用、强大的开源笔记软件综合评测
快速掌握grep命令及正则表达式
Alchemy (3): how to do a good job in interfacing a business process
HCIP/HCIE Routing&Switching / Datacom备考宝典系列(十九)PKI知识点全面总结(公钥基础架构)
How many securities companies can a person open an account? Is it safe to open an account
Latest MySQL advanced SQL statement Encyclopedia
【无标题】
Download, configuration and installation of MySQL
Squid代理服务器(缓存加速之Web缓存层)
Is it safe to open an account for mobile stocks? Where can I open an account for buying stocks?
炼金术(3): 怎样做好1个业务流程的接口对接
攻击队攻击方式复盘总结
Introduction to memory model of JVM
Oracle数据库的启停
Overview and deployment of GFS distributed file system
互联网业衍生出来了新的技术,新的模式,新的产业类型
Software engineering job design (1): [personal project] implements a log view page