当前位置:网站首页>[data analysis and visualization] key points of data drawing 6- too many data groups
[data analysis and visualization] key points of data drawing 6- too many data groups
2022-06-13 02:34:00 【The winter holiday of falling marks】
Key points of data drawing 6- Too many data groups
Comparing the distributions of several numerical variables is a common task in data presentation . The distribution of variables can be represented by histogram or density graph , Groups that represent the right amount of data on the same axis are very attractive . However, too many data sets will seriously affect the performance of chart information .
Example of data distribution drawing
Here's an example , It shows how people perceive words . The phrase “Highly likely” What is the probability of the situation . The following is the result of the distribution of probability scores .
# Load the library
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(patchwork)
# Load data
data <- read.table("https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv", header=TRUE, sep=",")
# Processing data
data <- data %>%
gather(key="text", value="value") %>%
mutate(text = gsub("\\.", " ",text)) %>%
mutate(value = round(as.numeric(value),0))
head(data)
nrow(data)
text | value | |
---|---|---|
<chr> | <dbl> | |
1 | Almost Certainly | 95 |
2 | Almost Certainly | 95 |
3 | Almost Certainly | 95 |
4 | Almost Certainly | 95 |
5 | Almost Certainly | 98 |
6 | Almost Certainly | 95 |
782
# Create a data callout box
annot <- data.frame(
text = c("Almost No Chance", "About Even", "Probable", "Almost Certainly"),
x = c(5, 53, 65, 79),
y = c(0.15, 0.4, 0.06, 0.1)
)
# Extract some data for display
data1 <-filter(data,text %in% c("Almost No Chance", "About Even", "Probable", "Almost Certainly"))
data1 <-mutate(data1,text = fct_reorder(text, value))
head(data1)
nrow(data1)
text | value | |
---|---|---|
<fct> | <dbl> | |
1 | Almost Certainly | 95 |
2 | Almost Certainly | 95 |
3 | Almost Certainly | 95 |
4 | Almost Certainly | 95 |
5 | Almost Certainly | 98 |
6 | Almost Certainly | 95 |
184
# mapping
ggplot(data1, aes(x=value, color=text, fill=text)) +
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
geom_text( data=annot, aes(x=x, y=y, label=text, color=text), hjust=0, size=4.5) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")
under these circumstances , The graphics are very neat . People give "Highly likely" Express "Almost No chance" The probability of this sentence is 0% To 20% Between , And it means "Almost Certainly" The probability of this sentence is 75% To 100% Between . But when we look at what happens when we represent more data sets .
# Plot
data2<-mutate(data,text = fct_reorder(text, value))
ggplot(data2,aes(x=value, color=text, fill=text)) +
# Draw a density map
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")
Now you can see that this picture is too cluttered , Can't group : There are too many data groups represented on the same graph . How to avoid this situation ? We will introduce several solutions in the next section .
resolvent
Box chart
The most common way to represent such a dataset is boxplot. It summarizes the main characteristics of each group , Thus, efficient distribution is realized . Please note some pitfalls . It usually makes sense to sort groups to make charts easier to read . If the group label is long , Consider a horizontal version that makes the label readable . However, the box chart box hides the basic distribution of sample size and other information , You can use unobtrusive points to display individual data points .
ggplot(data2, aes(x=text, y=value, fill=text)) +
# Drawing box diagram
geom_boxplot() +
# Add data point
geom_jitter(color="grey", alpha=0.3, size=0.9) +
scale_fill_viridis(discrete=TRUE) +
theme(
legend.position="none"
) +
# xy The axis turns
coord_flip() +
xlab("") +
ylab("Assigned Probability (%)")
Violin chart
As long as the sample size is large enough , Violin charts are usually a good substitute for box charts . It is very close to the box diagram , It just describes the group distribution more accurately by definition . If you have many groups , Violin pictures may not be the best choice , Because the display results of each data group in the violin diagram are often very thin , This makes it difficult to imagine its distribution . under these circumstances , A good alternative is the ridge map , This will be further described in this article .
ggplot(data2, aes(x=text, y=value, fill=text, color=text)) +
geom_violin(width=2.1, size=0.2) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
legend.position="none"
) +
coord_flip() +
xlab("") +
ylab("Assigned Probability (%)")
Warning message:
"position_dodge requires non-overlapping x intervals"
Density map
If there are only a few groups , Can be compared on the same density map . Only four groups have been selected to illustrate this idea . If there are more groups , The graphics will become disorganized , Difficult to read . This example was mentioned earlier , But it is only suitable when there are few data sets .
# mapping
ggplot(data1, aes(x=value, color=text, fill=text)) +
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
geom_text( data=annot, aes(x=x, y=y, label=text, color=text), hjust=0, size=4.5) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")
however , If you have more than 4 A set of , Graphics can become too confusing . If you want to use a density chart , It is more appropriate to draw by grouping subgraphs . This is a good way to study the distribution of each group separately . however , Because they don't share the same X Axis , So it's hard to compare groups together . It all depends on what the question is .
ggplot(data2,aes(x=value, color=text, fill=text)) +
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)") +
# Group drawing
facet_wrap(~text, scale="free_y")
Histogram
The histogram is very close to the density map , The representation is handled in a similar way , Using subgraphs . But the histogram in this example Y The scale is the same for each group , This is different from the previous example on the density map .
ggplot(data2, aes(x=value, color=text, fill=text)) +
geom_histogram(alpha=0.6, binwidth = 5) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)") +
facet_wrap(~text)
Ridge map
In this case , The best choice may be a ridge map . It has all the advantages of the violin chart , But avoid loose space , Because there is overlap between groups . It effectively describes the individual distribution and the comparison between groups .
# Load a special drawing library
library(ggridges)
ggplot(data2, aes(y=text, x=value, fill=text)) +
geom_density_ridges(alpha=0.6, bandwidth=4) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")
Reference resources
边栏推荐
- Linear, integer, nonlinear, dynamic programming
- Chapter7-11_ Deep Learning for Question Answering (2/2)
- The precision of C language printf output floating point numbers
- For loop instead of while loop - for loop instead of while loop
- Chapter7-13_ Dialogue State Tracking (as Question Answering)
- [Dest0g3 520迎新赛] 拿到WP还整了很久的Dest0g3_heap
- Think: when do I need to disable mmu/i-cache/d-cache?
- Basic principle of bilateral filtering
- [keras learning]fit_ Generator analysis and complete examples
- Port mapping between two computers on different LANs (anydesk)
猜你喜欢
Understand CRF
[pytorch]fixmatch code explanation (super detailed)
Priority queue with dynamically changing priority
Paper reading - joint beat and downbeat tracking with recurrent neural networks
[reading point paper] yolo9000:better, faster, stronger, (yolov2), integrating various methods to improve the idea of map and wordtree data fusion
在IDEA使用C3P0连接池连接SQL数据库后却不能显示数据库内容
ROS learning -5 how function packs with the same name work (workspace coverage)
Yovo3 and yovo3 tiny structure diagram
0- blog notes guide directory (all)
数仓笔记|针对客户维度建模需要关注的5个因素
随机推荐
A real-time target detection model Yolo
Hstack, vstack and dstack in numpy
Laptop touch pad operation
What are the differences in cache/tlb?
Stm32f4 DMA Da sine wave generator keil5 Hal library cubemx
too old resource version,Code:410
ROS learning-8 pit for custom action programming
[keras] data of 3D u-net source code analysis py
Matlab: obtain the figure edge contour and divide the figure n equally
Why does it feel that most papers still use RESNET as the backbone network rather than densenet?
FFmpeg原理
Change the topic of change tax
Introduction to arm Cortex-M learning
Solution of depth learning for 3D anisotropic images
cmake_ example
Mbedtls migration experience
Exam23 named windows and simplified paths, grayscale conversion
[reading papers] deepface: closing the gap to human level performance in face verification. Deep learning starts with the face
Area of basic exercise circle ※
Mean Value Coordinates