当前位置:网站首页>[data analysis and visualization] key points of data drawing 6- too many data groups
[data analysis and visualization] key points of data drawing 6- too many data groups
2022-06-13 02:34:00 【The winter holiday of falling marks】
Key points of data drawing 6- Too many data groups
Comparing the distributions of several numerical variables is a common task in data presentation . The distribution of variables can be represented by histogram or density graph , Groups that represent the right amount of data on the same axis are very attractive . However, too many data sets will seriously affect the performance of chart information .
Example of data distribution drawing
Here's an example , It shows how people perceive words . The phrase “Highly likely” What is the probability of the situation . The following is the result of the distribution of probability scores .
# Load the library
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(patchwork)
# Load data
data <- read.table("https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv", header=TRUE, sep=",")
# Processing data
data <- data %>%
gather(key="text", value="value") %>%
mutate(text = gsub("\\.", " ",text)) %>%
mutate(value = round(as.numeric(value),0))
head(data)
nrow(data)
| text | value | |
|---|---|---|
| <chr> | <dbl> | |
| 1 | Almost Certainly | 95 |
| 2 | Almost Certainly | 95 |
| 3 | Almost Certainly | 95 |
| 4 | Almost Certainly | 95 |
| 5 | Almost Certainly | 98 |
| 6 | Almost Certainly | 95 |
782
# Create a data callout box
annot <- data.frame(
text = c("Almost No Chance", "About Even", "Probable", "Almost Certainly"),
x = c(5, 53, 65, 79),
y = c(0.15, 0.4, 0.06, 0.1)
)
# Extract some data for display
data1 <-filter(data,text %in% c("Almost No Chance", "About Even", "Probable", "Almost Certainly"))
data1 <-mutate(data1,text = fct_reorder(text, value))
head(data1)
nrow(data1)
| text | value | |
|---|---|---|
| <fct> | <dbl> | |
| 1 | Almost Certainly | 95 |
| 2 | Almost Certainly | 95 |
| 3 | Almost Certainly | 95 |
| 4 | Almost Certainly | 95 |
| 5 | Almost Certainly | 98 |
| 6 | Almost Certainly | 95 |
184
# mapping
ggplot(data1, aes(x=value, color=text, fill=text)) +
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
geom_text( data=annot, aes(x=x, y=y, label=text, color=text), hjust=0, size=4.5) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")

under these circumstances , The graphics are very neat . People give "Highly likely" Express "Almost No chance" The probability of this sentence is 0% To 20% Between , And it means "Almost Certainly" The probability of this sentence is 75% To 100% Between . But when we look at what happens when we represent more data sets .
# Plot
data2<-mutate(data,text = fct_reorder(text, value))
ggplot(data2,aes(x=value, color=text, fill=text)) +
# Draw a density map
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")

Now you can see that this picture is too cluttered , Can't group : There are too many data groups represented on the same graph . How to avoid this situation ? We will introduce several solutions in the next section .
resolvent
Box chart
The most common way to represent such a dataset is boxplot. It summarizes the main characteristics of each group , Thus, efficient distribution is realized . Please note some pitfalls . It usually makes sense to sort groups to make charts easier to read . If the group label is long , Consider a horizontal version that makes the label readable . However, the box chart box hides the basic distribution of sample size and other information , You can use unobtrusive points to display individual data points .
ggplot(data2, aes(x=text, y=value, fill=text)) +
# Drawing box diagram
geom_boxplot() +
# Add data point
geom_jitter(color="grey", alpha=0.3, size=0.9) +
scale_fill_viridis(discrete=TRUE) +
theme(
legend.position="none"
) +
# xy The axis turns
coord_flip() +
xlab("") +
ylab("Assigned Probability (%)")

Violin chart
As long as the sample size is large enough , Violin charts are usually a good substitute for box charts . It is very close to the box diagram , It just describes the group distribution more accurately by definition . If you have many groups , Violin pictures may not be the best choice , Because the display results of each data group in the violin diagram are often very thin , This makes it difficult to imagine its distribution . under these circumstances , A good alternative is the ridge map , This will be further described in this article .
ggplot(data2, aes(x=text, y=value, fill=text, color=text)) +
geom_violin(width=2.1, size=0.2) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
legend.position="none"
) +
coord_flip() +
xlab("") +
ylab("Assigned Probability (%)")
Warning message:
"position_dodge requires non-overlapping x intervals"

Density map
If there are only a few groups , Can be compared on the same density map . Only four groups have been selected to illustrate this idea . If there are more groups , The graphics will become disorganized , Difficult to read . This example was mentioned earlier , But it is only suitable when there are few data sets .
# mapping
ggplot(data1, aes(x=value, color=text, fill=text)) +
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
geom_text( data=annot, aes(x=x, y=y, label=text, color=text), hjust=0, size=4.5) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")

however , If you have more than 4 A set of , Graphics can become too confusing . If you want to use a density chart , It is more appropriate to draw by grouping subgraphs . This is a good way to study the distribution of each group separately . however , Because they don't share the same X Axis , So it's hard to compare groups together . It all depends on what the question is .
ggplot(data2,aes(x=value, color=text, fill=text)) +
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)") +
# Group drawing
facet_wrap(~text, scale="free_y")

Histogram
The histogram is very close to the density map , The representation is handled in a similar way , Using subgraphs . But the histogram in this example Y The scale is the same for each group , This is different from the previous example on the density map .
ggplot(data2, aes(x=value, color=text, fill=text)) +
geom_histogram(alpha=0.6, binwidth = 5) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)") +
facet_wrap(~text)

Ridge map
In this case , The best choice may be a ridge map . It has all the advantages of the violin chart , But avoid loose space , Because there is overlap between groups . It effectively describes the individual distribution and the comparison between groups .
# Load a special drawing library
library(ggridges)
ggplot(data2, aes(y=text, x=value, fill=text)) +
geom_density_ridges(alpha=0.6, bandwidth=4) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")

Reference resources
边栏推荐
- Bi modal progressive mask attention for fine graded recognition
- FFmpeg原理
- [reading papers] deep learning face representation from predicting 10000 classes. deepID
- Graph theory, tree based concept
- Advanced stair climbing
- Example 4 linear filtering and built-in filtering
- 数仓笔记|针对客户维度建模需要关注的5个因素
- 03 认识第一个view组件
- An image is word 16x16 words: transformers for image recognition at scale
- [reading papers] deep learning face representation by joint identification verification, deep learning applied to optimization problems, deepid2
猜你喜欢
![[pytorch]fixmatch code explanation (super detailed)](/img/22/66703bea0f8ee40eceb0687fcb3ad2.jpg)
[pytorch]fixmatch code explanation (super detailed)

What are the differences in cache/tlb?

Paper reading - group normalization
![Leetcode 473. 火柴拼正方形 [暴力+剪枝]](/img/3a/975b91dd785e341c561804175b6439.png)
Leetcode 473. 火柴拼正方形 [暴力+剪枝]
![[reading papers] deep learning face representation from predicting 10000 classes. deepID](/img/40/94ac64998a34d03ea61ad0091a78bf.jpg)
[reading papers] deep learning face representation from predicting 10000 classes. deepID

Matlab: obtain the figure edge contour and divide the figure n equally

Introduction to armv8/armv9 - learning this article is enough

Classification and summary of system registers in aarch64 architecture of armv8/arnv9

A real-time target detection model Yolo
![Leetcode 926. 将字符串翻转到单调递增 [前缀和]](/img/ca/d23c1927bc32393cf023c748e4b449.png)
Leetcode 926. 将字符串翻转到单调递增 [前缀和]
随机推荐
[reading papers] transformer miscellaneous notes, especially miscellaneous
Priority queue with dynamically changing priority
Summary of innovative ideas of transformer model in CV
Leetcode 926. Flip string to monotonically increasing [prefix and]
[reading point paper] deeplobv3+ encoder decoder with Atlas separable revolution
Hstack, vstack and dstack in numpy
nn. Conv2d and nn Convtranspose2d differences
Open source video recolor code
Paipai loan parent company Xinye quarterly report diagram: revenue of RMB 2.4 billion, net profit of RMB 530million, a year-on-year decrease of 10%
Paper reading - joint beat and downbeat tracking with recurrent neural networks
02 optimize the default structure of wechat developer tools
拍拍贷母公司信也季报图解:营收24亿 净利5.3亿同比降10%
An image is word 16x16 words: transformers for image recognition at scale
Stm32f4 DMA Da sine wave generator keil5 Hal library cubemx
Introduction to easydl object detection port
(novice to) detailed tutorial on machine / in-depth learning with colab from scratch
Linear, integer, nonlinear, dynamic programming
Yovo3 and yovo3 tiny structure diagram
05 tabbar navigation bar function
Port mapping between two computers on different LANs (anydesk)