当前位置:网站首页>[data analysis and visualization] key points of data drawing 6- too many data groups
[data analysis and visualization] key points of data drawing 6- too many data groups
2022-06-13 02:34:00 【The winter holiday of falling marks】
Key points of data drawing 6- Too many data groups
Comparing the distributions of several numerical variables is a common task in data presentation . The distribution of variables can be represented by histogram or density graph , Groups that represent the right amount of data on the same axis are very attractive . However, too many data sets will seriously affect the performance of chart information .
Example of data distribution drawing
Here's an example , It shows how people perceive words . The phrase “Highly likely” What is the probability of the situation . The following is the result of the distribution of probability scores .
# Load the library
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(patchwork)
# Load data
data <- read.table("https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv", header=TRUE, sep=",")
# Processing data
data <- data %>%
gather(key="text", value="value") %>%
mutate(text = gsub("\\.", " ",text)) %>%
mutate(value = round(as.numeric(value),0))
head(data)
nrow(data)
| text | value | |
|---|---|---|
| <chr> | <dbl> | |
| 1 | Almost Certainly | 95 |
| 2 | Almost Certainly | 95 |
| 3 | Almost Certainly | 95 |
| 4 | Almost Certainly | 95 |
| 5 | Almost Certainly | 98 |
| 6 | Almost Certainly | 95 |
782
# Create a data callout box
annot <- data.frame(
text = c("Almost No Chance", "About Even", "Probable", "Almost Certainly"),
x = c(5, 53, 65, 79),
y = c(0.15, 0.4, 0.06, 0.1)
)
# Extract some data for display
data1 <-filter(data,text %in% c("Almost No Chance", "About Even", "Probable", "Almost Certainly"))
data1 <-mutate(data1,text = fct_reorder(text, value))
head(data1)
nrow(data1)
| text | value | |
|---|---|---|
| <fct> | <dbl> | |
| 1 | Almost Certainly | 95 |
| 2 | Almost Certainly | 95 |
| 3 | Almost Certainly | 95 |
| 4 | Almost Certainly | 95 |
| 5 | Almost Certainly | 98 |
| 6 | Almost Certainly | 95 |
184
# mapping
ggplot(data1, aes(x=value, color=text, fill=text)) +
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
geom_text( data=annot, aes(x=x, y=y, label=text, color=text), hjust=0, size=4.5) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")

under these circumstances , The graphics are very neat . People give "Highly likely" Express "Almost No chance" The probability of this sentence is 0% To 20% Between , And it means "Almost Certainly" The probability of this sentence is 75% To 100% Between . But when we look at what happens when we represent more data sets .
# Plot
data2<-mutate(data,text = fct_reorder(text, value))
ggplot(data2,aes(x=value, color=text, fill=text)) +
# Draw a density map
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")

Now you can see that this picture is too cluttered , Can't group : There are too many data groups represented on the same graph . How to avoid this situation ? We will introduce several solutions in the next section .
resolvent
Box chart
The most common way to represent such a dataset is boxplot. It summarizes the main characteristics of each group , Thus, efficient distribution is realized . Please note some pitfalls . It usually makes sense to sort groups to make charts easier to read . If the group label is long , Consider a horizontal version that makes the label readable . However, the box chart box hides the basic distribution of sample size and other information , You can use unobtrusive points to display individual data points .
ggplot(data2, aes(x=text, y=value, fill=text)) +
# Drawing box diagram
geom_boxplot() +
# Add data point
geom_jitter(color="grey", alpha=0.3, size=0.9) +
scale_fill_viridis(discrete=TRUE) +
theme(
legend.position="none"
) +
# xy The axis turns
coord_flip() +
xlab("") +
ylab("Assigned Probability (%)")

Violin chart
As long as the sample size is large enough , Violin charts are usually a good substitute for box charts . It is very close to the box diagram , It just describes the group distribution more accurately by definition . If you have many groups , Violin pictures may not be the best choice , Because the display results of each data group in the violin diagram are often very thin , This makes it difficult to imagine its distribution . under these circumstances , A good alternative is the ridge map , This will be further described in this article .
ggplot(data2, aes(x=text, y=value, fill=text, color=text)) +
geom_violin(width=2.1, size=0.2) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
legend.position="none"
) +
coord_flip() +
xlab("") +
ylab("Assigned Probability (%)")
Warning message:
"position_dodge requires non-overlapping x intervals"

Density map
If there are only a few groups , Can be compared on the same density map . Only four groups have been selected to illustrate this idea . If there are more groups , The graphics will become disorganized , Difficult to read . This example was mentioned earlier , But it is only suitable when there are few data sets .
# mapping
ggplot(data1, aes(x=value, color=text, fill=text)) +
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
geom_text( data=annot, aes(x=x, y=y, label=text, color=text), hjust=0, size=4.5) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")

however , If you have more than 4 A set of , Graphics can become too confusing . If you want to use a density chart , It is more appropriate to draw by grouping subgraphs . This is a good way to study the distribution of each group separately . however , Because they don't share the same X Axis , So it's hard to compare groups together . It all depends on what the question is .
ggplot(data2,aes(x=value, color=text, fill=text)) +
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)") +
# Group drawing
facet_wrap(~text, scale="free_y")

Histogram
The histogram is very close to the density map , The representation is handled in a similar way , Using subgraphs . But the histogram in this example Y The scale is the same for each group , This is different from the previous example on the density map .
ggplot(data2, aes(x=value, color=text, fill=text)) +
geom_histogram(alpha=0.6, binwidth = 5) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)") +
facet_wrap(~text)

Ridge map
In this case , The best choice may be a ridge map . It has all the advantages of the violin chart , But avoid loose space , Because there is overlap between groups . It effectively describes the individual distribution and the comparison between groups .
# Load a special drawing library
library(ggridges)
ggplot(data2, aes(y=text, x=value, fill=text)) +
geom_density_ridges(alpha=0.6, bandwidth=4) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")

Reference resources
边栏推荐
- Automatic differential reference
- Priority queue with dynamically changing priority
- Jump model between mirrors
- OpenCVSharpSample05Wpf
- Opencvsharp4 pixel read / write and memory structure of color image and gray image
- Flow chart of interrupt process
- AutoX. JS invitation code
- Introduction to arm Cortex-M learning
- json,xml,txt
- [keras] generator for 3D u-net source code analysis py
猜你喜欢

speech production model

Is space time attention all you need for video understanding?

Flow chart of interrupt process
![Leetcode 450. 删除二叉搜索树中的节点 [二叉搜索树]](/img/39/d5c4d424a160635791c4645d6f2e10.png)
Leetcode 450. 删除二叉搜索树中的节点 [二叉搜索树]
![[reading papers] transformer miscellaneous notes, especially miscellaneous](/img/c3/7788b1bcd71b90c18cf66bb915db32.jpg)
[reading papers] transformer miscellaneous notes, especially miscellaneous

Rough understanding of wechat cloud development

Paper reading - joint beat and downbeat tracking with recurrent neural networks

OpenCVSharpSample04WinForms

0- blog notes guide directory (all)

Chapter7-12_ Controllable Chatbot
随机推荐
Number of special palindromes in basic exercise of test questions
Laravel 权限导出
4.11 introduction to firmware image package
03 认识第一个view组件
Matlab: obtain the figure edge contour and divide the figure n equally
json,xml,txt
Chapter7-12_ Controllable Chatbot
redis. Conf general configuration details
ROS learning-7 error in custom message or service reference header file
03 recognize the first view component
too old resource version,Code:410
Common web page status return code crawler
Jump model between mirrors
[reading papers] deepface: closing the gap to human level performance in face verification. Deep learning starts with the face
[reading papers] deep learning face representation from predicting 10000 classes. deepID
Branch and bound method, example sorting
Paper reading - jukebox: a generic model for music
The precision of C language printf output floating point numbers
Leetcode 926. Flip string to monotonically increasing [prefix and]
[reading papers] deep learning face representation by joint identification verification, deep learning applied to optimization problems, deepid2