当前位置：网站首页>[data analysis and visualization] key points of data drawing 6- too many data groups

[data analysis and visualization] key points of data drawing 6- too many data groups

2022-06-13 02:34:00 【The winter holiday of falling marks】

Key points of data drawing 6- Too many data groups

Comparing the distributions of several numerical variables is a common task in data presentation . The distribution of variables can be represented by histogram or density graph , Groups that represent the right amount of data on the same axis are very attractive . However, too many data sets will seriously affect the performance of chart information .

Example of data distribution drawing

Here's an example , It shows how people perceive words . The phrase “Highly likely” What is the probability of the situation . The following is the result of the distribution of probability scores .

#  Load the library 
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(patchwork)

#  Load data 
data <- read.table("https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv", header=TRUE, sep=",")
#  Processing data 
data <- data %>% 
  gather(key="text", value="value") %>%
  mutate(text = gsub("\\.", " ",text)) %>%
  mutate(value = round(as.numeric(value),0))
head(data)
nrow(data)

A data.frame: 6 × 2
	text	value
	<chr>	<dbl>
1	Almost Certainly	95
2	Almost Certainly	95
3	Almost Certainly	95
4	Almost Certainly	95
5	Almost Certainly	98
6	Almost Certainly	95

782

#  Create a data callout box 
annot <- data.frame(
  text = c("Almost No Chance", "About Even", "Probable", "Almost Certainly"),
  x = c(5, 53, 65, 79),
  y = c(0.15, 0.4, 0.06, 0.1)
)

#  Extract some data for display 
data1 <-filter(data,text %in% c("Almost No Chance", "About Even", "Probable", "Almost Certainly")) 
data1 <-mutate(data1,text = fct_reorder(text, value))
head(data1)
nrow(data1)

A data.frame: 6 × 2
	text	value
	<fct>	<dbl>
1	Almost Certainly	95
2	Almost Certainly	95
3	Almost Certainly	95
4	Almost Certainly	95
5	Almost Certainly	98
6	Almost Certainly	95

184

#  mapping 
ggplot(data1, aes(x=value, color=text, fill=text)) +
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
geom_text( data=annot, aes(x=x, y=y, label=text, color=text), hjust=0, size=4.5) +
theme(
  legend.position="none",
  panel.spacing = unit(0.1, "lines"),
  strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")

png

under these circumstances , The graphics are very neat . People give "Highly likely" Express "Almost No chance" The probability of this sentence is 0% To 20% Between , And it means "Almost Certainly" The probability of this sentence is 75% To 100% Between . But when we look at what happens when we represent more data sets .

# Plot
data2<-mutate(data,text = fct_reorder(text, value)) 
ggplot(data2,aes(x=value, color=text, fill=text)) +
#  Draw a density map 
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
  panel.spacing = unit(0.1, "lines"),
  strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")

png

Now you can see that this picture is too cluttered , Can't group ： There are too many data groups represented on the same graph . How to avoid this situation ？ We will introduce several solutions in the next section .

resolvent

Box chart

The most common way to represent such a dataset is boxplot. It summarizes the main characteristics of each group , Thus, efficient distribution is realized . Please note some pitfalls . It usually makes sense to sort groups to make charts easier to read . If the group label is long , Consider a horizontal version that makes the label readable . However, the box chart box hides the basic distribution of sample size and other information , You can use unobtrusive points to display individual data points .

ggplot(data2, aes(x=text, y=value, fill=text)) +
#  Drawing box diagram 
geom_boxplot() +
#  Add data point 
geom_jitter(color="grey", alpha=0.3, size=0.9) +
scale_fill_viridis(discrete=TRUE) +
theme(
  legend.position="none"
) +
# xy The axis turns 
coord_flip() +
xlab("") +
ylab("Assigned Probability (%)")

png

Violin chart

As long as the sample size is large enough , Violin charts are usually a good substitute for box charts . It is very close to the box diagram , It just describes the group distribution more accurately by definition . If you have many groups , Violin pictures may not be the best choice , Because the display results of each data group in the violin diagram are often very thin , This makes it difficult to imagine its distribution . under these circumstances , A good alternative is the ridge map , This will be further described in this article .

ggplot(data2,  aes(x=text, y=value, fill=text, color=text)) +
geom_violin(width=2.1, size=0.2) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
  legend.position="none"
) +
coord_flip() +
xlab("") +
ylab("Assigned Probability (%)")

Warning message:
"position_dodge requires non-overlapping x intervals"

png

Density map

If there are only a few groups , Can be compared on the same density map . Only four groups have been selected to illustrate this idea . If there are more groups , The graphics will become disorganized , Difficult to read . This example was mentioned earlier , But it is only suitable when there are few data sets .

#  mapping 
ggplot(data1, aes(x=value, color=text, fill=text)) +
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
geom_text( data=annot, aes(x=x, y=y, label=text, color=text), hjust=0, size=4.5) +
theme(
  legend.position="none",
  panel.spacing = unit(0.1, "lines"),
  strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")

png

however , If you have more than 4 A set of , Graphics can become too confusing . If you want to use a density chart , It is more appropriate to draw by grouping subgraphs . This is a good way to study the distribution of each group separately . however , Because they don't share the same X Axis , So it's hard to compare groups together . It all depends on what the question is .

ggplot(data2,aes(x=value, color=text, fill=text)) +
geom_density(alpha=0.6) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
  legend.position="none",
  panel.spacing = unit(0.1, "lines"),
  strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)") +
#  Group drawing 
facet_wrap(~text, scale="free_y")

png

Histogram

The histogram is very close to the density map , The representation is handled in a similar way , Using subgraphs . But the histogram in this example Y The scale is the same for each group , This is different from the previous example on the density map .

ggplot(data2, aes(x=value, color=text, fill=text)) +
geom_histogram(alpha=0.6, binwidth = 5) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
  legend.position="none",
  panel.spacing = unit(0.1, "lines"),
  strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)") +
facet_wrap(~text)

png

Ridge map

In this case , The best choice may be a ridge map . It has all the advantages of the violin chart , But avoid loose space , Because there is overlap between groups . It effectively describes the individual distribution and the comparison between groups .

#  Load a special drawing library 
library(ggridges)

ggplot(data2, aes(y=text, x=value,  fill=text)) +
geom_density_ridges(alpha=0.6, bandwidth=4) +
scale_fill_viridis(discrete=TRUE) +
scale_color_viridis(discrete=TRUE) +
theme(
  legend.position="none",
  panel.spacing = unit(0.1, "lines"),
  strip.text.x = element_text(size = 8)
) +
xlab("") +
ylab("Assigned Probability (%)")

png