当前位置:网站首页>[data analysis and visualization] key points of data mapping 7- over mapping

[data analysis and visualization] key points of data mapping 7- over mapping

2022-06-13 02:34:00 The winter holiday of falling marks

Key points of data drawing 7- Overprinting

Over plotting is a common problem in data plotting . When your data set is large , The points of a scatter plot tend to overlap , Make graphics unreadable . In this article , Several solutions will be given to avoid over drawing .

Over drawing instances

The following scatter diagram illustrates the problems with over plotting . At first glance, it may be concluded that :X and Y There is no obvious relationship between . But later we will prove how wrong this conclusion is .

# #  Load the library 
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(patchwork)

# Dataset:
a <- data.frame( x=rnorm(20000, 10, 1.2), y=rnorm(20000, 10, 1.2), group=rep("A",20000))
b <- data.frame( x=rnorm(20000, 14.5, 1.2), y=rnorm(20000, 14.5, 1.2), group=rep("B",20000))
c <- data.frame( x=rnorm(20000, 9.5, 1.5), y=rnorm(20000, 15.5, 1.5), group=rep("C",20000))
#  Splicing data 
data <- do.call(rbind, list(a,b,c))               

#  mapping 
ggplot(data,aes(x=x, y=y)) +
geom_point(color="#69b3a2", size=2) +
theme(
  legend.position="none"
)

png

resolvent

Reduce the size of the points

The simplest solution may be to reduce the size of the points , It can provide very satisfactory results . You can clearly see the existence here 3 Clusters , This is hidden in the image above .

ggplot(data,aes(x=x, y=y)) +
#  Reduce the size of the points 
geom_point(color="#69b3a2", size=0.02) +
theme(
  legend.position="none"
)

png

transparency

Combined with the size of the reduction point , Using transparency can also further solve the problem of over drawing .

ggplot(data,aes(x=x, y=y)) +
#  Set transparency 
geom_point(color="#69b3a2", size=2, alpha=0.01) +
theme(
  legend.position="none"
)

png

2 Dimensional density diagram

The two-dimensional density map basically calculates the number of observations in a specific area of the two-dimensional space , This count is represented in color , The distribution of points can be clearly seen

#  draw 2 Dimensional density diagram 
ggplot(data, aes(x=x, y=y) ) +
  stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  scale_fill_viridis() +
  theme(
    legend.position='none'
  )

png

Data sampling

Sometimes less is more . Only a small portion of the data is plotted ( Here is 5%) It can greatly reduce the calculation time and help avoid over drawing :

sample_data <- sample_frac(data, 0.05)
ggplot(sample_data, aes(x=x, y=y)) +
geom_point(color="#69b3a2", size=2) +
theme(
  legend.position="none"
)

png

Highlight a specific group

Another way to reduce graphics complexity is to highlight specific groups .

ggplot( data,aes(x=x, y=y)) +
geom_point(color="grey", size=2) +
#  Highlight group B
geom_point(data = data %>% filter(group=="B"), color="#69b3a2", size=2) +
theme(
  legend.position="none",
  plot.title = element_text(size=12)
) +
ggtitle('Behavior of the group B') 

png

grouping

If the data is grouped , You can use different colors to represent different groups of points .

ggplot(data, aes(x=x, y=y, color=group)) +
geom_point( size=2, alpha=0.1) +
scale_color_viridis(discrete=TRUE) 

png

Subgraph

Once you have multiple groups in your diagram , Another method is to use a partition , Highlight one group at a time .

ggplot(data, aes(x=x, y=y)) +
#  Draw a point that highlights the category 
geom_point( aes( color=group) , size=2, alpha=0.1) +
#  Draw points that do not highlight categories 
geom_point( data=data %>% select(-group), size=1, alpha=0.05, color="grey") +
scale_color_viridis(discrete=TRUE) +
theme(
  legend.position="none",
) +
#  Subgraph 
facet_wrap(~group)

png

Three dimensional diagram

Use a three-dimensional graph to display density , under these circumstances , The position of each group becomes obvious .

library(plotly)
library(MASS)

kd <- with(data, MASS::kde2d(x, y, n = 50))

plot_ly(x = kd$x, y = kd$y, z = kd$z) %>% add_surface()

Edge distribution

Adding edge distributions allows you to detect distributions hidden in the over drawn portion of the graph . You can add a box chart to the edge 、 Histogram or density chart .

library(ggExtra)

#  Create a scatter diagram 
p <- ggplot(data, aes(x=x, y=y)) +
    geom_point(color="#69b3a2", size=2, alpha=0.01) +
    theme(
      legend.position="none"
    )

#  Add edge histogram 
ggExtra::ggMarginal(p, type = "histogram")

png

Reference resources

原网站

版权声明
本文为[The winter holiday of falling marks]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202280540284318.html