Visualizations in R with too many data points?

Written by wordpress January 18, 2025

I have this simulated data in R with multiple variables:

set.seed(123)
n <- 100
x1 <- rnorm(n, mean = 50, sd = 10)
x2 <- x1 * 0.7 + rnorm(n, mean = 0, sd = 5)
x3 <- -x1 * 0.5 + rnorm(n, mean = 70, sd = 8)
x4 <- rnorm(n, mean = 40, sd = 15)
x5 <- x4 * 0.6 + rnorm(n, mean = 30, sd = 7)

data <- data.frame(
    Variable1 = x1,
    Variable2 = x2,
    Variable3 = x3,
    Variable4 = x4,
    Variable5 = x5
)

data$Group <- cut(data$Variable1, 
                  breaks = 3, 
                  labels = c("Low", "Medium", "High"))

I then made this visualization (parallel coordinates plot) as this is one of the only types of visualizations (to my knowledge) that can be done with multiple variables:

library(GGally)

ggparcoord(data = data,
           columns = 1:5,
           groupColumn = "Group",
           scale = "uniminmax",
           showPoints = TRUE,
           alphaLines = 0.3,
           title = "Parallel Coordinates Plot of Simulated Data") +
    theme_minimal() +
    theme(
        plot.title = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1)
    ) +
    scale_color_brewer(palette = "Set2")

I have the following question: Suppose I have many rows of data. Is there something that can be done to still make the visual without everything being too crowded? The only idea that comes to mind is to use random sampling to make a smaller dataset from the original dataset, but I am looking for other ideas that this.

Source link

Leave a Reply Cancel reply