I have this simulated data in R with multiple variables:
set.seed(123)
n <- 100
x1 <- rnorm(n, mean = 50, sd = 10)
x2 <- x1 * 0.7 + rnorm(n, mean = 0, sd = 5)
x3 <- -x1 * 0.5 + rnorm(n, mean = 70, sd = 8)
x4 <- rnorm(n, mean = 40, sd = 15)
x5 <- x4 * 0.6 + rnorm(n, mean = 30, sd = 7)
data <- data.frame(
Variable1 = x1,
Variable2 = x2,
Variable3 = x3,
Variable4 = x4,
Variable5 = x5
)
data$Group <- cut(data$Variable1,
breaks = 3,
labels = c("Low", "Medium", "High"))
I then made this visualization (parallel coordinates plot) as this is one of the only types of visualizations (to my knowledge) that can be done with multiple variables:
library(GGally)
ggparcoord(data = data,
columns = 1:5,
groupColumn = "Group",
scale = "uniminmax",
showPoints = TRUE,
alphaLines = 0.3,
title = "Parallel Coordinates Plot of Simulated Data") +
theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_color_brewer(palette = "Set2")
I have the following question: Suppose I have many rows of data. Is there something that can be done to still make the visual without everything being too crowded? The only idea that comes to mind is to use random sampling to make a smaller dataset from the original dataset, but I am looking for other ideas that this.