ylliX - Online Advertising Network
How to build GenAI mock server?

Visualizations in R with too many data points?


I have this simulated data in R with multiple variables:

set.seed(123)
n <- 100
x1 <- rnorm(n, mean = 50, sd = 10)
x2 <- x1 * 0.7 + rnorm(n, mean = 0, sd = 5)
x3 <- -x1 * 0.5 + rnorm(n, mean = 70, sd = 8)
x4 <- rnorm(n, mean = 40, sd = 15)
x5 <- x4 * 0.6 + rnorm(n, mean = 30, sd = 7)

data <- data.frame(
    Variable1 = x1,
    Variable2 = x2,
    Variable3 = x3,
    Variable4 = x4,
    Variable5 = x5
)

data$Group <- cut(data$Variable1, 
                  breaks = 3, 
                  labels = c("Low", "Medium", "High"))

I then made this visualization (parallel coordinates plot) as this is one of the only types of visualizations (to my knowledge) that can be done with multiple variables:

library(GGally)

ggparcoord(data = data,
           columns = 1:5,
           groupColumn = "Group",
           scale = "uniminmax",
           showPoints = TRUE,
           alphaLines = 0.3,
           title = "Parallel Coordinates Plot of Simulated Data") +
    theme_minimal() +
    theme(
        plot.title = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1)
    ) +
    scale_color_brewer(palette = "Set2")

enter image description here

I have the following question: Suppose I have many rows of data. Is there something that can be done to still make the visual without everything being too crowded? The only idea that comes to mind is to use random sampling to make a smaller dataset from the original dataset, but I am looking for other ideas that this.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *