DAVI Visualizing Multivariate Data
DAVI Visualizing Multivariate Data
Visualizing Multivariate Data - Detailed Notes (Fall 2024)
Key Concepts in Multivariate Visualization
1. Why Talk About Multivariate Visualization?
Real-world data often involves many attributes (dimensions).
- Mapping two attributes:
- Typically done using scatterplots, which allow for the representation of two numerical variables along the x and y axes. Position is the most effective channel for this.
- Mapping three attributes:
- A common method is using bubble charts, where the third variable is encoded using size (i.e., area of the bubble). This is a natural extension of the scatterplot.
- Mapping four attributes:
- Color is often introduced to encode a fourth attribute, leveraging either color gradients or distinct hues. For example, temperature might be encoded using a gradient from blue to red.
- Mapping five attributes:
- Shape can be added to distinguish categorical variables or classes, such as using circles, triangles, or squares to represent different groups.
- Key limitation:
- While itās possible to map many attributes to a single visualization, as more attributes are added, the visualization often becomes cluttered and harder to interpret. The cognitive load increases for the viewer, making it challenging to extract meaningful insights from highly complex visualizations.
- Source: Professorās Lecture and Slide 5 from PDF (Page 5).
Techniques for Visualizing Multivariate Data
1. Using Points
- Ternary Plots:
- Definition: A scatterplot that is adapted for three variables, where the three variables are constrained to sum to a fixed total (e.g., 1 or 100%). Ternary plots are particularly useful in fields like chemistry and geoscience where compositions (e.g., soil types or alloy compositions) are examined.
- Example: In geology, ternary plots are used to plot soil composition, which typically involves sand, silt, and clay percentages that must add up to 100%.
- Strengths: Provides an efficient way to display three variables, especially in situations where the three components represent parts of a whole.
- Weaknesses: Limited to specific use cases where variables sum to a constant value.
- Source: Professorās Lecture.
- Scatterplot Matrices (SPLOMs):
- Definition: A matrix of scatterplots that displays pairwise relationships between multiple variables. Each cell in the matrix contains a scatterplot of two variables, and the entire matrix provides an overview of all possible combinations of variables.
- With Histograms:
- The diagonal cells of the matrix are often filled with histograms showing the distribution of individual variables.
- Advantage: Helps in quickly identifying correlations and clusters across a dataset.
- With Density Contours:
- Interaction: Linking and Brushing:
- SPLOMs are often interactive, allowing users to select a subset of points in one scatterplot, which are then highlighted across all other scatterplots.
- Example: In a dataset showing car attributes, selecting cars with a high horsepower in one scatterplot will highlight their position across other variables like weight and fuel efficiency.
- Source: Professorās Lecture and Slide 8 from PDF (Page 8).
- RadViz (Radial Visualization):
- Definition: A technique where each data attribute is assigned to a point on the circumference of a circle. Data points are āpulledā towards these anchors based on their values for each attribute, creating a projection of multivariate data into two dimensions. ļ¼ the length dependents on the number, ę°å¼č¶å¤§č·ē¦»č¶å°ļ¼
- RadViz Deluxe:
- A refined version of RadViz, where correlated attributes are grouped closer together to prevent distortions that can occur when independent variables are placed opposite one another.
- Fixing the Issue: In the original RadViz, correlated attributes can distort the visualization, leading to overlapping or condensed points. RadViz Deluxe adjusts the positions of correlated attributes, ensuring more meaningful spatial distributions.
- RadViz with Histograms/Densities:
- Adding histograms or density plots to the visualization allows for a clearer understanding of the dataās distribution along each axis.
- Source: Professorās Lecture.
2. Using Lines
- Parallel Coordinates (PCs):
- Definition: A multivariate visualization method where each variable is represented as a parallel axis. Data points are plotted as polylines connecting the values across these parallel axes. PCs are particularly useful for understanding high-dimensional data.
- Parallel Coordinate Patterns:
- Strengths: Useful for spotting correlations, trends, and clusters in high-dimensional data.
- Weaknesses: The method can become cluttered when visualizing large datasets, leading to overplotting.
- Source: Professorās Lecture and Slide 10 from PDF (Page 10).
- Polylines:
- PCs(Parallel Coordinates) with Histograms:
- Reducing Visual Clutter in PCs:
- Alpha Blending: A technique that makes overlapping lines partially transparent, reducing the visual clutter of overplotted data.
- Bundling: Bundling lines that follow similar trajectories helps to emphasize major trends while minimizing noise from individual data points.
- Source: Professorās Lecture and Slide 11 from PDF (Page 11).
- Parallel Sets:
- A technique similar to PCs but designed for categorical data. Instead of polylines, the visualization uses ribbons to show the flow of categories across multiple dimensions.
- Use Case: Ideal for understanding how categorical data is distributed across various categories, such as showing how education level is distributed across different income brackets.
- Source: Professorās Lecture.
- Parallel Hierarchies:
- Extends parallel sets to visualize hierarchical data structures, allowing for more complex relationships between data attributes to be explored.
- Use Case: Particularly effective when working with census data, where hierarchical relationships such as geographical locations or job classifications are common.
- Source: Professorās Lecture.
3. Using Nesting and Hierarchies
- Mosaic Plots:
- Definition: A type of plot for categorical data, where the area of each tile is proportional to the number of observations within that category. Each subdivision within a tile represents a further categorical breakdown.
- Strengths: Helps in understanding the distribution of categories and their subcategories.
- Weaknesses: Can become hard to read if too many categories are plotted.
- Source: Professorās Lecture and Slide 13 from PDF (Page 13).
- Trellis Displays:
- Definition: A grid-based visualization where a dataset is broken down into subsets, each of which is plotted in its own panel. This allows for comparisons across multiple variables or dimensions.
- Example: Used in geoscience to explore earthquake data, where the dataset is divided by depth, and each depth category is visualized individually.
- Strengths: Enables detailed comparisons across different categories or variable ranges.
- Source: Professorās Lecture.
- Radar Charts:
- Definition: A chart where each variable is represented by an axis that radiates from a central point. Data points are plotted along these axes and connected to form a polygon å¤č¾¹å½¢, showing the relative strength of each variable.
- Criticism: Radar charts are often criticized for their difficulty in scaling to larger datasets and the challenges they present in accurately comparing data points.
- Small Multiples: A method for improving radar charts by displaying each data point as a separate radar chart, allowing for better comparison between items.
- Source: Professorās Lecture and Slide 15 from PDF (Page 15).
4. Glyph-based Techniques
- Chernoff Faces:
- Definition: A method that encodes data attributes into facial features (e.g., size of eyes, shape of mouth), based on the assumption that humans are good at perceiving subtle differences in facial expressions.
- Criticism: Widely debunked as ineffective; the visual representation is not intuitive, and the encoding can lead to misleading interpretations.
- Source: Professorās Lecture.
- Other Glyph Variants:
- Examples include:
- Star Glyphs: Where each attribute is represented as a spoke on the wheel, with the length of each spoke proportional to the attributeās value.
- Wind Barbs: Commonly used in meteorology, where a lineās direction shows wind direction, and barbs along the line show wind speed.
- Metro Glyphs: Simplified human figures where different attributes (e.g., arm length, leg width) are mapped to data points.
- Strengths: Glyphs can be embedded within larger data visualizations (e.g., in a grid or over a map) to represent multivariate data at specific locations.
- Weaknesses: Interpretation requires practice, and glyphs like Chernoff Faces are not perceptually intuitive.
- Source: Professorās Lecture.
- Examples include:
5. Pixel-Based Techniques
- Overview of Pixel-Based Techniques:
- These techniques aim to represent high-dimensional data in a space-efficient way by mapping individual data points to pixels, allowing for dense visualizations.
- Use Case: Often used in time series data or stock market data, where millions of data points need to be represented in a single chart.
- Strengths: Extremely compact, allowing for a large amount of data to be displayed in a small space.
- Weaknesses: The interpretation can be challenging without additional contextual information.
- Source: Professorās Lecture.
- Calendar and Spiral Arrangements:
- Calendar Arrangement: Data is mapped to days, weeks, and months to identify trends and outliers over time.
- Spiral Plots: Often used to visualize periodic data such as seasonality, where the spiral layout helps in seeing repeating patterns.
- Example: Spiral plots are useful for displaying stock market data, where day-to-day trends might follow weekly or monthly patterns.
- Source: Professorās Lecture and Slide 20 from PDF (Page 20).
Vis Critiques - Wireless Signal Map (Critique #7):
- Chart Type: A 3D visualization of Wi-Fi signal strength, represented as a heatmap.
- Color Encoding Issue: The use of a rainbow color scale can distort the perception of data, as different colors are not perceptually equidistant. This can lead to an incorrect interpretation of small variations in signal strength as large ones and vice versa.
- Fix: Replace the rainbow color scale with a perceptually uniform color scale like Viridis or CUBEHELIX. In addition, increase transparency or use alpha blending for areas with poor signal strength to reduce occlusion.
- Occlusion: The 3D nature of the visualization leads to occlusion, where parts of the map are hidden behind others. This could be improved by interactive features or increased transparency.
- Source: Professorās Lecture
VIS CRITIQUE #8 - Educational Investment Map
- Chart Type: A glyph-based US map showing educational investment.
- Key Visual Encoding: - Size of glyph represents total investment.
- Color represents the ratio of private vs public investment.
- Number of polygon sides represents average investment per student.
- Critique:
- Polygon Sides: The number of polygon sides is not a perceptually equidistant encoding method, meaning small changes in investment levels may not be visually apparent.
- Fix: Instead of showing total investment (which correlates strongly with population), the glyphs should use normalized data such as investment per student, and another encoding like color hue can replace polygon sides for clarity.
- Source: Professorās Lecture
This post is licensed under CC BY 4.0 by the author.