DAVI Visualizing Textual Data

Posted Nov 22, 2024 Updated Dec 24, 2024

By Wei Xiong

13 min read

Lecture on Text Visualization and Visualization Critiques

Challenges with Text Data

Doesn’t Fit Traditional Data Types
- Neither purely qualitative nor quantitative.
Visual Channel Mismatch
- Standard channels (position, color, size) don’t directly apply.
No Inherent Structure
- Often unstructured or semi-structured, lacking a universal format.
Difficult Task Abstraction
- Traditional task frameworks don’t easily accommodate text data.

Implication: Text data requires specialized visualization techniques.

Topics Covered

Preprocessing Text Data
Visualizing Individual Documents
Visualizing Entire Text Corpora
Embedding Visualizations into Text (Spark Lines, etc.)

1. Preprocessing Text Data

Objective: Normalize and prepare text for analysis.

Steps:

Remove Formatting
- Strip away HTML tags, LaTeX commands, and other markup.
Noise Removal
- Remove punctuation, emoticons, and non-text elements.
Lowercasing
- Convert all text to lowercase to unify word representations.
Social Normalization
- Standardize colloquial terms and slang.
- Example: “u r so gud” ➔ “you are so good”
Stopword Removal
- Eliminate common words with little semantic value (e.g., “the”, “is”, “and”).
- Benefit: Reduces noise and focuses on meaningful words.
Stemming
- Reduce words to their base or root form.
- Example: “trouble”, “troubled”, “troubles” ➔ “trouble”
- Algorithm: Porter Stemming Algorithm is widely used.
Lemmatization
- Convert words to their canonical form, accounting for irregularities.
- Example: “good”, “better”, “best” ➔ “good”
- Benefit: Groups words with similar meanings that standard stemming might miss.

2. Representing Text Data

Bag-of-Words Model

Concept: Text represented as a multiset of words.
Components:
- Each word (term) and its frequency in the document.
Limitation: Loses word order and context.

n-grams

Definition: Sequences of ‘n’ consecutive elements(namely n-grams) from a text.
Types:
- Unigrams: Single words (same as bag-of-words).
- Bigrams: Pairs of consecutive words.
- Trigrams: Sequences of three words.
Purpose: Capture context and word order.

Example with “To be or not to be”:

Unigrams: “to”, “be”, “or”, “not”
Bigrams: “to be”, “be or”, “or not”, “not to”, “to be”
Trigrams: “to be or”, “be or not”, “or not to”, “not to be”

Character n-grams

Use letters instead of words.
Application: Helps in language detection and handling misspellings.

Vector Space Representation

Method: Represent documents as high-dimensional vectors.
Example: Trigrams of letters yield a vector with ~17,500 dimensions (26 letters^3).
Benefit: Enables mathematical operations like calculating cosine similarity between documents.

tf-idf (Term Frequency-Inverse Document Frequency)

Term Frequency (TF): Number of times a term appears in a document.
Inverse Document Frequency (IDF): Used to measure the importance of the word in the corpus as a whole
tf-idf Score:
TF multiplied by IDF.
- High tf-idf: Term is important in a document but rare across the corpus.
Purpose: Identify the most significant terms in a document while ignoring commonly used terms that lack discriminative power.
- If a term is frequent in a specific document but not common across other documents, it receives a higher weight.
- Conversely, terms that are frequent across many documents have a lower weight because they are not useful for distinguishing one document from another. (Maybe like words: and, or …etc)

3. Visualizing Individual Documents

Wordle Algorithm

Goal: Create visually appealing word clouds with efficient space utilization.
Process:
- Randomized Greedy Algorithm:
  - Place the largest word first at a random position.
  - Use a spiral search to find a position for each subsequent word without overlap.
- Collision Detection:
  - Uses hierarchical bounding boxes (quadtrees) for efficient overlap checking.
Features:
- Words can be placed at various orientations and within specific shapes.

核心逻辑：

首先，检查当前单词 𝑤 是否与已经放置的单词发生重叠（intersection）。
如果有重叠，就通过某种方式移动单词 𝑤。
移动的路径是一条螺旋路径（spiral path），这样单词会在一个逐渐扩大的范围内寻找合适的位置。
移动的条件包括：
- 单词 𝑤 的任何部分没有超出“游戏场”（playing field）。
- 当前的螺旋半径（spiral radius）保持在“较小”的范围。

Video Example: Time-Varying Word Clouds

Demonstration of advanced word cloud techniques.
Key Points:
- Incorporates shape animations and dynamic layouts.
- Uses rigid body dynamics for arranging words.
- Supports various constraints like boundary shapes and word orientations.

Word Trees (Visual Concordance)

Purpose: Visualize all occurrences of phrases starting or ending with a specific word.
Structure:
- Tree-like diagram showing how phrases branch from a common root word.
Example:
- Exploring phrases starting with “Love the…” in the Bible.
- Reveals all continuations and frequencies of each phrase.

4. Visualizing Entire Text Corpora(语块)

Literature Fingerprinting

Method: Divide texts into equal-sized chunks and analyze statistical properties.
Encoding:
- Average Sentence Length: Mapped to color intensity.
- Vocabulary Measures: Use of unique words (hapax legomena) to indicate vocabulary richness.
Application: Detect stylistic differences, authorship attribution, or anomalies.

Examples:

Jack London and Mark Twain:
- Variations in sentence length or vocabulary can indicate ghostwriting or stylistic shifts.
- Notable Observations:
  - “Jerry of the Islands” (Jack London) differs in sentence length from his other works.
  - “Tom Sawyer” (Mark Twain) has shorter sentences compared to his typical style.

Bible Visualization

Approach: Each verse represented as a pixel.
Encoding: Length of verses mapped to color.
Insights:
- Patterns reveal structural elements like repetitive lists or anomalies.
- Example: Repeating patterns in “Numbers 7” due to similar offerings described for each tribe.

Reconstructing Original Texts from Witnesses

Context: Original texts (stemma) are lost; multiple copies (witnesses) exist with variations.
Goal: Infer the most probable original text by comparing witnesses.
Visualization:
- Aligns different versions to highlight commonalities and differences.
- Helps scholars decide on the most authentic content.

5. Embedding Visualizations into Text

Spark Lines

Definition: Small, word-sized line charts embedded within text.
Purpose: Show trends or patterns in a compact form without axes or labels.
Applications:
- Stock prices over time.
- Temperature changes.
- Any time-series data.

Creating Spark Lines:

Challenges: Limited space requires data simplification.
Techniques:
1. Sampling:
  - Select data points at regular intervals.
  - Limitation: May miss important features.
2. Averaging (Piecewise Aggregate Approximation):
  - Compute average values within intervals.
  - Benefit: Captures overall trends but may smooth out peaks.
3. Perceptually Important Points:
  - Algorithm selects points that significantly affect the visual shape.
  - Process:
    - Start with endpoints.
    - Add points that deviate most from the current simplified line.
    - Iteratively refine until the desired level of detail is reached.
  - Advantage: Preserves critical features like peaks and troughs.

Generalized Word-Scale Visualizations

Micro Charts: Include bar charts, box plots, and other small visuals.
Applications:
- In Text: Embed within sentences to support or extend content.
- In Code: Augment source code with visualizations showing variable states or performance metrics.

When to use word-sized visualization:

Support Content:
- Quick visual comparisons or summaries.
Summarize Content:
- Highlight key data points or trends.
Emphasize Content:
- Reinforce important information visually.
Extend Content:
- Provide additional data not fully described in the text.
Display Contradictory Data:
- Offer alternative perspectives (less common).
High-density information display:
- There is no need to switch eyes and you can get the context of the data as you read.

Examples:

Source Code Visualization:
- Embedding performance metrics or variable states directly in code.
- Helps in debugging and understanding program behavior.
User Study Data:
- Compact visualizations of eye-tracking or interaction data.

Additional Resources

Text Visualization Browser
- A comprehensive collection of text visualization techniques.
- Link to Text Visualization Browser

Next Lecture Preview

Topic: Visual Analytics
Reading Assignment: Chapter 2 from the specified textbook.
Focus: Integration of automated analysis with interactive visualization.

Visualization Critiques

General Approach

Identify Issues: Examine the visualization for inaccuracies or misleading elements.
Evaluate Expressiveness and Effectiveness: Does it represent the data accurately and clearly?
Suggest Improvements: Propose ways to enhance the visualization.

Example Critique: Bubble Chart of Movie Budgets and Grosses

Visualization Overview:

Data Represented:
- Movie budgets (vertical axis).
- Gross earnings (bubble size).
- Release year (horizontal axis).
Purpose: Compare movie budgets and grosses over time.

Issues Identified:

Incorrect Size Encoding
- Problem: Gross earnings are mapped to bubble radius, not area.
- Impact: Misrepresents the relative gross earnings; viewers may misinterpret the data.
Color Usage
- Problem: Colors assigned to bubbles have no meaningful categories.
- Impact: Adds unnecessary complexity; may confuse viewers.
Labeling
- Problem: Uses a legend instead of directly labeling bubbles.
- Impact: Makes it difficult to identify movies; requires constant referencing.
Data Interpretation
- Problem: Hard to compare movies accurately due to size encoding errors and overlapping bubbles.
- Impact: Reduces the effectiveness of the visualization for comparison tasks.

Suggestions for Improvement:

Correct Size Encoding
- Map gross earnings to bubble area, not radius.
Simplify Colors
- Use a single color or meaningful color encoding (e.g., genre).
Direct Labeling
- Place labels next to or within bubbles where possible.
Alternative Charts
- Consider a scatter plot with gross earnings on one axis and budgets on another for clearer comparison.

Example Critique2: Students’ Academic Stressors

Expressiveness:
- Stacking does not make sense
  - Overall vs. Subfields
  - Subfields w/ each other
- Better: Grouped Bar Chart
Effectiveness:
- Unclear why Social Sciences is “grayed out”
- No whitespace between bars/columns makes it look like a histogram
- Differences between fields are hard to see (e.g., Sciences and Arts/Humanities)

Example Critique3: House Prices in Canada

EXPRESSIVENESS:
- Line Graph connects across arbitrarily orderable categories (shows a “trend” that is not in the data)

Tips for Effective Visualizations

Expressiveness
- Ensure the visualization accurately represents the underlying data.
- Avoid adding elements that suggest relationships not present in the data.
Effectiveness
- Design for the target audience’s understanding.
- Use visual channels appropriately (e.g., position is more precise than color).
Clarity
- Labels and legends should be clear and easy to reference.
- Avoid clutter and unnecessary embellishments.
Accessibility
- Use color palettes that are colorblind-friendly.
- Ensure text and visuals are legible.

Additional Resources for Visualization Critiques

Visualization Blogs and Forums:
- WTF Visualizations
- Data is Ugly Subreddit
Best Practices Guides:
- Books and articles on data visualization principles by experts like Edward Tufte and Stephen Few.

Conclusion

Text Visualization: Requires specialized techniques due to the unique nature of text data.
Visualization Critiques: Essential for improving data representations and avoiding misleading visuals.
Next Steps: Explore visual analytics to combine automated analysis with interactive visualization.

Menti Quiz Review

Question 1: Criteria for a Good Node-Link Layout

Question: Which one of these is not a criterion for a good node-link layout?

Minimizing edge crossings
Adhering to the Robinson criterion
Maximizing edge crossing angles
Uniform edge length

Answer: Adhering to the Robinson criterion

Explanation:

Minimizing Edge Crossings
- Essential to prevent misreading the graph.
- Each crossing increases the chance of following the wrong edge.
Maximizing Edge Crossing Angles
- Steeper angles reduce confusion at crossings.
- Acute angles can mislead the viewer to the wrong path.
Uniform Edge Length
- Avoids very long or very short edges.
- Helps in maintaining a clear and proportional layout.
Robinson Criterion
- Not related to node-link layouts.
- Used in matrix visualizations of graphs.
- Involves arranging the matrix so that values decrease as you move away from the main diagonal.
- Helps in revealing clusters by minimizing Robinson violations.

Robinson Criterion Details:

Definition: Values should decrease when moving horizontally or vertically away from the main diagonal in a matrix.
Purpose: To highlight clusters around the main diagonal that may be otherwise unnoticed.
Implementation: Optimize the order of rows and columns to minimize Robinson violations.
Violations: Instances where the criterion is not met; can be counted or weighted differently during optimization.

Question 2: Artificial Reduction in Velocity in Force-Directed Layouts

Question: How do we call the artificial reduction in velocity in the force-directed layout?

The Barnes-Hut optimization
Simulated annealing
The barycenter algorithm
Coulomb’s law

Answer: Simulated annealing

Explanation:

Simulated Annealing
- Inspired by metallurgy (cooling molten metal to solidify).
- In layout algorithms, it involves gradually reducing the “temperature” to settle the nodes into a stable configuration.
- Velocity of nodes decreases over iterations, preventing oscillations and helping reach an equilibrium.
Barnes-Hut Optimization
- Uses a quadtree data structure.
- Reduces computational complexity from O(n2)O(n^2)O(n2) to O(nlog⁡n)O(n \log n)O(nlogn) by approximating distant node interactions.
- Not related to reducing velocity but to optimizing force calculations.
Barycenter Algorithm
- Part of the Sugiyama framework for layered graph layouts.
- Calculates the average (barycenter) of connected nodes to minimize edge crossings during node ordering.
- Not related to velocity reduction.
Coulomb’s Law
- Describes the repulsive force between electrically charged particles.
- Used in force-directed layouts to simulate repulsion between nodes.
- Does not address velocity reduction.

Question 3: Matrix Displays are Good for…

Question: Matrix displays are a good choice for the following:

For sparse graphs
For trees
For showing edge attributes
For dense graphs

Answers: For showing edge attributes and For dense graphs

Explanation:

Dense Graphs
- Node-Link Issues: In dense graphs, node-link diagrams become cluttered with overlapping edges.
- Matrix Advantages: Matrix displays handle density well by representing edges as cells, avoiding visual clutter.
Showing Edge Attributes
- Enhanced Encoding: Each matrix cell can represent multiple edge attributes.
- Use of Glyphs: Allows embedding small glyphs or visualizations within cells to convey complex information.
Sparse Graphs and Trees
- Sparse Graphs: Matrix displays are inefficient due to many empty cells, wasting space.
- Trees: Being sparse by nature, trees are better visualized using hierarchical or node-link diagrams.

Question 4: Squarified Treemap Layout

Question: The squarified treemap layout is:

A slice-and-dice algorithm
A randomized algorithm
A greedy algorithm
An optimization algorithm

Answer: A greedy algorithm

Explanation:

Squarified Treemap Layout
- Aims to create rectangles (nodes) with aspect ratios close to 1 (squares).
- Greedy Approach: Places the largest items first, filling space efficiently.
- Improves readability by making area comparisons easier between nodes.
Slice-and-Dice Algorithm
- Divides space alternately in horizontal and vertical directions.
- Can result in elongated rectangles, making area comparison difficult.
Randomized Algorithm
- Not applicable here; squarified treemaps follow a deterministic process.
Optimization Algorithm
- While it aims for better aspect ratios, it doesn’t solve an optimization problem with a cost function.

Question 5: Steps of the Sugiyama Algorithm

Question: Which one is not a step of the Sugiyama algorithm?

The upward expansion
The crossing reduction
The layer assignment
The horizontal assignment

Answer: The upward expansion

Explanation:

Sugiyama Algorithm Steps:
1. Layer Assignment
  - Assign nodes to discrete layers.
  - Source nodes placed on the first layer; edges point downwards.
2. Crossing Reduction
  - Minimize edge crossings by ordering nodes within layers.
  - Uses methods like the barycentric algorithm to reorder nodes based on connections.
3. Horizontal Assignment
  - Assign exact horizontal positions to nodes.
  - Aims to reduce edge bends and improve readability.
Upward Expansion
- Not a recognized step.
- Possibly a made-up term for the question.

Data Visualization, Lecture

This post is licensed under CC BY 4.0 by the author.

Lecture on Text Visualization and Visualization Critiques

Challenges with Text Data

Topics Covered

1. Preprocessing Text Data

2. Representing Text Data

Bag-of-Words Model

n-grams

Character n-grams

Vector Space Representation

tf-idf (Term Frequency-Inverse Document Frequency)

3. Visualizing Individual Documents

Tag Clouds (Word Clouds)

Wordle Algorithm

核心逻辑：

Video Example: Time-Varying Word Clouds

Word Trees (Visual Concordance)

4. Visualizing Entire Text Corpora(语块)

Literature Fingerprinting

Bible Visualization

Reconstructing Original Texts from Witnesses

5. Embedding Visualizations into Text

Spark Lines

Generalized Word-Scale Visualizations

Additional Resources

Next Lecture Preview

Visualization Critiques

General Approach

Example Critique: Bubble Chart of Movie Budgets and Grosses

Example Critique2: Students’ Academic Stressors

Example Critique3: House Prices in Canada

Tips for Effective Visualizations

Additional Resources for Visualization Critiques

Conclusion

Menti Quiz Review

Question 1: Criteria for a Good Node-Link Layout

Question 2: Artificial Reduction in Velocity in Force-Directed Layouts

Question 3: Matrix Displays are Good for…

Question 4: Squarified Treemap Layout

Question 5: Steps of the Sugiyama Algorithm

Trending Tags