Post

DAVI Visualizing Textual Data

DAVI Visualizing Textual Data

Lecture on Text Visualization and Visualization Critiques

Challenges with Text Data

  • Doesn’t Fit Traditional Data Types
    • Neither purely qualitative nor quantitative.
  • Visual Channel Mismatch
    • Standard channels (position, color, size) don’t directly apply.
  • No Inherent Structure
    • Often unstructured or semi-structured, lacking a universal format.
  • Difficult Task Abstraction
    • Traditional task frameworks don’t easily accommodate text data.

Implication: Text data requires specialized visualization techniques.


Topics Covered

  1. Preprocessing Text Data
  2. Visualizing Individual Documents
  3. Visualizing Entire Text Corpora
  4. Embedding Visualizations into Text (Spark Lines, etc.)

1. Preprocessing Text Data

Objective: Normalize and prepare text for analysis.

Steps:

  1. Remove Formatting
    • Strip away HTML tags, LaTeX commands, and other markup.
  2. Noise Removal
    • Remove punctuation, emoticons, and non-text elements.
  3. Lowercasing
    • Convert all text to lowercase to unify word representations.
  4. Social Normalization
    • Standardize colloquial terms and slang.
    • Example: “u r so gud” ➔ “you are so good”
  5. Stopword Removal
    • Eliminate common words with little semantic value (e.g., “the”, “is”, “and”).
    • Benefit: Reduces noise and focuses on meaningful words.
  6. Stemming
    • Reduce words to their base or root form.
    • Example: “trouble”, “troubled”, “troubles” ➔ “trouble”
    • Algorithm: Porter Stemming Algorithm is widely used.
  7. Lemmatization
    • Convert words to their canonical form, accounting for irregularities.
    • Example: “good”, “better”, “best” ➔ “good”
    • Benefit: Groups words with similar meanings that standard stemming might miss.

2. Representing Text Data

Bag-of-Words Model

  • Concept: Text represented as a multiset of words.
  • Components:
    • Each word (term) and its frequency in the document.
  • Limitation: Loses word order and context.

n-grams

  • Definition: Sequences of ‘n’ consecutive elements(namely n-grams) from a text.
  • Types:
    • Unigrams: Single words (same as bag-of-words).
    • Bigrams: Pairs of consecutive words.
    • Trigrams: Sequences of three words.
  • Purpose: Capture context and word order.

Example with “To be or not to be”:

  • Unigrams: “to”, “be”, “or”, “not”
  • Bigrams: “to be”, “be or”, “or not”, “not to”, “to be”
  • Trigrams: “to be or”, “be or not”, “or not to”, “not to be”

Character n-grams

  • Use letters instead of words.
  • Application: Helps in language detection and handling misspellings.

Vector Space Representation

  • Method: Represent documents as high-dimensional vectors.
  • Example: Trigrams of letters yield a vector with ~17,500 dimensions (26 letters^3).
  • Benefit: Enables mathematical operations like calculating cosine similarity between documents.

tf-idf (Term Frequency-Inverse Document Frequency)

  • Term Frequency (TF): Number of times a term appears in a document.

  • Inverse Document Frequency (IDF): Used to measure the importance of the word in the corpus as a whole
    • Desktop View
  • tf-idf Score:

    TF multiplied by IDF.

    • High tf-idf: Term is important in a document but rare across the corpus. Desktop View
  • Purpose: Identify the most significant terms in a document while ignoring commonly used terms that lack discriminative power.
    • If a term is frequent in a specific document but not common across other documents, it receives a higher weight.
    • Conversely, terms that are frequent across many documents have a lower weight because they are not useful for distinguishing one document from another. (Maybe like words: and, or …etc)

3. Visualizing Individual Documents

Tag Clouds (Word Clouds)

Desktop View

  • Representation: word frequencies or TFIDF values represented through font size and/or color saturation.
  • Arrangement: Often alphabetical; position is typically meaningless.
  • Usage:
    • Visual summary of main themes or topics.
  • Issues:
    • Positional Meaning: Viewers may infer relationships based on proximity, which can be misleading.
    • Underutilized Channels: Position and color often don’t encode additional data.

Examples:

  • State of the Union Addresses:
    • President Bush (2002): Emphasis on “security”, “terror”, “weapons”.
    • President Obama (2011): Focus on “business”, “jobs”, “future”.

Wordle Algorithm

Desktop View

  • Goal: Create visually appealing word clouds with efficient space utilization.
  • Process:
    • Randomized Greedy Algorithm:
      • Place the largest word first at a random position.
      • Use a spiral search to find a position for each subsequent word without overlap.
    • Collision Detection:
      • Uses hierarchical bounding boxes (quadtrees) for efficient overlap checking.
  • Features:
    • Words can be placed at various orientations and within specific shapes. Desktop View
核心逻辑:
  1. 首先,检查当前单词 𝑤 是否与已经放置的单词发生重叠(intersection)。
  2. 如果有重叠,就通过某种方式移动单词 𝑤
  3. 移动的路径是一条螺旋路径(spiral path),这样单词会在一个逐渐扩大的范围内寻找合适的位置。
  4. 移动的条件包括:
    • 单词 𝑤 的任何部分没有超出“游戏场”(playing field)。
    • 当前的螺旋半径(spiral radius)保持在“较小”的范围。

Video Example: Time-Varying Word Clouds

  • Demonstration of advanced word cloud techniques.
  • Key Points:
    • Incorporates shape animations and dynamic layouts.
    • Uses rigid body dynamics for arranging words.
    • Supports various constraints like boundary shapes and word orientations.

Word Trees (Visual Concordance)

Desktop View

  • Purpose: Visualize all occurrences of phrases starting or ending with a specific word.
  • Structure:
    • Tree-like diagram showing how phrases branch from a common root word.
  • Example:
    • Exploring phrases starting with “Love the…” in the Bible.
    • Reveals all continuations and frequencies of each phrase.

4. Visualizing Entire Text Corpora(语块)

Literature Fingerprinting

Desktop View

  • Method: Divide texts into equal-sized chunks and analyze statistical properties.
  • Encoding:
    • Average Sentence Length: Mapped to color intensity.
    • Vocabulary Measures: Use of unique words (hapax legomena) to indicate vocabulary richness.
  • Application: Detect stylistic differences, authorship attribution, or anomalies.

Examples:

  • Jack London and Mark Twain:
    • Variations in sentence length or vocabulary can indicate ghostwriting or stylistic shifts.
    • Notable Observations:
      • “Jerry of the Islands” (Jack London) differs in sentence length from his other works.
      • “Tom Sawyer” (Mark Twain) has shorter sentences compared to his typical style.

Bible Visualization

Desktop View

  • Approach: Each verse represented as a pixel.
  • Encoding: Length of verses mapped to color.
  • Insights:
    • Patterns reveal structural elements like repetitive lists or anomalies.
    • Example: Repeating patterns in “Numbers 7” due to similar offerings described for each tribe.

Reconstructing Original Texts from Witnesses

  • Context: Original texts (stemma) are lost; multiple copies (witnesses) exist with variations.
  • Goal: Infer the most probable original text by comparing witnesses.
  • Visualization:
    • Aligns different versions to highlight commonalities and differences.
    • Helps scholars decide on the most authentic content.

5. Embedding Visualizations into Text

Spark Lines

Desktop View

  • Definition: Small, word-sized line charts embedded within text.
  • Purpose: Show trends or patterns in a compact form without axes or labels.
  • Applications:
    • Stock prices over time.
    • Temperature changes.
    • Any time-series data.

Creating Spark Lines:

  • Challenges: Limited space requires data simplification.
  • Techniques: Desktop View
    1. Sampling:
      • Select data points at regular intervals.
      • Limitation: May miss important features.
    2. Averaging (Piecewise Aggregate Approximation):
      • Compute average values within intervals.
      • Benefit: Captures overall trends but may smooth out peaks.
    3. Perceptually Important Points:
      • Algorithm selects points that significantly affect the visual shape.
      • Process:
        • Start with endpoints.
        • Add points that deviate most from the current simplified line.
        • Iteratively refine until the desired level of detail is reached.
      • Advantage: Preserves critical features like peaks and troughs.

Generalized Word-Scale Visualizations

Desktop View

  • Micro Charts: Include bar charts, box plots, and other small visuals.
  • Applications:
    • In Text: Embed within sentences to support or extend content.
    • In Code: Augment source code with visualizations showing variable states or performance metrics.

When to use word-sized visualization:

  1. Support Content:
    • Quick visual comparisons or summaries.
  2. Summarize Content:
    • Highlight key data points or trends.
  3. Emphasize Content:
    • Reinforce important information visually.
  4. Extend Content:
    • Provide additional data not fully described in the text.
  5. Display Contradictory Data:
    • Offer alternative perspectives (less common).
  6. High-density information display:
    • There is no need to switch eyes and you can get the context of the data as you read.

Examples:

  • Source Code Visualization:
    • Embedding performance metrics or variable states directly in code.
    • Helps in debugging and understanding program behavior.
  • User Study Data:
    • Compact visualizations of eye-tracking or interaction data.

Additional Resources

  • Text Visualization Browser
    • A comprehensive collection of text visualization techniques.
    • Link to Text Visualization Browser

Next Lecture Preview

  • Topic: Visual Analytics
  • Reading Assignment: Chapter 2 from the specified textbook.
  • Focus: Integration of automated analysis with interactive visualization.

Visualization Critiques

General Approach

  • Identify Issues: Examine the visualization for inaccuracies or misleading elements.
  • Evaluate Expressiveness and Effectiveness: Does it represent the data accurately and clearly?
  • Suggest Improvements: Propose ways to enhance the visualization.

Example Critique: Bubble Chart of Movie Budgets and Grosses

Desktop View Visualization Overview:

  • Data Represented:
    • Movie budgets (vertical axis).
    • Gross earnings (bubble size).
    • Release year (horizontal axis).
  • Purpose: Compare movie budgets and grosses over time.

Issues Identified:

  1. Incorrect Size Encoding
    • Problem: Gross earnings are mapped to bubble radius, not area.
    • Impact: Misrepresents the relative gross earnings; viewers may misinterpret the data.
  2. Color Usage
    • Problem: Colors assigned to bubbles have no meaningful categories.
    • Impact: Adds unnecessary complexity; may confuse viewers.
  3. Labeling
    • Problem: Uses a legend instead of directly labeling bubbles.
    • Impact: Makes it difficult to identify movies; requires constant referencing.
  4. Data Interpretation
    • Problem: Hard to compare movies accurately due to size encoding errors and overlapping bubbles.
    • Impact: Reduces the effectiveness of the visualization for comparison tasks.

Suggestions for Improvement:

  • Correct Size Encoding
    • Map gross earnings to bubble area, not radius.
  • Simplify Colors
    • Use a single color or meaningful color encoding (e.g., genre).
  • Direct Labeling
    • Place labels next to or within bubbles where possible.
  • Alternative Charts
    • Consider a scatter plot with gross earnings on one axis and budgets on another for clearer comparison.

Example Critique2: Students’ Academic Stressors

Desktop View

  • Expressiveness:
    • Stacking does not make sense
      • Overall vs. Subfields
      • Subfields w/ each other
    • Better: Grouped Bar Chart
  • Effectiveness:
    • Unclear why Social Sciences is “grayed out”
    • No whitespace between bars/columns makes it look like a histogram
    • Differences between fields are hard to see (e.g., Sciences and Arts/Humanities)

Example Critique3: House Prices in Canada

Desktop View

  • EXPRESSIVENESS:
    • Line Graph connects across arbitrarily orderable categories (shows a “trend” that is not in the data)

Tips for Effective Visualizations

  • Expressiveness
    • Ensure the visualization accurately represents the underlying data.
    • Avoid adding elements that suggest relationships not present in the data.
  • Effectiveness
    • Design for the target audience’s understanding.
    • Use visual channels appropriately (e.g., position is more precise than color).
  • Clarity
    • Labels and legends should be clear and easy to reference.
    • Avoid clutter and unnecessary embellishments.
  • Accessibility
    • Use color palettes that are colorblind-friendly.
    • Ensure text and visuals are legible.

Additional Resources for Visualization Critiques


Conclusion

  • Text Visualization: Requires specialized techniques due to the unique nature of text data.
  • Visualization Critiques: Essential for improving data representations and avoiding misleading visuals.
  • Next Steps: Explore visual analytics to combine automated analysis with interactive visualization.

Menti Quiz Review

Question: Which one of these is not a criterion for a good node-link layout?

  • Minimizing edge crossings
  • Adhering to the Robinson criterion
  • Maximizing edge crossing angles
  • Uniform edge length

Answer: Adhering to the Robinson criterion

Explanation:

  • Minimizing Edge Crossings
    • Essential to prevent misreading the graph.
    • Each crossing increases the chance of following the wrong edge.
  • Maximizing Edge Crossing Angles
    • Steeper angles reduce confusion at crossings.
    • Acute angles can mislead the viewer to the wrong path.
  • Uniform Edge Length
    • Avoids very long or very short edges.
    • Helps in maintaining a clear and proportional layout.
  • Robinson Criterion
    • Not related to node-link layouts.
    • Used in matrix visualizations of graphs.
    • Involves arranging the matrix so that values decrease as you move away from the main diagonal.
    • Helps in revealing clusters by minimizing Robinson violations.

Robinson Criterion Details:

  • Definition: Values should decrease when moving horizontally or vertically away from the main diagonal in a matrix.
  • Purpose: To highlight clusters around the main diagonal that may be otherwise unnoticed.
  • Implementation: Optimize the order of rows and columns to minimize Robinson violations.
  • Violations: Instances where the criterion is not met; can be counted or weighted differently during optimization.

Question 2: Artificial Reduction in Velocity in Force-Directed Layouts

Question: How do we call the artificial reduction in velocity in the force-directed layout?

  • The Barnes-Hut optimization
  • Simulated annealing
  • The barycenter algorithm
  • Coulomb’s law

Answer: Simulated annealing

Explanation:

  • Simulated Annealing
    • Inspired by metallurgy (cooling molten metal to solidify).
    • In layout algorithms, it involves gradually reducing the “temperature” to settle the nodes into a stable configuration.
    • Velocity of nodes decreases over iterations, preventing oscillations and helping reach an equilibrium.
  • Barnes-Hut Optimization
    • Uses a quadtree data structure.
    • Reduces computational complexity from O(n2)O(n^2)O(n2) to O(nlog⁡n)O(n \log n)O(nlogn) by approximating distant node interactions.
    • Not related to reducing velocity but to optimizing force calculations.
  • Barycenter Algorithm
    • Part of the Sugiyama framework for layered graph layouts.
    • Calculates the average (barycenter) of connected nodes to minimize edge crossings during node ordering.
    • Not related to velocity reduction.
  • Coulomb’s Law
    • Describes the repulsive force between electrically charged particles.
    • Used in force-directed layouts to simulate repulsion between nodes.
    • Does not address velocity reduction.

Question 3: Matrix Displays are Good for…

Question: Matrix displays are a good choice for the following:

  • For sparse graphs
  • For trees
  • For showing edge attributes
  • For dense graphs

Answers: For showing edge attributes and For dense graphs

Explanation:

  • Dense Graphs
    • Node-Link Issues: In dense graphs, node-link diagrams become cluttered with overlapping edges.
    • Matrix Advantages: Matrix displays handle density well by representing edges as cells, avoiding visual clutter.
  • Showing Edge Attributes
    • Enhanced Encoding: Each matrix cell can represent multiple edge attributes.
    • Use of Glyphs: Allows embedding small glyphs or visualizations within cells to convey complex information.
  • Sparse Graphs and Trees
    • Sparse Graphs: Matrix displays are inefficient due to many empty cells, wasting space.
    • Trees: Being sparse by nature, trees are better visualized using hierarchical or node-link diagrams.

Question 4: Squarified Treemap Layout

Question: The squarified treemap layout is:

  • A slice-and-dice algorithm
  • A randomized algorithm
  • A greedy algorithm
  • An optimization algorithm

Answer: A greedy algorithm

Explanation:

  • Squarified Treemap Layout
    • Aims to create rectangles (nodes) with aspect ratios close to 1 (squares).
    • Greedy Approach: Places the largest items first, filling space efficiently.
    • Improves readability by making area comparisons easier between nodes.
  • Slice-and-Dice Algorithm
    • Divides space alternately in horizontal and vertical directions.
    • Can result in elongated rectangles, making area comparison difficult.
  • Randomized Algorithm
    • Not applicable here; squarified treemaps follow a deterministic process.
  • Optimization Algorithm
    • While it aims for better aspect ratios, it doesn’t solve an optimization problem with a cost function.

Question 5: Steps of the Sugiyama Algorithm

Question: Which one is not a step of the Sugiyama algorithm?

  • The upward expansion
  • The crossing reduction
  • The layer assignment
  • The horizontal assignment

Answer: The upward expansion

Explanation:

  • Sugiyama Algorithm Steps:
    1. Layer Assignment
      • Assign nodes to discrete layers.
      • Source nodes placed on the first layer; edges point downwards.
    2. Crossing Reduction
      • Minimize edge crossings by ordering nodes within layers.
      • Uses methods like the barycentric algorithm to reorder nodes based on connections.
    3. Horizontal Assignment
      • Assign exact horizontal positions to nodes.
      • Aims to reduce edge bends and improve readability.
  • Upward Expansion
    • Not a recognized step.
    • Possibly a made-up term for the question.

This post is licensed under CC BY 4.0 by the author.