DAVI Visualizing Textual Data
Lecture on Text Visualization and Visualization Critiques
Challenges with Text Data
- Doesn’t Fit Traditional Data Types
- Neither purely qualitative nor quantitative.
- Visual Channel Mismatch
- Standard channels (position, color, size) don’t directly apply.
- No Inherent Structure
- Often unstructured or semi-structured, lacking a universal format.
- Difficult Task Abstraction
- Traditional task frameworks don’t easily accommodate text data.
Implication: Text data requires specialized visualization techniques.
Topics Covered
- Preprocessing Text Data
- Visualizing Individual Documents
- Visualizing Entire Text Corpora
- Embedding Visualizations into Text (Spark Lines, etc.)
1. Preprocessing Text Data
Objective: Normalize and prepare text for analysis.
Steps:
- Remove Formatting
- Strip away HTML tags, LaTeX commands, and other markup.
- Noise Removal
- Remove punctuation, emoticons, and non-text elements.
- Lowercasing
- Convert all text to lowercase to unify word representations.
- Social Normalization
- Standardize colloquial terms and slang.
- Example: “u r so gud” ➔ “you are so good”
- Stopword Removal
- Eliminate common words with little semantic value (e.g., “the”, “is”, “and”).
- Benefit: Reduces noise and focuses on meaningful words.
- Stemming
- Reduce words to their base or root form.
- Example: “trouble”, “troubled”, “troubles” ➔ “trouble”
- Algorithm: Porter Stemming Algorithm is widely used.
- Lemmatization
- Convert words to their canonical form, accounting for irregularities.
- Example: “good”, “better”, “best” ➔ “good”
- Benefit: Groups words with similar meanings that standard stemming might miss.
2. Representing Text Data
Bag-of-Words Model
- Concept: Text represented as a multiset of words.
- Components:
- Each word (term) and its frequency in the document.
- Limitation: Loses word order and context.
n-grams
- Definition: Sequences of ‘n’ consecutive elements(namely n-grams) from a text.
- Types:
- Unigrams: Single words (same as bag-of-words).
- Bigrams: Pairs of consecutive words.
- Trigrams: Sequences of three words.
- Purpose: Capture context and word order.
Example with “To be or not to be”:
- Unigrams: “to”, “be”, “or”, “not”
- Bigrams: “to be”, “be or”, “or not”, “not to”, “to be”
- Trigrams: “to be or”, “be or not”, “or not to”, “not to be”
Character n-grams
- Use letters instead of words.
- Application: Helps in language detection and handling misspellings.
Vector Space Representation
- Method: Represent documents as high-dimensional vectors.
- Example: Trigrams of letters yield a vector with ~17,500 dimensions (26 letters^3).
- Benefit: Enables mathematical operations like calculating cosine similarity between documents.
tf-idf (Term Frequency-Inverse Document Frequency)
Term Frequency (TF): Number of times a term appears in a document.
- Inverse Document Frequency (IDF): Used to measure the importance of the word in the corpus as a whole
tf-idf Score:
TF multiplied by IDF.
- Purpose: Identify the most significant terms in a document while ignoring commonly used terms that lack discriminative power.
- If a term is frequent in a specific document but not common across other documents, it receives a higher weight.
- Conversely, terms that are frequent across many documents have a lower weight because they are not useful for distinguishing one document from another. (Maybe like words: and, or …etc)
3. Visualizing Individual Documents
Tag Clouds (Word Clouds)
- Representation: word frequencies or TFIDF values represented through font size and/or color saturation.
- Arrangement: Often alphabetical; position is typically meaningless.
- Usage:
- Visual summary of main themes or topics.
- Issues:
- Positional Meaning: Viewers may infer relationships based on proximity, which can be misleading.
- Underutilized Channels: Position and color often don’t encode additional data.
Examples:
- State of the Union Addresses:
- President Bush (2002): Emphasis on “security”, “terror”, “weapons”.
- President Obama (2011): Focus on “business”, “jobs”, “future”.
Wordle Algorithm
- Goal: Create visually appealing word clouds with efficient space utilization.
- Process:
- Randomized Greedy Algorithm:
- Place the largest word first at a random position.
- Use a spiral search to find a position for each subsequent word without overlap.
- Collision Detection:
- Uses hierarchical bounding boxes (quadtrees) for efficient overlap checking.
- Randomized Greedy Algorithm:
- Features:
核心逻辑:
- 首先,检查当前单词 𝑤 是否与已经放置的单词发生重叠(intersection)。
- 如果有重叠,就通过某种方式移动单词 𝑤。
- 移动的路径是一条螺旋路径(spiral path),这样单词会在一个逐渐扩大的范围内寻找合适的位置。
- 移动的条件包括:
- 单词 𝑤 的任何部分没有超出“游戏场”(playing field)。
- 当前的螺旋半径(spiral radius)保持在“较小”的范围。
Video Example: Time-Varying Word Clouds
- Demonstration of advanced word cloud techniques.
- Key Points:
- Incorporates shape animations and dynamic layouts.
- Uses rigid body dynamics for arranging words.
- Supports various constraints like boundary shapes and word orientations.
Word Trees (Visual Concordance)
- Purpose: Visualize all occurrences of phrases starting or ending with a specific word.
- Structure:
- Tree-like diagram showing how phrases branch from a common root word.
- Example:
- Exploring phrases starting with “Love the…” in the Bible.
- Reveals all continuations and frequencies of each phrase.
4. Visualizing Entire Text Corpora(语块)
Literature Fingerprinting
- Method: Divide texts into equal-sized chunks and analyze statistical properties.
- Encoding:
- Average Sentence Length: Mapped to color intensity.
- Vocabulary Measures: Use of unique words (hapax legomena) to indicate vocabulary richness.
- Application: Detect stylistic differences, authorship attribution, or anomalies.
Examples:
- Jack London and Mark Twain:
- Variations in sentence length or vocabulary can indicate ghostwriting or stylistic shifts.
- Notable Observations:
- “Jerry of the Islands” (Jack London) differs in sentence length from his other works.
- “Tom Sawyer” (Mark Twain) has shorter sentences compared to his typical style.
Bible Visualization
- Approach: Each verse represented as a pixel.
- Encoding: Length of verses mapped to color.
- Insights:
- Patterns reveal structural elements like repetitive lists or anomalies.
- Example: Repeating patterns in “Numbers 7” due to similar offerings described for each tribe.
Reconstructing Original Texts from Witnesses
- Context: Original texts (stemma) are lost; multiple copies (witnesses) exist with variations.
- Goal: Infer the most probable original text by comparing witnesses.
- Visualization:
- Aligns different versions to highlight commonalities and differences.
- Helps scholars decide on the most authentic content.
5. Embedding Visualizations into Text
Spark Lines
- Definition: Small, word-sized line charts embedded within text.
- Purpose: Show trends or patterns in a compact form without axes or labels.
- Applications:
- Stock prices over time.
- Temperature changes.
- Any time-series data.
Creating Spark Lines:
- Challenges: Limited space requires data simplification.
- Techniques:
- Sampling:
- Select data points at regular intervals.
- Limitation: May miss important features.
- Averaging (Piecewise Aggregate Approximation):
- Compute average values within intervals.
- Benefit: Captures overall trends but may smooth out peaks.
- Perceptually Important Points:
- Algorithm selects points that significantly affect the visual shape.
- Process:
- Start with endpoints.
- Add points that deviate most from the current simplified line.
- Iteratively refine until the desired level of detail is reached.
- Advantage: Preserves critical features like peaks and troughs.
- Sampling:
Generalized Word-Scale Visualizations
- Micro Charts: Include bar charts, box plots, and other small visuals.
- Applications:
- In Text: Embed within sentences to support or extend content.
- In Code: Augment source code with visualizations showing variable states or performance metrics.
When to use word-sized visualization:
- Support Content:
- Quick visual comparisons or summaries.
- Summarize Content:
- Highlight key data points or trends.
- Emphasize Content:
- Reinforce important information visually.
- Extend Content:
- Provide additional data not fully described in the text.
- Display Contradictory Data:
- Offer alternative perspectives (less common).
- High-density information display:
- There is no need to switch eyes and you can get the context of the data as you read.
Examples:
- Source Code Visualization:
- Embedding performance metrics or variable states directly in code.
- Helps in debugging and understanding program behavior.
- User Study Data:
- Compact visualizations of eye-tracking or interaction data.
Additional Resources
- Text Visualization Browser
- A comprehensive collection of text visualization techniques.
- Link to Text Visualization Browser
Next Lecture Preview
- Topic: Visual Analytics
- Reading Assignment: Chapter 2 from the specified textbook.
- Focus: Integration of automated analysis with interactive visualization.
Visualization Critiques
General Approach
- Identify Issues: Examine the visualization for inaccuracies or misleading elements.
- Evaluate Expressiveness and Effectiveness: Does it represent the data accurately and clearly?
- Suggest Improvements: Propose ways to enhance the visualization.
Example Critique: Bubble Chart of Movie Budgets and Grosses
- Data Represented:
- Movie budgets (vertical axis).
- Gross earnings (bubble size).
- Release year (horizontal axis).
- Purpose: Compare movie budgets and grosses over time.
Issues Identified:
- Incorrect Size Encoding
- Problem: Gross earnings are mapped to bubble radius, not area.
- Impact: Misrepresents the relative gross earnings; viewers may misinterpret the data.
- Color Usage
- Problem: Colors assigned to bubbles have no meaningful categories.
- Impact: Adds unnecessary complexity; may confuse viewers.
- Labeling
- Problem: Uses a legend instead of directly labeling bubbles.
- Impact: Makes it difficult to identify movies; requires constant referencing.
- Data Interpretation
- Problem: Hard to compare movies accurately due to size encoding errors and overlapping bubbles.
- Impact: Reduces the effectiveness of the visualization for comparison tasks.
Suggestions for Improvement:
- Correct Size Encoding
- Map gross earnings to bubble area, not radius.
- Simplify Colors
- Use a single color or meaningful color encoding (e.g., genre).
- Direct Labeling
- Place labels next to or within bubbles where possible.
- Alternative Charts
- Consider a scatter plot with gross earnings on one axis and budgets on another for clearer comparison.
Example Critique2: Students’ Academic Stressors
- Expressiveness:
- Stacking does not make sense
- Overall vs. Subfields
- Subfields w/ each other
- Better: Grouped Bar Chart
- Stacking does not make sense
- Effectiveness:
- Unclear why Social Sciences is “grayed out”
- No whitespace between bars/columns makes it look like a histogram
- Differences between fields are hard to see (e.g., Sciences and Arts/Humanities)
Example Critique3: House Prices in Canada
- EXPRESSIVENESS:
- Line Graph connects across arbitrarily orderable categories (shows a “trend” that is not in the data)
Tips for Effective Visualizations
- Expressiveness
- Ensure the visualization accurately represents the underlying data.
- Avoid adding elements that suggest relationships not present in the data.
- Effectiveness
- Design for the target audience’s understanding.
- Use visual channels appropriately (e.g., position is more precise than color).
- Clarity
- Labels and legends should be clear and easy to reference.
- Avoid clutter and unnecessary embellishments.
- Accessibility
- Use color palettes that are colorblind-friendly.
- Ensure text and visuals are legible.
Additional Resources for Visualization Critiques
- Visualization Blogs and Forums:
- Best Practices Guides:
- Books and articles on data visualization principles by experts like Edward Tufte and Stephen Few.
Conclusion
- Text Visualization: Requires specialized techniques due to the unique nature of text data.
- Visualization Critiques: Essential for improving data representations and avoiding misleading visuals.
- Next Steps: Explore visual analytics to combine automated analysis with interactive visualization.
Menti Quiz Review
Question 1: Criteria for a Good Node-Link Layout
Question: Which one of these is not a criterion for a good node-link layout?
- Minimizing edge crossings
- Adhering to the Robinson criterion
- Maximizing edge crossing angles
- Uniform edge length
Answer: Adhering to the Robinson criterion
Explanation:
- Minimizing Edge Crossings
- Essential to prevent misreading the graph.
- Each crossing increases the chance of following the wrong edge.
- Maximizing Edge Crossing Angles
- Steeper angles reduce confusion at crossings.
- Acute angles can mislead the viewer to the wrong path.
- Uniform Edge Length
- Avoids very long or very short edges.
- Helps in maintaining a clear and proportional layout.
- Robinson Criterion
- Not related to node-link layouts.
- Used in matrix visualizations of graphs.
- Involves arranging the matrix so that values decrease as you move away from the main diagonal.
- Helps in revealing clusters by minimizing Robinson violations.
Robinson Criterion Details:
- Definition: Values should decrease when moving horizontally or vertically away from the main diagonal in a matrix.
- Purpose: To highlight clusters around the main diagonal that may be otherwise unnoticed.
- Implementation: Optimize the order of rows and columns to minimize Robinson violations.
- Violations: Instances where the criterion is not met; can be counted or weighted differently during optimization.
Question 2: Artificial Reduction in Velocity in Force-Directed Layouts
Question: How do we call the artificial reduction in velocity in the force-directed layout?
- The Barnes-Hut optimization
- Simulated annealing
- The barycenter algorithm
- Coulomb’s law
Answer: Simulated annealing
Explanation:
- Simulated Annealing
- Inspired by metallurgy (cooling molten metal to solidify).
- In layout algorithms, it involves gradually reducing the “temperature” to settle the nodes into a stable configuration.
- Velocity of nodes decreases over iterations, preventing oscillations and helping reach an equilibrium.
- Barnes-Hut Optimization
- Uses a quadtree data structure.
- Reduces computational complexity from O(n2)O(n^2)O(n2) to O(nlogn)O(n \log n)O(nlogn) by approximating distant node interactions.
- Not related to reducing velocity but to optimizing force calculations.
- Barycenter Algorithm
- Part of the Sugiyama framework for layered graph layouts.
- Calculates the average (barycenter) of connected nodes to minimize edge crossings during node ordering.
- Not related to velocity reduction.
- Coulomb’s Law
- Describes the repulsive force between electrically charged particles.
- Used in force-directed layouts to simulate repulsion between nodes.
- Does not address velocity reduction.
Question 3: Matrix Displays are Good for…
Question: Matrix displays are a good choice for the following:
- For sparse graphs
- For trees
- For showing edge attributes
- For dense graphs
Answers: For showing edge attributes and For dense graphs
Explanation:
- Dense Graphs
- Node-Link Issues: In dense graphs, node-link diagrams become cluttered with overlapping edges.
- Matrix Advantages: Matrix displays handle density well by representing edges as cells, avoiding visual clutter.
- Showing Edge Attributes
- Enhanced Encoding: Each matrix cell can represent multiple edge attributes.
- Use of Glyphs: Allows embedding small glyphs or visualizations within cells to convey complex information.
- Sparse Graphs and Trees
- Sparse Graphs: Matrix displays are inefficient due to many empty cells, wasting space.
- Trees: Being sparse by nature, trees are better visualized using hierarchical or node-link diagrams.
Question 4: Squarified Treemap Layout
Question: The squarified treemap layout is:
- A slice-and-dice algorithm
- A randomized algorithm
- A greedy algorithm
- An optimization algorithm
Answer: A greedy algorithm
Explanation:
- Squarified Treemap Layout
- Aims to create rectangles (nodes) with aspect ratios close to 1 (squares).
- Greedy Approach: Places the largest items first, filling space efficiently.
- Improves readability by making area comparisons easier between nodes.
- Slice-and-Dice Algorithm
- Divides space alternately in horizontal and vertical directions.
- Can result in elongated rectangles, making area comparison difficult.
- Randomized Algorithm
- Not applicable here; squarified treemaps follow a deterministic process.
- Optimization Algorithm
- While it aims for better aspect ratios, it doesn’t solve an optimization problem with a cost function.
Question 5: Steps of the Sugiyama Algorithm
Question: Which one is not a step of the Sugiyama algorithm?
- The upward expansion
- The crossing reduction
- The layer assignment
- The horizontal assignment
Answer: The upward expansion
Explanation:
- Sugiyama Algorithm Steps:
- Layer Assignment
- Assign nodes to discrete layers.
- Source nodes placed on the first layer; edges point downwards.
- Crossing Reduction
- Minimize edge crossings by ordering nodes within layers.
- Uses methods like the barycentric algorithm to reorder nodes based on connections.
- Horizontal Assignment
- Assign exact horizontal positions to nodes.
- Aims to reduce edge bends and improve readability.
- Layer Assignment
- Upward Expansion
- Not a recognized step.
- Possibly a made-up term for the question.