HCI Track-A Evaluating User Interfaces 2 (Empirical) Week41

Posted Nov 19, 2024 Updated Nov 21, 2024

By Wei Xiong

15 min read

HCI Week 41 Lecture Notes: Empirical Evaluation

Introduction

Welcome to Week 41 of the HCI course. We’re focusing on the second of two classes on evaluation. Last time, we discussed analytical evaluation—methods you can perform without participants. Today, we’ll explore empirical evaluation, where we involve real users in our studies.

Four Main Topics:

Introduction to Empirical Evaluation
Think-Aloud Studies
Experiments
Field Studies

Recap of Analytical Evaluation

In the first half of last week’s lecture, we emphasized why evaluation is crucial—not just for scientific research but also for engineering projects and industry applications. Evaluation helps demonstrate that your design choices are correct and that the tool meets necessary certifications.

Key Points from Last Lecture:

Analytical Evaluations are conducted without users.
Methods include:
- Human Error Identification
- Heuristic Evaluation
- Cognitive Walkthrough
- Keystroke Level Modeling (KLM) and GOMS

These methods rely on guidelines, principles, and theories to evaluate interfaces in a “clean room” setting—without involving actual users (referred to humorously as “dirty users”).

Important Takeaway:

No single evaluation method can solve everything. We need to use complementary evaluation techniques to fully understand and improve user interfaces.

Empirical Evaluation

Empirical evaluation involves experiments with real users. This approach is essential because humans are complex, and their interactions with systems can’t be fully predicted by theoretical models alone.

Why Empirical Evaluation?

Humans can’t be reduced to simple rules.
There’s no strong theoretical model for human behavior in HCI.
Empirical studies allow us to observe actual user interactions and gather valuable data.

Think-Aloud Studies

Introduction

Don Norman refers to think-aloud studies as “the single most important usability engineering method.” In these studies, participants verbalize their thoughts while using an interface, providing insights into their decision-making processes.

Key Concepts

Verbalizing Thoughts: Participants articulate what they’re thinking as they interact with the interface.
Access to Mental Processes: This method reveals users’ intentions, expectations, and misunderstandings.
Identifying Usability Problems: By listening to users’ thought processes, designers can pinpoint issues that may not be evident through observation alone.

Importance in Academia and Industry

Academic Research: Think-aloud studies help researchers understand how users interact with new technologies or prototypes.
Industry Practice: Companies use these studies to improve product usability and user satisfaction.

Example: Data Transfer (“The One Ring”) Study

A recent project involved a device called Data Transfer, initially named The One Ring, inspired by Tolkien. This ring allows users to control multiple screens with gestures:

Objective: Evaluate the usability of the ring in controlling multiple devices.
Method: Conducted think-aloud studies with 12 participants.
Challenge: Lacked a direct comparison baseline, making think-aloud insights crucial.

How to Conduct Think-Aloud Studies

Provide Instructions to participants:
- Explain the task neutrally without influencing the participant.
- Encourage the participant to verbalize their thoughts.
Participants verbalize thoughts during interaction:
- Observe without interfering.
- If the participant falls silent, gently prompt them:
  - “Please keep talking.”
  - “What are you thinking right now?”
Capture Data:
- Record sessions (audio/video).
- Transcribe for analysis.
Analyze data for insights:
- Look for patterns in users’ thought processes.
- Identify common obstacles or misunderstandings.

Mechanism: Record verbalizations; prompt the participant if needed (“keep talking”); collect findings from transcript.
Not an interview! Focus on concrete behavior and iteractions.

Levels of Verbalization (Erickson and Simon)

Level 1: Direct reporting of thoughts
- Participants verbalize thoughts in real-time.
- Focus on immediate actions and decisions.
- Example Prompt: “What are you thinking right now as you are doing this task?”
Level 2: Description of mental imagery
- Participants describe mental models or images.
- Provides deeper insight into user expectations.
- Example Prompt: “Explain what you are seeing and picturing into words?”
Level 3: Explanations and filtering (less valid for understanding thought)
- Participants reflect after task completion.
- Less intrusive but may miss immediate reactions.
- Example Prompt: Retrospectively explain what you just did and why?

Example Video Summary

A participant navigates a website to understand an organization’s purpose:

Observations:
- Unclear initial understanding of the site’s purpose.
- Confusion about navigation options.
- Found relevant information only after exploring the “About Us” section.
Insights Gained:
- Navigation labels were not intuitive.
- Important information was buried, affecting user experience.

Strengths of Think-Aloud Studies

Cost-Effective: Requires minimal resources.
Rich Data: Provides detailed insights into user thought processes.
Actionable Results: Identifies specific areas for improvement.

Weaknesses of Think-Aloud Studies

Observer Effect: Participants may alter behavior due to being observed.
Artificial Setting: Verbalizing thoughts isn’t a natural behavior, potentially affecting authenticity.
Cultural Differences: Some participants may be uncomfortable sharing thoughts openly.
Analysis Challenges: Data can be subjective, making consistent analysis difficult.

Best Practices

Warm-Up Sessions:
- Begin with simple tasks to acclimate participants to verbalizing thoughts.
Neutral Prompts:
- Avoid leading questions.
- Use open-ended prompts to encourage detailed responses.
Minimize Interruptions:
- Allow participants to speak freely.
- Only interject when necessary to keep them talking.
Modern Variations:
- Collaborative Think-Aloud: Participants work in pairs, discussing tasks naturally.
- Remote Think-Aloud: Conduct sessions via video call to reduce observer impact.

Experiments

Introduction

Definition:
- Experiments in HCI are structured studies that measure the impact of different variables on user performance and experience. or say “Scientifically rigorous method for measuring the performance of a user interface”
Basic idea:
- Vary one condition of a situation and observe the outcome.
Example:
- Search times for a user with new and old search feature.
The experimental design governs the independent variables (intervention) and dependent variables (measured performance)

Key Concepts

Independent Variables (IVs): Factors you manipulate (e.g., interface type).
Dependent Variables (DVs): Outcomes you measure (e.g., task completion time).
Hypotheses: Predictions about the relationship between IVs and DVs.

Designing an Experiment

Example Scenario: Evaluating a new email search interface.

IV: Type of search interface (new vs. old).
DVs:
- Completion Time: Time taken to find specific emails.
- Accuracy: Correctness of search results.

Experiment Fundamentals

Main mechanism:
- independent variables govern dependent ones.
Nuisance factors:
- Humans are different
- Participants know different things
- Participants have unequal skill
- Experiment design matters
Handling nuisance factors:
- Controlling them (in design)
- Hold them constant
- Random assignment
Determine that experiment is valid
- Tasks are representative
- Meaningful hypotheses

Experimental Design Steps

Define Research Questions and Hypotheses:
- Research Question: Does the new search interface improve search efficiency?
- Hypothesis: Users will complete searches faster and more accurately with the new interface.
Control Nuisance Factors:
- Random Assignment: Participants are randomly assigned to conditions.
- Counterbalancing: Vary the order of conditions to control for learning effects.
Select Participants:
- Ensure they represent the target user population.
- Decide on sample size (use power analysis for statistical validity).
Ethical Considerations:
- Obtain informed consent.
- Ensure participant confidentiality.
- Debrief participants after the study.

Avoiding the Egocentric Fallacy

Recognize personal biases.
Avoid assuming others share your knowledge or experiences.
Design studies that account for diverse user perspectives.

Research questions & Research hypotheses

研究问题是对一个现象的明确知识空缺的陈述，指出我们尚未完全理解某个现象的部分。
研究问题的类型:

Empirical（现象性的）：研究客观现象，通过观察和数据收集来寻找答案。
- 示例：“用户在特定环境中完成任务的效率如何变化？”
Constructive（建设性的）：创建新系统或方法，提出具体解决方案。
- 示例：“是否可以设计一种新型的用户界面，使得搜索效率提高50%？”
Conceptual（概念性的）：理论性研究，发展或改进已有理论框架，或提出新的理论。
- 示例：“用户决策行为是否符合某种心理学理论模型？” 避免“那又怎么样”和“早就知道了”的问题

假设是对独立变量（independent variables）与因变量（dependent variables）之间关系的陈述。

Types of Experiments

Confirmatory Experiments:
- Test specific, predefined hypotheses.
- Require rigorous controls and statistical analysis.
Exploratory Experiments:
- Aim to discover new insights.
- Generate hypotheses for future testing.

Independent variables

Definition: Factors manipulated by the experimenter to observe their effect.

Examples: User interface type, user expertise, form of instruction, type of feedback.

Best Practices for Using Independent Variables in HCI Experiments:

Eliminating Confounds:
- Control non-essential aspects: Ensure that all other variables unrelated to the hypothesis (e.g., screen size, task difficulty, lighting) remain consistent across experimental conditions.
  - Example: If you’re testing two interface types, make sure both are presented on the same hardware under the same environmental conditions.
- Use comparable hardware, training, and success criteria: Participants should have equal access to equipment and receive similar instructions or training to eliminate bias.
  - Example: In a study comparing novice and expert users, both groups should use the same device and receive the same baseline instructions.
Selecting Meaningful Baselines:
- Use “strong baselines”: Compare your test condition against state-of-the-art alternatives or the best available approach, rather than an irrelevant or unrealistic option.
  - Example:Avoid comparing your design to an outdated or non-functional keyboard, as it would not provide a meaningful benchmark.
- A “straw man” comparison involves setting up a weak or outdated alternative to make your experimental condition look better. This can undermine the credibility of the experiment.

Participants in HCI experiments

Importance of representativeness代表性
- Avoid convenience sampling when possible.(Convenience sampling（便利抽样）指从容易接触到的群体中招募参与者，例如同事、朋友、或校园里的学生。 )
- Consider target user population characteristics.
Sample size considerations
- Typical HCI studies use around 12-20 participants
- Power analysis: A method which helps determine appropriate sample size
Power analysis example:
- To detect medium-sized effects with 80% probability
- Need 64 participants per condition in a between-subjects design(caculate by formula)

Research ethic in HCI experiments

Key principles:
- Respect participants: Value their time and opinions
- Ensure safety: Avoid physical, mental, and emotional harm
- Obtain informed consent: Use clear, comprehensive consent forms
- Provide adequate compensation: Fair but not coercive
- Debrief participants: Explain the study’s purpose and answer questions

Guidelines: Helsinki Declaration, APA Ethical Principles, ACM Code of Ethics

Experimental Designs

Within-Participants Design:
- Features: Each participant experiences all conditions.
- Pros:
  - Controls for individual differences which eliminates the individual diffierence(e.g age, intelligence,or mood).
  - Fewer Participants are needed to detect the significant result.
- Cons: Risk of learning(become more and more familiar with procedure) or fatigue effects.
- Counterbalancing needed: To minimize learning or carryover effects, researchers must randomize or counterbalance the order of conditions. This adds complexity to the experimental setup.
Between-Participants Design(also known as independent-group design):
- Features: Participants experience only one level of independent variable who are divided into groups, each group experience one condition. Then Comparison between groups instead of individual.
- Pros: Because each participant experiences only one condition, there’s no risk of learning, fatigue, or contamination from previous conditions.
- Cons:
  - Requires more participants.
  - Individual can also introduce the diffierence.

Dependent Variables

What is being measured

Definition: Measures reflecting the influence of independent variables
Conceptualization: Making clear the meaning of concepts in research questions.声明概念的定义
Operationalization: Turning concepts into measurable variables.把抽象的问题变成具体的变量
Common measures:
- Task completion time
- Accuracy or error rates
- Questionnaire responses

Importance of multiple measures to increase reliability and validity

Experimental situation

Task selection & experimental setting

Task selection approaches:
- Representative tasks: Based on real-world user activities
- Essential tasks: Capture core aspects of what’s being investigated
Lab vs. field experiments:
- Lab: Controlled setting, minimized external influences
- Field: Real-world setting, experimental manipulations in situ

Importance of multiple measures to increase reliability and validity

Hypothesis testing

Using inferential statistics

Hypothesis testing assesses whether experimental findings are due to random chance or a real effect. You start with a null hypothesis (no effect), calculate a p-value, and compare it to the chosen significance level (α) to determine if the results are meaningful. Smaller p-values mean stronger evidence to reject the null hypothesis.

p-value:

Definition: The probability of obtaining the observed results (or more extreme results) if the null hypothesis (H₀) is true.
Interpretation:
- Small p-value (e.g., < 0.05): Strong evidence against H₀ → reject H₀.
- Large p-value (e.g., > 0.05): Insufficient evidence to reject H₀ → fail to reject H₀.

Significance Levels (α):

0.05 (5%): There’s a 5% chance results occurred by random chance.
0.01 (1%): Stricter threshold for stronger evidence.
0.001 (0.1%): Extremely strong evidence required to reject H₀.

Explainning the results

Quantitative results alone is insufficient
Explanations help understand

Distributions of dependent variables
Mechanisms linking independent and dependent variables
Sources of explanations:
Qualitative data: Verbal protocols, video recordings, interviews
Theories: e.g., cognitive theories, communication theories Example: Using decision-making theories to explain user behavior in intelligent text entry systems (Cockburn et al.)

DATA ANALYSIS IN HCI EXPERIMENTS

Descriptive Statistics

Focus: Summarize and describe the data.
Purpose: To explore and visually represent relationships within the dataset.
Methods:
- Summary statistics: Simple numerical values like:
  - Mean: The average of all data points.
  - Median: The middle value in a sorted dataset.
  - Variance: Measures how much the data varies from the mean.
- Visualization:
  - Histograms: Show data distribution.
  - Scatter plots: Visualize relationships between two variables.
  - Line plots: Track changes or trends over time.

Inferential Statistics

Focus: Generalize findings from a sample to the population.
- Purpose: To draw conclusions and test hypotheses about how variables interact.
- Methods:
  - Confidence Intervals: Estimate the range within which the true population parameter lies with a certain level of confidence (e.g., 95%).
  - Hypothesis Testing: Assess if the observed patterns in the data are statistically significant or due to chance (e.g., p-values, t-tests).

Explaining Results

Link Findings to Theories:
- Use existing frameworks to interpret data.
Qualitative Data:
- Incorporate user feedback and observations.
Practical Implications:
- Discuss how results can inform design improvements.

Field Studies

Using the world as the laboratory

Introduction

Field studies evaluate systems in real-world settings, providing insights into how they function in natural environments.

Types of Field Studies

Field evaluations of prototypes
Pilot Studies:
- Deploy a prototype or unfinished system.
- Gather preliminary data to refine the system.
Deployment Studies:
- Release a final or near-final system to users.
- Collect data over an extended period.

对比维度	试点研究（Pilot Studies）	部署研究（Deployment Studies）
系统状态	未完成的系统，处于测试阶段	已经完成并完全部署的系统
环境	真实环境（但可能是有限范围内的测试环境）	真实环境中的完整部署
目标	验证系统设计和工程的可行性，发现问题并进行迭代	确保系统达成预期效果，减少支持成本，并为未来版本提供改进依据
时间范围	较短（几周到几个月）	较长（从几周到几年）
数据来源	技术探针、用户访谈、初步日志分析	用户反馈、日志分析、长期观察

Advantages

Key aspect: Real contexts!

Ecological Validity: Capture collaborative, communicative, and material practices
Real Users: Understand how systems work with users’ real tasks and motivations.
Contextual Insights: Understands social and organizational impacts.

Challenges

Control Loss: Harder to manage variables and external factors.
Participant Recruitment: Finding willing participants in the target demographic.
Data Collection: May require unobtrusive methods to avoid influencing behavior.

Example: AAC Application Deployment

Context: Application for non-speaking individuals with motor disabilities.
Method:
- Deployed the app through an online marketplace.
- Recruited users from the target demographic.
- Collected usage data and user feedback.

Lab vs. Field Studies Comparison

Lab Studies:

Pros:
- High control over variables.
- Easier to replicate.
- Efficient data collection.
Cons:
- May lack real-world relevance.
- Participants may behave differently in artificial settings.

Field Studies:

Pros:
- High ecological validity.
- Observes genuine user interactions.
- Captures environmental influences.
Cons:
- Less control over external factors.
- Data can be messy and harder to analyze.
- Potential ethical concerns with unobtrusive observation.

Conclusion

Key Takeaways:

Complementary Methods: No single evaluation method is sufficient; use a combination of analytical and empirical techniques.
Think-Aloud Studies: Offer deep insights but require careful execution to minimize observer effects.
Experiments: Provide rigorous testing of hypotheses but need careful design to ensure validity.
Field Studies: Deliver real-world insights but come with logistical and ethical challenges.
Ethics Matter: Always prioritize participants’ well-being and obtain informed consent.

By understanding and applying these evaluation methods, we can design more effective, user-centered interfaces that meet the needs and expectations of our users.

Human Computer Interaction, Track A

This post is licensed under CC BY 4.0 by the author.

HCI Week 41 Lecture Notes: Empirical Evaluation

Introduction

Four Main Topics:

Recap of Analytical Evaluation

Empirical Evaluation

Think-Aloud Studies

Introduction

Key Concepts

Importance in Academia and Industry

Example: Data Transfer (“The One Ring”) Study

How to Conduct Think-Aloud Studies

Levels of Verbalization (Erickson and Simon)

Example Video Summary

Strengths of Think-Aloud Studies

Weaknesses of Think-Aloud Studies

Best Practices

Experiments

Introduction

Key Concepts

Designing an Experiment

Experiment Fundamentals

Experimental Design Steps

Avoiding the Egocentric Fallacy

Research questions & Research hypotheses

Types of Experiments

Independent variables

Participants in HCI experiments

Research ethic in HCI experiments

Experimental Designs

Dependent Variables

Experimental situation

Hypothesis testing

Explainning the results

DATA ANALYSIS IN HCI EXPERIMENTS

Explaining Results

Field Studies

Introduction

Types of Field Studies

Advantages

Challenges

Example: AAC Application Deployment

Lab vs. Field Studies Comparison

Conclusion

Trending Tags