DAVI Data Preprocessing for DataVis

Posted Nov 11, 2024 Updated Dec 2, 2024

By Wei Xiong

8 min read

Data Processing

Why talk about preprocessing

测量标准未知 Data Characteristics are unknown
数据质量差 Poor Data Quality
太多数据 Too many data

Data Profiling

You need to figure out before you can even work with the data.

Data profiling is the process of diagnosing a new or otherwise unknown data set for its access modalities and data space characteristics.

数据分析是诊断新的或其他未知数据集的访问模式和数据空间特征的过程。

How to get the data, how to read the data if you got it, how much data to expect, how to interpret the data, whether to actually trust the data. All of these are issues that are part of the data profiling.

Characteristics of data space

You need to determine whether the numbers that you are looking at first of all, what they are.

Is that a time? What am I looking at there? Uh, and whether these numbers are plausible. Is that a temperature? Is that a weight? Also there are certain value ranges that they lie in.

Benford’s law

distribution of numbers (Benford’s law), words (Zipf’s law)

So you take the first digits of all the numbers of an observation that is data that has been generated by a natural process.

Benford’s law does not apply to voting numbers or actually any numbers that are being gathered over geospatial areas.

MAKE USE OF KNOWN VALUE RANGES!

Sensible value ranges as simple guards against “bad” data

THE PROBLEM(S) WITH “STANDARDS”

• Competing Standards: not only one standard but competing with other standard

• Different versions of the same standard: The standard has different versions because the standard needs to evolve with the world around it.

• Flexibility in interpreting the standard: bc software or sensor is updating

• Incomplete implementations of the standard: for small company or software

• Standardized data still requires validation

• Not all required information may be part of the standard

STANDARD ISSUES

1

What’s wrong with this visualization?

It’s not Greenland. Iceland is where there is no land. Why is the color? Because it’s just snow cover can only be on landmasses and this is exactly what it did not understand.

2

It did not do this rotated pool translation of the data to the actual place where it needs to be, but it’s still mapping it on the zero zero where they, where they ran the simulation instead of then mapping it up onto to Europe.

3

The axis. Yes. So it also did not rotate correctly. but you can only see that from the axis. So we don’t have Africa lying in the background here.

correct one

Data Wrangling

What is Data Wrangling: The process of making any raw dataset useful by:

identifying and treating missing values
duplicate and possibly contradicting entries,
formatting issues
other problems of data quality
eg:

MISSING DATA VALUES

DIFFERENT FORMS OF IMPUTATION(填补)

just use the last value until a new valid value comes in.
That makes sense for, um, for continuous data. because it does not bias your descriptive statistics. Imputed values will not change the average and the average will be the same, because that is what you just put in for the imputed values.
中位数填补方法使用数据集某一列的中间值来填补缺失数据。中位数不会受极端值影响，因此适合那些有序但可能有异常值的数据（如收入等级或评分数据）。
So say eye color. I need to impute this. Well maybe the most observed eye color was brown. Impute(推算)
You could also do a linear regression. So if you have two Quantitative values. You can compute the regression line. And then whenever one entry is missing, maybe for one data item I only have the x value, but not the y value. Then I just look at the x value, go up to the linear regression line and go over here.

DIFFERENT FORMS OF AMPUTATION(截去数据)

empirical (rule of thumb)

pair-wise deletion：

Depending on what these analysis touch upon, different rows will be dropped.

WHY RULES OF THUMB?

VIS SUPPORT: MISSINGNESS MAPS

VIS SUPPORT: EXPLICIT ENCODING

通常是在可视化中明确地显示缺失值，而不是简单地忽略它们。这种技术通过将缺失值作为可见元素来处理，例如使用特殊的符号或颜色标记缺失的数据，从而保持数据的完整性和透明性。

DE-DUPLICATION

这一页探讨了去重，即从数据集中移除重复的数据。它提到了一些常见的重复数据问题，例如相同的社保号（SSN）或类似的姓名等。可以通过使用唯一标识符或准标识符来进行数据去重，从而提高数据质量。

STRING MATCHING ALGORITHMS

这几页介绍了字符串匹配算法，这些算法用于处理数据去重和实体解析（entity resolution）。页面提到了两种常见的匹配算法：

模糊匹配（fuzzy matching）——如 Bitap 算法，适用于发现拼写错误或相似的文本。
语音匹配（phonetic matching）——如 Metaphone 算法，适用于根据发音匹配相似的单词。

这些技术通常用于确保数据中的文本字段能够被正确解析和去重，特别是在姓名或地址等非结构化数据中。

DATA QUALITY: UNCERTAINTY

这页讲述了数据质量中的不确定性。不确定性可以由多种因素引起，包括：

技术或环境因素：例如测量误差或偏差。
人为因素：例如手动数据采集的错误。
固有因素：例如随机过程或仿真系统。

根据教授讲述：

观察数据的可信度：教授问到在没有直接测量时，如何确保数据的可信度。这里指的是当数据是基于描述或观察，而非直接测量时，数据的准确性和可靠性会受到质疑。
模拟过程中的不确定性：他以模拟过程为例，指出有些模拟过程可能是随机的（stochastic），这意味着其结果本身带有一定的模糊性或不确定性。因此，处理这些结果时，必须考虑到数据的随机性和不确定性。
数据争论（data wrangling）：最后，教授提到数据争论的过程，这指的是如何清理、转换和处理这些不确定的数据，使其对后续分析有用。

教授强调了不确定性在数据预处理中是不可避免的，但通过正确的数据处理方法，可以使这些数据在软件开发或分析中依然具备实际价值。

Data Transformation

DATA REDUCTION – SAMPLING

SAMPLING FOR DOT MAPS

THE SAMPLING LENS

only the sampled data, you dont see all of them, so lose the breadth of data item.

AGGREGATING DATA

keep the breadth

So the whole data set is still covered, but you no longer have individual data items, but you only look at aggregates for example clusters.

Data Visualization, Lecture

This post is licensed under CC BY 4.0 by the author.