Post

DAVI Data Preprocessing for DataVis

DAVI Data Preprocessing for DataVis

Data Processing

Why talk about preprocessing

  1. 测量标准未知 Data Characteristics are unknown
    • image-20241001001754396
  2. 数据质量差 Poor Data Quality
  3. 太多数据 Too many data
    • image-20241001001529410

Data Profiling

You need to figure out before you can even work with the data.

  1. Data profiling is the process of diagnosing a new or otherwise unknown data set for its access modalities and data space characteristics.

​ 数据分析是诊断新的或其他未知数据集的访问模式和数据空间特征的过程。

  1. How to get the data, how to read the data if you got it, how much data to expect, how to interpret the data, whether to actually trust the data. All of these are issues that are part of the data profiling.

image-20241001003358882

Characteristics of data space

You need to determine whether the numbers that you are looking at first of all, what they are.

  • Is that a time? What am I looking at there? Uh, and whether these numbers are plausible. Is that a temperature? Is that a weight? Also there are certain value ranges that they lie in.

image-20241001011552875

Benford’s law

distribution of numbers (Benford’s law), words (Zipf’s law)

image-20241001012431564

So you take the first digits of all the numbers of an observation that is data that has been generated by a natural process.

Benford’s law does not apply to voting numbers or actually any numbers that are being gathered over geospatial areas.

MAKE USE OF KNOWN VALUE RANGES!

Sensible value ranges as simple guards against “bad” data

image-20241001014119171

THE PROBLEM(S) WITH “STANDARDS”

Competing Standards: not only one standard but competing with other standard

Different versions of the same standard: The standard has different versions because the standard needs to evolve with the world around it.

Flexibility in interpreting the standard: bc software or sensor is updating

Incomplete implementations of the standard: for small company or software

• Standardized data still requires validation

• Not all required information may be part of the standard

STANDARD ISSUES

1

image-20241001014858465

What’s wrong with this visualization?

It’s not Greenland. Iceland is where there is no land. Why is the color? Because it’s just snow cover can only be on landmasses and this is exactly what it did not understand.

2

image-20241001015003032

It did not do this rotated pool translation of the data to the actual place where it needs to be, but it’s still mapping it on the zero zero where they, where they ran the simulation instead of then mapping it up onto to Europe.

3

image-20241001015132102

The axis. Yes. So it also did not rotate correctly. but you can only see that from the axis. So we don’t have Africa lying in the background here.

correct one

image-20241001015208305

Data Wrangling

What is Data Wrangling: The process of making any raw dataset useful by:

  • identifying and treating missing values
  • duplicate and possibly contradicting entries,
  • formatting issues
  • other problems of data quality
  • eg:
    • image-20241001114256608

MISSING DATA VALUES

image-20241001114339607

DIFFERENT FORMS OF IMPUTATION(填补)

image-20241001114930777

  1. just use the last value until a new valid value comes in.
  2. That makes sense for, um, for continuous data. because it does not bias your descriptive statistics. Imputed values will not change the average and the average will be the same, because that is what you just put in for the imputed values.
  3. 中位数填补方法使用数据集某一列的中间值来填补缺失数据。中位数不会受极端值影响,因此适合那些有序但可能有异常值的数据(如收入等级或评分数据)。
  4. So say eye color. I need to impute this. Well maybe the most observed eye color was brown. Impute(推算)

  5. You could also do a linear regression. So if you have two Quantitative values. You can compute the regression line. And then whenever one entry is missing, maybe for one data item I only have the x value, but not the y value. Then I just look at the x value, go up to the linear regression line and go over here.
    • image-20241001115745911
  6. image-20241001115831128

DIFFERENT FORMS OF AMPUTATION(截去数据)

empirical (rule of thumb)

image-20241001115953253

pair-wise deletion:

image-20241001120508960

Depending on what these analysis touch upon, different rows will be dropped.

WHY RULES OF THUMB?

image-20241001120955551

VIS SUPPORT: MISSINGNESS MAPS

image-20241001121418075

VIS SUPPORT: EXPLICIT ENCODING

image-20241001131340802

通常是在可视化中明确地显示缺失值,而不是简单地忽略它们。这种技术通过将缺失值作为可见元素来处理,例如使用特殊的符号或颜色标记缺失的数据,从而保持数据的完整性和透明性。

DE-DUPLICATION

image-20241001131353276

这一页探讨了去重,即从数据集中移除重复的数据。它提到了一些常见的重复数据问题,例如相同的社保号(SSN)或类似的姓名等。可以通过使用唯一标识符或准标识符来进行数据去重,从而提高数据质量。

STRING MATCHING ALGORITHMS

image-20241001131410524image-20241001131432053

这几页介绍了字符串匹配算法,这些算法用于处理数据去重和实体解析(entity resolution)。页面提到了两种常见的匹配算法:

  1. 模糊匹配(fuzzy matching)——如 Bitap 算法,适用于发现拼写错误或相似的文本。
  2. 语音匹配(phonetic matching)——如 Metaphone 算法,适用于根据发音匹配相似的单词。

这些技术通常用于确保数据中的文本字段能够被正确解析和去重,特别是在姓名或地址等非结构化数据中。

DATA QUALITY: UNCERTAINTY

image-20241001131852596

这页讲述了数据质量中的不确定性。不确定性可以由多种因素引起,包括:

  • 技术或环境因素:例如测量误差或偏差。
  • 人为因素:例如手动数据采集的错误。
  • 固有因素:例如随机过程或仿真系统。

根据教授讲述:

  1. 观察数据的可信度:教授问到在没有直接测量时,如何确保数据的可信度。这里指的是当数据是基于描述或观察,而非直接测量时,数据的准确性和可靠性会受到质疑。
  2. 模拟过程中的不确定性:他以模拟过程为例,指出有些模拟过程可能是随机的(stochastic),这意味着其结果本身带有一定的模糊性或不确定性。因此,处理这些结果时,必须考虑到数据的随机性和不确定性。
  3. 数据争论(data wrangling):最后,教授提到数据争论的过程,这指的是如何清理、转换和处理这些不确定的数据,使其对后续分析有用。

教授强调了不确定性在数据预处理中是不可避免的,但通过正确的数据处理方法,可以使这些数据在软件开发或分析中依然具备实际价值。

Data Transformation

DATA REDUCTION – SAMPLING

image-20241001132115973

SAMPLING FOR DOT MAPS

image-20241001132750439

THE SAMPLING LENS

image-20241001132800890

only the sampled data, you dont see all of them, so lose the breadth of data item.

AGGREGATING DATA

keep the breadth

image-20241001132930475

So the whole data set is still covered, but you no longer have individual data items, but you only look at aggregates for example clusters.

This post is licensed under CC BY 4.0 by the author.