Learning never exhausts the mind: TSK - Chapter 2

Data is a collection of objects and their attributes. An attribute is a property or characteristic of an object. A collection of attributes describe an object.

2. Data

2.1. Types of data

2.1.1. attributes & measurement

operations used to describe attributes - distinctness, order, addition, multiplication
4 types of attributes - nominal, ordinal, interval, ratio
Nominal and Ordinal are Categorical (Qualitative)
Interval and Ratio are Numeric (Quantitative)

Nominal (ids, zip codes) -> distinctness
Ordinal (rankings, grades) -> distinctness + order
Interval (dates, temp in C) -> distinctness + order + addition
Ratio (temp in K, length, time) -> distinctness + order + addition + multiplication

attribute categorization by values -> discrete, continuous

2.1.2. Types of data sets

Characteristics of data sets -> dimensionality, sparsity, resolution
Types -> record, graph, ordered

Record data : a. record data, b. transaction data, c. data matrix, d. document-term matrix
Graph based data : a. data with relationship among objects (www), b. data with objects that are graphs (molecular structure)
Ordered data : a. sequential data (timestamps), b. sequence data (genetic code), c. time series data, d. spatial data

2.2. Data Quality

2.2.1. Measurement and data quality issues

Noise & artifacts -> precision (std deviation), bias (mean-actual), accuracy (closeness to true value)
Outliers
Missing values -> how to solve : eliminate data objects, eliminate missing value attributes, ignore missing value during analysis
Inconsistent values
Duplicate data

2.2.2. Issues related to applications

timeliness, relevance, knowledge about the data

2.3. Data Preprocessing

2.3.1. Aggregation

Purpose : data reduction (smaller dataset means lesser memory), change of scale, more 'stable' data

2.3.2. Sampling

Processing all records is very expensive. but sampling needs to be representative
Sampling approaches :

Simple random sampling
Sampling without replacement
Sampling with replacement
Stratified sampling
Progressive sampling

2.3.3. Dimensionality reduction

Curse of dimensionality (COD)
Purpose : avoid COD, reduce processing time, better visualization, reduce noise
Techniques : PCA (Principal Components Analysis), SVD (Singular Value Decomposition)

2.3.4. Feature subset reduction

Remove redundant features, irrelevant features
Techniques : brute force, embedded, filter, wrapper

2.3.5. Feature Creation

Create new attributes from existing
Methods : feature extraction, mapping data to new space (Fourier transformation, wavelet trans'n), feature construction

2.3.6. Discretization & Binarization

discretization, binarization, discretization of continuous attributes
Unsupervised discretization (no class labels) : equal width, equal frequency -> K-means
Supervised discretization (using class labels) : entropy

2.3.7. Variable Transformation

Simple functions, normalization

2.4. Measures of Similarity & Dissimilarity

2.4.1. Basics

definitions - similarity, dissimilarity
transformations -> e.g s' = (s - min_s) / (max_s - min_s)

2.4.2. S & D b/w simple attributes

nominal : d = 0 or 1 if x = y or x != y ; s = 1 or 0
ordinal : $ d = |x - y| / (n-1) ; s = 1 - d $
interval or ratio : $ d = |x - y| ; s = -d, s = 1/(1+d) $

2.4.3. Dissimilarities b/w data objects

Distances

Euclidean distance : $f(x) = \sqrt{\sum_{k=1}^n (x_k - y_k)^2}$
Minkowski distance (Generalization) : $f(x) = (\sum_{k=1}^n (x_k - y_k)^r) ^ {1/r} $

Hamming distance (city blocks) : r = 1
Euclidean distance : r = 2
Supremum distance : r = $ \infty $

Common properties : a. Positivity, b. Symmetry, c. Triangle inequality
'metric' satisfies all common properties

2.4.4 Similarities b/w data objects

typical properties
s(x, y) = 1 only if x = y [max similarity]

2.4.5 Examples of proximity measures

Similarity co-efficients

Simple Matching : $ SMC = (f_11 + f_00) / (f_01 + f_10 + f_11 + f_00) $
Jaccard : $ JC = f_11 / (f_01 + f_10 + f_11 + f_00) $
Cosine : $ cos(x, y) = $ x.y / ||x||.||y|| $

x.y is the vector dot product = $ \sum_{k=1}^n (x_k y_k) $

x.y is the vector dot product = $ \sum_{k=1}^n (x_k y_k)
$

||x|| = $ \sqrt{x.x}$

Extended Jaccard : $ EJ = x.y / (||x||^2 + ||y||^2 - x.y) $
Correlation : $ P_k $ -> Pearson's Correlation
$ P_k (x, y) = covariance (x, y) / SD(x) . SD(y) $
$ covariance(x, y) = 1 / (n-1) \sum_{k=1}^n(x_k - \bar{x}) (y_k - \bar{x}) $

$ = 1/n \sum_{k=1}^n(x_k)$

$ \bar{x} = $ mean of x $ = 1/n \sum_{k=1}^n(x_k)
$

$ \bar{y} = $ mean of y

$ SD(x) = Standard deviation = \sqrt{1 / (n-1) \sum_{k=1}^n(x_k - \bar{x})^2 } $
Bregman divergence

$ D(x, y) = \phi{(x)} - \phi{(y)} - \langle \nabla \phi{(y)}, (x - y) \rangle $

$ D(x, y) = \phi{(x)} - \phi{(y)} - \langle \nabla \phi{(y)}, (x - y) \rangle
$

2.4.6 Issues in Proximity calculation

a. different scales, b. different attributes, c. different weights
Standardization

$ mahalanobis(x, y) = (x - y) \sum_{}^{-1}(x-y)^T $

Combining Similarities

$ Similarity(x, y) = \sum_{k=1}^n w_k \delta_k s_k(x, y) / \sum_{k=1}^n \delta_k $

2.4.7 Selecting the right proximity measure

Type of proximity should fit type of data
Use Euclidean distance for dense continuous data
Usually expressed as differences
Sparse data -> ignore 0-0 matches -> use Jaccard, cosine, Extended Jaccard.
Time series -> use Euclidean distance
Time series with different magnitudes -> use Correlation

That was a pretty long chapter. Lots learnt !

Best thing that I learnt outside the chapter was 'how to write mathematical equations in a blog'! Used a Javascript library named MathJax which renders LaTeX notation at runtime into HTML.

Ref : http://docs.mathjax.org/en/latest/start.html

Putting this snippet in <head> section worked

<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js" type="text/javascript">
    MathJax.Hub.Config({
        extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js"],
        jax: ["input/TeX","output/HTML-CSS"],
        tex2jax: {
            inlineMath: [ ['$','$'], ["\\(","\\)"] ],
            displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
            processEscapes: true,
        },
        "HTML-CSS": { availableFonts: ["TeX"] }
    });
</script>

Ref : http://narnia.cs.ttu.edu/drupal/node/129

References for this chapter :

Text : Introduction to Data Mining - Chapter 2

ppt : http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap2_data.pdf

Learning never exhausts the mind

February 16, 2014

TSK - Chapter 2 - Data

No comments:

Post a Comment

Blog Archive

Search This Blog

Labels