February 16, 2014

TSK - Chapter 2 - Data

Data is a collection of objects and their attributes. An attribute is a property or characteristic of an object. A collection of attributes describe an object. 
  • 2. Data
    • 2.1. Types of data
      • 2.1.1. attributes & measurement
        • operations used to describe attributes - distinctness, order, addition, multiplication
        • 4 types of attributes - nominal, ordinal, interval, ratio
        • Nominal and Ordinal are Categorical (Qualitative) 
        • Interval and Ratio are Numeric (Quantitative)
          • Nominal (ids, zip codes) -> distinctness
          • Ordinal (rankings, grades) -> distinctness + order
          • Interval (dates, temp in C) -> distinctness + order + addition
          • Ratio (temp in K, length, time) -> distinctness + order + addition + multiplication
        •  attribute categorization by values -> discrete, continuous 
      • 2.1.2. Types of data sets
        • Characteristics of data sets -> dimensionality, sparsity, resolution 
        • Types -> record, graph, ordered
          • Record data : a. record data, b. transaction data, c. data matrix, d. document-term matrix
          • Graph based data : a. data with relationship among objects (www), b. data with objects that are graphs (molecular structure) 
          • Ordered data : a. sequential data (timestamps), b. sequence data (genetic code), c. time series data, d. spatial data
    • 2.2. Data Quality
      • 2.2.1. Measurement and data quality issues
        • Noise & artifacts -> precision (std deviation), bias (mean-actual), accuracy (closeness to true value) 
        • Outliers
        • Missing values -> how to solve : eliminate data objects, eliminate missing value attributes, ignore missing value during analysis
        • Inconsistent values
        • Duplicate data
      • 2.2.2. Issues related to applications
        • timeliness, relevance, knowledge about the data
    • 2.3. Data Preprocessing
      • 2.3.1. Aggregation 
        • Purpose : data reduction (smaller dataset means lesser memory), change of scale, more 'stable' data
      • 2.3.2. Sampling
        • Processing all records is very expensive. but sampling needs to be representative
        • Sampling approaches : 
          • Simple random sampling
          • Sampling without replacement
          • Sampling with replacement
          • Stratified sampling
          • Progressive sampling
      • 2.3.3. Dimensionality reduction 
        • Curse of dimensionality (COD)
        • Purpose : avoid COD, reduce processing time, better visualization, reduce noise
        • Techniques : PCA (Principal Components Analysis), SVD (Singular Value Decomposition) 
      • 2.3.4. Feature subset reduction
        • Remove redundant features, irrelevant features
        • Techniques : brute force, embedded, filter, wrapper
      • 2.3.5. Feature Creation
        • Create new attributes from existing
        • Methods : feature extraction, mapping data to new space (Fourier transformation, wavelet trans'n), feature construction 
      • 2.3.6. Discretization & Binarization 
        • discretization, binarization, discretization of continuous attributes
        • Unsupervised discretization (no class labels) : equal width, equal frequency -> K-means
        • Supervised discretization (using class labels) : entropy
      • 2.3.7. Variable Transformation
        • Simple functions, normalization
    • 2.4. Measures of Similarity & Dissimilarity
      • 2.4.1. Basics 
        • definitions - similarity, dissimilarity
        • transformations -> e.g s' = (s - min_s) / (max_s - min_s) 
      • 2.4.2. S & D b/w simple attributes
        • nominal : d = 0 or 1 if x = y or x != y ; s = 1 or 0 
        • ordinal : $ d = |x - y| / (n-1) ; s = 1 - d $  
        • interval or ratio : $ d = |x - y| ; s = -d, s = 1/(1+d) $
      • 2.4.3. Dissimilarities b/w data objects
        • Distances
          • Euclidean distance : $f(x) = \sqrt{\sum_{k=1}^n (x_k - y_k)^2}$
          • Minkowski distance (Generalization) : $f(x) = (\sum_{k=1}^n (x_k - y_k)^r) ^ {1/r} $
            • Hamming distance (city blocks) : r = 1
            • Euclidean distance : r = 2
            • Supremum distance : r = $ \infty $
        • Common properties : a. Positivity, b. Symmetry, c. Triangle inequality
        • 'metric' satisfies all common properties
      • 2.4.4 Similarities b/w data objects
        • typical properties
        • s(x, y) = 1 only if x = y [max similarity]
      • 2.4.5 Examples of proximity measures
        • Similarity co-efficients
          • Simple Matching : $ SMC =  (f_11 + f_00) / (f_01 + f_10 + f_11 + f_00) $
          • Jaccard : $ JC =  f_11 / (f_01 + f_10 + f_11 + f_00) $ 
          • Cosine : $ cos(x, y) = $ x.y / ||x||.||y|| $
            • x.y is the vector dot product = $ \sum_{k=1}^n (x_k y_k) $
          • x.y is the vector dot product = $ \sum_{k=1}^n (x_k y_k)
            $
            • ||x|| = $ \sqrt{x.x}$
          • Extended Jaccard : $ EJ =  x.y / (||x||^2 + ||y||^2 - x.y) $ 
          • Correlation : $ P_k $ -> Pearson's Correlation
          • $ P_k (x, y) = covariance (x, y) / SD(x) . SD(y) $
          • $ covariance(x, y) = 1 / (n-1)  \sum_{k=1}^n(x_k - \bar{x}) (y_k - \bar{x}) $
            •  $ = 1/n \sum_{k=1}^n(x_k)$
          • $ \bar{x} = $ mean of x $ = 1/n \sum_{k=1}^n(x_k)
            $
            • $ \bar{y} = $ mean of y
          • $ SD(x) = Standard deviation = \sqrt{1 / (n-1)  \sum_{k=1}^n(x_k - \bar{x})^2 } $
          • Bregman divergence
            • $ D(x, y) = \phi{(x)} - \phi{(y)} - \langle \nabla \phi{(y)}, (x - y) \rangle $
          • $ D(x, y) = \phi{(x)} - \phi{(y)} - \langle \nabla \phi{(y)}, (x - y) \rangle
            $
      • 2.4.6 Issues in Proximity calculation
        • a. different scales, b. different attributes, c. different weights
        • Standardization 
          • $ mahalanobis(x, y) = (x - y) \sum_{}^{-1}(x-y)^T $
        • Combining Similarities
          • $ Similarity(x, y) = \sum_{k=1}^n w_k \delta_k s_k(x, y)  / \sum_{k=1}^n \delta_k $
      • 2.4.7 Selecting the right proximity measure
        • Type of proximity should fit type of data
        • Use Euclidean distance for dense continuous data
        • Usually expressed as differences
        • Sparse data -> ignore 0-0 matches -> use Jaccard, cosine, Extended Jaccard.
        • Time series -> use Euclidean distance
        • Time series with different magnitudes -> use Correlation
That was a pretty long chapter. Lots learnt !
Best thing that I learnt outside the chapter was 'how to write mathematical equations in a blog'! Used a Javascript library named MathJax which renders LaTeX notation at runtime into HTML. 

Ref : http://docs.mathjax.org/en/latest/start.html

Putting this snippet in <head> section worked

<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js" type="text/javascript">
    MathJax.Hub.Config({
        extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js"],
        jax: ["input/TeX","output/HTML-CSS"],
        tex2jax: {
            inlineMath: [ ['$','$'], ["\\(","\\)"] ],
            displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
            processEscapes: true,
        },
        "HTML-CSS": { availableFonts: ["TeX"] }
    });
</script>

Ref : http://narnia.cs.ttu.edu/drupal/node/129

References for this chapter : 
Text : Introduction to Data Mining - Chapter 2



No comments:

Post a Comment