Loading web-font TeX/Math/Italic

February 16, 2014

TSK - Chapter 2 - Data

Data is a collection of objects and their attributes. An attribute is a property or characteristic of an object. A collection of attributes describe an object. 
  • 2. Data
    • 2.1. Types of data
      • 2.1.1. attributes & measurement
        • operations used to describe attributes - distinctness, order, addition, multiplication
        • 4 types of attributes - nominal, ordinal, interval, ratio
        • Nominal and Ordinal are Categorical (Qualitative) 
        • Interval and Ratio are Numeric (Quantitative)
          • Nominal (ids, zip codes) -> distinctness
          • Ordinal (rankings, grades) -> distinctness + order
          • Interval (dates, temp in C) -> distinctness + order + addition
          • Ratio (temp in K, length, time) -> distinctness + order + addition + multiplication
        •  attribute categorization by values -> discrete, continuous 
      • 2.1.2. Types of data sets
        • Characteristics of data sets -> dimensionality, sparsity, resolution 
        • Types -> record, graph, ordered
          • Record data : a. record data, b. transaction data, c. data matrix, d. document-term matrix
          • Graph based data : a. data with relationship among objects (www), b. data with objects that are graphs (molecular structure) 
          • Ordered data : a. sequential data (timestamps), b. sequence data (genetic code), c. time series data, d. spatial data
    • 2.2. Data Quality
      • 2.2.1. Measurement and data quality issues
        • Noise & artifacts -> precision (std deviation), bias (mean-actual), accuracy (closeness to true value) 
        • Outliers
        • Missing values -> how to solve : eliminate data objects, eliminate missing value attributes, ignore missing value during analysis
        • Inconsistent values
        • Duplicate data
      • 2.2.2. Issues related to applications
        • timeliness, relevance, knowledge about the data
    • 2.3. Data Preprocessing
      • 2.3.1. Aggregation 
        • Purpose : data reduction (smaller dataset means lesser memory), change of scale, more 'stable' data
      • 2.3.2. Sampling
        • Processing all records is very expensive. but sampling needs to be representative
        • Sampling approaches : 
          • Simple random sampling
          • Sampling without replacement
          • Sampling with replacement
          • Stratified sampling
          • Progressive sampling
      • 2.3.3. Dimensionality reduction 
        • Curse of dimensionality (COD)
        • Purpose : avoid COD, reduce processing time, better visualization, reduce noise
        • Techniques : PCA (Principal Components Analysis), SVD (Singular Value Decomposition) 
      • 2.3.4. Feature subset reduction
        • Remove redundant features, irrelevant features
        • Techniques : brute force, embedded, filter, wrapper
      • 2.3.5. Feature Creation
        • Create new attributes from existing
        • Methods : feature extraction, mapping data to new space (Fourier transformation, wavelet trans'n), feature construction 
      • 2.3.6. Discretization & Binarization 
        • discretization, binarization, discretization of continuous attributes
        • Unsupervised discretization (no class labels) : equal width, equal frequency -> K-means
        • Supervised discretization (using class labels) : entropy
      • 2.3.7. Variable Transformation
        • Simple functions, normalization
    • 2.4. Measures of Similarity & Dissimilarity
      • 2.4.1. Basics 
        • definitions - similarity, dissimilarity
        • transformations -> e.g s' = (s - min_s) / (max_s - min_s) 
      • 2.4.2. S & D b/w simple attributes
        • nominal : d = 0 or 1 if x = y or x != y ; s = 1 or 0 
        • ordinal : d = |x - y| / (n-1) ; s = 1 - d  
        • interval or ratio : d = |x - y| ; s = -d, s = 1/(1+d)
      • 2.4.3. Dissimilarities b/w data objects
        • Distances
          • Euclidean distance : f(x) = \sqrt{\sum_{k=1}^n (x_k - y_k)^2}
          • Minkowski distance (Generalization) : f(x) = (\sum_{k=1}^n (x_k - y_k)^r) ^ {1/r}
            • Hamming distance (city blocks) : r = 1
            • Euclidean distance : r = 2
            • Supremum distance : r =  \infty
        • Common properties : a. Positivity, b. Symmetry, c. Triangle inequality
        • 'metric' satisfies all common properties
      • 2.4.4 Similarities b/w data objects
        • typical properties
        • s(x, y) = 1 only if x = y [max similarity]
      • 2.4.5 Examples of proximity measures
        • Similarity co-efficients
          • Simple Matching : SMC =  (f_11 + f_00) / (f_01 + f_10 + f_11 + f_00) 
          • Jaccard : JC =  f_11 / (f_01 + f_10 + f_11 + f_00)  
          • Cosine : cos(x, y) = x.y / ||x||.||y|| $
            • x.y is the vector dot product = \sum_{k=1}^n (x_k y_k)
          • x.y is the vector dot product = \sum_{k=1}^n (x_k y_k)
            • ||x|| = \sqrt{x.x}
          • Extended Jaccard : EJ =  x.y / (||x||^2 + ||y||^2 - x.y)  
          • Correlation : P_k -> Pearson's Correlation
          • P_k (x, y) = covariance (x, y) / SD(x) . SD(y)
          • covariance(x, y) = 1 / (n-1)  \sum_{k=1}^n(x_k - \bar{x}) (y_k - \bar{x})
            •   = 1/n \sum_{k=1}^n(x_k)
          • \bar{x} = mean of x  = 1/n \sum_{k=1}^n(x_k)
            •  \bar{y} = mean of y
          • SD(x) = Standard deviation = \sqrt{1 / (n-1)  \sum_{k=1}^n(x_k - \bar{x})^2 }
          • Bregman divergence
            • D(x, y) = \phi{(x)} - \phi{(y)} - \langle \nabla \phi{(y)}, (x - y) \rangle 
          • D(x, y) = \phi{(x)} - \phi{(y)} - \langle \nabla \phi{(y)}, (x - y) \rangle
      • 2.4.6 Issues in Proximity calculation
        • a. different scales, b. different attributes, c. different weights
        • Standardization 
          • mahalanobis(x, y) = (x - y) \sum_{}^{-1}(x-y)^T
        • Combining Similarities
          • Similarity(x, y) = \sum_{k=1}^n w_k \delta_k s_k(x, y)  / \sum_{k=1}^n \delta_k
      • 2.4.7 Selecting the right proximity measure
        • Type of proximity should fit type of data
        • Use Euclidean distance for dense continuous data
        • Usually expressed as differences
        • Sparse data -> ignore 0-0 matches -> use Jaccard, cosine, Extended Jaccard.
        • Time series -> use Euclidean distance
        • Time series with different magnitudes -> use Correlation
That was a pretty long chapter. Lots learnt !
Best thing that I learnt outside the chapter was 'how to write mathematical equations in a blog'! Used a Javascript library named MathJax which renders LaTeX notation at runtime into HTML. 

Ref : http://docs.mathjax.org/en/latest/start.html

Putting this snippet in <head> section worked

<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js" type="text/javascript">
    MathJax.Hub.Config({
        extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js"],
        jax: ["input/TeX","output/HTML-CSS"],
        tex2jax: {
            inlineMath: [ ['$','$'], ["\\(","\\)"] ],
            displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
            processEscapes: true,
        },
        "HTML-CSS": { availableFonts: ["TeX"] }
    });
</script>

Ref : http://narnia.cs.ttu.edu/drupal/node/129

References for this chapter : 
Text : Introduction to Data Mining - Chapter 2



No comments:

Post a Comment