Data is a collection of objects and their attributes. An attribute is a property or characteristic of an object. A collection of attributes describe an object.
- 2. Data
- 2.1. Types of data
- 2.1.1. attributes & measurement
- operations used to describe attributes - distinctness, order, addition, multiplication
- 4 types of attributes - nominal, ordinal, interval, ratio
- Nominal and Ordinal are Categorical (Qualitative)
- Interval and Ratio are Numeric (Quantitative)
- Nominal (ids, zip codes) -> distinctness
- Ordinal (rankings, grades) -> distinctness + order
- Interval (dates, temp in C) -> distinctness + order + addition
- Ratio (temp in K, length, time) -> distinctness + order + addition + multiplication
- attribute categorization by values -> discrete, continuous
- 2.1.2. Types of data sets
- Characteristics of data sets -> dimensionality, sparsity, resolution
- Types -> record, graph, ordered
- Record data : a. record data, b. transaction data, c. data matrix, d. document-term matrix
- Graph based data : a. data with relationship among objects (www), b. data with objects that are graphs (molecular structure)
- Ordered data : a. sequential data (timestamps), b. sequence data (genetic code), c. time series data, d. spatial data
- 2.2. Data Quality
- 2.2.1. Measurement and data quality issues
- Noise & artifacts -> precision (std deviation), bias (mean-actual), accuracy (closeness to true value)
- Outliers
- Missing values -> how to solve : eliminate data objects, eliminate missing value attributes, ignore missing value during analysis
- Inconsistent values
- Duplicate data
- 2.2.2. Issues related to applications
- timeliness, relevance, knowledge about the data
- 2.3. Data Preprocessing
- 2.3.1. Aggregation
- Purpose : data reduction (smaller dataset means lesser memory), change of scale, more 'stable' data
- 2.3.2. Sampling
- Processing all records is very expensive. but sampling needs to be representative
- Sampling approaches :
- Simple random sampling
- Sampling without replacement
- Sampling with replacement
- Stratified sampling
- Progressive sampling
- 2.3.3. Dimensionality reduction
- Curse of dimensionality (COD)
- Purpose : avoid COD, reduce processing time, better visualization, reduce noise
- Techniques : PCA (Principal Components Analysis), SVD (Singular Value Decomposition)
- 2.3.4. Feature subset reduction
- Remove redundant features, irrelevant features
- Techniques : brute force, embedded, filter, wrapper
- 2.3.5. Feature Creation
- Create new attributes from existing
- Methods : feature extraction, mapping data to new space (Fourier transformation, wavelet trans'n), feature construction
- 2.3.6. Discretization & Binarization
- discretization, binarization, discretization of continuous attributes
- Unsupervised discretization (no class labels) : equal width, equal frequency -> K-means
- Supervised discretization (using class labels) : entropy
- 2.3.7. Variable Transformation
- Simple functions, normalization
- 2.4. Measures of Similarity & Dissimilarity
- 2.4.1. Basics
- definitions - similarity, dissimilarity
- transformations -> e.g s' = (s - min_s) / (max_s - min_s)
- 2.4.2. S & D b/w simple attributes
- nominal : d = 0 or 1 if x = y or x != y ; s = 1 or 0
- ordinal : $ d = |x - y| / (n-1) ; s = 1 - d $
- interval or ratio : $ d = |x - y| ; s = -d, s = 1/(1+d) $
- 2.4.3. Dissimilarities b/w data objects
- Distances
- Euclidean distance : $f(x) = \sqrt{\sum_{k=1}^n (x_k - y_k)^2}$
- Minkowski distance (Generalization) : $f(x) = (\sum_{k=1}^n (x_k - y_k)^r) ^ {1/r} $
- Hamming distance (city blocks) : r = 1
- Euclidean distance : r = 2
- Supremum distance : r = $ \infty $
- Common properties : a. Positivity, b. Symmetry, c. Triangle inequality
- 'metric' satisfies all common properties
- 2.4.4 Similarities b/w data objects
- typical properties
- s(x, y) = 1 only if x = y [max similarity]
- 2.4.5 Examples of proximity measures
- Similarity co-efficients
- Simple Matching : $ SMC = (f_11 + f_00) / (f_01 + f_10 + f_11 + f_00) $
- Jaccard : $ JC = f_11 / (f_01 + f_10 + f_11 + f_00) $
- Cosine : $ cos(x, y) = $ x.y / ||x||.||y|| $
- x.y is the vector dot product = $ \sum_{k=1}^n (x_k y_k) $
- x.y is the vector dot product = $ \sum_{k=1}^n (x_k y_k)
$ - ||x|| = $ \sqrt{x.x}$
- Extended Jaccard : $ EJ = x.y / (||x||^2 + ||y||^2 - x.y) $
- Correlation : $ P_k $ -> Pearson's Correlation
- $ P_k (x, y) = covariance (x, y) / SD(x) . SD(y) $
- $ covariance(x, y) = 1 / (n-1) \sum_{k=1}^n(x_k - \bar{x}) (y_k - \bar{x}) $
- $ = 1/n \sum_{k=1}^n(x_k)$
- $ \bar{x} = $ mean of x $ = 1/n \sum_{k=1}^n(x_k)
$ - $ \bar{y} = $ mean of y
- $ SD(x) = Standard deviation = \sqrt{1 / (n-1) \sum_{k=1}^n(x_k - \bar{x})^2 } $
- Bregman divergence
- $ D(x, y) = \phi{(x)} - \phi{(y)} - \langle \nabla \phi{(y)}, (x - y) \rangle $
- $ D(x, y) = \phi{(x)} - \phi{(y)} - \langle \nabla \phi{(y)}, (x - y) \rangle
$ - 2.4.6 Issues in Proximity calculation
- a. different scales, b. different attributes, c. different weights
- Standardization
- $ mahalanobis(x, y) = (x - y) \sum_{}^{-1}(x-y)^T $
- Combining Similarities
- $ Similarity(x, y) = \sum_{k=1}^n w_k \delta_k s_k(x, y) / \sum_{k=1}^n \delta_k $
- 2.4.7 Selecting the right proximity measure
- Type of proximity should fit type of data
- Use Euclidean distance for dense continuous data
- Usually expressed as differences
- Sparse data -> ignore 0-0 matches -> use Jaccard, cosine, Extended Jaccard.
- Time series -> use Euclidean distance
- Time series with different magnitudes -> use Correlation
Best thing that I learnt outside the chapter was 'how to write mathematical equations in a blog'! Used a Javascript library named MathJax which renders LaTeX notation at runtime into HTML.
Ref : http://docs.mathjax.org/en/latest/start.html
<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js" type="text/javascript"> MathJax.Hub.Config({ extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js"], jax: ["input/TeX","output/HTML-CSS"], tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ], displayMath: [ ['$$','$$'], ["\\[","\\]"] ], processEscapes: true, }, "HTML-CSS": { availableFonts: ["TeX"] } }); </script>
Ref : http://narnia.cs.ttu.edu/drupal/node/129
References for this chapter :
Text : Introduction to Data Mining - Chapter 2
No comments:
Post a Comment