CS725 : Data Mining

Course Overview

Course Synopsis

This course discusses techniques for preprocessing data before mining and presents the concepts related to data warehousing, online analytical processing (OLAP), and data generalization. It presents methods for mining frequent patterns, associations, and correlations. It also presents methods for data classification and prediction, data-clustering approaches, and outlier analysis.

Course Learning Outcomes

Students will be able to:

  • 1. Understand Data Warehouse fundamentals, Data Mining Principles
  • 2. Design data warehouse with dimensional modelling and apply OLAP operations.
  • 3. Identify appropriate data mining algorithms to solve real world problems
  • 4. Compare and evaluate different data mining techniques like classification, prediction, clustering and association rule mining
  • 5. Describe complex data types with respect to spatial and web mining.
  • 6. Benefit the user experiences towards research and innovation. integration.


Course Calendar

1 Introduction: Why Data Mining?
2 Introduction: What Is Data Mining?
3 Introduction: A Multi-Dimensional View of Data Mining
4 Introduction: What Kind of Data Can Be Mined?
5 Introduction: Are all Patterns are interesting?
6 Introduction: What Technology Are Used?
7 Introduction: What Kind of Applications Are Targeted?
8 Introduction: Major Issues in Data Mining
9 Data Objects and Attribute Types: Types of Data Sets
10 Data Objects and Attribute Types: Important Characteristics of Structured Data
11 Data Objects and Attribute Types: Data Objects
12 Data Objects and Attribute Types: Attributes
13 Data Objects and Attribute Types: Attribute Types
14 Data Objects and Attribute Types: Discrete vs. Continuous Attributes

15 Data Visualization: Introduction
16 Data Visualization: Pixel-Oriented Visualization Techniques
17 Basic Statistical Descriptions of Data: Introduction
18 Basic Statistical Descriptions of Data: Measuring the Central Tendency
19 Basic Statistical Descriptions of Data: Symmetric vs. Skewed Data
20 Basic Statistical Descriptions of Data: Measuring the Dispersion of Data
21 Basic Statistical Descriptions of Data: Box plot Analysis
22 Basic Statistical Descriptions of Data: Graphic Displays of Basic Statistical Descriptions using Histogram
23 Basic Statistical Descriptions of Data: Graphic Displays of Basic Statistical Descriptions using Quantile Plot
24 Basic Statistical Descriptions of Data: Graphic Displays of Basic Statistical Descriptions using Scatter plot

25 Data Visualization: Geometric Projection Visualization Techniques
26 Data Visualization: Icon-Based Visualization Techniques
27 Data Visualization: Hierarchical Visualization Techniques
28 Data Visualization: Hierarchical Visualization examples
29 Measuring Data Similarity and Dissimilarity: Introduction ( Videos are not directly watching but just downloading)
30 Measuring Data Similarity and Dissimilarity: Data Matrix and Dissimilarity Matrix
31 Measuring Data Similarity and Dissimilarity: Proximity Measure
32 Measuring Data Similarity and Dissimilarity: Standardizing Numeric Data
33 Measuring Data Similarity and Dissimilarity: Distance on Numeric Data
34 Measuring Data Similarity and Dissimilarity: Attributes of Mixed Type
35 Measuring Data Similarity and Dissimilarity: Cosine Similarity
36 Why Preprocess the Data: Introduction
37 Why Preprocess the Data: Why Is Data Dirty?
38 Why Preprocess the Data: Multi-Dimensional Measure of Data Quality
39 Why Preprocess the Data: Major Tasks in Data Preprocessing
40 Data Cleaning: Introduction
41 Data Cleaning: Missing Data
42 Data Cleaning: Noisy Data
43 Data Cleaning: How to Handle Noisy data using Binning
44 Data Cleaning: How to Handle Noisy data using Regression and Cluster Analysis

45 Data integration and transformation: Introduction
46 Data integration and transformation: Handling Redundancy in Data Integration
47 Data integration and transformation: Detect Redundancy in Data Integration using Corelation analysis
48 Data integration and transformation: Data Transformation methods
49 Data integration and transformation: Normalization Example
50 Data reduction: Introduction
51 Data reduction: Data cube aggregation
52 Data reduction: Data Compression
53 Data reduction: Dimensionality Reduction using Wavelet Transformation
54 Data reduction: Dimensionality Reduction using PCA
55 Data reduction: Numerosity Reduction
56 Data reduction: Numerosity Reduction using Regression and Log-Linear Models
57 Data reduction: Numerosity Reduction using Histogram
58 Data reduction: Numerosity Reduction using Clustering
59 Data reduction: Numerosity Reduction using Sampling

60 What is a data warehouse?: Introduction
61 What is a data warehouse?: Subject-Oriented
62 Data warehouse architecture
63 What is a data warehouse?: Data Warehouse vs. Operational DBMS
64 Data warehouse architecture: Data Warehouse Models
65 Data Warehouse-Metadata
66 A multi-dimensional data model
67 A multi-dimensional data model: Example of Star Schema
68 A multi-dimensional data model: Example of Snowflake Schema
69 A multi-dimensional data model: Example of Fact Constellation
70 Concept Heirarchy
71 Multi Dimensional Data Models
72 A multi-dimensional data model: Typical OLAP Operations

73 NLP
74 Stages of NLP
75 Syntax processing
76 Other stages of NLP
77 Regular expressions
78 Errors
79 Tokenization& issues in tokenization
80 Word normalization
81 Language model
82 Chain rule
83 Markov assumption
84 N gram probabilities
85 LM example
86 Spell correction
87 Noisy Channel
88 Candidate Generation
89 Noisy channel probability
90 Bigram Based Correction
91 Text Classification
92 Text Classification Examples
93 BOW Model
94 Formalizing Text Classification
95 Bayes Classification Methods: Why?
96 Naïve Bayes Independence
97 Naïve Bayes Parameters Learning
98 Naïve Bayes Smoothing
99 Naïve Bayes and LM
100 NB Example
101 NB Advantage
102 F Measure

103 Basic Concepts of Mining Frequent patterns: Introduction
104 Basic Concepts of Mining Frequent Patterns, Association and Correlation: Why Is Freq. Pattern Mining Important?
105 Market Basket Analysis
106 Frequent Item set Mining Methods: Apriori: Example
107 Frequent Item set Mining Methods: Apriori: Pseudo Code
108 Frequent Item set Mining Methods: Mining Close Frequent Patterns and Max patterns
109 Frequent Item set Mining Methods: How to Count Supports of Candidates?
110 Frequent Item set Mining Methods: Improving the Efficiency of Apriori
111 Basic Concepts of Mining Frequent Patterns, Association and Correlation: Computational Complexity of Frequent Item set Mining
112 Frequent Item set Mining Methods: ECLAT: Frequent Pattern Mining with Vertical Data Format
113 Which Patterns Are Interesting?—Pattern Evaluation Methods: interest Measure
114 Which Patterns Are Interesting?—Pattern Evaluation Methods: interest Measure-2

115 Basic Concepts of Classification: Introduction
116 Basic Concepts of Classification: Supervised vs. Unsupervised Learning
117 Basic Concepts of Classification: A Two-Step Process
118 Classification Issues
119 Classification Methods
120 Decision Tree Induction

121 Decision Tress-Introduction
122 Decision Tree Induction Algorithm
123 Pruning
124 Entropy
125 Attribute Selection Introduction
126 Information Gain
127 Gain Ration
128 Gini Index
129 Attribute Selection Comparison
130 Attributes Measure
131 Enhancement to DTI
132 Introduction to Rain Forest

133 Example of Rain Forest
134 BOAT
135 Rule-Based Classification: Using IF-THEN Rules for Classification
136 Rule-Based Classification: Rule Extraction from a Decision Tree
137 Rule-Based Classification: Rule Induction: Sequential Covering Method
138 Model Evaluation and Selection: Introduction
139 Model Evaluation and Selection: Confusion Matrix
140 Model Evaluation and Selection: Accuracy, Error Rate, Sensitivity and Specificity using Evaluation matrix Matrix
141 Model Evaluation and Selection: Holdout & Cross-Validation Methods
142 Model Evaluation and Selection: Bootstrap for evaluation of classifier
143 Model Evaluation and Selection: ROC Curves
144 Model Evaluation and Selection: Issues Affecting Model Selection

145 Techniques to Improve Classification Accuracy: Ensemble Methods: Introduction
146 Techniques to Improve Classification Accuracy: Ensemble Methods: Bagging: Bootstrap Aggregation
147 Techniques to Improve Classification Accuracy: Ensemble Methods: Boosting
148 Techniques to Improve Classification Accuracy: Ensemble Methods: Random Forest
149 Classification of Imbalance data
150 Classification by Back propagation: Introduction
151 Classification by Back propagation: Neural Network as a Classifier
152 Classification by Back propagation: A Multi-Layer Feed-Forward Neural Network
153 Classification by Back propagation: Defining a Network Topology
154 Classification by Back propagation: Back propagation
155 Neural Networks Evaluation
156 Support Vector Machines: Introduction
157 Support Vector Machines: History and Applications
158 Support Vector Machines: General Philosophy

159 Basic Concepts of Cluster Analysis: What is Cluster Analysis?
160 Basic Concepts of Cluster Analysis: Clustering for Data Understanding and Applications
161 Basic Concepts of Cluster Analysis: Clustering as a Preprocessing Too
162 Basic Concepts of Cluster Analysis: Quality: What Is Good Clustering?
163 Basic Concepts of Cluster Analysis: Measure the Quality of Clustering
164 Clustering Criteria
165 Basic Concepts of Cluster Analysis: Requirements and Challenges
166 Clustering Matrixes
167 Types of data of Cluster Analysis: Interval-valued variables
168 Types of data of Cluster Analysis: Binary Variables
169 Types of data of Cluster Analysis: Ratio-Scaled Variables
170 Partitioning Methods: Basic Concept
171 Partitioning Methods: The K-Means Clustering Method
172 Partitioning Methods: Comments on the K-Means Method
173 Partitioning Methods: Variations of the K-Means Method
174 Partitioning Methods: The K-Medoids Clustering Method
175 Hierarchical Methods: Introduction
176 Hierarchical Methods: AGNES (Agglomerative Nesting)
177 Hierarchical Methods: DIANA (Divisive Analysis)
178 Density-Based Methods: Introduction

179 DBM Parameters
180 Density-Based Methods: DBSCAN
181 Grid-Based Methods: Introduction
182 Grid-Based Methods: STING: A Statistical Information Grid Approach
183 Grid-Based Methods: Clustering by Wavelet Analysis
184 Model-Based Methods: Introduction
185 Model-Based Methods: EM — Expectation Maximization
186 Model-Based Methods: Conceptual Clustering
187 Model-Based Methods: COBWEB Clustering Method
188 Model-Based Methods: Neural Network Approach
189 Model-Based Methods: Self-Organizing Feature Map (SOM)
190 Outlier Analysis: What Is Outlier Discovery?
191 Outlier Analysis: Statistical Approaches for outlier discovery
192 Outlier Analysis: Distance-Based Approach for outlier discovery
193 Outlier Analysis: Deviation-Based Approach for outlier discovery
194 Clustering High-Dimensional Data: Introduction
195 Clustering High-Dimensional Data: The Curse of Dimensionality
196 Clustering High-Dimensional Data: Why Subspace Clustering?
197 Clustering High-Dimensional Data: CLIQUE

198 Introduction to WEKA
199 Features of WEKA
200 Attributes of WEKA
201 Preprocessing in WEKA
202 WEKA Classifier-Example
203 Demo of WEKA Classifier
204 WEKA Classification-Results
205 Result Visualization
206 WEKA Clustering
207 Association Findings in WEKA

208 Web Mining
209 Web Mining Introduction
210 Introduction to Text Mining
211 IR
212 Information Extraction
213 Web Structure Mining
214 Web Search
215 Cyber Community
216 Types of Cyber Community
217 Web Mining _ Usage
218 Web Mining Details
219 Strategies for Web search
220 Web Architecture