Data Organization and Data Mining

General

Educational goals

The purpose of the course is to familiarize and acquire knowledge and skills in modern topics and practices related to the organization and processing of data with the aim of extracting knowledge from them. Regarding data organization, the topics covered include Online Analytical Processing (OLAP) and data warehousing. In the thematic category of data mining techniques, emphasis is placed on classification, clustering, and association rules mining. Finally, the topic of recommendation systems is also covered. The practical part of the course involves the use of the scikit-learn library in the Python programming language and the WEKA software. Upon successful completion of the course, the student is able to:

  • Understand the applications of Knowledge Discovery in Databases and the stages of the Knowledge Discovery process
  • Apply appropriate data preprocessing techniques to prepare data for Knowledge Discovery
  • Understand how various classification, clustering, and association rules mining algorithms work and be able to apply on data
  • Develop and implement data mining scenarios using the WEKA software and the scikit-learn library in Python
  • Evaluate the performance of data mining algorithms using appropriate validation techniques and assess the generated results for decision making
  • Design and implement data warehouses, apply OLAP and data mining algorithms on them, Apply ETL
  • Understand the functioning of recommendation systems (Recommender Systems)
General Skills
  • Search, analysis, and synthesis of data and information using the necessary technologies.
  • Decision making.
  • Independent work or teamwork.
  • Promotion of free, creative, and inductive thinking.

Course Contents

  • Introduction to data organization and mining
  • Data preparation (Data cleaning, missing values imputation, feature selection and extraction, discretization, handling imbalances in classification problems, etc.)
  • Introduction to classification, categories of classification problems, categories of classification algorithms, probability-based algorithms (e.g., naive Bayes), space-partitioning algorithms (e.g., decision trees), similarity/distance-based algorithms (e.g., nearest neighbors), efficient nearest neighbor search through data indexing (e.g., k-d tree), data reduction techniques, Multi-label classification
  • Performance metrics of classification and techniques for validating the performance of classification algorithms
  • Introduction to data clustering, types of clusters, categories of clustering algorithms, clustering algorithms: k-means algorithm and its variations (k-medians, k-modes, and k-prototypes), hierarchical clustering, density-based clustering, DBSCAN algorithm, techniques for parameter determination of clustering algorithms (Elbow, Silhouette, dendrogram, k-dist graph), Interpretation of clustering results and estimation of clustering performance
  • Association rules, Apriori algorithm for discovering association rules, evaluation measures of association rules, FP-growth and Eclat algorithms
  • OLTP and OLAP, Data Warehouse design and implementation, star and snowflake schemas, Extract-Transform-Load (ETL) processes, multidimensional data cubes, OLAP queries and data mining in data warehouses
  • Introduction to Recommender Systems

Teaching Methods - Evaluation

Teaching Method
  • Face-to_Face Teaching
  • Case Studies: Data Preparation, Data Transformation, and Data Processing
  • Hands-On Laboratory/Computer Practicing
Use of ICT means
  • ICT based teaching
  • Virtual machine with preinstalled course software
  • Video recordings of present and past course lectures available on the Internet
  • CMS (Moodle) educational content availability
Teaching Organization
Activity Semester workload
Lectures52
Preparation for laboratory exercises and projects20
Projects48
Individual study and analysis of literature60
Total 180
Students evaluation

Languages: : Greek, English
Class project and hands-on laboratory practicing
Written final exam involving multiple choice questions and problem solving

Recommended Bibliography

Recommended Bibliography through "Eudoxus"
  1. P. Tan, M. Steinbach, A. Karpatne, V. Kumar, "Εισαγωγή στην Εξόρυξη Δεδομένων", Εκδόσεις Α. Τζιόλα & Υιοί Α.Ε., 2η Έκδοση, 2018, ISBN: 978-960-418-813-0, Κωδ. Ευδόξου: 77107675
  2. M.J. Zaki, W. Meira Jr., "Εξόρυξη και Ανάλυση Δεδομένων: Βασικές Έννοιες και Αλγόριθμοι", Εκδόσεις Κλειδάριθμος ΕΠΕ, 1η Έκδοση, 2017, ISBN: 978-960-461-770-8, Κωδ. Ευδόξου: 68386089
  3. Αλ. Νανόπουλος, Γ. Μανωλόπουλος, "Εισαγωγή στην Εξόρυξη Δεδομένων και τις Αποθήκες Δεδομένων", Εκδόσεις Νέων Τεχνολογιών, 1η Έκδοση, 2008, ISBN: 978-960-6759-17-8, Κωδ. Ευδόξου: 9457
Complementary greek bibliography
  1. A. Rajaraman, J.D. Ullman, "Εξόρυξη από Μεγάλα Σύνολα Δεδομένων", Εκδόσεις Νέων Τεχνολογιών, 1η Έκδοση, 2014, ISBN: 978-960-6759-83-3
  2. Μ. Βαζιργιάννης, Μ. Χαλκίδη, "Εξόρυξη Γνώσης από Βάσεις Δεδομένων και τον Παγκόσμιο Ιστό", Έκδοση: Γ. Δαρδανός - Κ. Δαρδανός Ο.Ε., 2η Έκδοση, 2005, ISBN: 978-960-402-116-8
  3. R.J. Roiger, M.W. Geatz, "Data Mining | A Tutorial-Based Primer", Εκδόσεις Κλειδάριθμος Ε.Π.Ε., 2η Έκδοση, 2008, ISBN: 978-960-461-206-2
  4. M. H. Dunham, "Data Mining: Introductory and Advanced Topics", Εκδόσεις Νέων Τεχνολογιών, ISBN: 9789608105720
Complementary international bibliography
  1. Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and Techniques, 3rd ed., The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791