(DATA H1030) Data Analysis and Visualisation

The purpose of this module is to give the student a broad but solid grounding in data analysis, with coverage of both statistical and computational (machine learning) concepts, methods and techniques. The module is organised around the stages of the data handling cycle and provides a comparative and integrative view of the statistical and machine learning aspects of the field.

*Curricular information is subject to change

What will I learn?

Data collection, cleaning and preparation

This part of the module outlines the concepts and tasks relating to the collection, cleaning and preparation of data for analysis, in particular of structured data. These include collection methods, identification and removal of errors, combining data sets from different sources, transformation between numerical and categorical data types and variable re-scaling, derivation and removal.

Single Variable Data Characterisation

This part of the module covers statistical characterisation of single variable data through measures of centre and variation and their visualisation, as well as probabilistic evaluation of those measures using confidence intervals and hypothesis tests.

Variable Relationship Characterisation

This part of the module deals with relationships between more than one variable, typically belonging to the same data set. It covers statistical measures of relatedness, such as correlation coefficients and chi-square tests, and similarity, such as t-tests and ANOVA, as well as attribute-relational concepts used in machine learning, such as data set splitting criteria and specifically measures of impurity. It looks at the use of visualisation for relationship discovery, rather than simply for presentation.

Group Identification

This part of the module covers unsupervised machine learning methods used for identifying various types of groups in data sets, including groups of instances, identified using hierarchical and non-hierarchical clustering methods, and groups of attribute values, described using association rules.

Modelling and Prediction

This part of the module examines how models can be built from data sets and how those models can be used for predicting values of variables for new instances of the data. Both statistical and machine learning methods are covered, as well as contexts for their application, particularly with respect to variable numbers and types. These include statistical regression, linear classifiers such as logistic regression and support vector machines, classification trees, probability estimation trees and neural networks.

Special Topics in Data Analysis

This part of the module looks at domains with data analysis paradigms different from that which presupposes structured data. One of these is the analysis of text using text mining methods. Another is social network analysis, which equates to the study of network topology while posing its own specific questions. A final example is web mining, where structure, content and usage are all analysed.

Visualisation

Visualisation permeates the module as a recurring theme, constituting a portion of the content taught under each of the other headings. A large number of visual data representation types are covered, as is the process of choosing the best representation for particular data. The traditional and more modern techniques of visualisation are covered, including interactive ones.

How will I be assessed?

Module Content & Assessment
Assessment Breakdown	%
Other Assessment(s)	50
Formal Examination	50

Module Overview

Data Analysis and Visualisation

Module Code

ECTS Credits