This module provides an introduction into programming with frameworks that are designed for distributed processing of large data sets across clusters of computers. The module will describe big data referring to its volume, velocity, and variety and how software is utilised for handling such large volumes of unstructured data.
It will demonstrate how to scale software from single servers to multiple servers and its implcations on computation and storage. Students will also learn to use software to detect and handle application failure, so delivering high-availability across a cluster of computers.
The module will utilise a state of the art software framework designed for handling large data sets (e.g., hadoop, HPPC systems, spark, etc.). The teaching and learning will be based on practical implementations and problem solving related to the challenges described above.
Software framework architecture/ecosystem and common utilities for large data
Distributed file systems – clusters, nodes, read/writes, data integrity/replication, fault tolerance
MapReduce – processing/generating large data sets, map APIs, failover
Job scheduling and cluster management – fair scheduler, user queues
Data warehousing – data summarisation, data types/schemas, query language
Parallel processing – parallel evaluation, execution modes
Structured data storage – schema design, optimise read/write
Multi-master databases – data replication, eventual consistency
Data mining – clustering, classification
Lectures/labs, discussion, practical examples, problem-solving exercises, project work, self-directed learning.Note, computer labs must have the relevant software installed and available to students.
Lectures/labs, discussion, practical examples, problem-solving exercises, project work, self-directed learning.
Note, computer labs must have the relevant software installed and available to students.
|Module Content & Assessment