Hadoop Decoded


We are sorry, but currently we don't know when the course will be arranged next time.
Please contact sales: +358 20 7776 670 or myyntipalvelu@sovelto.fi


Hadoop is more popular than ever and is generating data-driven business value across every industry. This course gives attendees the essential skills to build Big Data applications using Hadoop technologies such as HDFS, YARN, Apache Kafka, Apache Hive and Apache Spark, in an analytical ecosystem with Teradata components such as Teradata Database, Teradata Viewpoint, and Teradata QueryGrid.

In this course, students will have access to their own cluster to gain hands-on experience. Students will use the Hadoop's distributed file system and process distributed datasets with Hive. In addition, students will develop applications in Spark using Scala and Python via RDDs and DataFrames.

Students will write applications using Hive and Spark and learn about common issues encountered when processing vast datasets in distributed systems.

A discussion of additional tools, Hadoop distributions, and the opportunity to ask questions of experts in Hadoop technology make this popular course an essential grounding for companies looking to implement Hadoop effectively within their enterprise.


Hive Developers, Spark Developers, Hadoop Developers, Data Scientists, Business Analysts/Data Analysts, and Data Engineers


After successfully completing this course, you will be able to:

  • Describe the issues of 'Big Data' and how they are remedied using Hadoop.
  • Describe the Hadoop architecture and its core components (HDFS, YARN).
  • Load data into Hadoop from various sources (Flume, Sqoop, Kafka).
  • Use Hive to analyze unstructured and structured data at a large scale.
  • Explain the importance of the Hive Metastore.
  • Write applications with Spark using RDDs and Spark SQL using DataFrames.
  • Use Spark SQL to analyze datasets from Hive using Hive Metastore.
  • Use Spark Streaming and Structured Streaming for near-real-time analysis.
  • Integrate Hadoop with Teradata (Teradata Unified Data Architecture, Teradata Viewpoint, Teradata QueryGrid).


Module 0. Introduction and Setup:

  • Introduction to course, setup and connect to the cloud lab environment
  • Fire up Hadoop
  • Open PuTTy terminal, Firefox, and WinSCP

Module 1. Hadoop Basics:

  • Why Hadoop was developed and problems it solves
  • Architecture (HDFS and YARN)
  • Introduction to common Hadoop components
  • Hadoop components

Module 2. Ingesting Data into Hadoop:
How to load data into Hadoop using several popular ingest utilities

  • Flume
  • Sqoop
  • WebHDFS
  • Kafka

Module 3. Hive Basics:

  • Introduction to how Hive works
  • What Hive is (and isn’t)
  • The Hive Metastore
  • Creating tables, loading data, schema-on-read
  • Storage formats and SerDes
  • Hive with unstructured data
  • Logically partitioning tables
  • Complex Data Types (arrays, maps, structs)
  • UDFs, Joins, Explain

Module 4. Spark Architecture and Concepts:

  • Architecture (Spark versus MapReduce, Spark Building Blocks, Component Location, Execution Speed)
  • Why use Spark
  • Deployment options (Spark on Hadoop versus Spark standalone)
  • Terms and nomenclature

Module 5. Spark Core (Scala and Python):

  • How to use Spark with Scala or Python languages
  • About Spark (Spark session, Spark shell and Zeppelin, Spark logs)
  • Setting lab environments
  • Scala/Python for Spark (Immutables, Anonymous functions)
  • Resilient Distributed Datasets (RDDs, RDD Creation, RDD Operations, RDD Persistence)

Module 6. Spark SQL and DataFrames:

  • About Spark SQL (SQLContext and Hive Context)
  • Spark DataFrames (DFs, DF Creation, DFs API)
  • Spark SQL (querying Hive, querying DF, spark-sql shell)
  • Speed versus ease of use

Module 7. Spark Streaming:

  • About Spark Streaming (Introduction, fundamental concepts, components, streaming sources)
  • Unstructured Streaming (DStream, Streaming Program)
  • Structured Streaming (Datasets/DataFrames, Streaming Program)
  • Structured versus Unstructured

Module 8. Hackathon:

  • Hands-on labs developing Hadoop applications from scratch

Module 9. Integrating Teradata with Hadoop Applications – optional

  • Introduction to Teradata Unified Data Architecture
  • Introduction to Teradata Viewpoint
  • Introduction to Teradata QueryGrid
  • How to setup Teradata QueryGrid links for:

    • Teradata-Hive
    • Teradata-Spark
  • How to query from:

    • Teradata-to-Hive
    • Hive-to-Teradata
    • Teradata-to-Spark
    • Spark-to-Teradata


To get the most out of this training, you should have the following knowledge or experience:

  • Students are expected to have some prior programming experience and can use basic Linux commands.
  • Experience in SQL, Scala and Python will be a distinct advantage.
  • Prior Hadoop experience is a bonus.


Places left:
No participant limit
2080,00  + VAT