Course overview:
The Workshop will cover basic concepts of Hadoop and mostly in The Cloudera stack, like using HBase & Impala to query data, using Spark to stream data, afterwards we will launch a Cloudera quickstart, using datasets of top-rated movies in the workshops, getting the data analyzed and queried with Hadoop, explaining & demonstrating Map Reduce Concepts, RDD Partition on Spark.
Objectives
The main Goal is to really Understand what big data is , how to ingest data , main concepts for Hadoop Data warehouse , and utilize & stream Spark with Big Data.
Target audience
Entry Level in Big Data, DBA’s , BI Engineers, familiarity in Open Source Systems.
Technical requirements
- Installations:
- Docker Installed on Linux : sudo apt-get install docker.io
- Download the Cloudera QuickStart Image : docker pull cloudera/quickstart:latest
- Start the Cloudera stack Container:
docker run –hostname=quickstart.cloudera –privileged=true -t -i -p 8888 -p 80 -p 7180 -d <Name of the Image> /usr/bin/docker-quickstart
Duration: 1 day
Agenda:
- Part 1: Introduction to Hadoop and Map Reduce :
- Hadoop Distributers
- Hadoop Vs Traditional Data Storage
- Working with HDFS
- Basic commands
- Architecture
- Part 2: Hive and HBase:
- HiveQL
- Hive Data types
- HBase data model
- HBase vs RDBMS
- Client API and REST
- Part 3: Apache Spark ( PySpark):
- Basics and RDD
- Caching & Modules
- Spark Streaming
- Spark SQL