Spark and HADOOP

Invalid data source. Please correct the following errors:

The specified Sheet Name (Workshops (2)) was not found. Please try again with the correct one from the following list:\nWindows Server 2019 Microsoft Azure 2 React Microsoft 365 Projektų valdymas Asmeninis efektyvumas Scrum ITIL4F ITIL4P ITIL4L

Course overview:

The Workshop will cover basic concepts of Hadoop and mostly in The Cloudera stack, like using HBase & Impala to query data, using Spark to stream data, afterwards we will launch a Cloudera quickstart, using datasets of top-rated movies in the workshops, getting the data analyzed and queried with Hadoop, explaining & demonstrating Map Reduce Concepts, RDD Partition on Spark.

Objectives

The main Goal is to really Understand what big data is , how to ingest data , main concepts for Hadoop Data warehouse , and utilize & stream Spark with Big Data.

Target audience

Entry Level in Big Data, DBA’s , BI Engineers, familiarity in Open Source Systems.

Technical requirements

Installations:
- Docker Installed on Linux : sudo apt-get install docker.io
- Download the Cloudera QuickStart Image : docker pull cloudera/quickstart:latest
- Start the Cloudera stack Container:
docker run –hostname=quickstart.cloudera –privileged=true -t -i -p 8888 -p 80 -p 7180 -d <Name of the Image> /usr/bin/docker-quickstart

Duration: 1 day

Agenda:

Part 1: Introduction to Hadoop and Map Reduce :
- Hadoop Distributers
- Hadoop Vs Traditional Data Storage
- Working with HDFS
- Basic commands
- Architecture
Part 2: Hive and HBase:
- HiveQL
- Hive Data types
- HBase data model
- HBase vs RDBMS
- Client API and REST
Part 3: Apache Spark ( PySpark):
- Basics and RDD
- Caching & Modules
- Spark Streaming
- Spark SQL