Explore

Big Data Analysis: Hive, Spark SQL, DataFrames and GraphFrames

Approx. 37 hours to complete

Save Course

Go to Course

Course Summary

Learn how to analyze big data and make data-driven decisions with this course. Gain hands-on experience with Hadoop, Spark, and other big data tools.

Key Learning Points

Learn to work with Hadoop and Spark
Gain hands-on experience with real-world big data projects
Learn to make data-driven decisions

Learning Outcomes

Learn to work with Hadoop and Spark to analyze big data
Gain hands-on experience with real-world big data projects
Develop skills to make data-driven decisions

Prerequisites or good to have knowledge before taking this course

Basic knowledge of programming concepts
Familiarity with SQL

Course Difficulty Level

Intermediate

Course Format

Online
Self-paced
Video lectures
Hands-on projects

Similar Courses

Big Data Essentials: HDFS, MapReduce and Spark RDD
Big Data and Hadoop Essentials

Related Education Paths

Notable People in This Field

Creator of Hadoop
Creator of Apache Spark

Related Books

Description

No doubt working with huge data volumes is hard, but to move a mountain, you have to deal with a lot of small stones. But why strain yourself? Using Mapreduce and Spark you tackle the issue partially, thus leaving some space for high-level tools. Stop struggling to make your big data workflow productive and efficient, make use of the tools we are offering you.

This course will teach you how to: - Warehouse your data efficiently using Hive, Spark SQL and Spark DataFframes. - Work with large graphs, such as social graphs or networks. - Optimize your Spark applications for maximum performance. Precisely, you will master your knowledge in: - Writing and executing Hive & Spark SQL queries; - Reasoning how the queries are translated into actual execution primitives (be it MapReduce jobs or Spark transformations); - Organizing your data in Hive to optimize disk space usage and execution times; - Constructing Spark DataFrames and using them to write ad-hoc analytical jobs easily; - Processing large graphs with Spark GraphFrames; - Debugging, profiling and optimizing Spark application performance. Still in doubt? Check this out. Become a data ninja by taking this course! Special thanks to: - Prof. Mikhail Roytberg, APT dept., MIPT, who was the initial reviewer of the project, the supervisor and mentor of half of the BigData team. He was the one, who helped to get this show on the road. - Oleg Sukhoroslov (PhD, Senior Researcher at IITP RAS), who has been teaching MapReduce, Hadoop and friends since 2008. Now he is leading the infrastructure team. - Oleg Ivchenko (PhD student APT dept., MIPT), Pavel Akhtyamov (MSc. student at APT dept., MIPT) and Vladimir Kuznetsov (Assistant at P.G. Demidov Yaroslavl State University), superbrains who have developed and now maintain the infrastructure used for practical assignments in this course. - Asya Roitberg, Eugene Baulin, Marina Sudarikova. These people never sleep to babysit this course day and night, to make your learning experience productive, smooth and exciting.

Outline

Welcome to the Second Course: Big Data Analysis
Computations Optimization
What is BigData Analysis?
Tools For BigData Analysis
Graph Data Analysis
Meet Alexey Dral
Meet Pavel Mezentsev
Meet Natalia Pritykovskaya
Meet Pavel Klemenkov
Slack Channel is the quickest way to get answers to your questions

Big Data SQL: Hive
Analytics: Business Use Cases
HTTP Web Service: Access Log Format
Business Use Cases: Solution with Hive
(optional) SQL: likbez
Hive Data Definition Language (DDL)
Hive Data Manipulation Language (DML)
Hive Analytics: RegexSerDe, Views
(optional) Regular Expressions, Likbez
Hive Analytics: UDF, UDAF, UDTF
Hive Streaming
Hive PTF (Window Functions)
Hive Optimization: Partitioning, Bucketing and Sampling
Hive Map-Side Joins: Plain, Bucket, Sort-Merge
Hive Optimization: Data Skew
Hive Optimization: Row-Columnar File Formats, Compression
Hive: SQL over Hadoop MapReduce
Hive Analytics with UDF and Streaming
Hive final

Big Data SQL: Hive (practice week)
How to submit your first assignment
How to Install Docker on Windows 7, 8, 10
How to submit your first Hive assignment
Grading System: Instructions and Common Problems
Docker Installation Guide
Assignments. General requirements
Hive assignment. Intro and instructions

Spark SQL and Spark Dataframe
Advantages of Spark SQL
What is Pandas DataFrame and how to create it
How to process a DataFrame as SQL
Working with Hive
Reading and Writing Files
RDD vs. DF vs. SQL
Projection and Filtering
Functions
Aggregates
Join
User Defined Functions
Time Processing
Window Functions
Two-Dimensional Distributions
Introducing DataFrame and SQL
Spark SQL and Spark Dataframe

Graph Analysis from Big Data Perspective
Graph examples
Graph representation
Counting common friends. Part I
Counting common friends. Part II
Counting common friends. Part III
GraphFrames: Introduction
Motif Finding: DSL
Motif Finding: Counting Mutual Friends
Motif Finding: Under The Hood. Part 1
Motif Finding: Under The Hood. Part 2
Triangles Count: Introduction
Triangles Count: Edge Lists
Triangles Count: GraphFrame
Graph Representations
Motif Finding
Triangles Count
Graph Analysis from Big Data Perspective

PageRank and Recent Advances
Introduction
Algorithm
GraphFrames
Random Walk
Page Rank Algorithm
RDD Implementation
GraphFrames API
Taste Graph. Part I
Taste Graph. Part II
Taste Graph. Part III
Graph based Music Recommender
Connected Components
PageRank
Label Propagation Algorithm (LPA)
PageRank and Recent Advances

Spark Internals and Optimization
Welcome
Spark Execution Model
Shuffle. Where to send data?
Shuffle. How to send data?
Optimizing Functions
PageRank Optimization
Spark SQL. Motivation
Catalyst
Catalyst Optimization Example
Joins
Optimizing Joins
UDF Optimization
Persistance and Checkpointing
Memory Management
Resource Allocation
Dynamic Allocation
Speculative Execution
Deployment of the environment
Spark Execution Model & RDD Internals
Spark SQL and Catalyst
Memory management and resource allocation
Final Quiz

Summary of User Reviews

Learn Big Data Analysis online with Coursera. This course has received positive reviews from users who found it to be very informative and helpful. Many users appreciated the practical examples and real-world applications of the course material.