Distributed Computing with Spark SQL

  • 4.5
Approx. 14 hours to complete

Course Summary

Learn how to use Spark SQL to analyze big data in this hands-on course. Gain skills in data cleaning, manipulation, and querying with Spark SQL and become proficient in working with large datasets.

Key Learning Points

  • Learn how to use Spark SQL to analyze big data
  • Gain skills in data cleaning, manipulation, and querying with Spark SQL
  • Become proficient in working with large datasets

Related Topics for further study


Learning Outcomes

  • Understand the basics of Spark SQL and its use in big data analysis
  • Gain skills in data cleaning, manipulation, and querying with Spark SQL
  • Become proficient in working with large datasets

Prerequisites or good to have knowledge before taking this course

  • Basic programming knowledge in Python or Scala
  • Familiarity with SQL

Course Difficulty Level

Intermediate

Course Format

  • Online
  • Self-paced
  • Hands-on

Similar Courses

  • Big Data Analysis with Apache Spark
  • Data Warehousing for Business Intelligence
  • Data Engineering, Big Data, and Machine Learning on GCP

Related Education Paths


Notable People in This Field

  • Matei Zaharia
  • Michael Armbrust

Related Books

Description

This course is all about big data. It’s for students with SQL experience that want to take the next step on their data journey by learning distributed computing using Apache Spark. Students will gain a thorough understanding of this open-source standard for working with large datasets. Students will gain an understanding of the fundamentals of data analysis using SQL on Spark, setting the foundation for how to combine data with advanced analytics at scale and in production environments. The four modules build on one another and by the end of the course you will understand: the Spark architecture, queries within Spark, common ways to optimize Spark SQL, and how to build reliable data pipelines.

Knowledge

  • U​se the collaborative Databricks workspace to write scalable Spark SQL code that executes against a cluster of machines
  • Inspect the Spark UI to analyze query performance and identify bottlenecks
  • Create an end-to-end pipeline that reads data, transforms it, and saves the result
  • B​uild a medallion (bronze, silver, gold) lakehouse architecture with Delta Lake to ensure the reliability, scalability, and performance of your data

Outline

  • Introduction to Spark
  • Course Introduction
  • Why Distributed Computing?
  • Spark DataFrames
  • The Databricks Environment
  • SQL in Notebooks
  • Import Data
  • A Note From UC Davis
  • Readings and Resources
  • Assignment #1 - Queries in Spark SQL
  • Assignment #1 Quiz - Queries in Spark SQL
  • Module 1 Quiz
  • Spark Core Concepts
  • Module Introduction
  • Spark Terminology
  • Caching
  • Shuffle Partitions
  • Spark UI
  • Adaptive Query Execution (AQE)
  • Readings
  • Assignment #2 - Spark Internals
  • Assignment #2 Quiz - Spark Internals
  • Module 2 Quiz
  • Engineering Data Pipelines
  • Module Introduction
  • Spark as a Connector
  • Accessing Data
  • File Formats
  • JSON, Schemas and Types
  • Writing Data
  • Tables and Views
  • Readings
  • Assignment #3 - Engineering Data Pipelines
  • Assignment #3 Quiz - Engineering Data Pipelines
  • Module 3 Quiz
  • Data Lakes, Warehouses and Lakehouses
  • Module Introduction
  • Data Lakes vs. Data Warehouses
  • What is a Lakehouse?
  • Delta Lake
  • Delta Lake (Demo)
  • Delta Advanced Features (Demo)
  • Continuing with Spark and Data Science
  • Course Summary
  • Readings
  • Assignment #4 - Lakehouse
  • Assignment #4 Quiz - Lakehouse
  • Module 4 Quiz

Summary of User Reviews

Learn about Spark SQL with this comprehensive course on Coursera. Students rave about the course's depth and clear explanations. One key aspect that many users thought was good is the practical exercises that allow students to apply what they've learned.

Pros from User Reviews

  • The course is very comprehensive and covers a lot of ground
  • The explanations are clear and easy to follow
  • The practical exercises allow students to apply what they've learned
  • The course is well-structured and easy to navigate
  • The instructors are knowledgeable and engaging

Cons from User Reviews

  • Some users found the pace to be too slow
  • The course assumes some prior knowledge of SQL and programming
  • The lectures can be a bit dry and repetitive at times
  • The quizzes and assignments can be challenging for beginners
  • The course could benefit from more interactive elements
English
Available now
Approx. 14 hours to complete
Brooke Wenig, Conor Murphy
University of California, Davis
Coursera

Instructor

Brooke Wenig

  • 4.5 Raiting
Share
Saved Course list
Cancel
Get Course Update
Computer Courses