Fundamentals of Scalable Data Science

  • 4.3
Approx. 20 hours to complete

Course Summary

This course is an introduction to Data Science, covering topics such as data preprocessing, visualization, modeling, and evaluation. It is designed for beginners with no prior knowledge of Data Science.

Key Learning Points

  • Learn how to preprocess data and visualize it using Python libraries
  • Understand the basics of machine learning and how to apply it to real-world problems
  • Learn how to evaluate the performance of your models and make data-driven decisions

Related Topics for further study


Learning Outcomes

  • Ability to preprocess and visualize data using Python libraries
  • Understanding of machine learning and its application to real-world problems
  • Ability to evaluate the performance of models and make data-driven decisions

Prerequisites or good to have knowledge before taking this course

  • Basic knowledge of Python programming
  • Familiarity with statistics and linear algebra

Course Difficulty Level

Beginner

Course Format

  • Video lectures
  • Hands-on exercises
  • Real-world examples

Similar Courses

  • Applied Data Science with Python
  • Machine Learning
  • Statistics and Data Science

Related Education Paths


Related Books

Description

Apache Spark is the de-facto standard for large scale data processing. This is the first course of a series of courses towards the IBM Advanced Data Science Specialization. We strongly believe that is is crucial for success to start learning a scalable data science platform since memory and CPU constraints are to most limiting factors when it comes to building advanced machine learning models.

Outline

  • Introduction the course and grading environment
  • Course Overview and a warm welcome
  • Overview of technology used within the course
  • Intro to Apache Spark
  • Assignment and Exercise Environment Setup
  • IMPORTANT: How to submit your programming assignments
  • Challenges, terminology, methods and technology
  • Tools that support BigData solutions
  • Data storage solutions
  • Parallel data processing strategies of Apache Spark
  • Programming language options on ApacheSpark
  • Functional programming basics
  • Introduction of Cloudant
  • Resilient Distributed Dataset and DataFrames - ApacheSparkSQL
  • OPTIONAL: Test Data Generator (data is provided for you already)
  • Apache Parquet (optional)
  • Create the data on your own (optional)
  • Data storage solutions, and ApacheSpark
  • Programming language options and functional programming
  • ApacheSparkSQL and Cloudant
  • Scaling Math for Statistics on Apache Spark
  • Overview of the week...
  • Averages
  • Standard deviation
  • Skewness
  • Kurtosis
  • Covariance, Covariance matrices, correlation
  • Multidimensional vector spaces
  • Exercise 2
  • Averages and standard deviation
  • Skewness and kurtosis
  • Covariance, correlation and multidimensional Vector Spaces
  • Data Visualization of Big Data
  • Overview of the week
  • Plotting with ApacheSpark and python's matplotlib
  • Dimensionality reduction
  • PCA
  • Exercise on Plotting
  • Exercise on PCA
  • Visualization and dimension reduction

Summary of User Reviews

Learn data science online with Coursera's comprehensive course. Students love the interactive approach to learning and the practical applications of the material.

Key Aspect Users Liked About This Course

practical applications

Pros from User Reviews

  • Interactive approach to learning
  • Comprehensive course content
  • Practical applications of material
  • Great for beginners and professionals alike

Cons from User Reviews

  • Some material may be too basic for advanced learners
  • Assignments can be time-consuming
  • Limited interaction with instructors
English
Available now
Approx. 20 hours to complete
Romeo Kienzler
IBM
Coursera

Instructor

Romeo Kienzler

  • 4.3 Raiting
Share
Saved Course list
Cancel
Get Course Update
Computer Courses