Data Manipulation at Scale: Systems and Algorithms

  • 4.3
Approx. 20 hours to complete

Course Summary

Learn how to manipulate data using the popular programming language R with this comprehensive Coursera course. Gain hands-on experience with data wrangling, data cleaning, and data visualization techniques to prepare you for a career in data science.

Key Learning Points

  • Gain hands-on experience using R for data manipulation
  • Learn how to clean and transform data for analysis
  • Create visualizations to better understand your data

Related Topics for further study


Learning Outcomes

  • Ability to manipulate and clean data using R
  • Understanding of data visualization techniques
  • Preparedness for a career in data science

Prerequisites or good to have knowledge before taking this course

  • Basic understanding of programming concepts
  • Familiarity with R programming language

Course Difficulty Level

Intermediate

Course Format

  • Online
  • Self-paced

Similar Courses

  • Data Analysis and Visualization Foundations
  • Data Science Essentials
  • Data Mining

Related Education Paths


Related Books

Description

Data analysis has replaced data acquisition as the bottleneck to evidence-based decision making --- we are drowning in it. Extracting knowledge from large, heterogeneous, and noisy datasets requires not only powerful computing resources, but the programming abstractions to use them effectively. The abstractions that emerged in the last decade blend ideas from parallel databases, distributed systems, and programming languages to create a new class of scalable data analytics platforms that form the foundation for data science at realistic scales.

Outline

  • Data Science Context and Concepts
  • Appetite Whetting: Politics
  • Appetite Whetting: Extreme Weather
  • Appetite Whetting: Digital Humanities
  • Appetite Whetting: Bibliometrics
  • Appetite Whetting: Food, Music, Public Health
  • Appetite Whetting: Public Health cont'd, Earthquakes, Legal
  • Characterizing Data Science
  • Characterizing Data Science, cont'd
  • Distinguishing Data Science from Related Topics
  • Four Dimensions of Data Science
  • Tools vs. Abstractions
  • Desktop Scale vs. Cloud Scale
  • Hackers vs. Analysts
  • Structs vs. Stats
  • Structs vs. Stats cont'd
  • A Fourth Paradigm of Science
  • Data-Intensive Science Examples
  • Big Data and the 3 Vs
  • Big Data Definitions
  • Big Data Sources
  • Course Logistics
  • Twitter Assignment: Getting Started
  • Supplementary: Three-Course Reading List
  • Supplementary: Resources for Learning Python
  • Supplementary: Class Virtual Machine
  • Supplementary: Github Instructions
  • Relational Databases and the Relational Algebra
  • Data Models, Terminology
  • From Data Models to Databases
  • Pre-Relational Databases
  • Motivating Relational Databases
  • Relational Databases: Key Ideas
  • Algebraic Optimization Overview
  • Relational Algebra Overview
  • Relational Algebra Operators: Union, Difference, Selection
  • Relational Algebra Operators: Projection, Cross Product
  • Relational Algebra Operators: Cross Product cont'd, Join
  • Relational Algebra Operators: Outer Join
  • Relational Algebra Operators: Theta-Join
  • From SQL to RA
  • Thinking in RA: Logical Query Plans
  • Practical SQL: Binning Timeseries
  • Practical SQL: Genomic Intervals
  • User-Defined Functions
  • Support for User-Defined Functions
  • Optimization: Physical Query Plans
  • Optimization: Choosing Physical Plans
  • Declarative Languages
  • Declarative Languages: More Examples
  • Views: Logical Data Independence
  • Indexes
  • MapReduce and Parallel Dataflow Programming
  • What Does Scalable Mean?
  • A Sketch of Algorithmic Complexity
  • A Sketch of Data-Parallel Algorithms
  • "Pleasingly Parallel" Algorithms
  • More General Distributed Algorithms
  • MapReduce Abstraction
  • MapReduce Data Model
  • Map and Reduce Functions
  • MapReduce Simple Example
  • MapReduce Simple Example cont'd
  • MapReduce Example: Word Length Histogram
  • MapReduce Examples: Inverted Index, Join
  • Relational Join: Map Phase
  • Relational Join: Reduce Phase
  • Simple Social Network Analysis: Counting Friends
  • Matrix Multiply Overview
  • Matrix Multiply Illustrated
  • Shared Nothing Computing
  • MapReduce Implementation
  • MapReduce Phases
  • A Design Space for Large-Scale Data Systems
  • Parallel and Distributed Query Processing
  • Teradata Example, MR Extensions
  • RDBMS vs. MapReduce: Features
  • RDBMS vs. Hadoop: Grep
  • RDBMS vs. Hadoop: Select, Aggregate, Join
  • NoSQL: Systems and Concepts
  • NoSQL Context and Roadmap
  • NoSQL Roundup
  • Relaxing Consistency Guarantees
  • Two-Phase Commit and Consensus Protocols
  • Eventual Consistency
  • CAP Theorem
  • Types of NoSQL Systems
  • ACID, Major Impact Systems
  • Memcached: Consistent Hashing
  • Consistent Hashing, cont'd
  • DynamoDB: Vector Clocks
  • Vector Clocks, cont'd
  • CouchDB Overview
  • CouchB Views
  • BigTable Overview
  • BigTable Implementation
  • HBase, Megastore
  • Spanner
  • Spanner cont'd, Google Systems
  • MapReduce-based Systems
  • Bringing Back Joins
  • NoSQL Rebuttal
  • Almost SQL: Pig
  • Pig Architecture and Performance
  • Data Model
  • Load, Filter, Group
  • Group, Distinct, Foreach, Flatten
  • CoGroup, Join
  • Join Algorithms
  • Skew
  • Other Commands
  • Evaluation Walkthrough
  • Review
  • Context
  • Spark Examples
  • RDDs, Benefits
  • Graph Analytics
  • Graph Overview
  • Structural Analysis
  • Degree Histograms, Structure of the Web
  • Connectivity and Centrality
  • PageRank
  • PageRank in more Detail
  • Traversal Tasks: Spanning Trees and Circuits
  • Traversal Tasks: Maximum Flow
  • Pattern Matching
  • Querying Edge Tables
  • Relational Algebra and Datalog for Graphs
  • Querying Hybrid Graph/Relational Data
  • Graph Query Example: NSA
  • Graph Query Example: Recursion
  • Evaluation of Recursive Programs
  • Recursive Queries in MapReduce
  • The End-Game Problem
  • Representation: Edge Table, Adjacency List
  • Representation: Adjacency Matrix
  • PageRank in MapReduce
  • PageRank in Pregel

Summary of User Reviews

Discover the art of Data Manipulation with this comprehensive course on Coursera. Students praise the course for its clear and concise explanations and hands-on exercises. The overall rating is excellent.

Key Aspect Users Liked About This Course

The course provides a great foundation for understanding data manipulation and its practical applications.

Pros from User Reviews

  • The course provides practical examples that are relevant to real-world scenarios.
  • The instructors are knowledgeable and engaging, making the course content engaging and easy to follow.
  • The course is suitable for both beginners and intermediate learners, providing a strong foundation for those new to data manipulation and challenging enough for those with some experience.
  • The course is well-structured and easy to follow, with engaging videos and interactive exercises.
  • The course is available on Coursera, making it accessible to anyone with an internet connection.

Cons from User Reviews

  • Some users found the course to be too basic, particularly those with more advanced knowledge of data manipulation.
  • The course can be time-consuming, particularly if you want to complete all of the exercises and assignments.
  • Some users found the course content to be too theoretical, with not enough practical applications.
  • The course requires some prior knowledge of statistics and programming, which may be challenging for some learners.
  • The course may not be suitable for those looking for a comprehensive overview of all aspects of data manipulation.
English
Available now
Approx. 20 hours to complete
Bill Howe
University of Washington
Coursera

Instructor

Bill Howe

  • 4.3 Raiting
Share
Saved Course list
Cancel
Get Course Update
Computer Courses