Apache Spark - Complete guide

Getting Started

Welcome to our complete guide to Apache Spark! In this blog post, we will introduce you to Apache Spark, a powerful open-source data processing engine that is designed to be fast and flexible. We will start by discussing the origins of Spark and why it has become so popular in recent years. From there, we will dive into the key features and capabilities of Spark, including its support for real-time stream processing and machine learning. We will also provide a detailed walkthrough of how to get started with Spark, including how to install and set up a development environment. By the end of this guide, you will have a solid understanding of what Apache Spark is and how you can use it to build powerful data-driven applications.

What is Apache Spark

If you’re interested in working with big data, you’ve probably heard of Spark - but you might not be entirely sure what it is or how it works. In this post, we’ll give you a complete introduction to Apache Spark. We’ll start by explaining exactly what Spark is and how it differs from other big data technologies. Then, we’ll dive into the key features and capabilities of Spark, including its support for real-time stream processing and machine learning.

What is Apache Spark

How does Apache Spark work

High level guide on How Apache Spark works

Setup Apache Spark Locally (PySpark)

We will make Apache Spark setup a breeze. Check out our article to begin writing Spark Code locally

Setup Apache Spark

Fetching data with Apache Spark (PySpark)

The first step to using Apache Spark is of course, to fetch some data! We’re going to look at the few most common methods

Read / write CSV files with Apache Spark (PySpark)

We’re going to read csv files to continue with the course. The files are included in this post on how to read csv files.

Reading and writing CSV Files with PySpark

Data Manipulation

We’re now going to dive into all the common ways to manipulate data within a PySpark dataframe

Renaming Columns

Renaming Columns With PySpark

In this post, you will learn how to rename columns of a Dataframe with PySpark.

Sorting and filtering data

Learn how to sort and filter data using Spark SQL/PySpark

Sorting and Filtering Data

Aggregating and Grouping Data

This post will cover exactly how to group, aggregate and perform many aggregation functions on your data.

Aggregating and Grouping Data

Joining and Merging Data

Pivot and Unpivot Data

In this post, we go through how to pivot and unpivot dataframes, which is a really useful technique to transform the data in a way that suits your analysis/business intelligence.

Pivot and Unpivot Data

File Types

Use JSON files with PySpark

Mastering JSON files with Pyspark

Advanced

Streaming

Streaming Data with Spark SQL

Performance Tips

Performance Tips with PySpark