· Spark SQL · 3 min read
Apache Spark - Complete guide
Getting Started
Welcome to our complete guide to Apache Spark! In this blog post, we will introduce you to Apache Spark, a powerful open-source data processing engine that is designed to be fast and flexible. We will start by discussing the origins of Spark and why it has become so popular in recent years. From there, we will dive into the key features and capabilities of Spark, including its support for real-time stream processing and machine learning. We will also provide a detailed walkthrough of how to get started with Spark, including how to install and set up a development environment. By the end of this guide, you will have a solid understanding of what Apache Spark is and how you can use it to build powerful data-driven applications.
What is Apache Spark
If you’re interested in working with big data, you’ve probably heard of Spark - but you might not be entirely sure what it is or how it works. In this post, we’ll give you a complete introduction to Apache Spark. We’ll start by explaining exactly what Spark is and how it differs from other big data technologies. Then, we’ll dive into the key features and capabilities of Spark, including its support for real-time stream processing and machine learning.
How does Apache Spark work
High level guide on How Apache Spark works
Setup Apache Spark Locally (PySpark)
We will make Apache Spark setup a breeze. Check out our article to begin writing Spark Code locally
Fetching data with Apache Spark (PySpark)
The first step to using Apache Spark is of course, to fetch some data! We’re going to look at the few most common methods
Read / write CSV files with Apache Spark (PySpark)
We’re going to read csv files to continue with the course. The files are included in this post on how to read csv files.
Reading and writing CSV Files with PySpark
Data Manipulation
We’re now going to dive into all the common ways to manipulate data within a PySpark dataframe
Renaming Columns
In this post, you will learn how to rename columns of a Dataframe with PySpark.
Sorting and filtering data
Learn how to sort and filter data using Spark SQL/PySpark
Aggregating and Grouping Data
This post will cover exactly how to group, aggregate and perform many aggregation functions on your data.
Joining and Merging Data
Pivot and Unpivot Data
In this post, we go through how to pivot and unpivot dataframes, which is a really useful technique to transform the data in a way that suits your analysis/business intelligence.
File Types
Use JSON files with PySpark
Mastering JSON files with Pyspark