Spark SQL Column / Data Types explained

Introduction

Types can be a little confusing across SQL, Spark SQL, Pandas etc. Here we breakdown all of the Spark SQL data types so you know exactly which type you should be using.

Bookmark this and use it as a reference!

All Spark Data types

Column Type	Technical Description	Example Use Case
StringType	Represents character string values	Storing names, addresses, or other textual information
IntegerType	Represents 32-bit integer values	Storing age, quantity, or other whole numbers
LongType	Represents 64-bit integer values	Storing large whole numbers like population, or timestamps
FloatType	Represents single-precision 32-bit floating point values	Storing decimal numbers like price, weight, or temperature
DoubleType	Represents double-precision 64-bit floating point values	Storing high-precision decimal numbers like scientific data
BooleanType	Represents true or false values	Storing binary conditions like isActive, isApproved, or isValid
TimestampType	Represents values comprising of date and time with nanosecond precision	Storing event timestamps or datetime values for analysis
DateType	Represents date values without time information	Storing birthdates, order dates, or other date-only values
BinaryType	Represents binary data (byte arrays)	Storing file contents like images, documents, or other files
ArrayType	Represents arrays of elements with the same data type	Storing multiple values of the same type like tags, or scores
MapType	Represents key-value pairs with specified key and value data types	Storing dictionaries, or lookup tables with key-value pairs
StructType	Represents a structured data type containing a list of StructFields with various types	Storing complex data structures like address or person details
DecimalType	Represents fixed-point decimal numbers with user-defined precision and scale	Storing exact decimal values like currency amounts or percentages

How to set the types with Schema

The best practice when working with Spark SQL that we recommend is to set the types by using a Schema object, when creating your dataframe.

Here is an example of how to set all of the types.

# Sample data for each column type
data = [
    ("John Doe", 30, 16000000000, 72.5, 0.987654321, True, "2023-04-08 14:30:00", "2023-04-08", bytearray(b"\x01\x02\x03"), ["apple", "banana"], {"key": "value"}, Row(a=1, b="text"), Decimal("3.14"))
]

# Define schema for the sample data
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Population", LongType(), True),
    StructField("Weight", FloatType(), True),
    StructField("Accuracy", DoubleType(), True),
    StructField("IsValid", BooleanType(), True),
    StructField("Timestamp", TimestampType(), True),
    StructField("Date", DateType(), True),
    StructField("BinaryData", BinaryType(), True),
    StructField("Fruits", ArrayType(StringType()), True),
    StructField("KeyValue", MapType(StringType(), StringType()), True),
    StructField("PersonDetails", StructType([
        StructField("a", IntegerType(), True),
        StructField("b", StringType(), True),
    ]), True),
    StructField("Amount", DecimalType(10, 2), True),
])

# Create DataFrame using the sample data and schema
df = spark.createDataFrame(data, schema)

How to set types when adding columns

It’s not always possible to set all types in the schema, as you may be adding new columns to your dataframe. Here is an example of how to create a new column with all of the types.

# Sample data for name column
data = [
    ("John Doe",)
]

# Define schema for the name column
schema = StructType([
    StructField("Name", StringType(), True),
])

# Create DataFrame using the sample data and schema
df = spark.createDataFrame(data, schema)

# Add new columns for each data type
df = df.withColumn("Age", lit(30).cast(IntegerType()))
df = df.withColumn("Population", lit(16000000000).cast(LongType()))
df = df.withColumn("Weight", lit(72.5).cast(FloatType()))
df = df.withColumn("Accuracy", lit(0.987654321).cast(DoubleType()))
df = df.withColumn("IsValid", lit(True).cast(BooleanType()))
df = df.withColumn("Timestamp", lit("2023-04-08 14:30:00").cast(TimestampType()))
df = df.withColumn("Date", lit("2023-04-08").cast(DateType()))
df = df.withColumn("BinaryData", lit(bytearray(b"\x01\x02\x03")).cast(BinaryType()))
df = df.withColumn("Fruits", lit(["apple", "banana"]).cast(ArrayType(StringType())))
df = df.withColumn("KeyValue", lit({"key": "value"}).cast(MapType(StringType(), StringType())))
df = df.withColumn("PersonDetails", lit(Row(a=1, b="text")).cast(StructType([
    StructField("a", IntegerType(), True),
    StructField("b", StringType(), True),
])))
df = df.withColumn("Amount", lit(Decimal("3.14")).cast(DecimalType(10, 2)))