~ 3 min read

Spark SQL Column / Data Types explained

By: Adam Richardson
Share:

Introduction

Types can be a little confusing across SQL, Spark SQL, Pandas etc. Here we breakdown all of the Spark SQL data types so you know exactly which type you should be using.

Bookmark this and use it as a reference!

All Spark Data types

Column TypeTechnical DescriptionExample Use Case
StringTypeRepresents character string valuesStoring names, addresses, or other textual information
IntegerTypeRepresents 32-bit integer valuesStoring age, quantity, or other whole numbers
LongTypeRepresents 64-bit integer valuesStoring large whole numbers like population, or timestamps
FloatTypeRepresents single-precision 32-bit floating point valuesStoring decimal numbers like price, weight, or temperature
DoubleTypeRepresents double-precision 64-bit floating point valuesStoring high-precision decimal numbers like scientific data
BooleanTypeRepresents true or false valuesStoring binary conditions like isActive, isApproved, or isValid
TimestampTypeRepresents values comprising of date and time with nanosecond precisionStoring event timestamps or datetime values for analysis
DateTypeRepresents date values without time informationStoring birthdates, order dates, or other date-only values
BinaryTypeRepresents binary data (byte arrays)Storing file contents like images, documents, or other files
ArrayTypeRepresents arrays of elements with the same data typeStoring multiple values of the same type like tags, or scores
MapTypeRepresents key-value pairs with specified key and value data typesStoring dictionaries, or lookup tables with key-value pairs
StructTypeRepresents a structured data type containing a list of StructFields with various typesStoring complex data structures like address or person details
DecimalTypeRepresents fixed-point decimal numbers with user-defined precision and scaleStoring exact decimal values like currency amounts or percentages

How to set the types with Schema

The best practice when working with Spark SQL that we recommend is to set the types by using a Schema object, when creating your dataframe.

Here is an example of how to set all of the types.

# Sample data for each column type
data = [
    ("John Doe", 30, 16000000000, 72.5, 0.987654321, True, "2023-04-08 14:30:00", "2023-04-08", bytearray(b"\x01\x02\x03"), ["apple", "banana"], {"key": "value"}, Row(a=1, b="text"), Decimal("3.14"))
]

# Define schema for the sample data
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Population", LongType(), True),
    StructField("Weight", FloatType(), True),
    StructField("Accuracy", DoubleType(), True),
    StructField("IsValid", BooleanType(), True),
    StructField("Timestamp", TimestampType(), True),
    StructField("Date", DateType(), True),
    StructField("BinaryData", BinaryType(), True),
    StructField("Fruits", ArrayType(StringType()), True),
    StructField("KeyValue", MapType(StringType(), StringType()), True),
    StructField("PersonDetails", StructType([
        StructField("a", IntegerType(), True),
        StructField("b", StringType(), True),
    ]), True),
    StructField("Amount", DecimalType(10, 2), True),
])

# Create DataFrame using the sample data and schema
df = spark.createDataFrame(data, schema)

How to set types when adding columns

It’s not always possible to set all types in the schema, as you may be adding new columns to your dataframe. Here is an example of how to create a new column with all of the types.

# Sample data for name column
data = [
    ("John Doe",)
]

# Define schema for the name column
schema = StructType([
    StructField("Name", StringType(), True),
])

# Create DataFrame using the sample data and schema
df = spark.createDataFrame(data, schema)

# Add new columns for each data type
df = df.withColumn("Age", lit(30).cast(IntegerType()))
df = df.withColumn("Population", lit(16000000000).cast(LongType()))
df = df.withColumn("Weight", lit(72.5).cast(FloatType()))
df = df.withColumn("Accuracy", lit(0.987654321).cast(DoubleType()))
df = df.withColumn("IsValid", lit(True).cast(BooleanType()))
df = df.withColumn("Timestamp", lit("2023-04-08 14:30:00").cast(TimestampType()))
df = df.withColumn("Date", lit("2023-04-08").cast(DateType()))
df = df.withColumn("BinaryData", lit(bytearray(b"\x01\x02\x03")).cast(BinaryType()))
df = df.withColumn("Fruits", lit(["apple", "banana"]).cast(ArrayType(StringType())))
df = df.withColumn("KeyValue", lit({"key": "value"}).cast(MapType(StringType(), StringType())))
df = df.withColumn("PersonDetails", lit(Row(a=1, b="text")).cast(StructType([
    StructField("a", IntegerType(), True),
    StructField("b", StringType(), True),
])))
df = df.withColumn("Amount", lit(Decimal("3.14")).cast(DecimalType(10, 2)))
Share:
Subscribe to our newsletter

Stay up to date with our latest content - No spam!

Related Posts