How to Use dropna() Function in PySpark | Remove Null Values Easily | PySpark Tutorial

How to Use dropna() Function in PySpark | Step-by-Step Guide

How to Use dropna() Function in PySpark

The dropna() function in PySpark is used to remove rows that contain NULL (missing) values from a DataFrame. This is crucial in data preprocessing and cleansing, ensuring your data is clean and ready for analysis.

Basic Syntax

DataFrame.dropna(how='any', thresh=None, subset=None)
  • how: 'any' (default) drops a row if any nulls are present. 'all' drops a row only if all columns are null.
  • thresh: Minimum number of non-null values required to retain the row.
  • subset: List of column names to consider for null checking.

Sample Data

from pyspark.sql import SparkSession
from pyspark.sql import Row

# Initialize SparkSession
spark = SparkSession.builder.appName("DropNAExample").getOrCreate()

# Sample data
data = [
    Row(name='John', age=30, salary=None),
    Row(name='Alice', age=None, salary=5000),
    Row(name='Bob', age=40, salary=6000),
    Row(name=None, age=None, salary=None)
]

# Create DataFrame
df = spark.createDataFrame(data)

# Show original DataFrame
df.show()

Expected Output

+-----+----+------+
| name| age|salary|
+-----+----+------+
| John|  30|  null|
|Alice|null|  5000|
|  Bob|  40|  6000|
| null|null|  null|
+-----+----+------+

Example 1: Remove Rows with Any NULL Values

# Remove rows with any NULL values
df_clean_any = df.dropna()
df_clean_any.show()

Expected Output

+----+---+------+
|name|age|salary|
+----+---+------+
| Bob| 40|  6000|
+----+---+------+

Example 2: Remove Rows Where All Columns Are NULL

# Remove rows where all columns are NULL
df_clean_all = df.dropna(how='all')
df_clean_all.show()

Expected Output

+-----+----+------+
| name| age|salary|
+-----+----+------+
| John|  30|  null|
|Alice|null|  5000|
|  Bob|  40|  6000|
+-----+----+------+

Example 3: Remove Rows with NULLs in a Specific Subset of Columns

# Remove rows with NULLs in 'name' or 'salary'
df_clean_subset = df.dropna(subset=['name', 'salary'])
df_clean_subset.show()

Expected Output

+-----+----+------+
| name| age|salary|
+-----+----+------+
|Alice|null|  5000|
|  Bob|  40|  6000|
+-----+----+------+

Conclusion

The dropna() function in PySpark is a powerful tool for handling missing data efficiently. It allows you to clean your datasets by removing incomplete rows based on flexible criteria, ensuring the quality of your data pipelines.

Watch the Video Tutorial

How to use dropDuplicates Function in PySpark | PySpark Tutorial

How to Use dropDuplicates() Function in PySpark | Step-by-Step Guide

How to Use dropDuplicates() Function in PySpark

The dropDuplicates() function in PySpark is used to remove duplicate rows from a DataFrame. It returns a new DataFrame with only unique rows, keeping the first occurrence of each duplicate.

Sample Data

data = [
    (1, "Aamir Shahzad", "IT", 6000),
    (2, "Ali", "HR", 7000),
    (3, "Raza", "Finance", 8000),
    (4, "Aamir Shahzad", "IT", 6000),  # Duplicate Entry
    (5, "Alice", "IT", 5000),
    (6, "Bob", "HR", 7000),
    (7, "Ali", "HR", 7000),            # Duplicate Entry
    (8, "Charlie", "Finance", 9000),
    (9, "David", "IT", 6000),
    (10, "Eve", "HR", 7500)
]

df = spark.createDataFrame(data, ["id", "name", "department", "salary"])

Show the Full DataFrame

df.show()

Expected Output

+---+-------------+----------+------+
| id|         name|department|salary|
+---+-------------+----------+------+
|  1|Aamir Shahzad|        IT|  6000|
|  2|          Ali|        HR|  7000|
|  3|         Raza|   Finance|  8000|
|  4|Aamir Shahzad|        IT|  6000|
|  5|        Alice|        IT|  5000|
|  6|          Bob|        HR|  7000|
|  7|          Ali|        HR|  7000|
|  8|      Charlie|   Finance|  9000|
|  9|        David|        IT|  6000|
| 10|          Eve|        HR|  7500|
+---+-------------+----------+------+

Example 1: Removing All Duplicate Rows

# Removing duplicate rows (entire row must match)
df_no_duplicates = df.dropDuplicates()
df_no_duplicates.show()

Expected Output

+---+-------------+----------+------+
| id|         name|department|salary|
+---+-------------+----------+------+
|  1|Aamir Shahzad|        IT|  6000|
|  2|          Ali|        HR|  7000|
|  3|         Raza|   Finance|  8000|
|  5|        Alice|        IT|  5000|
|  6|          Bob|        HR|  7000|
|  8|      Charlie|   Finance|  9000|
|  9|        David|        IT|  6000|
| 10|          Eve|        HR|  7500|
+---+-------------+----------+------+

Example 2: Removing Duplicates Based on Specific Columns

# Removing duplicates based on 'name' and 'department'
df_no_duplicates_specific = df.dropDuplicates(["name", "department"])
df_no_duplicates_specific.show()

Expected Output

+---+-------------+----------+------+
| id|         name|department|salary|
+---+-------------+----------+------+
|  1|Aamir Shahzad|        IT|  6000|
|  2|          Ali|        HR|  7000|
|  3|         Raza|   Finance|  8000|
|  5|        Alice|        IT|  5000|
|  6|          Bob|        HR|  7000|
|  8|      Charlie|   Finance|  9000|
|  9|        David|        IT|  6000|
| 10|          Eve|        HR|  7500|
+---+-------------+----------+------+

Summary

  • dropDuplicates() removes duplicate rows from a DataFrame.
  • When columns are specified, it keeps only unique rows based on those columns.
  • If no columns are specified, it compares all columns in the DataFrame.

Watch the Video Tutorial

How to Use drop() to Remove Columns from DataFrame | PySpark Tutorial

How to Use drop() Function in PySpark | Step-by-Step Guide

How to Use drop() Function in PySpark

The drop() function in PySpark is used to remove one or multiple columns from a DataFrame. It returns a new DataFrame without modifying the original one, making it useful for data cleaning and transformation tasks.

Sample Data

data = [
    (1, "Alice", 5000, "IT", 25),
    (2, "Bob", 6000, "HR", 30),
    (3, "Charlie", 7000, "Finance", 35),
    (4, "David", 8000, "IT", 40),
    (5, "Eve", 9000, "HR", 45)
]

df = spark.createDataFrame(data, ["id", "name", "salary", "department", "age"])

df.show()

Expected Output

+---+-------+------+----------+---+
| id|   name|salary|department|age|
+---+-------+------+----------+---+
|  1|  Alice|  5000|        IT| 25|
|  2|    Bob|  6000|        HR| 30|
|  3|Charlie|  7000|   Finance| 35|
|  4|  David|  8000|        IT| 40|
|  5|    Eve|  9000|        HR| 45|
+---+-------+------+----------+---+

Example 1: Dropping a Single Column

# Dropping the 'age' column
df_dropped = df.drop("age")
df_dropped.show()

Expected Output

+---+-------+------+----------+
| id|   name|salary|department|
+---+-------+------+----------+
|  1|  Alice|  5000|        IT|
|  2|    Bob|  6000|        HR|
|  3|Charlie|  7000|   Finance|
|  4|  David|  8000|        IT|
|  5|    Eve|  9000|        HR|
+---+-------+------+----------+

Example 2: Dropping Multiple Columns

# Dropping 'id' and 'department' columns
df_multiple_dropped = df.drop("id", "department")
df_multiple_dropped.show()

Expected Output

+-------+------+---+
|   name|salary|age|
+-------+------+---+
|  Alice|  5000| 25|
|    Bob|  6000| 30|
|Charlie|  7000| 35|
|  David|  8000| 40|
|    Eve|  9000| 45|
+-------+------+---+

Example 3: Trying to Drop a Non-Existent Column

# Dropping a column that does not exist ('gender')
df_non_existent = df.drop("gender")
df_non_existent.show()

Expected Output

+---+-------+------+----------+---+
| id|   name|salary|department|age|
+---+-------+------+----------+---+
|  1|  Alice|  5000|        IT| 25|
|  2|    Bob|  6000|        HR| 30|
|  3|Charlie|  7000|   Finance| 35|
|  4|  David|  8000|        IT| 40|
|  5|    Eve|  9000|        HR| 45|
+---+-------+------+----------+---+

Since the column 'gender' does not exist, the DataFrame remains unchanged.

Watch the Video Tutorial

Use distinct() to Remove Duplicates from DataFrames | Get Unique Rows with distinct() in PySpark

How to Use distinct() Function in PySpark | Step-by-Step Guide

How to Use distinct() Function in PySpark

The distinct() function in PySpark is used to remove duplicate rows from a DataFrame. It returns a new DataFrame containing only unique rows, making it a valuable tool for data cleaning and analysis in big data workflows.

Example Dataset

id  name    department
1   Alice   IT
2   Bob     HR
9   Aamir   Finance
4   Alice   IT
5   Eve     HR
6   Frank   Finance
7   Bob     HR
8   Grace   IT
9   Aamir   Finance

Create DataFrame

data = [
    (1, "Alice", "IT"),
    (2, "Bob", "HR"),
    (9, "Aamir", "Finance"),
    (4, "Alice", "IT"),
    (5, "Eve", "HR"),
    (6, "Frank", "Finance"),
    (7, "Bob", "HR"),
    (8, "Grace", "IT"),
    (9, "Aamir", "Finance")
]

df = spark.createDataFrame(data, ["id", "name", "department"])

df.show()

Removing Duplicate Rows using distinct()

df_distinct = df.distinct()

df_distinct.show()

Getting Unique Values from a Single Column

df.select("department").distinct().show()

Summary

The distinct() function in PySpark is very useful when you need to remove duplicate rows from your DataFrame or get unique values from a specific column. It is commonly used during data preprocessing and cleaning tasks in data engineering projects.

Watch the Video Tutorial

PySpark Tutorial: How to Use describe() for DataFrame Statistics | PySpark Tutorial for Data Engineers

How to Use describe() Function in PySpark | Step-by-Step Guide

How to Use describe() Function in PySpark

The describe() function in PySpark provides summary statistics for numerical columns in a DataFrame. It returns important metrics like count, mean, standard deviation, min, and max for each selected column.

Sample Data

data = [
    (1, "Alice", 5000, 25),
    (2, "Bob", 6000, 30),
    (3, "Charlie", 7000, 35),
    (4, "David", 8000, 40),
    (5, "Eve", 9000, 45),
    (6, "Frank", 10000, 50),
    (7, "Grace", 11000, 55),
    (8, "Hannah", 12000, 60),
    (9, "Ian", 13000, 65),
    (10, "Jack", 14000, 70)
]

Create DataFrame

df = spark.createDataFrame(data, ["id", "name", "salary", "age"])

Show the Full DataFrame

df.show()

Example 1: Basic Usage of describe()

print("Summary statistics for numerical columns:")
df.describe().show()

Example 2: describe() for Specific Columns

print("Summary statistics for 'salary' and 'age':")
df.describe("salary", "age").show()

Sample Output

+-------+------------------+------------------+
|summary|            salary|               age|
+-------+------------------+------------------+
|  count|                10|                10|
|   mean|            9500.0|              47.5|
| stddev|3027.6503540974913|15.138251770487457|
|    min|              5000|                25|
|    max|             14000|                70|
+-------+------------------+------------------+

Watch the Video Tutorial

PySpark Tutorial: limit() Function to Display Limited Rows | PySpark tutorial for Data Engineers

PySpark limit() Function Explained with Examples | Step-by-Step Guide

PySpark limit() Function Explained with Examples

The limit() function in PySpark is used to return a specified number of rows from a DataFrame. It helps in sampling data or fetching a small subset for quick analysis, especially useful for data engineers working with large datasets.

Sample Data

data = [
    (1, "Alice", 5000),
    (2, "Bob", 6000),
    (3, "Charlie", 7000),
    (4, "David", 8000),
    (5, "Eve", 9000),
    (6, "Frank", 10000),
    (7, "Grace", 11000),
    (8, "Hannah", 12000),
    (9, "Ian", 13000),
    (10, "Jack", 14000)
]

Create a DataFrame

df = spark.createDataFrame(data, ["id", "name", "salary"])

Show the Full DataFrame

df.show()

Example 1: Get the First 5 Rows

df.limit(5).show()

Example 2: Get the First 3 Rows

df.limit(3).show()

Example 3: Store the Limited DataFrame

df_limited = df.limit(4)
df_limited.show()

Watch the Video Tutorial

How to Use withColumn() Function in PySpark to Add & Update Columns | PySpark Tutorial

How to Use withColumn() Function in PySpark | Add & Update Columns

How to Use withColumn() Function in PySpark | Add & Update Columns

The withColumn() function in PySpark allows you to add new columns or update existing ones within a DataFrame. This guide provides simple, step-by-step examples to help you understand how to use it effectively.

📌 Syntax

DataFrame.withColumn(colName, col)

📦 Import Libraries

from pyspark.sql.functions import col, lit, expr

📋 Create Sample DataFrame

data = [
    (1, "Alice", 5000, "IT"),
    (2, "Bob", 6000, "HR"),
    (3, "Charlie", 7000, "Finance")
]

df = spark.createDataFrame(data, ["id", "name", "salary", "department"])
df.show()

✅ Expected Output

+---+-------+------+----------+
| id|   name|salary|department|
+---+-------+------+----------+
|  1|  Alice|  5000|        IT|
|  2|    Bob|  6000|        HR|
|  3|Charlie|  7000|   Finance|
+---+-------+------+----------+

1️⃣ Add New Column

df_new = df.withColumn("bonus", lit(1000))
df_new.show()

✅ Expected Output

+---+-------+------+----------+-----+
| id|   name|salary|department|bonus|
+---+-------+------+----------+-----+
|  1|  Alice|  5000|        IT| 1000|
|  2|    Bob|  6000|        HR| 1000|
|  3|Charlie|  7000|   Finance| 1000|
+---+-------+------+----------+-----+

2️⃣ Update Existing Column

df_updated = df_new.withColumn("salary", col("salary") * 1.10)
df_updated.show()

✅ Expected Output

+---+-------+-------+----------+-----+
| id|   name| salary|department|bonus|
+---+-------+-------+----------+-----+
|  1|  Alice| 5500.0|        IT| 1000|
|  2|    Bob| 6600.0|        HR| 1000|
|  3|Charlie| 7700.0|   Finance| 1000|
+---+-------+-------+----------+-----+

3️⃣ Use Expressions with expr()

df_expr = df_updated.withColumn("salary_with_bonus", expr("salary + bonus"))
df_expr.show()

✅ Expected Output

+---+-------+-------+----------+-----+------------------+
| id|   name| salary|department|bonus|salary_with_bonus |
+---+-------+-------+----------+-----+------------------+
|  1|  Alice| 5500.0|        IT| 1000|            6500.0|
|  2|    Bob| 6600.0|        HR| 1000|            7600.0|
|  3|Charlie| 7700.0|   Finance| 1000|            8700.0|
+---+-------+-------+----------+-----+------------------+

4️⃣ Change Column Data Type

df_type_changed = df_expr.withColumn("salary", col("salary").cast("Integer"))
df_type_changed.show()

✅ Expected Output

+---+-------+------+----------+-----+------------------+
| id|   name|salary|department|bonus|salary_with_bonus |
+---+-------+------+----------+-----+------------------+
|  1|  Alice|  5500|        IT| 1000|            6500.0|
|  2|    Bob|  6600|        HR| 1000|            7600.0|
|  3|Charlie|  7700|   Finance| 1000|            8700.0|
+---+-------+------+----------+-----+------------------+

5️⃣ Rename a Column

df_renamed = df_type_changed.withColumn("emp_name", col("name")).drop("name")
df_renamed.show()

✅ Expected Output

+---+------+----------+-----+------------------+---------+
| id|salary|department|bonus|salary_with_bonus |emp_name |
+---+------+----------+-----+------------------+---------+
|  1|  5500|        IT| 1000|            6500.0|   Alice |
|  2|  6600|        HR| 1000|            7600.0|     Bob |
|  3|  7700|   Finance| 1000|            8700.0| Charlie |
+---+------+----------+-----+------------------+---------+

🎥 Watch the Tutorial

Click here to watch on YouTube

Author: Aamir Shahzad

For more PySpark tutorials, subscribe to TechBrothersIT

How to Sort DataFrames Using orderBy() in PySpark | PySpark Tutorial

How to Use orderBy() Function in PySpark | Step-by-Step Guide

How to Use orderBy() Function in PySpark

The orderBy() function in PySpark allows you to sort data in a DataFrame by one or more columns. By default, it sorts in ascending order, but you can specify descending order as well.

Syntax

DataFrame.orderBy(*cols, ascending=True)

Arguments

  • cols: Column(s) to sort by.
  • ascending: Boolean or list. True for ascending (default), False for descending.

Example 1: Order by Single Column (Ascending - Default)

df.orderBy("salary").show()

Example 2: Order by Single Column (Descending)

from pyspark.sql.functions import col

df.orderBy(col("salary").desc()).show()

Example 3: Order by Multiple Columns (Ascending)

df.orderBy("department", "salary").show()

Example 4: Order by Multiple Columns with Mixed Order

df.orderBy(col("department").asc(), col("salary").desc()).show()

Example 5: Using orderBy() with Explicit Boolean for Ascending/Descending

df.orderBy(col("salary"), ascending=True).show()

df.orderBy(col("salary"), ascending=False).show()

Conclusion

The orderBy() function in PySpark is a powerful tool to sort data efficiently. It can handle single or multiple columns and allows flexibility with ascending or descending order settings.

Watch the Full Tutorial

How to Use select(), selectExpr(), col(), expr(), when(), and lit() in PySpark | PySpark Tutorial

How to Use select(), selectExpr(), col(), expr(), when(), and lit() in PySpark | Step-by-Step Guide

How to Use select(), selectExpr(), col(), expr(), when(), and lit() in PySpark

In this guide, you will learn how to work with various functions in PySpark to select, manipulate, and transform data efficiently in your data engineering projects.

Topics Covered

  • select() - Retrieve specific columns.
  • selectExpr() - Use SQL expressions.
  • col() - Reference columns.
  • expr() - Perform expressions.
  • when() - Conditional logic.
  • lit() - Add constant columns.

1. Sample DataFrame Creation

from pyspark.sql.functions import col, expr, when, lit

# Sample Data
data = [
    (1, "Alice", 5000, "IT", 25),
    (2, "Bob", 6000, "HR", 30),
    (3, "Charlie", 7000, "Finance", 35),
    (4, "David", 8000, "IT", 40),
    (5, "Eve", 9000, "HR", 45)
]

# Creating DataFrame
df = spark.createDataFrame(data, ["id", "name", "salary", "department", "age"])

# Show DataFrame
df.show()

2. Selecting Specific Columns

df.select("name", "salary").show()

3. Using col() Function

df.select(col("name"), col("department")).show()

4. Renaming Columns Using alias()

df.select(col("name").alias("Employee_Name"), col("salary").alias("Employee_Salary")).show()

5. Using Expressions in select()

df.select("name", "salary", expr("salary * 1.10 AS increased_salary")).show()

6. Using Conditional Expressions with when()

df.select(
    "name",
    "salary",
    when(col("salary") > 7000, "High").otherwise("Low").alias("Salary_Category")
).show()

7. Using selectExpr() for SQL-like Expressions

df.selectExpr("name", "salary * 2 as double_salary").show()

8. Adding Constant Columns Using lit()

df.select("name", "department", lit("Active").alias("status")).show()

9. Selecting Columns Dynamically

columns_to_select = ["name", "salary", "department"]
df.select(*columns_to_select).show()

10. Selecting All Columns Except One

df.select([column for column in df.columns if column != "age"]).show()

Watch the Video Tutorial

Watch on YouTube

Author: Aamir Shahzad

How to Use the display() Function in Databricks | PySpark Tutorial for Beginners

How to Use display() Function in Databricks | PySpark Tutorial for Beginners

How to Use display() Function in Databricks | PySpark Tutorial for Beginners

The display() function in Databricks provides an interactive way to visualize DataFrames directly within your Databricks notebook. Although not part of standard PySpark, it's a powerful tool designed specifically for Databricks users.

Advantages of display()

  • Auto-formats the table output: Displays DataFrame results in a well-formatted table automatically.
  • Interactive sorting and filtering: Allows you to sort and filter columns directly in the notebook interface.
  • Built-in charts and visualizations: Provides options to visualize data using bar charts, pie charts, line graphs, etc., without writing extra code.

How to Use display() in Databricks

Here’s a simple example of how to use the display() function inside a Databricks notebook:

# Create a sample DataFrame
data = [("James", "Smith", "USA", "CA"), 
        ("Michael", "Rose", "USA", "NY"), 
        ("Robert", "Williams", "USA", "CA"), 
        ("Maria", "Jones", "USA", "FL")]

columns = ["firstname", "lastname", "country", "state"]

df = spark.createDataFrame(data, columns)

# Display the DataFrame in Databricks notebook
display(df)

This will open an interactive table display of your DataFrame in the Databricks notebook. You can sort, filter, and switch to chart views easily.

When Should You Use display()?

Use display() when working in Databricks notebooks and you need a quick, interactive way to explore and visualize your data.

Note: The display() function works only within Databricks and isn’t available in standalone PySpark environments.

Watch the Video Tutorial

Watch on YouTube

Author: Aamir Shahzad

How to Read JSON File into DataFrame from Azure Blob Storage | PySpark Tutorial

How to Read JSON File in DataFrame from Azure Blob Storage | Step-by-Step Guide

How to Read JSON File in DataFrame from Azure Blob Storage

In this PySpark tutorial, you'll learn how to read a JSON file stored in Azure Blob Storage and load it into a PySpark DataFrame.

Step 1: Prerequisites

  • Azure Storage Account
  • Container with a JSON file
  • Access key or SAS token
  • PySpark environment (Databricks or local setup)

Step 2: Configure Spark to Access Azure Blob Storage

Configure the Spark session to access your Azure Blob Storage account using the access key or SAS token.

# Set the Spark configuration for Azure Blob Storage
spark.conf.set(
    "fs.azure.account.key.<storage_account_name>.blob.core.windows.net",
    "<access_key_or_sas_token>"
)

Step 3: Read JSON File into DataFrame

Now read the JSON file using the spark.read.json() function and load it into a PySpark DataFrame.

# Define the file path
file_path = "wasbs://<container_name>@<storage_account_name>.blob.core.windows.net/<file_name.json>"

# Read JSON file into DataFrame
df = spark.read.json(file_path)

# Display the DataFrame
df.show(truncate=False)

# Print the schema of the DataFrame
df.printSchema()

Step 4: Example Output

Output of df.show():

+------------+-------------+-----+
|        Name|  Nationality|  Age|
+------------+-------------+-----+
|Aamir Shahzad|     Pakistan|   25|
|     Ali Raza|          USA|   30|
|          Bob|           UK|   45|
|         Lisa|       Canada|   35|
+------------+-------------+-----+

Output of df.printSchema():

root
 |-- Name: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Age: long (nullable = true)

Summary

  • Set Spark configuration to access Azure Blob Storage using your account key or SAS token.
  • Use spark.read.json() to load JSON files into a PySpark DataFrame.
  • Verify your data using df.show() and df.printSchema().

Watch the Video Tutorial

Author: Aamir Shahzad

How to Read CSV File into DataFrame from Azure Blob Storage | PySpark Tutorial

How to Read CSV File into DataFrame from Azure Blob Storage | PySpark Tutorial

How to Read CSV File into DataFrame from Azure Blob Storage | PySpark Tutorial

In this PySpark tutorial, you'll learn how to read a CSV file from Azure Blob Storage into a Spark DataFrame. Follow this step-by-step guide to integrate Azure storage with PySpark for efficient data processing.

Step 1: Configure Spark to Use SAS Token for Authentication

In Azure Blob Storage, SAS (Shared Access Signature) provides secure delegated access to your storage resources. Below is an example SAS token and how you configure Spark to use it.

# SAS token example (for illustration only)
sas_token = "sp=r&st=2025-03-06T17:28:38Z&se=2026-03-07T01:28:38Z&spr=https&sv=2022-11-02&sr=c&sig=VAI..."
        

Step 2: Define the File Path Using WASBS (Azure Blob Storage)

# Define file path
file_path = "wasbs://<container_name>@<storage_account_name>.blob.core.windows.net/<path_to_your_file>.csv"
        

Step 3: Configure Spark with SAS Token

# Spark configuration for accessing the blob
spark.conf.set(
    "fs.azure.sas.<container_name>.<storage_account_name>.blob.core.windows.net",
    sas_token
)
        

Step 4: Read the CSV File into a DataFrame

# Read CSV file into DataFrame
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(file_path)
        

Step 5: Show the Data and Print Schema

# Display the DataFrame contents
df.show()

# Print the DataFrame schema
df.printSchema()
        

Conclusion

Using the above steps, you can securely connect to Azure Blob Storage using SAS tokens and read CSV files directly into PySpark DataFrames. This method is essential for data processing workflows in big data and cloud environments.

📺 Watch the Full Tutorial Video

For a complete step-by-step video guide, watch the tutorial below:

▶️ Watch on YouTube

Author: Aamir Shahzad

© 2024 PySpark Tutorials. All rights reserved.

How to Display Data in PySpark Using show() Function | PySpark Tutorial for Beginners

How to Use show() Function in PySpark | Step-by-Step Guide

How to Use show() Function in PySpark | Step-by-Step Guide

The show() function in PySpark allows you to display DataFrame contents in a readable tabular format. It’s ideal for quickly checking your data or debugging your transformations.

What is show() in PySpark?

The show() function is a simple way to view rows from a DataFrame. By default, it displays up to 20 rows and limits long strings to 20 characters.

Common Use Cases

1. Show Default Rows (First 20 Rows)

df.show()

Displays the first 20 rows and truncates long strings.

2. Show a Specific Number of Rows

df.show(5)

Displays only the first 5 rows of the DataFrame.

3. Show Full Column Content (No Truncation)

df.show(truncate=False)

Displays full content in each column, without cutting off long strings.

4. Truncate Column Content After N Characters

df.show(truncate=10)

Limits column text to 10 characters, useful for large text fields.

5. Show Rows in Vertical Format

df.show(vertical=True)

Displays rows in a vertical layout, which is helpful for wide DataFrames or debugging.

Summary of Options

  • df.show(): Shows 20 rows with default truncation.
  • df.show(n): Shows the first n rows.
  • df.show(truncate=False): Shows full column content.
  • df.show(truncate=n): Truncates text after n characters.
  • df.show(vertical=True): Displays data vertically.

🎥 Watch the Video Tutorial

Prefer watching a step-by-step guide? Watch my video tutorial explaining show() in PySpark:

▶ Watch on YouTube

Author: Aamir Shahzad | PySpark Tutorial for Beginners

How to Extract Date and Time from Blob File Names and Load Sequentially into SQL Table -ADF Tutorial

How to Extract Date and Time from Blob File Names and Load Sequentially into SQL Table - ADF Tutorial

How to Extract Date and Time from Blob File Names and Load Sequentially into SQL Table - ADF Tutorial

Introduction

In modern data pipelines, organizing and processing data efficiently is essential for maintaining data integrity and ensuring optimal performance. Azure Data Factory (ADF) offers powerful capabilities for managing data workflows, including sequential data loading based on file metadata. This tutorial explains how to extract date and time from blob file names stored in Azure Blob Storage and load them sequentially into an Azure SQL Table using Azure Data Factory.

Understanding the Scenario

The objective is to process and load files sequentially from Azure Blob Storage based on the date and time embedded in their file names. This approach ensures the data is loaded either from the oldest to the newest files or vice versa, depending on your business needs.

For example, consider files named in the following format:

  • customer_20250310_110000_US.csv
  • customer_20250311_120000_US.csv
  • customer_20250313_170000_US.csv

These files contain timestamps in their names, which serve as the basis for sequencing them during data loading operations.

Step-by-Step Process

1. Explore the Blob Storage

  • Access the Azure Blob Storage container holding your data files.
  • Verify that the file names include date and time details that follow a consistent naming convention.

2. Create Required SQL Tables

FileList Table: Stores metadata about each file, including the file name and load status.

CREATE TABLE FileList (
    ID INT IDENTITY(1,1),
    FileName NVARCHAR(255),
    LoadStatus NVARCHAR(10)
);
    

Customer Table: Stores the actual data loaded from each file.

CREATE TABLE Customer (
    CustomerID INT,
    FirstName NVARCHAR(100),
    LastName NVARCHAR(100),
    Salary FLOAT,
    FileName NVARCHAR(255),
    LoadDateTime DATETIME
);
    

3. Set Up Azure Data Factory Pipeline

a. Get Metadata Activity

  • Use this activity to fetch the list of files from the Blob Storage container.
  • Configure the dataset and link service pointing to the Blob Storage.

b. ForEach Loop Activity (Sequential Execution)

  • Loop through each file sequentially to ensure the correct order.
  • Use a script activity inside the loop to insert each file name into the FileList table.

c. Lookup Activity

  • Retrieve file names from the FileList table ordered by date and time.
  • Use a SQL query that extracts and orders files based on the date and time information parsed from the file names.

d. ForEach Loop for Data Loading

  • Sequentially process each file retrieved by the Lookup activity.
  • Use a script activity to update the load status of each file after processing.
  • Use a Copy activity to transfer data from each file in Blob Storage into the Customer table in Azure SQL Database.

4. Parameterization and Dynamic Content

  • Use parameters to dynamically handle file names and paths within datasets.
  • Create additional columns during the Copy activity to include the file name and load timestamp for traceability.

Handling Different Loading Scenarios

Loading Oldest to Newest Files

  • Order the files in ascending order of date and time.
  • This ensures data consistency when older data must be processed first.

Loading Newest to Oldest Files

  • Change the order clause in your SQL query to descending.
  • This is useful when prioritizing the most recent data.

Validation and Testing

  • Verify records in the Customer table to ensure they are loaded in the correct sequence.
  • Check the FileList table for accurate load status updates.
  • Use load timestamps to confirm data processing order.

Conclusion

This methodical approach using Azure Data Factory allows you to automate the sequential loading of data files based on embedded metadata like date and time. It enhances data pipeline reliability and ensures the correct sequencing of data ingestion processes.

By following this tutorial, data engineers and analysts can establish a robust data processing workflow in Azure that scales with their data growth and organizational needs.

Watch the Tutorial Video

For a step-by-step walkthrough, watch the video below:

How to Use createDataFrame Function with Schema in PySpark to create DataFrame | PySpark Tutorial

How to use createDataFrame() with Schema in PySpark

How to use createDataFrame() with Schema in PySpark

In PySpark, when creating a DataFrame using createDataFrame(), you can specify a schema to define column names and data types explicitly. This is useful when you want to control the structure and data types of your DataFrame instead of relying on PySpark's automatic inference.

Why define a Schema?

  • Ensures consistent column names and data types
  • Improves data quality and validation
  • Provides better control over data transformations

Example Usage

Below is a sample example of how to create a DataFrame using a schema in PySpark:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Define schema
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Sample data
data = [
    (1, "Alice", 25),
    (2, "Bob", 30),
    (3, "Charlie", 35),
    (4, "Amir", 40)  # None represents a NULL value in PySpark
]

# Create DataFrame using schema
df = spark.createDataFrame(data, schema=schema)

# Show the DataFrame
df.show()

# Check the schema of the DataFrame
df.printSchema()

Output

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 25|
|  2|    Bob| 30|
|  3|Charlie| 35|
|  4|   Amir| 40|
+---+-------+---+

Check the Schema

root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Watch the Video Tutorial

If you prefer a video explanation, check out the tutorial below:

How to Add Columns and Check Schema in PySpark DataFrame | PySpark Tutorial

How to Add Columns to DataFrame and Check Schema in PySpark

How to Add Columns to DataFrame and Check Schema in PySpark

In this tutorial, we’ll cover how to add columns to a DataFrame and also how to check the schema of a DataFrame using PySpark.

1. Creating a DataFrame

data = [
    (1, "Alice", 25),
    (2, "Bob", 30),
    (3, "Charlie", 35),
    (4, "David", 40)
]

df = spark.createDataFrame(data, ["id", "name", "age"])
df.show()

2. Adding New Columns

We can add new columns using the withColumn() function.

from pyspark.sql.functions import lit

df_new = df.withColumn("country", lit("USA"))
df_new.show()

3. Adding Columns Using Expressions

from pyspark.sql.functions import col

df_exp = df.withColumn("age_double", col("age") * 2)
df_exp.show()

4. Adding Multiple Columns

df_multi = df \
    .withColumn("country", lit("USA")) \
    .withColumn("age_plus_ten", col("age") + 10)

df_multi.show()

5. Checking the Schema of DataFrame

df.printSchema()

This command prints the schema of the DataFrame, showing column names and data types.

Conclusion

Adding columns in PySpark is simple and flexible. The withColumn() method is the most common way to add or modify columns, and the printSchema() method provides a quick view of the DataFrame’s structure.

Watch the Tutorial Video

Watch on YouTube

What is a DataFrame in PySpark? | How to create DataFrame from Static Values | PySpark Tutorial

What is DataFrame? - PySpark Tutorial

What is DataFrame in PySpark?

A DataFrame in PySpark is a distributed collection of data organized into named columns. It is similar to a table in a relational database or an Excel spreadsheet. DataFrames allow you to process large amounts of data efficiently by using multiple computers at the same time.

Key Features

  • Structured Data: Organizes data into rows and columns.
  • Fast and Scalable: Handles large datasets effectively.
  • Data Source Flexibility: Works with CSV, JSON, Parquet, databases, etc.
  • SQL Queries: Supports SQL-like queries for filtering and grouping data.

Example: Creating a DataFrame

from pyspark.sql import SparkSession
from pyspark.sql import Row

# Create a SparkSession
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

# Create data as a list of Row objects
data = [
    Row(id=1, name="Alice", age=25),
    Row(id=2, name="Bob", age=30),
    Row(id=3, name="Charlie", age=35)
]

# Create DataFrame
df = spark.createDataFrame(data)

# Show DataFrame content
df.show()

Output

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 25|
|  2|    Bob| 30|
|  3|Charlie| 35|
+---+-------+---+

Conclusion

PySpark DataFrames are an essential tool for working with structured and semi-structured data in big data processing. They provide an easy-to-use API for data manipulation and analysis.

Watch on YouTube

How to Use Comments in PySpark | How to Write Comments Like a Pro | PySpark Tutorial

Using Comments in Notebook - Databricks

Using Comments in Databricks Notebook

In this post, you'll learn how to add comments in Databricks notebooks using different techniques.

🔹 What is a Comment?

A comment in PySpark (Databricks notebooks) is a line in the code that is ignored by the interpreter. It is used to:

  • Explain the code for better readability.
  • Temporarily disable code during debugging.
  • Document the logic of the program.

🔹 Single-Line Comments

Use # to create a single-line comment.

# This is a single-line comment in PySpark print("Hello, World!") # This is an inline comment

🔹 Multi-Line Comments

You can use triple quotes ''' or """ to write multi-line comments or docstrings.

''' This is a multi-line comment. It spans multiple lines. ''' print("Hello, World!")

🔹 Shortcut in Databricks Notebook

For commenting multiple lines in a Databricks notebook:

  • On Windows: Ctrl + /
  • On Mac: Cmd + /
💡 Pro Tip: Use comments to explain complex code and improve collaboration with your team!

🎬 Watch the Video Tutorial

📌 Additional Video on Comments in Databricks

How to Format Notebooks | Top Markdown Formatting Tips & Tricks | PySpark Tutorial

Markdown Formatting Guide Tips

Markdown Formatting Guide Tips

🔹 Headings

Use # for headings:

# H1
## H2
### H3
#### H4
    

🔹 Bold and Italic

  • Bold: **text**
  • Italic: *text*
  • Bold & Italic: ***text***

🔹 Line Break

Use --- to create a horizontal line.

🔹 Lists

✅ Unordered List

- Item 1
- Item 2
  - Sub-item 1
  - Sub-item 2
    
  • Item 1
  • Item 2
    • Sub-item 1
    • Sub-item 2

✅ Ordered List

1. First item
2. Second item
   1. Sub-item 1
   2. Sub-item 2
    
  1. First item
  2. Second item
    1. Sub-item 1
    2. Sub-item 2

🔹 Links

[] (url)

Click here for Spark Documentation

🔹 Tables

| Column 1 | Column 2 | Column 3 |
|----------|----------|----------|
| Data 1   | Data 2   | Data 3   |
| Data 4   | Data 5   | Data 6   |
    
Column 1 Column 2 Column 3
Data 1 Data 2 Data 3
Data 4 Data 5 Data 6

✅ Inline Code

Use print("Hello") to display text.

Use `print("Hello")` to display text.
    

🔹 Blockquotes

> This is a blockquote.
> It is useful for highlighting notes.
    
This is a blockquote. It is useful for highlighting notes.

🔹 Task List

- [x] Install PySpark
- [ ] Load data
- [ ] Process data
    
  • ☑️ Install PySpark
  • ⬜ Load data
  • ⬜ Process data

🔹 Images

![](path)
![PySpark Logo](https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg)
    
PySpark Logo

🎬 Watch the Video Tutorial

How to Create Azure Databricks Service | PySpark Tutorial

How to Create Azure Databricks Service

How to Create Azure Databricks Service

In this video, you will be learning below items:

  • How to create Azure Resource Group
  • How to create Azure Databricks Service
  • How to create cluster/compute in Databricks
  • How to create Notebook
  • How to run/execute Code in Notebook

What is PySpark ? What is Apache Spark | Apache Spark vs PySpark | PySpark Tutorial

Apache Spark vs PySpark

What is Apache Spark?

"Apache Spark is an open-source, distributed computing framework designed for big data processing. It was developed by UC Berkeley in 2009 and is now one of the most powerful tools for handling massive datasets."

🔥 Why is Spark So Popular?

  • ✔️ 100x faster than Hadoop – Uses in-memory computing.
  • ✔️ Supports multiple workloads – Batch, streaming, machine learning, and graph processing.
  • ✔️ Scales easily – Runs on clusters with thousands of nodes.

What is PySpark?

"Now that we understand Apache Spark, let's talk about PySpark. PySpark is simply the Python API for Apache Spark, allowing us to use Spark with Python instead of Scala or Java."

💎 Why Use PySpark?

  • ✔️ Python is easy to learn – Great for data engineers & scientists.
  • ✔️ Leverages Spark’s speed – Handles big data in a scalable way.
  • ✔️ Integrates with Pandas, NumPy, and Machine Learning libraries.

Apache Spark vs PySpark – Key Differences

Feature Apache Spark PySpark
Language Scala, Java Python
Ease of Use More complex Easier for beginners
Performance Faster (native) Slightly slower (Python overhead)
Community Support Strong (since 2009) Growing rapidly
Best For Large-scale data engineering Python-based big data & ML

Watch the Video Explanation!