How to Display Data in PySpark Using show() Function | PySpark Tutorial for Beginners

How to Use show() Function in PySpark | Step-by-Step Guide

How to Use show() Function in PySpark | Step-by-Step Guide

The show() function in PySpark allows you to display DataFrame contents in a readable tabular format. It’s ideal for quickly checking your data or debugging your transformations.

What is show() in PySpark?

The show() function is a simple way to view rows from a DataFrame. By default, it displays up to 20 rows and limits long strings to 20 characters.

Common Use Cases

1. Show Default Rows (First 20 Rows)

df.show()

Displays the first 20 rows and truncates long strings.

2. Show a Specific Number of Rows

df.show(5)

Displays only the first 5 rows of the DataFrame.

3. Show Full Column Content (No Truncation)

df.show(truncate=False)

Displays full content in each column, without cutting off long strings.

4. Truncate Column Content After N Characters

df.show(truncate=10)

Limits column text to 10 characters, useful for large text fields.

5. Show Rows in Vertical Format

df.show(vertical=True)

Displays rows in a vertical layout, which is helpful for wide DataFrames or debugging.

Summary of Options

  • df.show(): Shows 20 rows with default truncation.
  • df.show(n): Shows the first n rows.
  • df.show(truncate=False): Shows full column content.
  • df.show(truncate=n): Truncates text after n characters.
  • df.show(vertical=True): Displays data vertically.

🎥 Watch the Video Tutorial

Prefer watching a step-by-step guide? Watch my video tutorial explaining show() in PySpark:

▶ Watch on YouTube

Author: Aamir Shahzad | PySpark Tutorial for Beginners

How to Extract Date and Time from Blob File Names and Load Sequentially into SQL Table -ADF Tutorial

How to Extract Date and Time from Blob File Names and Load Sequentially into SQL Table - ADF Tutorial

How to Extract Date and Time from Blob File Names and Load Sequentially into SQL Table - ADF Tutorial

Introduction

In modern data pipelines, organizing and processing data efficiently is essential for maintaining data integrity and ensuring optimal performance. Azure Data Factory (ADF) offers powerful capabilities for managing data workflows, including sequential data loading based on file metadata. This tutorial explains how to extract date and time from blob file names stored in Azure Blob Storage and load them sequentially into an Azure SQL Table using Azure Data Factory.

Understanding the Scenario

The objective is to process and load files sequentially from Azure Blob Storage based on the date and time embedded in their file names. This approach ensures the data is loaded either from the oldest to the newest files or vice versa, depending on your business needs.

For example, consider files named in the following format:

  • customer_20250310_110000_US.csv
  • customer_20250311_120000_US.csv
  • customer_20250313_170000_US.csv

These files contain timestamps in their names, which serve as the basis for sequencing them during data loading operations.

Step-by-Step Process

1. Explore the Blob Storage

  • Access the Azure Blob Storage container holding your data files.
  • Verify that the file names include date and time details that follow a consistent naming convention.

2. Create Required SQL Tables

FileList Table: Stores metadata about each file, including the file name and load status.

CREATE TABLE FileList (
    ID INT IDENTITY(1,1),
    FileName NVARCHAR(255),
    LoadStatus NVARCHAR(10)
);
    

Customer Table: Stores the actual data loaded from each file.

CREATE TABLE Customer (
    CustomerID INT,
    FirstName NVARCHAR(100),
    LastName NVARCHAR(100),
    Salary FLOAT,
    FileName NVARCHAR(255),
    LoadDateTime DATETIME
);
    

3. Set Up Azure Data Factory Pipeline

a. Get Metadata Activity

  • Use this activity to fetch the list of files from the Blob Storage container.
  • Configure the dataset and link service pointing to the Blob Storage.

b. ForEach Loop Activity (Sequential Execution)

  • Loop through each file sequentially to ensure the correct order.
  • Use a script activity inside the loop to insert each file name into the FileList table.

c. Lookup Activity

  • Retrieve file names from the FileList table ordered by date and time.
  • Use a SQL query that extracts and orders files based on the date and time information parsed from the file names.

d. ForEach Loop for Data Loading

  • Sequentially process each file retrieved by the Lookup activity.
  • Use a script activity to update the load status of each file after processing.
  • Use a Copy activity to transfer data from each file in Blob Storage into the Customer table in Azure SQL Database.

4. Parameterization and Dynamic Content

  • Use parameters to dynamically handle file names and paths within datasets.
  • Create additional columns during the Copy activity to include the file name and load timestamp for traceability.

Handling Different Loading Scenarios

Loading Oldest to Newest Files

  • Order the files in ascending order of date and time.
  • This ensures data consistency when older data must be processed first.

Loading Newest to Oldest Files

  • Change the order clause in your SQL query to descending.
  • This is useful when prioritizing the most recent data.

Validation and Testing

  • Verify records in the Customer table to ensure they are loaded in the correct sequence.
  • Check the FileList table for accurate load status updates.
  • Use load timestamps to confirm data processing order.

Conclusion

This methodical approach using Azure Data Factory allows you to automate the sequential loading of data files based on embedded metadata like date and time. It enhances data pipeline reliability and ensures the correct sequencing of data ingestion processes.

By following this tutorial, data engineers and analysts can establish a robust data processing workflow in Azure that scales with their data growth and organizational needs.

Watch the Tutorial Video

For a step-by-step walkthrough, watch the video below:

How to Use createDataFrame Function with Schema in PySpark to create DataFrame | PySpark Tutorial

How to use createDataFrame() with Schema in PySpark

How to use createDataFrame() with Schema in PySpark

In PySpark, when creating a DataFrame using createDataFrame(), you can specify a schema to define column names and data types explicitly. This is useful when you want to control the structure and data types of your DataFrame instead of relying on PySpark's automatic inference.

Why define a Schema?

  • Ensures consistent column names and data types
  • Improves data quality and validation
  • Provides better control over data transformations

Example Usage

Below is a sample example of how to create a DataFrame using a schema in PySpark:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Define schema
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Sample data
data = [
    (1, "Alice", 25),
    (2, "Bob", 30),
    (3, "Charlie", 35),
    (4, "Amir", 40)  # None represents a NULL value in PySpark
]

# Create DataFrame using schema
df = spark.createDataFrame(data, schema=schema)

# Show the DataFrame
df.show()

# Check the schema of the DataFrame
df.printSchema()

Output

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 25|
|  2|    Bob| 30|
|  3|Charlie| 35|
|  4|   Amir| 40|
+---+-------+---+

Check the Schema

root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Watch the Video Tutorial

If you prefer a video explanation, check out the tutorial below:

How to Add Columns and Check Schema in PySpark DataFrame | PySpark Tutorial

How to Add Columns to DataFrame and Check Schema in PySpark

How to Add Columns to DataFrame and Check Schema in PySpark

In this tutorial, we’ll cover how to add columns to a DataFrame and also how to check the schema of a DataFrame using PySpark.

1. Creating a DataFrame

data = [
    (1, "Alice", 25),
    (2, "Bob", 30),
    (3, "Charlie", 35),
    (4, "David", 40)
]

df = spark.createDataFrame(data, ["id", "name", "age"])
df.show()

2. Adding New Columns

We can add new columns using the withColumn() function.

from pyspark.sql.functions import lit

df_new = df.withColumn("country", lit("USA"))
df_new.show()

3. Adding Columns Using Expressions

from pyspark.sql.functions import col

df_exp = df.withColumn("age_double", col("age") * 2)
df_exp.show()

4. Adding Multiple Columns

df_multi = df \
    .withColumn("country", lit("USA")) \
    .withColumn("age_plus_ten", col("age") + 10)

df_multi.show()

5. Checking the Schema of DataFrame

df.printSchema()

This command prints the schema of the DataFrame, showing column names and data types.

Conclusion

Adding columns in PySpark is simple and flexible. The withColumn() method is the most common way to add or modify columns, and the printSchema() method provides a quick view of the DataFrame’s structure.

Watch the Tutorial Video

Watch on YouTube

What is a DataFrame in PySpark? | How to create DataFrame from Static Values | PySpark Tutorial

What is DataFrame? - PySpark Tutorial

What is DataFrame in PySpark?

A DataFrame in PySpark is a distributed collection of data organized into named columns. It is similar to a table in a relational database or an Excel spreadsheet. DataFrames allow you to process large amounts of data efficiently by using multiple computers at the same time.

Key Features

  • Structured Data: Organizes data into rows and columns.
  • Fast and Scalable: Handles large datasets effectively.
  • Data Source Flexibility: Works with CSV, JSON, Parquet, databases, etc.
  • SQL Queries: Supports SQL-like queries for filtering and grouping data.

Example: Creating a DataFrame

from pyspark.sql import SparkSession
from pyspark.sql import Row

# Create a SparkSession
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

# Create data as a list of Row objects
data = [
    Row(id=1, name="Alice", age=25),
    Row(id=2, name="Bob", age=30),
    Row(id=3, name="Charlie", age=35)
]

# Create DataFrame
df = spark.createDataFrame(data)

# Show DataFrame content
df.show()

Output

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 25|
|  2|    Bob| 30|
|  3|Charlie| 35|
+---+-------+---+

Conclusion

PySpark DataFrames are an essential tool for working with structured and semi-structured data in big data processing. They provide an easy-to-use API for data manipulation and analysis.

Watch on YouTube

How to Use Comments in PySpark | How to Write Comments Like a Pro | PySpark Tutorial

Using Comments in Notebook - Databricks

Using Comments in Databricks Notebook

In this post, you'll learn how to add comments in Databricks notebooks using different techniques.

🔹 What is a Comment?

A comment in PySpark (Databricks notebooks) is a line in the code that is ignored by the interpreter. It is used to:

  • Explain the code for better readability.
  • Temporarily disable code during debugging.
  • Document the logic of the program.

🔹 Single-Line Comments

Use # to create a single-line comment.

# This is a single-line comment in PySpark print("Hello, World!") # This is an inline comment

🔹 Multi-Line Comments

You can use triple quotes ''' or """ to write multi-line comments or docstrings.

''' This is a multi-line comment. It spans multiple lines. ''' print("Hello, World!")

🔹 Shortcut in Databricks Notebook

For commenting multiple lines in a Databricks notebook:

  • On Windows: Ctrl + /
  • On Mac: Cmd + /
💡 Pro Tip: Use comments to explain complex code and improve collaboration with your team!

🎬 Watch the Video Tutorial

📌 Additional Video on Comments in Databricks

How to Format Notebooks | Top Markdown Formatting Tips & Tricks | PySpark Tutorial

Markdown Formatting Guide Tips

Markdown Formatting Guide Tips

🔹 Headings

Use # for headings:

# H1
## H2
### H3
#### H4
    

🔹 Bold and Italic

  • Bold: **text**
  • Italic: *text*
  • Bold & Italic: ***text***

🔹 Line Break

Use --- to create a horizontal line.

🔹 Lists

✅ Unordered List

- Item 1
- Item 2
  - Sub-item 1
  - Sub-item 2
    
  • Item 1
  • Item 2
    • Sub-item 1
    • Sub-item 2

✅ Ordered List

1. First item
2. Second item
   1. Sub-item 1
   2. Sub-item 2
    
  1. First item
  2. Second item
    1. Sub-item 1
    2. Sub-item 2

🔹 Links

[] (url)

Click here for Spark Documentation

🔹 Tables

| Column 1 | Column 2 | Column 3 |
|----------|----------|----------|
| Data 1   | Data 2   | Data 3   |
| Data 4   | Data 5   | Data 6   |
    
Column 1 Column 2 Column 3
Data 1 Data 2 Data 3
Data 4 Data 5 Data 6

✅ Inline Code

Use print("Hello") to display text.

Use `print("Hello")` to display text.
    

🔹 Blockquotes

> This is a blockquote.
> It is useful for highlighting notes.
    
This is a blockquote. It is useful for highlighting notes.

🔹 Task List

- [x] Install PySpark
- [ ] Load data
- [ ] Process data
    
  • ☑️ Install PySpark
  • ⬜ Load data
  • ⬜ Process data

🔹 Images

![](path)
![PySpark Logo](https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg)
    
PySpark Logo

🎬 Watch the Video Tutorial

How to Create Azure Databricks Service | PySpark Tutorial

How to Create Azure Databricks Service

How to Create Azure Databricks Service

In this video, you will be learning below items:

  • How to create Azure Resource Group
  • How to create Azure Databricks Service
  • How to create cluster/compute in Databricks
  • How to create Notebook
  • How to run/execute Code in Notebook

What is PySpark ? What is Apache Spark | Apache Spark vs PySpark | PySpark Tutorial

Apache Spark vs PySpark

What is Apache Spark?

"Apache Spark is an open-source, distributed computing framework designed for big data processing. It was developed by UC Berkeley in 2009 and is now one of the most powerful tools for handling massive datasets."

🔥 Why is Spark So Popular?

  • ✔️ 100x faster than Hadoop – Uses in-memory computing.
  • ✔️ Supports multiple workloads – Batch, streaming, machine learning, and graph processing.
  • ✔️ Scales easily – Runs on clusters with thousands of nodes.

What is PySpark?

"Now that we understand Apache Spark, let's talk about PySpark. PySpark is simply the Python API for Apache Spark, allowing us to use Spark with Python instead of Scala or Java."

💎 Why Use PySpark?

  • ✔️ Python is easy to learn – Great for data engineers & scientists.
  • ✔️ Leverages Spark’s speed – Handles big data in a scalable way.
  • ✔️ Integrates with Pandas, NumPy, and Machine Learning libraries.

Apache Spark vs PySpark – Key Differences

Feature Apache Spark PySpark
Language Scala, Java Python
Ease of Use More complex Easier for beginners
Performance Faster (native) Slightly slower (Python overhead)
Community Support Strong (since 2009) Growing rapidly
Best For Large-scale data engineering Python-based big data & ML

Watch the Video Explanation!