TechBrothersIT is the blog spot and a video (Youtube) Channel to learn and share Information, scenarios, real time examples about SQL Server, Transact-SQL (TSQL), SQL Server Database Administration (SQL DBA), Business Intelligence (BI), SQL Server Integration Services (SSIS), SQL Server Reporting Services (SSRS), Data Warehouse (DWH) Concepts, Microsoft Dynamics AX, Microsoft Dynamics Lifecycle Services and all other different Microsoft Technologies.
How to Use show() Function in PySpark | Step-by-Step Guide
How to Use show() Function in PySpark | Step-by-Step Guide
The show() function in PySpark allows you to display DataFrame contents in a readable tabular format. It’s ideal for quickly checking your data or debugging your transformations.
What is show() in PySpark?
The show() function is a simple way to view rows from a DataFrame. By default, it displays up to 20 rows and limits long strings to 20 characters.
Common Use Cases
1. Show Default Rows (First 20 Rows)
df.show()
Displays the first 20 rows and truncates long strings.
2. Show a Specific Number of Rows
df.show(5)
Displays only the first 5 rows of the DataFrame.
3. Show Full Column Content (No Truncation)
df.show(truncate=False)
Displays full content in each column, without cutting off long strings.
4. Truncate Column Content After N Characters
df.show(truncate=10)
Limits column text to 10 characters, useful for large text fields.
5. Show Rows in Vertical Format
df.show(vertical=True)
Displays rows in a vertical layout, which is helpful for wide DataFrames or debugging.
Summary of Options
df.show(): Shows 20 rows with default truncation.
df.show(n): Shows the first n rows.
df.show(truncate=False): Shows full column content.
df.show(truncate=n): Truncates text after n characters.
df.show(vertical=True): Displays data vertically.
🎥 Watch the Video Tutorial
Prefer watching a step-by-step guide? Watch my video tutorial explaining show() in PySpark:
How to Extract Date and Time from Blob File Names and Load Sequentially into SQL Table - ADF Tutorial
How to Extract Date and Time from Blob File Names and Load Sequentially into SQL Table - ADF Tutorial
Introduction
In modern data pipelines, organizing and processing data efficiently is essential for maintaining data integrity and ensuring optimal performance. Azure Data Factory (ADF) offers powerful capabilities for managing data workflows, including sequential data loading based on file metadata. This tutorial explains how to extract date and time from blob file names stored in Azure Blob Storage and load them sequentially into an Azure SQL Table using Azure Data Factory.
Understanding the Scenario
The objective is to process and load files sequentially from Azure Blob Storage based on the date and time embedded in their file names. This approach ensures the data is loaded either from the oldest to the newest files or vice versa, depending on your business needs.
For example, consider files named in the following format:
customer_20250310_110000_US.csv
customer_20250311_120000_US.csv
customer_20250313_170000_US.csv
These files contain timestamps in their names, which serve as the basis for sequencing them during data loading operations.
Step-by-Step Process
1. Explore the Blob Storage
Access the Azure Blob Storage container holding your data files.
Verify that the file names include date and time details that follow a consistent naming convention.
2. Create Required SQL Tables
FileList Table: Stores metadata about each file, including the file name and load status.
CREATE TABLE FileList (
ID INT IDENTITY(1,1),
FileName NVARCHAR(255),
LoadStatus NVARCHAR(10)
);
Customer Table: Stores the actual data loaded from each file.
Use this activity to fetch the list of files from the Blob Storage container.
Configure the dataset and link service pointing to the Blob Storage.
b. ForEach Loop Activity (Sequential Execution)
Loop through each file sequentially to ensure the correct order.
Use a script activity inside the loop to insert each file name into the FileList table.
c. Lookup Activity
Retrieve file names from the FileList table ordered by date and time.
Use a SQL query that extracts and orders files based on the date and time information parsed from the file names.
d. ForEach Loop for Data Loading
Sequentially process each file retrieved by the Lookup activity.
Use a script activity to update the load status of each file after processing.
Use a Copy activity to transfer data from each file in Blob Storage into the Customer table in Azure SQL Database.
4. Parameterization and Dynamic Content
Use parameters to dynamically handle file names and paths within datasets.
Create additional columns during the Copy activity to include the file name and load timestamp for traceability.
Handling Different Loading Scenarios
Loading Oldest to Newest Files
Order the files in ascending order of date and time.
This ensures data consistency when older data must be processed first.
Loading Newest to Oldest Files
Change the order clause in your SQL query to descending.
This is useful when prioritizing the most recent data.
Validation and Testing
Verify records in the Customer table to ensure they are loaded in the correct sequence.
Check the FileList table for accurate load status updates.
Use load timestamps to confirm data processing order.
Conclusion
This methodical approach using Azure Data Factory allows you to automate the sequential loading of data files based on embedded metadata like date and time. It enhances data pipeline reliability and ensures the correct sequencing of data ingestion processes.
By following this tutorial, data engineers and analysts can establish a robust data processing workflow in Azure that scales with their data growth and organizational needs.
Watch the Tutorial Video
For a step-by-step walkthrough, watch the video below:
How to use createDataFrame() with Schema in PySpark
How to use createDataFrame() with Schema in PySpark
In PySpark, when creating a DataFrame using createDataFrame(), you can specify a schema to define column names and data types explicitly. This is useful when you want to control the structure and data types of your DataFrame instead of relying on PySpark's automatic inference.
Why define a Schema?
Ensures consistent column names and data types
Improves data quality and validation
Provides better control over data transformations
Example Usage
Below is a sample example of how to create a DataFrame using a schema in PySpark:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Define schema
schema = StructType([
StructField("id", IntegerType(), False),
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
# Sample data
data = [
(1, "Alice", 25),
(2, "Bob", 30),
(3, "Charlie", 35),
(4, "Amir", 40) # None represents a NULL value in PySpark
]
# Create DataFrame using schema
df = spark.createDataFrame(data, schema=schema)
# Show the DataFrame
df.show()
# Check the schema of the DataFrame
df.printSchema()
This command prints the schema of the DataFrame, showing column names and data types.
Conclusion
Adding columns in PySpark is simple and flexible. The withColumn() method is the most common way to add or modify columns, and the printSchema() method provides a quick view of the DataFrame’s structure.
A DataFrame in PySpark is a distributed collection of data organized into named columns. It is similar to a table in a relational database or an Excel spreadsheet. DataFrames allow you to process large amounts of data efficiently by using multiple computers at the same time.
Key Features
Structured Data: Organizes data into rows and columns.
Fast and Scalable: Handles large datasets effectively.
Data Source Flexibility: Works with CSV, JSON, Parquet, databases, etc.
SQL Queries: Supports SQL-like queries for filtering and grouping data.
Example: Creating a DataFrame
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Create a SparkSession
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()
# Create data as a list of Row objects
data = [
Row(id=1, name="Alice", age=25),
Row(id=2, name="Bob", age=30),
Row(id=3, name="Charlie", age=35)
]
# Create DataFrame
df = spark.createDataFrame(data)
# Show DataFrame content
df.show()
PySpark DataFrames are an essential tool for working with structured and semi-structured data in big data processing. They provide an easy-to-use API for data manipulation and analysis.
"Apache Spark is an open-source, distributed computing framework designed for big data processing.
It was developed by UC Berkeley in 2009 and is now one of the most powerful tools for handling massive datasets."
🔥 Why is Spark So Popular?
✔️ 100x faster than Hadoop – Uses in-memory computing.
✔️ Scales easily – Runs on clusters with thousands of nodes.
What is PySpark?
"Now that we understand Apache Spark, let's talk about PySpark.
PySpark is simply the Python API for Apache Spark, allowing us to use Spark with Python instead of Scala or Java."
💎 Why Use PySpark?
✔️ Python is easy to learn – Great for data engineers & scientists.
✔️ Leverages Spark’s speed – Handles big data in a scalable way.
✔️ Integrates with Pandas, NumPy, and Machine Learning libraries.