TechBrothersIT is a blog and YouTube channel sharing real-world tutorials, interview questions, and examples on SQL Server (T-SQL, DBA), SSIS, SSRS, Azure Data Factory, GCP Cloud SQL, PySpark, ChatGPT, Microsoft Dynamics AX, Lifecycle Services, Windows Server, TFS, and KQL. Ideal for data engineers, DBAs, and developers seeking hands-on, step-by-step learning across Microsoft and cloud platforms.
PySpark Tutorial: How to Use toJSON() – Convert DataFrame Rows to JSON Strings
PySpark Tutorial: How to Use toJSON() – Convert DataFrame Rows to JSON Strings
This tutorial demonstrates how to use PySpark's toJSON() function to convert each row of a DataFrame into a JSON string. This is especially useful for exporting data, streaming to APIs, or sending JSON records to systems like Kafka or NoSQL databases.
1. Import SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark toJSON Example").getOrCreate()
PySpark Tutorial: Analyze Column Relationships Using crosstab()
PySpark Tutorial: How to Use crosstab() to Analyze Relationships Between Columns
This tutorial will show you how to use the crosstab() function in PySpark to create frequency tables and understand the relationship between two categorical columns.
1. What is crosstab() in PySpark?
The crosstab() function in PySpark generates a contingency table (cross-tabulation) between two columns. It counts the occurrences of combinations between two categorical variables.
PySpark Tutorial: How to Use rollup() to Aggregate Data by Groups and Subtotals
PySpark Tutorial: How to Use rollup() to Aggregate Data by Groups and Subtotals
In this PySpark tutorial, we’ll explore how to use the rollup() function to perform multi-level aggregations (group subtotals and grand totals).
It’s very useful when you want to analyze hierarchical data grouped by multiple columns.
1. What is rollup() in PySpark?
The rollup() function in PySpark allows you to create subtotals and a grand total in grouped aggregations. It’s similar to SQL’s ROLLUP.
2. Import Required Libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum as _sum
df_rollup_country = df.rollup("Country") \
.agg(_sum("Salary").alias("Total_Salary")) \
.orderBy("Country")
print("Rollup Aggregation by Country with Grand Total:")
df_rollup_country.show(truncate=False)
The rollup() function is a powerful tool for hierarchical aggregation in PySpark. It allows you to create subtotals at multiple levels as well as an overall total.
It’s great for building summarized reports or dashboards directly from Spark DataFrames.
PySpark freqItems() Function | Identify Frequent Items in Columns Fast
PySpark freqItems() Function | Identify Frequent Items in Columns Fast
In this tutorial, you'll learn how to use PySpark's freqItems() function to identify frequent items in one or multiple DataFrame columns. This method is helpful when performing data analysis or finding patterns in datasets.
What is freqItems() in PySpark?
freqItems() is a PySpark DataFrame function that returns frequent items (values) in a column or multiple columns. It helps identify commonly occurring values. It's useful for:
Step 4: Identify Frequent Items in Multiple Columns
freq_name_country = df.freqItems(["Name", "Country"])
print("Frequent Items in 'Name' and 'Country' Columns:")
freq_name_country.show(truncate=False)
+--------------------+--------------------+
|Name_freqItems |Country_freqItems |
+--------------------+--------------------+
|[Lisa, Bob, Aamir Shahzad, Ali Raza]|[Pakistan, USA, Canada]|
+--------------------+--------------------+
Step 5: Adjust Support Threshold (Optional)
By default, the support threshold is 1%. You can increase it to find only more frequent items:
freq_with_support = df.freqItems(["Name"], support=0.4)
print("Frequent Items in 'Name' Column with Support = 0.4 (40%):")
freq_with_support.show(truncate=False)
PySpark replace() Function Tutorial | Replace Values in DataFrames Easily
PySpark replace() Function Tutorial | Replace Values in DataFrames Easily
Learn how to use the replace() function in PySpark to replace values in one or more columns of a DataFrame. This guide explains everything step-by-step with examples and expected outputs.
What is replace() in PySpark?
The replace() function in PySpark allows you to replace specific values in one or more columns of a DataFrame. It is helpful for:
# Replace 'Unknown' with 'Not Provided' in the Name column
df_replaced_name = df.replace("Unknown", "Not Provided", subset=["Name"])
df_replaced_name.show()
Step 5: Replace a Single Value Across Multiple Columns
# Replace 'Unknown' with 'Not Provided' in Name and Country columns
df_replaced_multiple = df.replace("Unknown", "Not Provided", subset=["Name", "Country"])
df_replaced_multiple.show()
PySpark Tutorial: How to Use transform() for Custom DataFrame Transformations
PySpark Tutorial: How to Use transform() for Custom DataFrame Transformations
In this tutorial, you will learn how to use the transform() function in PySpark to apply custom reusable transformations on DataFrames. This is a great way to simplify complex logic and make your code cleaner!
What is transform() in PySpark?
The transform() function allows you to apply a custom function to a DataFrame. It’s a cleaner way to chain multiple operations, especially when applying reusable logic.
Helps make your code more reusable.
Applies custom transformations on DataFrames.
Keeps your DataFrame pipelines clean and modular.
Step 1: Create Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper
spark = SparkSession.builder \\
.appName("PySpark transform() Example") \\
.getOrCreate()
def add_bonus(input_df):
# 1. Uppercase the Name column
# 2. Add a new column "Bonus" which is 10% of Salary
return input_df.withColumn("Name_Upper", upper(col("Name"))) \\
.withColumn("Bonus", col("Salary") * 0.10)
PySpark Tutorial: How to Use subtract() to Compare and Filter DataFrames
PySpark Tutorial: How to Use subtract() to Compare and Filter DataFrames
In this tutorial, you'll learn how to use the subtract() function in PySpark to find differences between two DataFrames. A simple way to compare and filter rows in big data!
What is subtract() in PySpark?
The subtract() function returns rows that exist in the first DataFrame but not in the second. It works like the EXCEPT operator in SQL.
Both DataFrames must have the same schema.
Commonly used to compare datasets or filter out rows.
PySpark Tutorial: DataFrame.summary() for Statistical Summary in One Command
PySpark Tutorial: DataFrame.summary() for Statistical Summary in One Command
In this tutorial, you will learn how to use summary() in PySpark to quickly generate statistical summaries of your data. Perfect for data exploration and quick analysis!
What is summary() in PySpark?
The summary() function provides descriptive statistics for numeric and string columns in a PySpark DataFrame.
PySpark take() Function | Get First N Rows from DataFrame Fast
PySpark take() Function
Get First N Rows from DataFrame Fast
In this tutorial, you'll learn how to use the take() function in PySpark to quickly retrieve the first N rows from a DataFrame. It's a handy method for previewing and debugging data.
Introduction
When working with PySpark DataFrames, it's often useful to retrieve a small sample of rows for inspection or validation. The take() function allows you to pull the first N rows efficiently, returning them as a list of Row objects.
PySpark Code Example
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName("PySpark take() Example").getOrCreate()
# Sample data
data = [
("Aamir Shahzad", 85, "Math"),
("Ali Raza", 78, "Science"),
("Bob", 92, "History"),
("Lisa", 80, "Math"),
("John", 88, "Science")
]
# Define columns
columns = ["Name", "Score", "Subject"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show complete DataFrame
print("Complete DataFrame:")
df.show()
# Example 1: Take the first 2 rows
first_two_rows = df.take(2)
print("First 2 rows using take():")
for row in first_two_rows:
print(row)
# Example 2: Take more rows than exist (request 10 rows, only 5 in DataFrame)
more_rows = df.take(10)
print("Taking more rows than available (requested 10 rows):")
for row in more_rows:
print(row)
PySpark DataFrame.to() Function | Schema Reconciliation and Column Reordering Made Easy
๐งฉ PySpark DataFrame.to() Function
Schema Reconciliation and Column Reordering Made Easy
Learn how to use the DataFrame.to() function introduced in PySpark 3.4.0 to reorder columns and reconcile schema effortlessly.
๐ Introduction
PySpark’s DataFrame.to() function helps with schema alignment, column reordering, and type casting — all in one step. This is especially useful when you're writing to tables, preparing data for joins, or ensuring schema compliance.
๐ง PySpark Code Example
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType
# Create SparkSession
spark = SparkSession.builder.appName("DataFrame.to Example").getOrCreate()
# Sample data
data = [
("Aamir Shahzad", "Math", 85, True),
("Ali Raza", "Science", 78, False),
("Bob", "History", 92, True),
("Lisa", "Math", 80, False)
]
columns = ["Name", "Subject", "Score", "Passed"]
df = spark.createDataFrame(data, columns)
# Print original schema and data
df.show(truncate=False)
df.printSchema()
# Define new schema with reordered and altered types
schema = StructType([
StructField("Passed", BooleanType(), True),
StructField("Score", StringType(), True), # Cast from int to string
StructField("Name", StringType(), True),
StructField("Subject", StringType(), True)
])
# Apply .to() transformation
df2 = df.to(schema)
# Show transformed DataFrame
df2.show(truncate=False)
df2.printSchema()
How to Use randomSplit() in PySpark | Split Your DataFrame into Train and Test Sets
How to Use randomSplit() in PySpark
Split Your DataFrame into Train and Test Sets | PySpark Tutorial
Learn how to efficiently split your DataFrame into training and testing datasets using PySpark's randomSplit() method. A must-have step when preparing data for machine learning workflows.
๐ Introduction
In machine learning and data science, splitting data into training and testing sets is a fundamental step. PySpark’s randomSplit() function allows you to do this easily and reproducibly across distributed data.
PySpark Tutorial: How to Use dtypes in PySpark | Get DataFrame Column Names and Types
PySpark Tutorial: How to Use dtypes in PySpark
Get DataFrame Column Names and Types | Step-by-Step Guide
Learn how to use dtypes in PySpark to quickly inspect column names and their corresponding data types — a must-know feature for any data engineer or analyst.
๐ Introduction
Understanding the structure of your DataFrame is essential for building reliable data pipelines. The dtypes attribute helps you list out each column with its corresponding data type in a quick and readable format.
๐ง PySpark Code Example
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName("PySpark dtypes Example").getOrCreate()
# Sample data
data = [
("Aamir Shahzad", 85, 90.5, True),
("Ali Raza", 78, 83.0, False),
("Bob", 92, 95.2, True),
("Lisa", 80, 87.8, False)
]
# Define columns
columns = ["Name", "Math_Score", "Science_Score", "Passed"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show the DataFrame
df.show()
# Get column names and data types
print("Column Names and Data Types (dtypes):")
print(df.dtypes)
# Formatted output
print("\nFormatted Output of Column Data Types:")
for col_name, data_type in df.dtypes:
print(f"Column: {col_name}, Type: {data_type}")
How to Use Melt Function in PySpark | Step-by-Step Guide
How to Use melt() Function in PySpark
Reshape & Unpivot DataFrames | Step-by-Step Guide
Learn how to reshape and unpivot PySpark DataFrames using a custom melt function. Ideal for data engineering and analytics workflows.
๐ Introduction
PySpark doesn’t have a native melt() function like pandas. However, we can mimic this behavior by combining explode(), struct(), and array(). This transformation is useful when converting a wide table (with many columns) into a long format (more rows, fewer columns).
๐ง PySpark Code Example
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Create SparkSession
spark = SparkSession.builder.appName("PySpark Melt Function Example").getOrCreate()
# Sample data
data = [
("Aamir Shahzad", 85, 90, 88),
("Ali Raza", 78, 83, 89),
("Bob", 92, 95, 91),
("Lisa", 80, 87, 85)
]
columns = ["Name", "Math", "Science", "History"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
df.show()
# Custom melt function
def melt(df, id_vars, value_vars, var_name="Subject", value_name="Score"):
from pyspark.sql.functions import array, struct, explode, lit
_vars_and_vals = array(*[
struct(lit(c).alias(var_name), col(c).alias(value_name))
for c in value_vars
])
_tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
cols = id_vars + [
col("_vars_and_vals")[var_name].alias(var_name),
col("_vars_and_vals")[value_name].alias(value_name)
]
return _tmp.select(*cols)
# Apply melt
melted_df = melt(df, ["Name"], ["Math", "Science", "History"])
melted_df.show()
How to Use createGlobalTempView() in PySpark | Step-by-Step Guide
How to Use createGlobalTempView() in PySpark | Step-by-Step Guide
Author: Aamir Shahzad
Published: March 2025
๐ Introduction
The createGlobalTempView() function in PySpark allows you to register a DataFrame as a global temporary view that can be accessed across multiple Spark sessions. Unlike createTempView(), it persists for the lifetime of the Spark application and is stored in the global_temp database.
๐งพ Sample Dataset
Name Department Salary
Aamir Shahzad Engineering 5000
Ali Sales 4000
Raza Marketing 3500
Bob Sales 4200
Lisa Engineering 6000
How to Use createTempView() in PySpark | Step-by-Step Guide
How to Use createTempView() in PySpark | Step-by-Step Guide
Author: Aamir Shahzad
Published: March 2025
๐ Introduction
The createTempView() function registers a PySpark DataFrame as a temporary SQL view. This allows you to execute SQL queries using Spark SQL, combining the power of SQL with the flexibility of DataFrames.
๐งพ Sample Dataset
Name Department Salary
Aamir Shahzad Engineering 5000
Ali Sales 4000
Raza Marketing 3500
Bob Sales 4200
Lisa Engineering 6000
How to Perform Unpivot in PySpark | Step-by-Step Guide
How to Perform Unpivot in PySpark | Convert Columns to Rows
Author: Aamir Shahzad
Published: March 2025
๐ Introduction
In PySpark, unpivoting means converting columns into rows — the opposite of a pivot operation. While PySpark doesn't have a built-in unpivot() function, you can easily achieve this using selectExpr() and stack().
๐งพ Sample Dataset
We’ll work with a DataFrame that contains sales data across three years:
Name Sales_2023 Sales_2024 Sales_2025
Aamir Shahzad 800 900 1000
Ali 500 600 1001
Raza 450 550 102
Bob 700 750 103
Lisa 620 None 109
How to Use Pivot Function in PySpark | Step-by-Step Guide
How to Use pivot() Function in PySpark | Step-by-Step Guide
Author: Aamir Shahzad
Published: March 2025
๐ Introduction
The pivot() function in PySpark lets you rotate data in a DataFrame. It's used to transform rows into columns—helpful when summarizing data for reporting and analytics.
๐งพ Sample Dataset
Name Year Product Revenue
Aamir Shahzad 2023 Product A 500
Aamir Shahzad 2023 Product B 300
Ali 2023 Product A 400
Raza 2023 Product B 200
Bob 2024 Product A 700
Lisa 2024 Product B 600
Ali 2024 Product A 300
Raza 2024 Product B 500
Aamir Shahzad 2024 Product A 800
Lisa 2023 Product A 650
Lisa 2025 Product A 650
Lisa 2026 Product C 1100