Apache Spark Join DataFrames Java Example
In modern data engineering pipelines, applications often need to combine multiple datasets that share the same schema. Apache Spark provides a powerful and scalable way to work with structured data through its Dataset and DataFrame APIs. In Java-based big-data ecosystems, concatenating two DataFrames with the same column structure is typically done using union() or unionByName(). Let us delve into understanding how we can join dataframes in Spark.
1. Introduction to Spark
Apache Spark is a fast, distributed computing engine designed to process large-scale data efficiently across clusters. It provides high-level APIs in Java, Python, Scala, and R, enabling developers to work with structured and unstructured data using resilient distributed datasets (RDDs), DataFrames, and SQL queries. Spark is especially suited for data engineering workflows where operations such as filtering, aggregations, joins, and dataset transformations must scale beyond the limits of a single machine. By distributing computation across multiple nodes, Spark achieves significant performance gains, making it ideal for ETL pipelines, analytics platforms, and machine learning workloads. In modern architectures, Spark often runs within containerized environments like Docker or Kubernetes to ensure portability and repeatable deployments. With its rich ecosystem—including Spark SQL, Spark Streaming, and MLlib—Spark has become the backbone of many enterprise data platforms.
1.1 Problem Statement
You are given two DataFrames in Java—both containing the same columns (for example, id and name). Your task is to:
- Load or build both DataFrames
- Concatenate them row-wise so the final output contains all rows from both DataFrames
- Ensure that schemas match
Spark provides multiple ways to do this, and we will use union() for identical schemas or unionByName() for name-based concatenation.
1.1.1 Code Comparison
| Aspect | union() | unionByName() |
|---|---|---|
| Column Matching | By position/order of columns | By column names |
| Schema Requirement | Schemas must be exactly the same and in the same order | Columns can be in different orders; missing columns can be handled with options |
| Use Case | When both DataFrames have identical schema and column order | When DataFrames have the same columns but order differs or some columns are missing |
| Error Handling | Fails or produces incorrect results if schemas don’t match exactly | Handles mismatched column order gracefully by matching column names |
| Performance | Generally faster since no column name matching needed | Slightly slower due to matching columns by name |
2. Code Example
2.1 Setting up Spark Using Docker
To run the Java Spark example in a clean and reproducible environment, Docker provides an easy and consistent setup. Instead of manually installing Java, Spark binaries, and managing system-level environment variables, you can use Docker to package everything into a single container. This ensures that your Java-based Spark program runs the same way on any machine without requiring complex local installation steps. A straightforward approach is to use a custom Dockerfile that includes Java 11 and Spark 3.5.0, which matches the version used in the Java code example. Once the container is built, you can compile and run your Maven-based Spark project directly inside Docker. This also keeps your host machine clean from dependencies, while still allowing you to share code through mounted volumes.
Below is a minimal setup that prepares a Spark-ready environment for running the example:
version: "3.9"
services:
spark-master:
image: bitnami/spark:3.5.0
container_name: spark-master
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_DIRS=/tmp
ports:
- "7077:7077"
- "8080:8080"
spark-worker:
image: bitnami/spark:3.5.0
container_name: spark-worker
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
depends_on:
- spark-master
pyspark:
image: jupyter/pyspark-notebook:spark-3.5.0
container_name: pyspark
depends_on:
- spark-master
ports:
- "8888:8888"
environment:
- SPARK_MASTER=spark://spark-master:7077
2.1.1 Code Explanation
This Docker Compose configuration sets up a complete Spark environment consisting of a Spark master, a Spark worker, and an optional PySpark Jupyter Notebook interface. The spark-master service uses the Bitnami Spark 3.5.0 image and exposes ports 7077 for cluster communication and 8080 for the Spark Master UI; environment variables configure it to run in master mode with authentication disabled for local development. The spark-worker service also uses the Bitnami Spark image and runs as a worker node that automatically registers with the master via the SPARK_MASTER_URL=spark://spark-master:7077 setting, ensuring that Spark jobs submitted by Java or PySpark applications are executed on this worker. The pyspark service provides a Jupyter Notebook environment with PySpark support, exposing port 8888 so users can run PySpark code from a browser while connecting to the same Spark master through the SPARK_MASTER=spark://spark-master:7077 environment variable. Overall, this configuration creates a functional single-node Spark cluster where Java Spark applications and PySpark notebooks can both connect to the master and execute distributed operations across the worker node.
2.1.2 Code Run
To run this Spark setup, first save the provided Docker Compose configuration into a file named docker-compose.yml at the root of your project directory. Once the file is in place, open a terminal and navigate to that directory, then start the entire Spark environment by executing the command docker compose up -d, which launches the Spark master, Spark worker, and the PySpark Jupyter Notebook container in detached mode. After the containers start, you can verify that they are running correctly by executing docker ps, which should list all three services. The Spark Master web UI becomes available at http://localhost:8080, allowing you to inspect the cluster status, while the PySpark Jupyter Notebook can be accessed through http://localhost:8888 using the token displayed in the container logs. With all services running, both Java Spark applications and PySpark notebooks can connect to the Spark cluster via spark://localhost:7077, enabling you to immediately begin executing distributed Spark jobs within this Docker-based environment.
2.2 Setting up the Maven Project for Spark
Create a Maven project and add Spark dependencies in pom.xml:
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>spark-dataframe-concat</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.5.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.5.0</version>
</dependency>
</dependencies>
</project>
These dependencies allow your Java application to run Spark SQL and DataFrame operations.
2.3 Code Example
Below is the complete Java code that creates two DataFrames with the same schema and concatenates them:
// DataFrameUnionExample.java
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.sql.RowFactory;
import java.util.Arrays;
import java.util.List;
public class DataFrameUnionExample {
public static void main(String[] args) {
// Initialize Spark
SparkSession spark = SparkSession.builder()
.appName("DataFrame Union Example")
.master("spark://localhost:7077")
.getOrCreate();
// Schema with columns in order: id, name
StructType schema1 = new StructType(new StructField[]{
DataTypes.createStructField("id", DataTypes.IntegerType, false),
DataTypes.createStructField("name", DataTypes.StringType, false)
});
// Schema with columns in different order: name, id
StructType schema2 = new StructType(new StructField[]{
DataTypes.createStructField("name", DataTypes.StringType, false),
DataTypes.createStructField("id", DataTypes.IntegerType, false)
});
// Rows for first DataFrame
List<Row> rows1 = Arrays.asList(
RowFactory.create(1, "Alice"),
RowFactory.create(2, "Bob")
);
// Rows for second DataFrame (column order swapped)
List<Row> rows2 = Arrays.asList(
RowFactory.create("Charlie", 3),
RowFactory.create("David", 4)
);
// Create DataFrames with different schema orders
Dataset<Row> df1 = spark.createDataFrame(rows1, schema1);
Dataset<Row> df2 = spark.createDataFrame(rows2, schema2);
System.out.println("=== DataFrame 1 ===");
df1.show();
System.out.println("=== DataFrame 2 ===");
df2.show();
// Concatenate using union() - this assumes same schema order, so this will fail or give wrong result
try {
Dataset<Row> unionResult = df1.union(df2);
System.out.println("=== Combined DataFrame using union() ===");
unionResult.show();
} catch (Exception e) {
System.out.println("union() failed due to schema mismatch: " + e.getMessage());
}
// Concatenate using unionByName() - matches columns by name correctly
Dataset<Row> unionByNameResult = df1.unionByName(df2);
System.out.println("=== Combined DataFrame using unionByName() ===");
unionByNameResult.show();
spark.stop();
}
}
2.3.1 Code Explanation
This Java code demonstrates the difference between union() and unionByName() in Apache Spark. It starts by initializing a SparkSession and defining two schemas with the same columns—id and name—but in different orders. Two DataFrames are created from lists of rows: the first with schema order (id, name), and the second with (name, id). When attempting to combine these DataFrames using union(), the code catches and reports an error or incorrect result due to the schema mismatch caused by differing column orders. However, using unionByName() successfully concatenates the two DataFrames by matching columns based on their names regardless of order. The output shows the combined DataFrame with rows from both sources properly aligned, illustrating why unionByName() is preferable when column order varies between DataFrames. Finally, the Spark session is stopped to release resources.
2.3.2 Code Output
To compile the project, run mvn clean install, and to execute the program, use mvn exec:java -Dexec.mainClass="DataFrameConcatExample".
=== DataFrame 1 === +---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| Bob| +---+-----+ === DataFrame 2 === +-------+---+ | name| id| +-------+---+ |Charlie| 3| | David| 4| +-------+---+ union() failed due to schema mismatch: union can only be performed on tables with the compatible column types === Combined DataFrame using unionByName() === +---+-------+ | id| name| +---+-------+ | 1| Alice| | 2| Bob| | 3|Charlie| | 4| David| +---+-------+
This Java program shows how to combine two Apache Spark DataFrames using both union() and unionByName(). First, it initializes a SparkSession and defines two schemas with the same columns (id and name) but in different orders. Then, it creates two DataFrames from sample data matching these schemas. The program attempts to concatenate the DataFrames using union(), which expects identical schemas with columns in the same order; this causes an error or incorrect results due to the column order mismatch. Next, it uses unionByName() to concatenate the DataFrames by matching columns based on their names, which works correctly even when column orders differ. Finally, the combined DataFrame is displayed, showing all rows from both inputs properly aligned, and the Spark session is stopped to release resources.
3. Conclusion
Concatenating DataFrames in Java using Apache Spark is straightforward when schemas match. The union() and unionByName() operations efficiently combine datasets in a scalable way suitable for production data pipelines. This approach helps unify data from multiple sources, making it essential in ETL, analytics, and machine-learning feature-engineering workflows.




