Introducing DataFrames

Under the covers, DataFrames are derived from data structures known as Resilient Distributed Datasets (RDDs). RDDs and DataFrames are immutable distributed collections of data. Let's take a closer look at what some of these terms mean before we understand how they relate to DataFrames:

Resilient: They are fault tolerant, so if part of your operation fails, Spark quickly recovers the lost computation.
Distributed: RDDs are distributed across networked machines known as a cluster.
DataFrame: A data structure where data is organized into named columns, like a table in a relational database, but with richer optimizations under the hood.

Without the named columns and declared types provided by a schema, Spark wouldn't know how to optimize the executation of any computation. Since DataFrames have a schema, they use the Catalyst Optimizer to determine the optimal way to execute your code.

DataFrames were invented because the business community uses tables in a relational database, Pandas or R DataFrames, or Excel worksheets. A Spark DataFrame is conceptually equivalent to these, with richer optimizations under the hood and the benefit of being distributed across a cluster.

DataFrames

The Apache Spark DataFrame API provides a rich set of functions (select columns, filter, join, aggregate, and so on) that allow you to solve common data analysis problems efficiently. DataFrames also allows you to intermix operations seamlessly with custom Python, R, Scala, and SQL code. In this tutorial module, you will learn how to:

We also provide a sample notebook that you can import to access and run all of the code examples included in the module.

Load sample data

The easiest way to start working with DataFrames is to use an example Azure Databricks dataset available in the /databricks-datasets folder accessible within the Azure Databricks workspace. To access the file that compares city population versus median sale prices of homes, load the file /databricks-datasets/samples/population-vs-price/data_geo.csv.

Copy
%python
# Use the Spark CSV datasource with options specifying:
# - First line of file is a header
# - Automatically infer the schema of the data
data = spark.read.csv("/databricks-datasets/samples/population-vs-price/data_geo.csv", header="true", inferSchema="true")
data.cache() # Cache data for faster reuse
data = data.dropna() # drop rows with missing values

View the DataFrame

Now that you have created the data DataFrame, you can quickly access the data using standard Spark commands such as take(). For example, you can use the command data.take(10) to view the first ten rows of the data DataFrame. Because this is a SQL notebook, the next few commands use the %python magic command.

Copy
%python
data.take(10)

../../_images/gsasg-output-of-the-dataframe-take-command.png

To view this data in a tabular format, you can use the Azure Databricks display() command instead of exporting the data to a third-party tool.

Copy
%python
display(data)

../../_images/gsasg-display-dataframe.png

Run SQL queries

Before you can issue SQL queries, you must save your data DataFrame as a temporary table:

Copy
%python
# Register table so it is accessible via SQL Context
data.createOrReplaceTempView("data_geo")

Then, in a new cell, specify a SQL query to list the 2015 median sales price by state:

Copy
select `State Code`, `2015 median sales price` from data_geo

../../_images/gsasg-visualize-dataframe-in-a-table.png

Or, query for population estimate in the state of Washington:

Copy
select City, `2014 Population estimate` from data_geo where `State Code` = 'WA';

Visualize the DataFrame

An additional benefit of using the Azure Databricks display() command is that you can quickly view this data with a number of embedded visualizations. Click the down arrow next to the