Monday, 29 July 2019

Databricks - Query a CSV file using SQL - Create Spark Table

The CREATE TABLE statement below registers the CSV file as a SQL Table. The CSV file can then be queried directly using SQL.
In order to allow this example to run quickly on a small cluster, we'll use the file small.csv instead.


%sql

CREATE DATABASE IF NOT EXISTS Databricks;
USE Databricks;

CREATE TABLE IF NOT EXISTS AirlineFlight
USING CSV
OPTIONS (
  header="true",
  delimiter=",",
  inferSchema="true",
  path="dbfs:/mnt/training/asa/flights/small.csv"
);

CACHE TABLE AirlineFlight;

SELECT * FROM AirlineFlight;


------------------ or simply---------------------
Option 1: Create a Spark table from the CSV data
Use this option if you want to get going quickly, and you only need standard levels of performance. Copy and paste this code snippet into a notebook cell:
DROP TABLE IF EXISTS diamonds;


CREATE TABLE diamonds USING CSV OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")


Option 2: 
Write the CSV data to Databricks Delta format and create a Delta table
Databricks Delta offers a powerful transactional storage layer that enables fast reads and other benefits. Delta format consists of Parquet files plus a transaction log. Use this option to get the best performance on future operations on the table.
  1. Read the CSV data into a DataFrame and write out in Delta format. This command uses a Python language magic command, which allows you to interleave commands in languages other than the notebook primary language (SQL). Copy and paste this code snippet into a notebook cell:
%python

diamonds = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header="true", inferSchema="true")
diamonds.write.format("delta").save("/delta/diamonds")


2.  Create a Delta table at the stored location. Copy and paste this code snippet into a notebook cell:
Copy to clipboardCopy
DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds USING DELTA LOCATION '/delta/diamonds/'

Query the table

Run a SQL statement to query the table for the average diamond price by color.
  1. To add a cell to the notebook, mouse over the cell bottom and click the icon.
    ../_images/quick-start-new-cell.png
  2. Copy this snippet and paste it in the cell.
    Copy to clipboardCopy
    SELECT color, avg(price) AS price FROM diamonds GROUP BY color ORDER BY COLOR
    
  3. Press SHIFT + ENTER. The notebook displays a table of diamond color and average price.
    ../_images/diamonds-table.png


No comments:

Post a Comment