Saturday 27 July 2019

Databricks Notebook Fundamentals


notebook is a collection cells. These cells are run to execute code, render formatted text or display graphical visualizations.

Understanding Code Cells and Markdown Cells


The following cell (with the gray text area) is a code cell.
Cmd 5
Out[3]: 2
Command took 0.04 seconds -- by girdhar.ramit@gmail.com at 7/28/2019, 2:35:30 PM on Test Cluster
Cmd 6
In this case, the code is written in Python. The default language for a cell is provided by the notebook, as can be seen by observing the title of the notebook, near the top left of the window. In this case it reads 01 Notebook Fundamentals (Python). The notebook language is always shown in parenthesis following the notebook title.
In order to run a notebook cell, your notebook must be attached to a cluster. If your notebook is not attached to a cluster, you will be prompted to do so before the cell can run.
To attach a notebook to a cluster:
  1. In the notebook toolbar, click Clusters Icon Detached Cluster Dropdown Detached .
  2. From the drop-down, select a cluster.
The code cell above has not run yet, so the expressions of 1 + 1 has not been evaluated. To run the code cell, select the cell by placing your cursor within the cell text area and do any of the following:
  • Press Shift + Enter (to run the current cell and advance to the next cell)
  • Press Ctrl + Enter (to run the current cell, but keep the current cell selected)
  • Use the cell actions menu that is found at the far right of the cell to select the run cell option: Run Cell
Cmd 7
The following cell is another example of a code cell. Run it to see it's output.
Cmd 8
1
# This is also a code cell
2
print("Welcome to the fabulous world of Databricks!")
Welcome to the fabulous world of Databricks!
Command took 0.03 seconds -- by girdhar.ramit@gmail.com at 7/28/2019, 2:35:49 PM on Test Cluster
Cmd 9
The following cell, which displays its output as formatted text is a markdown cell.
Cmd 10
This is a markdown cell.
To create a markdown cell you need to use a "magic" command which is the short name of a Databricks magic command.
Magic commands start with %.
The magic for markdown is %md.
The magic commmand must always be the first text within the cell.
The following provides the list of supported magics:
  • %python - Allows you to execute Python code in the cell.
  • %r - Allows you to execute R code in the cell.
  • %scala - Allows you to execute Scala code in the cell.
  • %sql - Allows you to execute SQL statements in the cell.
  • sh - Allows you to execute Bash Shell commmands and code in the cell.
  • fs - Allows you to execute Databricks Filesystem commands in the cell.
  • md - Allows you to render Markdown syntax as formatted content in the cell.
  • run - Allows you to run another notebook from a cell in the current notebook.
To read more about magics see here.
Cmd 11
Double click on the above cell (or select the cell and press ENTER) and notice how the cell changes to an editable code cell. Observe that the first line starts with %md. This magic instructs the notebook not to run the cell contents as python, but instead to render the contents from the markdown syntax.
To render the markdown, either run the cell, select another cell or press the Esc key.
Cmd 12

Supported Markdown content

Cmd 13
Markdown notebook cells in Azure Databricks support a wide variety of content that help your notebook convey more than just code, these capabilities include:
Cmd 14
Try running the cell below to see how display some HTML code that loads the Bing website in an iFrame.
Cmd 15
1
displayHTML("<iframe src='https://bing.com' width='100%' height='350px'/>")
Cmd 16

Understanding cell output

Cmd 17
By default, a notebook cell will output the value of evaluating the last line the cell.
Cmd 18
Run the following cell. Observe that the entire cell is echoed in the output because the cell contains only one line.
Cmd 19
1
"Hello Databricks world!"
Cmd 20
Next, examine the following cell. What do you expect the output to be? Run the cell and confirm your understanding.
Cmd 21
1
"Hello Databricks world!"
2
"And, hello Microsoft Ignite!"
Cmd 22
If you want to ensure your output displays something, use the print method.
Cmd 23
1
print("Hello Databricks world!")
2
print("And, hello Microsoft Ignite!")
Cmd 24
Not all code lines return a value to be output. Run the following cell to see one such an example.
Cmd 25
1
text_variable = "Hello, hello!"
Cmd 26

Running multiple notebook cells

Cmd 27
It is not uncommon to need to run (or re-run) a subset of all notebook cells in top to bottom order.
There are a few ways you can accomplish this. You can:
  • Run all cells in the notebook. Select Run All in the notebook toolbar to run all cells starting from the first.
  • Run cells above or below your current cell. To run all cells above or below a cell, go to the cell actions menu at the far right, select Run Menu, and then select Run All Above or Run All Below. 
Cmd 28

Navigating Cells

You can quickly navigate thru the cells in the notebook by using the up or down arrows. If your cursor is within a code cell, keep pressing up or down until your reach the top or bottom edge of the cell respectively and the focus will automatically jump to the next cell.
When you navigate cells, observe the selected cell is highlighted with a darker black border.
Cmd 29
Select this cell by single clicking. Then try navigating up 3 cells and then return to this cell once you have tried navigation out.
Cmd 30
When you navigated between cells without entering into the mode where you could edit the cell, you were in command mode.
When the editable text area has the cursor within it, you are in edit mode
There are different functions and keyboard shortcuts available depending on which mode you are in.
Cmd 31

Managing notebook Cells

Cmd 32
You can perform many tasks within a notebook by using the UI or by using the keyboard shortcuts.
To learn the shortcuts (or refresh your memory later), you can select the Shortcuts from the top right of notebook to display a dialog that lists all of the keyboard shortcuts.
Cmd 33

Keyboard shortcuts for adding and removing cells

Cmd 34
When a cell is in command mode press the following keys:
  • A (insert a cell above)
  • B (insert a cell below)
  • D,D (delete current cell)
Try the following in your notebook:
  1. Insert a cell below this cell (B)
  2. Insert a cell above that new cell (EscA)
  3. Delete both new cells (EscD,D)
Cmd 35

Adding and removing cells using the UI

Cmd 36
You can also add and remove cells using the UI.
To add a cell, mouse over a cell at the top or bottom and click the Add Cell icon.
Alternately, access the notebook cell menu at the far right, select the down caret, and select Add Cell Above or Add Cell Below 
Cmd 37

Adjusting cell order

Cmd 38
You can move cells up or down within the notebook to fit your needs.
You can do so using the UI or via keyboard shortcuts.
When in command mode, you can cut and paste entire cells using keyboard shortcuts. To do so, select the cell and then press X to cut it. Use the up or down arrow keys to find the cell around which it should be pasted. Press V to paste the cell below the selected cell or press SHIFT + V to paste the cell above the selected cell.
You can also move cells using the UI, by accessing the notebook cell menu at the far right, selecting the down caret and then selecting either Move Up or Move Down or the Cut Cell and Paste Cell options.
Cmd 39
Experiment with the commands for re-ordering cells by adjusting the order of the following cells so that they appear in the order 1, 2, 3.
Cmd 40
3 - This cell should appear third.
Cmd 41
1 - This cell should appear first.
Cmd 42
2 - This cell should appear second.
Cmd 43

Understanding notebook state

Cmd 44
When you execute notebook cells, their execution is backed by a process running on a cluster. The state of your notebook, such as the values of variables, is maintained in the process. All variables default to a global scope (unless you author your code so it has nested scopes) and this global state can be a little confusing at first when you re-run cells.
Cmd 45
Run the following two cells in order and take note of the value ouput for the variable y:
Cmd 46
1
x = 10
Cmd 47
1
y = x + 1
2
y
Cmd 48
Next, run the following cell.
Cmd 49
1
x = 100
Cmd 50
Now select the cell that has the lines y = x + 1 and y. And re-run that cell. Did the value of y meet your expectation?
The value of y should now be 101. This is because it is not the actual order of the cells that determines the value, but the order in which they are run and how that affects the underlying state itself. To understand this, realize that when the code x = 100 was run, this changed the value of x, and then when you re-ran the cell containing y = x + 1 this evaluation used the current value of x which is 100. This resulted in y having a value of 101 and not 11.
Cmd 51

Clearing state and output

You can use the Clear dropdown on the notebook toolbar remove output (results) or remove output and clear the underlying state. 
  • Clear -> Clear Results (removes the displayed output for all cells)
  • Clear -> Clear State (removes all cell states)
You typically do this when you want to cleanly re-run a notebook you have been working on and eliminate any accidental changes to the state that may have occured while you were authoring the notebook.
Read more about execution context here.
Cmd 52

Introducing Spark DataFrames

Cmd 53
Let's create a simple DataFrame containing one column
Cmd 54
1
df = spark.range(1000).toDF("number")
2
display(df)
Cmd 55
Let's take a look at its content
Cmd 56
1
df.describe().show()
Cmd 57
Get access to a specific column
Cmd 58
1
df["number"]
Cmd 59
And to all columns
Cmd 60
1
df.columns
Cmd 61
Take a look at the schema of the DataFrame
Cmd 62
1
df.schema
Cmd 63
Apply a simple projection and a simple filter to the DataFrame
Cmd 64
1
df.select("number").show(15)
Cmd 65
Apply a more explicit filter
Cmd 66
1
df.where("number > 10 and number < 14").show()
Cmd 67
Notice the difference between df.where("number > 10 and number < 14") and the same expression followed by .show() or .count(). This is one core property of DataFrames called lazy evaluation (more about it later in the day). Try it yourself in the next cell.
Cmd 68
1
# Your code goes here
2
Cmd 69
And finally, let's have some fun with sampling
Cmd 70
1
seed = 10
2
withReplacement = False
3
fraction = 0.02
4
df.sample(withReplacement, fraction, seed).show(100)
Cmd 71

Introducting Pandas DataFrames

Cmd 72
So far we've used Spark DataFrames which are the de facto choice in Databricks. Unless you are used to work with Pandas, in which case the next line will make you feel at home:
Cmd 73
1
pdf = df.toPandas()
2
pdf.dtypes
Cmd 74
Let's create a new Pandas DataFrame
Cmd 75
1
import pandas as pd
2
import numpy as np
3
pdf = pd.DataFrame(data={'ColumnA':np.linspace(0, 1, 11),
4
'ColumnB':['red', 'yellow','blue', 'green', 'red', \
5
'green','green','red', 'yellow','blue', 'green'],
6
'ColumnC': np.random.randint(11)})
7
pdf
Cmd 76
1
#One column
2
pdf['ColumnA']
Cmd 77
1
#Single value
2
pdf.loc[0, 'ColumnA']
Cmd 78
1
#Slicing by index
2
pdf.loc[3:7:2]
Cmd 79
1
pdf['ColumnB']=='red'
Cmd 80
1
#Boolean mask
2
pdf[pdf['ColumnB']=='red']
Cmd 81
1
#Boolean mask multiple conditions
2
pdf[(pdf['ColumnB']=='red') & (pdf['ColumnA']==0.4)]
Cmd 82
This concludes the Notebook Fundamentals lab.
Through this lab you've been introduced to the core concepts related to Databricks notebooks structure and execution (cluster connections, cells, execution flow, and state).
You have also tested some simple features of the core Databricks data modeling element - the Spark DataFrame and of it's traditional Python counterpart, the Pandas DataFrame.

No comments:

Post a Comment