Write your first Apache Spark job
To write your first Apache Spark job, you add code to the cells of a Azure Databricks notebook. This example uses Python. For more information, you can also reference the Apache Spark Quick Start Guide.
This first command lists the contents of a folder in the Databricks File System:
Copy
# Take a look at the file system
display(dbutils.fs.ls("/databricks-datasets/samples/docs/"))
The next command uses
spark
, the SparkSession
available in every notebook, to read the README.md
text file and create a DataFrame named textFile
:
Copy
textFile = spark.read.text("/databricks-datasets/samples/docs/README.md")
To count the lines of the text file, apply the
count
action to the DataFrame:
Copy
textFile.count()
One thing you may notice is that the second command, reading the text file, does not generate any output while the third command, performing the
count
, does. The reason for this is that the first command is a transformation while the second one is an action. Transformations are lazy and run only when an action is run. This allows Spark to optimize for performance (for example, run a filter prior to a join), instead of running commands serially. For a complete list of transformations and actions, refer to the Apache Spark Programming Guide: Transformations and Actions.
No comments:
Post a Comment