Write your first Apache Spark job
To write your first Apache Spark job, you add code to the cells of a Azure Databricks notebook. This example uses Python. For more information, you can also reference the Apache Spark Quick Start Guide.
This first command lists the contents of a folder in the Databricks File System:
# Take a look at the file system
display(dbutils.fs.ls("/databricks-datasets/samples/docs/"))

The next command uses
spark, the SparkSession available in every notebook, to read the README.md text file and create a DataFrame named textFile:textFile = spark.read.text("/databricks-datasets/samples/docs/README.md")
To count the lines of the text file, apply the
count action to the DataFrame:textFile.count()

One thing you may notice is that the second command, reading the text file, does not generate any output while the third command, performing the
count, does. The reason for this is that the first command is a transformation while the second one is an action. Transformations are lazy and run only when an action is run. This allows Spark to optimize for performance (for example, run a filter prior to a join), instead of running commands serially. For a complete list of transformations and actions, refer to the Apache Spark Programming Guide: Transformations and Actions.
No comments:
Post a Comment