Wednesday, 31 July 2019

Write your first Apache Spark job




Write your first Apache Spark job

To write your first Apache Spark job, you add code to the cells of a Azure Databricks notebook. This example uses Python. For more information, you can also reference the Apache Spark Quick Start Guide.
This first command lists the contents of a folder in the Databricks File System:
Copy to clipboardCopy
# Take a look at the file system
display(dbutils.fs.ls("/databricks-datasets/samples/docs/"))
../../_images/DBFS-readme-sm.png
The next command uses spark, the SparkSession available in every notebook, to read the README.md text file and create a DataFrame named textFile:
Copy to clipboardCopy
textFile = spark.read.text("/databricks-datasets/samples/docs/README.md")
To count the lines of the text file, apply the count action to the DataFrame:
Copy to clipboardCopy
textFile.count()
../../_images/databricks-guide-textfile-count-output.png
One thing you may notice is that the second command, reading the text file, does not generate any output while the third command, performing the count, does. The reason for this is that the first command is a transformation while the second one is an action. Transformations are lazy and run only when an action is run. This allows Spark to optimize for performance (for example, run a filter prior to a join), instead of running commands serially. For a complete list of transformations and actions, refer to the Apache Spark Programming Guide: Transformations and Actions.


No comments:

Post a Comment