Azure: DataFrames and SQL

Monday, 29 July 2019

DataFrames and SQL

DataFrame syntax is more flexible than SQL syntax. Here we illustrate general usage patterns of SQL and DataFrames.

Suppose we have a data set we loaded as a table called myTable and an equivalent DataFrame, called df. We have three fields/columns called col_1 (numeric type), col_2 (string type) and col_3 (timestamp type) Here are basic SQL operations and their DataFrame equivalents.

Notice that columns in DataFrames are referenced by col("").

SQL	DataFrame (Python)
`SELECT col_1 FROM myTable`	`df.select(col("col_1"))`
`DESCRIBE myTable`	`df.printSchema()`
`SELECT * FROM myTable WHERE col_1 > 0`	`df.filter(col("col_1") > 0)`
`..GROUP BY col_2`	`..groupBy(col("col_2"))`
`..ORDER BY col_2`	`..orderBy(col("col_2"))`
`..WHERE year(col_3) > 1990`	`..filter(year(col("col_3")) > 1990)`
`SELECT * FROM myTable LIMIT 10`	`df.limit(10)`
`display(myTable)` (text format)	`df.show()`
`display(myTable)` (html format)	`display(df)`

Hint: You can also run SQL queries with the special syntax spark.sql("SELECT * FROM myTable")

Built-In Functions

Spark provides a number of built-in functions, many of which can be used directly with DataFrames. Use these functions in the filter expressions to filter data and in select expressions to create derived columns.

The following DataFrame statement finds women born after 1990; it uses the year function and it creates a birthYear column on the fly.

display(peopleDF

.select("firstName","middleName","lastName",year("birthDate").alias("birthYear"),"salary")

.filter(year("birthDate") > "1990")

.filter("gender = 'F' "))

Azure

Monday, 29 July 2019

DataFrames and SQL

DataFrames and SQL

Built-In Functions

No comments:

Post a Comment