Monday 29 July 2019

DataFrames and SQL

DataFrames and SQL

DataFrame syntax is more flexible than SQL syntax. Here we illustrate general usage patterns of SQL and DataFrames.
Suppose we have a data set we loaded as a table called myTable and an equivalent DataFrame, called df. We have three fields/columns called col_1 (numeric type), col_2 (string type) and col_3 (timestamp type) Here are basic SQL operations and their DataFrame equivalents.
Notice that columns in DataFrames are referenced by col("").
SQLDataFrame (Python)
SELECT col_1 FROM myTabledf.select(col("col_1"))
DESCRIBE myTabledf.printSchema()
SELECT * FROM myTable WHERE col_1 > 0df.filter(col("col_1") > 0)
..GROUP BY col_2..groupBy(col("col_2"))
..ORDER BY col_2..orderBy(col("col_2"))
..WHERE year(col_3) > 1990..filter(year(col("col_3")) > 1990)
SELECT * FROM myTable LIMIT 10df.limit(10)
display(myTable) (text format)df.show()
display(myTable) (html format)display(df)
Hint Hint: You can also run SQL queries with the special syntax spark.sql("SELECT * FROM myTable")


Built-In Functions

Spark provides a number of built-in functions, many of which can be used directly with DataFrames. Use these functions in the filter expressions to filter data and in select expressions to create derived columns.
The following DataFrame statement finds women born after 1990; it uses the year function and it creates a birthYear column on the fly.
display(peopleDF
  .select("firstName","middleName","lastName",year("birthDate").alias("birthYear"),"salary")
  .filter(year("birthDate") > "1990") 
  .filter("gender = 'F' "))

No comments:

Post a Comment