DataFrames and SQL
DataFrame syntax is more flexible than SQL syntax. Here we illustrate general usage patterns of SQL and DataFrames.
Suppose we have a data set we loaded as a table called
myTable
and an equivalent DataFrame, called df
. We have three fields/columns called col_1
(numeric type), col_2
(string type) and col_3
(timestamp type) Here are basic SQL operations and their DataFrame equivalents.
Notice that columns in DataFrames are referenced by
col("")
.SQL | DataFrame (Python) |
---|---|
SELECT col_1 FROM myTable | df.select(col("col_1")) |
DESCRIBE myTable | df.printSchema() |
SELECT * FROM myTable WHERE col_1 > 0 | df.filter(col("col_1") > 0) |
..GROUP BY col_2 | ..groupBy(col("col_2")) |
..ORDER BY col_2 | ..orderBy(col("col_2")) |
..WHERE year(col_3) > 1990 | ..filter(year(col("col_3")) > 1990) |
SELECT * FROM myTable LIMIT 10 | df.limit(10) |
display(myTable) (text format) | df.show() |
display(myTable) (html format) | display(df) |
Hint: You can also run SQL queries with the special syntax
spark.sql("SELECT * FROM myTable")
Built-In Functions
Spark provides a number of built-in functions, many of which can be used directly with DataFrames. Use these functions in the
filter
expressions to filter data and in select
expressions to create derived columns.
The following DataFrame statement finds women born after 1990; it uses the
year
function and it creates a birthYear
column on the fly.
display(peopleDF
.select("firstName","middleName","lastName",year("birthDate").alias("birthYear"),"salary")
.filter(year("birthDate") > "1990")
.filter("gender = 'F' "))
No comments:
Post a Comment