Monday 29 July 2019

Databricks - Read and write functions

Read and write functions

Let's discuss the reading and writing capabilities of Spark notebooks in detail. Databricks provides various built-in functions that allow you to query, process, and analyze a large volume of data.
  • Querying files: You can use DataFrames to query large data files. DataFrames are derived from data structures known as resilient distributed datasets (RDDs). RDDs and DataFrames are immutable distributed collections of data.
    • Resilient: RDDs are fault tolerant, so if part of your operation fails, Spark quickly rebuilds any lost data.
    • Distributed: RDDs are distributed across networked machines known as a cluster.
    • DataFrame: A data structure where data is organized into named columns, like a table in a relational database, but with richer optimizations available.
    A DataFrame object comes with methods attached to it. These methods are operations that you can run on DataFrames, such as filtering, counting, aggregating, and many others. Spark provides a number of built-in functions that can be used directly with DataFrames to filter through the data and derive specific results.
  • Joins and aggregation: Several built-in functions of Databricks allow you to aggregate your data in different ways. You can use DataFrames to join different sets of data to find correlations between datasets.
  • Accessing data: With Databricks, you can access your data in the following ways:
    • Access data stored in an existing file.
    • Upload a data file from your local system.
    • Mount an Azure blob to the Databricks file system.
    Databricks accepts data stored in different file formats, such as Parquet and CSV.
  • Working with hierarchical data: DataFrames make it easier to query hierarchical data such as nested JSON records. You can query nested structured data or data that contains array columns.
  • Data lakes: Apache Spark and Databricks make it easy to access and work with files stored in data lakes, such as Azure Data Lake Storage. Data lakes store vast amounts of data in its native format, such as XML, JSON, CSV, and Parquet. To gain insights from a data lake, you can use Databricks to do exploratory data analysis (EDA). You can use Spark DataFrames to read directly from raw files contained in a data lake and then execute queries to join and aggregate the data.
  • Azure Data Lake Storage Gen2: Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big-data analytics, built on Azure Blob storage. Data Lake Storage Gen2 combines Microsoft's two existing storage services, Azure Blob storage and Azure Data Lake Storage Gen1, to produce a data lake platform with the following features:
    • File-system semantics.
    • Directory-level and file-level security.
    • Scalability.
    • Low-cost, tiered storage.
    • High availability and disaster-recovery capabilities.
    Databricks allows you to set up an Azure Data Lake Storage Gen2 instance and use the Azure Blob File System (ABFS) driver that's built into the Databricks Runtime to query data stored in Azure Data Lake Storage Gen2.
  • Azure Key Vault-backed secret scopes: Azure Databricks has two types of secret scopes: Key Vault-backed and Databricks-backed. These secret scopes allow you to securely store secrets, such as database-connection strings. If someone tries to output a secret to a notebook, the secret is replaced with [REDACTED]. This behavior helps prevent unauthorized users from viewing the secret or someone accidentally leaking the secret when displaying or sharing the notebook.
In this module, we'll explore how to:
  • Configure a Key Vault-backed secret scope.
  • Create the Key Vault, Azure SQL Database, and Azure Cosmos DB instances.
  • Create a Key Vault-backed secret scope.

No comments:

Post a Comment