Last updated Jul 10, 2023 Edit Source

# DP-900

#Azure #cloud #dataanalytics #datawarehousing

Azure exam prep

Pasted image 20230217231056

Pasted image 20230217231305

Azure Storage Explorer (To explore Azure Storage accounts) Azure Synapse Analytics Azure Data Lake Storage Azure Data Analytics Azure Data Box Azure Databricks (Apache spark 3rd party) Microsoft Power BI (Dashboards) HDInsight (fully managed Hadoop) Azure Data Studio (IDE like VSCode for data related tasks similar to SSIS) Azure Data Factory (ETL/ELT pipeline builder) SQL Server Integration Services (SSIS) Pasted image 20230217231411

# Azure Data Roles

Pasted image 20230217231752

# Data Engineer Tools

Pasted image 20230217232108

# Data Overview

Pasted image 20230217232341

# Data

Basically Information

# Data Documents

Collective Data (Datasets, Databases, Datastores, Data warehouses, notebooks)

# Datasets

Logical Grouping of units of data (mnist, Stackoverflow developer dataset, etc) Pasted image 20230217232822

# Data Structure

a specific storage format.

unstructured (Sharepoint, blob, azure files, Azure Data Lakes)
semi-structured (slight relations XML, JSON, AVRO, PARQUET) (Azure Tables, Cosmos DB, Mongo, Cassandra)

Pasted image 20230218001106

structured usually just tables

# Data Type

Single unit of data to tell a computer how a data is intended to be used (Int, Float, Characters, String, Array, Hash, Binary, bool, enums etc) some of these overlap with data structures

# Schema

Blueprint of database Pasted image 20230217233245

# Schemaless

A cell can accept many types related to Nosql Databases

# Batch Processing

Usually a scheduled action to process a collection of data, are cost effective and is not real-time (ie, analysing daily logs)

# Stream Processing

For realtime data processing Producers send to stream and Consumer pull from stream

# Relational Data

Tables: Logical grouping of rows and columns Views: Result of query which is stored in memory (Temp table) Materialized View: similar to Views but stored on disk Indexes: copy of a column to accelerate stuff (By storing partial redundant data) Constraints: Rules for writes Triggers: A function triggered on a certain database event PK/FK: A data to create relationship between tables

Row Store: generally used in relational database Column Store: NoSQL Faster for analytics, good for large amounts of data (OLTP), for limited columns use

Pivot Tables: A table of statistics easy to create in Excel to draw attention to useful information and finding figures and facts quickly.

Data Consistency: when data is kept in many places and weather they match or not. (Strongly consistent vs Eventually Consistent). Syncronous data transmissions are used in Strongly consistent data.

# Non Relational Data

Key/Value
Document
Columnar
Graph

# Data Sources

Where Data originates from, can be: Data lake Data warehouse Datastore Database Data requested from API Flat Files (spreadsheet)

# Data Store

A repo for storing data, Database is a subset of data store

# Database

Stores semi-structured or structured data A more complex data store with a formal design

# Data Warehouse

A relational data store for analytic workloads, utilizing usually column-oriented data-store to aggrigate huge amounts of data. they are desined to be HOT (fast). They are run in a schedule. It needs to consume data from a relational database.

# Data Mart

Subset of Datawarehouse which stores a smaller sets of data and are designed to be read only.

# Data Lake

A centralized place to store vast amount of raw data. Used for: Visualization Real-time analytics Machine Learning Pasted image 20230218111944 Pasted image 20230218112119

# Data Lakehouse

Combines the best elements of Data Lakes and Data warehouse. Supports video, audio and text files, used for data science and ML. support for both streaming and ELT

# Azure Synapse

Basically Data Lake or LakeHouse service by Azure Pasted image 20230218111352 Synapse SQL

Extends TSQL for ML
streaming capabilities to load data
Integrate AI with SQL
Offers both serverless and dedicated resource models

Features:

Integrate with Apache Spark (SparkML, )
Simplified resource model
Fast spark load
Built in support for dotnet
makes it easy to use SQL and spark

# PolyBase

Data virtualization feature for Sql Server Query directly TSQL from

SQL server
Oracle
Teradata
MongoDB
Hadoop clusters
Cosmos DB can bu used to perform ETL transformation

# Azure Data Lake Analytics

On demand analytics job service Pasted image 20230218112644

# Data Mining

Pasted image 20230218001406 find valid patterns and relationships (classification, clustering, regression, sequential, association rules, Outer Detection, Prediction).

# Data Wrangling

Matching data from raw to other format.

# Data Modelling

Organizing elements of data and standardises how they relate to one another. could be conceptial, logical or physical. example a ER diagram.

# ETL vs ELT

used when you want to move data from one location to another. Ie moving data from relational to non relational database

Pasted image 20230218002020

# Data Analytics

Examining, transforming and arranging data to view useful information. Pasted image 20230218002122

Techniques Descriptive Diagnostic Predictive Prescriptive Cognitive

# Azure HDInsights

Pasted image 20230218103328 Can use Apache Ambari an opensource Hadoop implementation

Apache spark is a unified analytics engine for BD and ML.

Super fast

DataBricks Fully managed Spark Clusters. Available on all major cloud platform. Azure Data Bricks is a partnership with Databricks to offer the services within Azure.

Azure Data-bricks Workspace (for data pipelines: Batching, Streaming, Storage)
Azure Database SQL Analytics (run SQL on Data Lakes, visualizations, dashboards)

Azure certification for data foundations DP-900 done after DP 900 Power BI Pasted image 20230218113522 Pasted image 20230218114405 Pasted image 20230218114833

Data Cheetsheets Pasted image 20230218110331 Pasted image 20230218110557 Pasted image 20230218110835 Pasted image 20230218111000

Relational Databases Pasted image 20230218120115 Pasted image 20230218002710

Account Storage Pasted image 20230218113200

CosmosDB Pasted image 20230218121343 Pasted image 20230218121706

Azure Synapse and Data Lake Pasted image 20230218112816

Hadoop Pasted image 20230218122232

Spark and DataBricks Pasted image 20230218122602

SQL Management Studio (SSMS) Pasted image 20230218122457

Pasted image 20230218105604 Pasted image 20230218105647 Pasted image 20230218122909 Pasted image 20230218105830 Pasted image 20230218105929 Pasted image 20230218105936 Pasted image 20230218110125

Database Security Pasted image 20230218120454

Shrestha Rajat

# DP-900