AWS Crash Course - Redshift

Redshift is a data warehouse from Amazon. It’s like a virtual place where you store a huge amount of data.
  • Redshift is fully managed petabyte-scale system
  • Amazon Redshift is based on PostgreSQL 8.0.2
  • It is optimized for data warehousing
  • Supports integrations and connections with various applications, including, Business Intelligence tools
  • Redshift provides custom JDBC and ODBC drivers.
  • Redshift can be integrated with CloudTrail for auditing purpose.
  • You can monitor Redshift performance from CloudWatch.
Features of Amazon Redshift
Supports VPC − The users can launch Redshift within VPC and control access to the cluster through the virtual networking environment.
Encryption − Data stored in Redshift can be encrypted and configured while creating tables in Redshift.
SSL − SSL encryption is used to encrypt connections between clients and Redshift.
Scalable − With a few simple clicks, you can choose vertical scaling(increasing instance size) or horizontal scaling(increasing compute nodes).
Cost-effective − Amazon Redshift is a cost-effective alternative to traditional data warehousing practices. There are no up-front costs, no long-term commitments and on-demand pricing structure.
MPP(Massive Parallel Processing) – Redshift Leverages parallel processing which improves query performance. Massively parallel refers to the use of a large number of processors (or separate computers) to perform coordinated computations in parallel. This reduces computing time and improves query performance.
Columnar Storage – Redshift uses columnar storage. So it stores data tables by column rather than by row. The goal of a columnar database is to efficiently write and read data to and from hard disk storage in order to speed up the time it takes to return a query.
Advanced Compression – Compression is a column-level operation that reduces the size of data when it is stored thus help in saving space.
Type of nodes in Redshift
Leader Node
Compute Node
What does these nodes do?
Leader Node:- A leader node receives queries from client applications, parse the queries and develops execution plans, which are an ordered set of steps to process these queries. The leader node then coordinates the parallel execution of these plans with the compute nodes.  Good part is that you will not be charged for leader node hours; only compute nodes will incur charges. If you run single node Redshift cluster you don’t need leader node.
Compute Node:- It execute the steps specified in the execution plans and transmit data among other nodes to serve queries. The intermediate results are sent back to the leader node for aggregation before being sent back to the client applications. You can have 1 to 128 Compute Nodes.
From which sources you can load data in Redshift?
You can do it from multiple sources like :-
  • Amazon S3
  • Amazon DynamoDB
  • Amazon EMR
  • AWS Data Pipeline
  • Any SSH-enabled host on Amazon EC2 or on-premises
Redshift Backup and Restores
  • Redshift can take automatic snapshots of cluster.
  • You can also take manual snapshot of cluster.
  • Redshift continuously backs up its data to S3
  • Redshift attempt to keep at least 3 copies of the data.
Hope the above snapshot give you a decent understanding of Redshift. If you want to try some handson check this tutorial .
This series is created to give you a quick snapshot of AWS technologies.  You can check about other AWS services in this series over here .

No comments:

Post a Comment