Cloud Vedas

AWS Crash Course - Redshift

Redshift is a data warehouse from Amazon. It’s like a virtual place where you store a huge amount of data.

Redshift is fully managed petabyte-scale system
Amazon Redshift is based on PostgreSQL 8.0.2
It is optimized for data warehousing
Supports integrations and connections with various applications, including, Business Intelligence tools
Redshift provides custom JDBC and ODBC drivers.
Redshift can be integrated with CloudTrail for auditing purpose.
You can monitor Redshift performance from CloudWatch.

Features of Amazon Redshift
Supports VPC − The users can launch Redshift within VPC and control access to the cluster through the virtual networking environment.
Encryption − Data stored in Redshift can be encrypted and configured while creating tables in Redshift.
SSL − SSL encryption is used to encrypt connections between clients and Redshift.
Scalable − With a few simple clicks, you can choose vertical scaling(increasing instance size) or horizontal scaling(increasing compute nodes).
Cost-effective − Amazon Redshift is a cost-effective alternative to traditional data warehousing practices. There are no up-front costs, no long-term commitments and on-demand pricing structure.
MPP(Massive Parallel Processing) – Redshift Leverages parallel processing which improves query performance. Massively parallel refers to the use of a large number of processors (or separate computers) to perform coordinated computations in parallel. This reduces computing time and improves query performance.
Columnar Storage – Redshift uses columnar storage. So it stores data tables by column rather than by row. The goal of a columnar database is to efficiently write and read data to and from hard disk storage in order to speed up the time it takes to return a query.
Advanced Compression – Compression is a column-level operation that reduces the size of data when it is stored thus help in saving space.
Type of nodes in Redshift
Leader Node
Compute Node
What does these nodes do?
Leader Node:- A leader node receives queries from client applications, parse the queries and develops execution plans, which are an ordered set of steps to process these queries. The leader node then coordinates the parallel execution of these plans with the compute nodes. Good part is that you will not be charged for leader node hours; only compute nodes will incur charges. If you run single node Redshift cluster you don’t need leader node.
Compute Node:- It execute the steps specified in the execution plans and transmit data among other nodes to serve queries. The intermediate results are sent back to the leader node for aggregation before being sent back to the client applications. You can have 1 to 128 Compute Nodes.
From which sources you can load data in Redshift?
You can do it from multiple sources like :-

Amazon S3
Amazon DynamoDB
Amazon EMR
AWS Data Pipeline
Any SSH-enabled host on Amazon EC2 or on-premises

Redshift Backup and Restores

Redshift can take automatic snapshots of cluster.
You can also take manual snapshot of cluster.
Redshift continuously backs up its data to S3
Redshift attempt to keep at least 3 copies of the data.

Hope the above snapshot give you a decent understanding of Redshift. If you want to try some handson check this tutorial .
This series is created to give you a quick snapshot of AWS technologies. You can check about other AWS services in this series over here .

How to load data in Amazon Redshift Cluster and query it?

In the last post we checked How to build a Redshift Cluster? . In this post we will see how to load data in that cluster and query it.
Pre-requisite :-
Download SQL Workbench
Download Redshift Driver
Once you have downloaded the above mentioned pre-requisites let's move ahead.
First we will obtain the JDBC URL.

Login to your AWS Redshift Console .
Click on the cluster you have created. If you have followed the last post it will be "testdw" .
In the "Configuration" tab look for JDBC URL .
Copy the JDBC URL and save it in notepad.

Now open the SQL Workbench. You just have to click on 32-bit or 64-bit exe as per your OS version. In my case I am using Windows 10 64-bit so the exe name is SQLWorkbench64 .

Click on File > Connect window.
In the bottom left of the "Select Connection Profile" window click on "Manage Drivers"
In the "Manage Drivers" window click on the folder icon, browse to the location of the Redshift driver you downloaded earlier and select it.

Fill other details in "Manage Drivers" Window as below.

Name:- Redshiftdriver JDBC 4.2
Classname:- com.amazon.redshift.jdbc.Driver

Click OK
Now in the "Select Connection Profile" window

Fill details as below. You can also refer the image below for this.

Driver:- Select the Redshift driver you added.
URL:- Mention the JDBC URL you saved earlier.
Username:- DB username you mentioned during cluster creation.
Password:- Enter password of the DB user .

Check the Autocommit box.

Finally click on OK.
If everything is configured correctly. You will get connected to DB .
Try executing the query

select * from information_schema.tables;

If you connection is successful you will see results in the window.
Now we will load some sample data which is provided by AWS and kept on S3. In the SQL Workbench copy/paste the below query and execute to create a table.

create table users(
 userid integer not null distkey sortkey,
 username char(8),
 firstname varchar(30),
 lastname varchar(30),
 city varchar(30),
 state char(2),
 email varchar(100),
 phone char(14),
 likesports boolean,
 liketheatre boolean,
 likeconcerts boolean,
 likejazz boolean,
 likeclassical boolean,
 likeopera boolean,
 likerock boolean,
 likevegas boolean,
 likebroadway boolean,
 likemusicals boolean);

create table venue(
 venueid smallint not null distkey sortkey,
 venuename varchar(100),
 venuecity varchar(30),
 venuestate char(2),
 venueseats integer);

create table category(
 catid smallint not null distkey sortkey,
 catgroup varchar(10),
 catname varchar(10),
 catdesc varchar(50));

create table date(
 dateid smallint not null distkey sortkey,
 caldate date not null,
 day character(3) not null,
 week smallint not null,
 month character(5) not null,
 qtr character(5) not null,
 year smallint not null,
 holiday boolean default('N'));

create table event(
 eventid integer not null distkey,
 venueid smallint not null,
 catid smallint not null,
 dateid smallint not null sortkey,
 eventname varchar(200),
 starttime timestamp);

create table listing(
 listid integer not null distkey,
 sellerid integer not null,
 eventid integer not null,
 dateid smallint not null  sortkey,
 numtickets smallint not null,
 priceperticket decimal(8,2),
 totalprice decimal(8,2),
 listtime timestamp);

create table sales(
 salesid integer not null,
 listid integer not null distkey,
 sellerid integer not null,
 buyerid integer not null,
 eventid integer not null,
 dateid smallint not null sortkey,
 qtysold smallint not null,
 pricepaid decimal(8,2),
 commission decimal(8,2),
 saletime timestamp);

Now load sample data. Ensure that in below query you replace "<iam-role-arn>" with your ARN.
So your query should look like.

copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' 
credentials 'aws_iam_role=arn:aws:iam::123456789123:role/redshiftrole' 
delimiter '|' region 'us-west-2';

copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy venue from 's3://awssampledbuswest2/tickit/venue_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy category from 's3://awssampledbuswest2/tickit/category_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy date from 's3://awssampledbuswest2/tickit/date2008_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy event from 's3://awssampledbuswest2/tickit/allevents_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-west-2';

copy listing from 's3://awssampledbuswest2/tickit/listings_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy sales from 's3://awssampledbuswest2/tickit/sales_tab.txt'
credentials 'aws_iam_role=<iam-role-arn>'
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region 'us-west-2';

Once you have loaded the data you can run sample queries like below in your SQL Workbench.

Congrats! You have finally created the Redshift cluster and run queries on it after loading data.
Refer this post if you want to reset the master user password.
Don't forget to cleanup the cluster or you will be billed.

For deleting the cluster just click on the Cluster(in our case it's testdw) in the AWS console.
Click on "Cluster" drop down and select delete.

That will cleanup everything.
Hope this guide was helpful to you! Do let me know in the comment section if you have any queries or suggestions .

How to create a multi node redshift cluster?

In this tutorial we will see how to create a multi node Amazon Redshift cluster. If you are new to Redshift refer to our AWS Crash Course – Redshift .
Pre-requisites :-
You will have to create an IAM role and Security Group for Redshift.
Refer How to create a Security Group and How to create an IAM role .
Once you have created the role and security group follow below steps:

Login to Amazon Redshift Console . For this tutorial we will be launching cluster in US East region.
In the Dashboard click on Launch Cluster.

In the cluster details section fill in the details as below:-

Cluster identifier:- testdwDatabase name:- testdbDatabase port:- 5439Master user name:- testuserMaster user password:- <your password>

In the node configuration section select details as in below screenshot:-

Note:- We will be creating 2 node cluster so select multi-node.

Once done click continue.
In the additional configuration screen leave the top section as default (refer image below). Just change “VPC security groups” to “redshiftsg”and “Available roles” to “redshiftrole” that we created earlier.

Click on Continue and Review everything. It will show you approximate cost of running this cluster but if you are using “free tier” you may run it for free if you have not exhausted your “free tier” limits.

Finally click on launch cluster.

If you have followed everything as in above tutorial your cluster will be ready in around 15 minutes.

Hope this tutorial was helpful to you. In the next post we will discuss how to load data in this cluster and query the data.

How to create an IAM role in AWS?

In this tutorial we will be creating an IAM role which we will use to launch an Amazon Redshift cluster.

Login to your IAM Console .
In the left Navigation Pane Click on “Roles”

Click on “Create New Role”
In the AWS Service Role Select “Amazon Redshift” .
In the Policy Type Select “AmazonS3FullAccess” and click on “Next Step” .
Enter details as below:-

Role Name as :-  redshiftroleRole description: - Amazon Redshift Role

Finally Click on Create Role .

How to create a security group in AWS?

In this tutorial we will create a security group which we will use further while creating our Redhshift Cluster.

Login to your AWS VPC Console .
In the left Navigation Pane select Security Groups .
Once in Security groups section click on Create Security Group .
Enter details as below:

Name tag: redshiftsgGroup name: redshiftsgDescription: resdshift security groupVPC: <Select your VPC or if you are not sure leave it as default>

Once details are filled Click on Yes, Create .
Now select the security group you just created.
In the bottom pane click on inbound rules.

Click Edit
Select Redshift in Type and Source 0.0.0.0/0 . By this we are basically allowing all inbound traffic. You can also mention specific IP range in source.

Leave the outbound rule as “All traffic”

Finally click save.

Congrats! You have successfully created a security group.