Showing posts with label Redshift. Show all posts
Showing posts with label Redshift. Show all posts

Solved: How to change the IAM role in Redshift cluster

In our last post we checked How to create a Redshift Cluster?
In this post we will see how to change the IAM role assigned to Redshift cluster.(Check earlier post to know how to create an IAM role.)
Note: Be careful to test the new role first in dev. Don’t try this in prod directly.
  • First create a role as per your requirement.
  • Next login to Redshift Console . Do check you are in correct region.
  • Select the Cluster by clicking on check box.
  • On the top you will see an option of “Manage IAM Roles”.  Click on that.
  • Select the role you created recently from the drop down and click “Apply Changes”
  • It will take 3-5 minutes to reflect the changes.

Solved: How to reset the Master User Password of Redshift Cluster

In our last post we checked How to create a Redshift Cluster?
At times you may want to reset the password of the master user in DB of Redshift.
In this post we will see how to reset the password of  master user in Redshift cluster.
Note:- Ensure that you really want to change the password.  Changing the password in production can break apps which may have old password stored anywhere.
  • Go to Redshift console and click on cluster for which you want to reset the password. In our case it was “testdw” .
  • Once you are inside the cluster configuration click on the “Cluster” drop down and select “Modify”.
  • You will see the field “Master user password”.  Enter the new password in it.
  • Finally click “Modify” .
This will change the master user password. Try connecting to the cluster with the new password.

AWS Crash Course - Redshift

Redshift is a data warehouse from Amazon. It’s like a virtual place where you store a huge amount of data.
  • Redshift is fully managed petabyte-scale system
  • Amazon Redshift is based on PostgreSQL 8.0.2
  • It is optimized for data warehousing
  • Supports integrations and connections with various applications, including, Business Intelligence tools
  • Redshift provides custom JDBC and ODBC drivers.
  • Redshift can be integrated with CloudTrail for auditing purpose.
  • You can monitor Redshift performance from CloudWatch.
Features of Amazon Redshift
Supports VPC − The users can launch Redshift within VPC and control access to the cluster through the virtual networking environment.
Encryption − Data stored in Redshift can be encrypted and configured while creating tables in Redshift.
SSL − SSL encryption is used to encrypt connections between clients and Redshift.
Scalable − With a few simple clicks, you can choose vertical scaling(increasing instance size) or horizontal scaling(increasing compute nodes).
Cost-effective − Amazon Redshift is a cost-effective alternative to traditional data warehousing practices. There are no up-front costs, no long-term commitments and on-demand pricing structure.
MPP(Massive Parallel Processing) – Redshift Leverages parallel processing which improves query performance. Massively parallel refers to the use of a large number of processors (or separate computers) to perform coordinated computations in parallel. This reduces computing time and improves query performance.
Columnar Storage – Redshift uses columnar storage. So it stores data tables by column rather than by row. The goal of a columnar database is to efficiently write and read data to and from hard disk storage in order to speed up the time it takes to return a query.
Advanced Compression – Compression is a column-level operation that reduces the size of data when it is stored thus help in saving space.
Type of nodes in Redshift
Leader Node
Compute Node
What does these nodes do?
Leader Node:- A leader node receives queries from client applications, parse the queries and develops execution plans, which are an ordered set of steps to process these queries. The leader node then coordinates the parallel execution of these plans with the compute nodes.  Good part is that you will not be charged for leader node hours; only compute nodes will incur charges. If you run single node Redshift cluster you don’t need leader node.
Compute Node:- It execute the steps specified in the execution plans and transmit data among other nodes to serve queries. The intermediate results are sent back to the leader node for aggregation before being sent back to the client applications. You can have 1 to 128 Compute Nodes.
From which sources you can load data in Redshift?
You can do it from multiple sources like :-
  • Amazon S3
  • Amazon DynamoDB
  • Amazon EMR
  • AWS Data Pipeline
  • Any SSH-enabled host on Amazon EC2 or on-premises
Redshift Backup and Restores
  • Redshift can take automatic snapshots of cluster.
  • You can also take manual snapshot of cluster.
  • Redshift continuously backs up its data to S3
  • Redshift attempt to keep at least 3 copies of the data.
Hope the above snapshot give you a decent understanding of Redshift. If you want to try some handson check this tutorial .
This series is created to give you a quick snapshot of AWS technologies.  You can check about other AWS services in this series over here .

How to load data in Amazon Redshift Cluster and query it?

In the last post we checked How to build a Redshift Cluster? . In this post we will see how to load data in that cluster and query it.
Pre-requisite :-
Download SQL Workbench
Download Redshift Driver
Once you have downloaded the above mentioned pre-requisites let's move ahead.
First we will obtain the JDBC URL.
  • Login to your AWS Redshift Console .
  • Click on the cluster you have created. If you have followed the last post it will be "testdw" .
  • In the "Configuration" tab look for JDBC URL .
  • Copy the JDBC URL and save it in notepad.
Now open the SQL Workbench. You just have to click on 32-bit or 64-bit exe as per your OS version. In my case I am using Windows 10 64-bit so the exe name is SQLWorkbench64  .
  • Click on File > Connect window.
  • In the bottom left of the "Select Connection Profile" window click on "Manage Drivers"
  • In the "Manage Drivers" window click on the folder icon, browse to the location of the Redshift driver you downloaded earlier and select it.
Fill other details in "Manage Drivers" Window as below.
Name:- Redshiftdriver JDBC 4.2
  • Click OK
  • Now in the "Select Connection Profile" window
Fill details as below. You can also refer the image below for this.
Driver:- Select the Redshift driver you added.
URL:- Mention the JDBC URL you saved earlier.
Username:- DB username you mentioned during cluster creation.
Password:- Enter password of the DB user .

Check the Autocommit box.

  • Finally click on OK.
  • If everything is configured correctly. You will get connected to DB .
  • Try executing the query
select * from information_schema.tables;
  • If you connection is successful you will see results in the window.
  • Now we will load some sample data which is provided by AWS and kept on S3. In the SQL Workbench copy/paste the below query and execute to create a table.
create table users(
 userid integer not null distkey sortkey,
 username char(8),
 firstname varchar(30),
 lastname varchar(30),
 city varchar(30),
 state char(2),
 email varchar(100),
 phone char(14),
 likesports boolean,
 liketheatre boolean,
 likeconcerts boolean,
 likejazz boolean,
 likeclassical boolean,
 likeopera boolean,
 likerock boolean,
 likevegas boolean,
 likebroadway boolean,
 likemusicals boolean);

create table venue(
 venueid smallint not null distkey sortkey,
 venuename varchar(100),
 venuecity varchar(30),
 venuestate char(2),
 venueseats integer);

create table category(
 catid smallint not null distkey sortkey,
 catgroup varchar(10),
 catname varchar(10),
 catdesc varchar(50));

create table date(
 dateid smallint not null distkey sortkey,
 caldate date not null,
 day character(3) not null,
 week smallint not null,
 month character(5) not null,
 qtr character(5) not null,
 year smallint not null,
 holiday boolean default('N'));

create table event(
 eventid integer not null distkey,
 venueid smallint not null,
 catid smallint not null,
 dateid smallint not null sortkey,
 eventname varchar(200),
 starttime timestamp);

create table listing(
 listid integer not null distkey,
 sellerid integer not null,
 eventid integer not null,
 dateid smallint not null  sortkey,
 numtickets smallint not null,
 priceperticket decimal(8,2),
 totalprice decimal(8,2),
 listtime timestamp);

create table sales(
 salesid integer not null,
 listid integer not null distkey,
 sellerid integer not null,
 buyerid integer not null,
 eventid integer not null,
 dateid smallint not null sortkey,
 qtysold smallint not null,
 pricepaid decimal(8,2),
 commission decimal(8,2),
 saletime timestamp);
Now load sample data. Ensure that in below query you replace "<iam-role-arn>" with your ARN.
So your query should look like.
copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' 
credentials 'aws_iam_role=arn:aws:iam::123456789123:role/redshiftrole' 
delimiter '|' region 'us-west-2';
copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy venue from 's3://awssampledbuswest2/tickit/venue_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy category from 's3://awssampledbuswest2/tickit/category_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy date from 's3://awssampledbuswest2/tickit/date2008_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy event from 's3://awssampledbuswest2/tickit/allevents_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-west-2';

copy listing from 's3://awssampledbuswest2/tickit/listings_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy sales from 's3://awssampledbuswest2/tickit/sales_tab.txt'
credentials 'aws_iam_role=<iam-role-arn>'
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region 'us-west-2';
Once you have loaded the data you can run sample queries like below in your SQL Workbench.

Congrats! You have finally created the Redshift cluster and run queries on it after loading data.
Refer this post if you want to reset the master user password.
Don't forget to cleanup the cluster or you will be billed.
  • For deleting the cluster just click on the Cluster(in our case it's testdw) in the AWS console.
  • Click on "Cluster" drop down and select delete.
That will cleanup everything.
Hope this guide was helpful to you! Do let me know in the comment section if you have any queries or suggestions .

How to create a multi node redshift cluster?

In this tutorial we will see how to create a multi node Amazon Redshift  cluster. If you are new to Redshift refer to our AWS Crash Course – Redshift .
Pre-requisites :-
You will have to create an IAM role and Security Group for Redshift.
Refer How to create a Security Group and How to create an IAM role .
Once you have created the role and security group follow below steps:
  • Login to Amazon Redshift Console . For this  tutorial we will be launching cluster in US East region.
  • In the Dashboard click on Launch Cluster.
In the cluster details section fill in the details as below:-
Cluster identifier:- testdwDatabase name:- testdbMaster user name:- testuserMaster user password:- <your password>
  • In the node configuration section select details as in below screenshot:-
Note:-  We will be creating 2 node cluster so select multi-node.

  • Once done click continue.
  • In the additional configuration screen leave the top section as default (refer image below). Just change “VPC security groups” to “redshiftsg”and   “Available roles” to “redshiftrole” that we created earlier.

  • Click on Continue and Review everything. It will show you approximate cost of running this cluster but if you are using “free tier” you  may run it for free if you have not exhausted your  “free tier” limits.

  • Finally click on launch cluster.
If you have followed everything as in above tutorial your cluster will be ready in around 15 minutes.

Hope this tutorial was helpful to you. In the next post we will discuss how to load data in this cluster and query the data.