Cloud Vedas: How to load data in Amazon Redshift Cluster and query it?

In the last post we checked How to build a Redshift Cluster? . In this post we will see how to load data in that cluster and query it.
Pre-requisite :-
Download SQL Workbench
Download Redshift Driver
Once you have downloaded the above mentioned pre-requisites let's move ahead.
First we will obtain the JDBC URL.

Login to your AWS Redshift Console .
Click on the cluster you have created. If you have followed the last post it will be "testdw" .
In the "Configuration" tab look for JDBC URL .
Copy the JDBC URL and save it in notepad.

Now open the SQL Workbench. You just have to click on 32-bit or 64-bit exe as per your OS version. In my case I am using Windows 10 64-bit so the exe name is SQLWorkbench64 .

Click on File > Connect window.
In the bottom left of the "Select Connection Profile" window click on "Manage Drivers"
In the "Manage Drivers" window click on the folder icon, browse to the location of the Redshift driver you downloaded earlier and select it.

Fill other details in "Manage Drivers" Window as below.

Name:- Redshiftdriver JDBC 4.2
Classname:- com.amazon.redshift.jdbc.Driver

Click OK
Now in the "Select Connection Profile" window

Fill details as below. You can also refer the image below for this.

Driver:- Select the Redshift driver you added.
URL:- Mention the JDBC URL you saved earlier.
Username:- DB username you mentioned during cluster creation.
Password:- Enter password of the DB user .

Check the Autocommit box.

Finally click on OK.
If everything is configured correctly. You will get connected to DB .
Try executing the query

select * from information_schema.tables;

If you connection is successful you will see results in the window.
Now we will load some sample data which is provided by AWS and kept on S3. In the SQL Workbench copy/paste the below query and execute to create a table.

create table users(
 userid integer not null distkey sortkey,
 username char(8),
 firstname varchar(30),
 lastname varchar(30),
 city varchar(30),
 state char(2),
 email varchar(100),
 phone char(14),
 likesports boolean,
 liketheatre boolean,
 likeconcerts boolean,
 likejazz boolean,
 likeclassical boolean,
 likeopera boolean,
 likerock boolean,
 likevegas boolean,
 likebroadway boolean,
 likemusicals boolean);

create table venue(
 venueid smallint not null distkey sortkey,
 venuename varchar(100),
 venuecity varchar(30),
 venuestate char(2),
 venueseats integer);

create table category(
 catid smallint not null distkey sortkey,
 catgroup varchar(10),
 catname varchar(10),
 catdesc varchar(50));

create table date(
 dateid smallint not null distkey sortkey,
 caldate date not null,
 day character(3) not null,
 week smallint not null,
 month character(5) not null,
 qtr character(5) not null,
 year smallint not null,
 holiday boolean default('N'));

create table event(
 eventid integer not null distkey,
 venueid smallint not null,
 catid smallint not null,
 dateid smallint not null sortkey,
 eventname varchar(200),
 starttime timestamp);

create table listing(
 listid integer not null distkey,
 sellerid integer not null,
 eventid integer not null,
 dateid smallint not null  sortkey,
 numtickets smallint not null,
 priceperticket decimal(8,2),
 totalprice decimal(8,2),
 listtime timestamp);

create table sales(
 salesid integer not null,
 listid integer not null distkey,
 sellerid integer not null,
 buyerid integer not null,
 eventid integer not null,
 dateid smallint not null sortkey,
 qtysold smallint not null,
 pricepaid decimal(8,2),
 commission decimal(8,2),
 saletime timestamp);

Now load sample data. Ensure that in below query you replace "<iam-role-arn>" with your ARN.
So your query should look like.

copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' 
credentials 'aws_iam_role=arn:aws:iam::123456789123:role/redshiftrole' 
delimiter '|' region 'us-west-2';

copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy venue from 's3://awssampledbuswest2/tickit/venue_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy category from 's3://awssampledbuswest2/tickit/category_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy date from 's3://awssampledbuswest2/tickit/date2008_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy event from 's3://awssampledbuswest2/tickit/allevents_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' timeformat 'YYYY-MM-DD HH:MI:SS' region 'us-west-2';

copy listing from 's3://awssampledbuswest2/tickit/listings_pipe.txt' 
credentials 'aws_iam_role=<iam-role-arn>' 
delimiter '|' region 'us-west-2';

copy sales from 's3://awssampledbuswest2/tickit/sales_tab.txt'
credentials 'aws_iam_role=<iam-role-arn>'
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region 'us-west-2';

Once you have loaded the data you can run sample queries like below in your SQL Workbench.

Congrats! You have finally created the Redshift cluster and run queries on it after loading data.
Refer this post if you want to reset the master user password.
Don't forget to cleanup the cluster or you will be billed.

For deleting the cluster just click on the Cluster(in our case it's testdw) in the AWS console.
Click on "Cluster" drop down and select delete.

That will cleanup everything.
Hope this guide was helpful to you! Do let me know in the comment section if you have any queries or suggestions .

Cloud Vedas

How to load data in Amazon Redshift Cluster and query it?

No comments:

Post a Comment