AWS Crash Course - EMR

What is EMR?
  • AWS EMR(Elastic MapReduce) is a managed hadoop framework.
  • It provides you an easy, cost-effective and highly scalable way to process large amount of data.
  • It can be used for multiple things like indexing, log analysis, financial analysis, scientific simulation, machine learning etc.
Cluster and Nodes
  • The centerpiece of EMR is Cluster.
  • Cluster is a collection of EC2 instances also called as nodes.
  • All nodes of an EMR cluster are launched in same availability zone.
  • Each node has a role in cluster.
Type of EMR Cluster Nodes
Master Node:- It’s the main boss which manages the cluster by running software components and distributing the tasks to other nodes. Master node will monitor task status and health of cluster.
Core Node:- It’s a slave node which “run tasks” and “store data” in HDFS (Hadoop Distributed Filesystem).
Task Node:- This is also a slave node but it only “run tasks”. It doesn’t store any data. It’s an optional node.
Cluster Types
EMR has two type of clusters
1) Transient :- These are clusters which are shutdown once the jobs is done. These are useful when you don’t need cluster to be running all day long and can save money by shutting them down.
2) Persistent :- Persistent clusters are those which need to be always available to process the continuous stream of jobs or you want the data to be always available on HDFS.
Different Cluster States
An EMR cluster goes through multiple stages as described below:-
STARTING – The cluster provisions, starts, and configures EC2 instances.
BOOTSTRAPPING – Bootstrap actions are being executed on the cluster.
RUNNING – A step for the cluster is currently being run.
WAITING – The cluster is currently active, but has no steps to run.
TERMINATING – The cluster is in the process of shutting down.
TERMINATED – The cluster was shut down without error.
TERMINATED_WITH_ERRORS – The cluster was shut down with errors.


Types of filesystem in EMR
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is a distributed, scalable file system for Hadoop. HDFS distributes the data it stores across instances in the cluster, storing multiple copies of data on different instances to ensure that no data is lost if an individual instance fails. HDFS is ephemeral storage that is reclaimed when you terminate a cluster.
EMR File System (EMRFS)
Using the EMR File System (EMRFS), Amazon EMR extends Hadoop to add the ability to directly access data stored in Amazon S3 as if it were a file system like HDFS. You can use either HDFS or Amazon S3 as the file system in your cluster. Most often, Amazon S3 is used to store input and output data and intermediate results are stored in HDFS.
Local File System
The local file system refers to a locally connected disk. When you create a Hadoop cluster, each node is created from an Amazon EC2 instance that comes with a preconfigured block of preattached disk storage called an instance store. Data on instance store volumes persists only during the lifecycle of its Amazon EC2 instance.
Programming languages supported by EMR
  • Perl
  • Python
  • Ruby
  • C++
  • PHP
  • R
EMR Security
  • EMR integrates with IAM to manage permissions.
  • EMR has Master and Slave security groups for nodes to control the traffic access.
  • EMR supports S3 server-side and client-side encryption with EMRFS.
  • You can launch EMR clusters in your VPC to make it more secure.
  • EMR integrates with CloudTrail so you will have log of all activites done on cluster.
  • You can login via ssh to EMR cluster nodes using EC2 Key Pairs.
EMR Management Interfaces
  • Console :-  You can manage your EMR clusters from AWS EMR Console .
  • AWS CLI :-  Command line provides you a rich way of controlling the EMR. Refer here the EMR CLI .
  • Software Development Kits (SDKs) :- SDKs provide functions that call Amazon EMR to create and manage clusters. It’s currently available only for the supported languages mentioned above. You can check here some sample code and libraries.
  • Web Service API :- You can use this interface to call the Web Service directly using JSON. You can get more information from API reference Guide .
EMR Billing
  • You pay for EC2 instances used in cluster and EMR.
  • You are charged for per instance hours.
  • EMR supports  On-Demand, Spot, and Reserved Instances
  • As a cost saving measure it is recommenced that task nodes should be Spot instances
  • It’s not a good idea to use spot instances for Master or Core Node as they store data on them. And you will lose data once the node is terminated.
If you want to try some EMR hands on refer this tutorial.

  • This AWS Crash Course series is created to give you a quick snapshot of AWS technologies.  You can check about other AWS services in this series over here .

Solved: How to create a soft link in Linux or Solaris

In this post we will see how to create a softlink.
Execute the below command to create a softlink.
[root@cloudvedas ~]# ln -s /usr/interface/HB0 CLV
So now when you list using  “ls -l”  the softlink thus created will look like.
[root@cloudvedas ~]# ls -llrwxrwxrwx. 1 root root 18 Aug 8 23:16 CLV -> /usr/interface/HB0[root@cloudvedas ~]#
Try going inside the link and list the contents.
[root@cloudvedas ~]# cd CLV[root@cloudvedas CLV]# lscloud1 cloud2 cloud3[root@cloudvedas CLV]#
You can see the contents of /usr/interface/HB0 directory.

Solved: How to create a flar image in Solaris and restore it forrecovery

Flar image is a good way to recover your system from crashes. In this post we will see how to create a flar image and use it for recovery of the system.
Flar Creation
  • It is recommended that you create flar image in single user mode. Shutdown server and boot it in single user.
#init 0ok>boot -s
  • In this example, the FLAR image will be stored to a directory under /flash. The FLAR image will be named recovery_image.flar .
flarcreate -n my_bkp_image1 -c -S -R / -x /flash /flash/recovery_image.flar
  • Once the flar image is created. Copy it to your repository system. Here we are using NFS.
cp -p /flash/recovery_image.flar /net/FLAR_recovery/recovery_image.flar
Flar Restoration
  • To restore a flar image start the boot process.
  • You can boot server either with Solaris CD/DVD or Network
  • Go to the ok prompt and run one of the below command:-
For booting the boot media (installation CD/DVD). ok> boot cdromIf you want to boot from network do. 

ok> boot net
  • Provide the network, date/time, and password information for the system.
  • Once you reach the “Solaris Interactive Installation” part, select “Flash”.
  • Provide the path to the system with location of the FLAR image:
    /net/FLAR_recovery/recovery_image.flar
  • Select the correct Retrieval Method (HTTP, FTP, NFS) to locate the FLAR image.
  • At the Disk Selection screen, select the disk where the FLAR image is to be installed.
  • Choose not to preserve existing data.(Be sure you want to restore on selected disk)
  • At the File System and Disk Layout screen, select Customize to edit the disk slices to input the values of the disk partition table from the original disk.
  • Once the system is rebooted the recovery is complete.

What are the maximum number of usable partitions in a disk in Linux

Linux can generally have two types of Disks. IDE and SCSI.
IDE
By convention, IDE drives will be given device names /dev/hda to /dev/hdd. Hard Drive A (/dev/hda) is the first drive and Hard Drive C (/dev/hdc) is the third.
A typical PC has two IDE controllers, each of which can have two drives connected to it. For example, /dev/hda is the first drive (master) on the first IDE controller and /dev/hdd is the second (slave) drive on the second controller (the fourth IDE drive in the computer).
Maximum usable partitions 63 for IDE disks.
SCSI
SCSI drives follow a similar pattern; They are represented by ‘sd’ instead of ‘hd’. The first partition of the second SCSI drive would therefore be /dev/sdb1.
Maximum usable partitions 15 for SCSI disks.
A partition is labeled to host a certain kind of file system (not to be confused with a volume label). Such a file system could be the linux standard ext2 file system or linux swap space, or even foreign file systems like (Microsoft) NTFS or (Sun) UFS. There is a numerical code associated with each partition type. For example, the code for ext2 is 0x83 and linux swap is 0x82.
To see a list of partition types and their codes, execute /sbin/sfdisk -T

Solved: How to cap memory on a Solaris 10 zone.

If you want to cap the usage of memory for a zone, follow below steps:-

Here we will ensure that zone(zcldvdas) doesn't use more than 3072mb memory.

# zonecfg -z zcldvdas

zonecfg:zcldvdas> add capped-memory

zonecfg:zcldvdas:capped-memory> set physical=3072m

zonecfg:zcldvdas:capped-memory> end

zonecfg:zcldvdas> verify

zonecfg:zcldvdas> commit

zonecfg:zcldvdas> exit

Now if you want to dedicate  3072mb memory to a zone so that it's always available only to this zone. Follow below steps:-

# zonecfg -z zcldvdas

zonecfg:zcldvdas> add capped-memory

zonecfg:zcldvdas:capped-memory> set locked=3072m

zonecfg:zcldvdas:capped-memory> end

zonecfg:zcldvdas> verify

zonecfg:zcldvdas> commit

zonecfg:zcldvdas> exit

You can also use a combination of physical and locked to assign max and min memory to a zone.

In the next example we are assigning maximum memory the zone can use as 3072mb while minimum 1024mb which should always be available to zone.

# zonecfg -z zcldvdas

zonecfg:zcldvdas> add capped-memory

zonecfg:zcldvdas:capped-memory> set physical=3072m

zonecfg:zcldvdas:capped-memory> set locked=1024m

zonecfg:zcldvdas:capped-memory> end

zonecfg:zcldvdas> verify

zonecfg:zcldvdas> commit

zonecfg:zcldvdas> exit

This change will be effective after reboot of the local zone.

zoneadm -z zcldvdas reboot

From Solaris10u4 onwards you can cap the memory online also using rcapadm.

rcapadm -z zcldvdas -m 3G

But remember the changes made my rcapadm are not persistent across reboot so you will still have to make the entry in zonecfg as discussed above.

You can view the set memory using rcapstat from Global Zone.

rcapstat -z 2 5

From local zone you can check this with prtconf.

prtconf -vp | grep Mem