Friday, March 15, 2019

Kubernetes Reference

Kubernetes Reference

Minikube

minikube start

minikube stop

minikube delete

minikube env

minikube ip



---
Kubectl

kubectl get all
Pods, ReplicaSets, Deployments and Services

kubectl apply –f <yaml file>

kubectl apply –f .

kubectl describe pod <name of pod>

kubectl exec –it <pod name> <command>

kubectl get <pod | po | service | svc | rs | replicaset | deployment | deploy>

kubectl get po --show-labels

kubectl get po --show-labels -l {name}={value}

kubectl delete po <pod name>

kubectl delete po --all



---
Deployment Management

kubectl rollout status deploy <name of deployment>

kubectl rollout history deploy <name of deployment>

kubectl rollout undo deploy <name of deployment>

Docker Reference

Docker Reference

Manage images

docker image pull <image name>

docker image ls

docker image build -t <image name> .

docker image push <image name>

docker image tag <image id> <tag name>



---
Manage Containers

docker container run -p <public port>:<container port> <image name>

docker container ls -a

docker container stop <container id>

docker container start <container id>

docker container rm <container id>

docker container prune

docker container run -it <image name>

docker container run -d <image name>

docker container exec -it <container id> <command>

docker container exec -it <container id> bash

docker container logs -f <container id>

docker container commit -a "author" <container id> <image name>



---
Manage your (local) Virtual Machine

docker-machine ip



---
Manage Networks

docker network ls

docker network create <network name>



---
Manage Volumes

docker volume ls

docker volume prune

docker volume inspect <volume name>

docker volume rm <volume name>



---
Docker Compose

docker-compose up

docker-compose up -d

docker-compose logs -f <service name>

docker-compose down



---
Manage a Swarm

docker swarm init (--advertise-addr <ip address>)

docker service create <args>

docker network create --driver overlay <name>

docker service ls

docker node ls

docker service logs -f <service name>

docker service ps <service name>

docker swarm join-token <worker|manager>



---
Manage Stacks

docker stack ls

docker stack deploy -c <compose file> <stack name>

docker stack rm <stack name>



Thursday, March 14, 2019

ElasticSearch PUT and GET data

--Create an index in Elasticsearch

PUT http://host-1:9200/my_index

{
"settings" : {
"number_of_shards" : 3,
"number_of_replicas" : 1
}
}

--To get information about an index

GET http://host-1:9200/my_index


--Add user to index with id 1

POST http://host-1:9200/my_index/user/1

{
"name": "Deepak",
"age": 36,
"department": "IT",
"address": {
"street": "No.123, XYZ street",
"city": "Singapore",
"country": "Singapore"
}
}

--To fetch document with id 1

GET http://host-1:9200/my_index/user/1

--Add user to index with id 2

POST http://host-1:9200/my_index/user/2

{
"name": "McGiven",
"age": 30,
"department": "Finance"
}

--Add user to index with id 3

POST http://host-1:9200/my_index/user/3

{
"name": "Watson",
"age": 30,
"department": "HR",
"address": {
"street": "No.123, XYZ United street",
"city": "Singapore",
"country": "Singapore"
}
}

--Search documents by name

GET http://host-1:9200/my_index/user/_search?q=name:watson

--Delete an index

DELETE http://host-1:9200/my_index

Tuesday, March 5, 2019

Hadoop Submitting a MapReduce Job

HDFS Input Location:

/user/deepakdubey/input/stocks

HDFS Output Location (relative to user):

output/mapreduce/stocks



Delete Output Directory:

hadoop fs -rm -r output/mapreduce/stocks



Submit Job:

hadoop jar /deepakdubey/mapreduce/stocks/MaxClosePriceByStock-1.0.jar com.deepakdubey.MaxClosePriceByStock /user/deepakdubey/input/stocks output/mapreduce/stocks



View Result:

hadoop fs -cat output/mapreduce/stocks/part-r-00000



=========================


HDFS Input Location:

/user/deepakdubey/input/stocks

HDFS Output Location (relative to user):

output/mapreduce/stocks



Delete Output Directory:

hadoop fs -rm -r output/mapreduce/stocks



Submit Job:

hadoop jar /deepakdubey-starterkit/mapreduce/stocks/MaxClosePriceByStock-1.0.jar com.deepakdubey.MaxClosePriceByStock /user/deepakdubey/input/stocks output/mapreduce/stocks



View Result:

hadoop fs -cat output/mapreduce/stocks/part-r-00000

Caching Data in Spark

symvol.cache()

symvol.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
symvol.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
symvol.persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK)

symvol.unpersist()

--caching symvol

val stocks = sc.textFile("hdfs://ip-10-10-10-10.ec2.internal:8020/user/deepakdubey/input/stocks")
val splits = stocks.map(record => record.split(","))
val symvol = splits.map(arr => (arr(1), arr(7).toInt))
symvol.cache()
val maxvol = symvol.reduceByKey((vol1, vol2) => Math.max(vol1, vol2))
maxvol.collect().foreach(println)

Spark Get Maximum Volume by Stock

val stocks = sc.textFile("hdfs://ip-10-10-10-10.ec2.internal:8020/user/deepakdubey/input/stocks")
val splits = stocks.map(record => record.split(","))
val symvol = splits.map(arr => (arr(1), arr(7).toInt))
val maxvol = symvol.reduceByKey((vol1, vol2) => Math.max(vol1, vol2))
maxvol.collect().foreach(println)

--Spark shell
spark-shell --master yarn

Sunday, March 3, 2019

Big Data Ingestion using Kafka


Hive Loading Tables

### CREATE A TABLE FOR STOCKS ###

hive> CREATE TABLE IF NOT EXISTS stocks (
exch STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES ('creator'='deepakdubey', 'created_on' = '2019-02-16', 'description'='This table holds stocks data!!!');

### DESCRIBE TABLE TO GET DETAILS ABOUT TABLE ###

hive> DESCRIBE FORMATTED stocks;

### COPY THE STOCKS DATASET TO HDFS ###

hadoop fs -copyFromLocal /deepakdubey/input/stocks-dataset/stocks/* input/hive/stocks_db

hadoop fs -ls input/hive/stocks_db

hive> !hadoop fs -ls input/hive/stocks_db;

### LOAD DATASAET USING LOAD COMMAND ###

hive> LOAD DATA INPATH 'input/hive/stocks_db'
INTO TABLE stocks;

hive> !hadoop fs -ls input/hive/stocks_db;

hive> DESCRIBE FORMATTED stocks;

hive> !hadoop fs -ls /user/hive/warehouse/stocks_db.db/stocks;

hive> SELECT * FROM stocks;

### LOAD DATASET USING CTAS ###

hive> CREATE TABLE stocks_ctas
AS
SELECT * FROM stocks;

hive> DESCRIBE FORMATTED stocks_ctas;

hive> !hadoop fs -ls /user/hive/warehouse/stocks_db.db/stocks_ctas;

### LOAD DATASET USING INSERT..SELECT ###

hive> INSERT INTO TABLE stocks_ctas
SELECT s.* FROM stocks s;

hive> !hadoop fs -ls /user/hive/warehouse/stocks_db.db/stocks_ctas;

hive> SELECT * FROM stocks_ctas;

### LOAD DATASET USING INSERT OVERWRITE ###

hive> INSERT OVERWRITE TABLE stocks_ctas
SELECT s.* FROM stocks s;

hive> !hadoop fs -ls /user/hive/warehouse/stocks_db.db/stocks_ctas;

hadoop fs -copyFromLocal /home/cloudera/deepakdubey/input/stocks_db/stocks/* input/stocks_db/stocks

hadoop fs -ls input/stocks_db/stocks

### LOCATION ATTRIBUTE & LOADING DATA ### 

hadoop fs -copyFromLocal /deepakdubey/input/stocks-dataset/stocks/* input/hive/stocks_db

hadoop fs -ls input/hive/stocks_db

hive> CREATE TABLE IF NOT EXISTS stocks_loc (
exch STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/deepakdubey/input/hive/stocks_db'
TBLPROPERTIES ('creator'='deepakdubey', 'created_on' = '2019-02-16', 'description'='This table holds stocks data!!!');

hive> DESCRIBE FORMATTED stocks_loc;

hive> SELECT * FROM stocks_loc;

Hive Create MANAGED Table

hive> CREATE DATABASE stocks_db;

hive> SHOW DATABASES;

hive> USE stocks_db;

hive> CREATE TABLE IF NOT EXISTS stocks (
exch string,
symbol string,
ymd string,
price_open float,
price_high float,
price_low float,
price_close float,
volume int,
price_adj_close float)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

hive> DESCRIBE FORMATTED stocks;

hive> DROP DATABASE stocks_db;

hive> DROP TABLE stocks;

hive> DROP DATABASE stocks_db CASCADE;

Hive Create EXTERNAL Table

hive> CREATE DATABASE stocks_db;

hive> USE stocks_db;

hive> CREATE EXTERNAL TABLE IF NOT EXISTS stocks_tb (
exch STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/deepakdubey/input/stocks';

hive> SELECT * FROM stocks_tb LIMIT 100;

Hadoop Cluster - Stop Start Restart

--Stop, Start and Restart Datanode



sudo service hadoop-hdfs-datanode stop



sudo service hadoop-hdfs-datanode start



sudo service hadoop-hdfs-datanode restart



--Stop, Start and Restart Node Manager



sudo service hadoop-yarn-nodemanager stop



sudo service hadoop-yarn-nodemanager start



sudo service hadoop-yarn-nodemanager restart



--Stop, Start and Restart Resource Manager



sudo service hadoop-yarn-resourcemanager stop



sudo service hadoop-yarn-resourcemanager start



sudo service hadoop-yarn-resourcemanager restart



--Restart all HDFS services on a node



for x in 'cd /etc/init.d ; ls hadoop-hdfs*' ; do sudo service $x restart ; done



--Restart all YARN services on a node



for x in 'cd /etc/init.d ; ls hadoop-yarn*' ; do sudo service $x restart ; done



--Restart all Hadoop services on a node



for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x restart ; done



--Start checkpointing operation



hdfs dfsadmin -safemode enter

hdfs dfsadmin -saveNamespace



--Save backup of FSIMAGE



hdfs dfsadmin -fetchImage /tmp/fsimage-bkup





--Start/Stop commands for HDP (Hortonworks Data Platform)



/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh start namenode

/usr/hdp/current/hadoop-hdfs-datanode/../hadoop/sbin/hadoop-daemon.sh start datanode



/usr/hdp/current/hadoop-yarn-resourcemanager/sbin/yarn-daemon.sh start resourcemanager

/usr/hdp/current/hadoop-yarn-resourcemanager/sbin/yarn-daemon.sh start nodemanager

Saturday, March 2, 2019

Pig - Pig Latin Solving A Problem

grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float,

volume:int, adj_close:float);



### FILTERING ONLY RECORDS FROM YEAR 2003 ###



filter_by_yr = FILTER stocks by GetYear(date) == 2003;



### GROUPING RECORDS BY SYMBOL ###



grunt> grp_by_sym = GROUP filter_by_yr BY symbol;



grp_by_sym: {

 group: chararray,

 filter_by_yr: {

  (exchange: chararray,symbol: chararray,date: datetime,open: float,high: float,low: float,close: float,volume: int,adj_close: float)

 }

}



### SAMPLE OUTPUT OF GROUP ###



(CASC, { (NYSE,CASC,2003-12-22T00:00:00.000Z,22.02,22.2,21.94,22.09,36700,20.29), (NYSE,CASC,2003-12-23T00:00:00.000Z,22.15,22.15,21.9,22.05,23600,20.26), ....... })

(CATO, { (NYSE,CATO,2003-10-08T00:00:00.000Z,22.48,22.5,22.01,22.06,92000,12.0), (NYSE,CATO,2003-10-09T00:00:00.000Z,21.3,21.59,21.16,21.45,373500,11.67), ....... })



### CALCULATE AVERAGE VOLUME ON THE GROUPED RECORDS ###



avg_volume = FOREACH grp_by_sym GENERATE group, ROUND(AVG(filter_by_yr.volume)) as avgvolume;



### ORDER THE RESULT IN DESCENDING ORDER ###



avg_vol_ordered = ORDER avg_volume BY avgvolume DESC;



### STORE TOP 10 RECORDS ###



top10 = LIMIT avg_vol_ordered 10;

STORE top10 INTO 'output/pig/avg-volume' USING PigStorage(',');



### EXECUTE PIG INSTRUCTIONS AS SCRIPT ###



pig /deepakdubey-workshop/pig/scripts/average-volume.pig



### PASSING PARAMETERS TO SCRIPT ###



pig -param input=/user/deepakdubey/input/stocks -param output=output/pig/avg-volume-params /deepakdubey-workshop/pig/scripts/average-volume-parameters.pig



### RUNNING A PIG SCRIPT LOCALLY. INPUT AND OUTPUT LOCATION ARE POINTING TO LOCAL FILE SYSTEM ###



pig -x local -param input=/deepakdubey-workshop/input/stocks-dataset/stocks -param output=output/stocks /deepakdubey-workshop/pig/scripts/average-volume-parameters.pig

Kafka Architecture








Hadoop HDFS - Getting to know your cluster

--To find the hadoop version

hadoop version

--HDFS report

sudo su - hdfs

hdfs dfsadmin -report

hdfs dfsadmin -report -live
hdfs dfsadmin -report -dead

--Configuration details

hdfs getconf -namenodes

hdfs getconf -confKey dfs.replication

hdfs getconf -confKey dfs.blocksize

hdfs getconf -confKey dfs.namenode.http-address

hdfs getconf -confKey yarn.resourcemanager.webapp.address

--YARN application details

yarn application -list -appStates ALL

yarn application -list -appStates FAILED

--Other YARN application status

NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED

Pig - Pig Latin Loading Projecting

### LOADING A DATASET ###



grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float,

volume:int, adj_close:float);



### STRUCTURE ###



grunt> DESC stocks;



### PROJECT AND MANIPULATE FEW COLUMNS FROM DATASET ###



grunt> projection = FOREACH stocks GENERATE symbol, SUBSTRING($0, 0, 1) as sub_exch, close - open as up_or_down;



### PRINT RESULT ON SCREEN ###



grunt> DUMP projection;



### STORE RESULT IN HDFS ###



grunt> STORE projection INTO 'output/pig/simple-projection';



### LOAD 1 - WITH NO COLUMN NAMES AND DATATYPES ###



grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',');



### LOAD 2 - WITH COLUMN NAMES BUT NO DATATYPES ###



grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',') as (exchange, symbol, date, open, high, low, close, volume, adj_close);



### LOAD 3 - WITH COLUMN NAMES AND DATATYPES ###



grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float,

volume:int, adj_close:float);



### TO LOOK UP STRUCTURE OF THE RELATION ###



grunt> DESCRIBE stocks;



### WHEN COLUMN NAMES ARE NOT AVAILABLE ###



grunt> projection = FOREACH stocks GENERATE $1 as symbol, SUBSTRING($0, 0, 1) as sub_exch, $6 - $3 as up_or_down;

Hadoop HDFS - Disk Usage



--Disk space reserved for non HDFS - hdfs-site.xml



dfs.datanode.du.reserved



--HDFS Balancer



sudo -su hdfs



hdfs balancer



hdfs balancer -threshold 5



--Creating user directories



sudo -u hdfs hadoop fs -mkdir /user/deepak.dubey

sudo -u hdfs hadoop fs -chown deepak.dubey:deepak.dubey /user/deepak.dubey



--Disk Usage



hadoop fs -du /user



hadoop fs -du -h /user



hadoop fs -du -h -s /user




Hadoop HDFS - Disk Quotas

--Name Quota



hadoop fs -mkdir /user/ubuntu/quotatest



hdfs dfsadmin -setQuota 5 /user/ubuntu/quotatest



hadoop fs -count -q /user/ubuntu/quotatest



hadoop fs -touchz /user/ubuntu/quotatest/test1



hadoop fs -touchz /user/ubuntu/quotatest/test2

hadoop fs -touchz /user/ubuntu/quotatest/test3

hadoop fs -touchz /user/ubuntu/quotatest/test4



hadoop fs -touchz /user/ubuntu/quotatest/test5



hadoop fs -rm /user/ubuntu/quotatest/test4



hdfs dfsadmin -clrQuota /user/ubuntu/quotatest



hadoop fs -count -q /user/ubuntu/quotatest



--Space Quota



hdfs dfsadmin -setSpaceQuota 300k /user/ubuntu/quotatest



hadoop fs -count -q /user/ubuntu/quotatest



hadoop fs -copyFromLocal hdfs-site.xml /user/ubuntu/quotatest



hdfs dfsadmin -clrSpaceQuota /user/ubuntu/quotatest

Hadoop HDFS - Recovering From Accidental Data Loss

--Trash



sudo vi /etc/hadoop/conf/core-site.xml



<property>

 <name>fs.trash.interval</name>

 <value>60</value>

</property>



<property>

 <name>fs.trash.checkpoint.interval</name>

 <value>45</value>

</property>



sudo service hadoop-hdfs-namenode restart



--Skip Trash



hadoop fs -rm -skipTrash delete-file2



--Snapshots

hadoop fs -mkdir important-files



hadoop fs -copyFromLocal file1 file2 important-files



hdfs dfs -createSnapshot /user/ubuntu/important-files



--Require admin rights

hdfs dfsadmin -allowSnapshot /user/ubuntu/important-files



--Snapshot creation - demonstrate file add & deletes

hdfs dfs -createSnapshot /user/ubuntu/important-files snapshot1



hadoop fs -ls /user/ubuntu/important-files/.snapshot



hadoop fs -ls /user/ubuntu/important-files/.snapshot/snapshot1



hadoop fs -rm /user/ubuntu/important-files/.snapshot/snapshot1/file1



hadoop fs -rm /user/ubuntu/important-files/file2



hadoop fs -copyFromLocal file3 /user/ubuntu/important-files



hdfs dfs -createSnapshot /user/ubuntu/important-files snapshot2



hadoop fs -ls /user/ubuntu/important-files/.snapshot/snapshot2



hadoop fs -ls /user/ubuntu/important-files



hadoop fs -ls /user/ubuntu/important-files/.snapshot/snapshot1



hadoop fs -cat /user/ubuntu/important-files/.snapshot/snapshot1/file2



hdfs snapshotDiff /user/ubuntu/important-files snapshot2 snapshot1



--Demonstrate file modifications

cat append-file



hadoop fs -appendToFile append-file /user/ubuntu/important-files/file1



hadoop fs -cat /user/ubuntu/important-files/file1



hdfs dfs -createSnapshot /user/ubuntu/important-files snapshot3



hadoop fs -ls /user/ubuntu/important-files/.snapshot/snapshot3



hdfs snapshotDiff /user/ubuntu/important-files snapshot3 snapshot2



hadoop fs -cat /user/ubuntu/important-files/.snapshot/snapshot2/file1



hadoop fs -cat /user/ubuntu/important-files/.snapshot/snapshot3/file1



hadoop fs -cat /user/ubuntu/important-files/file1



--Delete snapshots

hdfs dfs -deleteSnapshot /user/ubuntu/important-files snapshot1



hadoop fs -ls /user/ubuntu/important-files/.snapshot



--Require admin rights

hdfs dfsadmin -disallowSnapshot /user/ubuntu/important-files

Hadoop HDFS Admin - Network Topology Of Cluster


sudo vi /etc/hadoop/conf/core-site.xml


<property>

    <name>net.topology.node.switch.mapping.impl</name>

    <value>org.apache.hadoop.net.ScriptBasedMapping</value>

</property>



<property>

    <name>net.topology.script.file.name</name>

    <value>/etc/hadoop/conf/topology.sh</value>

</property>



sudo vi /etc/hadoop/conf/topology.data



10-10-10-10 /rack1

20-20-20-20 /rack1

30-30-30-30 /rack1

40-40-40-40 /rack2



--Topology script will be called when Namenode starts


./etc/hadoop/conf/topology.sh 10-10-10-10 20-20-20-20 30-30-30-30 40-40-40-40


sudo vi /etc/hadoop/conf/topology.sh



#!/bin/bash

HADOOP_CONF=/etc/hadoop/conf

while [ $# -gt 0 ] ; do

  nodeArg=$1

  exec< ${HADOOP_CONF}/topology.data

  result=""

  while read line ; do

    ar=( $line )

    if [ "${ar[0]}" = "$nodeArg" ] ; then

      result="${ar[1]}"

    fi

  done

  shift

  if [ -z "$result" ] ; then

     echo -n "/rack0"

  else

    echo -n "$result "

  fi

done



--IP Address encoded with rack information


xx.xx.rackno.nodeno

50-50-50-50

Day to Day Esentails - Adding and Removing Nodes

--Marking Datanode Dead

2 * dfs.namenode.heartbeat.recheck-interval + 10 * dfs.heartbeat.interval

dfs.heartbeat.interval - defaults to 3 seconds

dfs.namenode.heartbeat.recheck-interval - defaults to 300,000 milliseconds = 300 seconds

2 * 300 seconds + 10 * 3 seconds = 630 seconds (10 minutes and 30 seconds)

--To commission or decommission datanodes

sudo vi /etc/hadoop/conf/hdfs-site.xml

<property>

  <name>dfs.hosts</name>

  <value>/etc/hadoop/conf/include</value>

</property>


<property>

  <name>dfs.hosts.exclude</name>

  <value>/etc/hadoop/conf/exclude</value>

</property>


sudo vi /etc/hadoop/conf/include

sudo vi /etc/hadoop/conf/exclude

--After include/exclude changes, refresh nodes for changes to take effect

hdfs dfsadmin -refreshNodes



http://ec2-10-10-10-10.compute-1.amazonaws.com:50070/dfshealth.html

--To commission or decommission nodemanagers

sudo vi /etc/hadoop/conf/yarn-site.xml

<property>

  <name>yarn.resourcemanager.nodes.include-path</name>

  <value>/etc/hadoop/conf/include</value>

</property>


<property>

  <name>yarn.resourcemanager.nodes.exclude-path</name>

  <value>/etc/hadoop/conf/exclude</value>

</property>



--After include/exclude changes, refresh nodes for changes to take effect



yarn rmadmin -refreshNodes

--Support graceful decommission of nodemanager


https://issues.apache.org/jira/browse/YARN-914

Puppet Installation

--Public DNS
ec2-10-10-10-10.compute-1.amazonaws.com
ec2-20-20-20-20.compute-1.amazonaws.com
ec2-30-30-30-30.compute-1.amazonaws.com
ec2-40-40-40-40.compute-1.amazonaws.com

--Private DNS
ip-50-50-50-50.ec2.internal
ip-60-60-60-60.ec2.internal
ip-70-70-70-70.ec2.internal
ip-80-80-80-80.ec2.internal

--Download Puppet on all nodes
wget https://apt.puppetlabs.com/puppet5-release-xenial.deb
sudo dpkg -i puppet5-release-xenial.deb
sudo apt update

--Install puppetserver on Puppet Master

sudo apt-get install puppetserver

--Install agents on 3 nodes

sudo apt-get install puppet-agent

--Update Puppet configuration

sudo vi /etc/puppetlabs/puppet/puppet.conf

server=ip-50-50-50-50.ec2.internal

runinterval = 1800

--Start Puppet server
sudo service puppetserver start

--Start Puppet agent
sudo /opt/puppetlabs/bin/puppet resource service puppet ensure=running enable=true

--Sign Certificates

sudo /opt/puppetlabs/bin/puppet cert list

sudo /opt/puppetlabs/bin/puppet cert sign ip-60-60-60-60.ec2.internal
sudo /opt/puppetlabs/bin/puppet cert sign ip-70-70-70-70.ec2.internal
sudo /opt/puppetlabs/bin/puppet cert sign ip-80-80-80-80.ec2.internal

Puppet Concepts with Tomcat Installation

--Private DNS
ip-10-10-10-10.ec2.internal (Master)
ip-20-20-20-20.ec2.internal (Agent 1)
ip-30-30-30-30.ec2.internal (Agent 2)
ip-40-40-40-40.ec2.internal (Agent 3)

--Create module tomcat
cd /etc/puppetlabs/code/environments/production/modules

mkdir tomcat

--Create manifest folder

cd tomcat

mkdir manifests

--Create init.pp (class name should match the name of the module)

vi init.pp

class tomcat {
}

--Create install.pp

vi install.pp

class tomcat::install {

 package{'tomcat8':
  ensure => installed
 } 
 
 package{'tomcat8-admin':
  ensure => installed
 }
 
}

--Create start.pp

vi start.pp

class tomcat::start{

 service{'tomcat8' :
  ensure => running
 }
 
}

--Node assignments

cd /etc/puppetlabs/code/environments/production/manifests

vi tomcat.pp

node 'ip-20-20-20-20.ec2.internal' {
 include tomcat::install
 include tomcat::start
}

--To run agent "pull" on as needed basis

sudo /opt/puppetlabs/bin/puppet agent --test

--Create config.pp

cd /etc/puppetlabs/code/environments/production/modules/tomcat/manifests

vi config.pp

class tomcat::config {
 
 file { '/etc/tomcat8/tomcat-users.xml':
  source   => 'puppet:///modules/tomcat/tomcat-users.xml',
  owner    => 'tomcat8', 
  group    => 'tomcat8', 
  mode     => '0600',
  notify   => Service['tomcat8'] 
 }
 
}


--Copy tomcat-users.xml file


--Add config to tomcat.pp

cd /etc/puppetlabs/code/environments/production/manifests

vi tomcat.pp

node 'ip-20-20-20-20.ec2.internal' {
 include tomcat::install
 include tomcat::config
 include tomcat::start
}

Haddop Working with HDFS

### LOCAL FILE SYSTEM ###

ls
mkdir
cp
mv
rm

### LISTING ROOT DIRECTORY ###

hadoop fs -ls /

### LISTING DEFAULT TO HOME DIRECTORY ###

hadoop fs -ls

hadoop fs -ls /user/deepakdubey

### CREATE A DIRECTORY IN HDFS ###

hadoop fs -mkdir hadoop-file-system-test1

### COPY FROM LOCAL FS TO HDFS ###

hadoop fs -copyFromLocal  /deepakdubey-starterkit/hdfs/commands/stocks-exchange.cvs hadoop-file-system-test1


### COPY TO HDFS TO LOCAL FS ###

hadoop fs -copyToLocal hadoop-file-system-test1/stocks-exchange.cvs .

hadoop fs -ls hadoop-file-system-test1

### CREATE 2 MORE DIRECTORIES ###

hadoop fs -mkdir hadoop-file-system-test2
hadoop fs -mkdir hadoop-file-system-test3

### COPY A FILE FROM ONE FOLDER TO ANOTHER ###

hadoop fs -cp hadoop-file-system-test1/stocks-exchange.cvs hadoop-file-system-test2

### MOVE A FILE FROM ONE FOLDER TO ANOTHER ###

hadoop fs -mv hadoop-file-system-test1/stocks-exchange.cvs hadoop-file-system-test3

### CHECK REPLICATION ###

hadoop fs -ls hadoop-file-system-test3

### CHANGE OR SET REPLICATION FACTOR ###

hadoop fs -Ddfs.replication=2 -cp hadoop-file-system-test2/stocks-exchange.cvs hadoop-file-system-test2/test-with-replication-factor-2.csv

hadoop fs -ls hadoop-file-system-test2

hadoop fs -ls hadoop-file-system-test2/test-with-replication-factor-2.csv

### CHANGING PERMISSIONS ###

hadoop fs -chmod 777 hadoop-file-system-test2/test-with-replication-factor-2.csv

### FILE SYSTEM CHECK - REQUIRES ADMIN PREVILEGES ###

sudo -u hdfs hdfs fsck /user/deepakdubey/hadoop-file-system-test2 -files -blocks -locations

sudo -u hdfs hdfs fsck /user/deepakdubey/hadoop-file-system-test3 -files -blocks -locations

sudo -u hdfs hdfs fsck /user/ubuntu/input/yelp/academic_dataset_review.json -files -blocks -locations

vi /etc/hadoop/conf/hdfs-site.xml

/data/1/dfs/dn/current/BP-2125152513-172.31.45.216-1410037307133/current/finalized


### DELETE DIR/FILES IN HDFS ###


hadoop fs -rm hadoop-file-system-test2/test-with-replication-factor-5.csv


hadoop fs -rm -r hadoop-file-system-test1

hadoop fs -rm -r hadoop-file-system-test2

hadoop fs -rm -r hadoop-file-system-test3

Hadoop Kernel Level Tuning

--Disk Swapping

--Look up existing values

cd /proc/sys/vm

cat swappiness

sysctl vm.swappiness=0

--To make swappiness permanent

vi /etc/sysctl.conf
vm.swappiness = 0

--Memory Allocation (Over Commit)

--Deny all memory requests over RAM+SWAP
vm.overcommit=0

--Approve memory requests over RAM+SWAP as defined by vm.overcommit_ratio
vm.overcommit=1

--Always approve all memory requests
vm.overcommit=2

--With 1 GB RAM; permit memory request up to 1.5 GB RAM + SWAP
vm.overcommit_ratio=50

--Look up existing values

cd /proc/sys/vm

cat overcommit_memory 
cat overcommit_ratio

--To change value online

sysctl vm.overcommit_memory=1

--To make the change permanent

vi /etc/sysctl.conf
vm.overcommit_memory=1 
vm.overcommit_ratio=50

Cluster Installation with Apache Ambari

--Ambari system requirements

https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.1/bk_Installing_HDP_AMB/content/_operating_systems_requirements.html

Ambari 1

ec2-10-10-10-10.compute-1.amazonaws.com
ip-20-20-20-20.ec2.internal

Ambari 2

ec2-30-30-30-30.compute-1.amazonaws.com
ip-40-40-40-40.ec2.internal

Ambari 3

ec2-50-50-50-50.compute-1.amazonaws.com
ip-60-60-60-60.ec2.internal

Ambari 4

ec2-70-70-70-70.compute-1.amazonaws.com
ip-80-80-80-80.ec2.internal


--Setup id_rsa on all nodes
cd .ssh
vi id_rsa
sudo chown ubuntu:ubuntu id_rsa
chmod 600 id_rsa


--Copy the downloaded JDK to other instances
scp jdk-8u131-linux-x64.tar.gz ip-40-40-40-40.ec2.internal:/home/ubuntu/
scp jdk-8u131-linux-x64.tar.gz ip-60-60-60-60.ec2.internal:/home/ubuntu/
scp jdk-8u131-linux-x64.tar.gz ip-80-80-80-80.ec2.internal:/home/ubuntu/

--Untar JDK
tar -xvf jdk-8u131-linux-x64.tar.gz

sudo mkdir -p /usr/lib/jvm
sudo mv ./jdk1.8.0_131 /usr/lib/jvm/

sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.8.0_131/bin/java" 1
sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.8.0_131/bin/javac" 1
sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.8.0_131/bin/javaws" 1

--Set permissions
sudo chmod a+x /usr/bin/java 
sudo chmod a+x /usr/bin/javac 
sudo chmod a+x /usr/bin/javaws
sudo chown -R root:root /usr/lib/jvm/jdk1.8.0_131

--Set JAVA_HOME (on all nodes) in /etc/environment
sudo vi /etc/environment
JAVA_HOME=/usr/lib/jvm/jdk1.8.0_131


--Ambari repository
sudo wget -nv http://public-repo-1.hortonworks.com/ambari/ubuntu14/2.x/updates/2.2.2.0/ambari.list -O /etc/apt/sources.list.d/ambari.list

sudo apt-get update

--Install ambari server only one one node
sudo apt-get install ambari-server

--Setup Ambari server
sudo ambari-server setup

sudo ambari-server start
sudo ambari-server status


http://ec2-10-10-10-10.compute-1.amazonaws.com:8080
admin/admin

--Verification

sudo -u hdfs hadoop fs -mkdir /user/ubuntu
sudo -u hdfs hadoop fs -chown ubuntu:ubuntu /user/ubuntu


ubuntu@ip-172-31-44-14:~$ hadoop fs -mkdir input
ubuntu@ip-172-31-44-14:~$ hadoop fs -copyFromLocal stocks input

ubuntu@ip-172-31-44-14:~$ hadoop jar MaxClosePrice-1.0.jar com.hirw.maxcloseprice.MaxClosePrice input output

Troubleshooting - Namenode Stuck In Safe Mode

--Name node in safe mode
hadoop fs -copyFromLocal test /tmp/
copyFromLocal: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create file /tmp/test. Name node is in safe mode.

--From name node web UI or logs
Safe mode is ON. The reported blocks 8900092 needs additional 6476 blocks to reach the threshold 1.0000 of total blocks 8906567

--hdfs-site.xml 
dfs.namenode.safemode.threshold-pct 

####Reason 1 - loss of datanodes or the cluster is running low on resources.####

sudo -u hdfs hdfs dfsadmin -report

Safe mode is ON
Configured Capacity: 6807953326080 (6.19 TB)
Present Capacity: 5076746797056 (4.62 TB)
DFS Remaining: 5076745936896 (4.62 TB)
DFS Used: 860160 (840 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
-------------------------------------------------
Live datanodes (5):
Name: 10.15.230.42:50010 (node2)
Hostname: node2
Rack: /default
Decommission Status : Normal
Configured Capacity: 1361590665216 (1.24 TB)
DFS Used: 172032 (168 KB)
Non DFS Used: 425847939072 (396.60 GB)
DFS Remaining: 935742554112 (871.48 GB)
DFS Used%: 0.00%
DFS Remaining%: 68.72%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 0 (0 B)
Cache Remaining: 4294967296 (4 GB)
Cache Used%: 0.00%
Cache Remaining%: 100.00%
Xceivers: 2
Last contact: Thu May 07 21:52:10 EDT 2018

Name: 10.15.230.44:50010 (node4)
Hostname: node4
Rack: /default
Decommission Status : Normal
Configured Capacity: 1361590665216 (1.24 TB)
DFS Used: 172032 (168 KB)
Non DFS Used: 219371347968 (204.31 GB)
DFS Remaining: 1142219145216 (1.04 TB)
DFS Used%: 0.00%
DFS Remaining%: 83.89%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 0 (0 B)
Cache Remaining: 4294967296 (4 GB)
Cache Used%: 0.00%
Cache Remaining%: 100.00%
Xceivers: 2
Last contact: Thu May 07 21:52:10 EDT 2018


sudo -u hdfs hdfs dfsadmin -report

Safe mode is ON
    Configured Capacity: 52710469632 (49.09 GB)
    Present Capacity: 213811200 (203.91 MB)
    DFS Remaining: 0 (0 B)
    DFS Used: 213811200 (203.91 MB)
    DFS Used%: 100.00%
    Under replicated blocks: 39
    Blocks with corrupt replicas: 0
    Missing blocks: 0

####Reason 2 - Block corruption####

hdfs fsck / 
 
Connecting to namenode via http://master:50070
FSCK started by hdfs (auth:SIMPLE) from /10.15.230.22 for path / at Thu May 07 21:56:17 EDT 2018
..
/accumulo/tables/!0/table_info/A00009pl.rf: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073806971
/accumulo/tables/!0/table_info/A00009pl.rf: MISSING 1 blocks of total size 891 B..
/accumulo/tables/!0/table_info/A00009pm.rf: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073806989
/accumulo/tables/!0/table_info/A00009pm.rf: MISSING 1 blocks of total size 891 B..
/accumulo/tables/!0/table_info/F00009pn.rf: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073807006
............................................
/user/oozie/share/lib/lib_20180408141046/sqoop/geronimo-jaspic_1.0_spec-1.0.jar: MISSING 1 blocks of total size 30548 B..
/user/oozie/share/lib/lib_20180408141046/sqoop/geronimo-jta_1.1_spec-1.1.1.jar: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073743585
/user/oozie/share/lib/lib_20180408141046/sqoop/geronimo-jta_1.1_spec-1.1.1.jar: MISSING 1 blocks of total size 16030 B..
/user/oozie/share/lib/lib_20180408141046/sqoop/groovy-all-2.1.6.jar: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073743592
............................................
/tmp/logs/fincalc/logs/application_1430405794825_0003/DAST-node5_8041: MISSING 9 blocks of total size 1117083218 B..
/tmp/logs/fincalc/logs/application_1430405794825_0004/DAST-node1_8041: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073800217
/tmp/logs/fincalc/logs/application_1430405794825_0004/DAST-node1_8041: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073800222
/tmp/logs/fincalc/logs/application_1430405794825_0004/DAST-node1_8041: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073800230

Total size:    154126314807 B (Total open files size: 186 B)
 Total dirs:    3350
 Total files:   1790
 Total symlinks:                0 (Files currently being written: 2)
 Total blocks (validated):      2776 (avg. block size 55521006 B) (Total open file blocks (not validated): 2)
  ********************************
  CORRUPT FILES:        1764
  MISSING BLOCKS:       2776
  MISSING SIZE:         154126314807 B
  CORRUPT BLOCKS:       2776
  ********************************
 Minimally replicated blocks:   0 (0.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    2
 Average block replication:     0.0
 Corrupt blocks:                2776
 Missing replicas:              0
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Thu May 07 21:56:18 EDT 2018 in 516 milliseconds

The filesystem under path '/' is CORRUPT 


hdfs fsck / -list-corruptfileblocks

Connecting to namenode via http://ec2-80-80-80-80.compute-1.amazonaws.com:50070
The filesystem under path '/' has 0 CORRUPT files

hdfs@ip-172-31-45-216:~$ hdfs fsck /user/hirw/input/stocks -files -blocks -locations

Connecting to namenode via http://ec2-80-80-80-80.compute-1.amazonaws.com:50070
FSCK started by hdfs (auth:SIMPLE) from /172.31.45.216 for path /user/hirw/input/stocks at Mon Sep 18 11:22:00 UTC 2017
/user/hirw/input/stocks <dir>
/user/hirw/input/stocks/stocks 428223209 bytes, 4 block(s):  OK
0. BP-2125152513-172.31.45.216-1410037307133:blk_1074178780_437980 len=134217728 Live_repl=3 [DatanodeInfoWithStorage[172.31.45.216:50010,DS-d0d9eb5c-f35f-4a12-bfdf-544085d693a3,DISK], DatanodeInfoWithStorage[172.31.46.124:50010,DS-fe24aecb-f56f-4c9c-8cf9-a3b1259bc0d0,DISK], DatanodeInfoWithStorage[172.31.45.217:50010,DS-7e4a4aef-39b9-4087-b0ff-3199c5b8b8bb,DISK]]
1. BP-2125152513-172.31.45.216-1410037307133:blk_1074178781_437981 len=134217728 Live_repl=3 [DatanodeInfoWithStorage[172.31.45.216:50010,DS-d0d9eb5c-f35f-4a12-bfdf-544085d693a3,DISK], DatanodeInfoWithStorage[172.31.46.124:50010,DS-fe24aecb-f56f-4c9c-8cf9-a3b1259bc0d0,DISK], DatanodeInfoWithStorage[172.31.45.217:50010,DS-7e4a4aef-39b9-4087-b0ff-3199c5b8b8bb,DISK]]
2. BP-2125152513-172.31.45.216-1410037307133:blk_1074178782_437982 len=134217728 Live_repl=3 [DatanodeInfoWithStorage[172.31.45.216:50010,DS-d0d9eb5c-f35f-4a12-bfdf-544085d693a3,DISK], DatanodeInfoWithStorage[172.31.46.124:50010,DS-fe24aecb-f56f-4c9c-8cf9-a3b1259bc0d0,DISK], DatanodeInfoWithStorage[172.31.45.217:50010,DS-7e4a4aef-39b9-4087-b0ff-3199c5b8b8bb,DISK]]
3. BP-2125152513-172.31.45.216-1410037307133:blk_1074178783_437983 len=25570025 Live_repl=3 [DatanodeInfoWithStorage[172.31.45.216:50010,DS-d0d9eb5c-f35f-4a12-bfdf-544085d693a3,DISK], DatanodeInfoWithStorage[172.31.46.124:50010,DS-fe24aecb-f56f-4c9c-8cf9-a3b1259bc0d0,DISK], DatanodeInfoWithStorage[172.31.45.217:50010,DS-7e4a4aef-39b9-4087-b0ff-3199c5b8b8bb,DISK]]

Status: HEALTHY


$ hdfs fsck /user/hadooptest/test1 -locations -blocks -files
FSCK started by hadoop (auth:SIMPLE) from /10-10-10-10 for path /user/hadooptest/test1 at Thu Dec 15 23:32:30 PST 2016
/user/hadooptest/test1 339281920 bytes, 3 block(s): 
/user/hadooptest/test1: CORRUPT blockpool BP-762523015-10-10-10-10-1480061879099 block blk_1073741830

/user/hadooptest/test1: CORRUPT blockpool BP-762523015-10-10-10-10-1480061879099 block blk_1073741831

/user/hadooptest/test1: CORRUPT blockpool BP-762523015-10-10-10-10-1480061879099 block blk_1073741832
 MISSING 3 blocks of total size 339281920 B
0. BP-762523015-10-10-10-10-1480061879099:blk_1073741830_1006 len=134217728 MISSING!
1. BP-762523015-10-10-10-10-1480061879099:blk_1073741831_1007 len=134217728 MISSING!
2. BP-762523015-10-10-10-10-1480061879099:blk_1073741832_1008 len=70846464 MISSING!


--Leave safe mode and delete the file
hdfs dfsadmin -safemode leave

hadoop fs -rm /user/hadooptest/test1