Kubernetes Reference
Minikube
minikube start
minikube stop
minikube delete
minikube env
minikube ip
---
Kubectl
kubectl get all
Pods, ReplicaSets, Deployments and Services
kubectl apply –f <yaml file>
kubectl apply –f .
kubectl describe pod <name of pod>
kubectl exec –it <pod name> <command>
kubectl get <pod | po | service | svc | rs | replicaset | deployment | deploy>
kubectl get po --show-labels
kubectl get po --show-labels -l {name}={value}
kubectl delete po <pod name>
kubectl delete po --all
---
Deployment Management
kubectl rollout status deploy <name of deployment>
kubectl rollout history deploy <name of deployment>
kubectl rollout undo deploy <name of deployment>
The blog covers various articles and posts on Cloud, Big Data Analytics, Data Science, Machine Learning, DevOps, Full Stack Development, Java and Middleware Technologies
Friday, March 15, 2019
Docker Reference
Docker Reference
Manage images
docker image pull <image name>
docker image ls
docker image build -t <image name> .
docker image push <image name>
docker image tag <image id> <tag name>
---
Manage Containers
docker container run -p <public port>:<container port> <image name>
docker container ls -a
docker container stop <container id>
docker container start <container id>
docker container rm <container id>
docker container prune
docker container run -it <image name>
docker container run -d <image name>
docker container exec -it <container id> <command>
docker container exec -it <container id> bash
docker container logs -f <container id>
docker container commit -a "author" <container id> <image name>
---
Manage your (local) Virtual Machine
docker-machine ip
---
Manage Networks
docker network ls
docker network create <network name>
---
Manage Volumes
docker volume ls
docker volume prune
docker volume inspect <volume name>
docker volume rm <volume name>
---
Docker Compose
docker-compose up
docker-compose up -d
docker-compose logs -f <service name>
docker-compose down
---
Manage a Swarm
docker swarm init (--advertise-addr <ip address>)
docker service create <args>
docker network create --driver overlay <name>
docker service ls
docker node ls
docker service logs -f <service name>
docker service ps <service name>
docker swarm join-token <worker|manager>
---
Manage Stacks
docker stack ls
docker stack deploy -c <compose file> <stack name>
docker stack rm <stack name>
Manage images
docker image pull <image name>
docker image ls
docker image build -t <image name> .
docker image push <image name>
docker image tag <image id> <tag name>
---
Manage Containers
docker container run -p <public port>:<container port> <image name>
docker container ls -a
docker container stop <container id>
docker container start <container id>
docker container rm <container id>
docker container prune
docker container run -it <image name>
docker container run -d <image name>
docker container exec -it <container id> <command>
docker container exec -it <container id> bash
docker container logs -f <container id>
docker container commit -a "author" <container id> <image name>
---
Manage your (local) Virtual Machine
docker-machine ip
---
Manage Networks
docker network ls
docker network create <network name>
---
Manage Volumes
docker volume ls
docker volume prune
docker volume inspect <volume name>
docker volume rm <volume name>
---
Docker Compose
docker-compose up
docker-compose up -d
docker-compose logs -f <service name>
docker-compose down
---
Manage a Swarm
docker swarm init (--advertise-addr <ip address>)
docker service create <args>
docker network create --driver overlay <name>
docker service ls
docker node ls
docker service logs -f <service name>
docker service ps <service name>
docker swarm join-token <worker|manager>
---
Manage Stacks
docker stack ls
docker stack deploy -c <compose file> <stack name>
docker stack rm <stack name>
Labels:
docker
Thursday, March 14, 2019
ElasticSearch PUT and GET data
--Create an index in Elasticsearch
PUT http://host-1:9200/my_index
{
"settings" : {
"number_of_shards" : 3,
"number_of_replicas" : 1
}
}
--To get information about an index
GET http://host-1:9200/my_index
--Add user to index with id 1
POST http://host-1:9200/my_index/user/1
{
"name": "Deepak",
"age": 36,
"department": "IT",
"address": {
"street": "No.123, XYZ street",
"city": "Singapore",
"country": "Singapore"
}
}
--To fetch document with id 1
GET http://host-1:9200/my_index/user/1
--Add user to index with id 2
POST http://host-1:9200/my_index/user/2
{
"name": "McGiven",
"age": 30,
"department": "Finance"
}
--Add user to index with id 3
POST http://host-1:9200/my_index/user/3
{
"name": "Watson",
"age": 30,
"department": "HR",
"address": {
"street": "No.123, XYZ United street",
"city": "Singapore",
"country": "Singapore"
}
}
--Search documents by name
GET http://host-1:9200/my_index/user/_search?q=name:watson
--Delete an index
DELETE http://host-1:9200/my_index
PUT http://host-1:9200/my_index
{
"settings" : {
"number_of_shards" : 3,
"number_of_replicas" : 1
}
}
--To get information about an index
GET http://host-1:9200/my_index
--Add user to index with id 1
POST http://host-1:9200/my_index/user/1
{
"name": "Deepak",
"age": 36,
"department": "IT",
"address": {
"street": "No.123, XYZ street",
"city": "Singapore",
"country": "Singapore"
}
}
--To fetch document with id 1
GET http://host-1:9200/my_index/user/1
--Add user to index with id 2
POST http://host-1:9200/my_index/user/2
{
"name": "McGiven",
"age": 30,
"department": "Finance"
}
--Add user to index with id 3
POST http://host-1:9200/my_index/user/3
{
"name": "Watson",
"age": 30,
"department": "HR",
"address": {
"street": "No.123, XYZ United street",
"city": "Singapore",
"country": "Singapore"
}
}
--Search documents by name
GET http://host-1:9200/my_index/user/_search?q=name:watson
--Delete an index
DELETE http://host-1:9200/my_index
Labels:
elasticsearch
Tuesday, March 5, 2019
Hadoop Submitting a MapReduce Job
HDFS Input Location:
/user/deepakdubey/input/stocks
HDFS Output Location (relative to user):
output/mapreduce/stocks
Delete Output Directory:
hadoop fs -rm -r output/mapreduce/stocks
Submit Job:
hadoop jar /deepakdubey/mapreduce/stocks/MaxClosePriceByStock-1.0.jar com.deepakdubey.MaxClosePriceByStock /user/deepakdubey/input/stocks output/mapreduce/stocks
View Result:
hadoop fs -cat output/mapreduce/stocks/part-r-00000
=========================
HDFS Input Location:
/user/deepakdubey/input/stocks
HDFS Output Location (relative to user):
output/mapreduce/stocks
Delete Output Directory:
hadoop fs -rm -r output/mapreduce/stocks
Submit Job:
hadoop jar /deepakdubey-starterkit/mapreduce/stocks/MaxClosePriceByStock-1.0.jar com.deepakdubey.MaxClosePriceByStock /user/deepakdubey/input/stocks output/mapreduce/stocks
View Result:
hadoop fs -cat output/mapreduce/stocks/part-r-00000
/user/deepakdubey/input/stocks
HDFS Output Location (relative to user):
output/mapreduce/stocks
Delete Output Directory:
hadoop fs -rm -r output/mapreduce/stocks
Submit Job:
hadoop jar /deepakdubey/mapreduce/stocks/MaxClosePriceByStock-1.0.jar com.deepakdubey.MaxClosePriceByStock /user/deepakdubey/input/stocks output/mapreduce/stocks
View Result:
hadoop fs -cat output/mapreduce/stocks/part-r-00000
=========================
HDFS Input Location:
/user/deepakdubey/input/stocks
HDFS Output Location (relative to user):
output/mapreduce/stocks
Delete Output Directory:
hadoop fs -rm -r output/mapreduce/stocks
Submit Job:
hadoop jar /deepakdubey-starterkit/mapreduce/stocks/MaxClosePriceByStock-1.0.jar com.deepakdubey.MaxClosePriceByStock /user/deepakdubey/input/stocks output/mapreduce/stocks
View Result:
hadoop fs -cat output/mapreduce/stocks/part-r-00000
Caching Data in Spark
symvol.cache()
symvol.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
symvol.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
symvol.persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK)
symvol.unpersist()
--caching symvol
val stocks = sc.textFile("hdfs://ip-10-10-10-10.ec2.internal:8020/user/deepakdubey/input/stocks")
val splits = stocks.map(record => record.split(","))
val symvol = splits.map(arr => (arr(1), arr(7).toInt))
symvol.cache()
val maxvol = symvol.reduceByKey((vol1, vol2) => Math.max(vol1, vol2))
maxvol.collect().foreach(println)
symvol.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
symvol.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
symvol.persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK)
symvol.unpersist()
--caching symvol
val stocks = sc.textFile("hdfs://ip-10-10-10-10.ec2.internal:8020/user/deepakdubey/input/stocks")
val splits = stocks.map(record => record.split(","))
val symvol = splits.map(arr => (arr(1), arr(7).toInt))
symvol.cache()
val maxvol = symvol.reduceByKey((vol1, vol2) => Math.max(vol1, vol2))
maxvol.collect().foreach(println)
Labels:
spark
Spark Get Maximum Volume by Stock
val stocks = sc.textFile("hdfs://ip-10-10-10-10.ec2.internal:8020/user/deepakdubey/input/stocks")
val splits = stocks.map(record => record.split(","))
val symvol = splits.map(arr => (arr(1), arr(7).toInt))
val maxvol = symvol.reduceByKey((vol1, vol2) => Math.max(vol1, vol2))
maxvol.collect().foreach(println)
--Spark shell
spark-shell --master yarn
val splits = stocks.map(record => record.split(","))
val symvol = splits.map(arr => (arr(1), arr(7).toInt))
val maxvol = symvol.reduceByKey((vol1, vol2) => Math.max(vol1, vol2))
maxvol.collect().foreach(println)
--Spark shell
spark-shell --master yarn
Labels:
spark
Monday, March 4, 2019
Sunday, March 3, 2019
Hive Loading Tables
### CREATE A TABLE FOR STOCKS ###
hive> CREATE TABLE IF NOT EXISTS stocks (
exch STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES ('creator'='deepakdubey', 'created_on' = '2019-02-16', 'description'='This table holds stocks data!!!');
### DESCRIBE TABLE TO GET DETAILS ABOUT TABLE ###
hive> DESCRIBE FORMATTED stocks;
### COPY THE STOCKS DATASET TO HDFS ###
hadoop fs -copyFromLocal /deepakdubey/input/stocks-dataset/stocks/* input/hive/stocks_db
hadoop fs -ls input/hive/stocks_db
hive> !hadoop fs -ls input/hive/stocks_db;
### LOAD DATASAET USING LOAD COMMAND ###
hive> LOAD DATA INPATH 'input/hive/stocks_db'
INTO TABLE stocks;
hive> !hadoop fs -ls input/hive/stocks_db;
hive> DESCRIBE FORMATTED stocks;
hive> !hadoop fs -ls /user/hive/warehouse/stocks_db.db/stocks;
hive> SELECT * FROM stocks;
### LOAD DATASET USING CTAS ###
hive> CREATE TABLE stocks_ctas
AS
SELECT * FROM stocks;
hive> DESCRIBE FORMATTED stocks_ctas;
hive> !hadoop fs -ls /user/hive/warehouse/stocks_db.db/stocks_ctas;
### LOAD DATASET USING INSERT..SELECT ###
hive> INSERT INTO TABLE stocks_ctas
SELECT s.* FROM stocks s;
hive> !hadoop fs -ls /user/hive/warehouse/stocks_db.db/stocks_ctas;
hive> SELECT * FROM stocks_ctas;
### LOAD DATASET USING INSERT OVERWRITE ###
hive> INSERT OVERWRITE TABLE stocks_ctas
SELECT s.* FROM stocks s;
hive> !hadoop fs -ls /user/hive/warehouse/stocks_db.db/stocks_ctas;
hadoop fs -copyFromLocal /home/cloudera/deepakdubey/input/stocks_db/stocks/* input/stocks_db/stocks
hadoop fs -ls input/stocks_db/stocks
### LOCATION ATTRIBUTE & LOADING DATA ###
hadoop fs -copyFromLocal /deepakdubey/input/stocks-dataset/stocks/* input/hive/stocks_db
hadoop fs -ls input/hive/stocks_db
hive> CREATE TABLE IF NOT EXISTS stocks_loc (
exch STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/deepakdubey/input/hive/stocks_db'
TBLPROPERTIES ('creator'='deepakdubey', 'created_on' = '2019-02-16', 'description'='This table holds stocks data!!!');
hive> DESCRIBE FORMATTED stocks_loc;
hive> SELECT * FROM stocks_loc;
hive> CREATE TABLE IF NOT EXISTS stocks (
exch STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES ('creator'='deepakdubey', 'created_on' = '2019-02-16', 'description'='This table holds stocks data!!!');
### DESCRIBE TABLE TO GET DETAILS ABOUT TABLE ###
hive> DESCRIBE FORMATTED stocks;
### COPY THE STOCKS DATASET TO HDFS ###
hadoop fs -copyFromLocal /deepakdubey/input/stocks-dataset/stocks/* input/hive/stocks_db
hadoop fs -ls input/hive/stocks_db
hive> !hadoop fs -ls input/hive/stocks_db;
### LOAD DATASAET USING LOAD COMMAND ###
hive> LOAD DATA INPATH 'input/hive/stocks_db'
INTO TABLE stocks;
hive> !hadoop fs -ls input/hive/stocks_db;
hive> DESCRIBE FORMATTED stocks;
hive> !hadoop fs -ls /user/hive/warehouse/stocks_db.db/stocks;
hive> SELECT * FROM stocks;
### LOAD DATASET USING CTAS ###
hive> CREATE TABLE stocks_ctas
AS
SELECT * FROM stocks;
hive> DESCRIBE FORMATTED stocks_ctas;
hive> !hadoop fs -ls /user/hive/warehouse/stocks_db.db/stocks_ctas;
### LOAD DATASET USING INSERT..SELECT ###
hive> INSERT INTO TABLE stocks_ctas
SELECT s.* FROM stocks s;
hive> !hadoop fs -ls /user/hive/warehouse/stocks_db.db/stocks_ctas;
hive> SELECT * FROM stocks_ctas;
### LOAD DATASET USING INSERT OVERWRITE ###
hive> INSERT OVERWRITE TABLE stocks_ctas
SELECT s.* FROM stocks s;
hive> !hadoop fs -ls /user/hive/warehouse/stocks_db.db/stocks_ctas;
hadoop fs -copyFromLocal /home/cloudera/deepakdubey/input/stocks_db/stocks/* input/stocks_db/stocks
hadoop fs -ls input/stocks_db/stocks
### LOCATION ATTRIBUTE & LOADING DATA ###
hadoop fs -copyFromLocal /deepakdubey/input/stocks-dataset/stocks/* input/hive/stocks_db
hadoop fs -ls input/hive/stocks_db
hive> CREATE TABLE IF NOT EXISTS stocks_loc (
exch STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/deepakdubey/input/hive/stocks_db'
TBLPROPERTIES ('creator'='deepakdubey', 'created_on' = '2019-02-16', 'description'='This table holds stocks data!!!');
hive> DESCRIBE FORMATTED stocks_loc;
hive> SELECT * FROM stocks_loc;
Labels:
hive
Hive Create MANAGED Table
hive> CREATE DATABASE stocks_db;
hive> SHOW DATABASES;
hive> USE stocks_db;
hive> CREATE TABLE IF NOT EXISTS stocks (
exch string,
symbol string,
ymd string,
price_open float,
price_high float,
price_low float,
price_close float,
volume int,
price_adj_close float)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> DESCRIBE FORMATTED stocks;
hive> DROP DATABASE stocks_db;
hive> DROP TABLE stocks;
hive> DROP DATABASE stocks_db CASCADE;
hive> SHOW DATABASES;
hive> USE stocks_db;
hive> CREATE TABLE IF NOT EXISTS stocks (
exch string,
symbol string,
ymd string,
price_open float,
price_high float,
price_low float,
price_close float,
volume int,
price_adj_close float)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> DESCRIBE FORMATTED stocks;
hive> DROP DATABASE stocks_db;
hive> DROP TABLE stocks;
hive> DROP DATABASE stocks_db CASCADE;
Labels:
hive
Hive Create EXTERNAL Table
hive> CREATE DATABASE stocks_db;
hive> USE stocks_db;
hive> CREATE EXTERNAL TABLE IF NOT EXISTS stocks_tb (
exch STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/deepakdubey/input/stocks';
hive> SELECT * FROM stocks_tb LIMIT 100;
hive> USE stocks_db;
hive> CREATE EXTERNAL TABLE IF NOT EXISTS stocks_tb (
exch STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/deepakdubey/input/stocks';
hive> SELECT * FROM stocks_tb LIMIT 100;
Labels:
hive
Hadoop Cluster - Stop Start Restart
--Stop, Start and Restart Datanode
sudo service hadoop-hdfs-datanode stop
sudo service hadoop-hdfs-datanode start
sudo service hadoop-hdfs-datanode restart
--Stop, Start and Restart Node Manager
sudo service hadoop-yarn-nodemanager stop
sudo service hadoop-yarn-nodemanager start
sudo service hadoop-yarn-nodemanager restart
--Stop, Start and Restart Resource Manager
sudo service hadoop-yarn-resourcemanager stop
sudo service hadoop-yarn-resourcemanager start
sudo service hadoop-yarn-resourcemanager restart
--Restart all HDFS services on a node
for x in 'cd /etc/init.d ; ls hadoop-hdfs*' ; do sudo service $x restart ; done
--Restart all YARN services on a node
for x in 'cd /etc/init.d ; ls hadoop-yarn*' ; do sudo service $x restart ; done
--Restart all Hadoop services on a node
for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x restart ; done
--Start checkpointing operation
hdfs dfsadmin -safemode enter
hdfs dfsadmin -saveNamespace
--Save backup of FSIMAGE
hdfs dfsadmin -fetchImage /tmp/fsimage-bkup
--Start/Stop commands for HDP (Hortonworks Data Platform)
/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh start namenode
/usr/hdp/current/hadoop-hdfs-datanode/../hadoop/sbin/hadoop-daemon.sh start datanode
/usr/hdp/current/hadoop-yarn-resourcemanager/sbin/yarn-daemon.sh start resourcemanager
/usr/hdp/current/hadoop-yarn-resourcemanager/sbin/yarn-daemon.sh start nodemanager
sudo service hadoop-hdfs-datanode stop
sudo service hadoop-hdfs-datanode start
sudo service hadoop-hdfs-datanode restart
--Stop, Start and Restart Node Manager
sudo service hadoop-yarn-nodemanager stop
sudo service hadoop-yarn-nodemanager start
sudo service hadoop-yarn-nodemanager restart
--Stop, Start and Restart Resource Manager
sudo service hadoop-yarn-resourcemanager stop
sudo service hadoop-yarn-resourcemanager start
sudo service hadoop-yarn-resourcemanager restart
--Restart all HDFS services on a node
for x in 'cd /etc/init.d ; ls hadoop-hdfs*' ; do sudo service $x restart ; done
--Restart all YARN services on a node
for x in 'cd /etc/init.d ; ls hadoop-yarn*' ; do sudo service $x restart ; done
--Restart all Hadoop services on a node
for x in `cd /etc/init.d ; ls hadoop-*` ; do sudo service $x restart ; done
--Start checkpointing operation
hdfs dfsadmin -safemode enter
hdfs dfsadmin -saveNamespace
--Save backup of FSIMAGE
hdfs dfsadmin -fetchImage /tmp/fsimage-bkup
--Start/Stop commands for HDP (Hortonworks Data Platform)
/usr/hdp/current/hadoop-hdfs-namenode/../hadoop/sbin/hadoop-daemon.sh start namenode
/usr/hdp/current/hadoop-hdfs-datanode/../hadoop/sbin/hadoop-daemon.sh start datanode
/usr/hdp/current/hadoop-yarn-resourcemanager/sbin/yarn-daemon.sh start resourcemanager
/usr/hdp/current/hadoop-yarn-resourcemanager/sbin/yarn-daemon.sh start nodemanager
Labels:
Hadoop
Saturday, March 2, 2019
Pig - Pig Latin Solving A Problem
grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float,
volume:int, adj_close:float);
### FILTERING ONLY RECORDS FROM YEAR 2003 ###
filter_by_yr = FILTER stocks by GetYear(date) == 2003;
### GROUPING RECORDS BY SYMBOL ###
grunt> grp_by_sym = GROUP filter_by_yr BY symbol;
grp_by_sym: {
group: chararray,
filter_by_yr: {
(exchange: chararray,symbol: chararray,date: datetime,open: float,high: float,low: float,close: float,volume: int,adj_close: float)
}
}
### SAMPLE OUTPUT OF GROUP ###
(CASC, { (NYSE,CASC,2003-12-22T00:00:00.000Z,22.02,22.2,21.94,22.09,36700,20.29), (NYSE,CASC,2003-12-23T00:00:00.000Z,22.15,22.15,21.9,22.05,23600,20.26), ....... })
(CATO, { (NYSE,CATO,2003-10-08T00:00:00.000Z,22.48,22.5,22.01,22.06,92000,12.0), (NYSE,CATO,2003-10-09T00:00:00.000Z,21.3,21.59,21.16,21.45,373500,11.67), ....... })
### CALCULATE AVERAGE VOLUME ON THE GROUPED RECORDS ###
avg_volume = FOREACH grp_by_sym GENERATE group, ROUND(AVG(filter_by_yr.volume)) as avgvolume;
### ORDER THE RESULT IN DESCENDING ORDER ###
avg_vol_ordered = ORDER avg_volume BY avgvolume DESC;
### STORE TOP 10 RECORDS ###
top10 = LIMIT avg_vol_ordered 10;
STORE top10 INTO 'output/pig/avg-volume' USING PigStorage(',');
### EXECUTE PIG INSTRUCTIONS AS SCRIPT ###
pig /deepakdubey-workshop/pig/scripts/average-volume.pig
### PASSING PARAMETERS TO SCRIPT ###
pig -param input=/user/deepakdubey/input/stocks -param output=output/pig/avg-volume-params /deepakdubey-workshop/pig/scripts/average-volume-parameters.pig
### RUNNING A PIG SCRIPT LOCALLY. INPUT AND OUTPUT LOCATION ARE POINTING TO LOCAL FILE SYSTEM ###
pig -x local -param input=/deepakdubey-workshop/input/stocks-dataset/stocks -param output=output/stocks /deepakdubey-workshop/pig/scripts/average-volume-parameters.pig
volume:int, adj_close:float);
### FILTERING ONLY RECORDS FROM YEAR 2003 ###
filter_by_yr = FILTER stocks by GetYear(date) == 2003;
### GROUPING RECORDS BY SYMBOL ###
grunt> grp_by_sym = GROUP filter_by_yr BY symbol;
grp_by_sym: {
group: chararray,
filter_by_yr: {
(exchange: chararray,symbol: chararray,date: datetime,open: float,high: float,low: float,close: float,volume: int,adj_close: float)
}
}
### SAMPLE OUTPUT OF GROUP ###
(CASC, { (NYSE,CASC,2003-12-22T00:00:00.000Z,22.02,22.2,21.94,22.09,36700,20.29), (NYSE,CASC,2003-12-23T00:00:00.000Z,22.15,22.15,21.9,22.05,23600,20.26), ....... })
(CATO, { (NYSE,CATO,2003-10-08T00:00:00.000Z,22.48,22.5,22.01,22.06,92000,12.0), (NYSE,CATO,2003-10-09T00:00:00.000Z,21.3,21.59,21.16,21.45,373500,11.67), ....... })
### CALCULATE AVERAGE VOLUME ON THE GROUPED RECORDS ###
avg_volume = FOREACH grp_by_sym GENERATE group, ROUND(AVG(filter_by_yr.volume)) as avgvolume;
### ORDER THE RESULT IN DESCENDING ORDER ###
avg_vol_ordered = ORDER avg_volume BY avgvolume DESC;
### STORE TOP 10 RECORDS ###
top10 = LIMIT avg_vol_ordered 10;
STORE top10 INTO 'output/pig/avg-volume' USING PigStorage(',');
### EXECUTE PIG INSTRUCTIONS AS SCRIPT ###
pig /deepakdubey-workshop/pig/scripts/average-volume.pig
### PASSING PARAMETERS TO SCRIPT ###
pig -param input=/user/deepakdubey/input/stocks -param output=output/pig/avg-volume-params /deepakdubey-workshop/pig/scripts/average-volume-parameters.pig
### RUNNING A PIG SCRIPT LOCALLY. INPUT AND OUTPUT LOCATION ARE POINTING TO LOCAL FILE SYSTEM ###
pig -x local -param input=/deepakdubey-workshop/input/stocks-dataset/stocks -param output=output/stocks /deepakdubey-workshop/pig/scripts/average-volume-parameters.pig
Hadoop HDFS - Getting to know your cluster
--To find the hadoop version
hadoop version
--HDFS report
sudo su - hdfs
hdfs dfsadmin -report
hdfs dfsadmin -report -live
hdfs dfsadmin -report -dead
--Configuration details
hdfs getconf -namenodes
hdfs getconf -confKey dfs.replication
hdfs getconf -confKey dfs.blocksize
hdfs getconf -confKey dfs.namenode.http-address
hdfs getconf -confKey yarn.resourcemanager.webapp.address
--YARN application details
yarn application -list -appStates ALL
yarn application -list -appStates FAILED
--Other YARN application status
NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED
hadoop version
--HDFS report
sudo su - hdfs
hdfs dfsadmin -report
hdfs dfsadmin -report -live
hdfs dfsadmin -report -dead
--Configuration details
hdfs getconf -namenodes
hdfs getconf -confKey dfs.replication
hdfs getconf -confKey dfs.blocksize
hdfs getconf -confKey dfs.namenode.http-address
hdfs getconf -confKey yarn.resourcemanager.webapp.address
--YARN application details
yarn application -list -appStates ALL
yarn application -list -appStates FAILED
--Other YARN application status
NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED
Pig - Pig Latin Loading Projecting
### LOADING A DATASET ###
grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float,
volume:int, adj_close:float);
### STRUCTURE ###
grunt> DESC stocks;
### PROJECT AND MANIPULATE FEW COLUMNS FROM DATASET ###
grunt> projection = FOREACH stocks GENERATE symbol, SUBSTRING($0, 0, 1) as sub_exch, close - open as up_or_down;
### PRINT RESULT ON SCREEN ###
grunt> DUMP projection;
### STORE RESULT IN HDFS ###
grunt> STORE projection INTO 'output/pig/simple-projection';
### LOAD 1 - WITH NO COLUMN NAMES AND DATATYPES ###
grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',');
### LOAD 2 - WITH COLUMN NAMES BUT NO DATATYPES ###
grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',') as (exchange, symbol, date, open, high, low, close, volume, adj_close);
### LOAD 3 - WITH COLUMN NAMES AND DATATYPES ###
grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float,
volume:int, adj_close:float);
### TO LOOK UP STRUCTURE OF THE RELATION ###
grunt> DESCRIBE stocks;
### WHEN COLUMN NAMES ARE NOT AVAILABLE ###
grunt> projection = FOREACH stocks GENERATE $1 as symbol, SUBSTRING($0, 0, 1) as sub_exch, $6 - $3 as up_or_down;
grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float,
volume:int, adj_close:float);
### STRUCTURE ###
grunt> DESC stocks;
### PROJECT AND MANIPULATE FEW COLUMNS FROM DATASET ###
grunt> projection = FOREACH stocks GENERATE symbol, SUBSTRING($0, 0, 1) as sub_exch, close - open as up_or_down;
### PRINT RESULT ON SCREEN ###
grunt> DUMP projection;
### STORE RESULT IN HDFS ###
grunt> STORE projection INTO 'output/pig/simple-projection';
### LOAD 1 - WITH NO COLUMN NAMES AND DATATYPES ###
grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',');
### LOAD 2 - WITH COLUMN NAMES BUT NO DATATYPES ###
grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',') as (exchange, symbol, date, open, high, low, close, volume, adj_close);
### LOAD 3 - WITH COLUMN NAMES AND DATATYPES ###
grunt> stocks = LOAD '/user/deepakdubey/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float,
volume:int, adj_close:float);
### TO LOOK UP STRUCTURE OF THE RELATION ###
grunt> DESCRIBE stocks;
### WHEN COLUMN NAMES ARE NOT AVAILABLE ###
grunt> projection = FOREACH stocks GENERATE $1 as symbol, SUBSTRING($0, 0, 1) as sub_exch, $6 - $3 as up_or_down;
Hadoop HDFS - Disk Usage
--Disk space reserved for non HDFS - hdfs-site.xml
dfs.datanode.du.reserved
--HDFS Balancer
sudo -su hdfs
hdfs balancer
hdfs balancer -threshold 5
--Creating user directories
sudo -u hdfs hadoop fs -mkdir /user/deepak.dubey
sudo -u hdfs hadoop fs -chown deepak.dubey:deepak.dubey /user/deepak.dubey
--Disk Usage
hadoop fs -du /user
hadoop fs -du -h /user
hadoop fs -du -h -s /user
Labels:
Hadoop
Hadoop HDFS - Disk Quotas
--Name Quota
hadoop fs -mkdir /user/ubuntu/quotatest
hdfs dfsadmin -setQuota 5 /user/ubuntu/quotatest
hadoop fs -count -q /user/ubuntu/quotatest
hadoop fs -touchz /user/ubuntu/quotatest/test1
hadoop fs -touchz /user/ubuntu/quotatest/test2
hadoop fs -touchz /user/ubuntu/quotatest/test3
hadoop fs -touchz /user/ubuntu/quotatest/test4
hadoop fs -touchz /user/ubuntu/quotatest/test5
hadoop fs -rm /user/ubuntu/quotatest/test4
hdfs dfsadmin -clrQuota /user/ubuntu/quotatest
hadoop fs -count -q /user/ubuntu/quotatest
--Space Quota
hdfs dfsadmin -setSpaceQuota 300k /user/ubuntu/quotatest
hadoop fs -count -q /user/ubuntu/quotatest
hadoop fs -copyFromLocal hdfs-site.xml /user/ubuntu/quotatest
hdfs dfsadmin -clrSpaceQuota /user/ubuntu/quotatest
hadoop fs -mkdir /user/ubuntu/quotatest
hdfs dfsadmin -setQuota 5 /user/ubuntu/quotatest
hadoop fs -count -q /user/ubuntu/quotatest
hadoop fs -touchz /user/ubuntu/quotatest/test1
hadoop fs -touchz /user/ubuntu/quotatest/test2
hadoop fs -touchz /user/ubuntu/quotatest/test3
hadoop fs -touchz /user/ubuntu/quotatest/test4
hadoop fs -touchz /user/ubuntu/quotatest/test5
hadoop fs -rm /user/ubuntu/quotatest/test4
hdfs dfsadmin -clrQuota /user/ubuntu/quotatest
hadoop fs -count -q /user/ubuntu/quotatest
--Space Quota
hdfs dfsadmin -setSpaceQuota 300k /user/ubuntu/quotatest
hadoop fs -count -q /user/ubuntu/quotatest
hadoop fs -copyFromLocal hdfs-site.xml /user/ubuntu/quotatest
hdfs dfsadmin -clrSpaceQuota /user/ubuntu/quotatest
Hadoop HDFS - Recovering From Accidental Data Loss
--Trash
sudo vi /etc/hadoop/conf/core-site.xml
<property>
<name>fs.trash.interval</name>
<value>60</value>
</property>
<property>
<name>fs.trash.checkpoint.interval</name>
<value>45</value>
</property>
sudo service hadoop-hdfs-namenode restart
--Skip Trash
hadoop fs -rm -skipTrash delete-file2
--Snapshots
hadoop fs -mkdir important-files
hadoop fs -copyFromLocal file1 file2 important-files
hdfs dfs -createSnapshot /user/ubuntu/important-files
--Require admin rights
hdfs dfsadmin -allowSnapshot /user/ubuntu/important-files
--Snapshot creation - demonstrate file add & deletes
hdfs dfs -createSnapshot /user/ubuntu/important-files snapshot1
hadoop fs -ls /user/ubuntu/important-files/.snapshot
hadoop fs -ls /user/ubuntu/important-files/.snapshot/snapshot1
hadoop fs -rm /user/ubuntu/important-files/.snapshot/snapshot1/file1
hadoop fs -rm /user/ubuntu/important-files/file2
hadoop fs -copyFromLocal file3 /user/ubuntu/important-files
hdfs dfs -createSnapshot /user/ubuntu/important-files snapshot2
hadoop fs -ls /user/ubuntu/important-files/.snapshot/snapshot2
hadoop fs -ls /user/ubuntu/important-files
hadoop fs -ls /user/ubuntu/important-files/.snapshot/snapshot1
hadoop fs -cat /user/ubuntu/important-files/.snapshot/snapshot1/file2
hdfs snapshotDiff /user/ubuntu/important-files snapshot2 snapshot1
--Demonstrate file modifications
cat append-file
hadoop fs -appendToFile append-file /user/ubuntu/important-files/file1
hadoop fs -cat /user/ubuntu/important-files/file1
hdfs dfs -createSnapshot /user/ubuntu/important-files snapshot3
hadoop fs -ls /user/ubuntu/important-files/.snapshot/snapshot3
hdfs snapshotDiff /user/ubuntu/important-files snapshot3 snapshot2
hadoop fs -cat /user/ubuntu/important-files/.snapshot/snapshot2/file1
hadoop fs -cat /user/ubuntu/important-files/.snapshot/snapshot3/file1
hadoop fs -cat /user/ubuntu/important-files/file1
--Delete snapshots
hdfs dfs -deleteSnapshot /user/ubuntu/important-files snapshot1
hadoop fs -ls /user/ubuntu/important-files/.snapshot
--Require admin rights
hdfs dfsadmin -disallowSnapshot /user/ubuntu/important-files
sudo vi /etc/hadoop/conf/core-site.xml
<property>
<name>fs.trash.interval</name>
<value>60</value>
</property>
<property>
<name>fs.trash.checkpoint.interval</name>
<value>45</value>
</property>
sudo service hadoop-hdfs-namenode restart
--Skip Trash
hadoop fs -rm -skipTrash delete-file2
--Snapshots
hadoop fs -mkdir important-files
hadoop fs -copyFromLocal file1 file2 important-files
hdfs dfs -createSnapshot /user/ubuntu/important-files
--Require admin rights
hdfs dfsadmin -allowSnapshot /user/ubuntu/important-files
--Snapshot creation - demonstrate file add & deletes
hdfs dfs -createSnapshot /user/ubuntu/important-files snapshot1
hadoop fs -ls /user/ubuntu/important-files/.snapshot
hadoop fs -ls /user/ubuntu/important-files/.snapshot/snapshot1
hadoop fs -rm /user/ubuntu/important-files/.snapshot/snapshot1/file1
hadoop fs -rm /user/ubuntu/important-files/file2
hadoop fs -copyFromLocal file3 /user/ubuntu/important-files
hdfs dfs -createSnapshot /user/ubuntu/important-files snapshot2
hadoop fs -ls /user/ubuntu/important-files/.snapshot/snapshot2
hadoop fs -ls /user/ubuntu/important-files
hadoop fs -ls /user/ubuntu/important-files/.snapshot/snapshot1
hadoop fs -cat /user/ubuntu/important-files/.snapshot/snapshot1/file2
hdfs snapshotDiff /user/ubuntu/important-files snapshot2 snapshot1
--Demonstrate file modifications
cat append-file
hadoop fs -appendToFile append-file /user/ubuntu/important-files/file1
hadoop fs -cat /user/ubuntu/important-files/file1
hdfs dfs -createSnapshot /user/ubuntu/important-files snapshot3
hadoop fs -ls /user/ubuntu/important-files/.snapshot/snapshot3
hdfs snapshotDiff /user/ubuntu/important-files snapshot3 snapshot2
hadoop fs -cat /user/ubuntu/important-files/.snapshot/snapshot2/file1
hadoop fs -cat /user/ubuntu/important-files/.snapshot/snapshot3/file1
hadoop fs -cat /user/ubuntu/important-files/file1
--Delete snapshots
hdfs dfs -deleteSnapshot /user/ubuntu/important-files snapshot1
hadoop fs -ls /user/ubuntu/important-files/.snapshot
--Require admin rights
hdfs dfsadmin -disallowSnapshot /user/ubuntu/important-files
Hadoop HDFS Admin - Network Topology Of Cluster
sudo vi /etc/hadoop/conf/core-site.xml
<property>
<name>net.topology.node.switch.mapping.impl</name>
<value>org.apache.hadoop.net.ScriptBasedMapping</value>
</property>
<property>
<name>net.topology.script.file.name</name>
<value>/etc/hadoop/conf/topology.sh</value>
</property>
sudo vi /etc/hadoop/conf/topology.data
10-10-10-10 /rack1
20-20-20-20 /rack1
30-30-30-30 /rack1
40-40-40-40 /rack2
--Topology script will be called when Namenode starts
./etc/hadoop/conf/topology.sh 10-10-10-10 20-20-20-20 30-30-30-30 40-40-40-40
sudo vi /etc/hadoop/conf/topology.sh
#!/bin/bash
HADOOP_CONF=/etc/hadoop/conf
while [ $# -gt 0 ] ; do
nodeArg=$1
exec< ${HADOOP_CONF}/topology.data
result=""
while read line ; do
ar=( $line )
if [ "${ar[0]}" = "$nodeArg" ] ; then
result="${ar[1]}"
fi
done
shift
if [ -z "$result" ] ; then
echo -n "/rack0"
else
echo -n "$result "
fi
done
--IP Address encoded with rack information
xx.xx.rackno.nodeno
50-50-50-50
Labels:
Hadoop
Day to Day Esentails - Adding and Removing Nodes
--Marking Datanode Dead
2 * dfs.namenode.heartbeat.recheck-interval + 10 * dfs.heartbeat.interval
dfs.heartbeat.interval - defaults to 3 seconds
dfs.namenode.heartbeat.recheck-interval - defaults to 300,000 milliseconds = 300 seconds
2 * 300 seconds + 10 * 3 seconds = 630 seconds (10 minutes and 30 seconds)
--To commission or decommission datanodes
sudo vi /etc/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.hosts</name>
<value>/etc/hadoop/conf/include</value>
</property>
<property>
<name>dfs.hosts.exclude</name>
<value>/etc/hadoop/conf/exclude</value>
</property>
sudo vi /etc/hadoop/conf/include
sudo vi /etc/hadoop/conf/exclude
--After include/exclude changes, refresh nodes for changes to take effect
hdfs dfsadmin -refreshNodes
http://ec2-10-10-10-10.compute-1.amazonaws.com:50070/dfshealth.html
--To commission or decommission nodemanagers
sudo vi /etc/hadoop/conf/yarn-site.xml
<property>
<name>yarn.resourcemanager.nodes.include-path</name>
<value>/etc/hadoop/conf/include</value>
</property>
<property>
<name>yarn.resourcemanager.nodes.exclude-path</name>
<value>/etc/hadoop/conf/exclude</value>
</property>
--After include/exclude changes, refresh nodes for changes to take effect
yarn rmadmin -refreshNodes
--Support graceful decommission of nodemanager
https://issues.apache.org/jira/browse/YARN-914
2 * dfs.namenode.heartbeat.recheck-interval + 10 * dfs.heartbeat.interval
dfs.heartbeat.interval - defaults to 3 seconds
dfs.namenode.heartbeat.recheck-interval - defaults to 300,000 milliseconds = 300 seconds
2 * 300 seconds + 10 * 3 seconds = 630 seconds (10 minutes and 30 seconds)
--To commission or decommission datanodes
sudo vi /etc/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.hosts</name>
<value>/etc/hadoop/conf/include</value>
</property>
<property>
<name>dfs.hosts.exclude</name>
<value>/etc/hadoop/conf/exclude</value>
</property>
sudo vi /etc/hadoop/conf/include
sudo vi /etc/hadoop/conf/exclude
--After include/exclude changes, refresh nodes for changes to take effect
hdfs dfsadmin -refreshNodes
http://ec2-10-10-10-10.compute-1.amazonaws.com:50070/dfshealth.html
--To commission or decommission nodemanagers
sudo vi /etc/hadoop/conf/yarn-site.xml
<property>
<name>yarn.resourcemanager.nodes.include-path</name>
<value>/etc/hadoop/conf/include</value>
</property>
<property>
<name>yarn.resourcemanager.nodes.exclude-path</name>
<value>/etc/hadoop/conf/exclude</value>
</property>
--After include/exclude changes, refresh nodes for changes to take effect
yarn rmadmin -refreshNodes
--Support graceful decommission of nodemanager
https://issues.apache.org/jira/browse/YARN-914
Labels:
Hadoop
Puppet Installation
--Public DNS
ec2-10-10-10-10.compute-1.amazonaws.com
ec2-20-20-20-20.compute-1.amazonaws.com
ec2-30-30-30-30.compute-1.amazonaws.com
ec2-40-40-40-40.compute-1.amazonaws.com
--Private DNS
ip-50-50-50-50.ec2.internal
ip-60-60-60-60.ec2.internal
ip-70-70-70-70.ec2.internal
ip-80-80-80-80.ec2.internal
--Download Puppet on all nodes
wget https://apt.puppetlabs.com/puppet5-release-xenial.deb
sudo dpkg -i puppet5-release-xenial.deb
sudo apt update
--Install puppetserver on Puppet Master
sudo apt-get install puppetserver
--Install agents on 3 nodes
sudo apt-get install puppet-agent
--Update Puppet configuration
sudo vi /etc/puppetlabs/puppet/puppet.conf
server=ip-50-50-50-50.ec2.internal
runinterval = 1800
--Start Puppet server
sudo service puppetserver start
--Start Puppet agent
sudo /opt/puppetlabs/bin/puppet resource service puppet ensure=running enable=true
--Sign Certificates
sudo /opt/puppetlabs/bin/puppet cert list
sudo /opt/puppetlabs/bin/puppet cert sign ip-60-60-60-60.ec2.internal
sudo /opt/puppetlabs/bin/puppet cert sign ip-70-70-70-70.ec2.internal
sudo /opt/puppetlabs/bin/puppet cert sign ip-80-80-80-80.ec2.internal
Puppet Concepts with Tomcat Installation
--Private DNS
ip-10-10-10-10.ec2.internal (Master)
ip-20-20-20-20.ec2.internal (Agent 1)
ip-30-30-30-30.ec2.internal (Agent 2)
ip-40-40-40-40.ec2.internal (Agent 3)
--Create module tomcat
cd /etc/puppetlabs/code/environments/production/modules
mkdir tomcat
--Create manifest folder
cd tomcat
mkdir manifests
--Create init.pp (class name should match the name of the module)
vi init.pp
class tomcat {
}
--Create install.pp
vi install.pp
class tomcat::install {
package{'tomcat8':
ensure => installed
}
package{'tomcat8-admin':
ensure => installed
}
}
--Create start.pp
vi start.pp
class tomcat::start{
service{'tomcat8' :
ensure => running
}
}
--Node assignments
cd /etc/puppetlabs/code/environments/production/manifests
vi tomcat.pp
node 'ip-20-20-20-20.ec2.internal' {
include tomcat::install
include tomcat::start
}
--To run agent "pull" on as needed basis
sudo /opt/puppetlabs/bin/puppet agent --test
--Create config.pp
cd /etc/puppetlabs/code/environments/production/modules/tomcat/manifests
vi config.pp
class tomcat::config {
file { '/etc/tomcat8/tomcat-users.xml':
source => 'puppet:///modules/tomcat/tomcat-users.xml',
owner => 'tomcat8',
group => 'tomcat8',
mode => '0600',
notify => Service['tomcat8']
}
}
--Copy tomcat-users.xml file
--Add config to tomcat.pp
cd /etc/puppetlabs/code/environments/production/manifests
vi tomcat.pp
node 'ip-20-20-20-20.ec2.internal' {
include tomcat::install
include tomcat::config
include tomcat::start
}
Haddop Working with HDFS
### LOCAL FILE SYSTEM ###
ls
mkdir
cp
mv
rm
### LISTING ROOT DIRECTORY ###
hadoop fs -ls /
### LISTING DEFAULT TO HOME DIRECTORY ###
hadoop fs -ls
hadoop fs -ls /user/deepakdubey
### CREATE A DIRECTORY IN HDFS ###
hadoop fs -mkdir hadoop-file-system-test1
### COPY FROM LOCAL FS TO HDFS ###
hadoop fs -copyFromLocal /deepakdubey-starterkit/hdfs/commands/stocks-exchange.cvs hadoop-file-system-test1
### COPY TO HDFS TO LOCAL FS ###
hadoop fs -copyToLocal hadoop-file-system-test1/stocks-exchange.cvs .
hadoop fs -ls hadoop-file-system-test1
### CREATE 2 MORE DIRECTORIES ###
hadoop fs -mkdir hadoop-file-system-test2
hadoop fs -mkdir hadoop-file-system-test3
### COPY A FILE FROM ONE FOLDER TO ANOTHER ###
hadoop fs -cp hadoop-file-system-test1/stocks-exchange.cvs hadoop-file-system-test2
### MOVE A FILE FROM ONE FOLDER TO ANOTHER ###
hadoop fs -mv hadoop-file-system-test1/stocks-exchange.cvs hadoop-file-system-test3
### CHECK REPLICATION ###
hadoop fs -ls hadoop-file-system-test3
### CHANGE OR SET REPLICATION FACTOR ###
hadoop fs -Ddfs.replication=2 -cp hadoop-file-system-test2/stocks-exchange.cvs hadoop-file-system-test2/test-with-replication-factor-2.csv
hadoop fs -ls hadoop-file-system-test2
hadoop fs -ls hadoop-file-system-test2/test-with-replication-factor-2.csv
### CHANGING PERMISSIONS ###
hadoop fs -chmod 777 hadoop-file-system-test2/test-with-replication-factor-2.csv
### FILE SYSTEM CHECK - REQUIRES ADMIN PREVILEGES ###
sudo -u hdfs hdfs fsck /user/deepakdubey/hadoop-file-system-test2 -files -blocks -locations
sudo -u hdfs hdfs fsck /user/deepakdubey/hadoop-file-system-test3 -files -blocks -locations
sudo -u hdfs hdfs fsck /user/ubuntu/input/yelp/academic_dataset_review.json -files -blocks -locations
vi /etc/hadoop/conf/hdfs-site.xml
/data/1/dfs/dn/current/BP-2125152513-172.31.45.216-1410037307133/current/finalized
### DELETE DIR/FILES IN HDFS ###
hadoop fs -rm hadoop-file-system-test2/test-with-replication-factor-5.csv
hadoop fs -rm -r hadoop-file-system-test1
hadoop fs -rm -r hadoop-file-system-test2
hadoop fs -rm -r hadoop-file-system-test3
ls
mkdir
cp
mv
rm
### LISTING ROOT DIRECTORY ###
hadoop fs -ls /
### LISTING DEFAULT TO HOME DIRECTORY ###
hadoop fs -ls
hadoop fs -ls /user/deepakdubey
### CREATE A DIRECTORY IN HDFS ###
hadoop fs -mkdir hadoop-file-system-test1
### COPY FROM LOCAL FS TO HDFS ###
hadoop fs -copyFromLocal /deepakdubey-starterkit/hdfs/commands/stocks-exchange.cvs hadoop-file-system-test1
### COPY TO HDFS TO LOCAL FS ###
hadoop fs -copyToLocal hadoop-file-system-test1/stocks-exchange.cvs .
hadoop fs -ls hadoop-file-system-test1
### CREATE 2 MORE DIRECTORIES ###
hadoop fs -mkdir hadoop-file-system-test2
hadoop fs -mkdir hadoop-file-system-test3
### COPY A FILE FROM ONE FOLDER TO ANOTHER ###
hadoop fs -cp hadoop-file-system-test1/stocks-exchange.cvs hadoop-file-system-test2
### MOVE A FILE FROM ONE FOLDER TO ANOTHER ###
hadoop fs -mv hadoop-file-system-test1/stocks-exchange.cvs hadoop-file-system-test3
### CHECK REPLICATION ###
hadoop fs -ls hadoop-file-system-test3
### CHANGE OR SET REPLICATION FACTOR ###
hadoop fs -Ddfs.replication=2 -cp hadoop-file-system-test2/stocks-exchange.cvs hadoop-file-system-test2/test-with-replication-factor-2.csv
hadoop fs -ls hadoop-file-system-test2
hadoop fs -ls hadoop-file-system-test2/test-with-replication-factor-2.csv
### CHANGING PERMISSIONS ###
hadoop fs -chmod 777 hadoop-file-system-test2/test-with-replication-factor-2.csv
### FILE SYSTEM CHECK - REQUIRES ADMIN PREVILEGES ###
sudo -u hdfs hdfs fsck /user/deepakdubey/hadoop-file-system-test2 -files -blocks -locations
sudo -u hdfs hdfs fsck /user/deepakdubey/hadoop-file-system-test3 -files -blocks -locations
sudo -u hdfs hdfs fsck /user/ubuntu/input/yelp/academic_dataset_review.json -files -blocks -locations
vi /etc/hadoop/conf/hdfs-site.xml
/data/1/dfs/dn/current/BP-2125152513-172.31.45.216-1410037307133/current/finalized
### DELETE DIR/FILES IN HDFS ###
hadoop fs -rm hadoop-file-system-test2/test-with-replication-factor-5.csv
hadoop fs -rm -r hadoop-file-system-test1
hadoop fs -rm -r hadoop-file-system-test2
hadoop fs -rm -r hadoop-file-system-test3
Hadoop Kernel Level Tuning
--Disk Swapping --Look up existing values cd /proc/sys/vm cat swappiness sysctl vm.swappiness=0 --To make swappiness permanent vi /etc/sysctl.conf vm.swappiness = 0 --Memory Allocation (Over Commit) --Deny all memory requests over RAM+SWAP vm.overcommit=0 --Approve memory requests over RAM+SWAP as defined by vm.overcommit_ratio vm.overcommit=1 --Always approve all memory requests vm.overcommit=2 --With 1 GB RAM; permit memory request up to 1.5 GB RAM + SWAP vm.overcommit_ratio=50 --Look up existing values cd /proc/sys/vm cat overcommit_memory cat overcommit_ratio --To change value online sysctl vm.overcommit_memory=1 --To make the change permanent vi /etc/sysctl.conf vm.overcommit_memory=1 vm.overcommit_ratio=50
Cluster Installation with Apache Ambari
--Ambari system requirements
https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.1/bk_Installing_HDP_AMB/content/_operating_systems_requirements.html
Ambari 1
ec2-10-10-10-10.compute-1.amazonaws.com
ip-20-20-20-20.ec2.internal
Ambari 2
ec2-30-30-30-30.compute-1.amazonaws.com
ip-40-40-40-40.ec2.internal
Ambari 3
ec2-50-50-50-50.compute-1.amazonaws.com
ip-60-60-60-60.ec2.internal
Ambari 4
ec2-70-70-70-70.compute-1.amazonaws.com
ip-80-80-80-80.ec2.internal
--Setup id_rsa on all nodes
cd .ssh
vi id_rsa
sudo chown ubuntu:ubuntu id_rsa
chmod 600 id_rsa
--Copy the downloaded JDK to other instances
scp jdk-8u131-linux-x64.tar.gz ip-40-40-40-40.ec2.internal:/home/ubuntu/
scp jdk-8u131-linux-x64.tar.gz ip-60-60-60-60.ec2.internal:/home/ubuntu/
scp jdk-8u131-linux-x64.tar.gz ip-80-80-80-80.ec2.internal:/home/ubuntu/
--Untar JDK
tar -xvf jdk-8u131-linux-x64.tar.gz
sudo mkdir -p /usr/lib/jvm
sudo mv ./jdk1.8.0_131 /usr/lib/jvm/
sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.8.0_131/bin/java" 1
sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.8.0_131/bin/javac" 1
sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.8.0_131/bin/javaws" 1
--Set permissions
sudo chmod a+x /usr/bin/java
sudo chmod a+x /usr/bin/javac
sudo chmod a+x /usr/bin/javaws
sudo chown -R root:root /usr/lib/jvm/jdk1.8.0_131
--Set JAVA_HOME (on all nodes) in /etc/environment
sudo vi /etc/environment
JAVA_HOME=/usr/lib/jvm/jdk1.8.0_131
--Ambari repository
sudo wget -nv http://public-repo-1.hortonworks.com/ambari/ubuntu14/2.x/updates/2.2.2.0/ambari.list -O /etc/apt/sources.list.d/ambari.list
sudo apt-get update
--Install ambari server only one one node
sudo apt-get install ambari-server
--Setup Ambari server
sudo ambari-server setup
sudo ambari-server start
sudo ambari-server status
http://ec2-10-10-10-10.compute-1.amazonaws.com:8080
admin/admin
--Verification
sudo -u hdfs hadoop fs -mkdir /user/ubuntu
sudo -u hdfs hadoop fs -chown ubuntu:ubuntu /user/ubuntu
ubuntu@ip-172-31-44-14:~$ hadoop fs -mkdir input
ubuntu@ip-172-31-44-14:~$ hadoop fs -copyFromLocal stocks input
ubuntu@ip-172-31-44-14:~$ hadoop jar MaxClosePrice-1.0.jar com.hirw.maxcloseprice.MaxClosePrice input output
Troubleshooting - Namenode Stuck In Safe Mode
--Name node in safe mode
hadoop fs -copyFromLocal test /tmp/
copyFromLocal: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create file /tmp/test. Name node is in safe mode.
--From name node web UI or logs
Safe mode is ON. The reported blocks 8900092 needs additional 6476 blocks to reach the threshold 1.0000 of total blocks 8906567
--hdfs-site.xml
dfs.namenode.safemode.threshold-pct
####Reason 1 - loss of datanodes or the cluster is running low on resources.####
sudo -u hdfs hdfs dfsadmin -report
Safe mode is ON
Configured Capacity: 6807953326080 (6.19 TB)
Present Capacity: 5076746797056 (4.62 TB)
DFS Remaining: 5076745936896 (4.62 TB)
DFS Used: 860160 (840 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
-------------------------------------------------
Live datanodes (5):
Name: 10.15.230.42:50010 (node2)
Hostname: node2
Rack: /default
Decommission Status : Normal
Configured Capacity: 1361590665216 (1.24 TB)
DFS Used: 172032 (168 KB)
Non DFS Used: 425847939072 (396.60 GB)
DFS Remaining: 935742554112 (871.48 GB)
DFS Used%: 0.00%
DFS Remaining%: 68.72%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 0 (0 B)
Cache Remaining: 4294967296 (4 GB)
Cache Used%: 0.00%
Cache Remaining%: 100.00%
Xceivers: 2
Last contact: Thu May 07 21:52:10 EDT 2018
Name: 10.15.230.44:50010 (node4)
Hostname: node4
Rack: /default
Decommission Status : Normal
Configured Capacity: 1361590665216 (1.24 TB)
DFS Used: 172032 (168 KB)
Non DFS Used: 219371347968 (204.31 GB)
DFS Remaining: 1142219145216 (1.04 TB)
DFS Used%: 0.00%
DFS Remaining%: 83.89%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 0 (0 B)
Cache Remaining: 4294967296 (4 GB)
Cache Used%: 0.00%
Cache Remaining%: 100.00%
Xceivers: 2
Last contact: Thu May 07 21:52:10 EDT 2018
sudo -u hdfs hdfs dfsadmin -report
Safe mode is ON
Configured Capacity: 52710469632 (49.09 GB)
Present Capacity: 213811200 (203.91 MB)
DFS Remaining: 0 (0 B)
DFS Used: 213811200 (203.91 MB)
DFS Used%: 100.00%
Under replicated blocks: 39
Blocks with corrupt replicas: 0
Missing blocks: 0
####Reason 2 - Block corruption####
hdfs fsck /
Connecting to namenode via http://master:50070
FSCK started by hdfs (auth:SIMPLE) from /10.15.230.22 for path / at Thu May 07 21:56:17 EDT 2018
..
/accumulo/tables/!0/table_info/A00009pl.rf: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073806971
/accumulo/tables/!0/table_info/A00009pl.rf: MISSING 1 blocks of total size 891 B..
/accumulo/tables/!0/table_info/A00009pm.rf: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073806989
/accumulo/tables/!0/table_info/A00009pm.rf: MISSING 1 blocks of total size 891 B..
/accumulo/tables/!0/table_info/F00009pn.rf: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073807006
............................................
/user/oozie/share/lib/lib_20180408141046/sqoop/geronimo-jaspic_1.0_spec-1.0.jar: MISSING 1 blocks of total size 30548 B..
/user/oozie/share/lib/lib_20180408141046/sqoop/geronimo-jta_1.1_spec-1.1.1.jar: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073743585
/user/oozie/share/lib/lib_20180408141046/sqoop/geronimo-jta_1.1_spec-1.1.1.jar: MISSING 1 blocks of total size 16030 B..
/user/oozie/share/lib/lib_20180408141046/sqoop/groovy-all-2.1.6.jar: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073743592
............................................
/tmp/logs/fincalc/logs/application_1430405794825_0003/DAST-node5_8041: MISSING 9 blocks of total size 1117083218 B..
/tmp/logs/fincalc/logs/application_1430405794825_0004/DAST-node1_8041: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073800217
/tmp/logs/fincalc/logs/application_1430405794825_0004/DAST-node1_8041: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073800222
/tmp/logs/fincalc/logs/application_1430405794825_0004/DAST-node1_8041: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073800230
Total size: 154126314807 B (Total open files size: 186 B)
Total dirs: 3350
Total files: 1790
Total symlinks: 0 (Files currently being written: 2)
Total blocks (validated): 2776 (avg. block size 55521006 B) (Total open file blocks (not validated): 2)
********************************
CORRUPT FILES: 1764
MISSING BLOCKS: 2776
MISSING SIZE: 154126314807 B
CORRUPT BLOCKS: 2776
********************************
Minimally replicated blocks: 0 (0.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 0.0
Corrupt blocks: 2776
Missing replicas: 0
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu May 07 21:56:18 EDT 2018 in 516 milliseconds
The filesystem under path '/' is CORRUPT
hdfs fsck / -list-corruptfileblocks
Connecting to namenode via http://ec2-80-80-80-80.compute-1.amazonaws.com:50070
The filesystem under path '/' has 0 CORRUPT files
hdfs@ip-172-31-45-216:~$ hdfs fsck /user/hirw/input/stocks -files -blocks -locations
Connecting to namenode via http://ec2-80-80-80-80.compute-1.amazonaws.com:50070
FSCK started by hdfs (auth:SIMPLE) from /172.31.45.216 for path /user/hirw/input/stocks at Mon Sep 18 11:22:00 UTC 2017
/user/hirw/input/stocks <dir>
/user/hirw/input/stocks/stocks 428223209 bytes, 4 block(s): OK
0. BP-2125152513-172.31.45.216-1410037307133:blk_1074178780_437980 len=134217728 Live_repl=3 [DatanodeInfoWithStorage[172.31.45.216:50010,DS-d0d9eb5c-f35f-4a12-bfdf-544085d693a3,DISK], DatanodeInfoWithStorage[172.31.46.124:50010,DS-fe24aecb-f56f-4c9c-8cf9-a3b1259bc0d0,DISK], DatanodeInfoWithStorage[172.31.45.217:50010,DS-7e4a4aef-39b9-4087-b0ff-3199c5b8b8bb,DISK]]
1. BP-2125152513-172.31.45.216-1410037307133:blk_1074178781_437981 len=134217728 Live_repl=3 [DatanodeInfoWithStorage[172.31.45.216:50010,DS-d0d9eb5c-f35f-4a12-bfdf-544085d693a3,DISK], DatanodeInfoWithStorage[172.31.46.124:50010,DS-fe24aecb-f56f-4c9c-8cf9-a3b1259bc0d0,DISK], DatanodeInfoWithStorage[172.31.45.217:50010,DS-7e4a4aef-39b9-4087-b0ff-3199c5b8b8bb,DISK]]
2. BP-2125152513-172.31.45.216-1410037307133:blk_1074178782_437982 len=134217728 Live_repl=3 [DatanodeInfoWithStorage[172.31.45.216:50010,DS-d0d9eb5c-f35f-4a12-bfdf-544085d693a3,DISK], DatanodeInfoWithStorage[172.31.46.124:50010,DS-fe24aecb-f56f-4c9c-8cf9-a3b1259bc0d0,DISK], DatanodeInfoWithStorage[172.31.45.217:50010,DS-7e4a4aef-39b9-4087-b0ff-3199c5b8b8bb,DISK]]
3. BP-2125152513-172.31.45.216-1410037307133:blk_1074178783_437983 len=25570025 Live_repl=3 [DatanodeInfoWithStorage[172.31.45.216:50010,DS-d0d9eb5c-f35f-4a12-bfdf-544085d693a3,DISK], DatanodeInfoWithStorage[172.31.46.124:50010,DS-fe24aecb-f56f-4c9c-8cf9-a3b1259bc0d0,DISK], DatanodeInfoWithStorage[172.31.45.217:50010,DS-7e4a4aef-39b9-4087-b0ff-3199c5b8b8bb,DISK]]
Status: HEALTHY
$ hdfs fsck /user/hadooptest/test1 -locations -blocks -files
FSCK started by hadoop (auth:SIMPLE) from /10-10-10-10 for path /user/hadooptest/test1 at Thu Dec 15 23:32:30 PST 2016
/user/hadooptest/test1 339281920 bytes, 3 block(s):
/user/hadooptest/test1: CORRUPT blockpool BP-762523015-10-10-10-10-1480061879099 block blk_1073741830
/user/hadooptest/test1: CORRUPT blockpool BP-762523015-10-10-10-10-1480061879099 block blk_1073741831
/user/hadooptest/test1: CORRUPT blockpool BP-762523015-10-10-10-10-1480061879099 block blk_1073741832
MISSING 3 blocks of total size 339281920 B
0. BP-762523015-10-10-10-10-1480061879099:blk_1073741830_1006 len=134217728 MISSING!
1. BP-762523015-10-10-10-10-1480061879099:blk_1073741831_1007 len=134217728 MISSING!
2. BP-762523015-10-10-10-10-1480061879099:blk_1073741832_1008 len=70846464 MISSING!
--Leave safe mode and delete the file
hdfs dfsadmin -safemode leave
hadoop fs -rm /user/hadooptest/test1
Subscribe to:
Posts (Atom)