ASSIGNMENT-02(BIG DATA)
Name: Shaik Mohammad Yunus
RegNo:12206551
Section: K22GT
Download files to local: wget https://github.com/logpai/loghub/raw/master/Hadoop/Hadoop_2k.log
wget https://raw.github.com/logpai/loghub/raw/master/Hadoop/Hadoop_2k.log_structured.csv
Creating raw_logs directory
hdfs dfs -mkdir -p /user/ubuntu/raw_logs/
Upload to HDFS: hdfs dfs -put Hadoop_2k.log /user/ubuntu/raw_logs/ hdfs dfs -put
Hadoop_2k.log_structured.csv /user/ubuntu/raw_logs/
Verify
hdfs dfs -ls /user/ubuntu/raw_logs/
DATA SETUP
hdfs dfs -mkdir -p /user/ubuntu/raw_logs
hdfs dfs -mkdir -p /user/ubuntu/processed_logs
hdfs dfs -mkdir -p /user/ubuntu/logs_archive
File Upload Verification
hdfs dfs -ls /user/ubuntu/raw_logs/
List Files with Details
hdfs dfs -ls -h /user/ubuntu/raw_logs/
(replication factor 1 of both files) (h- human readable)
SIZE: (375.9K of Hadoop_2k.log) (522.3k of Hadoop_2k.log_structrued.csv)
OPERATIONS:
File Copy:hdfs dfs -cp /user/ubuntu/raw_logs/Hadoop_2k.log /usFile Rename hdfs dfs -mv
/user/ubuntu/raw_logs/Hadoop_2k.log_structured.csv
/user/ubuntu/raw_logs/structured_hadoop_logs.csv
Move File
hdfs dfs -mv/user/ubuntu/raw_logs/structured_hadoop_logs.csv /user/ubuntu/processed_logs/
Delete a File
hdfs dfs -rm /user/ubuntu/raw_logs/Hadoop_2k.loger/ubuntu/processed_logs/
Preview File Content
hdfs dfs -head /user/ubuntu/processed_logs/Hadoop_2k.log | head -n 20 or hadoop fs -cat
/user/ubuntu/processed_logs/Hadoop_2k.log | head -20
View File Metadata hdfs dfs -stat %s,%r,%b
/user/ubuntu/processed_logs/structured_hadoop_logs.csv 0r hadoop fsck
/user/ubuntu/processed_logs/structured_hadoop_logs.csv -files -blocks - locations Find Number of
Lines in File hdfs dfs -cat /user/ubuntu/processed_logs/Hadoop_2k.log | wc -l
Search for a String in Logs
hdfs dfs -cat /user/ubuntu/processed_logs/Hadoop_2k.log | grep "ERROR"
Count String Occurrences hdfs dfs -cat /user/ubuntu/processed_logs/Hadoop_2k.log | grep -o
"WARN" | wc -l
Set Replication Factor
hdfs dfs -setrep 2 /user/ubuntu/processed_logs/structured_hadoop_logs.csv
Verify Replication Factor
hdfs dfs -ls /user/ubuntu/processed_logs/structured_hadoop_logs.csv
Check File Blocks
hdfs fsck /user/ubuntu/processed_logs/Hadoop_2k.log -files -blocks
Directory Size hdfs dfs -du -h /user/ubuntu/processed_logs/ Disk Space Usage hdfs dfs -du -h
/user/ubuntu
Directory Size
hdfs dfs -du -h /user/ubuntu/processed_logs/ Disk Space Usage hdfs dfs -du -h /user/ubuntu
Clean Up Empty Directories
hadoop fs -ls -R /user/ubuntu | awk '$1 ~ /^d/ {print $8}' | while read dir; do if [ -z "$(hadoop fs -ls
$dir 2>/dev/null | tail -n +2)" ]; then hadoop fs -rm -r "$dir" echo "Deleted: $dir" fi done
Filter Large Files
hdfs dfs -ls -R /user/ubuntu | awk '$5 > 1048576 {print $NF}'
Log Filtering
hadoop fs -cat /user/ubuntu/processed_logs/Hadoop_2k.log | grep "INFO" | hadoop fs -put -f -
/user/ubuntu/processed_logs/info_logs.txt
Error Logs Count hdfs dfs -cat /user/ubuntu/processed_logs/Hadoop_2k.log | grep ERROR | wc -l
Generate Checksums
hadoop fs -checksum /user/ubuntu/processed_logs/structured_hadoop_logs.csv
Set Permissions
hdfs dfs -chmod 755 /user/ubuntu/processed_logs
Set ACLs
Check ACLs
hadoop fs -chmod 770 /user/ubuntu/raw_logs
Append to File
hdfs dfs -cat /user/ubuntu/processed_logs/Hadoop_2k.log | head -50 | tee temp_50_lines.txt
Merge Logs hdfs dfs -ls /user/ubuntu/processed_logs/merged_logs.txt hdfs dfs -cat
/user/ubuntu/processed_logs/Hadoop_2k.log hdfs dfs -put
/user/ubuntu/processed_logs/structured_hadoop_logs.csv
/user/ubuntu/processed_logs/merged_logs.txt hdfs dfs -cat
/user/ubuntu/processed_logs/Hadoop_2k.log | head -10
Archive Old Logs
hdfs dfs -ls /user/ubuntu/raw_logs | awk -v date="$(date -d '7 days ago' '+%Y-%m-%d')" '$6 < date
{print $8}' | while read file; do hdfs dfs -mv "$file" /user/ubuntu/logs_archive/ done