Anvesh
Sr. Data Engineer
Professional Summary
Dynamic and motivated IT professional with over 9+ years of experience Seeking a challenging career with an
escalating organization that provides an opportunity to exploit my technical skills and capabilities in the stream
of technology as a Big Data Engineer with expertise in designing data intensive applications using Hadoop
Ecosystem, Big Data Analytical, Cloud Data engineering, Data Warehouse/ Data Mart, Data Visualization,
Reporting, and Data Quality solutions.
Hands on experience with AWS (Amazon Web Services), Elastic Map Reduce (EMR), Storage S3, EC2
instances and Data Warehousing.
Good Expertise in ingesting, processing, exporting, analyzing Terabytes of structured and unstructured data on
Hadoop clusters in Information Security and Technology domains.
Relevant Experience in working with various SDLC methodologies like Agile Scrum for developing and
delivering applications.
Experience in design, development, and Implementation of Big data applications using Hadoop ecosystem
frameworks and tools like HDFS, MapReduce, Sqoop, Spark, Scala, Storm HBase, Kafka, Flume.
In-depth knowledge of Hadoop Architecture and working with Hadoop components such as HDFS,
JobTracker, TaskTracker, NameNode, DataNode, and MapReduce concepts.
Demonstrated experience in delivering data and analytic solutions leveraging AWS, Azure or similar cloud
DataLake.
Spearheaded the development of scalable, real-time data pipelines and architectures using Apache Spark
(PySpark, Spark SQL, Spark Streaming), Databricks, Hadoop, and Hive, with over 5 years of hands-on
experience in big data technologies.
Designed and deployed cloud-native data solutions across AWS, Azure, and GCP, with 3+ years of experience
in building distributed, high-performance, cloud-based systems.
Engineered robust ETL/ELT workflows for structured and unstructured data sources, optimizing performance,
scalability, and reliability in large-scale data environments.
Hands of experience in HDP, GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/sub cloud
shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
Worked with various file formats such as CSV, JSON, XML file formats.
Experience in creating Datalake using Spark which is used for downstream applications.
Expertise in writing DDLs and DMLs scripts in SQL and HQL for analytics applications in RDBMS.
Expertise in developing streaming applications in Scala using Kafka and Spark Structured Streaming.
Experience in importing and exporting data from HDFS to RDBMS systems like Teradata (Sales Data
Warehouse), SQL-Server, and Non-Relational Systems like HBase using Sqoop by efficient column mappings
and maintaining the uniformity.
Experience in working with Flume and NiFi for loading log files into Hadoop.
Experience developing Scala applications for loading/streaming data into NoSQL databases (MongoDB) and
HDFS.
Experience in working with NoSQL databases like HBase and Cassandra.
Experienced in creating shell scripts to push data loads from various sources from the edge nodes onto the
HDFS.
Experience in working with various build and automation like Maven, GIT, SVN, Jenkins.
Experience in understanding of the Specifications for Data Warehouse ETL Process and interacting with the
designers and the end users for informational requirements.
Worked with Cloudera and Hortonworks distributions.
Experienced in performing code reviews, involved closely in smoke testing sessions, retrospective sessions.
Experienced in Microsoft Business Intelligence tools, developing SSIS (Integration Service), SSAS (Analysis
Service) and SSRS (Reporting Service), building Key Performance Indicators, and OLAP cubes.
Have good exposure with the star, snowflake schema, data modeling and work with different data warehouse
projects.
Strong analytical and problem-solving skills and the ability to follow through with projects from inception to
completion.
Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal
skills.
Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
Good Working Knowledge on working with AWS cloud services like EMR, S3, Redshift, EMR cloud watch,
for big data development.
Professional Experience
Sr. Data Engineer
UBS ,Weehawken, NJ July 2023 to Present
Responsibilities:
Perform Data Profiling to learn about behavior with various features such as traffic pattern, location, Date and
Time etc.
Implemented Pyspark and utilizing Data frames and Spark SQL API for faster processing of data.
Involved in ingestion, transformation, manipulation and computation of data using StreamSets, NIFI, MySQL,
PySpark
Delivered mission-critical data products and analytics platforms by combining strong programming
proficiency in Python and Java with deep understanding of data engineering principles.
Led cross-functional teams through end-to-end project lifecycles, including technical architecture, roadmap
planning, and delivery of large-scale enterprise data solutions.
Developed automated monitoring frameworks and operational dashboards to track pipeline health, enforce data
quality standards, and visualize key business KPIs.
Conducted technical evaluations and implemented proof-of-concepts (POCs) for emerging technologies,
ensuring alignment with enterprise data strategies and performance goals.
Involved in data ingestion into MySQL using NIFI - MySQL pipeline for full load and Incremental load on
variety of sources like web server, RDBMS and Data API’s.
Worked on PySpark Data sources, PySpark Data frames, Spark SQL and Streaming using Scala.
Worked extensively on Azure Components such as Databrick, Virtual machine, Blob storage
Experience in developing PySpark application using Scala SBT
Experience in integrating Spark-MySQL connector and JDBC connector to save the data processed in Spark to
MySQL.
Responsible for creating tables and MySQL pipelines which are automated to load the data in to tables from
NIFI topics
Performed a POC to check the time taking for Change Data Capture (CDC) of oracle data across Stream,
StreamSets and DB Visit
Strong Experience in implementing Data warehouse solutions in Amazon web services (AWS) Redshift;
Worked on various projects to migrate data from on premise databases to AWS Redshift, RDS and S3.
Build the Logical and Physical data model for snowflake as per the changes required
Implemented AWS services to provide a variety of computing and networking services to meet the needs of
applications
Used Amazon web services (AWS) like EC2 and S3 for small data sets.
Expertise in using different file formats like Text files, CSV, Parquet, JSON
Responsible to designing and deploying new ELK clusters (Elasticsearch, Logstash, Kibana, beats, Kafka,
zookeeper etc.
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets
processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR
Experienced with setup, configuration and maintain ELK stack (Elasticsearch, Logstash and Kibana) and
OpenGrok source code (SCM)
Automate AWS infrastructure through infrastructure as code by writing various Terraform modules, scripts by
creating AWS IAM users, groups, roles, policies, custom policies, AWS Glue, Crawlers, Redshift clusters,
snapshots of clusters, EC2, S3 buckets.
Sr. Data Engineer
Macy's, New York, NY October 2020 to June 2023
Responsibilities:
Worked on DB2 for SQL connection to Spark Scala code to Select, Insert, and Update data into DB.
Used Broadcast Join in SPARK for making smaller datasets to large datasets without shuffling of data across
nodes.
Loading the data from the different Data sources like (Teradata and DB2) into HDFS using SQOOP and load
into Hive tables, which are partitioned.
Performed Data Integration, Extraction, Transformation, and Load (ETL) Processes Migrated an existing on-
premises application to AWS.
Established data governance and retention policies, ensuring compliance with regulatory requirements and
enhancing data lifecycle management practices.
Collaborated with business, product, and data science stakeholders to translate strategic goals into actionable,
data-driven solutions that support decision-making and customer insights.
Demonstrated strong foundation in computer science fundamentals, including data structures, algorithms,
system design, and stream processing, ensuring scalable and efficient software solutions.
Used AWS services like EC2 and S3 for small datasets processing and storage.
Worked on building an Enterprise DataLake using Data Factory and Blob storage, enabling other teams to
work with more complex scenarios and ML solutions.
Worked on MongoDB database concepts such as locking, transactions, indexes, sharding, replication and
schema design.
Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and
connected to Tableau for generating interactive reports using Hive server2.
Designed and deployed data pipelines using DataLake, DataBricks, and Apache Airflow.
Developed Spark application for loading CSV file data and applying business validation on data frame to find
invalid and valid data frames.
Developed NoSQL database by using CRUD, Indexing, Replication and Sharing in MongoDB.
Implemented Spark Scripts using Spark Session, Python, Spark SQL to access hive tables into spark for faster
processing of data.
Developed shell scripts for dynamic partitions adding to hive stage. Involved in developing ETL jobs to extract
data and load it in Datalake.
Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
Developed Spark programs to parse the raw data, populate staging tables and store the refined data in
partitioned tables in the EDW.
Implemented Spark using Python and SparkSql for faster testing and processing of data.
Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using
Python and NoSQL databases such as HBase and Cassandra.
Developed data processing applications in Scala using SparkRDD as well as Dataframes using SparkSQL
APIs.
Worked with Spark Session Object on Spark SQL and Data-Frames for faster execution of Hive queries.
Import the data from different sources like SQL Server into Spark RDD and developed a data pipeline using
Kafka and Spark to store data into HDFS.
Used SparkSql to load JSON data and create schema RDD and load it into Hive tables and handled Structured
data using SparkSql.
Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database
systems/mainframe and vice-versa loading data into HDFS.
Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary
Transformations and Aggregation on the fly to build the common learner data model and persists the data in
HDFS.
Implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS.
Used Apache NiFi to copy data from local file system to HDP.
Worked on Big Data Integration and Analytics based on Hadoop, Spark, Kafka and web Methods
technologies.
Worked with Tidal Enterprise Scheduler in scheduling daily batch jobs with ease.
Environment: Scala, Spark Core, SparkSql, Apache Hadoop 2.7.6, Spark 2.3 Hive SQL, HDFS, Cassandra,
Zookeeper, Spark, Kafka, Oracle 19c, MySQL, Mongo DB, Shell Script, AWS, EC2, Hive.
Data Engineer
Molina healthcare, Bothell, WA November 2018 to September 2020
Responsibilities:
Involved in requirements gathering, analysis, design, development, change management, deployment.
Involved in the development of real time streaming applications using PySpark, Apache Flank, Kafka, Hive on
distributed Hadoop Cluster.
Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning
applications, executed machine learning use cases under Spark ML and MLLib.
Configuring high availability using geographical MongoDB replica sets across multiple data centers.
Extracted data from heterogeneous sources and performed complex business logic on network data to
normalize raw data which can be utilized by BI teams to detect anomalies.
Responsible for designing and building a DataLake using Hadoop and its ecosystem components.
Designed and developed Flink pipelines to consume streaming data from Kafka and applied business logic to
massage and transform and serialize raw data.
Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the
Cosmos Activity.
Implemented OLAP multi-dimensional cube functionality using Azure SQL Data Warehouse.
Implementing and orchestrating data pipelines using Oozie and Airflow.
Developed common Flink module for serializing and de-serializing AVRO data by applying schema.
DataLake is used to store and do all types of processing and analytics.
Developed Spark streaming pipeline to batch real time data, detect anomalies by applying business logic and
write the anomalies to Hbase table.
Implemented layered architecture for Hadoop to modularize design. Developed framework scripts to enable
quick development.
Designed reusable shell scripts for Hive, Sqoop, Flink and PIG jobs. Standardize error handling, logging and
metadata management processes.
Indexed processed data and created dashboards and alerts in Splunk to be utilized/ action by support teams.
Responsible for operations and support of Big data Analytics platform, Splunk and Tableau visualization.
Overcame challenges like data migration from MySQL to MongoDB.
Designed and Developed applications using Apache Spark, Scala, Python, NIFI, S3, AWS EMR on AWS
cloud to format, cleanse, validate, create schema and build data stores on S3.
Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
Developed CI-CD pipeline to automate build and deploy to Dev, QA, and production environments.
Supported production jobs and developed several automated processes to handle errors and notifications. Also,
tuned performance of slow jobs by improving design and configuration changes of PySpark jobs.
Created standard report Subscriptions and Data Driven Report Subscriptions.
Environment: Hadoop, Map Reduce, Spark, Spark MLLib, Tableau, SQL, Excel, PIG, Hive, Ambari, AWS,
PostgreSQL, Azure, Cosmos Python, PySpark, Flink, Kafka,
Data Engineer
Ceequence Technologies Hyderabad, India March 2016 to August 2018
Responsibilities:
Worked on data pre-processing and cleansing the data to perform feature engineering and performed
data imputation techniques for the missing values in the dataset using Python.
Developed Python scripts to automate data sampling process. Ensured the data integrity by checking
for completeness, duplication, accuracy and consistency.
Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and
SQL queries.
Performed Data Integration, Extraction, Transformation, and Load (ETL) Processes.
Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from
different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
Used Python to identify trends and relationships between different pieces of data and drew appropriate
conclusions.
Developed Rest API to serve the data generated by prediction model to serve other customers/teams.
Generated complex reports in various formats such as list reports, summary reports etc. using advanced data
manipulation techniques.
Created Visual Charts, Graphs, Maps, Area Maps, Dashboards and Storytelling using Tableau.
Develop, Maintain and support Continuous Integration framework (CI-CD) based on Jenkins.
Implemented, tuned and tested the model on AWS EC2 with the best performing algorithm and parameters.
Environment: Hadoop, HDFS, Map Reduce, Sqoop, Flume, Hive, SQL Server, Oracle, PL/SQL, Eclipse, JAVA, Shell
scripting, Vertica, Unix, Cassandra.
Data Modeler
Brio Technologies Private Limited Hyd India September 2014 to February 2016
Responsibilities:
Worked with business users to gather requirements and create a data flow, process flows, and functional
specification documents and created Conceptual, Logical and Physical data models using Erwin.
Designed both 3NF data models for ODS, OLTP systems and dimensional data models using star and
snowflake Schemas.
Used forward engineering in Erwin to create databases scripts for OLAP model.
Worked with business to identify the distinct data elements in each report to determine the number of reports
needed to satisfy all reporting requirements.
Worked on integrating data from heterogeneous sources like Oracle, flat files, and XML files.
Developed the design & process flow to ensure that the process is repeatable.
Successfully created and managed a conversion testing effort which included a data quality review, two system
test cycles, and user acceptance testing.
Used Expert level understanding of different databases in combinations for Data extraction and loading,
joining data extracted from different databases and loading to a specific database.
Staging and Target Schema Model Designed and Deployed DDL in Oracle Database.
Created and reviewed the conceptual model for the EDW (Enterprise Data Warehouse) with business user.
Environment: Erwin 7.0, Oracle 11g, MS-Office, SQL Loader, PL/SQL, SQL Server 2008/2012.