0% found this document useful (0 votes)
29 views7 pages

DAS 839 NoSQL Systems CoursePlan

Document providing a detailed and systematic approach to learning NoSQL Systems

Uploaded by

rcbklund1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views7 pages

DAS 839 NoSQL Systems CoursePlan

Document providing a detailed and systematic approach to learning NoSQL Systems

Uploaded by

rcbklund1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Course Syllabus

Course Code / Course Name DAS 839 / NoSQL Systems


Course Instructor Name(s) Vinu E. Venugopal
Hours Component
4 Lecture (1hr = 1 credit)
Credits (L:T:P) 0 Tutorial (1hr = 1 credit)
(Lecture : Tutorial : Practical) 0 Practical (2hrs = 1 credit)
L:T:P = 4:0:0 Total Credits = 4
Grading Scheme X 4-point scale (A,A-,B+,B,B-,C+,C,D,F)
(Choose by placing X against
appropriate box) Satisfactory/Unsatisfactory (S / X)
Area of Specialization (if applicable)
(Choose by placing X in box against not more than two areas from the list)
Theory and Systems for Computing Networking and
X and Data Communication
Artificial Intelligence and Machine Digital Society
Learning
VLSI Systems Cyber Security
General Elective
Programme / Branch Course is restricted to the following programmes / branch(es):
(Place X appropriately. More than one is okay)
Programme: Branch:
X iMTech X CSE
X M.Tech ECE
X M.Sc. Digital Society
Course Category Select one from the following:
(Place X appropriately)
Basic Sciences
CSE Core
ECE Core
X CSE Branch Elective
ECE Branch Elective
Engineering Science and Skills
HSS/M
General

Course Pre-Requisites Database Management Systems, Basics of Computer


Architecture and Organization, Networking; Basic capabilities in
a scripting- and object-oriented programming language
(Java/Scala); usage of a Unix-like command-line shell.

Template Version 1.1


Template Date April 4, 2021
Additional Focus Areas
Select zero or more from the following and write one sentence explaining the how the focus areas covered as part of
the course.[NAAC criteria 1.1.3, 1.3.2].

Yes /
Focus Area Details
No
Yes The course emphasizes data management
and engineering skills that contribute to
various phases of technology development,
evaluation, and design, which enhance
Direct focus on employability employability.
Yes The course provides skills in Apache
Hadoop, Hive, Pig, HBase, MongoDB and
Focus on skill development Apache Spark.
Yes An experience in designing the system
architecture for applications related to large
Focus on entrepreneurship scale data processing.
Provides value added / life skills Yes Programming skills, data modeling,
(language, writing, communication, etc.) technical report writing, presentation.

Course Context and Overview


This course introduces the fundamentals, architecture, and practical use cases of NoSQL
Systems. The word NoSQL denotes “non-SQL” (non-relational) or “not only SQL”. NoSQL system
refers to a class of data management systems that deals with the management (storage and
retrieval) of not just tabular/structured data but also unstructured and semi-structured data. This
course will provide an entry point to large-scale data management and distributed computing
principles in recent NoSQL architectures.

The course is designed to:


(1) Understand the evolution of data management systems – essentially covering centralized
systems, distributed systems, big data systems, cloud-based systems and even streaming data
processing systems.
(2) Study the fundamental differences of NoSQL systems with SQL based systems.
(3) Understand the different types of NoSQL systems (such as Key-value database, Columnar
databases, Graph based system, Documents based systems) and the related data models & data
structures.
(4) From a practical point of view, we specifically focus on current tools and APIs in the context
of the Apache Hadoop and Spark ecosystem.

The course starts by reviewing the functionality of a classical SQL database system (PostgreSQL)
and then moves forward to distributed file systems, including the Google (GFS) and Hadoop
(HDFS) distributed file systems, which is followed by a detailed discussion of MapReduce and
distributed computing principles. Further, we look into several recent NoSQL engines and key-
value stores, including Apache Pig, HBase, Hive and MongoDB, which provide a variety of options
for processing different data formats such as text, CSV, XML and JSON.

Template Version 1.1


Template Date April 4, 2021
Course Outcomes and Competencies
PO/ Class Tut
Id Course Outcome
PSO
CL KC
(Hrs) (Hrs)
CO1 Understand the data modelling and querying on SQL PO1C, U, 3 1.5
systems. Familiarize with the PostgreSQL features. PO3P, Ap
F
CO2 Understand the fundamental concepts in distributed PO1 U, C, 3 0
system. R F
CO3 Understand distributed file system in detail. Examine PO1 U, C, 1.5 0
the architecture of the GFS and HDFS. PO5 R P
CO4 Understand Apache Hadoop & MapReduce. Learn to PO1 U, C, 4.5 1.5
interpret a given problem in terms of Map and PO3 Ap, P,
Reduce. PO4 An, F
PO5 C
CO5 Understand nested relations in Pig and the standard PO1 U, C, 1.5 1.5
operators in Pig Latin (a dataflow-oriented query PO3 Ap, P,
language). Design queries using Pig Latin. PO4 An, F
PO5 C
CO6 Understand the data model in Hive and learn to PO1 U, C, 3 1.5
design queries in HiveQL. Design use cases for PO3 Ap, P,
Partitions & Buckets, Map & Reduce side join, Outer PO4 An, F
& semi-join, Views and UDFs. PO5 C
CO7 Understand the column-oriented NoSQL database PO1 U, C, 3 0
(Hbase) and its architecture. PO3 An, F
PO4 C
CO8 Understand document-oriented NoSQL database PO1 U, F, 1.5 0
(MongoDB) and its architecture. PO3 R C
CO9 Understand the embedded data model & normalized PO1 U, C, 3 1.5
data model in MongoDB. Learn how CRUD PO3 Ap, P
operations are performed in MongoDB. Develop PO4 An,
application pipelines using MongoDB. PO5 C
CO10 Understand the fundamental concepts in Spark PO1 U, C, 1.5 1.5
Structured API. PO3 Ap, P,
PO5 An F
CO11 Read and understand the design principles of other PO2 U, C, 6 0
NoSQL systems from recent research papers. PO9 Ap, F,
PO10 An M
CO12 Learn to develop a full-fledged application using the PO2 U, C, 4.5 0
tools introduced in the lecture. Write a detailed project PO9 Ap, F,
report -- containing motivation, literature survey, PO10 An, M
proposed system, experimental study, conclusion PO11 C
PO13
and future scope.
Legend: PO/PSO: Programme Outcomes / Programme Specific Outcomes; CL: Cognitive Level (from Revised
Bloom’s Taxonomy); KC: Knowledge Category (from Revised Bloom’s Taxonomy); Class (Hrs): Number of hours
of instruction; Tut (Hrs): Number of hours of tutorial session (where applicable)

Template Version 1.1


Template Date April 4, 2021
Course Content
- Usage of classical data-modeling languages such as E/R diagrams
- Data management in SQL using the PostgreSQL open-source DBMS
- Distributed file systems (GFS & HDFS), session semantics vs. transaction
semantics, CAP theorem
- Apache Hadoop: distributed computing principles (MapReduce), replication, fault
tolerance, backup tasks, custom combiners and partitioners, local aggregation,
linear scalability
- Apache Pig: first dataflow language (Pig Latin), translation into MapReduce and
optimizations
- Apache HBase: distributed key-value store for very large tabular data, columns
and column families, indexing and lookups
- Apache Hive: SQL-like query language on top of Hadoop, translation into
MapReduce
- MongoDB: API overview, JSON processing, user-defined functions
- Apache Spark: distributed resilient data objects (RDDs) and dataframes, basic
overview of streaming and machine-learning extensions

Instruction Schedule
Session 1 & 2: – Introduction to Information Management: Types of data and how they are
related with the evolution of Information management tools. Revisiting the topics in database
management system (ER diagram, Relational model, Anomalies, Decomposition, Normal forms,
Relational Algebra, Functional Dependencies, SQL).

Session 3 & 4 – Principles of Parallel and Distributed Computing: Parallel vs. distributed
computing, Fundamentals and Common properties of Distributed Computing, History of Parallel
and Distributed data processing, Technologies for distributed computing.

Session 5 – Distributed File System Principles: File Access Models, File Access Types,
Sharding and Replication, Replicated data consistency, Strong and weak consistency, Types of
weak consistencies, CAP theorem, Distributed File Sharing Semantics - Session semantics vs.
Transaction semantics. Caching, Overview of Distributed file systems (GFS & HDFS), Conflict-
free replicated data types, state-based objects, Linearizability.

Session 6 – Introduction to MapReduce Programming Model: Apache Hadoop: distributed


computing principles (MapReduce), replication, fault tolerance, backup tasks, custom combiners
and partitioners, local aggregation, linear scalability.

Session 7 – Apache Pig: Advantages and Disadvantages of using Pig over HadoopMR directly,
execution modes, data model, dataflow processing, Wordcount example using Pig Latin, detailed
introduction to Pig Latin – operators, Multi-Query Execution, UDF implementation and parallelism
details.

Template Version 1.1


Template Date April 4, 2021
Session 8 – Apache HIVE: Introduction to HIVE and HIVEQL, access modes, architecture, data
model, operators and built-in functions, types of tables, schema management, partitions and
buckets, map-side and reduce-side join, multi-table insertion, sorting, joins, subqueries, views,
types of UDF.

Session 9 – Apache HBase: Access modes, Data model, Storage mechanism, Architecture,
Built-in operators, features of HBase tables, HBase API, Nested Loop Join in HBase, Bulk loading
of data.

Session 10 & 11 – MongoDB: Access modes, commands and scripting in Mongo shell, data
model, data storage & replication, architecture, sharded and non-sharded collections, range and
hash partitioning, CRUD operations, mapping of SQL to MongoDB, Indexing, Aggregation, Map-
side and Reduce-side join.

Session 12 – Introduction to Apache Spark: Architecture, RDDs, Structured API – Dataframe


API, Dataset API, Wordcount example, an introduction to other abstract APIs in the Spark
ecosystem.

Session 13 – Paper presentations

Session 14 – Paper presentations

Session 15 – Paper presentations

Learning Resources
1. Class slides and supplementary materials (code snippets, open datasets, etc.)
2. Hadoop: The Definitive Guide. Tom White. O’Reilly Media, 3rd edition, 2012. ISBN: 978-
1491901632
3. Data-intensive Text Processing using MapReduce. Jimmy Lin and Chris Dyer. Synthetic
Lecture on Human Language Technologies, Morgan and Claypool, 2010. ISBN:
9781608453429
4. Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL
Movement. Eric Richmond and Jim R. Wilson. Pragmatic Bookshelf, 2012. ISBN-13: 978-
1934356920
5. Learning Spark: Lightning-Fast Big-Data Analysis. Matei Zaharia, Patrick Wendell, Andy
Konwinski, Holden Karau, 1st Edition, 2015. ISBN: 978-1449358624
6. Advanced Analytics with Spark: Patterns for Learning from Data at Scale (2nd Ed). Sandy
Ryza, Uri Laserson, Sean Owen and Josh Wills. O'Reilly Media, July 2017. ISBN: 978-
1491972953
7. The Google File System, OSDI Symposium, Dec. 2003. (Research Paper)
8. MapReduce: Simplified Data Processing on Large Clusters, OSDI Symposium, Dec. 2004.
(Research Paper)
9. Bigtable: A Distributed Storage System for Structured Data, OSDI Symposium, Nov. 2006.
(Research Paper)

Assessment Plan

Template Version 1.1


Template Date April 4, 2021
• 25%: Mid-term exam and Quiz
• 30%: Take-home Programming Assignments (3 to 4)
• 15%: Paper Presentation
• 5%: Attendance, punctuality and overall interaction
• 25%: End-term Project (individual)

Assignments / Projects
[List exact number of assignments or projects included (provide generic description)]
S. CO
Focus of Assignment / Project
No. Mapping
1 Programming Assignments: (1) Learn how to formulate a problem/task/application CO1-
in terms of Map and Reduce/Dataflow queries/PigLatin/HiveQL, and gain experience CO10
in solving it by using the algorithms, system designs and techniques taught during the
lectures; (2) Get a hands-on experience on various No SQL systems; (3) Understand
the data model and programming model of the system.

Evaluation Procedures
The course uses one or more of the following evaluation procedures as part of the course:
• Automatic evaluation of MCQ quizzes on Moodle or other online platforms
• Manual evaluation of essay type / descriptive questions
• The programming questions need to be demonstrated before the TAs (either in person
or online).

Students will be provided opportunity to view the evaluations done where possible either in
person or online

Late Assignment Submission Policy


All deadlines are due at on the date and time indicated in LMS. The penalties for late
submission are as follows:

• 4 and <= 24 hours late submission: 25% penalty


• > 24 and < 48 hours late submissions: 50% penalty
• > 48 hours late submissions: 75% penalty

Make-up Exam/Submission Policy


As per institute policy.

Template Version 1.1


Template Date April 4, 2021
Citation Policy for Papers (if applicable)
[If course includes reading papers and citing them as part of activities, state the citation policy. Mention
“Not applicable” if section is not applicable to the course]

Not applicable.

Academic Dishonesty/Plagiarism
As per institute policy.

Accommodation of Divyangs
As per institute policy.

Template Version 1.1


Template Date April 4, 2021

You might also like