NPTEL Big Data Computing Assignment

In today’s fast-paced digital world , the incredible amount of data being generated every minute has grown tremendously from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and GPS signals from cell phone to name a few. This amount of large data with different velocities and varieties is termed as big data and its analytics enables professionals to convert extensive data through statistical and quantitative analysis into powerful insights that can drive efficient decisions. This course provides an in-depth understanding of terminologies and the core concepts behind big data problems, applications, systems and the techniques, that underlie today’s big data computing technologies. It provides an introduction to some of the most common frameworks such as Apache Spark, Hadoop, MapReduce, Large scale data storage technologies such as in-memory key/value storage systems, NoSQL distributed databases, Apache Cassandra, HBase and Big Data Streaming Platforms such as Apache Spark Streaming, Apache Kafka Streams that has made big data analysis easier and more accessible. And while discussing the concepts and techniques, we will also look at various applications of Big Data Analytics using Machine Learning, Deep Learning, Graph Processing and many others. The course is suitable for all UG/PG students and practicing engineers/ scientists from the diverse fields and interested in learning about the novel cutting edge techniques and applications of Big Data Computing.
PREREQUISITES  : Data Structure & Algorithms, Computer Architecture, Operating System, Database Management Systems
INDUSTRY SUPPORT  :  Companies like Amazon, Microsoft, Google, IBM, Facebook

This course can have Associate in Nursing unproctored programming communication conjointly excluding the Proctored communication, please check announcement section for date and time. The programming communication can have a weightage of twenty fifth towards the ultimate score.

Final score = Assignment score + Unproctored programming exam score + Proctored Exam score
  • Assignment score = 25% of average of best 8 assignments out of the total 12 assignments given in the course.
  • ( All assignments in a particular week will be counted towards final scoring – quizzes and programming assignments). 
  • Unproctored programming exam score = 25% of the average scores obtained as part of Unproctored programming exam – out of 100
  • Proctored Exam score =50% of the proctored certification exam score out of 100
If any one of the 3 criteria is not met, you will not be eligible for the certificate even if the Final score >= 40/100. 


Assignment not submitted
1 point
Which of the following tasks can be best solved using Clustering ?
1 point
Identify the correct statement in context of Regressive model of Machine Learning.
1 point
___________ refers to a model that can neither model the training data nor generalize to new data.
1 point
Which of the following is required by K-means clustering ?
1 point
Imagine you are working on a project which is a binary classification problem. You trained a model on training dataset and get the below confusion matrix on validation dataset.

Based on the above confusion matrix, choose which option(s) below will give you correct predictions ?

1. Accuracy is ~0.91
2. Misclassification rate is ~ 0.91
3. False positive rate is ~0.95
4. True positive rate is ~0.95

1 point
Identify the correct method for choosing the value of ‘k’ in k-means algorithm ?
1 point
True or False ?

If your model has very low training error but high generalization error, then it is overfitting.

1 point
Identify the correct statement(s) in context of overfitting in decision trees:

Statement I: The idea of Post-pruning is to grow a tree to its maximum size and then remove the nodes using a top-bottom approach.

Statement II: The idea of Pre-pruning is to stop tree induction before a fully grown tree is built, that perfectly fits the training data.

1 point
Which of the following options is/are true for K-fold cross-validation ?

1. Increase in K will result in higher time required to cross validate the result.
2. Higher values of K will result in higher confidence on the cross-validation result as compared to lower value of K.
3. If K=N, then it is called Leave one out cross validation, where N is the number of observations.

1 point
Identify the correct statement(s) in context of machine learning approaches:

Statement I: In supervised approaches, the target that the model is predicting is unknown or unavailable. This means that you have unlabeled data.

Statement II: In unsupervised approaches the target, which is what the model is predicting, is provided. This is referred to as having labeled data because the target is labeled for every sample that you have in your data set.

