


Data types and data structures
Atomic types
Collection types
Operations on data structures
Conditional statements (if-else blocks)
Switch statements
Loops
User defined functions
Functions with default arguments
Functions with variable number of arguments
Functions returning values
Reading data from a csv file
Reading data from XML file
Saving the contents of data frame into a file
Introduction
Data and information
Classification of data
Measures of central tendency
Measures of dispersion
Visual representation of data
Introduction
Sample space
Events and their types
Types of probability
Dependence and Independence
Bayes’ Theorem
Introduction
Various kinds of sampling
Central Limit Theorem
Standard error
Confidence intervals
Mann-Whitney-Wilcoxon test
Kruskal-Wallis test
Non-parametric tests
Chi-square test
Introduction
Correlation versus Causation
Basics of EDA
Importance of EDA in Data Science domain
Key EDA techniques
Define EDA
Explain why EDA is needed in the field of Data Science
Describe the important techniques in EDA
Variables & data types
Sampling techniques & samples
Frequency distribution and central tendency
Variability & shape
Relationship among variables
Describe central tendency of data and key statistics
Describe variability and shape of data in dataset
Identify association among variables
Develop insight into missing values and outliers
Data for EDA
Data types and formats
Data quality
Data analysis types and purpose
Handling missing values in data
Data transformation for EDA
Describe different types of data in enterprises
Describe the importance of data quality & cleansing
Explain the purpose of data analysis & types
Describe method to treat missing values
Handling categorical & numerical variables
Plotting EDA graphs
Dimension reduction
Association analysis
Clustering
Factor analysis & PCA
Visualize central tendency, shape, distribution of dataset
Identify outliers and missing values
Explain the need for data transformation and types
Implement the algorithms of association & clustering
Describe factor analysis and Principal Component
Analysis (PCA) for dimensionality reduction
Visual Communication Design
Datasets and Graphs
Text, Pictures, Icons, Animation
Layout and Formatting, Context
Getting Right Data for Visualization
Visualizing Numeric, Categorical Data, Multiple Variables, Multiple Dimensions in 2D
Visualizing Time and Space Dimension and Relationship Between Measures and Categories
Visualizing structured data
Mediums of visualization - dashboards, scorecards, infographics and others
Visualizing big data types
Data story design process
Edward Tufte's and other principles
Choosing the right graphs and interaction abilities
Story development and delivery
Introduction
Challenges of Processing Big Data
Distributed Systems
History of Hadoop
Hadoop Overview
Ecosystem of Hadoop
HDFS and MapReduce Paradigm
Processing Pipeline
Big Data Technologies
Use Cases
Features of Hadoop
Summary
Introduction
Hadoop Installation and Configuration
Hive Installation and Configuration
Pig Installation and Configuration
Sqoop Installation and Configuration
Oozie Installation and Configuration
Flume Installation and Configuration
Hbase Installation and Configuration
Hue Installation and Configuration
Introduction
HDFS commands
Basic HDFS commands demo
Read anatomy in HDFS
Write anatomy in HDFS
Additional HDFS commands demo
HDFS permission management
HDFS permission management demo -Part 1
HDFS permission management demo -Part 2
Introduction
Traditional approach
Overview of map reduce (MR1)
System architecture of map reduce (MR1)
Introduction to YARN
Map-Reduce job execution in YARN (MR2/Hadoop 2.x)
Introduction
Job flow
Job submission
Job initialization
Job scheduling
Map task execution
Sort and shuffle
Reduce task execution
Job clean-up
Scheduler
Introduction
Map Reduce and PIG
Modes of execution in PIG
Pig client
Data types in PIG
Operators in PIG
Pig Usage
Loading data into PIG demo
Pig dialects
Transformations in PIG demo
Debugging in PIG demo
Other capabilities in PIG demo
Real time deployment introduction
System architecture
Logical deployment overview
Physical deployment overview
Real time deployment summary
Big data software and tools introduction
Streaming tools
NoSQL tools
Administration tools
Other ecosystem tools
Big data software and tools summary
Introduction to Machine Learning
Broad classification – Supervised vs Unsupervised Learning
Use cases, Opportunities and Challenges
Machine Learning -definition, learning approaches supervised & unsupervised, use cases, challenges and opportunities.
Decision Trees
KNN - ‘K’ Nearest Neighbors
Applications
Build Classification based machine learning models using algorithms Decision Trees & k-NN.
Introduction
Goodness Measures
ROC Curves
Derive the accuracy of the machine learning model using Goodness Measures & ROC
Introduction
K-means
Practical Issues in Clustering
Applications
Develop models from unstructured data using clustering techniques and understand it’s limitation
Introduction
Customer Life Cycle and Customer Behavior
Understand the concepts of customer lifecycle & behaviour which aids in building better customer analytics based ML models
SVM
Neural Networks