CIT-423 Machine Learing

Professor Md Abdul Masud

References

Introduction

  • Concepts on Machine Learning, Deep Learning, and Reinforcement Learning
  • Concepts on Supervised Learning, Unsupervised Learning, and Semi-Supervised Learning

  • Supervised Learning

    • Discrete \(\to\) Classification \(\to\) Accuracy
    • Continuous \(\to\) Regression \(\to\) Error
  • Prameter Estimation \(\to\) Linear Regression

Chapter 2 : Supervised Learning (Book-1)

Slide: i2ml2e-chap2-v1-0

2.1 Learning a Class from Examples (Data)

Classification:

  • Hypothesis \(\to\) Equation/Mathematical Model
\[ Error, E(h|x) = \sum_{t=1}^{N} (h(t) \neq r^t) \]
  • Most specific hypothesis \(\to\) tightest hypothesis + all positive
  • Most general hypothesis \(\to\) largest hypothesis

Figure 2.3

Assignment

Multi-Class Classification

  • Confusion Matrix
  • Accuracy
  • Precision
  • Recall
  • F1-Score
  • ROC Curve

2.2 Vapnik-Chervonenkis (VC) Dimention

  • largest point
  • n points \(\to\) \(2^n\) possible points
  • \(\tau = {f(x)}\)

2.3 Probably Approximately Correct (PAC) Learning

Given a class, \(C\), and examples drawn from some unknown but fixed probability distribution, \(p(x)\), we want to find the number of examples, \(N\), such that with probability at least \(1 − δ\), the hypothesis \(h\) has error at most \(\epsilon\), for arbitrary \(δ ≤ 1/2\) and \(\epsilon > 0\)

\[ P \{CΔh ≤ \epsilon \} ≥ 1 − δ \]

where \(CΔh\) is the region of difference between \(C\) and \(h\)

2.4 Noise

2.5 Learning Multiple Classes

In the general case, we have \(K\) classes denoted as \(C_i, i = 1, . . . , K\), and an input instance belongs to one and exactly one of them. The training set is now of the form \(X = \{x^t , r^t \}N t =1\) where \(r\) has \(K\) dimensions and

\[r^{t}_{i}= \begin{cases} 1, & {\text{if } x^t ∈ C_i}.\\ 0, & {\text{if } x^t ∈ C_j , j \neq i}. \end{cases} \]

We have \(K\) hypotheses to learn such that,

\[h_i (x^t)= \begin{cases} 1, & {\text{if } x^t ∈ C_i}.\\ 0, & {\text{if } x^t ∈ C_j , j \neq i}. \end{cases} \]

The total empirical error takes a sum over the predictions for all classes over all instances:

\[ E (\{h_i \}_{i=1}^{K} | X) = \sum_{t=1}^{N} \sum_{i=1}^{K} 1 (h_i(x^t) \neq r_{i}^{t} ) \]

2.6 Regression

  • The function is not known but we have a training set of examples drawn from it, where \(r^t ∈ \mathbb{R}\). If there is no noise, the task is interpolation.
\[X = \{x^t, r^t \}^{N}_{t=1}\]
  • We would like to find the function \(f(x)\) that passes through these points such that we have \(r^t = f(x^t)\)

  • In polynomial interpolation, given \(N\) points, we find the \((N−1)\)st degree polynomial that we can use to predict the output for any \(x\). This is called extrapolation extrapolation if \(x\) is outside of the range of \(x^t\) in the training set.

  • In regression, there is noise added to the output of the unknown function where \(f(x) ∈ \mathbb{R}\) is the unknown function and is random noise.

\[r^t = f(x^t) + \epsilon\]
  • The explanation for noise is that there are extra hidden variables that we cannot observe where \(z^t\) denote those hidden variables.
\[r^t = f∗(x^t, z^t)\]
  • The empirical error on the training set \(X\) is
\[E(g|X) = \frac{1}{N} \sum_{t=1}^{N} [r^t − g(x^t)]^2\]
  • Our aim is to find \(g(·)\) that minimizes the empirical error.
\[g(x) = w_1 x_1 + · · · + w_d x_d + w_0 = \sum_{j=1}^{d} w_j x_j + w_0\]
  • There we used a single input linear model where \(w_1\) and \(w_0\) are the parameters to learn from data.
\[g(x) = w_1 x + w_0\]
  • The \(w_1\) and \(w_0\) values should minimize where \(\bar{x} = \sum_{t} x^t /N\) and \(\bar{r} = \sum_{t} r^t/N\)
\[E(w_1, w_0 |X) = \frac{1}{N} \sum_{t=1}{N} [r^t − (w_1 x^t + w_0)]^2\]
\[w_1 = \frac{ \sum_{t} x^t r^t − \overline{xr} N} { \sum_{t} (x^t)^2 − N\bar{x}^2}\]
\[w_0 = \bar{r} − w_1x\]

Table 2.1

Assignment

Errors and Evaluation Metrics for Regression

2.7 Model Selection and Generalization

Underfitting and Overfitting

2.8 Dimensions of a Supervised Machine Learning Algorithm

  • Loss function, \(L(·)\), to compute the difference between the desired output, \(r^t\), and our approximation to it, \(g(x^t |θ)\), given the current value of the parameters, \(θ\). The approximation error, or loss, is the sum of losses over the individual instances
\[E(θ|X) = \sum_{t} L (r^t, g(x^t |θ))$\]
  • Optimization procedure to find θ∗ that minimizes the total error
\[θ^∗ = \text{arg }^{\text{min}}_{θ} ~E(θ|X)\]

Chapter 3 : Baysian Decision Theory (Book-1)

unobservable variable

3.2 Classification

Eq. 3.5

3.3 Losses and Risks

--

Chapter 4 : Parametric Methods (Book-1)

Slide: i2ml2e-chap4-v1-0 Parametric Methods

  • How to estimate probabilities from a given training set?

  • Probability density function

  • probability mass function
  • likelihood vs posterior

4.2

4.2.2 Multinomial Density

(self study)

4.3 Evaluatind an Estimator : Bias and Variance

4.5 Parameter Classification

Fig 4.2, 4.3

4.6 Regression

4.7 Tuning Model Complexity: Bias/Variance Dilemma

Fig 4.5, 4.6

4.8 Model Selection Procedure

fig 4.7

Assignment on Model Selection Criterion

  • AIC & BIC information critarion
  • Minimum description length
  • Baysian model selection process
  • Baysian Estimator \(\to\) from Bayes Rules
  • Maximum Posterior Estimator

Chapter 5 : Multivariate Methods (Book-1)

5.1 Multivariate Data

5.2 Parameter Estimation

5.3 Estimation of Missing Values

5.4 Multivariate Normal Distribution

fig 5.1

All Equations \(\star \star \star\)

5.11

fig 5.2

Chapter 4 : Artificial Neural Networks (Book-2)

Introduction

Biological Motivation

4.4 Perceptron

4.4.1 Representational Power of Perceptron
4.4.2 The Perceptron Training Rule
4.4.3 Gradiant Descent and The Delta Rule
4.4.3.1 Visualization of Hypothesis Space
4.4.3.2 Derivation of the Gradiant Descent Rule

Table 4.1

  • Stochastic Approach to Gradiant Descent
  • Limitation of Gradiant Descent Algorithm
  • Solution of Gradiant Descent Algorithm

T 4.1
T 4.2

4.5 Multilayer Network and The Back Propagation Algorithm

fig 4.5

4.5.1 Differentiable Threshold Unit

Why perceptron rule is unsuitable for g. D algorithm?

Solution: Sigmoid Unit/ Logistic Function

4.5.2 Back Propagation Algorithm

Table 4.2

4.5.3 Derivation of Back Propagation Rule

figures from 4.6.4 and 4.6.5

Chapter 9 : Convolutional Networks (Book-3)

Neural Network \(\to\) DL

  • Forward Propagation
  • Backwork Propagation

9.1 The Convolution Operation

fig 9.2

fig 9.11

  • Input image \(\to\) Kernel \(\to\) Convolution \(\to\) ReLU
  • Feature Mapping
  • Deature Extension
  • Classification
  • Probabilistic Distribution \(\to\) Class/Output (Softmax)

Definitions:
- Stride - Pooling - Padding \(\to\) Border problem solve

Example:
o = (i-k)+1

Assignment

Different CNN Variants Different Activation Methods

Deep Learning for Sequence Data

Recurrent Neural Network (RNN)

Problem:

  • long term dependency problem
  • An unrolled Recurrent Neural Network \(\to\) fig
    • small gap: The cloud are in the _
    • More Content: _, _, _

Understanding LSTM Networks (Sheet-1)

Figures

LSTM Networks (Book + Website)

The repeating module in a standard RNN contains a single layer.

Core idea behind LSTM

Gate \(\to\) Information \(\to\) Through/Remove

Step by step LSTM work though

Varient of LSTM \(\to\) Assignment

Muhtasim

References

Topics of Lecture - 01

  • Overfitting-Underfitting
  • Training and Testing Data
  • Model Building
  • Graphs of polynomial function
  • Example: Linear Regression
  • Overfit-Underfit
  • Example: Logistic Regression
  • Training vs Testing Performance
  • Bias-Variance Tradeoff
    • Error derivation won't be in exam.
  • Figure-1
  • Figure-2

Topics of Lecture - 02

Model Selection

  • Validation

Lecture8.pdf

  • Estimating generalization error
  • till: Page-31 (Practical Issues in Cross-validation)

Cluster Analysis (Slide)

  • till: Slide-66 (Hierarchical Clustering: Problems and Limitations)

Clustering.pdf

  • till: Hierarchical Clustering (relevant page)