21CS743

 DL Model

From the given content, deep learning can be explained in 2-3 points:

  1. Learning from Experience: Deep learning enables computers to learn and understand the world by gathering knowledge from experience, rather than requiring humans to manually specify all necessary information.

  2. Hierarchical Concept Building: It organizes concepts in a hierarchical structure, where complex concepts are built from simpler ones. Each layer in this hierarchy represents concepts in terms of the simpler ones below it.

  3. Layered Representation: The approach involves deep layers of concepts, allowing the computer to represent complex ideas (like recognizing an image of a person) by combining simpler features, such as edges, corners, and contours.

A deep learning (DL) model can be explained as follows, based on the provided content:

  1. Layered Structure: A deep learning model maps raw sensory input data, like pixel values from an image, to an object identity by breaking down the task into a series of nested simple mappings. 

  2. Each mapping is represented by a layer in the model.

  3. The input is presented at the visible layer, which contains observable variables, and multiple hidden layers extract increasingly abstract features from the input.

  4. Feature Extraction Process: The first hidden layer identifies simple features like edges by comparing pixel brightness. 

  5. The second hidden layer recognizes corners and extended contours based on the first layer's description of edges. 

  6. The third hidden layer detects parts of objects by finding specific collections of contours and corners.

  7. Object Recognition: Finally, the deep learning model uses the abstracted features from the hidden layers (corners, contours, and object parts) to recognize the objects present in the image. 

  8. Each hidden layer's values are determined by the model, and the deeper the layer, the more abstract the features it extracts.

---------------------------------------------------

Deep feedforward networks, also known as feedforward neural networks or multilayer perceptrons (MLPs), are one of the core models in deep learning. The primary goal of these networks is to approximate a function ff^*. For example, in the case of classification, the network maps an input xx to an output category yy, i.e., y=f(x)y = f^*(x).

Key Concepts of Deep Feedforward Networks:

  1. Feedforward Structure:

    • The term "feedforward" refers to the fact that information flows only in one direction—from the input layer, through the intermediate layers (hidden layers), and finally to the output layer.
    • There are no feedback connections, meaning the output is not fed back into the network. When feedback connections are present, the network is known as a recurrent neural network (RNN).
  2. Function Approximation:

    • A deep feedforward network defines a mapping y=f(x;θ)y = f(x; \theta), where θ\theta represents the model parameters. The learning process involves finding the best set of parameters θ\theta to approximate the function ff^*.
  3. Layer Structure:

    • Feedforward networks are composed of multiple layers of functions. These functions are often organized in a chain structure, meaning one function’s output becomes the input to the next function, such as: f(x)=f(3)(f(2)(f(1)(x)))f(x) = f^{(3)}(f^{(2)}(f^{(1)}(x)))
    • The first function f(1)f^{(1)} is called the input layer, and the final function f(3)f^{(3)} is the output layer. Layers in between are called hidden layers.
  4. Depth and Width:

    • The number of layers in the network defines its depth, which is why these models are referred to as deep learning models.
    • Each hidden layer is typically vector-valued, and the number of units (or neurons) in each layer determines the width of the network.
  5. Hidden Layers:

    • During training, the training data specifies what the output layer should do (for example, mapping input xx to output yy). However, the behavior of the hidden layers is not explicitly specified.
    • The learning algorithm decides how to use the hidden layers to produce the desired output, making these layers "hidden" in the sense that their behavior is not directly observed from the data.
  6. Neural Inspiration:

    • Feedforward networks are loosely inspired by neuroscience. Each unit (or neuron) in the hidden layer computes an activation based on inputs from the previous layer, similar to how biological neurons function.
    • However, modern neural networks are more driven by mathematical and engineering principles rather than aiming to model the brain perfectly.

Overcoming the Limitations of Linear Models:

  • Linear models (like logistic regression) are efficient and reliable, but they are limited to representing linear functions.
  • To handle more complex (nonlinear) interactions, a nonlinear transformation of the input xx is applied, often denoted as ϕ(x)\phi(x). This results in a richer, nonlinear feature representation.
  • Deep learning models learn this nonlinear feature mapping ϕ(x;θ)\phi(x; \theta) from data, as opposed to manually designing or applying predefined transformations.

Design Decisions:

Several important decisions must be made when designing a deep feedforward network, including:

  1. Optimizer: Choosing the optimization algorithm to adjust the network parameters.
  2. Cost Function: Selecting the loss function that measures how far the network's output is from the target output.
  3. Activation Functions: Deciding on the activation functions that will be used in the hidden layers to introduce non-linearity into the model.
  4. Network Architecture: Determining how many layers the network should have, how these layers are connected, and how many units (neurons) should be in each layer.

Backpropagation:

  • The backpropagation algorithm is used to efficiently compute the gradients of the network's parameters. This is essential for optimizing the model during training.

Historical Perspective:

  • While feedforward networks introduced the concept of hidden layers, modern neural networks have evolved to apply similar principles to stochastic mappings, feedback-based functions, and probability distributions. Deep feedforward networks are just one of the many models in the broader field of deep learning.

In summary, deep feedforward networks are powerful models for approximating complex functions, consisting of multiple layers of transformations. They represent a major advancement over traditional linear models by learning feature representations automatically, rather than relying on manual feature engineering.

----------------------------------

Regularization: Definition and Purpose

  • Regularization is a technique used in machine learning to prevent overfitting by adding a penalty term to the loss function.
  • It ensures that models do not become overly complex by constraining the magnitude of feature weights.
  • This leads to better generalization, meaning the model performs well on unseen data.

Why is Regularization Used?

  • Overfitting Issue: A model that is too complex captures noise instead of meaningful patterns. This results in high accuracy on training data but poor performance on test data.
  • Prevents Model Complexity: It discourages large weight values, forcing the model to find simpler relationships in data.
  • Improves Generalization: A well-regularized model balances bias and variance, ensuring it performs well on both training and test datasets.

How Does Regularization Prevent Overfitting?

  • Regularization modifies the loss function by adding a term that penalizes large coefficients.
  • This penalty shrinks feature weights, reducing sensitivity to noise and preventing the model from memorizing the dataset.
  • It ensures that the model does not rely too heavily on specific features, leading to more robust predictions.

Types of Regularization: L1 and L2

1. L1 Regularization (Lasso Regression)

✅ Adds the absolute value of feature weights as a penalty term in the loss function.
✅ Encourages sparse models by forcing some coefficients to become exactly zero.
✅ Helps in feature selection, as irrelevant features are removed automatically.
✅ Works well when only a few features are truly important for predictions.

🔹 Mathematical Formula (Lasso Regularization Term):

λwi\lambda \sum |w_i|

Where wiw_i are the model weights and λ\lambda is the regularization strength.


2. L2 Regularization (Ridge Regression)

✅ Adds the squared value of feature weights as a penalty term.
✅ Prevents large coefficients but does not eliminate any features (no sparsity).
✅ Distributes weight more evenly among features, making the model more stable.
✅ Works well when many features contribute to the prediction.

🔹 Mathematical Formula (Ridge Regularization Term):

λwi2\lambda \sum w_i^2

Where wiw_i are the model weights and λ\lambda is the regularization strength.


Key Differences Between L1 and L2 Regularization

Feature L1 (Lasso) L2 (Ridge)
Penalty Term ( \sum w_i
Effect on Weights Can shrink some weights to zero Shrinks weights closer to zero, but not exactly zero
Feature Selection Yes (removes irrelevant features) No (keeps all features)
Best Use Case When only a few features are important When many features contribute

Conclusion

  • Regularization is essential to prevent overfitting and improve model generalization.
  • L1 Regularization (Lasso) is useful for feature selection, while L2 Regularization (Ridge) stabilizes the model without eliminating features.
  • Choosing the right technique depends on whether you need sparse feature selection (L1) or better distribution of feature importance (L2).
-----------------------------------------

Here’s a structured breakdown of historical trends in deep learning based on the four categories:

1. Philosophical Viewpoints and Historical Waves

Deep learning has evolved through different philosophical paradigms:

  • Cybernetics (1940s–1960s): Early neural network models, such as the McCulloch-Pitts Neuron and Perceptron, were inspired by neuroscience and sought to model biological learning.
  • Connectionism (1980s–1990s): This wave introduced concepts like distributed representation and parallel distributed processing, emphasizing how neurons work together in learning.
  • Modern Deep Learning (2006–present): The resurgence of deep learning was fueled by improved algorithms, larger datasets, and better hardware, leading to breakthroughs in AI applications.

2. Amount of Training Data

  • Early models struggled due to limited data, which restricted their generalization ability.
  • The digital revolution and internet era drastically increased the availability of labeled data, making large-scale training possible.
  • The rise of “Big Data” allowed deep learning models to achieve human-level performance in various tasks, requiring millions of labeled examples for optimal accuracy.

3. Model Size and Computational Power

  • Initial models, such as the Perceptron, had simple architectures and few parameters.
  • The 1980s saw the introduction of multi-layer perceptrons and recurrent networks like LSTMs, though training them was difficult.
  • Modern deep learning leverages massive architectures, such as deep convolutional and transformer networks, made possible by GPU acceleration and distributed computing.
  • Model size continues to grow exponentially, with architectures now reaching billions of parameters, enabling sophisticated tasks like self-programming neural networks and reinforcement learning systems.

4. Complexity and Real-World Applications

  • Early neural networks could only solve simple classification tasks but struggled with complex patterns (e.g., XOR problem).
  • The 1990s and 2000s saw improvements in training deep networks, allowing models to handle sequence-based tasks like speech recognition and NLP.
  • Today, deep learning is applied across industries, including image recognition, robotics, gaming (e.g., DeepMind’s Atari AI), and autonomous systems.
  • Advanced architectures, such as Neural Turing Machines, demonstrate self-programming abilities, pointing toward future general AI applications.

Would you like any refinements or additional details on a specific category?

----------------------

Machine learning algorithms can be broadly categorized as supervised unsupervised or by what kind of experience they are allowed to have during the learning process.


 Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. 

In the context of deep learning, we usually want to learn the entire probability distribution that generated a dataset, whether explicitly as in density estimation or implicitly for tasks like synthesis or denoising.

 Some other unsupervised learning algorithms perform other roles, like clustering, which consists of dividing the dataset into clusters of similar examples.


 Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target . 

For example, the Iris dataset is annotated with the species of each iris plant.

A supervised learning algorithm can study the Iris dataset and learn to classify iris plants into three different species based on their measurements


 Roughly speaking, unsupervised learning involves observing several examples of a random vector  x , and attempting to implicitly or explicitly learn the probability distribution p ( x ), or some interesting properties of that distribution, 

while supervised learning involves observing several examples of a random vector x and  an associated value or vector y , and learning to predict y from x  , usually by estimating  p  ( y x | ).


The term supervised learning originates from the view of being provided by an instructor or teacher who shows the machine learning system what to do. 

In unsupervised learning, there is no instructor or teacher, and the algorithm must learn to make sense of the data without this guide

----------------------------------

Bidirectional RNNs (BRNNs) Based on Given Content

  1. Causal Structure of RNNs:

    • Traditional RNNs process sequences in a causal manner, meaning the current state hth_t only depends on past inputs x(1),...,x(t1)x(1), ..., x(t-1) and the present input xtx_t.
  2. Need for Bidirectional Processing:

    • Many applications require context from both past and future inputs to make accurate predictions (e.g., speech recognition, handwriting recognition).
    • In speech recognition, the correct phoneme interpretation may depend on both past and future phonemes.
    • In handwriting recognition, correct word recognition may require analyzing surrounding words.
  3. Introduction of Bidirectional RNNs:

    • Bidirectional RNNs (BRNNs) were introduced by Schuster and Paliwal (1997) to address this need.
    • They have been highly successful in sequence-related tasks like handwriting recognition (Graves et al., 2008, 2009), speech recognition (Graves et al., 1999, 2005, 2013), and bioinformatics (Baldi et al.).
  4. Structure of BRNNs:

    • BRNNs consist of two RNNs:
      • Forward RNN: Moves forward through time from the beginning of the sequence, producing hidden states hth_t.
      • Backward RNN: Moves backward through time from the end of the sequence, producing hidden states gtg_t.
    • The output at each timestep depends on both past and future inputs.
  5. Advantage Over Traditional RNNs:

    • Unlike conventional RNNs, BRNNs allow capturing dependencies from both directions without needing a fixed-size window of future context.
    • This avoids manually setting a look-ahead buffer, as required in regular RNNs.
  6. Extension to 2D Inputs (Images):

    • The concept can be extended to images by using four directional RNNs (up, down, left, right).
    • At each point (i,j)(i, j) in a 2D grid, an output Oi,jO_{i,j} captures both local and long-range dependencies.
    • While computationally more expensive than convolutional networks, RNNs enable long-range lateral interactions in feature maps (Visin et al., 2015; Kalchbrenner et al., 2015).
  7. Mathematical Formulation:

    • The equations for BRNNs can be rewritten to show that they apply convolution before recurrent propagation.
    • This preprocessing step extracts bottom-up input before lateral interactions occur within the recurrent network.
----------------------------------

Linear regression is one of the simplest forms of machine learning algorithms, designed to solve regression problems. The goal is to predict a scalar value yRy \in \mathbb{R} based on input features represented as a vector xRnx \in \mathbb{R}^n. The output is a linear function of the input. In this case, we aim to predict y^\hat{y}, where y^=wTx\hat{y} = w^T x, and wRnw \in \mathbb{R}^n is a vector of parameters, also known as weights, which control how the model makes predictions.

Components of Linear Regression:

  1. Weights (ww): These are parameters that determine how much each feature of the input xx contributes to the prediction. If the weight wiw_i for feature xix_i is positive, increasing xix_i increases the prediction y^\hat{y}. If the weight is negative, increasing xix_i decreases the prediction. A zero weight means the feature has no effect.

  2. Prediction Function: The linear regression model predicts y^\hat{y} as:

    y^=wTx=w1x1+w2x2++wnxn\hat{y} = w^T x = w_1 x_1 + w_2 x_2 + \dots + w_n x_n

    where each wiw_i represents the contribution of feature xix_i to the final prediction.

Performance Measure:

The performance of the model is typically evaluated using the mean squared error (MSE). Given a set of test data XtestX_{\text{test}} and their corresponding true outputs ytesty_{\text{test}}, the MSE between the predicted values y^test\hat{y}_{\text{test}} and the actual values ytesty_{\text{test}} is:

MSEtest=1mi=1m(y^iyi)2\text{MSE}_{\text{test}} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}_i - y_i)^2

where mm is the number of examples. This error decreases as the predictions get closer to the actual values.

Training the Model:

To train the linear regression model, the goal is to minimize the MSE on the training set. This involves finding the optimal weights ww that minimize the error between the predicted values y^train\hat{y}_{\text{train}} and the actual training values ytrainy_{\text{train}}. Mathematically, this optimization problem can be solved by setting the gradient of the MSE with respect to ww to zero:

wMSEtrain=0\nabla_w \text{MSE}_{\text{train}} = 0

Solving this gives the normal equations:

w=(XtrainTXtrain)1XtrainTytrainw = (X_{\text{train}}^T X_{\text{train}})^{-1} X_{\text{train}}^T y_{\text{train}}

This equation provides the optimal weights for the model that minimize the MSE.

Intercept Term:

Linear regression can be extended by adding an intercept term (or bias) bb, so that the prediction becomes:

y^=wTx+b\hat{y} = w^T x + b

This allows the model's line to not necessarily pass through the origin. The intercept adjusts the predicted values based on the average of the data when all features are zero.

Conclusion:

Linear regression is a fundamental machine learning algorithm that provides a simple yet powerful way to model relationships between input features and output values. Though limited, it offers a foundation for understanding more complex learning algorithms and their behavior.

--------------------------------

Here’s an explanation of Principal Component Analysis (PCA) based on the given content:

  1. PCA as Compression & Representation Learning: PCA provides a way to compress data and can be viewed as an unsupervised learning algorithm that learns a lower-dimensional representation of the data.

  2. Dimensionality Reduction: PCA reduces the dimensionality of the original input while preserving as much information as possible.

  3. Uncorrelated Features: The new representation learned by PCA ensures that its elements are not linearly correlated with each other, making it a step toward statistical independence.

  4. Orthogonal Transformation: PCA applies an orthogonal linear transformation to project the data onto a new space.

  5. First Principal Component: The best one-dimensional representation that minimizes reconstruction error corresponds to the first principal component.

  6. Covariance Matrix: The unbiased sample covariance matrix of the dataset XX is given by:

    Var[x]=1m1XTX\text{Var}[x] = \frac{1}{m-1} X^T X

    where XX is a mean-centered design matrix.

  7. Linear Transformation Representation: PCA transforms the data as:

    z=XTWz = X^T W

    ensuring that Var[z]\text{Var}[z] is diagonal.

  8. Eigenvector Decomposition: The principal components are found by solving:

    XTXW=WΛX^T X W = W \Lambda

    where WW consists of eigenvectors of XTXX^T X.

  9. Singular Value Decomposition (SVD): An alternative way to derive PCA is using SVD:

    X=UΣWTX = U \Sigma W^T

    where the right singular vectors WW correspond to the principal components.

  10. Diagonal Covariance: By transforming XX with WW, the new representation has a diagonal covariance matrix:

    Var[z]=1m1Σ2\text{Var}[z] = \frac{1}{m-1} \Sigma^2

    ensuring that the elements of zz are mutually uncorrelated.

  11. Disentangling Variations: PCA finds a rotation (described by WW) that aligns the principal axes of variance with the new representation space.

  12. Limitation of PCA: While PCA removes linear correlations, it cannot capture more complex dependencies between features. More advanced methods are needed for fully disentangling underlying factors in the data.

This explanation captures the key points from the given content while maintaining clarity. Let me know if you need any modifications!

------------------------------















Comments

Popular posts from this blog

21CSL66

21CS62

21CS735