Skip to main content

LSTM and GRU

 






Long Short-Term Memory (LSTM) Networks

LSTMs are a type of Recurrent Neural Network (RNN) designed to handle sequential data with long-term dependencies.

Key Features:

Cell State: Preserves information over long periods.

Gates: Control information flow (input, output, and forget gates).

Hidden State: Temporary memory for short-term information.

Related Technologies:

Recurrent Neural Networks (RNNs): Basic architecture for sequential data.

Gated Recurrent Units (GRUs): Simplified version of LSTMs.

Bidirectional RNNs/LSTMs: Process input sequences in both directions.

Encoder-Decoder Architecture: Used for sequence-to-sequence tasks.

Real-World Applications:

Language Translation

Speech Recognition

Text Generation

Time Series Forecasting


GRUs are an alternative to LSTMs, designed to be faster and more efficient while still capturing long-term dependencies.

Key Differences from LSTMs:

Simplified Architecture: Fewer gates (update and reset) and fewer state vectors.

Faster Computation: Reduced number of parameters.

Technical Details for LSTMs and GRUs:


LSTM Mathematical Formulation:

Let x_t be the input at time t, h_t be the hidden state, and c_t be the cell state.

Input Gate: i_t = sigmoid(W_i * x_t + U_i * h_(t-1) + b_i)

Forget Gate: f_t = sigmoid(W_f * x_t + U_f * h_(t-1) + b_f)

Cell State Update: c_t = f_t * c_(t-1) + i_t * tanh(W_c * x_t + U_c * h_(t-1) + b_c)

Output Gate: o_t = sigmoid(W_o * x_t + U_o * h_(t-1) + b_o)

Hidden State Update: h_t = o_t * tanh(c_t)


Parameters:

W_i, W_f, W_c, W_o: Weight matrices for input, forget, cell, and output gates.

U_i, U_f, U_c, U_o: Weight matrices for hidden state.

b_i, b_f, b_c, b_o: Bias vectors.


GRU Mathematical Formulation:

Let x_t be the input at time t, h_t be the hidden state.

Update Gate: z_t = sigmoid(W_z * x_t + U_z * h_(t-1) + b_z)

Reset Gate: r_t = sigmoid(W_r * x_t + U_r * h_(t-1) + b_r)

Hidden State Update: h_t = (1 - z_t) * h_(t-1) + z_t * tanh(W_h * x_t + U_h * (r_t * h_(t-1)) + b_h)

Parameters:

W_z, W_r, W_h: Weight matrices for update, reset, and hidden state.

U_z, U_r, U_h: Weight matrices for hidden state.

b_z, b_r, b_h: Bias vectors.


Here's a small mathematical example for an LSTM network:

Example:

Suppose we have an LSTM network with:

Input dimension: 1

Hidden dimension: 2

Output dimension: 1

Input at time t (x_t)

x_t = 0.5

Previous Hidden State (h_(t-1)) and Cell State (c_(t-1))

h_(t-1) = [0.2, 0.3]

c_(t-1) = [0.4, 0.5]

Weight Matrices and Bias Vectors

W_i = [[0.1, 0.2], [0.3, 0.4]]

W_f = [[0.5, 0.6], [0.7, 0.8]]

W_c = [[0.9, 1.0], [1.1, 1.2]]

W_o = [[1.3, 1.4], [1.5, 1.6]]

U_i = [[1.7, 1.8], [1.9, 2.0]]

U_f = [[2.1, 2.2], [2.3, 2.4]]

U_c = [[2.5, 2.6], [2.7, 2.8]]

U_o = [[2.9, 3.0], [3.1, 3.2]]

b_i = [0.1, 0.2]

b_f = [0.3, 0.4]

b_c = [0.5, 0.6]

b_o = [0.7, 0.8]


Calculations


Input Gate

i_t = sigmoid(W_i * x_t + U_i * h_(t-1) + b_i)

= sigmoid([[0.1, 0.2], [0.3, 0.4]] * 0.5 + [[1.7, 1.8], [1.9, 2.0]] * [0.2, 0.3] + [0.1, 0.2])

= sigmoid([0.05 + 0.55, 0.1 + 0.65])

= sigmoid([0.6, 0.75])

= [0.55, 0.68]


Forget Gate

f_t = sigmoid(W_f * x_t + U_f * h_(t-1) + b_f)

= sigmoid([[0.5, 0.6], [0.7, 0.8]] * 0.5 + [[2.1, 2.2], [2.3, 2.4]] * [0.2, 0.3] + [0.3, 0.4])

= sigmoid([0.25 + 0.75, 0.35 + 0.85])

= sigmoid([1.0, 1.2])

= [0.73, 0.78]


Cell State Update

c_t = f_t * c_(t-1) + i_t * tanh(W_c * x_t + U_c * h_(t-1) + b_c)

= [0.73, 0.78] * [0.4, 0.5] + [0.55, 0.68] * tanh([[0.9, 1.0], [1.1, 1.2]] * 0.5 + [[2.5, 2.6], [2.7, 2.8]] * [0.2, 0.3] + [0.5, 0.6])

= [0.292, 0.39] + [0.55, 0.68] * tanh([0.45 + 0.7, 0.55 + 0.8])

= [0.292, 0.39] + [0.55, 0.68] * [0.58, 0.66]

= [0.479, 0.63]


Output Gate

o_t = sigmoid(W_o * x_t + U_o * h_(t-1) + b_o)

= sigmoid([[1.3, 1.4], [1.5, 1.6]] * 0.5 + [[2.9, 3.0], [3.1, 3.2]] * [0.2, 0.3] + [0.7, 0.8])

= sigmoid([0.65 + 0.95, 0.75 + 1.05])

= sigmoid([1.6, 1.8])

= [0.82, 0.87]

Hidden State Update

h_t = o_t * tanh(c_t)

= [0.82, 0.87] * tanh([0.479, 0.63])

= [0.82, 0.87] * [0.44, 0.53]

= [0.36, 0.46]

Output

y_t = h_t

= [0.36, 0.46]

This completes the LSTM calculation for one time step.


Here's a small mathematical example for a GRU (Gated Recurrent Unit) network:

Example:

Suppose we have a GRU network with:

Input dimension: 1

Hidden dimension: 2

Input at time t (x_t)

x_t = 0.5

Previous Hidden State (h_(t-1))

h_(t-1) = [0.2, 0.3]

Weight Matrices and Bias Vectors

W_z = [[0.1, 0.2], [0.3, 0.4]]

W_r = [[0.5, 0.6], [0.7, 0.8]]

W_h = [[0.9, 1.0], [1.1, 1.2]]

U_z = [[1.3, 1.4], [1.5, 1.6]]

U_r = [[1.7, 1.8], [1.9, 2.0]]

U_h = [[2.1, 2.2], [2.3, 2.4]]

b_z = [0.1, 0.2]

b_r = [0.3, 0.4]

b_h = [0.5, 0.6]


Calculations


Update Gate

z_t = sigmoid(W_z * x_t + U_z * h_(t-1) + b_z)

= sigmoid([[0.1, 0.2], [0.3, 0.4]] * 0.5 + [[1.3, 1.4], [1.5, 1.6]] * [0.2, 0.3] + [0.1, 0.2])

= sigmoid([0.05 + 0.45, 0.1 + 0.55])

= sigmoid([0.5, 0.65])

= [0.62, 0.66]


Reset Gate

r_t = sigmoid(W_r * x_t + U_r * h_(t-1) + b_r)

= sigmoid([[0.5, 0.6], [0.7, 0.8]] * 0.5 + [[1.7, 1.8], [1.9, 2.0]] * [0.2, 0.3] + [0.3, 0.4])

= sigmoid([0.25 + 0.65, 0.35 + 0.75])

= sigmoid([0.9, 1.1])

= [0.71, 0.75]


Hidden State Update

h~t = tanh(W_h * x_t + U_h * (r_t * h(t-1)) + b_h)

= tanh([[0.9, 1.0], [1.1, 1.2]] * 0.5 + [[2.1, 2.2], [2.3, 2.4]] * ([0.71, 0.75] * [0.2, 0.3]) + [0.5, 0.6])

= tanh([0.45 + 0.55, 0.55 + 0.65])

= tanh([1.0, 1.2])

= [0.58, 0.62]

Hidden State

h_t = (1 - z_t) * h_(t-1) + z_t * h~_t

= (1 - [0.62, 0.66]) * [0.2, 0.3] + [0.62, 0.66] * [0.58, 0.62]

= [0.38, 0.42] + [0.36, 0.41]

= [0.74, 0.83]

This completes the GRU calculation for one time step.


Here are examples of Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks:

LSTM Example

Python

# Import necessary libraries

import numpy as np

import pandas as pd

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import LSTM, Dense, Dropout

from tensorflow.keras.callbacks import EarlyStopping

import matplotlib.pyplot as plt


# Generate sample dataset (time series data)

np.random.seed(0)

time_steps = 100

future_pred = 30

data = np.sin(np.linspace(0, 10 * np.pi, time_steps)) + 0.2 * np.random.normal(0, 1, time_steps)


# Plot original data

plt.figure(figsize=(10, 6))

plt.plot(data)

plt.title('Original Data')

plt.show()


# Scale data

scaler = MinMaxScaler()

data_scaled = scaler.fit_transform(data.reshape(-1, 1))


# Split data into training and testing sets

train_size = int(0.8 * len(data_scaled))

train_data, test_data = data_scaled[0:train_size], data_scaled[train_size:]


# Split data into X (input) and y (output)

def split_data(data, future_pred):

    X, y = [], []

    for i in range(len(data) - future_pred):

        X.append(data[i:i + future_pred])

        y.append(data[i + future_pred])

    return np.array(X), np.array(y)


X_train, y_train = split_data(train_data, future_pred)

X_test, y_test = split_data(test_data, future_pred)


# Reshape data for LSTM input

X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))


# Build LSTM model

model = Sequential()

model.add(LSTM(50, activation='relu', return_sequences=True, input_shape=(future_pred, 1)))

model.add(LSTM(50, activation='relu'))

model.add(Dropout(0.2))

model.add(Dense(1))


# Compile model

model.compile(optimizer='adam', loss='mean_squared_error')


# Early stopping callback

early_stopping = EarlyStopping(patience=5, min_delta=0.001)


# Train model

model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping])


# Make predictions

predictions = model.predict(X_test)


# Plot predictions

plt.figure(figsize=(10, 6))

plt.plot(y_test, label='Actual')

plt.plot(predictions, label='Predicted')

plt.legend()

plt.title('Predictions')

plt.show()


GRU Example

Python

# Import necessary libraries

import numpy as np

import pandas as pd

from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import GRU, Dense, Dropout

from tensorflow.keras.callbacks import EarlyStopping

import matplotlib.pyplot as plt


# Generate sample dataset (time series data)

np.random.seed(0)

time_steps = 100

future_pred = 30

data = np.sin(np.linspace(0, 10 * np.pi, time_steps)) + 0.2 * np.random.normal(0, 1, time_steps)


# Plot original data

plt.figure(figsize=(10, 6))

plt.plot(data)

plt.title('Original Data')

plt.show()


# Scale data

scaler = MinMaxScaler()

data_scaled = scaler.fit_transform(data.reshape(-1, 1))


# Split data into training and testing sets

train_size = int(0.8 * len(data_scaled))

train_data, test_data = data_scaled[0:train_size], data_scaled[train_size:]


# Split data into X (input) and y (output)

def split_data(data, future_pred):

    X, y = [], []

    for i in range(len(data) - future_pred):

        X.append(data[i:i + future_pred])

        y.append(data[i + future_pred])

    return np.array(X), np.array(y)


X_train, y_train = split_data(train_data, future_pred)

X_test, y_test = split_data(test_data, future_pred)


# Reshape data for GRU input

X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))

X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))


# Build GRU model

model = Sequential()

model.add(GRU(50, activation='relu', return_sequences=True, input_shape=(future_pred, 1)))

model.add(GRU(50, activation='relu'))

model.add(Dropout(0.2))

model.add(Dense(1))


# Compile model

model.compile(optimizer='adam', loss='mean_squared_error')


# Early stopping callback

early_stopping = EarlyStopping(patience=5, min_delta=0.001)


# Train model

model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping])


# Make predictions

predictions = model.predict(X_test)


# Plot predictions

plt.figure(figsize=(10, 6))

plt.plot(y_test, label='Actual')

plt.plot(predictions, label='Predicted')

plt.legend()

plt.title('Predictions')

plt.show()


Key Differences:


Architecture:

LSTM has three gates (input, output, and forget) and three state vectors (cell state and two hidden states).

GRU has two gates (update and reset) and two state vectors (hidden state).


Computational Complexity:

LSTM is computationally more expensive due to the additional gate and state.

GRU is faster and more efficient.


Performance:

LSTM generally performs better on tasks requiring longer-term dependencies.

GRU performs better on tasks with shorter-term dependencies.


Use Cases:


LSTM:

Language modeling

Text generation

Speech recognition


GRU:

Time series forecasting

Speech recognition

Machine translation


These examples demonstrate basic LSTM and GRU architectures. Depending on your specific task, you may need to adjust parameters, add layers, or experiment with different optimizers and loss functions.


Comments

Popular posts from this blog

Financial Engineering

Financial Engineering: Key Concepts Financial engineering is a multidisciplinary field that combines financial theory, mathematics, and computer science to design and develop innovative financial products and solutions. Here's an in-depth look at the key concepts you mentioned: 1. Statistical Analysis Statistical analysis is a crucial component of financial engineering. It involves using statistical techniques to analyze and interpret financial data, such as: Hypothesis testing : to validate assumptions about financial data Regression analysis : to model relationships between variables Time series analysis : to forecast future values based on historical data Probability distributions : to model and analyze risk Statistical analysis helps financial engineers to identify trends, patterns, and correlations in financial data, which informs decision-making and risk management. 2. Machine Learning Machine learning is a subset of artificial intelligence that involves training algorithms t...

Wholesale Customer Solution with Magento Commerce

The client want to have a shop where regular customers to be able to see products with their retail price, while Wholesale partners to see the prices with ? discount. The extra condition: retail and wholesale prices hasn’t mathematical dependency. So, a product could be $100 for retail and $50 for whole sale and another one could be $60 retail and $50 wholesale. And of course retail users should not be able to see wholesale prices at all. Basically, I will explain what I did step-by-step, but in order to understand what I mean, you should be familiar with the basics of Magento. 1. Creating two magento websites, stores and views (Magento meaning of website of course) It’s done from from System->Manage Stores. The result is: Website | Store | View ———————————————— Retail->Retail->Default Wholesale->Wholesale->Default Both sites using the same category/product tree 2. Setting the price scope in System->Configuration->Catalog->Catalog->Price set drop-down to...

How to Prepare for AI Driven Career

  Introduction We are all living in our "ChatGPT moment" now. It happened when I asked ChatGPT to plan a 10-day holiday in rural India. Within seconds, I had a detailed list of activities and places to explore. The speed and usefulness of the response left me stunned, and I realized instantly that life would never be the same again. ChatGPT felt like a bombshell—years of hype about Artificial Intelligence had finally materialized into something tangible and accessible. Suddenly, AI wasn’t just theoretical; it was writing limericks, crafting decent marketing content, and even generating code. The world is still adjusting to this rapid shift. We’re in the middle of a technological revolution—one so fast and transformative that it’s hard to fully comprehend. This revolution brings both exciting opportunities and inevitable challenges. On the one hand, AI is enabling remarkable breakthroughs. It can detect anomalies in MRI scans that even seasoned doctors might miss. It can trans...