Long Short-Term Memory (LSTM) Networks
LSTMs are a type of Recurrent Neural Network (RNN) designed to handle sequential data with long-term dependencies.
Key Features:
Cell State: Preserves information over long periods.
Gates: Control information flow (input, output, and forget gates).
Hidden State: Temporary memory for short-term information.
Related Technologies:
Recurrent Neural Networks (RNNs): Basic architecture for sequential data.
Gated Recurrent Units (GRUs): Simplified version of LSTMs.
Bidirectional RNNs/LSTMs: Process input sequences in both directions.
Encoder-Decoder Architecture: Used for sequence-to-sequence tasks.
Real-World Applications:
Language Translation
Speech Recognition
Text Generation
Time Series Forecasting
GRUs are an alternative to LSTMs, designed to be faster and more efficient while still capturing long-term dependencies.
Key Differences from LSTMs:
Simplified Architecture: Fewer gates (update and reset) and fewer state vectors.
Faster Computation: Reduced number of parameters.
Technical Details for LSTMs and GRUs:
LSTM Mathematical Formulation:
Let x_t be the input at time t, h_t be the hidden state, and c_t be the cell state.
Input Gate: i_t = sigmoid(W_i * x_t + U_i * h_(t-1) + b_i)
Forget Gate: f_t = sigmoid(W_f * x_t + U_f * h_(t-1) + b_f)
Cell State Update: c_t = f_t * c_(t-1) + i_t * tanh(W_c * x_t + U_c * h_(t-1) + b_c)
Output Gate: o_t = sigmoid(W_o * x_t + U_o * h_(t-1) + b_o)
Hidden State Update: h_t = o_t * tanh(c_t)
Parameters:
W_i, W_f, W_c, W_o: Weight matrices for input, forget, cell, and output gates.
U_i, U_f, U_c, U_o: Weight matrices for hidden state.
b_i, b_f, b_c, b_o: Bias vectors.
GRU Mathematical Formulation:
Let x_t be the input at time t, h_t be the hidden state.
Update Gate: z_t = sigmoid(W_z * x_t + U_z * h_(t-1) + b_z)
Reset Gate: r_t = sigmoid(W_r * x_t + U_r * h_(t-1) + b_r)
Hidden State Update: h_t = (1 - z_t) * h_(t-1) + z_t * tanh(W_h * x_t + U_h * (r_t * h_(t-1)) + b_h)
Parameters:
W_z, W_r, W_h: Weight matrices for update, reset, and hidden state.
U_z, U_r, U_h: Weight matrices for hidden state.
b_z, b_r, b_h: Bias vectors.
Here's a small mathematical example for an LSTM network:
Example:
Suppose we have an LSTM network with:
Input dimension: 1
Hidden dimension: 2
Output dimension: 1
Input at time t (x_t)
x_t = 0.5
Previous Hidden State (h_(t-1)) and Cell State (c_(t-1))
h_(t-1) = [0.2, 0.3]
c_(t-1) = [0.4, 0.5]
Weight Matrices and Bias Vectors
W_i = [[0.1, 0.2], [0.3, 0.4]]
W_f = [[0.5, 0.6], [0.7, 0.8]]
W_c = [[0.9, 1.0], [1.1, 1.2]]
W_o = [[1.3, 1.4], [1.5, 1.6]]
U_i = [[1.7, 1.8], [1.9, 2.0]]
U_f = [[2.1, 2.2], [2.3, 2.4]]
U_c = [[2.5, 2.6], [2.7, 2.8]]
U_o = [[2.9, 3.0], [3.1, 3.2]]
b_i = [0.1, 0.2]
b_f = [0.3, 0.4]
b_c = [0.5, 0.6]
b_o = [0.7, 0.8]
Calculations
Input Gate
i_t = sigmoid(W_i * x_t + U_i * h_(t-1) + b_i)
= sigmoid([[0.1, 0.2], [0.3, 0.4]] * 0.5 + [[1.7, 1.8], [1.9, 2.0]] * [0.2, 0.3] + [0.1, 0.2])
= sigmoid([0.05 + 0.55, 0.1 + 0.65])
= sigmoid([0.6, 0.75])
= [0.55, 0.68]
Forget Gate
f_t = sigmoid(W_f * x_t + U_f * h_(t-1) + b_f)
= sigmoid([[0.5, 0.6], [0.7, 0.8]] * 0.5 + [[2.1, 2.2], [2.3, 2.4]] * [0.2, 0.3] + [0.3, 0.4])
= sigmoid([0.25 + 0.75, 0.35 + 0.85])
= sigmoid([1.0, 1.2])
= [0.73, 0.78]
Cell State Update
c_t = f_t * c_(t-1) + i_t * tanh(W_c * x_t + U_c * h_(t-1) + b_c)
= [0.73, 0.78] * [0.4, 0.5] + [0.55, 0.68] * tanh([[0.9, 1.0], [1.1, 1.2]] * 0.5 + [[2.5, 2.6], [2.7, 2.8]] * [0.2, 0.3] + [0.5, 0.6])
= [0.292, 0.39] + [0.55, 0.68] * tanh([0.45 + 0.7, 0.55 + 0.8])
= [0.292, 0.39] + [0.55, 0.68] * [0.58, 0.66]
= [0.479, 0.63]
Output Gate
o_t = sigmoid(W_o * x_t + U_o * h_(t-1) + b_o)
= sigmoid([[1.3, 1.4], [1.5, 1.6]] * 0.5 + [[2.9, 3.0], [3.1, 3.2]] * [0.2, 0.3] + [0.7, 0.8])
= sigmoid([0.65 + 0.95, 0.75 + 1.05])
= sigmoid([1.6, 1.8])
= [0.82, 0.87]
Hidden State Update
h_t = o_t * tanh(c_t)
= [0.82, 0.87] * tanh([0.479, 0.63])
= [0.82, 0.87] * [0.44, 0.53]
= [0.36, 0.46]
Output
y_t = h_t
= [0.36, 0.46]
This completes the LSTM calculation for one time step.
Here's a small mathematical example for a GRU (Gated Recurrent Unit) network:
Example:
Suppose we have a GRU network with:
Input dimension: 1
Hidden dimension: 2
Input at time t (x_t)
x_t = 0.5
Previous Hidden State (h_(t-1))
h_(t-1) = [0.2, 0.3]
Weight Matrices and Bias Vectors
W_z = [[0.1, 0.2], [0.3, 0.4]]
W_r = [[0.5, 0.6], [0.7, 0.8]]
W_h = [[0.9, 1.0], [1.1, 1.2]]
U_z = [[1.3, 1.4], [1.5, 1.6]]
U_r = [[1.7, 1.8], [1.9, 2.0]]
U_h = [[2.1, 2.2], [2.3, 2.4]]
b_z = [0.1, 0.2]
b_r = [0.3, 0.4]
b_h = [0.5, 0.6]
Calculations
Update Gate
z_t = sigmoid(W_z * x_t + U_z * h_(t-1) + b_z)
= sigmoid([[0.1, 0.2], [0.3, 0.4]] * 0.5 + [[1.3, 1.4], [1.5, 1.6]] * [0.2, 0.3] + [0.1, 0.2])
= sigmoid([0.05 + 0.45, 0.1 + 0.55])
= sigmoid([0.5, 0.65])
= [0.62, 0.66]
Reset Gate
r_t = sigmoid(W_r * x_t + U_r * h_(t-1) + b_r)
= sigmoid([[0.5, 0.6], [0.7, 0.8]] * 0.5 + [[1.7, 1.8], [1.9, 2.0]] * [0.2, 0.3] + [0.3, 0.4])
= sigmoid([0.25 + 0.65, 0.35 + 0.75])
= sigmoid([0.9, 1.1])
= [0.71, 0.75]
Hidden State Update
h~t = tanh(W_h * x_t + U_h * (r_t * h(t-1)) + b_h)
= tanh([[0.9, 1.0], [1.1, 1.2]] * 0.5 + [[2.1, 2.2], [2.3, 2.4]] * ([0.71, 0.75] * [0.2, 0.3]) + [0.5, 0.6])
= tanh([0.45 + 0.55, 0.55 + 0.65])
= tanh([1.0, 1.2])
= [0.58, 0.62]
Hidden State
h_t = (1 - z_t) * h_(t-1) + z_t * h~_t
= (1 - [0.62, 0.66]) * [0.2, 0.3] + [0.62, 0.66] * [0.58, 0.62]
= [0.38, 0.42] + [0.36, 0.41]
= [0.74, 0.83]
This completes the GRU calculation for one time step.
Here are examples of Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks:
LSTM Example
Python
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt
# Generate sample dataset (time series data)
np.random.seed(0)
time_steps = 100
future_pred = 30
data = np.sin(np.linspace(0, 10 * np.pi, time_steps)) + 0.2 * np.random.normal(0, 1, time_steps)
# Plot original data
plt.figure(figsize=(10, 6))
plt.plot(data)
plt.title('Original Data')
plt.show()
# Scale data
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data.reshape(-1, 1))
# Split data into training and testing sets
train_size = int(0.8 * len(data_scaled))
train_data, test_data = data_scaled[0:train_size], data_scaled[train_size:]
# Split data into X (input) and y (output)
def split_data(data, future_pred):
X, y = [], []
for i in range(len(data) - future_pred):
X.append(data[i:i + future_pred])
y.append(data[i + future_pred])
return np.array(X), np.array(y)
X_train, y_train = split_data(train_data, future_pred)
X_test, y_test = split_data(test_data, future_pred)
# Reshape data for LSTM input
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
# Build LSTM model
model = Sequential()
model.add(LSTM(50, activation='relu', return_sequences=True, input_shape=(future_pred, 1)))
model.add(LSTM(50, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1))
# Compile model
model.compile(optimizer='adam', loss='mean_squared_error')
# Early stopping callback
early_stopping = EarlyStopping(patience=5, min_delta=0.001)
# Train model
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping])
# Make predictions
predictions = model.predict(X_test)
# Plot predictions
plt.figure(figsize=(10, 6))
plt.plot(y_test, label='Actual')
plt.plot(predictions, label='Predicted')
plt.legend()
plt.title('Predictions')
plt.show()
GRU Example
Python
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt
# Generate sample dataset (time series data)
np.random.seed(0)
time_steps = 100
future_pred = 30
data = np.sin(np.linspace(0, 10 * np.pi, time_steps)) + 0.2 * np.random.normal(0, 1, time_steps)
# Plot original data
plt.figure(figsize=(10, 6))
plt.plot(data)
plt.title('Original Data')
plt.show()
# Scale data
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data.reshape(-1, 1))
# Split data into training and testing sets
train_size = int(0.8 * len(data_scaled))
train_data, test_data = data_scaled[0:train_size], data_scaled[train_size:]
# Split data into X (input) and y (output)
def split_data(data, future_pred):
X, y = [], []
for i in range(len(data) - future_pred):
X.append(data[i:i + future_pred])
y.append(data[i + future_pred])
return np.array(X), np.array(y)
X_train, y_train = split_data(train_data, future_pred)
X_test, y_test = split_data(test_data, future_pred)
# Reshape data for GRU input
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
# Build GRU model
model = Sequential()
model.add(GRU(50, activation='relu', return_sequences=True, input_shape=(future_pred, 1)))
model.add(GRU(50, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1))
# Compile model
model.compile(optimizer='adam', loss='mean_squared_error')
# Early stopping callback
early_stopping = EarlyStopping(patience=5, min_delta=0.001)
# Train model
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test), callbacks=[early_stopping])
# Make predictions
predictions = model.predict(X_test)
# Plot predictions
plt.figure(figsize=(10, 6))
plt.plot(y_test, label='Actual')
plt.plot(predictions, label='Predicted')
plt.legend()
plt.title('Predictions')
plt.show()
Key Differences:
Architecture:
LSTM has three gates (input, output, and forget) and three state vectors (cell state and two hidden states).
GRU has two gates (update and reset) and two state vectors (hidden state).
Computational Complexity:
LSTM is computationally more expensive due to the additional gate and state.
GRU is faster and more efficient.
Performance:
LSTM generally performs better on tasks requiring longer-term dependencies.
GRU performs better on tasks with shorter-term dependencies.
Use Cases:
LSTM:
Language modeling
Text generation
Speech recognition
GRU:
Time series forecasting
Speech recognition
Machine translation
These examples demonstrate basic LSTM and GRU architectures. Depending on your specific task, you may need to adjust parameters, add layers, or experiment with different optimizers and loss functions.