Neural Network Modules
The nn module provides neural network layers and utilities.
Base Classes
Base classes for neural network modules.
- class fit.nn.modules.base.Layer[source]
Bases:
ABCBase class for all neural network layers.
All layers should inherit from this class and implement the forward method.
- add_parameter(param: Tensor)[source]
Add a parameter to this layer.
- Parameters:
param – Parameter tensor to add
- parameters() List[Tensor][source]
Return all parameters of this layer.
- Returns:
List of parameter tensors
- named_parameters(prefix: str = '') Iterator[tuple][source]
Return iterator over module parameters with names.
- Parameters:
prefix – Prefix to add to parameter names
- Yields:
(name, parameter) tuples
- abstract forward(*args, **kwargs)[source]
Forward pass of the layer.
This method must be implemented by all subclasses.
- extra_repr() str[source]
Return extra representation string for this layer.
Subclasses can override this to provide additional information.
- state_dict() Dict[str, Any][source]
Return state dictionary containing layer’s state.
- Returns:
Dictionary mapping parameter names to their values
- load_state_dict(state_dict: Dict[str, Any])[source]
Load state from state dictionary.
- Parameters:
state_dict – Dictionary containing state to load
- class fit.nn.modules.base.Module[source]
Bases:
LayerAlias for Layer class to match PyTorch naming convention.
- class fit.nn.modules.base.Identity[source]
Bases:
LayerIdentity layer that returns input unchanged.
- class fit.nn.modules.base.Lambda(func)[source]
Bases:
LayerLayer that applies a function to its input.
- class fit.nn.modules.base.MultiInputLayer[source]
Bases:
LayerBase class for layers that take multiple inputs.
- class fit.nn.modules.base.ParameterList(parameters=None)[source]
Bases:
LayerContainer for a list of parameters.
Linear Layers
Implementation of linear (fully connected) layers.
- class fit.nn.modules.linear.Linear(in_features: int, out_features: int, bias: bool = True)[source]
Bases:
LayerLinear (fully connected) layer.
Applies a linear transformation: y = xW^T + b
- __init__(in_features: int, out_features: int, bias: bool = True)[source]
Initialize linear layer.
- Parameters:
in_features – Number of input features
out_features – Number of output features
bias – Whether to include bias term
- class fit.nn.modules.linear.Bilinear(in1_features: int, in2_features: int, out_features: int, bias: bool = True)[source]
Bases:
LayerBilinear layer: y = x1^T @ W @ x2 + b
- class fit.nn.modules.linear.Identity[source]
Bases:
LayerIdentity layer - returns input unchanged. Useful for skip connections and as placeholder.
- class fit.nn.modules.linear.Flatten(start_dim: int = 1)[source]
Bases:
LayerFlatten layer - reshapes input to 2D tensor.
- class fit.nn.modules.linear.Embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None)[source]
Bases:
LayerEmbedding layer for discrete tokens.
Activation Functions
Implementation of activation functions for neural networks.
Normalization Layers
- class fit.nn.modules.normalization.BatchNorm(num_features, eps=1e-05, momentum=0.1)[source]
Bases:
Layer
- class fit.nn.modules.normalization.LayerNorm(normalized_shape, eps=1e-05)[source]
Bases:
LayerLayer Normalization: normalizes inputs across the feature dimension.
Unlike BatchNorm, LayerNorm normalizes across features for each sample independently.
Attention Mechanisms
Core Attention Mechanisms for FIT Framework
This module implements the fundamental attention mechanisms that power modern deep learning: scaled dot-product attention, multi-head attention, and various attention variants.
The implementation is educational (showing how attention really works) while being efficient and production-ready.
- class fit.nn.modules.attention.ScaledDotProductAttention(dropout: float = 0.1, temperature: float = 1.0)[source]
Bases:
LayerScaled Dot-Product Attention: the core of all attention mechanisms.
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V
This is the fundamental building block that makes Transformers work.
- __init__(dropout: float = 0.1, temperature: float = 1.0)[source]
Initialize scaled dot-product attention.
- Parameters:
dropout – Dropout probability for attention weights
temperature – Temperature scaling factor (higher = more uniform attention)
- forward(query: Tensor, key: Tensor, value: Tensor, mask: Tensor | None = None, return_attention: bool = False) Tensor | Tuple[Tensor, Tensor][source]
Apply scaled dot-product attention.
- Parameters:
query – Query tensor (batch_size, seq_len_q, d_k)
key – Key tensor (batch_size, seq_len_k, d_k)
value – Value tensor (batch_size, seq_len_v, d_v)
mask – Optional attention mask to prevent attention to certain positions
return_attention – Whether to return attention weights
- Returns:
Output tensor (batch_size, seq_len_q, d_v) and optionally attention weights
- class fit.nn.modules.attention.MultiHeadAttention(d_model: int, num_heads: int, dropout: float = 0.1, bias: bool = True)[source]
Bases:
LayerMulti-Head Attention: the key innovation that makes Transformers so powerful.
Instead of using a single attention function, we use multiple “heads” that can focus on different types of relationships in the data.
- __init__(d_model: int, num_heads: int, dropout: float = 0.1, bias: bool = True)[source]
Initialize multi-head attention.
- Parameters:
d_model – Model dimension (must be divisible by num_heads)
num_heads – Number of attention heads
dropout – Dropout probability
bias – Whether to use bias in linear projections
- forward(query: Tensor, key: Tensor, value: Tensor, mask: Tensor | None = None, return_attention: bool = False) Tensor | Tuple[Tensor, Tensor][source]
Apply multi-head attention.
- Parameters:
query – Query tensor (batch_size, seq_len, d_model)
key – Key tensor (batch_size, seq_len, d_model)
value – Value tensor (batch_size, seq_len, d_model)
mask – Optional attention mask
return_attention – Whether to return attention weights
- Returns:
Output tensor and optionally attention weights
- class fit.nn.modules.attention.SelfAttention(d_model: int, num_heads: int = 8, dropout: float = 0.1)[source]
Bases:
LayerSelf-Attention: a special case where query, key, and value are the same.
This allows the model to relate different positions in a single sequence, which is crucial for understanding context and long-range dependencies.
- __init__(d_model: int, num_heads: int = 8, dropout: float = 0.1)[source]
Initialize self-attention layer.
- Parameters:
d_model – Model dimension
num_heads – Number of attention heads
dropout – Dropout probability
- forward(x: Tensor, mask: Tensor | None = None, return_attention: bool = False) Tensor | Tuple[Tensor, Tensor][source]
Apply self-attention to input sequence.
- Parameters:
x – Input tensor (batch_size, seq_len, d_model)
mask – Optional attention mask
return_attention – Whether to return attention weights
- Returns:
Output tensor and optionally attention weights
- class fit.nn.modules.attention.CrossAttention(d_model: int, num_heads: int = 8, dropout: float = 0.1)[source]
Bases:
LayerCross-Attention: attention between two different sequences.
Used in encoder-decoder architectures where the decoder attends to the encoder’s output. Query comes from decoder, Key and Value from encoder.
- __init__(d_model: int, num_heads: int = 8, dropout: float = 0.1)[source]
Initialize cross-attention layer.
- Parameters:
d_model – Model dimension
num_heads – Number of attention heads
dropout – Dropout probability
- forward(query: Tensor, key_value: Tensor, mask: Tensor | None = None, return_attention: bool = False) Tensor | Tuple[Tensor, Tensor][source]
Apply cross-attention between query and key_value sequences.
- Parameters:
query – Query tensor from decoder (batch_size, seq_len_q, d_model)
key_value – Key and value tensor from encoder (batch_size, seq_len_kv, d_model)
mask – Optional attention mask
return_attention – Whether to return attention weights
- Returns:
Output tensor and optionally attention weights
- class fit.nn.modules.attention.CausalSelfAttention(d_model: int, num_heads: int = 8, dropout: float = 0.1)[source]
Bases:
LayerCausal (Masked) Self-Attention: prevents positions from attending to future positions.
Essential for autoregressive models like GPT, where we want to predict the next token without “cheating” by looking at future tokens.
- __init__(d_model: int, num_heads: int = 8, dropout: float = 0.1)[source]
Initialize causal self-attention.
- Parameters:
d_model – Model dimension
num_heads – Number of attention heads
dropout – Dropout probability
- forward(x: Tensor, return_attention: bool = False) Tensor | Tuple[Tensor, Tensor][source]
Apply causal self-attention to input sequence.
- Parameters:
x – Input tensor (batch_size, seq_len, d_model)
return_attention – Whether to return attention weights
- Returns:
Output tensor and optionally attention weights
- fit.nn.modules.attention.create_padding_mask(sequences: Tensor, pad_token_id: int = 0) Tensor[source]
Create padding mask to ignore padded positions in variable-length sequences.
- Parameters:
sequences – Input sequences (batch_size, seq_len)
pad_token_id – Token ID used for padding
- Returns:
Padding mask (batch_size, 1, seq_len)
- fit.nn.modules.attention.create_look_ahead_mask(seq_len: int) Tensor[source]
Create look-ahead mask for causal attention.
- Parameters:
seq_len – Sequence length
- Returns:
Look-ahead mask (1, seq_len, seq_len)
- fit.nn.modules.attention.attention_visualization_helper(attention_weights: Tensor, tokens: list | None = None)[source]
Helper function to visualize attention weights.
- Parameters:
attention_weights – Attention weights (batch_size, num_heads, seq_len, seq_len)
tokens – Optional list of tokens for labeling
- Returns:
Dictionary with visualization data
Transformer Components
Transformer Blocks & Complete Transformer Architecture
This module implements the complete Transformer architecture including: - Transformer Encoder/Decoder Blocks - Positional Encoding - Layer Normalization - Feed-Forward Networks - Complete Transformer models
Built on top of the attention mechanisms, this creates the full power of modern Transformer architectures.
- class fit.nn.modules.transformer.PositionalEncoding(d_model: int, max_len: int = 5000, dropout: float = 0.1)[source]
Bases:
LayerPositional Encoding: adds position information to embeddings.
Since Transformers have no inherent notion of sequence order, we add sinusoidal position encodings to give the model information about token positions.
- class fit.nn.modules.transformer.GELU[source]
Bases:
LayerGaussian Error Linear Unit: smooth activation function used in Transformers.
GELU(x) = x * Φ(x) where Φ is the cumulative distribution function of the standard normal distribution.
- class fit.nn.modules.transformer.FeedForward(d_model: int, d_ff: int, activation: str = 'gelu', dropout: float = 0.1)[source]
Bases:
LayerPosition-wise Feed-Forward Network: applies same FFN to each position.
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
This adds non-linearity and allows the model to process information within each position independently.
- class fit.nn.modules.transformer.TransformerEncoderBlock(d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1, activation: str = 'gelu')[source]
Bases:
LayerTransformer Encoder Block: the core building block of the Transformer encoder.
Structure: 1. Multi-Head Self-Attention 2. Residual connection + Layer Norm 3. Feed-Forward Network 4. Residual connection + Layer Norm
- __init__(d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1, activation: str = 'gelu')[source]
Initialize transformer encoder block.
- Parameters:
d_model – Model dimension
num_heads – Number of attention heads
d_ff – Feed-forward dimension
dropout – Dropout probability
activation – Activation function in FFN
- class fit.nn.modules.transformer.TransformerDecoderBlock(d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1, activation: str = 'gelu')[source]
Bases:
LayerTransformer Decoder Block: core building block of the Transformer decoder.
Structure: 1. Masked Multi-Head Self-Attention 2. Residual connection + Layer Norm 3. Multi-Head Cross-Attention (encoder-decoder attention) 4. Residual connection + Layer Norm 5. Feed-Forward Network 6. Residual connection + Layer Norm
- __init__(d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1, activation: str = 'gelu')[source]
Initialize transformer decoder block.
- Parameters:
d_model – Model dimension
num_heads – Number of attention heads
d_ff – Feed-forward dimension
dropout – Dropout probability
activation – Activation function in FFN
- forward(x: Tensor, encoder_output: Tensor, self_attn_mask: Tensor | None = None, cross_attn_mask: Tensor | None = None) Tensor[source]
Forward pass through decoder block.
- Parameters:
x – Decoder input (batch_size, target_seq_len, d_model)
encoder_output – Encoder output (batch_size, source_seq_len, d_model)
self_attn_mask – Mask for self-attention
cross_attn_mask – Mask for cross-attention
- Returns:
Output tensor (batch_size, target_seq_len, d_model)
- class fit.nn.modules.transformer.TransformerEncoder(vocab_size: int, d_model: int, num_heads: int, num_layers: int, d_ff: int, max_len: int = 5000, dropout: float = 0.1, activation: str = 'gelu')[source]
Bases:
LayerComplete Transformer Encoder: stack of encoder blocks with embeddings.
- __init__(vocab_size: int, d_model: int, num_heads: int, num_layers: int, d_ff: int, max_len: int = 5000, dropout: float = 0.1, activation: str = 'gelu')[source]
Initialize transformer encoder.
- Parameters:
vocab_size – Size of vocabulary
d_model – Model dimension
num_heads – Number of attention heads
num_layers – Number of encoder layers
d_ff – Feed-forward dimension
max_len – Maximum sequence length
dropout – Dropout probability
activation – Activation function
- class fit.nn.modules.transformer.TransformerDecoder(vocab_size: int, d_model: int, num_heads: int, num_layers: int, d_ff: int, max_len: int = 5000, dropout: float = 0.1, activation: str = 'gelu')[source]
Bases:
LayerComplete Transformer Decoder: stack of decoder blocks with embeddings.
- __init__(vocab_size: int, d_model: int, num_heads: int, num_layers: int, d_ff: int, max_len: int = 5000, dropout: float = 0.1, activation: str = 'gelu')[source]
Initialize transformer decoder.
- Parameters:
vocab_size – Size of vocabulary
d_model – Model dimension
num_heads – Number of attention heads
num_layers – Number of decoder layers
d_ff – Feed-forward dimension
max_len – Maximum sequence length
dropout – Dropout probability
activation – Activation function
- forward(x: Tensor, encoder_output: Tensor, self_attn_mask: Tensor | None = None, cross_attn_mask: Tensor | None = None) Tensor[source]
Forward pass through transformer decoder.
- Parameters:
x – Target token indices (batch_size, target_seq_len)
encoder_output – Encoder output (batch_size, source_seq_len, d_model)
self_attn_mask – Mask for self-attention
cross_attn_mask – Mask for cross-attention
- Returns:
Decoded representations (batch_size, target_seq_len, d_model)
- class fit.nn.modules.transformer.Embedding(vocab_size: int, d_model: int)[source]
Bases:
LayerToken embedding layer that converts token indices to dense vectors.
- class fit.nn.modules.transformer.Transformer(src_vocab_size: int, tgt_vocab_size: int, d_model: int = 512, num_heads: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, d_ff: int = 2048, max_len: int = 5000, dropout: float = 0.1, activation: str = 'gelu')[source]
Bases:
LayerComplete Transformer model for sequence-to-sequence tasks.
This is the full Transformer as described in “Attention Is All You Need”.
- __init__(src_vocab_size: int, tgt_vocab_size: int, d_model: int = 512, num_heads: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, d_ff: int = 2048, max_len: int = 5000, dropout: float = 0.1, activation: str = 'gelu')[source]
Initialize complete Transformer model.
- Parameters:
src_vocab_size – Source vocabulary size
tgt_vocab_size – Target vocabulary size
d_model – Model dimension
num_heads – Number of attention heads
num_encoder_layers – Number of encoder layers
num_decoder_layers – Number of decoder layers
d_ff – Feed-forward dimension
max_len – Maximum sequence length
dropout – Dropout probability
activation – Activation function
- forward(src: Tensor, tgt: Tensor, src_mask: Tensor | None = None, tgt_mask: Tensor | None = None) Tensor[source]
Forward pass through complete Transformer.
- Parameters:
src – Source sequences (batch_size, src_seq_len)
tgt – Target sequences (batch_size, tgt_seq_len)
src_mask – Source attention mask
tgt_mask – Target attention mask
- Returns:
Output logits (batch_size, tgt_seq_len, tgt_vocab_size)
Container Modules
Container modules for composing neural networks.
- class fit.nn.modules.container.Sequential(*layers)[source]
Bases:
LayerSequential container that chains layers together.
Layers are executed in the order they are added.
- __init__(*layers)[source]
Initialize sequential container.
- Parameters:
*layers – Variable number of layers to chain
- class fit.nn.modules.container.ModuleList(modules: List[Layer] | None = None)[source]
Bases:
LayerList container for modules.
Unlike Sequential, ModuleList doesn’t define forward pass - you need to implement it yourself.
- class fit.nn.modules.container.ModuleDict(modules: Dict[str, Layer] | None = None)[source]
Bases:
LayerDictionary container for modules.
- class fit.nn.modules.container.Parallel(*layers)[source]
Bases:
LayerParallel container that applies multiple layers to the same input.
- __init__(*layers)[source]
Initialize parallel container.
- Parameters:
*layers – Layers to apply in parallel
- class fit.nn.modules.container.Residual(layer: Layer)[source]
Bases:
LayerResidual connection: output = input + layer(input)
- __init__(layer: Layer)[source]
Initialize residual connection.
- Parameters:
layer – Layer to wrap with residual connection
- class fit.nn.modules.container.Highway(layer: Layer, gate_layer: Layer | None = None)[source]
Bases:
LayerHighway connection: output = gate * layer(input) + (1 - gate) * input
- __init__(layer: Layer, gate_layer: Layer | None = None)[source]
Initialize highway connection.
- Parameters:
layer – Transform layer
gate_layer – Gate layer (if None, creates a linear layer)