Neural Network Modules

The nn module provides neural network layers and utilities.

Base Classes

Base classes for neural network modules.

class fit.nn.modules.base.Layer[source]

Bases: ABC

Base class for all neural network layers.

All layers should inherit from this class and implement the forward method.

__init__()[source]

Initialize the layer.

add_parameter(param: Tensor)[source]

Add a parameter to this layer.

Parameters:

param – Parameter tensor to add

parameters() List[Tensor][source]

Return all parameters of this layer.

Returns:

List of parameter tensors

named_parameters(prefix: str = '') Iterator[tuple][source]

Return iterator over module parameters with names.

Parameters:

prefix – Prefix to add to parameter names

Yields:

(name, parameter) tuples

zero_grad()[source]

Zero all parameter gradients.

train()[source]

Set the layer to training mode.

eval()[source]

Set the layer to evaluation mode.

abstract forward(*args, **kwargs)[source]

Forward pass of the layer.

This method must be implemented by all subclasses.

__call__(*args, **kwargs)[source]

Make the layer callable.

This calls the forward method.

__repr__()[source]

Return string representation of the layer.

extra_repr() str[source]

Return extra representation string for this layer.

Subclasses can override this to provide additional information.

state_dict() Dict[str, Any][source]

Return state dictionary containing layer’s state.

Returns:

Dictionary mapping parameter names to their values

load_state_dict(state_dict: Dict[str, Any])[source]

Load state from state dictionary.

Parameters:

state_dict – Dictionary containing state to load

apply(fn)[source]

Apply function to all parameters.

Parameters:

fn – Function to apply to each parameter

cuda()[source]

Move layer to CUDA (placeholder - not implemented).

cpu()[source]

Move layer to CPU (already on CPU).

to(device)[source]

Move layer to specified device (placeholder).

add_child(module: Layer)[source]

Add a child module to this layer.

Parameters:

module – Child module to add

class fit.nn.modules.base.Module[source]

Bases: Layer

Alias for Layer class to match PyTorch naming convention.

class fit.nn.modules.base.Identity[source]

Bases: Layer

Identity layer that returns input unchanged.

forward(x)[source]

Forward pass of the layer.

This method must be implemented by all subclasses.

class fit.nn.modules.base.Lambda(func)[source]

Bases: Layer

Layer that applies a function to its input.

__init__(func)[source]

Initialize lambda layer.

Parameters:

func – Function to apply

forward(x)[source]

Forward pass of the layer.

This method must be implemented by all subclasses.

class fit.nn.modules.base.MultiInputLayer[source]

Bases: Layer

Base class for layers that take multiple inputs.

forward(*inputs)[source]

Forward pass with multiple inputs.

Parameters:

*inputs – Variable number of input tensors

class fit.nn.modules.base.ParameterList(parameters=None)[source]

Bases: Layer

Container for a list of parameters.

__init__(parameters=None)[source]

Initialize parameter list.

Parameters:

parameters – Initial list of parameters

append(parameter: Tensor)[source]

Add a parameter to the list.

extend(parameters: List[Tensor])[source]

Extend the list with multiple parameters.

forward(x)[source]

Parameter lists don’t have forward pass.

class fit.nn.modules.base.ParameterDict(parameters=None)[source]

Bases: Layer

Container for a dictionary of parameters.

__init__(parameters=None)[source]

Initialize parameter dictionary.

Parameters:

parameters – Initial dictionary of parameters

__setitem__(key: str, parameter: Tensor)[source]

Set a parameter in the dictionary.

__getitem__(key: str) Tensor[source]

Get a parameter from the dictionary.

__delitem__(key: str)[source]

Delete a parameter from the dictionary.

keys()[source]
values()[source]
items()[source]
forward(x)[source]

Parameter dicts don’t have forward pass.

Linear Layers

Implementation of linear (fully connected) layers.

class fit.nn.modules.linear.Linear(in_features: int, out_features: int, bias: bool = True)[source]

Bases: Layer

Linear (fully connected) layer.

Applies a linear transformation: y = xW^T + b

__init__(in_features: int, out_features: int, bias: bool = True)[source]

Initialize linear layer.

Parameters:
  • in_features – Number of input features

  • out_features – Number of output features

  • bias – Whether to include bias term

forward(x: Tensor) Tensor[source]

Forward pass of linear layer.

Parameters:

x – Input tensor of shape (batch_size, in_features)

Returns:

Output tensor of shape (batch_size, out_features)

extra_repr() str[source]

Return extra representation string.

get_config()[source]

Get layer configuration for serialization.

class fit.nn.modules.linear.Bilinear(in1_features: int, in2_features: int, out_features: int, bias: bool = True)[source]

Bases: Layer

Bilinear layer: y = x1^T @ W @ x2 + b

__init__(in1_features: int, in2_features: int, out_features: int, bias: bool = True)[source]

Initialize bilinear layer.

Parameters:
  • in1_features – Size of first input

  • in2_features – Size of second input

  • out_features – Size of output

  • bias – Whether to include bias

forward(x1: Tensor, x2: Tensor) Tensor[source]

Forward pass of bilinear layer.

Parameters:
  • x1 – First input tensor (batch_size, in1_features)

  • x2 – Second input tensor (batch_size, in2_features)

Returns:

Output tensor (batch_size, out_features)

class fit.nn.modules.linear.Identity[source]

Bases: Layer

Identity layer - returns input unchanged. Useful for skip connections and as placeholder.

forward(x: Tensor) Tensor[source]

Forward pass of the layer.

This method must be implemented by all subclasses.

class fit.nn.modules.linear.Flatten(start_dim: int = 1)[source]

Bases: Layer

Flatten layer - reshapes input to 2D tensor.

__init__(start_dim: int = 1)[source]

Initialize flatten layer.

Parameters:

start_dim – Dimension to start flattening from

forward(x: Tensor) Tensor[source]

Flatten input tensor.

Parameters:

x – Input tensor

Returns:

Flattened tensor

class fit.nn.modules.linear.Embedding(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None)[source]

Bases: Layer

Embedding layer for discrete tokens.

__init__(num_embeddings: int, embedding_dim: int, padding_idx: int | None = None)[source]

Initialize embedding layer.

Parameters:
  • num_embeddings – Size of dictionary of embeddings

  • embedding_dim – Size of each embedding vector

  • padding_idx – If given, pads output with embedding vector at padding_idx

forward(x: Tensor) Tensor[source]

Look up embeddings.

Parameters:

x – Tensor containing indices

Returns:

Embedded tensor

Activation Functions

Implementation of activation functions for neural networks.

class fit.nn.modules.activation.ReLU[source]

Bases: Layer

forward(x)[source]

Forward pass of the layer.

This method must be implemented by all subclasses.

class fit.nn.modules.activation.Softmax[source]

Bases: Layer

forward(x: Tensor, axis=-1)[source]

Apply softmax function along specified axis.

Parameters:
  • x – Input tensor

  • axis – Axis along which to apply softmax

Returns:

Softmax output

class fit.nn.modules.activation.Tanh[source]

Bases: Layer

forward(x)[source]

Forward pass of the layer.

This method must be implemented by all subclasses.

class fit.nn.modules.activation.Sigmoid[source]

Bases: Layer

forward(x)[source]

Forward pass of the layer.

This method must be implemented by all subclasses.

class fit.nn.modules.activation.LeakyReLU(negative_slope=0.01)[source]

Bases: Layer

__init__(negative_slope=0.01)[source]

Initialize the layer.

forward(x)[source]

Forward pass of the layer.

This method must be implemented by all subclasses.

class fit.nn.modules.activation.ELU(alpha=1.0)[source]

Bases: Layer

__init__(alpha=1.0)[source]

Initialize the layer.

forward(x)[source]

Forward pass of the layer.

This method must be implemented by all subclasses.

class fit.nn.modules.activation.GELU[source]

Bases: Layer

forward(x)[source]

Forward pass of the layer.

This method must be implemented by all subclasses.

class fit.nn.modules.activation.Swish[source]

Bases: Layer

forward(x)[source]

Forward pass of the layer.

This method must be implemented by all subclasses.

class fit.nn.modules.activation.Dropout(p=0.5)[source]

Bases: Layer

__init__(p=0.5)[source]

Initialize the layer.

forward(x)[source]

Forward pass of the layer.

This method must be implemented by all subclasses.

train()[source]

Set the layer to training mode.

eval()[source]

Set the layer to evaluation mode.

class fit.nn.modules.activation.LogSoftmax[source]

Bases: Layer

forward(x: Tensor, axis=-1)[source]

Apply log-softmax function along specified axis.

Parameters:
  • x – Input tensor

  • axis – Axis along which to apply log-softmax

Returns:

Log-softmax output

Normalization Layers

class fit.nn.modules.normalization.BatchNorm(num_features, eps=1e-05, momentum=0.1)[source]

Bases: Layer

__init__(num_features, eps=1e-05, momentum=0.1)[source]

Initialize the layer.

forward(x: Tensor)[source]

Forward pass of the layer.

This method must be implemented by all subclasses.

train()[source]

Set the layer to training mode.

eval()[source]

Set the layer to evaluation mode.

get_config()[source]
class fit.nn.modules.normalization.LayerNorm(normalized_shape, eps=1e-05)[source]

Bases: Layer

Layer Normalization: normalizes inputs across the feature dimension.

Unlike BatchNorm, LayerNorm normalizes across features for each sample independently.

__init__(normalized_shape, eps=1e-05)[source]

Initialize layer normalization.

Parameters:
  • normalized_shape – Input shape from an expected input of size

  • eps – Small constant for numerical stability

forward(x: Tensor) Tensor[source]

Apply layer normalization.

Parameters:

x – Input tensor

Returns:

Normalized tensor

Attention Mechanisms

Core Attention Mechanisms for FIT Framework

This module implements the fundamental attention mechanisms that power modern deep learning: scaled dot-product attention, multi-head attention, and various attention variants.

The implementation is educational (showing how attention really works) while being efficient and production-ready.

class fit.nn.modules.attention.ScaledDotProductAttention(dropout: float = 0.1, temperature: float = 1.0)[source]

Bases: Layer

Scaled Dot-Product Attention: the core of all attention mechanisms.

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V

This is the fundamental building block that makes Transformers work.

__init__(dropout: float = 0.1, temperature: float = 1.0)[source]

Initialize scaled dot-product attention.

Parameters:
  • dropout – Dropout probability for attention weights

  • temperature – Temperature scaling factor (higher = more uniform attention)

forward(query: Tensor, key: Tensor, value: Tensor, mask: Tensor | None = None, return_attention: bool = False) Tensor | Tuple[Tensor, Tensor][source]

Apply scaled dot-product attention.

Parameters:
  • query – Query tensor (batch_size, seq_len_q, d_k)

  • key – Key tensor (batch_size, seq_len_k, d_k)

  • value – Value tensor (batch_size, seq_len_v, d_v)

  • mask – Optional attention mask to prevent attention to certain positions

  • return_attention – Whether to return attention weights

Returns:

Output tensor (batch_size, seq_len_q, d_v) and optionally attention weights

class fit.nn.modules.attention.MultiHeadAttention(d_model: int, num_heads: int, dropout: float = 0.1, bias: bool = True)[source]

Bases: Layer

Multi-Head Attention: the key innovation that makes Transformers so powerful.

Instead of using a single attention function, we use multiple “heads” that can focus on different types of relationships in the data.

__init__(d_model: int, num_heads: int, dropout: float = 0.1, bias: bool = True)[source]

Initialize multi-head attention.

Parameters:
  • d_model – Model dimension (must be divisible by num_heads)

  • num_heads – Number of attention heads

  • dropout – Dropout probability

  • bias – Whether to use bias in linear projections

forward(query: Tensor, key: Tensor, value: Tensor, mask: Tensor | None = None, return_attention: bool = False) Tensor | Tuple[Tensor, Tensor][source]

Apply multi-head attention.

Parameters:
  • query – Query tensor (batch_size, seq_len, d_model)

  • key – Key tensor (batch_size, seq_len, d_model)

  • value – Value tensor (batch_size, seq_len, d_model)

  • mask – Optional attention mask

  • return_attention – Whether to return attention weights

Returns:

Output tensor and optionally attention weights

class fit.nn.modules.attention.SelfAttention(d_model: int, num_heads: int = 8, dropout: float = 0.1)[source]

Bases: Layer

Self-Attention: a special case where query, key, and value are the same.

This allows the model to relate different positions in a single sequence, which is crucial for understanding context and long-range dependencies.

__init__(d_model: int, num_heads: int = 8, dropout: float = 0.1)[source]

Initialize self-attention layer.

Parameters:
  • d_model – Model dimension

  • num_heads – Number of attention heads

  • dropout – Dropout probability

forward(x: Tensor, mask: Tensor | None = None, return_attention: bool = False) Tensor | Tuple[Tensor, Tensor][source]

Apply self-attention to input sequence.

Parameters:
  • x – Input tensor (batch_size, seq_len, d_model)

  • mask – Optional attention mask

  • return_attention – Whether to return attention weights

Returns:

Output tensor and optionally attention weights

class fit.nn.modules.attention.CrossAttention(d_model: int, num_heads: int = 8, dropout: float = 0.1)[source]

Bases: Layer

Cross-Attention: attention between two different sequences.

Used in encoder-decoder architectures where the decoder attends to the encoder’s output. Query comes from decoder, Key and Value from encoder.

__init__(d_model: int, num_heads: int = 8, dropout: float = 0.1)[source]

Initialize cross-attention layer.

Parameters:
  • d_model – Model dimension

  • num_heads – Number of attention heads

  • dropout – Dropout probability

forward(query: Tensor, key_value: Tensor, mask: Tensor | None = None, return_attention: bool = False) Tensor | Tuple[Tensor, Tensor][source]

Apply cross-attention between query and key_value sequences.

Parameters:
  • query – Query tensor from decoder (batch_size, seq_len_q, d_model)

  • key_value – Key and value tensor from encoder (batch_size, seq_len_kv, d_model)

  • mask – Optional attention mask

  • return_attention – Whether to return attention weights

Returns:

Output tensor and optionally attention weights

class fit.nn.modules.attention.CausalSelfAttention(d_model: int, num_heads: int = 8, dropout: float = 0.1)[source]

Bases: Layer

Causal (Masked) Self-Attention: prevents positions from attending to future positions.

Essential for autoregressive models like GPT, where we want to predict the next token without “cheating” by looking at future tokens.

__init__(d_model: int, num_heads: int = 8, dropout: float = 0.1)[source]

Initialize causal self-attention.

Parameters:
  • d_model – Model dimension

  • num_heads – Number of attention heads

  • dropout – Dropout probability

forward(x: Tensor, return_attention: bool = False) Tensor | Tuple[Tensor, Tensor][source]

Apply causal self-attention to input sequence.

Parameters:
  • x – Input tensor (batch_size, seq_len, d_model)

  • return_attention – Whether to return attention weights

Returns:

Output tensor and optionally attention weights

fit.nn.modules.attention.create_padding_mask(sequences: Tensor, pad_token_id: int = 0) Tensor[source]

Create padding mask to ignore padded positions in variable-length sequences.

Parameters:
  • sequences – Input sequences (batch_size, seq_len)

  • pad_token_id – Token ID used for padding

Returns:

Padding mask (batch_size, 1, seq_len)

fit.nn.modules.attention.create_look_ahead_mask(seq_len: int) Tensor[source]

Create look-ahead mask for causal attention.

Parameters:

seq_len – Sequence length

Returns:

Look-ahead mask (1, seq_len, seq_len)

fit.nn.modules.attention.attention_visualization_helper(attention_weights: Tensor, tokens: list | None = None)[source]

Helper function to visualize attention weights.

Parameters:
  • attention_weights – Attention weights (batch_size, num_heads, seq_len, seq_len)

  • tokens – Optional list of tokens for labeling

Returns:

Dictionary with visualization data

fit.nn.modules.attention.demonstrate_attention()[source]

Demonstrate how attention mechanisms work with simple examples.

Transformer Components

Transformer Blocks & Complete Transformer Architecture

This module implements the complete Transformer architecture including: - Transformer Encoder/Decoder Blocks - Positional Encoding - Layer Normalization - Feed-Forward Networks - Complete Transformer models

Built on top of the attention mechanisms, this creates the full power of modern Transformer architectures.

class fit.nn.modules.transformer.PositionalEncoding(d_model: int, max_len: int = 5000, dropout: float = 0.1)[source]

Bases: Layer

Positional Encoding: adds position information to embeddings.

Since Transformers have no inherent notion of sequence order, we add sinusoidal position encodings to give the model information about token positions.

__init__(d_model: int, max_len: int = 5000, dropout: float = 0.1)[source]

Initialize positional encoding.

Parameters:
  • d_model – Model dimension

  • max_len – Maximum sequence length to precompute

  • dropout – Dropout probability

forward(x: Tensor) Tensor[source]

Add positional encoding to input embeddings.

Parameters:

x – Input embeddings (batch_size, seq_len, d_model)

Returns:

Embeddings with positional encoding added

class fit.nn.modules.transformer.GELU[source]

Bases: Layer

Gaussian Error Linear Unit: smooth activation function used in Transformers.

GELU(x) = x * Φ(x) where Φ is the cumulative distribution function of the standard normal distribution.

forward(x: Tensor) Tensor[source]

Apply GELU activation.

class fit.nn.modules.transformer.FeedForward(d_model: int, d_ff: int, activation: str = 'gelu', dropout: float = 0.1)[source]

Bases: Layer

Position-wise Feed-Forward Network: applies same FFN to each position.

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

This adds non-linearity and allows the model to process information within each position independently.

__init__(d_model: int, d_ff: int, activation: str = 'gelu', dropout: float = 0.1)[source]

Initialize feed-forward network.

Parameters:
  • d_model – Model dimension

  • d_ff – Feed-forward dimension (usually 4 * d_model)

  • activation – Activation function (‘relu’, ‘gelu’)

  • dropout – Dropout probability

forward(x: Tensor) Tensor[source]

Apply feed-forward network.

Parameters:

x – Input tensor (batch_size, seq_len, d_model)

Returns:

Output tensor (batch_size, seq_len, d_model)

class fit.nn.modules.transformer.TransformerEncoderBlock(d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1, activation: str = 'gelu')[source]

Bases: Layer

Transformer Encoder Block: the core building block of the Transformer encoder.

Structure: 1. Multi-Head Self-Attention 2. Residual connection + Layer Norm 3. Feed-Forward Network 4. Residual connection + Layer Norm

__init__(d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1, activation: str = 'gelu')[source]

Initialize transformer encoder block.

Parameters:
  • d_model – Model dimension

  • num_heads – Number of attention heads

  • d_ff – Feed-forward dimension

  • dropout – Dropout probability

  • activation – Activation function in FFN

forward(x: Tensor, mask: Tensor | None = None) Tensor[source]

Forward pass through encoder block.

Parameters:
  • x – Input tensor (batch_size, seq_len, d_model)

  • mask – Optional attention mask

Returns:

Output tensor (batch_size, seq_len, d_model)

class fit.nn.modules.transformer.TransformerDecoderBlock(d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1, activation: str = 'gelu')[source]

Bases: Layer

Transformer Decoder Block: core building block of the Transformer decoder.

Structure: 1. Masked Multi-Head Self-Attention 2. Residual connection + Layer Norm 3. Multi-Head Cross-Attention (encoder-decoder attention) 4. Residual connection + Layer Norm 5. Feed-Forward Network 6. Residual connection + Layer Norm

__init__(d_model: int, num_heads: int, d_ff: int, dropout: float = 0.1, activation: str = 'gelu')[source]

Initialize transformer decoder block.

Parameters:
  • d_model – Model dimension

  • num_heads – Number of attention heads

  • d_ff – Feed-forward dimension

  • dropout – Dropout probability

  • activation – Activation function in FFN

forward(x: Tensor, encoder_output: Tensor, self_attn_mask: Tensor | None = None, cross_attn_mask: Tensor | None = None) Tensor[source]

Forward pass through decoder block.

Parameters:
  • x – Decoder input (batch_size, target_seq_len, d_model)

  • encoder_output – Encoder output (batch_size, source_seq_len, d_model)

  • self_attn_mask – Mask for self-attention

  • cross_attn_mask – Mask for cross-attention

Returns:

Output tensor (batch_size, target_seq_len, d_model)

class fit.nn.modules.transformer.TransformerEncoder(vocab_size: int, d_model: int, num_heads: int, num_layers: int, d_ff: int, max_len: int = 5000, dropout: float = 0.1, activation: str = 'gelu')[source]

Bases: Layer

Complete Transformer Encoder: stack of encoder blocks with embeddings.

__init__(vocab_size: int, d_model: int, num_heads: int, num_layers: int, d_ff: int, max_len: int = 5000, dropout: float = 0.1, activation: str = 'gelu')[source]

Initialize transformer encoder.

Parameters:
  • vocab_size – Size of vocabulary

  • d_model – Model dimension

  • num_heads – Number of attention heads

  • num_layers – Number of encoder layers

  • d_ff – Feed-forward dimension

  • max_len – Maximum sequence length

  • dropout – Dropout probability

  • activation – Activation function

forward(x: Tensor, mask: Tensor | None = None) Tensor[source]

Forward pass through transformer encoder.

Parameters:
  • x – Input token indices (batch_size, seq_len)

  • mask – Optional attention mask

Returns:

Encoded representations (batch_size, seq_len, d_model)

class fit.nn.modules.transformer.TransformerDecoder(vocab_size: int, d_model: int, num_heads: int, num_layers: int, d_ff: int, max_len: int = 5000, dropout: float = 0.1, activation: str = 'gelu')[source]

Bases: Layer

Complete Transformer Decoder: stack of decoder blocks with embeddings.

__init__(vocab_size: int, d_model: int, num_heads: int, num_layers: int, d_ff: int, max_len: int = 5000, dropout: float = 0.1, activation: str = 'gelu')[source]

Initialize transformer decoder.

Parameters:
  • vocab_size – Size of vocabulary

  • d_model – Model dimension

  • num_heads – Number of attention heads

  • num_layers – Number of decoder layers

  • d_ff – Feed-forward dimension

  • max_len – Maximum sequence length

  • dropout – Dropout probability

  • activation – Activation function

forward(x: Tensor, encoder_output: Tensor, self_attn_mask: Tensor | None = None, cross_attn_mask: Tensor | None = None) Tensor[source]

Forward pass through transformer decoder.

Parameters:
  • x – Target token indices (batch_size, target_seq_len)

  • encoder_output – Encoder output (batch_size, source_seq_len, d_model)

  • self_attn_mask – Mask for self-attention

  • cross_attn_mask – Mask for cross-attention

Returns:

Decoded representations (batch_size, target_seq_len, d_model)

class fit.nn.modules.transformer.Embedding(vocab_size: int, d_model: int)[source]

Bases: Layer

Token embedding layer that converts token indices to dense vectors.

__init__(vocab_size: int, d_model: int)[source]

Initialize embedding layer.

Parameters:
  • vocab_size – Size of vocabulary

  • d_model – Embedding dimension

forward(x: Tensor) Tensor[source]

Look up embeddings for input tokens.

Parameters:

x – Token indices (batch_size, seq_len)

Returns:

Embeddings (batch_size, seq_len, d_model)

class fit.nn.modules.transformer.Transformer(src_vocab_size: int, tgt_vocab_size: int, d_model: int = 512, num_heads: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, d_ff: int = 2048, max_len: int = 5000, dropout: float = 0.1, activation: str = 'gelu')[source]

Bases: Layer

Complete Transformer model for sequence-to-sequence tasks.

This is the full Transformer as described in “Attention Is All You Need”.

__init__(src_vocab_size: int, tgt_vocab_size: int, d_model: int = 512, num_heads: int = 8, num_encoder_layers: int = 6, num_decoder_layers: int = 6, d_ff: int = 2048, max_len: int = 5000, dropout: float = 0.1, activation: str = 'gelu')[source]

Initialize complete Transformer model.

Parameters:
  • src_vocab_size – Source vocabulary size

  • tgt_vocab_size – Target vocabulary size

  • d_model – Model dimension

  • num_heads – Number of attention heads

  • num_encoder_layers – Number of encoder layers

  • num_decoder_layers – Number of decoder layers

  • d_ff – Feed-forward dimension

  • max_len – Maximum sequence length

  • dropout – Dropout probability

  • activation – Activation function

forward(src: Tensor, tgt: Tensor, src_mask: Tensor | None = None, tgt_mask: Tensor | None = None) Tensor[source]

Forward pass through complete Transformer.

Parameters:
  • src – Source sequences (batch_size, src_seq_len)

  • tgt – Target sequences (batch_size, tgt_seq_len)

  • src_mask – Source attention mask

  • tgt_mask – Target attention mask

Returns:

Output logits (batch_size, tgt_seq_len, tgt_vocab_size)

fit.nn.modules.transformer.demonstrate_transformer()[source]

Demonstrate transformer components with simple examples.

Container Modules

Container modules for composing neural networks.

class fit.nn.modules.container.Sequential(*layers)[source]

Bases: Layer

Sequential container that chains layers together.

Layers are executed in the order they are added.

__init__(*layers)[source]

Initialize sequential container.

Parameters:

*layers – Variable number of layers to chain

add_layer(layer: Layer)[source]

Add a layer to the sequence.

forward(x: Tensor) Tensor[source]

Forward pass through all layers in sequence.

Parameters:

x – Input tensor

Returns:

Output tensor after passing through all layers

train()[source]

Set all layers to training mode.

eval()[source]

Set all layers to evaluation mode.

get_config()[source]

Get configuration for serialization.

class fit.nn.modules.container.ModuleList(modules: List[Layer] | None = None)[source]

Bases: Layer

List container for modules.

Unlike Sequential, ModuleList doesn’t define forward pass - you need to implement it yourself.

__init__(modules: List[Layer] | None = None)[source]

Initialize module list.

Parameters:

modules – List of modules to store

append(module: Layer)[source]

Add a module to the end of the list.

extend(modules: List[Layer])[source]

Extend the list with multiple modules.

insert(index: int, module: Layer)[source]

Insert a module at the given index.

train()[source]

Set all modules to training mode.

eval()[source]

Set all modules to evaluation mode.

class fit.nn.modules.container.ModuleDict(modules: Dict[str, Layer] | None = None)[source]

Bases: Layer

Dictionary container for modules.

__init__(modules: Dict[str, Layer] | None = None)[source]

Initialize module dictionary.

Parameters:

modules – Dictionary of modules

keys()[source]
values()[source]
items()[source]
update(modules: Dict[str, Layer])[source]

Update with multiple modules.

train()[source]

Set all modules to training mode.

eval()[source]

Set all modules to evaluation mode.

class fit.nn.modules.container.Parallel(*layers)[source]

Bases: Layer

Parallel container that applies multiple layers to the same input.

__init__(*layers)[source]

Initialize parallel container.

Parameters:

*layers – Layers to apply in parallel

add_layer(layer: Layer)[source]

Add a layer to parallel execution.

forward(x: Tensor) List[Tensor][source]

Apply all layers to input in parallel.

Parameters:

x – Input tensor

Returns:

List of outputs from each layer

train()[source]

Set all layers to training mode.

eval()[source]

Set all layers to evaluation mode.

class fit.nn.modules.container.Residual(layer: Layer)[source]

Bases: Layer

Residual connection: output = input + layer(input)

__init__(layer: Layer)[source]

Initialize residual connection.

Parameters:

layer – Layer to wrap with residual connection

forward(x: Tensor) Tensor[source]

Forward pass with residual connection.

Parameters:

x – Input tensor

Returns:

x + layer(x)

train()[source]

Set layer to training mode.

eval()[source]

Set layer to evaluation mode.

class fit.nn.modules.container.Highway(layer: Layer, gate_layer: Layer | None = None)[source]

Bases: Layer

Highway connection: output = gate * layer(input) + (1 - gate) * input

__init__(layer: Layer, gate_layer: Layer | None = None)[source]

Initialize highway connection.

Parameters:
  • layer – Transform layer

  • gate_layer – Gate layer (if None, creates a linear layer)

forward(x: Tensor) Tensor[source]

Forward pass with highway connection.

Parameters:

x – Input tensor

Returns:

gate * layer(x) + (1 - gate) * x

train()[source]

Set layers to training mode.

eval()[source]

Set layers to evaluation mode.

Functional Interface

Utilities

fit.nn.utils.model_io.save_model(model, path)[source]

Save model to file.

Parameters:
  • model – Model to save

  • path – Path to save to

fit.nn.utils.model_io.load_model(path, model_class=None)[source]

Load model from file.

Parameters:
  • path – Path to load from

  • model_class – Model class to instantiate (optional)

Returns:

Loaded model