Data Utilities

The data module provides utilities for data loading and preprocessing.

Dataset and DataLoader

Dataset classes for handling data.

class fit.data.dataset.Dataset(X: ndarray | Tensor, y: ndarray | Tensor | None = None, transform: Callable | None = None, target_transform: Callable | None = None)[source]

Bases: object

Base dataset class for handling data.

Wraps arrays and provides indexing functionality.

__init__(X: ndarray | Tensor, y: ndarray | Tensor | None = None, transform: Callable | None = None, target_transform: Callable | None = None)[source]

Initialize dataset.

Parameters:
  • X – Input features

  • y – Target labels (optional for unsupervised tasks)

  • transform – Optional transform to apply to features

  • target_transform – Optional transform to apply to targets

__len__() int[source]

Return the size of the dataset.

__getitem__(idx: int) Tensor | Tuple[Tensor, Tensor][source]

Get item by index.

Parameters:

idx – Index of the item

Returns:

(X, y) tuple if y is provided, else just X

split(test_size: float = 0.2, random_state: int | None = None) Tuple[Dataset, Dataset][source]

Split dataset into train and test sets.

Parameters:
  • test_size – Fraction of data to use for testing

  • random_state – Random seed for reproducibility

Returns:

(train_dataset, test_dataset) tuple

shuffle(random_state: int | None = None)[source]

Shuffle the dataset in place.

Parameters:

random_state – Random seed for reproducibility

get_subset(indices: List[int] | ndarray) Dataset[source]

Get a subset of the dataset.

Parameters:

indices – Indices to include in subset

Returns:

New dataset with selected indices

class fit.data.dataset.TensorDataset(*tensors: Tensor)[source]

Bases: Dataset

Dataset from tensors.

Specialized dataset for when data is already in tensor format.

__init__(*tensors: Tensor)[source]

Initialize tensor dataset.

Parameters:

*tensors – Variable number of tensors (features, targets, etc.)

__len__() int[source]

Return the size of the dataset.

__getitem__(idx: int) Tensor | Tuple[Tensor, ...][source]

Get item by index.

Parameters:

idx – Index of the item

Returns:

Tensor or tuple of tensors

class fit.data.dataset.ConcatDataset(datasets: List[Dataset])[source]

Bases: Dataset

Dataset for concatenating multiple datasets.

__init__(datasets: List[Dataset])[source]

Initialize concatenated dataset.

Parameters:

datasets – List of datasets to concatenate

__len__() int[source]

Return total size of concatenated datasets.

__getitem__(idx: int)[source]

Get item by global index.

Parameters:

idx – Global index across all datasets

Returns:

Item from appropriate dataset

class fit.data.dataset.Subset(dataset: Dataset, indices: List[int] | ndarray)[source]

Bases: Dataset

Subset of a dataset at specified indices.

__init__(dataset: Dataset, indices: List[int] | ndarray)[source]

Initialize subset.

Parameters:
  • dataset – Original dataset

  • indices – Indices to include in subset

__len__() int[source]

Return size of subset.

__getitem__(idx: int)[source]

Get item by subset index.

Parameters:

idx – Index within subset

Returns:

Item from original dataset

class fit.data.dataset.RandomSampler(dataset: Dataset, replacement: bool = False, num_samples: int | None = None, random_state: int | None = None)[source]

Bases: object

Random sampler for datasets.

__init__(dataset: Dataset, replacement: bool = False, num_samples: int | None = None, random_state: int | None = None)[source]

Initialize random sampler.

Parameters:
  • dataset – Dataset to sample from

  • replacement – Whether to sample with replacement

  • num_samples – Number of samples to draw (default: len(dataset))

  • random_state – Random seed

__iter__()[source]

Iterator over random indices.

__len__() int[source]

Return number of samples.

class fit.data.dataset.SequentialSampler(dataset: Dataset)[source]

Bases: object

Sequential sampler for datasets.

__init__(dataset: Dataset)[source]

Initialize sequential sampler.

Parameters:

dataset – Dataset to sample from

__iter__()[source]

Iterator over sequential indices.

__len__() int[source]

Return number of samples.

DataLoader for batching and iterating over datasets.

class fit.data.dataloader.DataLoader(dataset: Dataset, batch_size: int = 1, shuffle: bool = False, sampler: Any | None = None, drop_last: bool = False, collate_fn: Callable | None = None, random_state: int | None = None)[source]

Bases: object

DataLoader for batching and iterating over datasets.

Provides batching, shuffling, and parallel data loading functionality.

__init__(dataset: Dataset, batch_size: int = 1, shuffle: bool = False, sampler: Any | None = None, drop_last: bool = False, collate_fn: Callable | None = None, random_state: int | None = None)[source]

Initialize DataLoader.

Parameters:
  • dataset – Dataset to load from

  • batch_size – Number of samples per batch

  • shuffle – Whether to shuffle data each epoch

  • sampler – Custom sampler (overrides shuffle)

  • drop_last – Whether to drop the last incomplete batch

  • collate_fn – Function to collate samples into batches

  • random_state – Random seed for reproducibility

__iter__() Iterator[source]

Return iterator over batches.

__len__() int[source]

Return number of batches.

class fit.data.dataloader.BatchSampler(sampler, batch_size: int, drop_last: bool = False)[source]

Bases: object

Sampler that groups indices into batches.

__init__(sampler, batch_size: int, drop_last: bool = False)[source]

Initialize batch sampler.

Parameters:
  • sampler – Base sampler to use

  • batch_size – Size of each batch

  • drop_last – Whether to drop the last incomplete batch

__iter__()[source]

Iterator over batches of indices.

__len__() int[source]

Return number of batches.

class fit.data.dataloader.WeightedRandomSampler(weights: list | ndarray, num_samples: int, replacement: bool = True, random_state: int | None = None)[source]

Bases: object

Weighted random sampler for handling class imbalance.

__init__(weights: list | ndarray, num_samples: int, replacement: bool = True, random_state: int | None = None)[source]

Initialize weighted random sampler.

Parameters:
  • weights – Weights for each sample

  • num_samples – Number of samples to draw

  • replacement – Whether to sample with replacement

  • random_state – Random seed

__iter__()[source]

Iterator over weighted random indices.

__len__() int[source]

Return number of samples.

class fit.data.dataloader.SubsetRandomSampler(indices: list | ndarray, random_state: int | None = None)[source]

Bases: object

Random sampler for a subset of indices.

__init__(indices: list | ndarray, random_state: int | None = None)[source]

Initialize subset random sampler.

Parameters:
  • indices – Subset of indices to sample from

  • random_state – Random seed

__iter__()[source]

Iterator over shuffled subset indices.

__len__() int[source]

Return number of indices.

class fit.data.dataloader.DistributedSampler(dataset: Dataset, num_replicas: int = 1, rank: int = 0, shuffle: bool = True, random_state: int | None = None)[source]

Bases: object

Sampler for distributed training (placeholder implementation).

__init__(dataset: Dataset, num_replicas: int = 1, rank: int = 0, shuffle: bool = True, random_state: int | None = None)[source]

Initialize distributed sampler.

Parameters:
  • dataset – Dataset to sample from

  • num_replicas – Number of processes participating in distributed training

  • rank – Rank of current process

  • shuffle – Whether to shuffle data

  • random_state – Random seed

__iter__()[source]

Iterator over distributed indices.

__len__() int[source]

Return number of samples for this rank.

set_epoch(epoch: int)[source]

Set epoch for shuffling.

fit.data.dataloader.collate_tensors(batch)[source]

Collate function for tensor data.

Parameters:

batch – List of tensor samples

Returns:

Batched tensor

fit.data.dataloader.collate_sequences(batch, pad_value=0)[source]

Collate function for variable-length sequences.

Parameters:
  • batch – List of sequence samples

  • pad_value – Value to use for padding

Returns:

Padded batch tensor

fit.data.dataloader.pin_memory(tensor)[source]

Pin tensor memory (placeholder for GPU acceleration).

Parameters:

tensor – Tensor to pin

Returns:

Tensor (unchanged in CPU-only implementation)

class fit.data.dataloader.DataLoaderIter(loader: DataLoader)[source]

Bases: object

Iterator for DataLoader with additional functionality.

__init__(loader: DataLoader)[source]

Initialize DataLoader iterator.

Parameters:

loader – DataLoader to iterate over

__iter__()[source]

Return self as iterator.

__next__()[source]

Get next batch.

__len__()[source]

Return number of batches.

property batch_index

Current batch index.

Built-in Datasets

Data Transformations

Feature Selection

Feature selection utilities for machine learning.

This module provides tools for selecting the most relevant features from datasets, including univariate and model-based selection methods.

class fit.data.feature_selection.SelectKBest(score_func: Callable | None = None, k: int = 10)[source]

Bases: object

Select features according to the k highest scores.

Examples

>>> from fit.data.feature_selection import SelectKBest, f_classif
>>> selector = SelectKBest(score_func=f_classif, k=5)
>>> X_new = selector.fit_transform(X, y)
__init__(score_func: Callable | None = None, k: int = 10)[source]

Initialize SelectKBest.

Parameters:
  • score_func – Function taking two arrays X and y, and returning scores and p-values

  • k – Number of top features to select

fit(X: ndarray, y: ndarray) SelectKBest[source]

Run score function on (X, y) and get the appropriate features.

Parameters:
  • X – Training data

  • y – Target values

Returns:

Self for method chaining

transform(X: ndarray) ndarray[source]

Reduce X to the selected features.

Parameters:

X – Input data

Returns:

Data with selected features only

fit_transform(X: ndarray, y: ndarray) ndarray[source]

Fit to data, then transform it.

Parameters:
  • X – Training data

  • y – Target values

Returns:

Data with selected features

get_support(indices: bool = False) ndarray | List[int][source]

Get a mask, or integer index, of the selected features.

Parameters:

indices – If True, return feature indices instead of mask

Returns:

Boolean mask or feature indices

class fit.data.feature_selection.SelectPercentile(score_func: Callable | None = None, percentile: int = 10)[source]

Bases: object

Select features according to a percentile of the highest scores.

Examples

>>> selector = SelectPercentile(score_func=f_classif, percentile=20)
>>> X_new = selector.fit_transform(X, y)
__init__(score_func: Callable | None = None, percentile: int = 10)[source]

Initialize SelectPercentile.

Parameters:
  • score_func – Function taking two arrays X and y, and returning scores

  • percentile – Percent of features to keep

fit(X: ndarray, y: ndarray) SelectPercentile[source]

Run score function on (X, y) and get the appropriate features.

Parameters:
  • X – Training data

  • y – Target values

Returns:

Self for method chaining

transform(X: ndarray) ndarray[source]

Reduce X to the selected features.

Parameters:

X – Input data

Returns:

Data with selected features only

fit_transform(X: ndarray, y: ndarray) ndarray[source]

Fit to data, then transform it.

Parameters:
  • X – Training data

  • y – Target values

Returns:

Data with selected features

class fit.data.feature_selection.VarianceThreshold(threshold: float = 0.0)[source]

Bases: object

Feature selector that removes all low-variance features.

Examples

>>> selector = VarianceThreshold(threshold=0.1)
>>> X_new = selector.fit_transform(X)
__init__(threshold: float = 0.0)[source]

Initialize VarianceThreshold.

Parameters:

threshold – Features with variance below this threshold will be removed

fit(X: ndarray, y: ndarray | None = None) VarianceThreshold[source]

Learn which features have variance above the threshold.

Parameters:
  • X – Training data

  • y – Not used, present for API consistency

Returns:

Self for method chaining

transform(X: ndarray) ndarray[source]

Reduce X to the selected features.

Parameters:

X – Input data

Returns:

Data with selected features only

fit_transform(X: ndarray, y: ndarray | None = None) ndarray[source]

Fit to data, then transform it.

Parameters:
  • X – Training data

  • y – Not used, present for API consistency

Returns:

Data with selected features

class fit.data.feature_selection.RFE(estimator, n_features_to_select: int | None = None, step: int = 1)[source]

Bases: object

Recursive Feature Elimination.

Given an external estimator that assigns weights to features, RFE recursively eliminates features and builds the model with the remaining attributes.

Examples

>>> from fit.nn.modules.linear import Linear
>>> estimator = Linear(10, 1)  # Simple linear model
>>> selector = RFE(estimator, n_features_to_select=5)
>>> X_new = selector.fit_transform(X, y)
__init__(estimator, n_features_to_select: int | None = None, step: int = 1)[source]

Initialize RFE.

Parameters:
  • estimator – Supervised learning estimator with a fit method

  • n_features_to_select – Number of features to select

  • step – Number of features to remove at each iteration

fit(X: ndarray, y: ndarray) RFE[source]

Fit the RFE model.

Parameters:
  • X – Training data

  • y – Target values

Returns:

Self for method chaining

transform(X: ndarray) ndarray[source]

Reduce X to the selected features.

Parameters:

X – Input data

Returns:

Data with selected features only

fit_transform(X: ndarray, y: ndarray) ndarray[source]

Fit to data, then transform it.

Parameters:
  • X – Training data

  • y – Target values

Returns:

Data with selected features

fit.data.feature_selection.f_classif(X: ndarray, y: ndarray) Tuple[ndarray, ndarray][source]

Compute the ANOVA F-value for the provided sample.

Parameters:
  • X – Sample data

  • y – Target values

Returns:

Tuple of (F-statistics, p-values)

fit.data.feature_selection.f_regression(X: ndarray, y: ndarray) Tuple[ndarray, ndarray][source]

Univariate linear regression tests.

Parameters:
  • X – Sample data

  • y – Target values

Returns:

Tuple of (F-statistics, p-values)

fit.data.feature_selection.mutual_info_classif(X: ndarray, y: ndarray) Tuple[ndarray, ndarray][source]

Estimate mutual information for a discrete target variable.

Parameters:
  • X – Sample data

  • y – Target values

Returns:

Tuple of (mutual information scores, dummy p-values)

Preprocessing

Data preprocessing utilities for machine learning.

This module provides tools for encoding categorical variables, scaling features, and other preprocessing tasks.

class fit.data.preprocessing.LabelEncoder[source]

Bases: object

Encode target labels with value between 0 and n_classes-1.

Examples

>>> encoder = LabelEncoder()
>>> labels = ['cat', 'dog', 'cat', 'bird']
>>> encoded = encoder.fit_transform(labels)
>>> print(encoded)  # [0, 1, 0, 2]
>>> decoded = encoder.inverse_transform(encoded)
>>> print(decoded)  # ['cat', 'dog', 'cat', 'bird']
__init__()[source]
fit(y: List | ndarray) LabelEncoder[source]

Fit label encoder.

Parameters:

y – Target values

Returns:

Self for method chaining

transform(y: List | ndarray) ndarray[source]

Transform labels to normalized encoding.

Parameters:

y – Target values

Returns:

Encoded labels

fit_transform(y: List | ndarray) ndarray[source]

Fit label encoder and return encoded labels.

Parameters:

y – Target values

Returns:

Encoded labels

inverse_transform(y: ndarray) ndarray[source]

Transform labels back to original encoding.

Parameters:

y – Encoded target values

Returns:

Original labels

class fit.data.preprocessing.OneHotEncoder(sparse: bool = False, drop: str | None = None)[source]

Bases: object

Encode categorical features as a one-hot numeric array.

Examples

>>> encoder = OneHotEncoder()
>>> data = [['cat'], ['dog'], ['cat'], ['bird']]
>>> encoded = encoder.fit_transform(data)
>>> print(encoded.shape)  # (4, 3)
__init__(sparse: bool = False, drop: str | None = None)[source]

Initialize OneHotEncoder.

Parameters:
  • sparse – Return sparse matrix if True (not implemented yet)

  • drop – Strategy to use to drop one category per feature (not implemented yet)

fit(X: List | ndarray) OneHotEncoder[source]

Fit OneHotEncoder to X.

Parameters:

X – Input samples

Returns:

Self for method chaining

transform(X: List | ndarray) ndarray[source]

Transform X using one-hot encoding.

Parameters:

X – Input samples

Returns:

One-hot encoded array

fit_transform(X: List | ndarray) ndarray[source]

Fit OneHotEncoder and transform X.

Parameters:

X – Input samples

Returns:

One-hot encoded array

get_feature_names_out(input_features: List[str] | None = None) List[str][source]

Get output feature names for transformation.

Parameters:

input_features – Input feature names

Returns:

Output feature names

class fit.data.preprocessing.StandardScaler(with_mean: bool = True, with_std: bool = True)[source]

Bases: object

Standardize features by removing the mean and scaling to unit variance.

Examples

>>> scaler = StandardScaler()
>>> X = [[1, 2], [3, 4], [5, 6]]
>>> X_scaled = scaler.fit_transform(X)
__init__(with_mean: bool = True, with_std: bool = True)[source]

Initialize StandardScaler.

Parameters:
  • with_mean – Center the data before scaling

  • with_std – Scale the data to unit variance

fit(X: ndarray) StandardScaler[source]

Compute the mean and std to be used for later scaling.

Parameters:

X – Training data

Returns:

Self for method chaining

transform(X: ndarray) ndarray[source]

Perform standardization by centering and scaling.

Parameters:

X – Data to transform

Returns:

Transformed data

fit_transform(X: ndarray) ndarray[source]

Fit to data, then transform it.

Parameters:

X – Training data

Returns:

Transformed data

inverse_transform(X: ndarray) ndarray[source]

Scale back the data to the original representation.

Parameters:

X – Transformed data

Returns:

Original scale data

class fit.data.preprocessing.MinMaxScaler(feature_range: tuple = (0, 1))[source]

Bases: object

Transform features by scaling each feature to a given range.

Examples

>>> scaler = MinMaxScaler()
>>> X = [[1, 2], [3, 4], [5, 6]]
>>> X_scaled = scaler.fit_transform(X)  # Scale to [0, 1]
__init__(feature_range: tuple = (0, 1))[source]

Initialize MinMaxScaler.

Parameters:

feature_range – Desired range of transformed data

fit(X: ndarray) MinMaxScaler[source]

Compute the minimum and maximum to be used for later scaling.

Parameters:

X – Training data

Returns:

Self for method chaining

transform(X: ndarray) ndarray[source]

Scale features according to feature_range.

Parameters:

X – Data to transform

Returns:

Transformed data

fit_transform(X: ndarray) ndarray[source]

Fit to data, then transform it.

Parameters:

X – Training data

Returns:

Transformed data

inverse_transform(X: ndarray) ndarray[source]

Undo the scaling of X according to feature_range.

Parameters:

X – Transformed data

Returns:

Original scale data