Data Utilities
The data module provides utilities for data loading and preprocessing.
Dataset and DataLoader
Dataset classes for handling data.
- class fit.data.dataset.Dataset(X: ndarray | Tensor, y: ndarray | Tensor | None = None, transform: Callable | None = None, target_transform: Callable | None = None)[source]
Bases:
objectBase dataset class for handling data.
Wraps arrays and provides indexing functionality.
- __init__(X: ndarray | Tensor, y: ndarray | Tensor | None = None, transform: Callable | None = None, target_transform: Callable | None = None)[source]
Initialize dataset.
- Parameters:
X – Input features
y – Target labels (optional for unsupervised tasks)
transform – Optional transform to apply to features
target_transform – Optional transform to apply to targets
- __getitem__(idx: int) Tensor | Tuple[Tensor, Tensor][source]
Get item by index.
- Parameters:
idx – Index of the item
- Returns:
(X, y) tuple if y is provided, else just X
- split(test_size: float = 0.2, random_state: int | None = None) Tuple[Dataset, Dataset][source]
Split dataset into train and test sets.
- Parameters:
test_size – Fraction of data to use for testing
random_state – Random seed for reproducibility
- Returns:
(train_dataset, test_dataset) tuple
- class fit.data.dataset.TensorDataset(*tensors: Tensor)[source]
Bases:
DatasetDataset from tensors.
Specialized dataset for when data is already in tensor format.
- class fit.data.dataset.ConcatDataset(datasets: List[Dataset])[source]
Bases:
DatasetDataset for concatenating multiple datasets.
- class fit.data.dataset.Subset(dataset: Dataset, indices: List[int] | ndarray)[source]
Bases:
DatasetSubset of a dataset at specified indices.
- class fit.data.dataset.RandomSampler(dataset: Dataset, replacement: bool = False, num_samples: int | None = None, random_state: int | None = None)[source]
Bases:
objectRandom sampler for datasets.
- __init__(dataset: Dataset, replacement: bool = False, num_samples: int | None = None, random_state: int | None = None)[source]
Initialize random sampler.
- Parameters:
dataset – Dataset to sample from
replacement – Whether to sample with replacement
num_samples – Number of samples to draw (default: len(dataset))
random_state – Random seed
- class fit.data.dataset.SequentialSampler(dataset: Dataset)[source]
Bases:
objectSequential sampler for datasets.
DataLoader for batching and iterating over datasets.
- class fit.data.dataloader.DataLoader(dataset: Dataset, batch_size: int = 1, shuffle: bool = False, sampler: Any | None = None, drop_last: bool = False, collate_fn: Callable | None = None, random_state: int | None = None)[source]
Bases:
objectDataLoader for batching and iterating over datasets.
Provides batching, shuffling, and parallel data loading functionality.
- __init__(dataset: Dataset, batch_size: int = 1, shuffle: bool = False, sampler: Any | None = None, drop_last: bool = False, collate_fn: Callable | None = None, random_state: int | None = None)[source]
Initialize DataLoader.
- Parameters:
dataset – Dataset to load from
batch_size – Number of samples per batch
shuffle – Whether to shuffle data each epoch
sampler – Custom sampler (overrides shuffle)
drop_last – Whether to drop the last incomplete batch
collate_fn – Function to collate samples into batches
random_state – Random seed for reproducibility
- class fit.data.dataloader.BatchSampler(sampler, batch_size: int, drop_last: bool = False)[source]
Bases:
objectSampler that groups indices into batches.
- class fit.data.dataloader.WeightedRandomSampler(weights: list | ndarray, num_samples: int, replacement: bool = True, random_state: int | None = None)[source]
Bases:
objectWeighted random sampler for handling class imbalance.
- __init__(weights: list | ndarray, num_samples: int, replacement: bool = True, random_state: int | None = None)[source]
Initialize weighted random sampler.
- Parameters:
weights – Weights for each sample
num_samples – Number of samples to draw
replacement – Whether to sample with replacement
random_state – Random seed
- class fit.data.dataloader.SubsetRandomSampler(indices: list | ndarray, random_state: int | None = None)[source]
Bases:
objectRandom sampler for a subset of indices.
- class fit.data.dataloader.DistributedSampler(dataset: Dataset, num_replicas: int = 1, rank: int = 0, shuffle: bool = True, random_state: int | None = None)[source]
Bases:
objectSampler for distributed training (placeholder implementation).
- __init__(dataset: Dataset, num_replicas: int = 1, rank: int = 0, shuffle: bool = True, random_state: int | None = None)[source]
Initialize distributed sampler.
- Parameters:
dataset – Dataset to sample from
num_replicas – Number of processes participating in distributed training
rank – Rank of current process
shuffle – Whether to shuffle data
random_state – Random seed
- fit.data.dataloader.collate_tensors(batch)[source]
Collate function for tensor data.
- Parameters:
batch – List of tensor samples
- Returns:
Batched tensor
- fit.data.dataloader.collate_sequences(batch, pad_value=0)[source]
Collate function for variable-length sequences.
- Parameters:
batch – List of sequence samples
pad_value – Value to use for padding
- Returns:
Padded batch tensor
- fit.data.dataloader.pin_memory(tensor)[source]
Pin tensor memory (placeholder for GPU acceleration).
- Parameters:
tensor – Tensor to pin
- Returns:
Tensor (unchanged in CPU-only implementation)
- class fit.data.dataloader.DataLoaderIter(loader: DataLoader)[source]
Bases:
objectIterator for DataLoader with additional functionality.
- __init__(loader: DataLoader)[source]
Initialize DataLoader iterator.
- Parameters:
loader – DataLoader to iterate over
- property batch_index
Current batch index.
Built-in Datasets
Data Transformations
Feature Selection
Feature selection utilities for machine learning.
This module provides tools for selecting the most relevant features from datasets, including univariate and model-based selection methods.
- class fit.data.feature_selection.SelectKBest(score_func: Callable | None = None, k: int = 10)[source]
Bases:
objectSelect features according to the k highest scores.
Examples
>>> from fit.data.feature_selection import SelectKBest, f_classif >>> selector = SelectKBest(score_func=f_classif, k=5) >>> X_new = selector.fit_transform(X, y)
- __init__(score_func: Callable | None = None, k: int = 10)[source]
Initialize SelectKBest.
- Parameters:
score_func – Function taking two arrays X and y, and returning scores and p-values
k – Number of top features to select
- fit(X: ndarray, y: ndarray) SelectKBest[source]
Run score function on (X, y) and get the appropriate features.
- Parameters:
X – Training data
y – Target values
- Returns:
Self for method chaining
- transform(X: ndarray) ndarray[source]
Reduce X to the selected features.
- Parameters:
X – Input data
- Returns:
Data with selected features only
- class fit.data.feature_selection.SelectPercentile(score_func: Callable | None = None, percentile: int = 10)[source]
Bases:
objectSelect features according to a percentile of the highest scores.
Examples
>>> selector = SelectPercentile(score_func=f_classif, percentile=20) >>> X_new = selector.fit_transform(X, y)
- __init__(score_func: Callable | None = None, percentile: int = 10)[source]
Initialize SelectPercentile.
- Parameters:
score_func – Function taking two arrays X and y, and returning scores
percentile – Percent of features to keep
- fit(X: ndarray, y: ndarray) SelectPercentile[source]
Run score function on (X, y) and get the appropriate features.
- Parameters:
X – Training data
y – Target values
- Returns:
Self for method chaining
- class fit.data.feature_selection.VarianceThreshold(threshold: float = 0.0)[source]
Bases:
objectFeature selector that removes all low-variance features.
Examples
>>> selector = VarianceThreshold(threshold=0.1) >>> X_new = selector.fit_transform(X)
- __init__(threshold: float = 0.0)[source]
Initialize VarianceThreshold.
- Parameters:
threshold – Features with variance below this threshold will be removed
- fit(X: ndarray, y: ndarray | None = None) VarianceThreshold[source]
Learn which features have variance above the threshold.
- Parameters:
X – Training data
y – Not used, present for API consistency
- Returns:
Self for method chaining
- class fit.data.feature_selection.RFE(estimator, n_features_to_select: int | None = None, step: int = 1)[source]
Bases:
objectRecursive Feature Elimination.
Given an external estimator that assigns weights to features, RFE recursively eliminates features and builds the model with the remaining attributes.
Examples
>>> from fit.nn.modules.linear import Linear >>> estimator = Linear(10, 1) # Simple linear model >>> selector = RFE(estimator, n_features_to_select=5) >>> X_new = selector.fit_transform(X, y)
- __init__(estimator, n_features_to_select: int | None = None, step: int = 1)[source]
Initialize RFE.
- Parameters:
estimator – Supervised learning estimator with a fit method
n_features_to_select – Number of features to select
step – Number of features to remove at each iteration
- fit(X: ndarray, y: ndarray) RFE[source]
Fit the RFE model.
- Parameters:
X – Training data
y – Target values
- Returns:
Self for method chaining
- fit.data.feature_selection.f_classif(X: ndarray, y: ndarray) Tuple[ndarray, ndarray][source]
Compute the ANOVA F-value for the provided sample.
- Parameters:
X – Sample data
y – Target values
- Returns:
Tuple of (F-statistics, p-values)
Preprocessing
Data preprocessing utilities for machine learning.
This module provides tools for encoding categorical variables, scaling features, and other preprocessing tasks.
- class fit.data.preprocessing.LabelEncoder[source]
Bases:
objectEncode target labels with value between 0 and n_classes-1.
Examples
>>> encoder = LabelEncoder() >>> labels = ['cat', 'dog', 'cat', 'bird'] >>> encoded = encoder.fit_transform(labels) >>> print(encoded) # [0, 1, 0, 2] >>> decoded = encoder.inverse_transform(encoded) >>> print(decoded) # ['cat', 'dog', 'cat', 'bird']
- fit(y: List | ndarray) LabelEncoder[source]
Fit label encoder.
- Parameters:
y – Target values
- Returns:
Self for method chaining
- transform(y: List | ndarray) ndarray[source]
Transform labels to normalized encoding.
- Parameters:
y – Target values
- Returns:
Encoded labels
- class fit.data.preprocessing.OneHotEncoder(sparse: bool = False, drop: str | None = None)[source]
Bases:
objectEncode categorical features as a one-hot numeric array.
Examples
>>> encoder = OneHotEncoder() >>> data = [['cat'], ['dog'], ['cat'], ['bird']] >>> encoded = encoder.fit_transform(data) >>> print(encoded.shape) # (4, 3)
- __init__(sparse: bool = False, drop: str | None = None)[source]
Initialize OneHotEncoder.
- Parameters:
sparse – Return sparse matrix if True (not implemented yet)
drop – Strategy to use to drop one category per feature (not implemented yet)
- fit(X: List | ndarray) OneHotEncoder[source]
Fit OneHotEncoder to X.
- Parameters:
X – Input samples
- Returns:
Self for method chaining
- transform(X: List | ndarray) ndarray[source]
Transform X using one-hot encoding.
- Parameters:
X – Input samples
- Returns:
One-hot encoded array
- class fit.data.preprocessing.StandardScaler(with_mean: bool = True, with_std: bool = True)[source]
Bases:
objectStandardize features by removing the mean and scaling to unit variance.
Examples
>>> scaler = StandardScaler() >>> X = [[1, 2], [3, 4], [5, 6]] >>> X_scaled = scaler.fit_transform(X)
- __init__(with_mean: bool = True, with_std: bool = True)[source]
Initialize StandardScaler.
- Parameters:
with_mean – Center the data before scaling
with_std – Scale the data to unit variance
- fit(X: ndarray) StandardScaler[source]
Compute the mean and std to be used for later scaling.
- Parameters:
X – Training data
- Returns:
Self for method chaining
- transform(X: ndarray) ndarray[source]
Perform standardization by centering and scaling.
- Parameters:
X – Data to transform
- Returns:
Transformed data
- class fit.data.preprocessing.MinMaxScaler(feature_range: tuple = (0, 1))[source]
Bases:
objectTransform features by scaling each feature to a given range.
Examples
>>> scaler = MinMaxScaler() >>> X = [[1, 2], [3, 4], [5, 6]] >>> X_scaled = scaler.fit_transform(X) # Scale to [0, 1]
- __init__(feature_range: tuple = (0, 1))[source]
Initialize MinMaxScaler.
- Parameters:
feature_range – Desired range of transformed data
- fit(X: ndarray) MinMaxScaler[source]
Compute the minimum and maximum to be used for later scaling.
- Parameters:
X – Training data
- Returns:
Self for method chaining
- transform(X: ndarray) ndarray[source]
Scale features according to feature_range.
- Parameters:
X – Data to transform
- Returns:
Transformed data