conditional_kde package

Top-level package for Conditional KDE.

class conditional_kde.ConditionalGaussian(bandwidth=1.0)[source]

Bases: object

Conditional Gaussian. Makes a simple Gaussian fit to the data, allowing for conditioning.

Parameters:

bandwidth (float) – allows for the additional smoothing/shrinking of the covariance. In most cases, it should be left as 1.

static _covariance_decomposition(cov, cond_mask, cond_only=False)[source]

Decomposing covariance matrix into the unconditional, conditional and cross terms.

Parameters:
  • cov (array) – covariance matrix.

  • cond_mask (array) – boolean array defining conditional dimensions.

  • cond_only (bool) – to return only conditional matrix or all decompositions.

Returns:

If cond_only is True, only conditional part of the covariance, otherwise: conditional, unconditional and cross parts, respectively.

static _log_prob(X, mean, cov, add_norm=True)[source]

Log probability of a gaussian KDE distribution.

Parameters:
  • X (array) – array of samples for which probability is calculated. Of shape (n, n_features).

  • mean (array) – mean of a gaussian distribution.

  • cov (float, array) – covariance matrix of a gaussian distribution. If float, it is a variance shared for all features. If 1D array, it is a variance for every feature separately. if 2D array, it is a full covariance matrix.

  • add_norm (bool) – either to add normalization factor to the calculation or not.

Returns:

Log probabilities.

fit(X, weights=None, features=None)[source]

Fitting the Conditional Kernel Density.

Parameters:
  • X (array) – data of shape (n_samples, n_features).

  • weights (array) – weights of every sample, of shape (n_samples).

  • features (list) – optional, list defining names for every feature. It’s used for referencing conditional dimensions. Defaults to [0, 1, …, n_features - 1].

Returns:

An instance of itself.

sample(conditionals=None, n_samples=1, random_state=None, keep_dims=False)[source]

Generate random samples from the conditional model. There are two modes of sampling: (1) specify conditionals as scalar values and sample n_samples out of distribution. (2) specify conditionals as an array, where the number of samples will be the length of an array.

Parameters:
  • conditionals (dict) – desired variables (features) to condition upon. Dictionary keys should be only feature names from features. For example, if self.features == [“a”, “b”, “c”] and one would like to condition on “a” and “c”, then conditionals = {“a”: cond_val_a, “c”: cond_val_c}. Conditioned values can be either float or array, where in the case of the latter, all conditioned arrays have to be of the same size. Defaults to None, i.e. normal KDE.

  • n_samples (int) – number of samples to generate. Ignored in the case conditional arrays have been passed in conditionals. Defaults to 1.

  • random_state (np.random.RandomState, int) – seed or RandomState instance, optional. Determines random number generation used to generate random samples. See Glossary <random_state>.

  • keep_dims (bool) – whether to return non-conditioned dimensions only or keep given conditional values. Defaults to False.

Returns:

Array of samples, of shape (n_samples, n_features) if conditional_variables is None, or (n_samples, n_features - len(conditionals)) otherwise.

score_samples(X, conditional_features=None)[source]

Compute the (un)conditional log-probability of each sample under the model.

Parameters:
  • X (array) – data of shape (n, n_features). Last dimension should match dimension of training data (n_features).

  • conditional_features (list) – subset of self.features, which dimensions of data to condition upon. Defaults to None, meaning unconditional log-probability.

Returns:

Conditional log probability for each sample in X.

class conditional_kde.ConditionalGaussianKernelDensity(whitening_algorithm='rescale', bandwidth='scott', **kwargs)[source]

Bases: object

Conditional Kernel Density estimator.

Parameters:
  • whitening_algorithm (str) – data whitening algorithm, either None, “rescale” or “ZCA”. See util.DataWhitener for more details. “rescale” by default.

  • bandwidth (str, float) –

    the width of the Gaussian centered around every point.

    It can be either:

    1. ”scott”, using Scott’s parameter,

    2. ”optimized”, which minimizes cross entropy to find the optimal bandwidth, or

    3. float, specifying the actual value.

    By default, it uses Scott’s parameter.

  • **kwargs

    additional kwargs used in the case of “optimized” bandwidth.

    steps (int): how many steps to use in optimization, 10 by default.

    cv_fold (int): cross validation fold, 5 by default.

    n_jobs (int): number of jobs to run cross validation in parallel, -1 by default, i.e. using all available processors.

    verbose (int): verbosity of the cross validation run, for more details see sklearn.model_selection.GridSearchCV.

static _conditional_weights(conditional_values, conditional_data, cov, optimize_memory=False)[source]

Weights for the sampling from the conditional distribution.

They amount to the conditioned part of the gaussian for every data point.

Parameters:
  • conditional_values (array) – of length n_conditionals.

  • cond_data (array) – of shape (n_samples, n_conditionals). Here non-conditional dimensions are already removed.

  • cov (float, array) – covariance matrix. If float, it is a variance shared for all features. If 1D array, it is a variance for every feature separately. if 2D array, it is a full covariance matrix.

  • optimize_memory (bool) – only for the vectorized conditionals, it makes an effort to minimize memory footprint, and enlarges computational time.

Returns:

Normalized weights.

static _covariance_decomposition(cov, cond_mask, cond_only=False)[source]

Decomposing covariance matrix into the unconditional, conditional and cross terms.

Parameters:
  • cov (array) – covariance matrix.

  • cond_mask (array) – boolean array defining conditional dimensions.

  • cond_only (bool) – to return only conditional matrix or all decompositions.

Returns:

If cond_only is True, only conditional part of the covariance, otherwise: conditional, unconditional and cross parts, respectively.

static _log_prob(X, data, cov, add_norm=True)[source]

Log probability of a gaussian KDE distribution.

Parameters:
  • X (array) – array of samples for which probability is calculated. Of shape (n, n_features).

  • data (array) – KDE data, of shape (n_samples, n_features).

  • cov (float, array) – covariance matrix of a gaussian distribution. If float, it is a variance shared for all features. If 1D array, it is a variance for every feature separately. if 2D array, it is a full covariance matrix.

  • add_norm (bool) – either to add normalization factor to the calculation or not.

Returns:

Log probabilities.

_sample(conditionals=None, n_samples=1, random_state=None, keep_dims=False)[source]

Generate random samples from the conditional model.

Here there is an assumption that all dimensions have not been distorted, but only rescaled. In other words, it works for None and “rescale” whitening algorithms, but not for “ZCA”.

_sample_general(conditionals=None, n_samples=1, random_state=None, keep_dims=False)[source]

Generate random samples from the conditional model.

This function is the most general sampler, without any assumptions. It should be used for ZCA.

fit(X, features=None)[source]

Fitting the Conditional Kernel Density.

Parameters:
  • X (array) – data of shape (n_samples, n_features).

  • features (list) – optional, list defining names for every feature. It’s used for referencing conditional dimensions. Defaults to [0, 1, …, n_features - 1].

Returns:

An instance of itself.

static log_scott(n_samples, n_features)[source]

Scott’s parameter.

sample(conditionals=None, n_samples=1, random_state=None, keep_dims=False)[source]

Generate random samples from the conditional model. There are two modes of sampling: (1) specify conditionals as scalar values and sample n_samples out of distribution. (2) specify conditionals as an array, where the number of samples will be the length of an array.

Parameters:
  • conditionals (dict) – desired variables (features) to condition upon. Dictionary keys should be only feature names from features. For example, if self.features == [“a”, “b”, “c”] and one would like to condition on “a” and “c”, then conditionals = {“a”: cond_val_a, “c”: cond_val_c}. Conditioned values can be either float or array, where in the case of the latter, all conditioned arrays have to be of the same size. Defaults to None, i.e. normal KDE.

  • n_samples (int) – number of samples to generate. Ignored in the case conditional arrays have been passed in conditionals. Defaults to 1.

  • random_state (np.random.RandomState, int) – seed or RandomState instance, optional. Determines random number generation used to generate random samples. See Glossary <random_state>.

  • keep_dims (bool) – whether to return non-conditioned dimensions only or keep given conditional values. Defaults to False.

Returns:

Array of samples, of shape (n_samples, n_features) if conditional_variables is None, or (n_samples, n_features - len(conditionals)) otherwise.

score_samples(X, conditional_features=None)[source]

Compute the (un)conditional log-probability of each sample under the model.

Parameters:
  • X (array) – data of shape (n, n_features). Last dimension should match dimension of training data (n_features).

  • conditional_features (list) – subset of self.features, which dimensions of data to condition upon. Defaults to None, meaning unconditional log-probability.

Returns:

Conditional log probability for each sample in X.

class conditional_kde.InterpolatedConditionalGaussian(bandwidth=1.0)[source]

Bases: object

Interpolated Conditional Gaussian estimator.

With respect to the ConditionalGaussian, which fits full distribution and slices through it to obtain the conditional distribution, here we allow for some dimensions of the data to be inherently conditional. For such dimensions, data should be available for every point on a grid.

To compute the final conditional density, one then interpolates for the inherently conditional dimensions, and slices through others as before.

Parameters:

bandwidth (float) – allows for the additional smoothing/shrinking of the covariance. In most cases, it should be left as 1.

fit(data, inherent_features=None, features=None, interpolation_points=None, interpolation_method='linear')[source]

Fitting the Interpolated Conditional Gaussian.

Let’s define by Y = (y1, y2, …, yN) inherently conditional random variables of the dataset, and by X = (x1, x2, …, xM) other variables, for which one has a sample of points. This function then fits P(X | Y) for every point on a gridded Y space. To make this possible, one needs to pass a set of X samples for every point on a grid. Later, one can use interpolation in Y and slicing in X to compute P(x1, x2 | x3, …, xM, y1, …, yN), or similar. Note that all Y values need to be conditioned.

Parameters:
  • data (list of arrays, array) – data to fit. Of shape (n_interp_1, n_interp_2, …, n_samples, n_features). For every point on a grid (n_interp_1, n_interp_2, …, n_interp_N) one needs to pass (n_samples, n_features) dataset, for which a separate n_features-dim Gaussian KDE is fitted. All points on a grid have to have the same number of features (n_features). In the case n_samples is not the same for every point, one needs to pass a nested list of arrays.

  • inherent_features (list) – optional, list defining name of every inherently conditional feature. It is used for referencing conditional dimensions. Defaults to [-1, -2, …, -N], where N is the number of inherently conditional features.

  • features (list) – optional, list defining name for every other feature. It’s used for referencing conditional dimensions. Defaults to [0, 1, …, n_features - 1].

  • interpolation_points (dict) – optional, a dictionary of feature: list_of_values pairs. This defines the grid points for every inherently conditional feature. Every list of values should be a strictly ascending. By default it amounts to: {-1: np.linspace(0, 1, n_interp_1), …, -N: np.linspace(0, 1, n_interp_N)}.

  • interpolation_method (str) – either “linear” or “nearest”, making linear interpolation between distributions or picking the closest one, respectively.

Returns:

An instance of itself.

sample(inherent_conditionals, conditionals=None, n_samples=1, random_state=None, keep_dims=False)[source]

Generate random samples from the conditional model. For inherent_condtitionals, there’s only one mode of sampling, where only scalar values are accepted. For conditionals there are two different modes: (1) specify conditionals as scalar values and sample n_samples out of distribution. (2) specify conditionals as an array, where the number of samples will be the length of an array.

Parameters:
  • inherent_conditionals (dict) – values of inherent (grid) features. This values are used to interpolate on the grid. All inherently conditional dimensions must be defined.

  • conditionals (dict) – desired variables (features) to condition upon. Dictionary keys should be only feature names from features. For example, if self.features == [“a”, “b”, “c”] and one would like to condition on “a” and “c”, then conditionals = {“a”: cond_val_a, “c”: cond_val_c}. Conditioned values can be either float or array, where in the case of the latter, all conditioned arrays have to be of the same size. Defaults to None, i.e. normal KDE.

  • n_samples (int) – number of samples to generate. Defaults to 1.

  • random_state (np.random.RandomState, int) – seed or RandomState instance, optional. Determines random number generation used to generate random samples. See Glossary <random_state>.

  • keep_dims (bool) – whether to return non-conditioned dimensions only or keep given conditional values. Defaults to False.

Returns:

Array of samples of shape (n_samples, N + n_features) if conditional_variables is None, or (n_samples, n_features - len(conditionals)) otherwise.

score_samples(X, inherent_conditionals, conditional_features=None)[source]

Compute the conditional log-probability of each sample under the model.

For the simplicity of calculation, here the grid point is fixed by defining a point in inherently conditional dimensions. X is then an array of shape (n, n_features), including all other dimensions of the data.

Parameters:
  • X (array) – data of shape (n, n_features). Last dimension should match dimension of training data (n_features).

  • inherent_conditionals (dict) – values of inherent (grid) features. This values are used to interpolate on the grid. All inherently conditional dimensions must be defined.

  • conditional_features (list) – subset of self.features, which dimensions of data to additionally condition upon. Defaults to None, meaning no additionally conditioned dimensions.

Returns:

Conditional log probability for each sample in X, conditioned on inherently conditional dimensions by inherent_conditionals and other dimensions by conditional_features.

class conditional_kde.InterpolatedConditionalKernelDensity(whitening_algorithm='rescale', bandwidth='scott', **kwargs)[source]

Bases: object

Interpolated Conditional Kernel Density estimator.

With respect to the ConditionalKernelDensity, which fits full distribution and slices through it to obtain the conditional distribution, here we allow for some dimensions of the data to be inherently conditional. For such dimensions, data should be available for every point on a grid.

To compute the final conditional density, one then interpolates for the inherently conditional dimensions, and slices through others as before.

Parameters:
  • whitening_algorithm (str) – data whitening algorithm, either None, “rescale” or “ZCA”. See util.DataWhitener for more details. “rescale” by default.

  • bandwidth (str, float) –

    the width of the Gaussian centered around every point.

    It can be either:

    1. ”scott”, using Scott’s parameter,

    2. ”optimized”, which minimizes cross entropy to find the optimal bandwidth, or

    3. float, specifying the actual value.

    By default, it uses Scott’s parameter.

  • **kwargs

    additional kwargs used in the case of “optimized” bandwidth.

    steps (int): how many steps to use in optimization, 10 by default.

    cv_fold (int): cross validation fold, 5 by default.

    n_jobs (int): number of jobs to run cross validation in parallel, -1 by default, i.e. using all available processors.

    verbose (int): verbosity of the cross validation run, for more details see sklearn.model_selection.GridSearchCV.

fit(data, inherent_features=None, features=None, interpolation_points=None, interpolation_method='linear')[source]

Fitting the Interpolated Conditional Kernel Density.

Let’s define by Y = (y1, y2, …, yN) inherently conditional random variables of the dataset, and by X = (x1, x2, …, xM) other variables, for which one has a sample of points. This function then fits P(X | Y) for every point on a gridded Y space. To make this possible, one needs to pass a set of X samples for every point on a grid. Later, one can use interpolation in Y and slicing in X to compute P(x1, x2 | x3, …, xM, y1, …, yN), or similar. Note that all Y values need to be conditioned.

Parameters:
  • data (list of arrays, array) – data to fit. Of shape (n_interp_1, n_interp_2, …, n_samples, n_features). For every point on a grid (n_interp_1, n_interp_2, …, n_interp_N) one needs to pass (n_samples, n_features) dataset, for which a separate n_features-dim Gaussian KDE is fitted. All points on a grid have to have the same number of features (n_features). In the case n_samples is not the same for every point, one needs to pass a nested list of arrays.

  • inherent_features (list) – optional, list defining name of every inherently conditional feature. It is used for referencing conditional dimensions. Defaults to [-1, -2, …, -N], where N is the number of inherently conditional features.

  • features (list) – optional, list defining name for every other feature. It’s used for referencing conditional dimensions. Defaults to [0, 1, …, n_features - 1].

  • interpolation_points (dict) – optional, a dictionary of feature: list_of_values pairs. This defines the grid points for every inherently conditional feature. Every list of values should be a strictly ascending. By default it amounts to: {-1: np.linspace(0, 1, n_interp_1), …, -N: np.linspace(0, 1, n_interp_N)}.

  • interpolation_method (str) – either “linear” or “nearest”, making linear interpolation between distributions or picking the closest one, respectively.

Returns:

An instance of itself.

sample(inherent_conditionals, conditionals=None, n_samples=1, random_state=None, keep_dims=False)[source]

Generate random samples from the conditional model. For inherent_condtitionals, there’s only one mode of sampling, where only scalar values are accepted. For conditionals there are two different modes: (1) specify conditionals as scalar values and sample n_samples out of distribution. (2) specify conditionals as an array, where the number of samples will be the length of an array.

Parameters:
  • inherent_conditionals (dict) – values of inherent (grid) features. This values are used to interpolate on the grid. All inherently conditional dimensions must be defined.

  • conditionals (dict) – desired variables (features) to condition upon. Dictionary keys should be only feature names from features. For example, if self.features == [“a”, “b”, “c”] and one would like to condition on “a” and “c”, then conditionals = {“a”: cond_val_a, “c”: cond_val_c}. Conditioned values can be either float or array, where in the case of the latter, all conditioned arrays have to be of the same size. Defaults to None, i.e. normal KDE.

  • n_samples (int) – number of samples to generate. Defaults to 1.

  • random_state (np.random.RandomState, int) – seed or RandomState instance, optional. Determines random number generation used to generate random samples. See Glossary <random_state>.

  • keep_dims (bool) – whether to return non-conditioned dimensions only or keep given conditional values. Defaults to False.

Returns:

Array of samples of shape (n_samples, N + n_features) if conditional_variables is None, or (n_samples, n_features - len(conditionals)) otherwise.

score_samples(X, inherent_conditionals, conditional_features=None)[source]

Compute the conditional log-probability of each sample under the model.

For the simplicity of calculation, here the grid point is fixed by defining a point in inherently conditional dimensions. X is then an array of shape (n, n_features), including all other dimensions of the data.

Parameters:
  • X (array) – data of shape (n, n_features). Last dimension should match dimension of training data (n_features).

  • inherent_conditionals (dict) – values of inherent (grid) features. This values are used to interpolate on the grid. All inherently conditional dimensions must be defined.

  • conditional_features (list) – subset of self.features, which dimensions of data to additionally condition upon. Defaults to None, meaning no additionally conditioned dimensions.

Returns:

Conditional log probability for each sample in X, conditioned on inherently conditional dimensions by inherent_conditionals and other dimensions by conditional_features.

Submodules