artan.mixture package

Submodules

artan.mixture.mixture_params module

class artan.mixture.mixture_params.HasBatchTrainEnabled[source]

Bases: pyspark.ml.param.Params

Mixin for enabling batch EM train mode

batchTrainEnabled = Param(parent='undefined', name='batchTrainEnabled', doc='Flag to enable batch EM. Unless enabled, the transformer will do online EM. Online EM can be done withboth streaming and batch dataframes, whereas batch EM can only be done with batch dataframes. Default is false')
getBatchTrainEnabled()[source]

Gets the value of batch train flag or its default value

class artan.mixture.mixture_params.HasBatchTrainMaxIter[source]

Bases: pyspark.ml.param.Params

Mixin for batch train max iterations

batchTrainMaxIter = Param(parent='undefined', name='batchTrainMaxIter', doc='Maximum iterations in batch train mode, default is 30')
getBatchTrainMaxIter()[source]

Gets the value of maxIter or its default value

class artan.mixture.mixture_params.HasBatchTrainTol[source]

Bases: pyspark.ml.param.Params

Mixin for batch train iteration stop tolerance

batchTrainTol = Param(parent='undefined', name='batchTrainTol', doc='Min change in loglikelihood to stop iterations in batch EM mode. Default is 0.1')
getBatchTrainTol()[source]

Gets the value of batchTrainTol or its default value

class artan.mixture.mixture_params.HasDecayRate[source]

Bases: pyspark.ml.param.Params

Mixin for decaying step size parameter

decayRate = Param(parent='undefined', name='decayRate', doc='Step size as a decaying function rather than a constant, which might be preferred in batch training.If set, the step size will be replaced with the output of the functionstepSize = (2 + kIter)**(-decayRate)')
getDecayingStepSizeEnabled()[source]

Gets the value of decaying step size flag

class artan.mixture.mixture_params.HasInitialMixtureModelCol[source]

Bases: pyspark.ml.param.Params

Mixin for initial mixture model parameter.

getInitialMixtureModelCol()[source]

Gets the value of initial mixture model col or its default value.

initialMixtureModelCol = Param(parent='undefined', name='initialMixtureModelCol', doc='Sets the initial mixture model from struct column conforming to mixture distribution')
class artan.mixture.mixture_params.HasInitialWeights[source]

Bases: pyspark.ml.param.Params

Mixin for initial mixture weights parameter.

getInitialWeights()[source]

Gets the value of initial weights or its default value.

initialWeights = Param(parent='undefined', name='initialWeights', doc='Initial weights of the mixtures. The weights should sum up to 1.0 .')
class artan.mixture.mixture_params.HasInitialWeightsCol[source]

Bases: pyspark.ml.param.Params

Mixin for initial mixture weights parameter.

getInitialWeightsCol()[source]

Gets the value of initial weights or its default value.

initialWeightsCol = Param(parent='undefined', name='initialWeightsCol', doc='Initial weights of mixtures from dataframe column')
class artan.mixture.mixture_params.HasMinibatchSize[source]

Bases: pyspark.ml.param.Params

Mixin for mini-batch size parameter

getMinibatchSize()[source]

Gets the value of minibatch size or its default value

minibatchSize = Param(parent='undefined', name='minibatchSize', doc='Size for batching samples together in online EM algorithm. Estimate will be produced once per each batchHaving larger batches increases stability with increased memory footprint. Each minibatch is stored inmixture transformer state independently from spark minibatches.')
class artan.mixture.mixture_params.HasMinibatchSizeCol[source]

Bases: pyspark.ml.param.Params

Mixin for mini-batch size parameter

getMinibatchSizeCol()[source]

Gets the value of minibatch size column or its default value

minibatchSizeCol = Param(parent='undefined', name='minibatchSizeCol', doc='Set minibatch size from dataframe column')
class artan.mixture.mixture_params.HasMixtureCount[source]

Bases: pyspark.ml.param.Params

Mixin for number of components in the mixture

getMixtureCount()[source]

Gets the value of mixtureCount or its default value

mixtureCount = Param(parent='undefined', name='mixtureCount', doc='Number of finite mixture components, must ge > 0')
class artan.mixture.mixture_params.HasSampleCol[source]

Bases: pyspark.ml.param.Params

Mixin for sample column parameter.

getSampleCol()[source]

Gets the value of initial weights or its default value.

sampleCol = Param(parent='undefined', name='sampleCol', doc='Column name for input to mixture models')
class artan.mixture.mixture_params.HasStepSize[source]

Bases: pyspark.ml.param.Params

Mixin for controlling the inertia of the current parameter.

getStepSize()[source]

Gets the value of step size or its default value

stepSize = Param(parent='undefined', name='stepSize', doc='Weights the current parameter of the model against the old parameter. A step size of 1.0 means ignorethe old parameter, whereas a step size of 0 means ignore the current parameter. Values closer to 1.0 willincrease speed of convergence, but might have adverse effects on stability. In online setting,it is advised to set small values close to 0.0. Default is 0.01')
class artan.mixture.mixture_params.HasStepSizeCol[source]

Bases: pyspark.ml.param.Params

Mixin for step size parameter

getStepSizeCol()[source]

Gets the value of step size or its default value

stepSizeCol = Param(parent='undefined', name='stepSizeCol', doc='stepSize parameter from dataframe column instead of a constant value across all samples')
class artan.mixture.mixture_params.HasUpdateHoldout[source]

Bases: pyspark.ml.param.Params

Mixin for update holdout parameter

getUpdateHoldout()[source]

Gets the value of update holdout or its default value

updateHoldout = Param(parent='undefined', name='updateHoldout', doc='Controls after how many samples the mixture will start calculating estimates. Preventing updatein first few samples might be preferred for stability.')
class artan.mixture.mixture_params.HasUpdateHoldoutCol[source]

Bases: pyspark.ml.param.Params

Mixin for update holdout parameter

getUpdateHoldoutCol()[source]

Gets the value of update holdout col or its default value

updateHoldoutCol = Param(parent='undefined', name='updateHoldoutCol', doc='updateHoldout from dataframe column rather than a constant value across all states')
class artan.mixture.mixture_params.MixtureParams[source]

Bases: artan.mixture.mixture_params.HasSampleCol, artan.mixture.mixture_params.HasStepSize, artan.mixture.mixture_params.HasStepSizeCol, artan.mixture.mixture_params.HasInitialWeights, artan.mixture.mixture_params.HasInitialWeightsCol, artan.mixture.mixture_params.HasMinibatchSize, artan.mixture.mixture_params.HasUpdateHoldout, artan.mixture.mixture_params.HasDecayRate, artan.mixture.mixture_params.HasInitialMixtureModelCol, artan.mixture.mixture_params.HasMinibatchSizeCol, artan.mixture.mixture_params.HasUpdateHoldoutCol, artan.mixture.mixture_params.HasBatchTrainEnabled, artan.mixture.mixture_params.HasBatchTrainMaxIter, artan.mixture.mixture_params.HasBatchTrainTol, artan.mixture.mixture_params.HasMixtureCount

setBatchTrainMaxIter(value)[source]

Sets the max number of iterations in batch train mode

Default is 30

Parameters:value – Int
Returns:MixtureTransformer
setBatchTrainTol(value)[source]

Sets the minimum loglikelihood improvement for stopping iterations in batch EM train mode

Defaullt is 0.1

Parameters:value – Float
Returns:MixtureTransformer
setDecayRate(value)[source]

Sets the step size as a decaying function rather than a constant step size, which might be preferred for batch training. If set, the step size will be replaced with the output of following function:

stepSize = (2 + kIter)**(-decayRate)

Where kIter is incremented by 1 on each minibatch.

Returns:MixtureTransformer
setEnableBatchTrain()[source]

Enables batch train mode.

Returns:MixtureTransformer
setInitialMixtureModelCol(value)[source]

Sets the initial mixture model directly from dataframe column

Parameters:value – String
Returns:MixtureTransformer
setInitialWeights(value)[source]

Sets the initial weights of the mixtures. The weights should sum up to 1.0.

Parameters:value – List[Float]
Returns:MixtureTransformer
setInitialWeightsCol(value)[source]

Sets the initial mixture weights parameter from dataframe column

Parameters:value – String
Returns:MixtureTransformer
setMinibatchSize(value)[source]

Sets the minibatch size for batching samples together in online EM algorithm. Estimate will be produced once per each batch. Having larger batches increases stability with increased memory footprint.

Default is 1

Parameters:value – Int
Returns:MixtureTransformer
setMinibatchSizeCol(value)[source]

Sets the minibatch size from dataframe column rather than a constant minibatch size across all states. Overrides setMinibatchSize setting.

Parameters:value – Int
Returns:MixtureTransformer
setMixtureCount(value)[source]

Sets the number of components in the finite mixture

Parameters:value – Int
Returns:MixtureTransformer
setSampleCol(value)[source]

Sets the sample column for the mixture model inputs. Depending on the mixture distribution, sample type should be different.

Bernoulli => Boolean Poisson => Long MultivariateGaussian => Vector

Parameters:value – String
Returns:MixtureTransformer
setStepSize(value)[source]

Sets the step size parameter, which weights the current parameter of the model against the old parameter. A step size of 1.0 means ignore the old parameter, whereas a step size of 0 means ignore the current parameter. Values closer to 1.0 will increase speed of convergence, but might have adverse effects on stability. For online EM, it is advised to set it close to 0.0.

Default is 0.1

Parameters:value – Int
Returns:MixtureTransformer
setStepSizeCol(value)[source]

Sets the step size from dataframe column, which would allow setting different step sizes accross measurements. Overrides the value set by setStepSize

Parameters:value – String
Returns:MixtureTransformer
setUpdateHoldout(value)[source]

Sets the update holdout parameter which controls after how many samples the mixture will start calculating estimates. Preventing update in first few samples might be preferred for stability.

Parameters:value – Int
Returns:MixtureTransformer
setUpdateHoldoutCol(value)[source]

Sets the update holdout parameter from dataframe column rather than a constant value across all states. Overrides the value set by setUpdateHoldout

Parameters:value – String
Returns:MixtureTransormer

artan.mixture.bernoulli_mixture module

class artan.mixture.bernoulli_mixture.BernoulliMixture[source]

Bases: artan.state.stateful_transformer.StatefulTransformer, artan.mixture.mixture_params.MixtureParams, artan.mixture.bernoulli_mixture._HasInitialProbabilities, artan.mixture.bernoulli_mixture._HasInitialProbabilitiesCol, artan.mixture.bernoulli_mixture._HasBernoulliMixtureModelCol, artan.utils.ArtanJavaMLReadable, pyspark.ml.util.JavaMLWritable

Online bernoulli mixture estimator with a stateful transformer, based on Cappe (2011) Online Expectation-Maximisation paper.

Outputs an estimate for each input sample in a single pass, by replacing the E-step in EM with a recursively averaged stochastic E-step.

setInitialProbabilities(value)[source]

Sets the initial bernoulli probabilities of the mixtures. The length of the array should be equal to mixture count, each element in the array should be between 0 and 1.

Default is equally spaced probabilities between 0 and 1

Parameters:value – List[Float]
Returns:BernoulliMixture
setInitialProbabilitiesCol(value)[source]

Sets the initial probabilities from dataframe column to set different probabilities across different models. Overrides the parameter set by setInitialProbabilities.

Parameters:value – String
Returns:BernoulliMixture

artan.mixture.multivariate_gaussian_mixture module

class artan.mixture.multivariate_gaussian_mixture.MultivariateGaussianMixture[source]

Bases: artan.state.stateful_transformer.StatefulTransformer, artan.mixture.mixture_params.MixtureParams, artan.mixture.multivariate_gaussian_mixture._HasInitialMeans, artan.mixture.multivariate_gaussian_mixture._HasInitialMeansCol, artan.mixture.multivariate_gaussian_mixture._HasInitialCovariances, artan.mixture.multivariate_gaussian_mixture._HasInitialCovariancesCol, artan.utils.ArtanJavaMLReadable, pyspark.ml.util.JavaMLWritable

Online gaussian mixture estimator with a stateful transformer, based on Cappe (2011) Online Expectation-Maximisation paper.

Outputs an estimate for each input sample in a single pass, by replacing the E-step in EM with a recursively averaged stochastic E-step.

setInitialCovariances(value)[source]

Sets the initial covariance matrices of the mixtures as a nested array of doubles. The dimensions of the array should be mixtureCount x sampleSize**2

Parameters:value – List[List[Float]]
Returns:MultivariateGaussianMixture
setInitialCovariancesCol(value)[source]

Sets the initial covariance matrices of the mixtures from dataframe column. Overrides the value set by setInitialCovariances

Parameters:value – String
Returns:MultivariateGaussianMixture
setInitialMeans(value)[source]

Sets the initial mean vectors of the mixtures as a nested array of doubles. The dimensions of the array should be mixtureCount x sample vector size

Parameters:value – List[List[Float]]
Returns:MultivariateGaussianMixture
setInitialMeansCol(value)[source]

Sets the initial means from dataframe column. Overrides the value set by setInitialMeans.

Parameters:value – String
Returns:MultivariateGaussianMixture

artan.mixture.poisson_mixture module

class artan.mixture.poisson_mixture.PoissonMixture[source]

Bases: artan.state.stateful_transformer.StatefulTransformer, artan.mixture.mixture_params.MixtureParams, artan.mixture.poisson_mixture._HasInitialRates, artan.mixture.poisson_mixture._HasInitialRatesCol, artan.utils.ArtanJavaMLReadable, pyspark.ml.util.JavaMLWritable

Online poisson mixture estimator with a stateful transformer, based on Cappe (2011) Online Expectation-Maximisation paper.

Outputs an estimate for each input sample in a single pass, by replacing the E-step in EM with a recursively averaged stochastic E-step.

setInitialRates(value)[source]

Sets the initial poisson rates of the mixtures. The length of the array should be equal to mixtureCount.

Parameters:value – List[Float]
Returns:PoissonMixture
setInitialRatesCol(value)[source]

Sets the initial rates from dataframe column. Overrides the parameter set from setInitialRates.

Parameters:value – String
Returns:PoissonMixture