Title: | Arbitrary Dependency Mixed Multivariate Bayesian Models |
---|---|
Description: | Supports Bayesian models with full and partial (hence arbitrary) dependencies between random variables. Discrete and continuous variables are supported, and conditional joint probabilities and probability densities are estimated using Kernel Density Estimation (KDE). The full general form, which implements an extension to Bayes' theorem, as well as the simple form, which is just a Bayesian network, both support regression through segmentation and KDE and estimation of probability or relative likelihood of discrete or continuous target random variables. This package also provides true statistical distance measures based on Bayesian models. Furthermore, these measures can be facilitated on neighborhood searches, and to estimate the similarity and distance between data points. Related work is by Bayes (1763) <doi:10.1098/rstl.1763.0053> and by Scutari (2010) <doi:10.18637/jss.v035.i03>. |
Authors: | Sebastian Hönel |
Maintainer: | Sebastian Hönel <[email protected]> |
License: | GPL-3 |
Version: | 0.13.3 |
Built: | 2025-02-24 03:48:56 UTC |
Source: | https://github.com/mrshoenel/r-mmb |
A wrapper to be used with the package/function caret::train()
.
Supports regression and classification and an extensive default grid.
bayesCaret
bayesCaret
An object of class list
of length 7.
Sebastian Hönel [email protected]
## Not run: trainIndex <- caret::createDataPartition( iris$Species, p = .8, list = FALSE, times = 1) train <- iris[ trainIndex, ] test <- iris[-trainIndex, ] fitControl <- caret::trainControl( method = "repeatedcv", number = 2, repeats = 2) fit <- caret::train( Species ~ ., data = train, method = mmb::bayesCaret, trControl = fitControl) ## End(Not run)
## Not run: trainIndex <- caret::createDataPartition( iris$Species, p = .8, list = FALSE, times = 1) train <- iris[ trainIndex, ] test <- iris[-trainIndex, ] fitControl <- caret::trainControl( method = "repeatedcv", number = 2, repeats = 2) fit <- caret::train( Species ~ ., data = train, method = mmb::bayesCaret, trControl = fitControl) ## End(Not run)
Computes the probability (discrete feature) or relative likelihood (continuous feature) of one given feature and a concrete value for it.
bayesComputeMarginalFactor(df, feature, doEcdf = FALSE)
bayesComputeMarginalFactor(df, feature, doEcdf = FALSE)
df |
data.frame that contains all the feature's data |
feature |
data.frame containing the designated feature as created
by @seealso |
doEcdf |
default FALSE a boolean to indicate whether to use the empirical CDF to return a probability when inferencing a continuous feature. If false, uses the empirical PDF to return the rel. likelihood. This parameter does not have any effect when inferring discrete values. Using the ECDF, a probability to find a value less than or equal to the given value is returned. |
numeric the probability or likelihood of the given feature assuming its given value.
Sebastian Hönel [email protected]
feat <- mmb::createFeatureForBayes( name = "Petal.Length", value = mean(iris$Petal.Length)) mmb::bayesComputeMarginalFactor(df = iris, feature = feat) mmb::bayesComputeMarginalFactor(df = iris, feature = feat, doEcdf = TRUE)
feat <- mmb::createFeatureForBayes( name = "Petal.Length", value = mean(iris$Petal.Length)) mmb::bayesComputeMarginalFactor(df = iris, feature = feat) mmb::bayesComputeMarginalFactor(df = iris, feature = feat, doEcdf = TRUE)
Converts all columns in a data.frame that are factors to character, except for the target column.
bayesConvertData(df)
bayesConvertData(df)
df |
data.frame to be used for bayesian inferencing. |
the same data.frame with all factors converted to character.
Sebastian Hönel [email protected]
df <- mmb::bayesConvertData(df = iris)
df <- mmb::bayesConvertData(df = iris)
Counter operation to @seealso mmb::sampleToBayesFeatures()
.
Takes a Bayes-feature data.frame and transforms it back to a row.
bayesFeaturesToSample(dfOrg, features)
bayesFeaturesToSample(dfOrg, features)
dfOrg |
data.frame containing at least one row of the original format, so that we can rebuild the sample matching exactly the original column names. |
features |
data.frame of Bayes-features, as for example
previously created using |
data.frame the sample as 1-row data.frame.
Sebastian Hönel [email protected]
samp <- mmb::sampleToBayesFeatures(dfRow = iris[15,], targetCol = "Species") # Convert the sample (as features) back to a sample that can be, e.g., # appended to the data again: row <- mmb::bayesFeaturesToSample(dfOrg = iris, features = samp)
samp <- mmb::sampleToBayesFeatures(dfRow = iris[15,], targetCol = "Species") # Convert the sample (as features) back to a sample that can be, e.g., # appended to the data again: row <- mmb::bayesFeaturesToSample(dfOrg = iris, features = samp)
Uses simple Bayesian inference to determine the probability or relative
likelihood of a given value. This function can also regress to the most
likely value instead. Simple means that segmented data is used in a way
that is equal to how a Bayesian network works. For a finite set of labels,
this function needs to be called for each, to obtain the probability of
each label (or, for n-1 labels or until a label with >.5 probability is
found). For obtaining the probability of a continuous value, this function
is useful for deciding between picking among a finite set of values. The
empirical CDF may be used to obtain an actual probability for a given
continuous value, otherwise, the empirical PDF is estimated and a relative
likelihood is returned. For regression, set doRegress = TRUE
to
obtain the most likely value of the target feature, instead of obtaining
its relative likelihood.
bayesInferSimple( df, features, targetCol, selectedFeatureNames = c(), retainMinValues = 1, doRegress = FALSE, doEcdf = FALSE, regressor = NULL )
bayesInferSimple( df, features, targetCol, selectedFeatureNames = c(), retainMinValues = 1, doRegress = FALSE, doEcdf = FALSE, regressor = NULL )
df |
data.frame |
features |
data.frame with bayes-features. One of the features needs to be the label-column. |
targetCol |
string with the name of the feature that represents the label. |
selectedFeatureNames |
vector default |
retainMinValues |
integer to require a minimum amount of data points when segmenting the data feature by feature. |
doRegress |
default FALSE a boolean to indicate whether to do a regression instead of returning the relative likelihood of a continuous feature. If the target feature is discrete and regression is requested, will issue a warning. |
doEcdf |
default FALSE a boolean to indicate whether to use the empirical CDF to return a probability when inferencing a continuous feature. If false, uses the empirical PDF to return the rel. likelihood. This parameter does not have any effect when inferring discrete values or when doing a regression. |
regressor |
Function that is given the collected values for regression and thus finally used to select a most likely value. Defaults to the built-in estimator for the empirical PDF and returns its argmax. However, any other function can be used, too, such as min, max, median, average etc. You may also use this function to obtain the raw values for further processing. This function is ignored if not doing regression. |
numeric probability (inferring discrete labels) or relative likelihood (regression, inferring likelihood of continuous value) or most likely value given the conditional features.
Sebastian Hönel [email protected]
Scutari M (2010). “Learning Bayesian Networks with the bnlearn R Package.” Journal of Statistical Software, 35(3), 1–22. doi:10.18637/jss.v035.i03.
feat1 <- mmb::createFeatureForBayes( name = "Petal.Length", value = mean(iris$Petal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Petal.Width", value = mean(iris$Petal.Width)) featT <- mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE) # Infer likelihood of featT's label: feats <- rbind(feat1, feat2, featT) mmb::bayesInferSimple(df = iris, features = feats, targetCol = featT$name) # Infer likelihood of feat1's value: featT$isLabel = FALSE feat1$isLabel = TRUE # We do not bind featT this time: feats <- rbind(feat1, feat2) mmb::bayesInferSimple(df = iris, features = feats, targetCol = feat1$name)
feat1 <- mmb::createFeatureForBayes( name = "Petal.Length", value = mean(iris$Petal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Petal.Width", value = mean(iris$Petal.Width)) featT <- mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE) # Infer likelihood of featT's label: feats <- rbind(feat1, feat2, featT) mmb::bayesInferSimple(df = iris, features = feats, targetCol = featT$name) # Infer likelihood of feat1's value: featT$isLabel = FALSE feat1$isLabel = TRUE # We do not bind featT this time: feats <- rbind(feat1, feat2) mmb::bayesInferSimple(df = iris, features = feats, targetCol = feat1$name)
Uses the full extended theorem of Bayes, taking all selected features into account. Expands Bayes' theorem to accomodate all dependent features, then calculates each conditional probability (or relative likelihood) and returns a single result reflecting the probability or relative likelihood of the target feature assuming its given value, given that all the other dependent features assume their given value. The target feature (designated by 'labelCol') may be discrete or continuous. If at least one of the depending features or the the target feature is continuous and the PDF ('doEcdf' = FALSE) is built, the result of this function is a relative likelihood of the target feature's value. If all of the features are discrete or the empirical CDF is used instead of the PDF, the result of this function is a probability.
bayesProbability( df, features, targetCol, selectedFeatureNames = c(), shiftAmount = 0.1, retainMinValues = 1, doEcdf = FALSE, useParallel = NULL )
bayesProbability( df, features, targetCol, selectedFeatureNames = c(), shiftAmount = 0.1, retainMinValues = 1, doEcdf = FALSE, useParallel = NULL )
df |
data.frame that contains all the feature's data |
features |
data.frame with bayes-features. One of the features needs to be the label-column. |
targetCol |
string with the name of the feature that represents the label. |
selectedFeatureNames |
vector default |
shiftAmount |
numeric an offset value used to increase any one probability (factor) in the full built equation. In scenarios with many dependencies, it is more likely that a single conditional probability becomes zero, which would result in the entire probability being zero. Since this is often useless, the 'shiftAmount' can be added to each factor, resulting in a non-zero probability that can at least be used to order samples by likelihood. Note that, with a positive 'shiftAmount', the result of this function cannot be said to be a probability any longer, but rather results in a comparable likelihood (a 'probability score'). |
retainMinValues |
integer to require a minimum amount of data points when segmenting the data feature by feature. |
doEcdf |
default FALSE a boolean to indicate whether to use the
empirical CDF to return a probability when inferencing a continuous
feature. If false, uses the empirical PDF to return the rel. likelihood.
This parameter does not have any effect if all of the variables are
discrete or when doing a regression. Otherwise, for each continuous
variable, the probability to find a value less then or equal - given
the conditions - is returned. Note that the interpretation of probability
using the ECDF much deviates and must be used with care, especially
since it affects each factor in Bayes equation that is continuous. This
is especially true for the case where |
useParallel |
default NULL a boolean to indicate whether to use a
previously registered parallel backend. If no explicit value was given,
calls |
numeric probability (inferring discrete labels) or relative
likelihood (regression, inferring likelihood of continuous value) or most
likely value given the conditional features. If using a positive
shiftAmount
, the result is a 'probability score'.
Sebastian Hönel [email protected]
Bayes T (1763). “LII. An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR S.” Philosophical transactions of the Royal Society of London, 370–418.
test-case "a zero denominator can happen"
feat1 <- mmb::createFeatureForBayes( name = "Petal.Length", value = mean(iris$Petal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Petal.Width", value = mean(iris$Petal.Width)) featT <- mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE) # Check the probability of Species=setosa, given the other 2 features: mmb::bayesProbability( df = iris, features = rbind(feat1, feat2, featT), targetCol = "Species") # Now check the probability of Species=versicolor: featT$valueChar <- "versicolor" mmb::bayesProbability( df = iris, features = rbind(feat1, feat2, featT), targetCol = "Species")
feat1 <- mmb::createFeatureForBayes( name = "Petal.Length", value = mean(iris$Petal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Petal.Width", value = mean(iris$Petal.Width)) featT <- mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE) # Check the probability of Species=setosa, given the other 2 features: mmb::bayesProbability( df = iris, features = rbind(feat1, feat2, featT), targetCol = "Species") # Now check the probability of Species=versicolor: featT$valueChar <- "versicolor" mmb::bayesProbability( df = iris, features = rbind(feat1, feat2, featT), targetCol = "Species")
This method uses full-dependency (simple=F
) Bayesian
inferencing to assign a probability to the target feature in all of the
samples given in dfValid
. Tests each sample using @seealso
mmb::bayesProbability()
or @seealso mmb::bayesProbabilitySimple()
.
It mostly forwards the given arguments to these functions, and you will find
good documentation there.
bayesProbabilityAssign( dfTrain, dfValid, targetCol, selectedFeatureNames = c(), shiftAmount = 0.1, retainMinValues = 1, doEcdf = FALSE, online = 0, simple = FALSE, naive = FALSE, useParallel = NULL, returnProbabilityTable = FALSE )
bayesProbabilityAssign( dfTrain, dfValid, targetCol, selectedFeatureNames = c(), shiftAmount = 0.1, retainMinValues = 1, doEcdf = FALSE, online = 0, simple = FALSE, naive = FALSE, useParallel = NULL, returnProbabilityTable = FALSE )
dfTrain |
data.frame that holds the training data. |
dfValid |
data.frame that holds the validation samples, for each of which a probability is sought. The convention is, that if you attempt to assign a probability to a numeric value, it ought to be found in the target column of this data frame (otherwise, the target column is not required in it). |
targetCol |
character the name of targeted feature, i.e., the feature to assign a probability to. |
selectedFeatureNames |
character defaults to empty vector which defaults to using all available features. Use this to select subsets of features and to order features. |
shiftAmount |
numeric an offset value used to increase any one probability (factor) in the full built equation. |
retainMinValues |
integer to require a minimum amount of data points when segmenting the data feature by feature. |
doEcdf |
default FALSE a boolean to indicate whether to use the empirical CDF to return a probability when inferencing a continuous feature. |
online |
default 0 integer to indicate how many rows should be used to do inferencing. If zero, then only the initially given data.frame dfTrain is used. If > 0, then each inferenced sample will be attached to it and the resulting data.frame is truncated to this number. Use an integer large enough (i.e., sum of training and validation rows) to keep all samples during inferencing. A smaller amount as, e.g., in dfTrain, will keep the amount of data restricted, discarding older rows. A larger amount than, e.g., in dfTrain is also fine; dfTrain will grow to it and then discard rows. |
simple |
default FALSE boolean to indicate whether or not to use simple
Bayesian inferencing instead of full. This is faster but the results are less
good. If true, uses |
naive |
default FALSE boolean to indicate whether or not to use naive Bayesian inferencing instead of full or simple. |
useParallel |
boolean DEFAULT NULL this is forwarded to the underlying
function |
returnProbabilityTable |
default FALSE boolean to indicate whether to return only the probabilities for each validation sample or whether a table with a probability for each tested label should be returned. This has no effect when inferencing probabilities for numeric values, as the table then only has one column "probability". The first column of this table is always called "rowname" and corresponds to the rownames of dfValid. |
Sebastian Hönel [email protected]
Bayes T (1763). “LII. An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR S.” Philosophical transactions of the Royal Society of London, 370–418.
w <- mmb::getWarnings() mmb::setWarnings(FALSE) set.seed(84735) rn <- base::sample(rownames(iris), 150) dfTrain <- iris[rn[1:120], ] dfValid <- iris[rn[121:150], !(colnames(iris) %in% "Species") ] mmb::bayesProbabilityAssign(dfTrain, dfValid, "Species") mmb::setWarnings(w)
w <- mmb::getWarnings() mmb::setWarnings(FALSE) set.seed(84735) rn <- base::sample(rownames(iris), 150) dfTrain <- iris[rn[1:120], ] dfValid <- iris[rn[121:150], !(colnames(iris) %in% "Species") ] mmb::bayesProbabilityAssign(dfTrain, dfValid, "Species") mmb::setWarnings(w)
A complementary implementation using methods common in mmb, such as computing factors or segmenting data. Supports Laplacian smoothing and early-stopping segmenting, as well as PDF and CDF and selecting any subset of features for dependency.
bayesProbabilityNaive( df, features, targetCol, selectedFeatureNames = c(), shiftAmount = 0.1, retainMinValues = 1, doEcdf = FALSE, useParallel = NULL )
bayesProbabilityNaive( df, features, targetCol, selectedFeatureNames = c(), shiftAmount = 0.1, retainMinValues = 1, doEcdf = FALSE, useParallel = NULL )
df |
data.frame that contains all the feature's data |
features |
data.frame with bayes-features. One of the features needs to be the label-column. |
targetCol |
string with the name of the feature that represents the label. |
selectedFeatureNames |
vector default |
shiftAmount |
numeric an offset value used to increase any one probability (factor) in the full built equation. In scenarios with many dependencies, it is more likely that a single conditional probability becomes zero, which would result in the entire probability being zero. Since this is often useless, the 'shiftAmount' can be added to each factor, resulting in a non-zero probability that can at least be used to order samples by likelihood. Note that, with a positive 'shiftAmount', the result of this function cannot be said to be a probability any longer, but rather results in a comparable likelihood (a 'probability score'). |
retainMinValues |
integer to require a minimum amount of data points when segmenting the data feature by feature. |
doEcdf |
default FALSE a boolean to indicate whether to use the
empirical CDF to return a probability when inferencing a continuous
feature. If false, uses the empirical PDF to return the rel. likelihood.
This parameter does not have any effect if all of the variables are
discrete or when doing a regression. Otherwise, for each continuous
variable, the probability to find a value less then or equal - given
the conditions - is returned. Note that the interpretation of probability
using the ECDF much deviates and must be used with care, especially
since it affects each factor in Bayes equation that is continuous. This
is especially true for the case where |
useParallel |
default NULL a boolean to indicate whether to use a
previously registered parallel backend. If no explicit value was given,
calls |
numeric probability (inferring discrete labels) or relative
likelihood (regression, inferring likelihood of continuous value) or most
likely value given the conditional features. If using a positive
shiftAmount
, the result is a 'probability score'.
Sebastian Hönel [email protected]
feat1 <- mmb::createFeatureForBayes( name = "Petal.Length", value = mean(iris$Petal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Petal.Width", value = mean(iris$Petal.Width)) featT <- mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE) # Check the probability of Species=setosa, given the other 2 features: mmb::bayesProbabilityNaive( df = iris, features = rbind(feat1, feat2, featT), targetCol = "Species") # Now check the probability of Species=versicolor: featT$valueChar <- "versicolor" mmb::bayesProbabilityNaive( df = iris, features = rbind(feat1, feat2, featT), targetCol = "Species")
feat1 <- mmb::createFeatureForBayes( name = "Petal.Length", value = mean(iris$Petal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Petal.Width", value = mean(iris$Petal.Width)) featT <- mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE) # Check the probability of Species=setosa, given the other 2 features: mmb::bayesProbabilityNaive( df = iris, features = rbind(feat1, feat2, featT), targetCol = "Species") # Now check the probability of Species=versicolor: featT$valueChar <- "versicolor" mmb::bayesProbabilityNaive( df = iris, features = rbind(feat1, feat2, featT), targetCol = "Species")
Uses simple Bayesian inference to return the probability or relative likelihood or a discrete label or continuous value.
bayesProbabilitySimple( df, features, targetCol, selectedFeatureNames = c(), retainMinValues = 1, doEcdf = FALSE )
bayesProbabilitySimple( df, features, targetCol, selectedFeatureNames = c(), retainMinValues = 1, doEcdf = FALSE )
df |
data.frame |
features |
data.frame with bayes-features. One of the features needs to be the label-column. |
targetCol |
string with the name of the feature that represents the label. |
selectedFeatureNames |
vector default |
retainMinValues |
integer to require a minimum amount of data points when segmenting the data feature by feature. |
doEcdf |
default FALSE a boolean to indicate whether to use the empirical CDF to return a probability when inferencing a continuous feature. If false, uses the empirical PDF to return the rel. likelihood. |
double the probability of the target-label, using the maximum a posteriori estimate.
Sebastian Hönel [email protected]
Scutari M (2010). “Learning Bayesian Networks with the bnlearn R Package.” Journal of Statistical Software, 35(3), 1–22. doi:10.18637/jss.v035.i03.
mmb::bayesInferSimple()
feat1 <- mmb::createFeatureForBayes( name = "Sepal.Length", value = mean(iris$Sepal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Sepal.Width", value = mean(iris$Sepal.Width), isLabel = TRUE) # Assign a probability to a continuous variable (also works with nominal): mmb::bayesProbabilitySimple(df = iris, features = rbind(feat1, feat2), targetCol = feat2$name, retainMinValues = 5, doEcdf = TRUE)
feat1 <- mmb::createFeatureForBayes( name = "Sepal.Length", value = mean(iris$Sepal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Sepal.Width", value = mean(iris$Sepal.Width), isLabel = TRUE) # Assign a probability to a continuous variable (also works with nominal): mmb::bayesProbabilitySimple(df = iris, features = rbind(feat1, feat2), targetCol = feat2$name, retainMinValues = 5, doEcdf = TRUE)
This method performs full-dependency regression by discretizing the continuous target variable into ranges (buckets), then finding the most probable ranges. It can either regress on the values in the most likely range or sample from all ranges, according to their likelihood.
bayesRegress( df, features, targetCol, selectedFeatureNames = c(), shiftAmount = 0.1, retainMinValues = 2, doEcdf = FALSE, useParallel = NULL, numBuckets = ceiling(log2(nrow(df))), sampleFromAllBuckets = TRUE, regressor = NULL )
bayesRegress( df, features, targetCol, selectedFeatureNames = c(), shiftAmount = 0.1, retainMinValues = 2, doEcdf = FALSE, useParallel = NULL, numBuckets = ceiling(log2(nrow(df))), sampleFromAllBuckets = TRUE, regressor = NULL )
df |
data.frame that contains all the feature's data |
features |
data.frame with bayes-features. One of the features needs to be the label-column. |
targetCol |
string with the name of the feature that represents the label. |
selectedFeatureNames |
vector default |
shiftAmount |
numeric an offset value used to increase any one probability (factor) in the full built equation. In scenarios with many dependencies, it is more likely that a single conditional probability becomes zero, which would result in the entire probability being zero. Since this is often useless, the 'shiftAmount' can be added to each factor, resulting in a non-zero probability that can at least be used to order samples by likelihood. Note that, with a positive 'shiftAmount', the result of this function cannot be said to be a probability any longer, but rather results in a comparable likelihood (a 'probability score'). |
retainMinValues |
integer to require a minimum amount of data points when segmenting the data feature by feature. |
doEcdf |
default FALSE a boolean to indicate whether to use the
empirical CDF to return a probability when inferencing a continuous
feature. If false, uses the empirical PDF to return the rel. likelihood.
This parameter does not have any effect if all of the variables are
discrete or when doing a regression. Otherwise, for each continuous
variable, the probability to find a value less then or equal - given
the conditions - is returned. Note that the interpretation of probability
using the ECDF much deviates and must be used with care, especially
since it affects each factor in Bayes equation that is continuous. This
is especially true for the case where |
useParallel |
default NULL a boolean to indicate whether to use a
previously registered parallel backend. If no explicit value was given,
calls |
numBuckets |
integer the amount of buckets to for discretization. Buckets are built in an equidistant manner, not as quantiles (i.e., one bucket has likely a different amount of values than another). |
sampleFromAllBuckets |
default TRUE boolean to indicate how to obtain values for regression from the buckets. If true, than takes values from those buckets with a non-zero probability, and according to their probability. If false, selects all values from the bucket with the highest probability. |
regressor |
Function that is given the collected values for regression and thus finally used to select a most likely value. Defaults to the built-in estimator for the empirical PDF and returns its argmax. However, any other function can be used, too, such as min, max, median, average etc. You may also use this function to obtain the raw values for further processing. |
Sebastian Hönel [email protected]
w <- mmb::getWarnings() mmb::setWarnings(FALSE) df <- iris[, ] set.seed(84735) rn <- base::sample(rownames(df), 150) dfTrain <- df[1:120, ] dfValid <- df[121:150, ] tf <- mmb::sampleToBayesFeatures(dfValid[1,], "Sepal.Length") mmb::bayesRegress(dfTrain, tf, "Sepal.Length") mmb::setWarnings(w)
w <- mmb::getWarnings() mmb::setWarnings(FALSE) df <- iris[, ] set.seed(84735) rn <- base::sample(rownames(df), 150) dfTrain <- df[1:120, ] dfValid <- df[121:150, ] tf <- mmb::sampleToBayesFeatures(dfValid[1,], "Sepal.Length") mmb::bayesRegress(dfTrain, tf, "Sepal.Length") mmb::setWarnings(w)
This method uses full-dependency (simple=F
) Bayesian
inferencing to to a regression for the target features for all of the
samples given in dfValid
. Assigns a regression value using either
bayesRegressAssign( dfTrain, dfValid, targetCol, selectedFeatureNames = c(), shiftAmount = 0.1, retainMinValues = 2, doEcdf = FALSE, online = 0, simple = FALSE, useParallel = NULL, numBuckets = ceiling(log2(nrow(df))), sampleFromAllBuckets = TRUE, regressor = NULL )
bayesRegressAssign( dfTrain, dfValid, targetCol, selectedFeatureNames = c(), shiftAmount = 0.1, retainMinValues = 2, doEcdf = FALSE, online = 0, simple = FALSE, useParallel = NULL, numBuckets = ceiling(log2(nrow(df))), sampleFromAllBuckets = TRUE, regressor = NULL )
dfTrain |
data.frame that holds the training data. |
dfValid |
data.frame that holds the validation samples, for each of which a probability is sought. The convention is, that if you attempt to assign a probability to a numeric value, it ought to be found in the target column of this data frame (otherwise, the target column is not required in it). |
targetCol |
character the name of targeted feature, i.e., the feature to assign a probability to. |
selectedFeatureNames |
character defaults to empty vector which defaults to using all available features. Use this to select subsets of features and to order features. |
shiftAmount |
numeric an offset value used to increase any one probability (factor) in the full built equation. |
retainMinValues |
integer to require a minimum amount of data points when segmenting the data feature by feature. |
doEcdf |
default FALSE a boolean to indicate whether to use the empirical CDF to return a probability when inferencing a continuous feature. |
online |
default 0 integer to indicate how many rows should be used to do inferencing. If zero, then only the initially given data.frame dfTrain is used. If > 0, then each inferenced sample will be attached to it and the resulting data.frame is truncated to this number. Use an integer large enough (i.e., sum of training and validation rows) to keep all samples during inferencing. A smaller amount as, e.g., in dfTrain, will keep the amount of data restricted, discarding older rows. A larger amount than, e.g., in dfTrain is also fine; dfTrain will grow to it and then discard rows. |
simple |
default FALSE boolean to indicate whether or not to use simple
Bayesian inferencing instead of full. This is faster but the results are less
good. If true, uses |
useParallel |
boolean DEFAULT NULL this is forwarded to the underlying
function |
numBuckets |
integer the amount of buckets to for discretization. Buckets are built in an equidistant manner, not as quantiles (i.e., one bucket has likely a different amount of values than another). |
sampleFromAllBuckets |
default TRUE boolean to indicate how to obtain values for regression from the buckets. If true, than takes values from those buckets with a non-zero probability, and according to their probability. If false, selects all values from the bucket with the highest probability. |
regressor |
Function that is given the collected values for regression and thus finally used to select a most likely value. Defaults to the built-in estimator for the empirical PDF and returns its argmax. However, any other function can be used, too, such as min, max, median, average etc. You may also use this function to obtain the raw values for further processing.#' |
Sebastian Hönel [email protected]
mmb::bayesRegress()
(full) or @seealso mmb::bayesRegressSimple()
if simple=T
. It mostly forwards the given arguments to these functions,
and you will find good documentation there.
df <- iris[, ] set.seed(84735) rn <- base::sample(rownames(df), 150) dfTrain <- df[1:120, ] dfValid <- df[121:150, ] res <- mmb::bayesRegressAssign( dfTrain, dfValid[, !(colnames(dfValid) %in% "Sepal.Length")], "Sepal.Length", sampleFromAllBuckets = TRUE, doEcdf = TRUE) cov(res, iris[121:150,]$Sepal.Length)^2
df <- iris[, ] set.seed(84735) rn <- base::sample(rownames(df), 150) dfTrain <- df[1:120, ] dfValid <- df[121:150, ] res <- mmb::bayesRegressAssign( dfTrain, dfValid[, !(colnames(dfValid) %in% "Sepal.Length")], "Sepal.Length", sampleFromAllBuckets = TRUE, doEcdf = TRUE) cov(res, iris[121:150,]$Sepal.Length)^2
Uses simple Bayesian inferencing to segment the data given the conditional features. Then estimates a density over the remaining values of the target feature and returns the most likely value using a maximum a posteriori estimate of the kernel (returning its mode).
bayesRegressSimple( df, features, targetCol, selectedFeatureNames = c(), retainMinValues = 2, regressor = NULL )
bayesRegressSimple( df, features, targetCol, selectedFeatureNames = c(), retainMinValues = 2, regressor = NULL )
df |
data.frame |
features |
data.frame with bayes-features. One of the features needs to be the label-column (not required or no value required). |
targetCol |
string with the name of the feature that represents the label (here the target variable for regression). |
selectedFeatureNames |
vector default |
retainMinValues |
integer to require a minimum amount of data points when segmenting the data feature by feature. |
regressor |
Function that is given the collected values for regression and thus finally used to select a most likely value. Defaults to the built-in estimator for the empirical PDF and returns its argmax. However, any other function can be used, too, such as min, max, median, average etc. You may also use this function to obtain the raw values for further processing. |
Sebastian Hönel [email protected]
Scutari M (2010). “Learning Bayesian Networks with the bnlearn R Package.” Journal of Statistical Software, 35(3), 1–22. doi:10.18637/jss.v035.i03.
mmb::bayesInferSimple()
feat1 <- mmb::createFeatureForBayes( name = "Sepal.Length", value = mean(iris$Sepal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Sepal.Width", value = mean(iris$Sepal.Width)) # Note how we do not require "Petal.Length" among the features when regressing: mmb::bayesRegressSimple(df = iris, features = rbind(feat1, feat2), targetCol = "Petal.Length")
feat1 <- mmb::createFeatureForBayes( name = "Sepal.Length", value = mean(iris$Sepal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Sepal.Width", value = mean(iris$Sepal.Width)) # Note how we do not require "Petal.Length" among the features when regressing: mmb::bayesRegressSimple(df = iris, features = rbind(feat1, feat2), targetCol = "Petal.Length")
This function can be used to generate Latex-markup that models the full dependency between covariates and a target variable.
bayesToLatex(conditionalFeatures, targetFeature, includeValues = FALSE)
bayesToLatex(conditionalFeatures, targetFeature, includeValues = FALSE)
conditionalFeatures |
data.frame of Bayesian features, the target feature depends on. |
targetFeature |
data.frame that holds exactly one Bayesian feature, that is supposed to be the target-feture for Bayesian inferencing. |
includeValues |
default FALSE boolean to indicate whether to include the features' values or not, i.e. "A" vs. "A = setosa". |
a string that can be used in Latex documents.
Use cat()
to print a string that can be copy-pasted.
Sebastian Hönel [email protected]
feat1 <- mmb::createFeatureForBayes( name = "Petal.Length", value = mean(iris$Petal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Petal.Width", value = mean(iris$Petal.Width)) featT <- mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE) cat(mmb::bayesToLatex(conditionalFeatures = rbind(feat1, feat2), targetFeature = featT, includeValues = TRUE))
feat1 <- mmb::createFeatureForBayes( name = "Petal.Length", value = mean(iris$Petal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Petal.Width", value = mean(iris$Petal.Width)) featT <- mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE) cat(mmb::bayesToLatex(conditionalFeatures = rbind(feat1, feat2), targetFeature = featT, includeValues = TRUE))
Takes a data.frame of samples, then builds a PDF/PMF or ECDF for each of the selected features. Then, for each sample, computes the product of probabilities. The result is a vector that holds a probability for each sample. That probability (or relative likelihood) then represents the vicinity (or similarity) of the sample to the given neighborhood.
centralities( dfNeighborhood, selectedFeatureNames = c(), shiftAmount = 0.1, doEcdf = FALSE, ecdfMinusOne = FALSE )
centralities( dfNeighborhood, selectedFeatureNames = c(), shiftAmount = 0.1, doEcdf = FALSE, ecdfMinusOne = FALSE )
dfNeighborhood |
data.frame that holds all rows that make up the neighborhood. |
selectedFeatureNames |
vector of names of features to use. The centrality of each row in the neighborhood is calculated based on the selected features. |
shiftAmount |
numeric DEFAULT 0.1 optional amount to shift each features probability by. This is useful for when the centrality not necessarily must be an actual probability and too many features are selected. To obtain actual probabilities, this needs to be 0, and you must use the ECDF. |
doEcdf |
boolean DEFAULT FALSE whether to use the ECDF instead of the EPDF to find the likelihood of continuous values. |
ecdfMinusOne |
boolean DEFAULT FALSE only has an effect if the ECDF is used. If true, uses 1 minus the ECDF to find the probability of a continuous value. Depending on the interpretation of what you try to do, this may be of use. |
a named vector, where the names correspond to the rownames of the rows in the given neighborhood, and the value is the centrality of that row.
Sebastian Hönel [email protected]
# Create a neighborhood: nbh <- mmb::neighborhood(df = iris, features = mmb::createFeatureForBayes( name = "Sepal.Width", value = mean(iris$Sepal.Width))) cent <- mmb::centralities(dfNeighborhood = nbh, shiftAmount = 0.1, doEcdf = TRUE, ecdfMinusOne = TRUE) # Plot the ordered samples to get an idea of the centralities in the neighborhood: plot(x = names(cent), y=cent)
# Create a neighborhood: nbh <- mmb::neighborhood(df = iris, features = mmb::createFeatureForBayes( name = "Sepal.Width", value = mean(iris$Sepal.Width))) cent <- mmb::centralities(dfNeighborhood = nbh, shiftAmount = 0.1, doEcdf = TRUE, ecdfMinusOne = TRUE) # Plot the ordered samples to get an idea of the centralities in the neighborhood: plot(x = names(cent), y=cent)
Takes a data.frame and segments it, according to the selected variables. Only rows satisfying all conditions are kept. Supports discrete and continuous variables. Supports NA, NaN and NULL by using is.na, is.nan and is.null as comparator.
conditionalDataMin( df, features, selectedFeatureNames = c(), retainMinValues = 1 )
conditionalDataMin( df, features, selectedFeatureNames = c(), retainMinValues = 1 )
df |
data.frame with data to segment. If it contains less than or
equally many rows as specified by |
features |
data.frame of bayes-features that are used to segment.
Each feature's value is used to segment the data, and the features are
used in the order as given by |
selectedFeatureNames |
default |
retainMinValues |
default 1. The minimum amount of rows to retain. Filtering the data by the selected features may reduce the amount of remaining rows quickly, and this can be used as an early stopping criteria. Note that filtering is done variable by variable, and the amount of remaining rows is evaluated after each segmenting-step. If the threshold is undercut, then the result from the previous round is returned. |
data.frame that is segmented according to the selected variables and the minimum amount of rows to retain.
Sebastian Hönel [email protected]
getValueKeyOfBayesFeatures()
feat1 <- mmb::createFeatureForBayes( name = "Petal.Length", value = mean(iris$Petal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Petal.Width", value = mean(iris$Petal.Width)) feats <- rbind(feat1, feat2) data <- mmb::conditionalDataMin(df = iris, features = feats, selectedFeatureNames = feats$name, retainMinValues = 1)
feat1 <- mmb::createFeatureForBayes( name = "Petal.Length", value = mean(iris$Petal.Length)) feat2 <- mmb::createFeatureForBayes( name = "Petal.Width", value = mean(iris$Petal.Width)) feats <- rbind(feat1, feat2) data <- mmb::conditionalDataMin(df = iris, features = feats, selectedFeatureNames = feats$name, retainMinValues = 1)
Transforms a sample's feature's value into a dataframe, that holds its name, type and value. Currently supports numeric, factor, character and boolean values. Note that factor is internally converted to character.
createFeatureForBayes(name, value, isLabel = FALSE, isDiscrete = FALSE)
createFeatureForBayes(name, value, isLabel = FALSE, isDiscrete = FALSE)
name |
the name of the feature or variable. |
value |
the value of the feature or variable. |
isLabel |
default FALSE. Indicates whether this feature or variable is the target variable (the label or value to predict). |
isDiscrete |
default FALSE. Used to indicate whether the feature or variable given is discrete. This will also be set to true if the value given is a charater, factor or a logical. |
A data.frame with one row holding all the feature's value's properties.
Sebastian Hönel [email protected]
sampleToBayesFeatures
that uses this function
feat <- mmb::createFeatureForBayes( name = "Petal.Width", value = mean(iris$Petal.Width)) featTarget <- mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE)
feat <- mmb::createFeatureForBayes( name = "Petal.Width", value = mean(iris$Petal.Width)) featTarget <- mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE)
Discretizes a continuous random variable into buckets (ranges). Each range is delimited by an exclusive minimum value and an inclusive maximum value.
discretizeVariableToRanges( data, openEndRanges = TRUE, numRanges = NA, exclMinVal = NULL, inclMaxVal = NULL )
discretizeVariableToRanges( data, openEndRanges = TRUE, numRanges = NA, exclMinVal = NULL, inclMaxVal = NULL )
data |
a vector with numeric data |
openEndRanges |
boolean default True. If true, then the minimum value
of the first range will be set to @seealso |
numRanges |
integer default NA. If NULL, then the amount of ranges (buckets) depends on the amount of data given. A minimum of two buckets is used then, and a maximum of ceiling(log2(length(data))). |
exclMinVal |
numeric default NULL. Used to delimit the lower bound of the given data. If not given, then no value is excluded, as the exclusive lower bound becomes the minimum of the given data minus an epsilon of 1e-15. |
inclMaxVal |
numeric default NULL. Used to delimit the upper bound of the given data. If not given, then the upper inclusive bound is the max of the given data. |
List a List of vectors, where each vector has two values, the first being the exclusive minimum value of the range, and the second being the inclusive maximum value of the range. The list will be as long as the number of buckets requested.
Sebastian Hönel [email protected]
buckets <- mmb::discretizeVariableToRanges( data = iris$Sepal.Length, openEndRanges = TRUE) length(buckets) buckets[[5]]
buckets <- mmb::discretizeVariableToRanges( data = iris$Sepal.Length, openEndRanges = TRUE) length(buckets) buckets[[5]]
The distance of two samples x,y from each other within a given neighborhood is defined as the absolute value of the subtraction of each sample's centrality to the neighborhood.
distance( dfNeighborhood, rowNrOfSample1, rowNrOfSample2, selectedFeatureNames = c(), shiftAmount = 0.1, doEcdf = FALSE, ecdfMinusOne = FALSE )
distance( dfNeighborhood, rowNrOfSample1, rowNrOfSample2, selectedFeatureNames = c(), shiftAmount = 0.1, doEcdf = FALSE, ecdfMinusOne = FALSE )
dfNeighborhood |
data.frame that holds all rows that make up the neighborhood. |
rowNrOfSample1 |
character the name of the row that constitutes the first sample from the given neighborhood. |
rowNrOfSample2 |
character the name of the row that constitutes the second sample from the given neighborhood. |
selectedFeatureNames |
vector of names of features to use. The centrality of each row in the neighborhood is calculated based on the selected features. |
shiftAmount |
numeric DEFAULT 0.1 optional amount to shift each features probability by. This is useful for when the centrality not necessarily must be an actual probability and too many features are selected. To obtain actual probabilities, this needs to be 0, and you must use the ECDF. |
doEcdf |
boolean DEFAULT FALSE whether to use the ECDF instead of the EPDF to find the likelihood of continuous values. |
ecdfMinusOne |
boolean DEFAULT FALSE only has an effect if the ECDF is used. If true, uses 1 minus the ECDF to find the probability of a continuous value. Depending on the interpretation of what you try to do, this may be of use. |
numeric the distance as a positive number.
Sebastian Hönel [email protected]
# Show the distance between two samples using all their features: mmb::distance(dfNeighborhood = iris, rowNrOfSample1 = 10, rowNrOfSample2 = 99) # Let's use an actual neighborhood: nbh <- mmb::neighborhood(df = iris, features = mmb::createFeatureForBayes( name = "Sepal.Length", value = mean(iris$Sepal.Length))) mmb::distance(dfNeighborhood = nbh, rowNrOfSample1 = 1, rowNrOfSample2 = 30, selectedFeatureNames = colnames(iris)[1:3]) # Let's compare this to the distances as they are in iris (should be smaller): mmb::distance(dfNeighborhood = iris, rowNrOfSample1 = 1, rowNrOfSample2 = 30, selectedFeatureNames = colnames(iris)[1:3])
# Show the distance between two samples using all their features: mmb::distance(dfNeighborhood = iris, rowNrOfSample1 = 10, rowNrOfSample2 = 99) # Let's use an actual neighborhood: nbh <- mmb::neighborhood(df = iris, features = mmb::createFeatureForBayes( name = "Sepal.Length", value = mean(iris$Sepal.Length))) mmb::distance(dfNeighborhood = nbh, rowNrOfSample1 = 1, rowNrOfSample2 = 30, selectedFeatureNames = colnames(iris)[1:3]) # Let's compare this to the distances as they are in iris (should be smaller): mmb::distance(dfNeighborhood = iris, rowNrOfSample1 = 1, rowNrOfSample2 = 30, selectedFeatureNames = colnames(iris)[1:3])
Given a few observations of a random variable, this function returns an approximation of the PDF as a function. Returns also the PDF's support and argmax and works when only zero or one value was given. Depending on the used density function, two values are often enough to estimate a PDF.
estimatePdf( data = c(), densFun = function(vec) { stats::density(vec, bw = "SJ") } )
estimatePdf( data = c(), densFun = function(vec) { stats::density(vec, bw = "SJ") } )
data |
vector of numeric data. Used to compute the empirical density of the data. |
densFun |
function default |
list with a function that is the empirical PDF using KDE. The list
also has two properties 'min' and 'max' which represent the integratable
range of that function. 'min' and 'max' are both zero if not data (an
empty vector) was given. If one data point was given, then they correspond
to its value -/+ .Machine$double.eps
. The list further contains two
numeric vectors 'x' and 'y', and a property 'argmax'. If no data was given,
'x' and 'y' are zero, and 'argmax' is NA. If one data points was given,
then 'x' and 'argmax' equal it, and 'y' is set to 1. If two or more data
points given, then the empirical density is estimated and 'x' and y' are
filled from its estimate. 'argmax' is then set to that 'x', where 'y'
becomes max.
If the given vector is empty, warns and returns a constant function that always returns zero for all values.
If the given vector contains only one observation, then a function is returned that returns 1 iff the value supplied is the same as the observation. Otherwise, that function will return zero.
Sebastian Hönel [email protected]
epdf <- mmb::estimatePdf(data = iris$Petal.Width) print(epdf$argmax) plot(epdf) # Get relative likelihood of some values: epdf$fun(0.5) epdf$fun(1.7)
epdf <- mmb::estimatePdf(data = iris$Petal.Width) print(epdf$argmax) plot(epdf) # Get relative likelihood of some values: epdf$fun(0.5) epdf$fun(1.7)
Getting and setting the default regressor affects all functions that have an overridable regressor. If this is not given, the default has defined here will be obtained.
getDefaultRegressor()
getDefaultRegressor()
Function the function used as the regressor. Defaults to
function(data) mmb::estimatePdf(data)$argmax
.
Sebastian Hönel [email protected]
Getter for the state of messages. Returns true if enabled.
getMessages()
getMessages()
Boolean to indicate whether messages are enabled or not.
Sebastian Hönel [email protected]
Similar to @seealso estimatePdf
, this function returns
the probability for a discrete value, given some observations.
getProbForDiscrete(data, value)
getProbForDiscrete(data, value)
data |
vector of observations that have the same type as the given value. |
value |
a single observation of the same type as the data vector. |
the probability of value given data.
If no observations are given, then this function will warn and return a probability of zero for the value given. While we could technically return positive infinity, 0 is more suitable in the context of Bayesian inferencing.
Sebastian Hönel [email protected]
mmb::getProbForDiscrete(data = c(), value = iris[1,]$Species) mmb::getProbForDiscrete(data = iris$Species, value = iris[1,]$Species)
mmb::getProbForDiscrete(data = c(), value = iris[1,]$Species) mmb::getProbForDiscrete(data = iris$Species, value = iris[1,]$Species)
Given a list of previously computed ranges for a random variable, this function returns the index of the range the given value belongs to (i.e., in which bucket it belongs). The indexes start R-typically at 1. Per definition, a value is within a range, if it is larger than the range's minimum and less than or equal to its maximum.
getRangeForDiscretizedValue(ranges, value)
getRangeForDiscretizedValue(ranges, value)
ranges |
list of ranges, as obtained by @seealso |
value |
numeric a value drawn from the previously discretized random variable. |
integer the index of the range the given value falls into.
Sebastian Hönel [email protected]
buckets <- mmb::discretizeVariableToRanges( data = iris$Sepal.Length, openEndRanges = TRUE) mmb::getRangeForDiscretizedValue( ranges = buckets, value = mean(iris$Sepal.Length))
buckets <- mmb::discretizeVariableToRanges( data = iris$Sepal.Length, openEndRanges = TRUE) mmb::getRangeForDiscretizedValue( ranges = buckets, value = mean(iris$Sepal.Length))
Given a data.frame with one or multiple features as
constructed by @seealso createFeatureForBayes
and a name,
extracts the type of the feature specified by name. Note that this
is only used internally.
getValueKeyOfBayesFeatures(dfFeature, featName)
getValueKeyOfBayesFeatures(dfFeature, featName)
dfFeature |
a data.frame for a single feature or variable
as constructed by @seealso |
featName |
the name of the feature or variable of which to obtain the type. |
the (internal) type of the feature.
Sebastian Hönel [email protected]
feats <- rbind( mmb::createFeatureForBayes( "Petal.Width", value = mean(iris$Petal.Width)), mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE) ) print(mmb::getValueKeyOfBayesFeatures(feats, "Species")) print(mmb::getValueKeyOfBayesFeatures(feats, "Petal.Width"))
feats <- rbind( mmb::createFeatureForBayes( "Petal.Width", value = mean(iris$Petal.Width)), mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE) ) print(mmb::getValueKeyOfBayesFeatures(feats, "Species")) print(mmb::getValueKeyOfBayesFeatures(feats, "Petal.Width"))
Given a data.frame with one or multiple features as
constructed by @seealso createFeatureForBayes
and a name,
extracts the value of the feature specified by name.
getValueOfBayesFeatures(dfFeature, featName)
getValueOfBayesFeatures(dfFeature, featName)
dfFeature |
a data.frame for a single feature or variable
as constructed by @seealso |
featName |
the name of the feature or variable of which to obtain the value. |
the value of the feature.
Sebastian Hönel [email protected]
feats <- rbind( mmb::createFeatureForBayes( "Petal.Width", value = mean(iris$Petal.Width)), mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE) ) print(mmb::getValueOfBayesFeatures(feats, "Species")) print(mmb::getValueOfBayesFeatures(feats, "Petal.Width"))
feats <- rbind( mmb::createFeatureForBayes( "Petal.Width", value = mean(iris$Petal.Width)), mmb::createFeatureForBayes( name = "Species", iris[1,]$Species, isLabel = TRUE) ) print(mmb::getValueOfBayesFeatures(feats, "Species")) print(mmb::getValueOfBayesFeatures(feats, "Petal.Width"))
Getter for the state of warnings. Returns true if enabled.
getWarnings()
getWarnings()
Boolean to indicate whether warnings are enabled or not.
Sebastian Hönel [email protected]
The neighborhood is defined as the set of samples that
have a similarity greater than zero to the given sample
. Segmentation
is done using equality (
==
) for discrete features and less than or equal
(<=
) for continuous features. Note that feature values NA
and NaN
are also supported using is.na()
and is.nan()
.
neighborhood(df, features, selectedFeatureNames = c(), retainMinValues = 0)
neighborhood(df, features, selectedFeatureNames = c(), retainMinValues = 0)
df |
data.frame to select the neighborhood from |
features |
data.frame of Bayes-features, used to segment/select the rows that should make up the neighborhood. |
selectedFeatureNames |
vector of names of features to use to demarcate the neighborhood. If empty, uses all features' names. |
retainMinValues |
DEFAULT 0 the amount of samples to retain during segmentation. For separating a neighborhood, this value typically should be 0, so that no samples are included that are not within it. However, for very sparse data or a great amount of variables, it might still make sense to retain samples. |
data.frame with rows that were selected as neighborhood. It is guaranteed that the rownames are maintained.
Sebastian Hönel [email protected]
nbh <- mmb::neighborhood(df = iris, features = mmb::createFeatureForBayes( name = "Sepal.Width", value = mean(iris$Sepal.Width))) print(nrow(nbh))
nbh <- mmb::neighborhood(df = iris, features = mmb::createFeatureForBayes( name = "Sepal.Width", value = mean(iris$Sepal.Width))) print(nrow(nbh))
Helper function that takes one sample (e.g., a row of a dataframe
with validation data) and transforms it into a data.frame where
each row corresponds to one feature (and its value) of the sample.
This is done using @seealso createFeatureForBayes
. This
operation can be thought of transposing a matrix.
sampleToBayesFeatures(dfRow, targetCol)
sampleToBayesFeatures(dfRow, targetCol)
dfRow |
a row of a data.frame with a value for each feature. |
targetCol |
the name of the feature (column in the data.frame) that is the target variable for classification or regression. |
a data.frame where the first row is the feature that represents the label.
Sebastian Hönel [email protected]
# Converts all features of iris; the result is a data.frame of length # equal to the amount of features in iris (5). The first feature is # targetCol (has isLabel=TRUE). samp <- mmb::sampleToBayesFeatures(dfRow = iris[15,], targetCol = "Species")
# Converts all features of iris; the result is a data.frame of length # equal to the amount of features in iris (5). The first feature is # targetCol (has isLabel=TRUE). samp <- mmb::sampleToBayesFeatures(dfRow = iris[15,], targetCol = "Species")
Getting and setting the default regressor affects all functions that have an overridable regressor. If this is not given, the default has defined here will be obtained.
setDefaultRegressor(func)
setDefaultRegressor(func)
func |
a Function to use a regressor, should accept one argument, which is a vector of numeric, and return one value, the regression. |
void
Sebastian Hönel [email protected]
Setter for enabling or disabling messages. Messages are disabled by default. Use these to enable high verbosity.
setMessages(enable = TRUE)
setMessages(enable = TRUE)
enable |
a boolean to indicate whether to enable messages or not. |
Boolean the state of enabled
Sebastian Hönel [email protected]
Setter for enabling or disabling warnings. Warnings are enabled by default.
setWarnings(enable = TRUE)
setWarnings(enable = TRUE)
enable |
a boolean to indicate whether to enable warnings or not. |
Boolean the state of enabled
Sebastian Hönel [email protected]
Given an entire dataset, uses each instance in it to demarcate
a neighborhood using the selected features. Then, for each neighborhood,
the vicinity of all samples to it is computed. The result of this is an
N x N matrix, where the entry corresponds to the vicinity of
sample
in neighborhood
.
vicinities( df, selectedFeatureNames = c(), shiftAmount = 0.1, doEcdf = FALSE, ecdfMinusOne = FALSE, retainMinValues = 0, useParallel = NULL )
vicinities( df, selectedFeatureNames = c(), shiftAmount = 0.1, doEcdf = FALSE, ecdfMinusOne = FALSE, retainMinValues = 0, useParallel = NULL )
df |
data.frame to compute the matrix of vicinites for. |
selectedFeatureNames |
vector of names of features to use for computing the vicinity/centrality of each sample to each neighborhood. |
shiftAmount |
numeric DEFAULT 0.1 optional amount to shift each features probability by. This is useful for when the centrality not necessarily must be an actual probability and too many features are selected. To obtain actual probabilities, this needs to be 0, and you must use the ECDF. |
doEcdf |
boolean DEFAULT FALSE whether to use the ECDF instead of the EPDF to find the likelihood of continuous values. |
ecdfMinusOne |
boolean DEFAULT FALSE only has an effect if the ECDF is used. If true, uses 1 minus the ECDF to find the probability of a continuous value. Depending on the interpretation of what you try to do, this may be of use. |
retainMinValues |
DEFAULT 0 the amount of samples to retain during segmentation. For separating a neighborhood, this value typically should be 0, so that no samples are included that are not within it. However, for very sparse data or a great amount of variables, it might still make sense to retain samples. |
useParallel |
boolean DEFAULT NULL whether to use parallelism or not. Setting this to true requires also having previously registered a parallel backend. If parallel computing is enabled, then each neighborhood is computed separately. |
matrix of length (N being the length of the data.frame). Each
row i demarcates the neighborhood as selected by sample i, and each column j then
is the vicinity of sample
to that neighborhood. No value of the diagonal
is zero, because each neighborhood always contains the sample it was demarcated
by, and that sample has a similarity greater than zero to it.
Sebastian Hönel [email protected]
vicinitiesForSample()
w <- mmb::getWarnings() mmb::setWarnings(FALSE) mmb::vicinities(df = iris[1:10,]) # Run the same, but use the ECDF and retain more values: mmb::vicinities(df = iris[1:10,], doEcdf = TRUE, retainMinValues = 10) mmb::setWarnings(w)
w <- mmb::getWarnings() mmb::setWarnings(FALSE) mmb::vicinities(df = iris[1:10,]) # Run the same, but use the ECDF and retain more values: mmb::vicinities(df = iris[1:10,], doEcdf = TRUE, retainMinValues = 10) mmb::setWarnings(w)
Given some data and one sample from it, constructs the
neighborhood
of that sample and assigns centralities to all other
samples in that neighborhood to it. Samples that lie outside the neighborhood
are assigned a vicinity of zero. Uses
mmb::neighborhood()
and
mmb::centralities()
.
vicinitiesForSample( df, sampleFromDf, selectedFeatureNames = c(), shiftAmount = 0.1, doEcdf = FALSE, ecdfMinusOne = FALSE, retainMinValues = 0 )
vicinitiesForSample( df, sampleFromDf, selectedFeatureNames = c(), shiftAmount = 0.1, doEcdf = FALSE, ecdfMinusOne = FALSE, retainMinValues = 0 )
df |
data.frame that holds the data (and also the sample to use to define the neighborhood). Each sample in this data.frame is assigned a vicinity. |
sampleFromDf |
data.frame a single row from the given data.frame. This is used to select a neighborhood from the given data. |
selectedFeatureNames |
vector of names of features to use to compute the
vicinity/centrality. This is passed to |
shiftAmount |
numeric DEFAULT 0.1 optional amount to shift each features probability by. This is useful for when the centrality not necessarily must be an actual probability and too many features are selected. To obtain actual probabilities, this needs to be 0, and you must use the ECDF. |
doEcdf |
boolean DEFAULT FALSE whether to use the ECDF instead of the EPDF to find the likelihood of continuous values. |
ecdfMinusOne |
boolean DEFAULT FALSE only has an effect if the ECDF is used. If true, uses 1 minus the ECDF to find the probability of a continuous value. Depending on the interpretation of what you try to do, this may be of use. |
retainMinValues |
DEFAULT 0 the amount of samples to retain during segmentation. For separating a neighborhood, this value typically should be 0, so that no samples are included that are not within it. However, for very sparse data or a great amount of variables, it might still make sense to retain samples. |
data.frame with a single column 'vicinity' and the same rownames as the given data.frame. Each row then holds the vicinity for the corresponding row.
Sebastian Hönel [email protected]
vic <- mmb::vicinitiesForSample( df = iris, sampleFromDf = iris[1,], shiftAmount = 0.1) vic$vicinity # Plot the ordered samples to get an idea which ones have a vicinity > 0 plot(x=rownames(vic), y=vic$vicinity)
vic <- mmb::vicinitiesForSample( df = iris, sampleFromDf = iris[1,], shiftAmount = 0.1) vic$vicinity # Plot the ordered samples to get an idea which ones have a vicinity > 0 plot(x=rownames(vic), y=vic$vicinity)