The fussclust package provides methods for
distance-based fuzzy clustering, including both unsupervised and
semi-supervised approaches.
The original motivation for developing the package was the lack of
implementation of semi-supervised fuzzy clustering methods in
R. However, both unsupervised and semi-supervised fuzzy
clustering share a common optimization framework and estimation
procedure. Consequently, fussclust implements a unified
estimation framework for fuzzy clustering models: both unsupervised and
semi-supervised.
The current release of fussclust includes four classical
models:
A typical workflow consists of the following steps:
X.superF (see Section 4.1).U and prototypes
V.The fuzzy clustering methods implemented in fussclust
are distance-based. Membership values for \(N\) observations and \(C\) clusters are estimated under the
assumption that similarity can be represented by distances in a metric
space: objects located closer to each other are considered more
similar.
The optimization problem is formulated as
\[ \min Q(U, V; X) \]
where:
The optimization problem does not admit a closed-form solution. Therefore, an iterative approximation procedure known as the Alternating Optimization (AO) algorithm is used.
The AO procedure minimizes the corresponding partial objective functions, yielding the matrix-valued update functions \(\hat{U}\) for estimating memberships and \(\hat{V}\) for estimating cluster prototypes.
At each iteration, the algorithm alternates between computing the estimated membership matrix \(\tilde{U}\) using the update function \(\hat{U}\) and computing the estimated prototype matrix \(\tilde{V}\) using the update function \(\hat{V}\).
For additional methodological details, see Kmita et al. (2024).
The AO algorithm can be summarized as follows:
The following examples use the well-known iris dataset
(R.A. Fisher, 1936). We
consider two features: sepal length and sepal width.
fussclust requires the input to be of class
matrix.
X <- iris[, c("Sepal.Length", "Sepal.Width")] |> as.matrix()
cat(
paste0(
"Class of `iris`: ", class(iris),
"; class of `X`: ",
paste(class(X), collapse = " & ")
)
)
#> Class of `iris`: data.frame; class of `X`: matrix & arrayThe iris dataset contains three species of iris flowers.
Therefore, we set the number of clusters to C = 3. In
practice, however, the user may specify any number of clusters such that
\(C > 1\).
The estimated cluster prototypes obtained from the FCM model are:
model_fcm$V
#> Sepal.Length Sepal.Width
#> [1,] 6.814646 3.070350
#> [2,] 4.979951 3.355452
#> [3,] 5.830402 2.762639The estimated memberships for the first ten observations obtained from the PCM model are:
model_pcm$U[1:10, ]
#> [,1] [,2] [,3]
#> [1,] 0.5153643 0.5153737 0.5153653
#> [2,] 0.4969738 0.4969839 0.4969748
#> [3,] 0.3983422 0.3983501 0.3983430
#> [4,] 0.3671885 0.3671957 0.3671893
#> [5,] 0.4485335 0.4485415 0.4485344
#> [6,] 0.4674081 0.4674135 0.4674087
#> [7,] 0.3451158 0.3451224 0.3451165
#> [8,] 0.4966804 0.4966900 0.4966814
#> [9,] 0.3058682 0.3058740 0.3058688
#> [10,] 0.4925528 0.4925630 0.4925539As described in Section 1, semi-supervised models in
fussclust require partial supervision encoded as a binary
matrix superF.
The package assumes that the number of clusters \(C\) equals the number of distinct classes represented in the supervision information.
In many applications, class labels are stored in a response variable.
This is also the case for the iris dataset:
iris[1:3, ]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosaThe class labels are stored in the Species column:
To construct the supervision matrix, the user must first define the arbitrary ordering of the classes. Let
\[ \mathcal{Y} = (y^1 = \texttt{setosa}, y^2 = \texttt{versicolor}, y^3 = \texttt{virginica}) \]
denote the ordered set of class labels.
The supervision matrix superF is defined as follows. Let
\(F = [f_{jk}]\) denote the binary
supervision matrix, where:
The matrix satisfies the following properties:
nrow(X) rows,The matrix can be constructed as follows:
When fitting a semi-supervised model, the user must provide the
scaling parameter \(\alpha\)
(alpha), which regulates the impact of partial
supervision.
There is no universally optimal choice of \(\alpha\), as the appropriate value depends on the dataset and application. For further discussion, see Kmita et al. (2024).
For this example, we set \(\alpha = 1\).
Semi-Supervised Fuzzy c-Means:
Semi-Supervised Possibilistic c-Means:
The estimated prototypes are:
The vector of cluster-specific hyperparameters
\[ \Gamma = (\gamma_1, \ldots, \gamma_C) \]
is specific to possibilistic clustering methods.
This is an optional argument in both fussclust::PCM()
and fussclust::SSPCM(). If not provided, the default value
is a unit vector:
\[ \Gamma = (1, \ldots, 1) \]
Alternatively, the user may apply the initialization strategy proposed by Krishnapuram and Keller (1993), which computes
\[ \gamma_k = \frac{ \sum_{j=1}^N u_{jk}^2 d_{jk}^2 }{ \sum_{j=1}^N u_{jk}^2 } \]
using values obtained from an initial FCM fit.
This strategy can be enabled by setting
initFCM = TRUE.
The user may also choose to specify the values of \(\Gamma\) manually, for example:
In addition to the final estimated memberships and prototypes
(accessible via model$U and model$V), users
may inspect intermediate results produced during the AO optimization
procedure.
The following histories can be stored:
model$U_history,model$V_history,model$Phi_history.By default, these objects are set to NULL.
To store the optimization history, set
store_history = TRUE.
model_fcm_history <- fussclust::FCM(
X = X,
C = 3,
store_history = TRUE
)
cat(
paste0(
"Class of `model_fcm_history$U_history`: ",
class(model_fcm_history$U_history),
"; length: ",
length(model_fcm_history$U_history),
"; number of AO iterations: ",
model_fcm_history$counter
)
)
#> Class of `model_fcm_history$U_history`: list; length: 32; number of AO iterations: 32To retrieve a specific intermediate result, for example the prototype matrix \(\hat{V}^{(3)}\) obtained at the third AO iteration, index the corresponding history object directly: