1 Introduction

The fussclust package provides methods for distance-based fuzzy clustering, including both unsupervised and semi-supervised approaches.

The original motivation for developing the package was the lack of implementation of semi-supervised fuzzy clustering methods in R. However, both unsupervised and semi-supervised fuzzy clustering share a common optimization framework and estimation procedure. Consequently, fussclust implements a unified estimation framework for fuzzy clustering models: both unsupervised and semi-supervised.

The current release of fussclust includes four classical models:

Fuzzy c-Means (FCM),
Semi-Supervised Fuzzy c-Means (SSFCM),
Possibilistic c-Means (PCM),
Semi-Supervised Possibilistic c-Means (SSPCM).

A typical workflow consists of the following steps:

Prepare a numerical feature matrix X.
For semi-supervised models, prepare a binary supervision matrix superF (see Section 4.1).
Fit the selected clustering model.
Inspect the estimated memberships U and prototypes V.

2 Methodology

The fuzzy clustering methods implemented in fussclust are distance-based. Membership values for $N$ observations and $C$ clusters are estimated under the assumption that similarity can be represented by distances in a metric space: objects located closer to each other are considered more similar.

The optimization problem is formulated as

\[ \min Q(U, V; X) \]

where:

$Q$ denotes the objective function,
$U$ is the membership matrix,
$V$ contains the cluster prototypes,
$X$ is the feature matrix.

The optimization problem does not admit a closed-form solution. Therefore, an iterative approximation procedure known as the Alternating Optimization (AO) algorithm is used.

The AO procedure minimizes the corresponding partial objective functions, yielding the matrix-valued update functions $\hat{U}$ for estimating memberships and $\hat{V}$ for estimating cluster prototypes.

At each iteration, the algorithm alternates between computing the estimated membership matrix $\tilde{U}$ using the update function $\hat{U}$ and computing the estimated prototype matrix $\tilde{V}$ using the update function $\hat{V}$.

For additional methodological details, see Kmita et al. (2024).

The AO algorithm can be summarized as follows:

Initialize the algorithm with an initial membership matrix $U^{(0)}$.
Set the iteration counter $l = 1$.
Estimate prototypes $\tilde{V}^{(l)}$ using the formula for optimal prototypes $\hat{V}$ and current values of memberships.
Estimate memberships $\tilde{U}^{(l)}$ using the formula for optimal memberships $\hat{U}$ and current values of prototypes.
Check the convergence criterion. If the difference between $\tilde{U}^{(l)}$ and $\tilde{U}^{(l-1)}$ is smaller than a predefined threshold, stop the algorithm. Otherwise, increase $l$ by one and repeat Steps 3–5.

3 Unsupervised models

The following examples use the well-known iris dataset (R.A. Fisher, 1936). We consider two features: sepal length and sepal width.

3.1 Preparing the data

fussclust requires the input to be of class matrix.

X <- iris[, c("Sepal.Length", "Sepal.Width")] |> as.matrix()

cat(
  paste0(
    "Class of `iris`: ", class(iris),
    "; class of `X`: ",
    paste(class(X), collapse = " & ")
  )
)
#> Class of `iris`: data.frame; class of `X`: matrix & array

The iris dataset contains three species of iris flowers. Therefore, we set the number of clusters to C = 3. In practice, however, the user may specify any number of clusters such that $C > 1$.

model_fcm <- fussclust::FCM(X = X, C = 3)
model_pcm <- fussclust::PCM(X = X, C = 3)

The estimated cluster prototypes obtained from the FCM model are:

model_fcm$V
#>      Sepal.Length Sepal.Width
#> [1,]     6.814646    3.070350
#> [2,]     4.979951    3.355452
#> [3,]     5.830402    2.762639

The estimated memberships for the first ten observations obtained from the PCM model are:

model_pcm$U[1:10, ]
#>            [,1]      [,2]      [,3]
#>  [1,] 0.5153643 0.5153737 0.5153653
#>  [2,] 0.4969738 0.4969839 0.4969748
#>  [3,] 0.3983422 0.3983501 0.3983430
#>  [4,] 0.3671885 0.3671957 0.3671893
#>  [5,] 0.4485335 0.4485415 0.4485344
#>  [6,] 0.4674081 0.4674135 0.4674087
#>  [7,] 0.3451158 0.3451224 0.3451165
#>  [8,] 0.4966804 0.4966900 0.4966814
#>  [9,] 0.3058682 0.3058740 0.3058688
#> [10,] 0.4925528 0.4925630 0.4925539

4 Semi-supervised models

4.1 Encoding partial supervision

As described in Section 1, semi-supervised models in fussclust require partial supervision encoded as a binary matrix superF.

The package assumes that the number of clusters $C$ equals the number of distinct classes represented in the supervision information.

In many applications, class labels are stored in a response variable. This is also the case for the iris dataset:

iris[1:3, ]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa

The class labels are stored in the Species column:

unique(iris$Species)
#> [1] setosa     versicolor virginica 
#> Levels: setosa versicolor virginica

To construct the supervision matrix, the user must first define the arbitrary ordering of the classes. Let

\[ \mathcal{Y} = (y^1 = \texttt{setosa}, y^2 = \texttt{versicolor}, y^3 = \texttt{virginica}) \]

denote the ordered set of class labels.

The supervision matrix superF is defined as follows. Let $F = [f_{jk}]$ denote the binary supervision matrix, where:

$j = 1, \ldots, N$ indexes observations,
$k = 1, \ldots, C$ indexes clusters.

The matrix satisfies the following properties:

it has nrow(X) rows,
it has $C$ columns,
the $k$-th column corresponds to the $k$-th class in $\mathcal{Y}$,
$f_{jk} = 1$ if the $j$-th observation is assumed to belong to the $k$-th class, and $0$ otherwise.

The matrix can be constructed as follows:

superF <- matrix(0, nrow = nrow(iris), ncol = 3)

superF[iris$Species == "setosa", 1] <- 1
superF[iris$Species == "versicolor", 2] <- 1
superF[iris$Species == "virginica", 3] <- 1

4.2 Fitting the models

When fitting a semi-supervised model, the user must provide the scaling parameter $\alpha$ (alpha), which regulates the impact of partial supervision.

There is no universally optimal choice of $\alpha$, as the appropriate value depends on the dataset and application. For further discussion, see Kmita et al. (2024).

For this example, we set $\alpha = 1$.

Semi-Supervised Fuzzy c-Means:

model_ssfcm <- fussclust::SSFCM(
  X = X,
  C = 3,
  alpha = 1,
  superF = superF
)

Semi-Supervised Possibilistic c-Means:

model_sspcm <- fussclust::SSPCM(
  X = X,
  C = 3,
  alpha = 1,
  superF = superF
)

The estimated prototypes are:

model_ssfcm$V
#>      Sepal.Length Sepal.Width
#> [1,]     4.999393    3.402151
#> [2,]     5.898059    2.766335
#> [3,]     6.646567    3.004991

model_sspcm$V
#>      Sepal.Length Sepal.Width
#> [1,]     5.212768    3.274814
#> [2,]     5.964629    2.865122
#> [3,]     6.335417    2.957180

5 Additional details

5.1 The $\Gamma$ parameter in possibilistic models

The vector of cluster-specific hyperparameters

\[ \Gamma = (\gamma_1, \ldots, \gamma_C) \]

is specific to possibilistic clustering methods.

This is an optional argument in both fussclust::PCM() and fussclust::SSPCM(). If not provided, the default value is a unit vector:

\[ \Gamma = (1, \ldots, 1) \]

Alternatively, the user may apply the initialization strategy proposed by Krishnapuram and Keller (1993), which computes

\[ \gamma_k = \frac{ \sum_{j=1}^N u_{jk}^2 d_{jk}^2 }{ \sum_{j=1}^N u_{jk}^2 } \]

using values obtained from an initial FCM fit.

This strategy can be enabled by setting initFCM = TRUE.

model_pcm_init <- fussclust::PCM(
  X = X,
  C = 3,
  initFCM = TRUE
)

The user may also choose to specify the values of $\Gamma$ manually, for example:

model_pcm_custom_gammas <- fussclust::PCM(
  X = X,
  C = 3,
  gammas = c(1, 2, 3)
)

5.2 Retrieving intermediate results

In addition to the final estimated memberships and prototypes (accessible via model$U and model$V), users may inspect intermediate results produced during the AO optimization procedure.

The following histories can be stored:

membership matrices, accessible via model$U_history,
prototype matrices, accessible via model$V_history,
weight matrices, accessible via model$Phi_history.

By default, these objects are set to NULL.

model_fcm$U_history
#> NULL

To store the optimization history, set store_history = TRUE.

model_fcm_history <- fussclust::FCM(
  X = X,
  C = 3,
  store_history = TRUE
)

cat(
  paste0(
    "Class of `model_fcm_history$U_history`: ",
    class(model_fcm_history$U_history),
    "; length: ",
    length(model_fcm_history$U_history),
    "; number of AO iterations: ",
    model_fcm_history$counter
  )
)
#> Class of `model_fcm_history$U_history`: list; length: 32; number of AO iterations: 32

To retrieve a specific intermediate result, for example the prototype matrix $\hat{V}^{(3)}$ obtained at the third AO iteration, index the corresponding history object directly:

model_fcm_history$V_history[[3]]
#>      Sepal.Length Sepal.Width
#> [1,]     5.521657    3.103216
#> [2,]     5.967252    3.095491
#> [3,]     6.038396    2.977654

fuzzy-clustering