data_structure

`ClusterModel`

Data class which represent a model containing a list of microclusters and a list of labels which it was trained on

Attributes:

Name	Type	Description
`microclusters`	`list of MicroCluster`	List of MicroClusters representing the model
`timestamp`	`list of int`	List of labels on which the model was trained on

Source code in streamndr/utils/data_structure.py

class ClusterModel:
    """Data class which represent a model containing a list of microclusters and a list of labels which it was trained on

    Attributes
    ----------
    microclusters : list of MicroCluster
        List of MicroClusters representing the model
    timestamp : list of int
        List of labels on which the model was trained on
    """
    def __init__(self, microclusters, labels):
        self.microclusters = microclusters
        self.labels = labels

`ImpurityBasedCluster`

Bases: MicroCluster

Cluster which implements the concept of entropy and dissimilarity if samples of a same class albel are not in the same cluster [1].

[1] Masud, Mohammad M., et al. "A practical approach to classify evolving data streams: Training with limited amount of labeled data." 2008 Eighth IEEE International Conference on Data Mining. IEEE, 2008.

Parameters:

Name	Type	Description	Default
`label`	`int`	Label of the cluster	required
`centroid`	`ndarray`	Current centroid of the cluster	required

Attributes:

Name	Type	Description
`entropy`	`int`	Entropy of the cluster as defined in [1]
`number_of_labeled_samples`	`int`	Number of labeled samples currently in the cluster

Source code in streamndr/utils/data_structure.py

class ImpurityBasedCluster(MicroCluster):
    """Cluster which implements the concept of entropy and dissimilarity if samples of a same class albel are not in the same cluster [1].

    [1] Masud, Mohammad M., et al. "A practical approach to classify evolving data streams: Training with limited amount of labeled data." 
    2008 Eighth IEEE International Conference on Data Mining. IEEE, 2008.

    Parameters
    ----------
    label : int
        Label of the cluster
    centroid : numpy.ndarray
        Current centroid of the cluster

    Attributes
    ----------
    entropy : int
        Entropy of the cluster as defined in [1]
    number_of_labeled_samples : int
        Number of labeled samples currently in the cluster
    """
    def __init__(self,
                 label,
                 centroid):

        super().__init__(label)

        self.centroid = centroid
        self.samples_by_label = {}

        self.entropy = 0
        self.number_of_labeled_samples = 0

    def add_sample(self, sample, update_summary=False):
        """Add a sample to the cluster, the sample can be labeled or not. Expects -1 as the label for an unlabeled sample.

        Parameters
        ----------
        sample : ShortMemInstance
            Instance to add to the cluster
        update_summary : bool
            Whether or not to update the microcluster supplementary properties (mean distance & squared sum) with this new point
        """

        if sample.y_true not in self.samples_by_label:
            self.samples_by_label[sample.y_true] = 0


        self.samples_by_label[sample.y_true] += 1


        if sample.y_true != -1:
            self.number_of_labeled_samples += 1

        X = sample.point
        if self.instances is not None:
            self.instances.append(sample.point)
            self.linear_sum = np.sum([self.linear_sum, X], axis=0)
        else:
            self.instances = [sample.point]
            self.linear_sum = X

        self.n += 1

        if update_summary:
            self.mean_distance = (self.n * self.mean_distance + self.distance_to_centroid(X)) / (self.n)
            self.squared_sum = np.sum([self.squared_sum, np.square(X).sum()], axis=0)

    def remove_sample(self, sample, update_summary=False):
        """Remove a sample from the cluster, the sample can be labeled or not. Expects -1 as the label for an unlabeled sample.

        Parameters
        ----------
        sample : ShortMemInstance
            Instance to remove from the cluster
        update_summary : bool
            Whether or not to update the microcluster supplementary properties (mean distance & squared sum) with this new point
        """
        self.samples_by_label[sample.y_true] -= 1

        if sample.y_true != -1:
            self.number_of_labeled_samples -= 1

        self.instances.remove(sample.point)
        self.n -= 1
        X = sample.point
        self.linear_sum = np.sum([self.linear_sum, -1*X], axis=0)

        if update_summary:        
            self.mean_distance = (self.n * self.mean_distance - self.distance_to_centroid(X)) / (self.n)
            self.squared_sum = np.sum([self.squared_sum, -1*np.square(X).sum()], axis=0)

    def update_entropy(self):
        label_probabilities = [self.calculate_label_probability(label) for label in self.samples_by_label if label != -1]
        self.entropy = -sum(p * math.log(p) for p in label_probabilities if p > 0)

    def calculate_label_probability(self, label):
        return self.samples_by_label[label] / self.number_of_labeled_samples

    def dissimilarity_count(self, labeled_sample):
        if labeled_sample.y_true not in self.samples_by_label:
            return self.number_of_labeled_samples
        if labeled_sample.y_true == -1:
            return 0

        return self.number_of_labeled_samples - self.samples_by_label[labeled_sample.y_true]

`add_sample(sample, update_summary=False)`

Add a sample to the cluster, the sample can be labeled or not. Expects -1 as the label for an unlabeled sample.

Parameters:

Name	Type	Description	Default
`sample`	`ShortMemInstance`	Instance to add to the cluster	required
`update_summary`	`bool`	Whether or not to update the microcluster supplementary properties (mean distance & squared sum) with this new point	`False`

Source code in streamndr/utils/data_structure.py

def add_sample(self, sample, update_summary=False):
    """Add a sample to the cluster, the sample can be labeled or not. Expects -1 as the label for an unlabeled sample.

    Parameters
    ----------
    sample : ShortMemInstance
        Instance to add to the cluster
    update_summary : bool
        Whether or not to update the microcluster supplementary properties (mean distance & squared sum) with this new point
    """

    if sample.y_true not in self.samples_by_label:
        self.samples_by_label[sample.y_true] = 0


    self.samples_by_label[sample.y_true] += 1


    if sample.y_true != -1:
        self.number_of_labeled_samples += 1

    X = sample.point
    if self.instances is not None:
        self.instances.append(sample.point)
        self.linear_sum = np.sum([self.linear_sum, X], axis=0)
    else:
        self.instances = [sample.point]
        self.linear_sum = X

    self.n += 1

    if update_summary:
        self.mean_distance = (self.n * self.mean_distance + self.distance_to_centroid(X)) / (self.n)
        self.squared_sum = np.sum([self.squared_sum, np.square(X).sum()], axis=0)

`remove_sample(sample, update_summary=False)`

Remove a sample from the cluster, the sample can be labeled or not. Expects -1 as the label for an unlabeled sample.

Parameters:

Name	Type	Description	Default
`sample`	`ShortMemInstance`	Instance to remove from the cluster	required
`update_summary`	`bool`	Whether or not to update the microcluster supplementary properties (mean distance & squared sum) with this new point	`False`

Source code in streamndr/utils/data_structure.py

def remove_sample(self, sample, update_summary=False):
    """Remove a sample from the cluster, the sample can be labeled or not. Expects -1 as the label for an unlabeled sample.

    Parameters
    ----------
    sample : ShortMemInstance
        Instance to remove from the cluster
    update_summary : bool
        Whether or not to update the microcluster supplementary properties (mean distance & squared sum) with this new point
    """
    self.samples_by_label[sample.y_true] -= 1

    if sample.y_true != -1:
        self.number_of_labeled_samples -= 1

    self.instances.remove(sample.point)
    self.n -= 1
    X = sample.point
    self.linear_sum = np.sum([self.linear_sum, -1*X], axis=0)

    if update_summary:        
        self.mean_distance = (self.n * self.mean_distance - self.distance_to_centroid(X)) / (self.n)
        self.squared_sum = np.sum([self.squared_sum, -1*np.square(X).sum()], axis=0)

`MicroCluster`

A representation of a cluster with compressed information.

Parameters:

Name	Type	Description	Default
`label`	`int`	Label associated with this microcluster	required
`instances`	`ndarray`	Instances in this microcluster, preferably these would not be stored if not needed using keep_instances=False. Will be converted to Python list for append performance.	`None`
`timestamp`	`int`	Timestamp this microcluster was last updated, used for forgetting mechanisms	`0`
`keep_instances`	`bool`	Whether or not to store the instances within the microcluster. Should preferably set to false, but some implementations require access to the instances	`True`
`n_label_instance`	`int`	Number of instances of the same label associated with the microcluster	required

Attributes:

Name	Type	Description
`n`	`int`	Number of instances stored in this microcluster
`linear_sum`	`ndarray`	Linear sum of the points belonging to this microcluster
`squared_sum`	`ndarray`	Sum of the squared l2 norms of all samples belonging to this microcluster
`centroid`	`ndarray`	Centroid coordinates of the microcluster
`max_distance`	`ndarray`	Maximum distance between a point belonging to the microcluster and its centroid
`mean_distance`	`ndarray`	Mean distance of the distances between the cluster's points and its centroid

Source code in streamndr/utils/data_structure.py

class MicroCluster:
    """A representation of a cluster with compressed information.

    Parameters
    ----------
    label : int
        Label associated with this microcluster
    instances : numpy.ndarray
        Instances in this microcluster, preferably these would not be stored if not needed using keep_instances=False. Will be converted to Python list for append performance.
    timestamp : int
        Timestamp this microcluster was last updated, used for forgetting mechanisms  
    keep_instances : bool
        Whether or not to store the instances within the microcluster. Should preferably set to false, but some implementations require
        access to the instances
    n_label_instance : int
        Number of instances of the same label associated with the microcluster

    Attributes
    ----------
    n : int
        Number of instances stored in this microcluster
    linear_sum : numpy.ndarray
        Linear sum of the points belonging to this microcluster
    squared_sum : numpy.ndarray
        Sum of the squared l2 norms of all samples belonging to this microcluster
    centroid : numpy.ndarray
        Centroid coordinates of the microcluster
    max_distance : numpy.ndarray
        Maximum distance between a point belonging to the microcluster and its centroid
    mean_distance : numpy.ndarray
        Mean distance of the distances between the cluster's points and its centroid
    """

    def __init__(self,
                 label,  # the class the microcluster belongs to
                 instances=None,
                 timestamp=0, 
                 keep_instances=True, #Required True for MINAS
                 n_label_instances=None
                 ):

        # TODO: remove instances entirely so it doesn't need to be stored in memory; Might not be possible because of _best_threshold used by MINAS which needs instances
        super().__init__()
        self.label = label

        if instances is not None:
            if not isinstance(instances, list):
                instances_list = instances.tolist()
            else:
                instances_list = instances
                instances = np.array(instances)
            self.instances = instances_list
            self.n = len(instances)
            self.linear_sum = instances.sum(axis=0)

            # Sum of the squared l2 norms of all samples belonging to a microcluster:
            self.squared_sum = np.square(np.linalg.norm(self.instances, axis=1)).sum()
            # self.squared_sum = np.square(instances).sum(axis=0)  # From CluSTREAM paper
            self.centroid = self.linear_sum / self.n
            self.max_distance = np.max(self.distance_to_centroid(instances))
            self.mean_distance = np.mean(self.distance_to_centroid(instances))
            self.update_properties()

        else:
            self.instances = None
            self.n = 0
            self.linear_sum = 0
            self.squared_sum = 0
            self.max_distance = 0
            self.mean_distance = 0

        self.timestamp = timestamp
        self.n_label_instances = n_label_instances

        if not keep_instances:
            self.instances = None

    def __str__(self):
        """Returns string representation of a microcluster.

        Returns
        -------
        str
            String representation of microcluster
        """

        return f"""Target class {self.label}
                # of instances: {self.n}
                Linear sum: {self.linear_sum}
                Squared sum: {self.squared_sum}
                Centroid: {self.centroid}
                Radius: {self.radius}
                Timestamp of last change: {self.timestamp}"""

    def small_str(self):
        """Returns string representation of a microcluster.

        Returns
        -------
        str
            Small string representation of microcluster
        """

        return f"""Target class {self.label}
                # of instances: {self.n}
                Timestamp of last change: {self.timestamp}"""

    def get_radius(self):
        """Returns radius of the microcluster.

        Returns
        -------
        float
            Radius of the microcluster
        """

        factor = 1.5
        # from BIRCH Wikipedia
        diff = (self.squared_sum / self.n) - np.dot(self.centroid, self.centroid)
        if diff > 1e-15:
            return factor * np.sqrt(diff)
        else:  # in this case diff should be zero, but sometimes it's an infinitesimal difference
            return 0
        # from MINAS paper:
        #return factor*np.std(self.distance_to_centroid(self.instances))

    def distance_to_centroid(self, X):
        """Returns distance from X to centroid of this cluster.

        Parameters
        ----------
        X : numpy.ndarray or list
            Point or multiple points

        Returns
        -------
        numpy.ndarray
            Distance from X to the microcluster's centroid
        """
        if not isinstance(X, np.ndarray):
            X = np.array(X)

        if len(X.shape) == 1:  # X is only one point
            return np.linalg.norm(X - self.centroid)
        else:  # X contains several points
            return np.linalg.norm(X - self.centroid, axis=1)

    def encompasses(self, X):
        """Checks if point X is inside this microcluster. The point X is considered within the microcluster if the distance 
        between the point and the microcluster's centroid is less than the radius of the microcluster.

        Parameters
        ----------
        X : numpy.ndarray
            One point

        Returns
        -------
        bool
            If the point distance to centroid is contained within the microcluster or not
        """

        return np.less(self.distance_to_centroid(X), self.radius)

    def find_closest_cluster(self, clusters):
        """Finds closest microcluster to this one among passed microclusters.

        Parameters
        ----------
        clusters : list of MicroCluster

        Returns
        -------
        MicroCluster
            Closest microcluster
        """

        return min(clusters, key=lambda cl: cl.distance_to_centroid(self.centroid))

    def update_cluster(self, X, timestamp, update_summary):
        """Adds point received in parameter to the cluster and update cluster's centroid if wanted.

        Parameters
        ----------
        X : numpy.ndarray
            One point
        timestamp : int
            Timestamp when this point was added to this microcluster
        update_summary : bool
            Whether or not to update the microcluster properties with this new point
        """

        assert len(X.shape) == 1  # it's just one point
        self.timestamp = timestamp

        if self.instances is not None:
            self.instances.append(X)

        if update_summary:
            self.mean_distance = (self.n * self.mean_distance + self.distance_to_centroid(X)) / (self.n + 1)
            self.n += 1
            self.linear_sum = np.sum([self.linear_sum, X], axis=0)
            self.squared_sum = np.sum([self.squared_sum, np.square(X).sum()], axis=0)
            self.update_properties()

    def update_properties(self):
        """Updates centroid and radius based on current cluster properties."""
        self.centroid = self.linear_sum / self.n

        if self.instances is not None:
            self.radius = self.get_radius()
            if np.max(self.distance_to_centroid(self.instances)) > self.max_distance:
                self.max_distance = np.max(self.distance_to_centroid(self.instances))

    def is_cohesive(self, clusters):
        """Verifies if this cluster is cohesive for novelty detection purposes.
        A new micro-cluster is cohesive if its silhouette coefficient is larger than 0.
        'b' represents the Euclidean distance between the centroid of the new micro-cluster and the centroid of its
        closest micro-cluster, and 'a' represents the standard deviation of the distances between the examples of the
        new micro-cluster and the centroid of the new micro-cluster.

        Parameters
        ----------
        clusters : List of MicroCluster
            Existing known micro-clusters

        Returns
        -------
        bool
            If the cluster is cohesive (silhouette>0) or not
        """
        b = self.distance_to_centroid(self.find_closest_cluster(clusters).centroid)
        a = np.std(self.distance_to_centroid(self.instances))
        silhouette = (b - a) / max(a, b)  # hm, this is always positive if b > a
        return silhouette > 0

    def is_representative(self, min_examples):
        """Verifies if this cluster is representative for novelty detection purposes.
        A new micro-cluster is representative if it contains a minimal number of examples,
        where this number is a user-defined parameter.

        Parameters
        ----------
        min_examples : int
            The number of samples the microcluster needs to have to be considered representative.

        Returns
        -------
        bool
            If the cluster is representative or not
        """
        return self.n >= min_examples

`str()`

Returns string representation of a microcluster.

Returns:

Type	Description
`str`	String representation of microcluster

Source code in streamndr/utils/data_structure.py

def __str__(self):
    """Returns string representation of a microcluster.

    Returns
    -------
    str
        String representation of microcluster
    """

    return f"""Target class {self.label}
            # of instances: {self.n}
            Linear sum: {self.linear_sum}
            Squared sum: {self.squared_sum}
            Centroid: {self.centroid}
            Radius: {self.radius}
            Timestamp of last change: {self.timestamp}"""

`distance_to_centroid(X)`

Returns distance from X to centroid of this cluster.

Parameters:

Name	Type	Description	Default
`X`	`ndarray or list`	Point or multiple points	required

Returns:

Type	Description
`ndarray`	Distance from X to the microcluster's centroid

Source code in streamndr/utils/data_structure.py

def distance_to_centroid(self, X):
    """Returns distance from X to centroid of this cluster.

    Parameters
    ----------
    X : numpy.ndarray or list
        Point or multiple points

    Returns
    -------
    numpy.ndarray
        Distance from X to the microcluster's centroid
    """
    if not isinstance(X, np.ndarray):
        X = np.array(X)

    if len(X.shape) == 1:  # X is only one point
        return np.linalg.norm(X - self.centroid)
    else:  # X contains several points
        return np.linalg.norm(X - self.centroid, axis=1)

`encompasses(X)`

Checks if point X is inside this microcluster. The point X is considered within the microcluster if the distance between the point and the microcluster's centroid is less than the radius of the microcluster.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	One point	required

Returns:

Type	Description
`bool`	If the point distance to centroid is contained within the microcluster or not

Source code in streamndr/utils/data_structure.py

def encompasses(self, X):
    """Checks if point X is inside this microcluster. The point X is considered within the microcluster if the distance 
    between the point and the microcluster's centroid is less than the radius of the microcluster.

    Parameters
    ----------
    X : numpy.ndarray
        One point

    Returns
    -------
    bool
        If the point distance to centroid is contained within the microcluster or not
    """

    return np.less(self.distance_to_centroid(X), self.radius)

`find_closest_cluster(clusters)`

Finds closest microcluster to this one among passed microclusters.

Parameters:

Name	Type	Description	Default
`clusters`	`list of MicroCluster`		required

Returns:

Type	Description
`MicroCluster`	Closest microcluster

Source code in streamndr/utils/data_structure.py

def find_closest_cluster(self, clusters):
    """Finds closest microcluster to this one among passed microclusters.

    Parameters
    ----------
    clusters : list of MicroCluster

    Returns
    -------
    MicroCluster
        Closest microcluster
    """

    return min(clusters, key=lambda cl: cl.distance_to_centroid(self.centroid))

`get_radius()`

Returns radius of the microcluster.

Returns:

Type	Description
`float`	Radius of the microcluster

Source code in streamndr/utils/data_structure.py

def get_radius(self):
    """Returns radius of the microcluster.

    Returns
    -------
    float
        Radius of the microcluster
    """

    factor = 1.5
    # from BIRCH Wikipedia
    diff = (self.squared_sum / self.n) - np.dot(self.centroid, self.centroid)
    if diff > 1e-15:
        return factor * np.sqrt(diff)
    else:  # in this case diff should be zero, but sometimes it's an infinitesimal difference
        return 0

`is_cohesive(clusters)`

Verifies if this cluster is cohesive for novelty detection purposes. A new micro-cluster is cohesive if its silhouette coefficient is larger than 0. 'b' represents the Euclidean distance between the centroid of the new micro-cluster and the centroid of its closest micro-cluster, and 'a' represents the standard deviation of the distances between the examples of the new micro-cluster and the centroid of the new micro-cluster.

Parameters:

Name	Type	Description	Default
`clusters`	`List of MicroCluster`	Existing known micro-clusters	required

Returns:

Type	Description
`bool`	If the cluster is cohesive (silhouette>0) or not

Source code in streamndr/utils/data_structure.py

def is_cohesive(self, clusters):
    """Verifies if this cluster is cohesive for novelty detection purposes.
    A new micro-cluster is cohesive if its silhouette coefficient is larger than 0.
    'b' represents the Euclidean distance between the centroid of the new micro-cluster and the centroid of its
    closest micro-cluster, and 'a' represents the standard deviation of the distances between the examples of the
    new micro-cluster and the centroid of the new micro-cluster.

    Parameters
    ----------
    clusters : List of MicroCluster
        Existing known micro-clusters

    Returns
    -------
    bool
        If the cluster is cohesive (silhouette>0) or not
    """
    b = self.distance_to_centroid(self.find_closest_cluster(clusters).centroid)
    a = np.std(self.distance_to_centroid(self.instances))
    silhouette = (b - a) / max(a, b)  # hm, this is always positive if b > a
    return silhouette > 0

`is_representative(min_examples)`

Verifies if this cluster is representative for novelty detection purposes. A new micro-cluster is representative if it contains a minimal number of examples, where this number is a user-defined parameter.

Parameters:

Name	Type	Description	Default
`min_examples`	`int`	The number of samples the microcluster needs to have to be considered representative.	required

Returns:

Type	Description
`bool`	If the cluster is representative or not

Source code in streamndr/utils/data_structure.py

def is_representative(self, min_examples):
    """Verifies if this cluster is representative for novelty detection purposes.
    A new micro-cluster is representative if it contains a minimal number of examples,
    where this number is a user-defined parameter.

    Parameters
    ----------
    min_examples : int
        The number of samples the microcluster needs to have to be considered representative.

    Returns
    -------
    bool
        If the cluster is representative or not
    """
    return self.n >= min_examples

`small_str()`

Returns string representation of a microcluster.

Returns:

Type	Description
`str`	Small string representation of microcluster

Source code in streamndr/utils/data_structure.py

def small_str(self):
    """Returns string representation of a microcluster.

    Returns
    -------
    str
        Small string representation of microcluster
    """

    return f"""Target class {self.label}
            # of instances: {self.n}
            Timestamp of last change: {self.timestamp}"""

`update_cluster(X, timestamp, update_summary)`

Adds point received in parameter to the cluster and update cluster's centroid if wanted.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	One point	required
`timestamp`	`int`	Timestamp when this point was added to this microcluster	required
`update_summary`	`bool`	Whether or not to update the microcluster properties with this new point	required

Source code in streamndr/utils/data_structure.py

def update_cluster(self, X, timestamp, update_summary):
    """Adds point received in parameter to the cluster and update cluster's centroid if wanted.

    Parameters
    ----------
    X : numpy.ndarray
        One point
    timestamp : int
        Timestamp when this point was added to this microcluster
    update_summary : bool
        Whether or not to update the microcluster properties with this new point
    """

    assert len(X.shape) == 1  # it's just one point
    self.timestamp = timestamp

    if self.instances is not None:
        self.instances.append(X)

    if update_summary:
        self.mean_distance = (self.n * self.mean_distance + self.distance_to_centroid(X)) / (self.n + 1)
        self.n += 1
        self.linear_sum = np.sum([self.linear_sum, X], axis=0)
        self.squared_sum = np.sum([self.squared_sum, np.square(X).sum()], axis=0)
        self.update_properties()

`update_properties()`

Updates centroid and radius based on current cluster properties.

Source code in streamndr/utils/data_structure.py

def update_properties(self):
    """Updates centroid and radius based on current cluster properties."""
    self.centroid = self.linear_sum / self.n

    if self.instances is not None:
        self.radius = self.get_radius()
        if np.max(self.distance_to_centroid(self.instances)) > self.max_distance:
            self.max_distance = np.max(self.distance_to_centroid(self.instances))

`ShortMem`

Data structure for efficient addition and search of ShortMemInstances.

Attributes:

Name	Type	Description
`list`	`list of tuples (hash, [ShortMemInstance1, ShortMemInstance2, ...])`	List containing the instances and their corresponding hash compiled from their point
`dictionary`	`dictionary`	Contains for each hash its index in the list

Source code in streamndr/utils/data_structure.py

class ShortMem:
    """Data structure for efficient addition and search of ShortMemInstances.

    Attributes
    ----------
    list : list of tuples (hash, [ShortMemInstance1, ShortMemInstance2, ...])
        List containing the instances and their corresponding hash compiled from their point
    dictionary : dictionary
        Contains for each hash its index in the list
    """
    def __init__(self):
        self.list = []
        self.dictionary = {}

    def append(self, instance):
        """Adds an element to the data structure

        Parameters
        ----------
        instance : ShortMemInstance
            Element to add
        """
        h = hashlib.sha256(instance.point.tobytes()).hexdigest()

        # Check if the hash already exists
        if h in self.dictionary:
            # Add the instance to the existing list of instances for that hash
            self.list[self.dictionary[h]][1].append(instance)
        else:
            index = len(self.list)
            self.list.append((h, [instance]))
            self.dictionary[h] = index

    def remove(self, index):
        """Remove the element at the given index from the data structure.

        Parameters
        ----------
        index : int
            Index of the element to remove
        """
        if 0 <= index < len(self.list):
            instance_list = self.list[index][1]

            # If there's only one instance for the hash, remove the entire entry
            if len(instance_list) == 1:
                del self.dictionary[self.list[index][0]]
                self.list.pop(index)

                # Update indices for remaining entries
                for i in range(index, len(self.list)):
                    self.dictionary[self.list[i][0]] = i
            else:
                # If there are multiple instances, remove just one instance
                instance_list.pop(-1)

    def index(self, instance):
        """Get the index of the given element.

        Parameters
        ----------
        instance : np.ndarray or ShortMemInstance
            Element to find

        Returns
        -------
        int
            Index of the element, -1 if not found
        """
        if isinstance(instance, np.ndarray):
            return self.dictionary.get(hashlib.sha256(instance.tobytes()).hexdigest(), -1)
        elif isinstance(instance, ShortMemInstance):
            return self.dictionary.get(hashlib.sha256(instance.point.tobytes()).hexdigest(), -1)
        return -1

    def get_all_instances(self):
        """Returns all ShortMemInstances instances within the data structure

        Returns
        -------
        list of ShortMemInstance
            All ShortMemInstances instances within the data structure
        """
        return [instance for _, instances_list in self.list for instance in instances_list]

    def get_instance(self, index):
        """Return specific ShortMemInstance at given index

        Parameters
        ----------
        index : int
            Index

        Returns
        -------
        ShortMemInstance
            The instance at the given index, None if index not found
        """
        if 0 <= index < len(self.list):
            return self.list[index][1][0]

    def get_all_points(self):
        """Returns all points within the data structure

        Returns
        -------
        np.ndarray
            All points contained in the data structure
        """
        return np.array([instance.point for _, instances_list in self.list for instance in instances_list])

    def __len__(self):
        return sum(len(instances_list) for _, instances_list in self.list)

`append(instance)`

Adds an element to the data structure

Parameters:

Name	Type	Description	Default
`instance`	`ShortMemInstance`	Element to add	required

Source code in streamndr/utils/data_structure.py

def append(self, instance):
    """Adds an element to the data structure

    Parameters
    ----------
    instance : ShortMemInstance
        Element to add
    """
    h = hashlib.sha256(instance.point.tobytes()).hexdigest()

    # Check if the hash already exists
    if h in self.dictionary:
        # Add the instance to the existing list of instances for that hash
        self.list[self.dictionary[h]][1].append(instance)
    else:
        index = len(self.list)
        self.list.append((h, [instance]))
        self.dictionary[h] = index

`get_all_instances()`

Returns all ShortMemInstances instances within the data structure

Returns:

Type	Description
`list of ShortMemInstance`	All ShortMemInstances instances within the data structure

Source code in streamndr/utils/data_structure.py

def get_all_instances(self):
    """Returns all ShortMemInstances instances within the data structure

    Returns
    -------
    list of ShortMemInstance
        All ShortMemInstances instances within the data structure
    """
    return [instance for _, instances_list in self.list for instance in instances_list]

`get_all_points()`

Returns all points within the data structure

Returns:

Type	Description
`ndarray`	All points contained in the data structure

Source code in streamndr/utils/data_structure.py

def get_all_points(self):
    """Returns all points within the data structure

    Returns
    -------
    np.ndarray
        All points contained in the data structure
    """
    return np.array([instance.point for _, instances_list in self.list for instance in instances_list])

`get_instance(index)`

Return specific ShortMemInstance at given index

Parameters:

Name	Type	Description	Default
`index`	`int`	Index	required

Returns:

Type	Description
`ShortMemInstance`	The instance at the given index, None if index not found

Source code in streamndr/utils/data_structure.py

def get_instance(self, index):
    """Return specific ShortMemInstance at given index

    Parameters
    ----------
    index : int
        Index

    Returns
    -------
    ShortMemInstance
        The instance at the given index, None if index not found
    """
    if 0 <= index < len(self.list):
        return self.list[index][1][0]

`index(instance)`

Get the index of the given element.

Parameters:

Name	Type	Description	Default
`instance`	`ndarray or ShortMemInstance`	Element to find	required

Returns:

Type	Description
`int`	Index of the element, -1 if not found

Source code in streamndr/utils/data_structure.py

def index(self, instance):
    """Get the index of the given element.

    Parameters
    ----------
    instance : np.ndarray or ShortMemInstance
        Element to find

    Returns
    -------
    int
        Index of the element, -1 if not found
    """
    if isinstance(instance, np.ndarray):
        return self.dictionary.get(hashlib.sha256(instance.tobytes()).hexdigest(), -1)
    elif isinstance(instance, ShortMemInstance):
        return self.dictionary.get(hashlib.sha256(instance.point.tobytes()).hexdigest(), -1)
    return -1

`remove(index)`

Remove the element at the given index from the data structure.

Parameters:

Name	Type	Description	Default
`index`	`int`	Index of the element to remove	required

Source code in streamndr/utils/data_structure.py

def remove(self, index):
    """Remove the element at the given index from the data structure.

    Parameters
    ----------
    index : int
        Index of the element to remove
    """
    if 0 <= index < len(self.list):
        instance_list = self.list[index][1]

        # If there's only one instance for the hash, remove the entire entry
        if len(instance_list) == 1:
            del self.dictionary[self.list[index][0]]
            self.list.pop(index)

            # Update indices for remaining entries
            for i in range(index, len(self.list)):
                self.dictionary[self.list[i][0]] = i
        else:
            # If there are multiple instances, remove just one instance
            instance_list.pop(-1)

`ShortMemInstance`

Instance of a point associated with a timestamp. Used for the buffer memory which stores the unknown samples.

Attributes:

Name	Type	Description
`point`	`ndarray`	The coordinates of the point
`timestamp`	`int`	The timestamp the point was added/treated
`y_true`	`int`	The true value of the class

Source code in streamndr/utils/data_structure.py

class ShortMemInstance:
    """Instance of a point associated with a timestamp. Used for the buffer memory which stores the unknown samples.

    Attributes
    ----------
    point : numpy.ndarray
        The coordinates of the point
    timestamp : int
        The timestamp the point was added/treated
    y_true : int
        The true value of the class
    """
    def __init__(self, point, timestamp, y_true=None, y_pred=None):
        self.point = point
        self.timestamp = timestamp
        self.y_true = y_true
        self.y_pred = y_pred

    def __eq__(self, other):
        """Elements are equal if they have the same values for all variables.
        This currently does not consider the timestamp.

        Parameters
        ----------
        other : ShortMemInstance or numpy.ndarray
            Other instance to compared to

        Returns
        -------
        bool
            If the instances are equals or not
        """
        if isinstance(other, np.ndarray):
            return np.all(self.point == other)
        if isinstance(other, ShortMemInstance):
            return np.all(self.point == other.point)
        return NotImplemented

`eq(other)`

Elements are equal if they have the same values for all variables. This currently does not consider the timestamp.

Parameters:

Name	Type	Description	Default
`other`	`ShortMemInstance or ndarray`	Other instance to compared to	required

Returns:

Type	Description
`bool`	If the instances are equals or not

Source code in streamndr/utils/data_structure.py

def __eq__(self, other):
    """Elements are equal if they have the same values for all variables.
    This currently does not consider the timestamp.

    Parameters
    ----------
    other : ShortMemInstance or numpy.ndarray
        Other instance to compared to

    Returns
    -------
    bool
        If the instances are equals or not
    """
    if isinstance(other, np.ndarray):
        return np.all(self.point == other)
    if isinstance(other, ShortMemInstance):
        return np.all(self.point == other.point)
    return NotImplemented

data_structure

ClusterModel

ImpurityBasedCluster

add_sample(sample, update_summary=False)

remove_sample(sample, update_summary=False)

MicroCluster

__str__()

distance_to_centroid(X)

encompasses(X)

find_closest_cluster(clusters)

get_radius()

is_cohesive(clusters)

is_representative(min_examples)

small_str()

update_cluster(X, timestamp, update_summary)

update_properties()

ShortMem

append(instance)

get_all_instances()

get_all_points()

get_instance(index)

index(instance)

remove(index)

ShortMemInstance

__eq__(other)

`ClusterModel`

`ImpurityBasedCluster`

`add_sample(sample, update_summary=False)`

`remove_sample(sample, update_summary=False)`

`MicroCluster`

`str()`

`distance_to_centroid(X)`

`encompasses(X)`

`find_closest_cluster(clusters)`

`get_radius()`

`is_cohesive(clusters)`

`is_representative(min_examples)`

`small_str()`

`update_cluster(X, timestamp, update_summary)`

`update_properties()`

`ShortMem`

`append(instance)`

`get_all_instances()`

`get_all_points()`

`get_instance(index)`

`index(instance)`

`remove(index)`

`ShortMemInstance`

`eq(other)`