Instantiation

Pyrite class takes a single argument (a Pandas DataFrame) when initialized. The returned object provides the necessary methods for anomaly scoring either a single instance or the entire dataset.

  • init(self, dataframe)
    • Arguments:
      • dataframe: Pandas Dataframe
    • Return: Pyrite object over which we detect anomalies

Anomaly Scoring

Once the dataframe is loaded, Pyrite randomly samples rows from the dataframe. The number of times sampled is controlled by: sample_num and number of instances picked in each sample is controlled by: sample_size. Pyrite can compute the anomaly score for each instance (via score_dataset), or for a single instance (via score_instance).

  • score_dataset(sample_num, sample_size)

    • Arguments:
      • sample_num: int (number of random samples to perform).
      • sample_size: int (number of instances to include in each sample)
    • Return: numpy.ndarray (a list where the i_th element is the anomaly score for the i_th instance in the dataset).
  • score_instance(single_instance, sample_num, sample_size)

    • Arguments:
      • single_instance: pandas.Series (a single instance for which the anomaly score is to be computed)
      • sample_num: int (number of random samples chosen without replacement from the dataset)
      • sample_size: int (size of the random sample)
    • Return: float (Anomaly score for the single instance)

Feature Importance

Pyrite provides a way to further explore the anomalous instances in a dataset, through supplying a list of features, and feature-pairs, that contribute the most in its anomaly score and thus gain insights on why an instance was classified as an anomaly.

  • get_feature_importance(single_instance)

    • Arguments:
      • single_instance: pandas.Series (a single instance for which the anomaly score is to be computed)
    • Return: A dictionary consisting of the most important single feature in the anomaly score, as well as the single most important pair of features.
  • instance_inspect(single_instance, plot)

    • Arguments:
      • single_instance: pandas.Series (a single instance for which the anomaly score is to be computed)
      • plot: Boolean, to show plots of the freq_1d and freq_2d values
    • Return:
      • freq_1d: ndarray, containing the inverse relative frequency for each single feautre.
      • freq_2d: dxd ndarray, where ith column and jth row corresponds to the anomaly score due to features i and j.

Discretization of Numerical Features

Discretization converts numerical data features into discrete categories before anomaly detection. To call the discretize function, the list of column numbers must be specified. It is assumed that the data does not contain NANs.

The function discretize performs discretization by calling the auto_discretize function individually on each column. To perform a different discretization, auto_discretize should be called directly on a per column basis.

  • discretize(columns, method = 'blocks')

    • Arguments:
      • columns: list (columns from self.df dataframe to discretize.)
      • method: int or str (method to use of either 'blocks'(default),'scott' or specific bin number)
    • Return: None (changes columns in self.df dataframe.)
  • auto_discretize(num_data,method,range_min_max):

    • Arguments:
      • num_data: numpy.array (numerical feature data to be discretized)
      • method: int or str (method to use of either 'blocks','scott' or specific bin number)
      • range_min_max: Tuple (data range from num_array to apply chosen method to)
    • Return: pandas.Series cast as str (categorical labels after discretization)