Details on Data Managers

Specification

class IManageData

Specification of data-manager interface. Implement this to provide a custom data-manager.

get_patterned_data(ci: CI_Identifier) CIT_DataPatterned

Get CIT-data with attached pattern-information.

See also

Details on patterns are provided at Patterns. Details on cache-IDs are given at Cache IDs.

Parameters:

ci (CI_Identifier) – The CI identified by its variable indices.

Returns:

The CIT-data with attached pattern-provider.

Return type:

CIT_DataPatterned

number_of_variables() int

Get the number of variables (as used e.g. by PCMCI) in the current data-set.

Returns:

Number of (contemporaneous) variables.

Return type:

int

total_sample_size() int

Get the total sample-size.

Returns:

sample-size

Return type:

int

reproject_blocks(value_per_block: ndarray, block_configuration: BlockView) ndarray

Project function-values given on blocks back to original data-layout for plotting.

Parameters:
  • value_per_block (np.ndarray) – function-values taken on blocks

  • block_configuration (BlockView) – the block-layout (e.g. block-size)

Returns:

the function-values taken in the original index-space.

Return type:

np.ndarray

Cache IDs

It is, for good runtime performance, often helpful to cache test-results at different stages. The frontend provides simple ways to inject cache-layers at different points of the framework, and the sample-configurations provided in the frontend also do so.

As the input data to the framework can (and is) typically be assumed immutable, results can be cached relative to test-indeces. It is the responsibility of the data-manager (and the custom pattern-provider), to provide unique cache-ids for queries: Given two CIT_Data objects provided by the same data-manager, they may have the same cache-id only if they contain the same data. It is in practice usually possible to employ the test-index (plus requested block-size for BlockView objects). The current built-in implementation additionally prefixes the test-index by the data-manager object’s memory address to prevent potential issues when using multiple data-managers with the same cache-layer. If cache will be writen to files or execution is parallelized accross multiple processes, it may be reasonable to include an initial-data hash (computed once at program initialization) instead of a memory address.

  • When implementing a custom data-manager (exposing IManageData), the implementation of IManageData.get_patterned_data() has to write a cache-id to the output that uniquely identifies the produced CIT_Data. This cache-id will typically be based on the data-manager’s object memory address (can be passed as the object itself in python) or data-hash and the CI_Identifier argument.

  • When implementing a custom pattern (extending CIT_DataPatterned), the implemenation of CIT_DataPatterned.view_blocks() has to write a cache-id to the output that uniquely identifies the produced BlockView. This cache-id will typically be based on self.cache_id and the requested (or actual) block-size.

  • When implementing a cachable test, you can (this should not typically be necessary if deriving from ITestCI or IProvideIndependenceAtoms) expose a method _extract_cache_id() returning a cache-id for a given query. It is called with the query-name fname (a string, the name of the method cached, e.g. ‘run_many’) as first argument and the run-time arguments of that method’s invokation as further arguments. See for example ITestCI or IProvideIndependenceAtoms which provide fallbacks for CITs and full backends.

The cache-id has to be hashable and equality-comparable. Note that tuples of hashable and equality-comparable types are again hashable and equality-comparable. Further CI_Identifier[var_index] is hashable and equality-comparable if var_index is.

Baseline Implementations

class DataManager_NumpyArray_IID

Bases: IManageData

Data-manager designed for use with IID data.

__init__(data_indexed_by_sampleidx_variableidx: ndarray, copy_data: bool = True, pattern=<class 'GLDF.data_management.CIT_DataPatterned_PersistentInTime'>, reproject_pattern_for_plotting=None)
get_patterned_data(ci: CI_Identifier[int]) CIT_DataPatterned

Implements functionality of interface IManageData.

Get CIT-data with attached pattern-information.

See also

Details on patterns are provided at Patterns. Details on cache-IDs are given at Cache IDs.

Parameters:

ci (CI_Identifier) – The CI identified by its variable indices.

Returns:

The CIT-data with attached pattern-provider.

Return type:

CIT_DataPatterned

number_of_variables() int

Implements functionality of interface IManageData.

Get the number of variables (as used e.g. by PCMCI) in the current data-set.

Returns:

Number of (contemporaneous) variables.

Return type:

int

total_sample_size() int

Implements functionality of interface IManageData.

Get the total sample-size.

Returns:

sample-size

Return type:

int

class DataManager_NumpyArray_Timeseries

Bases: IManageData

Data-manager designed for use with time-series data.

__init__(data_indexed_by_sampleidx_variableidx: ndarray, copy_data: bool = True, pattern=<class 'GLDF.data_management.CIT_DataPatterned_PersistentInTime'>, reproject_pattern_for_plotting=None)
get_patterned_data(ci: CI_Identifier_TimeSeries) CIT_DataPatterned

Implements functionality of interface IManageData.

Get CIT-data with attached pattern-information.

See also

Details on patterns are provided at Patterns. Details on cache-IDs are given at Cache IDs.

Parameters:

ci (CI_Identifier) – The CI identified by its variable indices.

Returns:

The CIT-data with attached pattern-provider.

Return type:

CIT_DataPatterned

number_of_variables() int

Implements functionality of interface IManageData.

Get the number of variables (as used e.g. by PCMCI) in the current data-set.

Returns:

Number of (contemporaneous) variables.

Return type:

int

total_sample_size() int

Implements functionality of interface IManageData.

Get the total sample-size.

Returns:

sample-size

Return type:

int