Details on Data Managers
Specification
- class IManageData
Specification of data-manager interface. Implement this to provide a custom data-manager.
- get_patterned_data(ci: CI_Identifier) CIT_DataPatterned
Get CIT-data with attached pattern-information.
- Parameters:
ci (CI_Identifier) – The CI identified by its variable indices.
- Returns:
The CIT-data with attached pattern-provider.
- Return type:
- number_of_variables() int
Get the number of variables (as used e.g. by PCMCI) in the current data-set.
- Returns:
Number of (contemporaneous) variables.
- Return type:
int
- total_sample_size() int
Get the total sample-size.
- Returns:
sample-size
- Return type:
int
- reproject_blocks(value_per_block: ndarray, block_configuration: BlockView) ndarray
Project function-values given on blocks back to original data-layout for plotting.
- Parameters:
value_per_block (np.ndarray) – function-values taken on blocks
block_configuration (BlockView) – the block-layout (e.g. block-size)
- Returns:
the function-values taken in the original index-space.
- Return type:
np.ndarray
Cache IDs
It is, for good runtime performance, often helpful to cache test-results at different stages.
The frontend provides simple ways to inject cache-layers at different points of
the framework, and the sample-configurations provided in the frontend also do so.
As the input data to the framework can (and is) typically be assumed immutable,
results can be cached relative to test-indeces. It is the responsibility of the
data-manager (and the custom pattern-provider), to provide unique cache-ids
for queries: Given two CIT_Data objects provided by the same
data-manager, they may have the same cache-id only if they contain the same data.
It is in practice usually possible to employ the test-index (plus requested block-size
for BlockView objects). The current built-in implementation additionally
prefixes the test-index by the data-manager object’s memory address to prevent potential
issues when using multiple data-managers with the same cache-layer.
If cache will be writen to files or execution is parallelized accross multiple
processes, it may be reasonable to include an initial-data hash (computed once at program
initialization) instead of a memory address.
When implementing a custom data-manager (exposing
IManageData), the implementation ofIManageData.get_patterned_data()has to write a cache-id to the output that uniquely identifies the producedCIT_Data. This cache-id will typically be based on the data-manager’s object memory address (can be passed as the object itself in python) or data-hash and theCI_Identifierargument.When implementing a custom pattern (extending
CIT_DataPatterned), the implemenation ofCIT_DataPatterned.view_blocks()has to write a cache-id to the output that uniquely identifies the producedBlockView. This cache-id will typically be based onself.cache_idand the requested (or actual) block-size.When implementing a cachable test, you can (this should not typically be necessary if deriving from
ITestCIorIProvideIndependenceAtoms) expose a method_extract_cache_id()returning a cache-id for a given query. It is called with the query-namefname(a string, the name of the method cached, e.g. ‘run_many’) as first argument and the run-time arguments of that method’s invokation as further arguments. See for exampleITestCIorIProvideIndependenceAtomswhich provide fallbacks for CITs and full backends.
The cache-id has to be hashable and equality-comparable. Note that tuples of
hashable and equality-comparable types are again hashable and equality-comparable.
Further CI_Identifier[var_index] is hashable and equality-comparable
if var_index is.
Baseline Implementations
- class DataManager_NumpyArray_IID
Bases:
IManageDataData-manager designed for use with IID data.
- __init__(data_indexed_by_sampleidx_variableidx: ndarray, copy_data: bool = True, pattern=<class 'GLDF.data_management.CIT_DataPatterned_PersistentInTime'>, reproject_pattern_for_plotting=None)
- get_patterned_data(ci: CI_Identifier[int]) CIT_DataPatterned
Implements functionality of interface
IManageData.Get CIT-data with attached pattern-information.
- Parameters:
ci (CI_Identifier) – The CI identified by its variable indices.
- Returns:
The CIT-data with attached pattern-provider.
- Return type:
- number_of_variables() int
Implements functionality of interface
IManageData.Get the number of variables (as used e.g. by PCMCI) in the current data-set.
- Returns:
Number of (contemporaneous) variables.
- Return type:
int
- total_sample_size() int
Implements functionality of interface
IManageData.Get the total sample-size.
- Returns:
sample-size
- Return type:
int
- class DataManager_NumpyArray_Timeseries
Bases:
IManageDataData-manager designed for use with time-series data.
- __init__(data_indexed_by_sampleidx_variableidx: ndarray, copy_data: bool = True, pattern=<class 'GLDF.data_management.CIT_DataPatterned_PersistentInTime'>, reproject_pattern_for_plotting=None)
- get_patterned_data(ci: CI_Identifier_TimeSeries) CIT_DataPatterned
Implements functionality of interface
IManageData.Get CIT-data with attached pattern-information.
- Parameters:
ci (CI_Identifier) – The CI identified by its variable indices.
- Returns:
The CIT-data with attached pattern-provider.
- Return type:
- number_of_variables() int
Implements functionality of interface
IManageData.Get the number of variables (as used e.g. by PCMCI) in the current data-set.
- Returns:
Number of (contemporaneous) variables.
- Return type:
int
- total_sample_size() int
Implements functionality of interface
IManageData.Get the total sample-size.
- Returns:
sample-size
- Return type:
int