Data sources ¶
The data source components are responsible for the actual reading of data (the “how”). The design uses an abstract interface, IDataSource
, to define a standard contract for any data source, making it easy to swap and compose implementations.
pems_data.sources.IDataSource
¶
Bases: ABC
An abstract interface for a generic data source.
Source code in pems_data/sources/__init__.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
read(identifier, **kwargs)
abstractmethod
¶
Reads data identified by a generic identifier from the source.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
identifier
|
str
|
The unique identifier for the data, e.g., an S3 key, a database table name, etc. |
required |
**kwargs
|
dict[str, Any]
|
Additional arguments for the underlying read operation, such as ‘columns’ or ‘filters’. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
value |
DataFrame
|
A DataFrame of data from the source for the given identifier. |
Source code in pems_data/sources/__init__.py
10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
pems_data.sources.s3.S3DataSource
¶
Bases: IDataSource
A data source for fetching data from an S3 bucket.
Source code in pems_data/sources/s3.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
|
default_bucket
property
¶
Returns:
Name | Type | Description |
---|---|---|
value |
str
|
The value from the |
name
property
¶
Returns:
Name | Type | Description |
---|---|---|
value |
str
|
The name of this bucket instance. |
__init__(name=None)
¶
Initialize a new S3DataSource.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
(Optional) The name of the S3 bucket to source from. |
None
|
Source code in pems_data/sources/s3.py
30 31 32 33 34 35 36 37 |
|
get_prefixes(filter_pattern=re.compile('.+'), initial_prefix='', match_func=None)
¶
Lists available object prefixes, optionally filtered by an initial prefix.
When a match is found, if match_func exists, add its result to the output list. Otherwise add the entire match.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filter_pattern
|
Pattern
|
A regular expression used to match object prefixes |
compile('.+')
|
initial_prefix
|
str
|
The initial prefix to start the search from |
''
|
match_func
|
Callable[[Match], str]
|
A callable used to extract data from prefix matches |
None
|
Returns:
Name | Type | Description |
---|---|---|
value |
list
|
A sorted list of unique prefixes that matched the pattern. |
Source code in pems_data/sources/s3.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
read(*args, path=None, columns=None, filters=None, **kwargs)
¶
Reads data from the S3 path into a pandas DataFrame. Extra kwargs are passed along to pandas.read_parquet()
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*args
|
tuple[str]
|
One or more path relative path components for the data file |
()
|
path
|
str
|
The absolute S3 URL path to a data file; using |
None
|
columns
|
list[str]
|
If not None, only these columns will be read from the file |
None
|
filters
|
list[tuple] | list[list[tuple]]
|
To filter out data. Filter syntax: |
None
|
**kwargs
|
dict[str, Any]
|
Extra kwargs to pass to |
{}
|
Returns:
Name | Type | Description |
---|---|---|
value |
DataFrame
|
A DataFrame of data read from the source path. |
Source code in pems_data/sources/s3.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
|
url(*args)
¶
Build an absolute S3 URL to this bucket, with optional path segments.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*args
|
tuple[str]
|
The components of the S3 path. |
()
|
Returns:
Name | Type | Description |
---|---|---|
value |
str
|
An absolute |
Source code in pems_data/sources/s3.py
92 93 94 95 96 97 98 99 100 101 102 103 |
|
pems_data.sources.cache.CachingDataSource
¶
Bases: IDataSource
A data source decorator that adds a caching layer to another data source.
Source code in pems_data/sources/cache.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
cache
property
¶
Returns:
Name | Type | Description |
---|---|---|
value |
Cache
|
This data source’s underlying Cache instance. |
data_source
property
¶
Returns:
Name | Type | Description |
---|---|---|
value |
IDataSource
|
This data source’s underlying data source instance. |
__init__(data_source, cache)
¶
Initialize a new CachingDataSource.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_source
|
IDataSource
|
The underlying data source to use for cache misses |
required |
cache
|
Cache
|
The underlying cache to use for get/set operations |
required |
Source code in pems_data/sources/cache.py
27 28 29 30 31 32 33 34 35 |
|
read(identifier, cache_opts={}, **kwargs)
¶
Reads data identified by a generic identifier from the source. Tries the cache first, setting on a miss.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
identifier
|
str
|
The unique identifier for the data, e.g., an S3 key, a database table name, etc. |
required |
cache_opts
|
dict[str, Any]
|
A dictionary of options for configuring caching of the data |
{}
|
**kwargs
|
dict[str, Any]
|
Additional arguments for the underlying read operation, such as ‘columns’ or ‘filters’ |
{}
|
Returns:
Name | Type | Description |
---|---|---|
value |
DataFrame
|
A DataFrame of data read from the cache (or the source), for the given identifier. |
Source code in pems_data/sources/cache.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|