-
Notifications
You must be signed in to change notification settings - Fork 357
rfc: Making Storage a Trait #1885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| // Deletion operations | ||
| async fn delete(&self, path: &str) -> Result<()>; | ||
| async fn remove_dir_all(&self, path: &str) -> Result<()>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migrating comments from @Fokko
This name feels very much file-system like, while Iceberg is designed to work against object stores. How about delete_prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| #[async_trait] | ||
| pub trait Storage: Debug + Send + Sync { | ||
| // File existence and metadata | ||
| async fn exists(&self, path: &str) -> Result<bool>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migrating comments from @c-thiel
I know that we use these results everywhere, but I think introducing more specific error types that we can match on for storage operations makes sense. They can implement Into of course.
For example, a RateLimited error that we got from the storage service should be treated differently from NotFound or CredentialsExpired.
With Lakekeeper we are currently using our own trait based IO due to many limitations in iceberg-rust, mainly due to unsupported signing mechanisms, missing refresh mechanisms, intransparent errors and missing extendability.
I would gladly switch to iceberg-rust if we get these solved.
Maybe this can serve as some inspiration: https://github.com/lakekeeper/lakekeeper/blob/b8fcf54c627d48a547ef0baf6863949b68579388/crates/io/src/error.rs#L291
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address @c-thiel 's comments, we have several approaches:
- Introduce another set of errors for storage.
- Extend current
ErrorKindfor storage errors. - Extend current
ErrorKind, but with another enum, for example
pub enum IoErrorKind {
FileNotFound,
CredentialExpired,
}
pub enum ErrorKind {
// Existing variants
...
Io(IoErrorKind)
}| // File object creation | ||
| fn new_input(&self, path: &str) -> Result<InputFile>; | ||
| fn new_output(&self, path: &str) -> Result<OutputFile>; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migrating comments from @c-thiel
Many object stores have a good way of running batch deletions, for example the DeleteObjects API in AWS S3. How would you feel about including a delete_batch method too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can provides a function like delete_iter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I think this should not be included in the initial cut tho. We can add the new function when stabilizing the Storage trait. I'll update the implementation plan to mention it
liurenjie1024
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @CTTY for this pr, generally LGTM! One missing point is, I want the StorageBuilderRegistry to have some built in StorageBuilder registered when user creating a new catalog instance. I currenlty don't have a good solution, one approach would be to have a standalone crate, which loads built in StorageBuilders when StorageBuilderRegistry is initiated. And then we could have catalog crates to depend on it.
| #[async_trait] | ||
| pub trait Storage: Debug + Send + Sync { | ||
| // File existence and metadata | ||
| async fn exists(&self, path: &str) -> Result<bool>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address @c-thiel 's comments, we have several approaches:
- Introduce another set of errors for storage.
- Extend current
ErrorKindfor storage errors. - Extend current
ErrorKind, but with another enum, for example
pub enum IoErrorKind {
FileNotFound,
CredentialExpired,
}
pub enum ErrorKind {
// Existing variants
...
Io(IoErrorKind)
}|
|
||
| // Deletion operations | ||
| async fn delete(&self, path: &str) -> Result<()>; | ||
| async fn remove_dir_all(&self, path: &str) -> Result<()>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
| The `StorageBuilder` trait defines how storage backends are constructed: | ||
|
|
||
| ```rust | ||
| pub trait StorageBuilder: Debug + Send + Sync { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pub trait StorageBuilder: Debug + Send + Sync { | |
| pub trait StorageFactory: Debug + Send + Sync { |
nit: Factory sounds a litte better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should also be Serializable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I haven't figured out how to have both Serializable and dyn Trait yet, and I'm planning to explore more with typetag. Serializability is a huge pain and I'll add more details once I have more clarity
| pub fn new() -> Self { /* ... */ } | ||
| pub fn register(&mut self, scheme: impl Into<String>, builder: Arc<dyn StorageBuilder>); | ||
| pub fn get_builder(&self, scheme: &str) -> Result<Arc<dyn StorageBuilder>>; | ||
| pub fn supported_types(&self) -> Vec<String>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pub fn supported_types(&self) -> Vec<String>; | |
| pub fn supported_types(&self) -> impl Iterator<Item=&str>>; |
| use iceberg::io::FileIOBuilder; | ||
|
|
||
| // Basic usage (same as the existing code) | ||
| let file_io = FileIOBuilder::new("s3") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer need FileIOBuilder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that FileIOBuilder seems excessive. I'm keeping FileIOBuilder here for now mainly because I'm uncertain where to keep Extensions right now mainly due to serde-related concerns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After StorageBuilder becomes serializable, we should be able to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've explored a bit more on the serialization issue, and I realized that the existing FileIOBuilder is not serializable due to pub struct Extensions(HashMap<TypeId, Arc<dyn Any + Send + Sync>>);. And due to the unserializable nature of Any, it won't make sense to move Extensions to either Storage or StorageBuilder
We should be able to remove Extensions and the FileIOBuilder once the Storage trait is ready to use, so users can register a storage with credential loader directly. But I think it's better to address this problem later when we are stabilizing the Storage trait
| ↓ | ||
| FileIOBuilder::build() | ||
| ↓ | ||
| StorageBuilderRegistry::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think we should remove
FileIOBuildersince then? - The
StorageBuilderRegisteryshould be an instance in catalog instance?
Co-authored-by: Renjie Liu <liurenjie2008@gmail.com>
This reverts commit e7d67d8.
|
|
||
| ```rust | ||
| #[derive(Debug, Clone)] | ||
| pub struct StorageBuilderRegistry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking maybe we also need a trait for StorageBuilderRegistry, and this trait lives in iceberg crate.
pub trait StorageBuilderRegistry {
}The motivation is that we may also need a default concrete registry implementation, which by default loads builtin builders. And the default registry implementaion should have a standalone crate, and it needs to depend on different storages.
Xuanwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this and really happy that RFC methods been adopted.
| ┌───────────────┼───────────────┬───────────────┐ | ||
| │ │ │ │ | ||
| ┌───────┴────────────┐ │ ┌────────────┴──────┐ ┌────┴──────────┐ | ||
| │ crates/storage/ │ │ │ crates/storage/ │ │ Third-Party │ | ||
| │ opendal/ │ │ │ object_store/ │ │ Crates │ | ||
| │ ┌────────────────┐ │ │ │ ┌───────────────┐ │ │ ┌───────────┐ │ | ||
| │ │ opendal-s3 │ │ │ │ │ objstore-s3 │ │ │ │ custom │ │ | ||
| │ │ impl Storage │ │ │ │ │ impl Storage │ │ │ │ storage │ │ | ||
| │ │ impl Builder │ │ │ │ │ impl Builder │ │ │ │impl traits│ │ | ||
| │ └────────────────┘ │ │ │ └───────────────┘ │ │ └───────────┘ │ | ||
| │ ┌────────────────┐ │ │ │ ┌───────────────┐ │ └───────────────┘ | ||
| │ │ opendal-fs │ │ │ │ │ objstore-gcs │ │ | ||
| │ │ impl Storage │ │ │ │ │ impl Storage │ │ | ||
| │ │ impl Builder │ │ │ │ │ impl Builder │ │ | ||
| │ └────────────────┘ │ │ │ └───────────────┘ │ | ||
| │ ┌────────────────┐ │ │ │ ┌───────────────┐ │ | ||
| │ │ opendal-gcs │ │ │ │ │ objstore-azure│ │ | ||
| │ │ impl Storage │ │ │ │ │ impl Storage │ │ | ||
| │ │ impl Builder │ │ │ │ │ impl Builder │ │ | ||
| │ └────────────────┘ │ │ │ └───────────────┘ │ | ||
| │ ... (oss, azure, │ │ └───────────────────┘ | ||
| │ memory) │ │ | ||
| └────────────────────┘ │ | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current feeling is that we don't need to add impls for every service, instead we add impls for every provider. One provider can provide multiple impls, and that can simplify a lot of our work.
For example, opendal can provide S3, GCS, OSS, Hf, and more, while object_store supports S3 and GCS. We can also have a fs implementation that only provides FS support. And maybe in the future, we can have hdfs which can provide hdfs-s3, hdfs-gcs and more.
This design matches the actual status better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this thread on the old google design doc is related: https://docs.google.com/document/d/1-CEvRvb52vPTDLnzwJRBx5KLpej7oSlTu_rg0qKEGZ8/edit?disco=AAABrRO9Prk
This design decision definitely deserves another round of discussion. Whether we use a single, general Storage type (e.g., OpenDALStorage) or multiple scheme-specific types (OpenDALS3Storage, OpenDALOSSStorage, etc.) depends on how we define Storage. Is it essentially an Iceberg wrapper around something like object_store::ObjectStore, or is it intended to serve as a higher-level Iceberg abstraction that operates across different storage backends?
If it’s more of a thin wrapper, then having multiple scheme-specific types makes more sense. If it’s meant to be a higher-level abstraction, a unified multi-scheme implementation feels more appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also have a fs implementation that only provides FS support
I'm feeling that having a "default" Storage will be a must, just like we are having MemoryCatalog in the iceberg crate, we will need to at least allow users to do basic fs operations in places like iceberg crate where concrete Storage implementations like ObjectStoreStorage are unavailable
| │ │ - pub trait Storage │ │ | ||
| │ │ - pub trait StorageBuilder │ │ | ||
| │ │ - pub struct StorageBuilderRegistry │ │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the previous comment, we should only have one trait that provides similar functionality from Storage and StorageBuilderRegistry. The new storage abstraction should handle s3://path/to/file directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding: do you mean that we don't need a registry. Instead, we can just configure to use a general Storage implementation that should be able to handle multiple storage schemes like s3/gcs/...?
|
|
||
| // Deletion operations | ||
| async fn delete(&self, path: &str) -> Result<()>; | ||
| async fn remove_dir_all(&self, path: &str) -> Result<()>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| // File object creation | ||
| fn new_input(&self, path: &str) -> Result<InputFile>; | ||
| fn new_output(&self, path: &str) -> Result<OutputFile>; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can provides a function like delete_iter.
Which issue does this PR close?
What changes are included in this PR?
Are these changes tested?