Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,20 @@ All notable changes to this project will be documented in this file.

- Support objectOverrides using `.spec.objectOverrides`.
See [objectOverrides concepts page](https://docs.stackable.tech/home/nightly/concepts/overrides/#object-overrides) for details ([#741]).
- Enable the [restart-controller](https://docs.stackable.tech/home/nightly/commons-operator/restarter/), so that the Pods are automatically restarted on config changes ([#743]).

### Changed

- Gracefully shutdown all concurrent tasks by forwarding the SIGTERM signal ([#747]).
- Added warning and exit condition to format-namenodes container script to check for corrupted data after formatting ([#751]).

### Fixed

- Previously, some shell output of init-containers was not logged properly and therefore not aggregated, which is fixed now ([#746]).

[#741]: https://github.com/stackabletech/hdfs-operator/pull/741
[#743]: https://github.com/stackabletech/hdfs-operator/pull/743
[#746]: https://github.com/stackabletech/hdfs-operator/pull/746
[#747]: https://github.com/stackabletech/hdfs-operator/pull/747
[#751]: https://github.com/stackabletech/hdfs-operator/pull/751

## [25.11.0] - 2025-11-07

Expand Down
27 changes: 27 additions & 0 deletions docs/modules/hdfs/pages/reference/troubleshooting.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
= Troubleshooting

[#init-container-format-namenode-fails]
== Init container format-namenodes fails

When creating fresh HDFS clusters, unexpected pod restarts might corrupt the initial namenode formatting.
This leaves the namenode data PVC in a dangling state, where e.g. the `../current/VERSION` file is created, but `../current/fsimage_xxx` files are missing.

After a restart corrupted the namenode formatting, reformatting again fails due to directories and files existing.
We do not want to force (override) the formatting process to avoid data loss and other implications.

[source]
----
Running in non-interactive mode, and data appears to exist in Storage Directory root= /stackable/data/namenode; location= null. Not formatting.
----

Another error message indicating a corrupt formatting state appears in the namenode main container during startup.

[source]
----
java.io.FileNotFoundException: No valid image files found
----

WARNING: The following fix should only be applied to fresh clusters. For existing clusters please consider support.

1. Remove the PVC called `data-<cluster-name>-namenode-<rolegroup>-0` for a failed namenode 0.
2. Restart the namenode afterwards.
1 change: 1 addition & 0 deletions docs/modules/hdfs/partials/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,4 @@
** xref:hdfs:reference/discovery.adoc[]
** xref:hdfs:reference/commandline-parameters.adoc[]
** xref:hdfs:reference/environment-variables.adoc[]
* xref:hdfs:reference/troubleshooting.adoc[]
10 changes: 10 additions & 0 deletions rust/operator-binary/src/container.rs
Original file line number Diff line number Diff line change
Expand Up @@ -718,6 +718,16 @@ impl ContainerConfig {
exclude_from_capture {hadoop_home}/bin/hdfs namenode -bootstrapStandby -nonInteractive
fi
else
# Sanity check for initial format data corruption: VERSION file exists but no fsimage files were created.
FSIMAGE_COUNT=$(find "{NAMENODE_ROOT_DATA_DIR}/current" -maxdepth 1 -regextype posix-egrep -regex ".*/fsimage_[0-9]+" | wc -l)

if [ "${{FSIMAGE_COUNT}}" -eq 0 ]
then
echo "WARNING: {NAMENODE_ROOT_DATA_DIR}/current/VERSION file exists but no fsimage files were found."
echo "This indicates an incomplete and corrupted namenode formatting. Please check the troubleshooting guide."
exit 1
fi

cat "{NAMENODE_ROOT_DATA_DIR}/current/VERSION"
echo "Pod $POD_NAME already formatted. Skipping..."
fi
Expand Down
8 changes: 1 addition & 7 deletions rust/operator-binary/src/hdfs_controller.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ use stackable_operator::{
product_image_selection::{self, ResolvedProductImage},
rbac::build_rbac_resources,
},
constants::RESTART_CONTROLLER_ENABLED_LABEL,
iter::reverse_if,
k8s_openapi::{
DeepMerge,
Expand Down Expand Up @@ -901,13 +900,8 @@ fn rolegroup_statefulset(
..StatefulSetSpec::default()
};

let sts_metadata = metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could leave this code in (commented).
At the very least IMHO we should add a TODO pointing to the issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, but its not much change/revert. I removed the enable part from changelog as well. I think we should rather untick HDFS in the restarter epic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good idea

.clone()
.with_label(RESTART_CONTROLLER_ENABLED_LABEL.to_owned())
.build();

Ok(StatefulSet {
metadata: sts_metadata,
metadata: metadata.build(),
spec: Some(statefulset_spec),
status: None,
})
Expand Down
9 changes: 0 additions & 9 deletions tests/templates/kuttl/smoke/30-assert.yaml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,6 @@ apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-namenode-default
generation: 1 # There should be no unneeded Pod restarts
labels:
restarter.stackable.tech/enabled: "true"
spec:
template:
spec:
Expand All @@ -35,9 +32,6 @@ apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-journalnode-default
generation: 1 # There should be no unneeded Pod restarts
labels:
restarter.stackable.tech/enabled: "true"
spec:
template:
spec:
Expand All @@ -62,9 +56,6 @@ apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-datanode-default
generation: 1 # There should be no unneeded Pod restarts
labels:
restarter.stackable.tech/enabled: "true"
spec:
template:
spec:
Expand Down