|
| 1 | +.. _pm_s2idle_psci: |
| 2 | + |
| 3 | +############################################# |
| 4 | +Suspend-to-Idle (S2Idle) and PSCI Integration |
| 5 | +############################################# |
| 6 | + |
| 7 | +********************************** |
| 8 | +Suspend-to-Idle (S2Idle) Overview |
| 9 | +********************************** |
| 10 | + |
| 11 | +Suspend-to-Idle (s2idle), also known as "freeze," is a generic, pure software, light-weight variant of system suspend. |
| 12 | +In this state, the Linux kernel freezes user space tasks, suspends devices, and then puts all CPUs into their deepest available idle state. |
| 13 | + |
| 14 | +********************************** |
| 15 | +S2Idle vs Deep Sleep (mem) |
| 16 | +********************************** |
| 17 | + |
| 18 | +The Linux kernel has sleep states that are global low-power states of the entire system in which user space |
| 19 | +code cannot be executed and the overall system activity is significantly reduced. |
| 20 | +There's different types of sleep states as mentioned in it's |
| 21 | +`documentation<https://docs.kernel.org/admin-guide/pm/sleep-states.html>`__. |
| 22 | +System sleep states can be selected using the sysfs entry :file:`/sys/kernel/mem_sleep` |
| 23 | + |
| 24 | +On TI K3 AM62L platform, we currently support the ``s2idle`` and ``deep`` states. |
| 25 | +Both of them can achieve similar power savings (e.g., by suspending to RAM / putting DDR into Self-Refresh). |
| 26 | +The primary differences lie in the software execution flow, specifically how CPUs are managed and which |
| 27 | +PSCI APIs are invoked. |
| 28 | + |
| 29 | +.. list-table:: S2Idle vs Deep Sleep |
| 30 | + :widths: 20 40 40 |
| 31 | + :header-rows: 1 |
| 32 | + |
| 33 | + * - Feature |
| 34 | + - s2idle (Suspend-to-Idle) |
| 35 | + - deep (Suspend-to-RAM) |
| 36 | + |
| 37 | + * - **Kernel String** |
| 38 | + - ``s2idle`` or ``freeze`` |
| 39 | + - ``deep`` or ``mem`` |
| 40 | + |
| 41 | + * - **Non-boot CPUs** |
| 42 | + - **Online**: Non-boot CPUs are put into a deep idle state but remain logically online. |
| 43 | + - **Offline**: Non-boot CPUs are hot-unplugged (removed) from the system via ``CPU_OFF``. |
| 44 | + |
| 45 | + * - **Entry Path** |
| 46 | + - **cpuidle**: Uses the standard CPU idle framework and governance. It runtime suspends each driver to make sure it's idle. |
| 47 | + - **suspend_ops**: Uses platform-specific suspend operations like each driver's suspend ops and finally the `PSCI_SYSTEM_SUSPEND` is called. |
| 48 | + No governors exist to make any decisions. |
| 49 | + |
| 50 | + * - **PSCI Call** |
| 51 | + - ``CPU_SUSPEND``: Invoked for every core (Last Man Standing logic coordinates the cluster/system depth). |
| 52 | + - ``SYSTEM_SUSPEND``: Typically invoked by the last active CPU after others are offlined. |
| 53 | + |
| 54 | + * - **Resume Flow** |
| 55 | + - **Fast**: CPUs exit the idle loop immediately upon interrupt. Context is preserved. |
| 56 | + - **Slow**: Kernel must serially bring secondary CPUs back online (Hotplug). Kernel must recreate |
| 57 | + threads, re-enable interrupts, resume each driver and restore per-CPU state for every non-boot core. |
| 58 | + |
| 59 | + * - **Latency** |
| 60 | + - Lower |
| 61 | + - High, primarily due to the overhead of **CPU Hotplug** for non-boot CPUs |
| 62 | + |
| 63 | +******************* |
| 64 | +PSCI as the Enabler |
| 65 | +******************* |
| 66 | + |
| 67 | +The Power State Coordination Interface (PSCI) is an ARM-defined standard that acts as the fundamental |
| 68 | +enabler for s2idle on all ARM platforms that support it. PSCI defines a standardized firmware interface that allows the |
| 69 | +Operating System (OS) to request power states without needing intimate knowledge of the underlying |
| 70 | +SoC. |
| 71 | + |
| 72 | +**s2idle Call Flow:** |
| 73 | + |
| 74 | +.. code-block:: text |
| 75 | +
|
| 76 | + Linux Kernel PSCI Firmware (TF-A) |
| 77 | + ============ ==================== |
| 78 | +
|
| 79 | + 1. Freeze tasks |
| 80 | + | |
| 81 | + v |
| 82 | + 2. Suspend devices |
| 83 | + | |
| 84 | + v |
| 85 | + 3. cpuidle driver -----------> CPU_SUSPEND (SMC/HVC) |
| 86 | + (per CPU) | |
| 87 | + | v |
| 88 | + | Coordinate power |
| 89 | + | state requests |
| 90 | + | | |
| 91 | + | v |
| 92 | + | CPU enters low-power |
| 93 | + | hardware state |
| 94 | + | |
| 95 | + |<--------- Resume --------- |
| 96 | + | |
| 97 | + v |
| 98 | + 4. Resume devices |
| 99 | + | |
| 100 | + v |
| 101 | + 5. Thaw tasks |
| 102 | +
|
| 103 | +The `cpuidle` driver calls the PSCI `CPU_SUSPEND` API to transition the CPUs into a low-power state. |
| 104 | +The effectiveness of s2idle depends heavily on the PSCI implementation's ability to coordinate these |
| 105 | +requests and enter the deepest possible hardware state. |
| 106 | + |
| 107 | +************************ |
| 108 | +OS Initiated (OSI) Mode |
| 109 | +************************ |
| 110 | + |
| 111 | +PSCI 1.0 introduced **OS Initiated (OSI)** mode, which shifts the responsibility of power state coordination from the platform firmware to the Operating System. |
| 112 | + |
| 113 | +In the default **Platform Coordinated (PC)** mode, the OS independently requests a state for each core. The firmware then aggregates these requests (voting) to |
| 114 | +determine if a cluster or the system can be powered down. |
| 115 | + |
| 116 | +In **OS Initiated (OSI)** mode, the OS explicitly manages the hierarchy. The OS determines when the last core in a power domain (e.g., a cluster) is going idle |
| 117 | +and explicitly requests the power-down of that domain. |
| 118 | + |
| 119 | +Why OSI? |
| 120 | +======== |
| 121 | + |
| 122 | +OSI mode allows the OS to make better power decisions because it has visibility into: |
| 123 | +* **Task Scheduling:** The OS knows when other cores will wake up. |
| 124 | +* **Wakeup Latencies:** The OS can respect Quality of Service (QoS) latency constraints more accurately. |
| 125 | +* **Usage Patterns:** The OS can predict idle duration better than firmware. |
| 126 | + |
| 127 | +OSI Sequence |
| 128 | +============ |
| 129 | + |
| 130 | +The coordination in OSI mode follows a specific "Last Man Standing" sequence. The OS tracks the state of all cores in a topology node (e.g., a cluster). |
| 131 | + |
| 132 | +.. code-block:: text |
| 133 | +
|
| 134 | + OSI "Last Man Standing" Flow |
| 135 | +
|
| 136 | + Cluster with 2 Cores OS Action PSCI Request |
| 137 | + ==================== ========= ============= |
| 138 | +
|
| 139 | + 1. Core 0,1: ACTIVE |
| 140 | + | |
| 141 | + | Core 0 becomes idle |
| 142 | + v |
| 143 | + 2. Core 0: IDLE --> OS requests local --> CPU_SUSPEND |
| 144 | + Core 1: ACTIVE Core Power Down (Core PD only) |
| 145 | + Cluster stays ON |
| 146 | + | |
| 147 | + | Core 1 (LAST) becomes idle |
| 148 | + v |
| 149 | + 3. Core 0,1: IDLE --> OS recognizes --> CPU_SUSPEND |
| 150 | + "Last Man" scenario (Composite State) |
| 151 | + Requests Composite: |
| 152 | + - Core 1: PD Core: PD |
| 153 | + - Cluster: PD Cluster: PD |
| 154 | + - System: PD System: PD |
| 155 | + | |
| 156 | + v |
| 157 | + 4. Firmware Verification --> PSCI firmware checks |
| 158 | + & System Power Down all cores/clusters idle |
| 159 | + If verified: Power down |
| 160 | + entire system |
| 161 | + If not: Deny request |
| 162 | + (race condition) |
| 163 | +
|
| 164 | +**Detailed Steps:** |
| 165 | + |
| 166 | +1. **First Core Idle:** When the first core in a cluster goes idle, the OS requests a local idle state |
| 167 | + for that core (e.g., Core Power Down) but keeps the cluster running. |
| 168 | + |
| 169 | +2. **Last Core Idle:** When the *last* active core in the cluster is ready to go idle, the OS recognizes |
| 170 | + that the entire cluster, and potentially the system, can now be powered down. |
| 171 | + |
| 172 | +3. **Composite Request:** The last core issues a `CPU_SUSPEND` call that requests a **composite state**: |
| 173 | + |
| 174 | + * **Core State:** Power Down |
| 175 | + * **Cluster State:** Power Down |
| 176 | + * **System State:** Power Down (as demonstrated in the diagram) |
| 177 | + |
| 178 | +4. **Firmware Enforcement:** The PSCI firmware verifies that all other cores and clusters in the requested node are indeed idle. |
| 179 | + If they are not, the request is denied (to prevent race conditions). |
| 180 | + |
| 181 | +*********************************** |
| 182 | +Understanding the Suspend Parameter |
| 183 | +*********************************** |
| 184 | + |
| 185 | +The `power_state` parameter passed to `CPU_SUSPEND` is the key to requesting these states. |
| 186 | +In OSI mode, this parameter must encode the intent for the entire hierarchy. |
| 187 | + |
| 188 | +Power State Parameter Encoding |
| 189 | +================================ |
| 190 | + |
| 191 | +The `power_state` is a 32-bit parameter defined by the ARM PSCI specification (ARM DEN0022C). |
| 192 | +It has two encoding formats, controlled by the platform's build configuration. |
| 193 | + |
| 194 | +Standard Format |
| 195 | +=============== |
| 196 | + |
| 197 | +This is the default format used by most platforms: |
| 198 | + |
| 199 | +.. code-block:: text |
| 200 | +
|
| 201 | + 31 26 25 24 23 17 16 15 0 |
| 202 | + +---------------+------+----------------+----+----------------------+ |
| 203 | + | Reserved | Pwr | Reserved | ST | State ID | |
| 204 | + | (must be 0) | Level| (must be 0) | | (platform-defined) | |
| 205 | + +---------------+------+----------------+----+----------------------+ |
| 206 | +
|
| 207 | +.. list-table:: Standard Format Bit Fields |
| 208 | + :widths: 20 80 |
| 209 | + :header-rows: 1 |
| 210 | + |
| 211 | + * - Bit Field |
| 212 | + - Description |
| 213 | + |
| 214 | + * - **[31:26]** |
| 215 | + - **Reserved**: Must be zero. |
| 216 | + |
| 217 | + * - **[25:24]** |
| 218 | + - **Power Level**: Indicates the deepest power domain level that can be powered down. |
| 219 | + |
| 220 | + * ``0``: CPU/Core level |
| 221 | + * ``1``: Cluster level |
| 222 | + * ``2``: System level |
| 223 | + * ``3``: Higher levels (platform-specific) |
| 224 | + |
| 225 | + * - **[23:17]** |
| 226 | + - **Reserved**: Must be zero. |
| 227 | + |
| 228 | + * - **[16]** |
| 229 | + - **State Type (ST)**: Type of power state. |
| 230 | + |
| 231 | + * ``0``: Standby or Retention (low latency, context preserved) |
| 232 | + * ``1``: Power Down (higher latency, may lose context) |
| 233 | + |
| 234 | + * - **[15:0]** |
| 235 | + - **State ID**: Platform-specific identifier for the requested power state. The OS and |
| 236 | + platform firmware must agree on the meaning of these values, typically defined through |
| 237 | + device tree bindings. |
| 238 | + |
| 239 | +**OSI Mode Consideration:** |
| 240 | + |
| 241 | +In OSI mode, the OS is responsible for tracking which cores are idle. When the last core |
| 242 | +in a cluster issues this `CPU_SUSPEND` call with Power Level = 1, the PSCI firmware: |
| 243 | + |
| 244 | +1. Verifies that all other cores in the cluster are already in a low-power state |
| 245 | +2. If verified, powers down the entire cluster |
| 246 | +3. If not verified (race condition), denies the request with an error code |
| 247 | + |
| 248 | +The State ID field is platform-defined and typically documented in the device tree |
| 249 | +``idle-state`` nodes using the ``arm,psci-suspend-param`` property. This mechanism, |
| 250 | +leveraging ``cpuidle`` and ``s2idle``, allows the kernel to abstract complex platform-specific |
| 251 | +low-power modes into a generic framework. The ``idle-state`` nodes in the Device Tree define these power states, |
| 252 | +including their entry/exit latencies and target power consumption, enabling the ``cpuidle`` governor to make informed |
| 253 | +decisions about which idle state to enter based on system load and predicted idle duration. |
| 254 | + |
| 255 | +The ``arm,psci-suspend-param`` property then directly maps these idle states to the corresponding PSCI ``power_state`` parameter values that the firmware understands. |
| 256 | + |
| 257 | +Example: System Suspend (Standard Format) |
| 258 | +========================================= |
| 259 | + |
| 260 | +When the OS targets a system-wide suspend state (e.g., Suspend-to-RAM), the `power_state` parameter is constructed to target the highest power level. |
| 261 | +Consider the example value **0x02012234**: |
| 262 | + |
| 263 | +.. list-table:: Power State Parameter Breakdown (0x02012234) |
| 264 | + :widths: 20 20 20 40 |
| 265 | + :header-rows: 1 |
| 266 | + |
| 267 | + * - Field |
| 268 | + - Bits |
| 269 | + - Value |
| 270 | + - Meaning |
| 271 | + |
| 272 | + * - Reserved |
| 273 | + - [31:26] |
| 274 | + - 0 |
| 275 | + - Must be zero |
| 276 | + |
| 277 | + * - Power Level |
| 278 | + - [25:24] |
| 279 | + - 2 |
| 280 | + - System level |
| 281 | + |
| 282 | + * - Reserved |
| 283 | + - [23:17] |
| 284 | + - 0 |
| 285 | + - Must be zero |
| 286 | + |
| 287 | + * - State Type |
| 288 | + - [16] |
| 289 | + - 1 |
| 290 | + - Power Down |
| 291 | + |
| 292 | + * - State ID |
| 293 | + - [15:0] |
| 294 | + - 0x2234 |
| 295 | + - Platform-specific (e.g., "S2RAM") |
| 296 | + |
| 297 | +**Interpretation:** |
| 298 | + |
| 299 | +* **Power Level = 2** tells the firmware that a system-level transition is requested. |
| 300 | +* **State Type = 1** indicates a power-down state. |
| 301 | +* **State ID = 0x2234** is the platform-specific identifier for this system state. |
| 302 | + |
| 303 | +In the context of **s2idle**, if the OS determines that all constraints are met for system suspension, |
| 304 | +the last active CPU (Last Man) will invoke `CPU_SUSPEND` with this parameter. The PSCI firmware then |
| 305 | +coordinates the final steps to suspend the system (e.g., placing DDR in self-refresh and powering down the SoC). |
0 commit comments