-
Notifications
You must be signed in to change notification settings - Fork 182
Doc 13539 data disk storage sizing info #4022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ggray-cb
wants to merge
6
commits into
release/8.0
Choose a base branch
from
DOC-13539_data_disk_storage_sizing_info
base: release/8.0
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+208
−34
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
059f528
In-progress checkin to rey to resolve some build issue.
ggray-cb ebba554
Completed draft
ggray-cb 4e4a0f1
Minor adjustments
ggray-cb 6a2ba77
Fixes for broken links
ggray-cb 4ec1a7e
Fixed typo
ggray-cb c4eecf5
Correcting math/formula errors spotted by Hyun-Ju
ggray-cb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,6 @@ | ||
| = Sizing Guidelines | ||
| :description: Evaluate the overall performance and capacity goals that you have for Couchbase, and use that information to determine the necessary resources that you'll need in your deployment. | ||
| :stem: latexmath | ||
|
|
||
| [abstract] | ||
| {description} | ||
|
|
@@ -110,13 +111,17 @@ Most deployments can achieve optimal performance with 1 Gbps interconnects, but | |
|
|
||
| == Sizing Data Service Nodes | ||
|
|
||
| Data Service nodes handle data service operations, such as create/read/update/delete (CRUD). | ||
| The sizing information provided below applies both to the _Couchstore_ and _Magma_ storage engines: however, the _differences_ between these storage engines should also be reviewed, before sizing is attempted. | ||
| For information, see xref:learn:buckets-memory-and-storage/storage-engines.adoc[Storage Engines]. | ||
| Data Service nodes store and perform data operations such as create/read/update/delete (CRUD). | ||
| The sizing information provided in this section applies to data stored in either Couchstore or Magma storage engines. | ||
| However, you should also consider the differences between these storage engines. | ||
| For more information, see xref:learn:buckets-memory-and-storage/storage-engines.adoc[]. | ||
|
|
||
| It's important to keep use-cases and application workloads in mind since different application workloads have different resource requirements. | ||
| For example, if your working set needs to be fully in memory, you might need large RAM size. | ||
| On the other hand, if your application requires only 10% of data in memory, you will need disks with enough space to store all of the data, and that are fast enough for your read/write operations. | ||
| For example, if your working data set needs to be fully in memory, your cluster may need more RAM. | ||
| On the other hand, if your application requires only 10% of data in memory, you need disks with enough space to store all of the data. | ||
| Their read/write rate must also be fast enough to meet your performance goals. | ||
|
|
||
| === RAM Sizing for Data Service Nodes | ||
|
|
||
| You can start sizing the Data Service nodes by answering the following questions: | ||
|
|
||
|
|
@@ -126,25 +131,25 @@ You can start sizing the Data Service nodes by answering the following questions | |
|
|
||
| Answers to the above questions can help you better understand the capacity requirement of your cluster and provide a better estimation for sizing. | ||
|
|
||
| *The following is an example use-case for sizing RAM:* | ||
| The following tables show an example use-case for sizing RAM: | ||
|
|
||
| .Input Variables for Sizing RAM | ||
| |=== | ||
| | Input Variable | Value | ||
|
|
||
| | [.var]`documents_num` | ||
| | `documents_num` | ||
| | 1,000,000 | ||
|
|
||
| | [.var]`ID_size` | ||
| | `ID_size` | ||
| | 100 bytes | ||
|
|
||
| | [.var]`value_size` | ||
| | `value_size` | ||
| | 10,000 bytes | ||
|
|
||
| | [.var]`number_of_replicas` | ||
| | `number_of_replicas` | ||
| | 1 | ||
|
|
||
| | [.var]`working_set_percentage` | ||
| | `working_set_percentage` | ||
| | 20% | ||
| |=== | ||
|
|
||
|
|
@@ -172,16 +177,16 @@ Based on the provided data, a rough sizing guideline formula would be: | |
| |=== | ||
| | Variable | Calculation | ||
|
|
||
| | [.var]`no_of_copies` | ||
| | `no_of_copies` | ||
| | `1 + number_of_replicas` | ||
|
|
||
| | [.var]`total_metadata` | ||
| | `total_metadata` | ||
| | `(documents_num) * (metadata_per_document + ID_size) * (no_of_copies)` | ||
|
|
||
| | [.var]`total_dataset` | ||
| | `total_dataset` | ||
| | `(documents_num) * (value_size) * (no_of_copies)` | ||
|
|
||
| | [.var]`working_set` | ||
| | `working_set` | ||
| | `total_dataset * (working_set_percentage)` | ||
|
|
||
| | Cluster RAM quota required | ||
|
|
@@ -198,16 +203,16 @@ Based on the above formula, these are the suggested sizing guidelines: | |
| |=== | ||
| | Variable | Calculation | ||
|
|
||
| | [.var]`no_of_copies` | ||
| | `no_of_copies` | ||
| | = 1 for original and 1 for replica | ||
|
|
||
| | [.var]`total_metadata` | ||
| | `total_metadata` | ||
| | = 1,000,000 * (100 + 56) * (2) = 312,000,000 bytes | ||
|
|
||
| | [.var]`total_dataset` | ||
| | `total_dataset` | ||
| | = 1,000,000 * (10,000) * (2) = 20,000,000,000 bytes | ||
|
|
||
| | [.var]`working_set` | ||
| | `working_set` | ||
| | = 20,000,000,000 * (0.2) = 4,000,000,000 bytes | ||
|
|
||
| | Cluster RAM quota required | ||
|
|
@@ -218,6 +223,175 @@ This tells you that the RAM requirement for the whole cluster is 7 GB. | |
|
|
||
| NOTE: This amount is in addition to the RAM requirements for the operating system and any other software that runs on the cluster nodes. | ||
|
|
||
| === Disk Sizing for Data Service Nodes | ||
|
|
||
| A key concept to remember about Couchbase Server's data storage is that it's an append-only system. | ||
| When an application mutates or deletes a document, the old version of the document is not immediately removed from disk. | ||
| Instead, Couchbase Server marks them as stale. | ||
| They remain on disk until a compaction process runs that reclaims the disk space. | ||
| When sizing disk space for your cluster, you take this behavior into account by applying an append-only multiplier to your data size. | ||
|
|
||
| When sizing disk space for the Data Service nodes, you first must determine the following information: | ||
|
|
||
| * The total number of documents that you plan to store in the cluster. | ||
| If this value constantly grows, consider the growth rate into the future when sizing. | ||
| * The average size of each document. | ||
| * Whether the documents can be compressed, and if they can, what compression ratio Couchbase Server can achieve. | ||
| Couchbase Server always compresses documents when storing them on disk. | ||
| See xref:learn:buckets-memory-and-storage/compression.adoc[] for more information about compression in Couchbase Server. | ||
| Documents containing JSON data or binaries can be compressed. | ||
| Binary data that's already compressed (such as compressed images or videos) cannot be compressed further. | ||
|
|
||
| + | ||
| Couchbase Server uses the https://en.wikipedia.org/wiki/Snappy_(compression)[Snappy^] compression algorithm, which prioritizes speed while still providing reasonable compression. | ||
| You can estimate the compression ratio Couchbase Server can achieve for your data by compressing a sample set of documents using a snappy-based command line tool such as `snzip`. | ||
| Otherwise, you can choose to use an estimated compression ratio of 0.7 for JSON documents. | ||
|
|
||
| * The number of replicas for your buckets. | ||
| See xref:learn:clusters-and-availability/intra-cluster-replication.adoc[] for more information about replicas. | ||
| * The number of documents that you plan to delete each day. | ||
| This number includes both the number of documents directly deleted by your applications and those that expire due to TTL (time to live) settings. | ||
| See xref:learn:data/expiration.adoc[] for more information about document expiration. | ||
|
|
||
| + | ||
| This value is important because in the short term, deletions actually take a bit more disk space rather than less. | ||
| Because of Couchbase Server's append-only system, the deleted documents remain on disk until a compaction process runs. | ||
| Also, Couchbase Server creates a tombstone record for each deleted document which consumes a small amount of additional disk space. | ||
|
|
||
| * The metadata purge interval you'll use. | ||
| This purge process removes tombstones that records the deletion of documents. | ||
| The default purge interval is 3 days. | ||
| For more information about the purge interval, see xref:manage:manage-settings/configure-compact-settings.adoc#tombstone-purge-interval[Metadata Purge Interval]. | ||
|
|
||
| * Which storage engine your cluster will use. | ||
| The storage engine affects the append-only multiplier that you use when sizing disk space. | ||
| See xref:learn:buckets-memory-and-storage/storage-engines.adoc[] for more information | ||
|
|
||
| To determine the amount of storage you need in your cluster: | ||
|
|
||
| . Calculate the size of the dataset by multiplying the total number of documents by the average document size. | ||
| If the documents are compressible, also multiply by the estimated compression ratio: | ||
|
|
||
| + | ||
| [stem] | ||
| ++++ | ||
| S_{\mathrm{dataset}} = \text{# of documents} \times \text{avg. document size} \times \text{compression ratio} | ||
| ++++ | ||
|
|
||
| . Calculate the total metadata size by multiplying the total number of documents by 56 bytes (the average metadata size per document): | ||
|
|
||
| + | ||
| [stem] | ||
| ++++ | ||
| S_{\mathrm{metadata}} = \text{# of documents} \times 56 | ||
| ++++ | ||
|
|
||
| . Calculate the key storage overhead by multiplying the total number of documents by the average key size. | ||
|
|
||
| + | ||
| [stem] | ||
| ++++ | ||
| S_{\mathrm{keys}} = \text{# of documents} \times \text{avg. key size} | ||
| ++++ | ||
|
|
||
| . Calculate the tombstone space in bytes using the following formula: | ||
|
|
||
| + | ||
| [latexmath] | ||
| ++++ | ||
| \begin{equation} | ||
| \begin{split} | ||
| S_{\mathrm{tombstones}} = & ( \text{avg. key size} + 60 ) \times \text{purge frequency in days} \\ | ||
| & \times ( \text{# of replicas} + 1 ) \times \text{# documents deleted per day} | ||
| \end{split} | ||
| \end{equation} | ||
| ++++ | ||
|
|
||
| . Calculate the total disk space required using the following formula: | ||
|
|
||
| + | ||
| [latexmath] | ||
| ++++ | ||
| \begin{equation} | ||
| \begin{split} | ||
| \text{total disk space} = & ( ( S_{\mathrm{dataset}} \times (\text{# replicas} + 1) \\ | ||
| & + S_{\mathrm{metadata}} + S_{\mathrm{keys}} ) \times F_{\text{append-multiplier}} ) + S_{\mathrm{tombstones}} | ||
| \end{split} | ||
| \end{equation} | ||
| ++++ | ||
|
|
||
|
|
||
| + | ||
| Where stem:[F_{\text{append-multiplier}}] is the append-only multiplier. | ||
| This value depends on the storage engine you use: | ||
|
|
||
| + | ||
| * For Couchstore storage engine, use an append-only multiplier of 3. | ||
| * For Magma storage engine, use an append-only multiplier of 2.2. | ||
|
|
||
| For example, suppose you're planning a cluster with the following characteristics: | ||
|
|
||
| * Total number of documents: 1,000,000 | ||
| * The average document size: 10,000 bytes. | ||
| * The documents contain JSON data that have an estimated compression ratio of 0.7. | ||
| * Average key size: 32 bytes. | ||
| * Number of replicas: 1 | ||
| * Number of documents deleted per day: 5,000 | ||
| * Purge frequency in days: 3 | ||
| * Storage engine: Magma | ||
|
|
||
| Using the formulas above, you can calculate the total disk space required as follows: | ||
|
|
||
| . Calculate the dataset: | ||
|
|
||
| + | ||
| [stem] | ||
| ++++ | ||
| S_{\mathrm{dataset}} = 1,000,000 \times 10,000 \times 0.7 = 7,000,000,000 \text{bytes} | ||
| ++++ | ||
|
|
||
| . Calculate the total metadata size: | ||
|
|
||
| + | ||
| [stem] | ||
| ++++ | ||
| S_{\mathrm{metadata}} = 1,000,000 \times 56 = 56,000,000 \text{bytes} | ||
| ++++ | ||
|
|
||
| . Calculate the total key size: | ||
|
|
||
| + | ||
| [stem] | ||
| ++++ | ||
| S_{\mathrm{keys}} = 1,000,000 \times 32 = 32,000,000 \text{bytes} | ||
| ++++ | ||
|
|
||
| . Calculate the tombstone space: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My calculator doesn't agree with the result in the document for the tombstone space. This is what's in the document. Calculate the tombstone space: |
||
|
|
||
| + | ||
| [stem] | ||
| ++++ | ||
| S_{\mathrm{tombstones}} = (32 + 60) \times 3 \times (1 + 1) \times 5,000 = 2,760,000 \text{bytes} | ||
| ++++ | ||
|
|
||
| . Calculate the total disk space: | ||
|
|
||
| + | ||
| [latexmath] | ||
| ++++ | ||
| \begin{equation} | ||
| \begin{split} | ||
| \text{total disk space} = & ( 7,000,000,000 \times (1 + 1) \\ | ||
| & + 56,000,000 + 32,000,000 ) \\ | ||
| & \times 2.2 \\ | ||
| & + 2,760,000 \\ | ||
| & = 30,996,360,000 \text{bytes} | ||
| \end{split} | ||
| \end{equation} | ||
| ++++ | ||
|
|
||
| Therefore, for the cluster in this example, you need at least 31{nbsp}GB of disk space to store your data. | ||
|
|
||
| [#cpu-overhead] | ||
| == CPU Overhead | ||
|
|
||
|
|
@@ -253,16 +427,16 @@ The following sizing guide can be used to compute the memory requirement for eac | |
| |=== | ||
| | Input Variable | Value | ||
|
|
||
| | [.var]`num_entries` (Number of index entries) | ||
| | `num_entries` (Number of index entries) | ||
| | 10,000,000 | ||
|
|
||
| | [.var]`ID_size` (Size of DocumentID) | ||
| | `ID_size` (Size of DocumentID) | ||
| | 30 bytes | ||
|
|
||
| | [.var]`index_entry_size` (Size of secondary key) | ||
| | `index_entry_size` (Size of secondary key) | ||
| | 50 bytes | ||
|
|
||
| | [.var]`working_set_percentage` (Nitro, Plasma, ForestDB) | ||
| | `working_set_percentage` (Nitro, Plasma, ForestDB) | ||
| | 100%, 20%, 20% | ||
| |=== | ||
|
|
||
|
|
@@ -290,19 +464,19 @@ Based on the provided data, a rough sizing guideline formula would be: | |
| |=== | ||
| | Variable | Calculation | ||
|
|
||
| | [.var]`total_index_data(secondary index)` (Nitro) | ||
| | `total_index_data(secondary index)` (Nitro) | ||
| | `(num_entries) * (metadata_per_entry + ID_size + index_entry_size)` | ||
|
|
||
| | [.var]`total_index_data(secondary index)` (Plasma, ForestDB) | ||
| | `total_index_data(secondary index)` (Plasma, ForestDB) | ||
| | `(num_entries) * (metadata_per_entry + ID_size + index_entry_size) * 2` | ||
|
|
||
| | [.var]`total_index_data(primary index)` (Nitro, Plasma, ForestDB) | ||
| | `total_index_data(primary index)` (Nitro, Plasma, ForestDB) | ||
| | `(num_entries) * (metadata_main_index + ID_size + index_entry_size)` | ||
|
|
||
| | [.var]`index_memory_required(100% resident)` (memdb) | ||
| | `index_memory_required(100% resident)` (memdb) | ||
| | `total_index_data * (1 + overhead_percentage)` | ||
|
|
||
| | [.var]`index_memory_required(20% resident)` (Plasma, ForestDB) | ||
| | `index_memory_required(20% resident)` (Plasma, ForestDB) | ||
| | `total_index_data * (1 + overhead_percentage) * working_set` | ||
| |=== | ||
|
|
||
|
|
@@ -313,22 +487,22 @@ Based on the above formula, these are the suggested sizing guidelines: | |
| |=== | ||
| | Variable | Calculation | ||
|
|
||
| | [.var]`total_index_data(secondary index)` (Nitro) | ||
| | `total_index_data(secondary index)` (Nitro) | ||
| | (10000000) * (120 + 30 + 50) = 2000000000 bytes | ||
|
|
||
| | [.var]`total_index_data(secondary index)` (Plasma) | ||
| | `total_index_data(secondary index)` (Plasma) | ||
| | (10000000) * (120 + 30 + 50) * 2 = 4000000000 bytes | ||
|
|
||
| | [.var]`total_index_data(secondary index)` (ForestDB) | ||
| | `total_index_data(secondary index)` (ForestDB) | ||
| | (10000000) * (80 + 30 + 50) * 2 = 3200000000 bytes | ||
|
|
||
| | [.var]`index_memory_required(100% resident)` (Nitro) | ||
| | `index_memory_required(100% resident)` (Nitro) | ||
| | (2000000000) * (1 + 0.25) = 2500000000 bytes | ||
|
|
||
| | [.var]`index_memory_required(20% resident)` (Plasma) | ||
| | `index_memory_required(20% resident)` (Plasma) | ||
| | (2000000000) * (1 + 0.25) * 0.2 = 1000000000 bytes | ||
|
|
||
| | [.var]`index_memory_required(20% resident)` (ForestDB) | ||
| | `index_memory_required(20% resident)` (ForestDB) | ||
| | (3200000000) * (1 + 0.25) * 0.2 = 800000000 bytes | ||
| |=== | ||
|
|
||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formula is not correct.
The tombstones size -- the
S<tombstones>-- is not multiplied by the append-only multiplier.The formula is:
Data storage needed = Tombstone space + ((A + B + C + D) * multiplier)
(Apologies that it wasn't clear -- I've clarified in my document.)
So, the formula in the document should be:
Instead of:
total disk space=((𝑆dataset×(# replicas+1)+𝑆metadata+𝑆keys+𝑆tombstones)×𝐹append-multiplier)
Should be:
total disk space=((𝑆dataset×(# replicas+1)+𝑆metadata+𝑆keys)×𝐹append-multiplier) + 𝑆tombstones