You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.rst
+32-9Lines changed: 32 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,9 +5,13 @@ pyFileFixity
5
5
6
6
|Build-Status| |Coverage|
7
7
8
-
This project aims to provide a set of open source, cross-platform, easy
8
+
pyFileFixity provides a suite of open source, cross-platform, easy
9
9
to use and easy to maintain (readable code) to protect and manage data
10
-
for long term storage. The project is done in pure-Python to meet those criteria.
10
+
for long term storage/archival, and also test the performance of any data protection algorithms.
11
+
12
+
The project is done in pure-Python to meet those criteria,
13
+
although cythonized extensions are available for core routines to speed up encoding/decoding,
14
+
but always with a pure python specification available so as to allow long term replication.
11
15
12
16
Here is an example of what pyFileFixity can do:
13
17
@@ -104,20 +108,20 @@ Note: this also works for a single file, just replace "your_folder" by "your_fil
104
108
105
109
- DEPRECATED (because Gooey is not maintained anymore it seems): To use the GUI with any tool, use ``--gui`` and do not supply any other argument, eg: ``python rfigc.py --gui``.
106
110
107
-
- You can also use `PyPy <http://pypy.org/>`_ to hugely speedup the processing time of any tool here.
111
+
- You can also use `PyPy <http://pypy.org/>`_ or Cython to hugely speedup the processing time of any tool here.
108
112
109
113
The problem of long term storage
110
114
--------------------------------
111
115
112
-
Why are data corrupted with time? Entropy, my friend, entropy.
116
+
Why are data corrupted with time? One sole reason: entropy.
113
117
Entropy refers to the universal tendency for systems to become
114
-
less ordered over time. Corruption is exactly that: a disorder
118
+
less ordered over time. Data corruption is exactly that: a disorder
115
119
in bits order. In other words: *the Universe hates your data*.
116
120
117
121
Long term storage is thus a very difficult topic: it's like fighting with
118
122
death (in this case, the death of data). Indeed, because of entropy,
119
123
data will eventually fade away because of various silent errors such as
120
-
bit rot. pyFileFixity aims to provide tools to detect any data
124
+
bit rot or cosmic rays. pyFileFixity aims to provide tools to detect any data
121
125
corruption, but also fight data corruption by providing repairing tools.
122
126
123
127
The only solution is to use a principle of engineering that is long
@@ -178,6 +182,15 @@ corruption, so that you can process it by your own means if you want to,
178
182
without having to study for hours how the code works (contrary to PAR2
179
183
format).
180
184
185
+
In practice, both approaches are not exclusive, and the best is to
186
+
combine them: protect the most precious data with error correction codes,
187
+
then duplicate them across multiple storage mediums. Hence, this suite of
188
+
data protection tools, just like any other such suite, is not sufficient to
189
+
guarantee your data is protected, you must have an active data curation strategy
190
+
which includes regularly checking your data and replacing copies that are damaged.
191
+
192
+
For a primer on storage mediums and data protection strategies, see `this post I wrote <https://web.archive.org/web/20220529125543/https://superuser.com/questions/374609/what-medium-should-be-used-for-long-term-high-volume-data-storage-archival/873260>`_.
193
+
181
194
Why not just use RAID ?
182
195
-----------------------
183
196
@@ -645,10 +658,20 @@ Cython implementation
645
658
---------------------
646
659
647
660
This section describes how to use the Cython implementation. However,
648
-
you should first try PyPy, as it did give 10x to 100x speedup over
649
-
Cython in our case.
661
+
you should first try PyPy, as it may give great performances too.
662
+
663
+
Simply follow the instruction to install the `reedsolo <https://github.com/tomerfiliba/reedsolomon/releases/tag/v2.0.5>`_ module with
0 commit comments