HashFS
|version| |travis| |coveralls| |license|
HashFS is a content-addressable file management system. What does that mean? Simply, that HashFS manages a directory where files are saved based on the file's hash.
Typical use cases for this kind of system are ones where:
n
number of characters.hashlib.new
.Install using pip:
::
pip install hashfs
.. code-block:: python
from hashfs import HashFS
Designate a root folder for HashFS
. If the folder doesn't already exist, it will be created.
.. code-block:: python
# Set the `depth` to the number of subfolders the file's hash should be split when saving.
# Set the `width` to the desired width of each subfolder.
fs = HashFS('temp_hashfs', depth=4, width=1, algorithm='sha256')
# With depth=4 and width=1, files will be saved in the following pattern:
# temp_hashfs/a/b/c/d/efghijklmnopqrstuvwxyz
# With depth=3 and width=2, files will be saved in the following pattern:
# temp_hashfs/ab/cd/ef/ghijklmnopqrstuvwxyz
NOTE: The algorithm
value should be a valid string argument to hashlib.new()
.
HashFS
supports basic file storage, retrieval, and removal as well as some more advanced features like file repair.
Add content to the folder using either readable objects (e.g. StringIO
) or file paths (e.g. 'a/path/to/some/file'
).
.. code-block:: python
from io import StringIO
some_content = StringIO('some content')
address = fs.put(some_content)
# Or if you'd like to save the file with an extension...
address = fs.put(some_content, '.txt')
# The id of the file (i.e. the hexdigest of its contents).
address.id
# The absolute path where the file was saved.
address.abspath
# The path relative to fs.root.
address.relpath
# Whether the file previously existed.
address.is_duplicate
Get a file's HashAddress
by address ID or path. This address would be identical to the address returned by put()
.
.. code-block:: python
assert fs.get(address.id) == address
assert fs.get(address.relpath) == address
assert fs.get(address.abspath) == address
assert fs.get('invalid') is None
Get a BufferedReader
handler for an existing file by address ID or path.
.. code-block:: python
fileio = fs.open(address.id)
# Or using the full path...
fileio = fs.open(address.abspath)
# Or using a path relative to fs.root
fileio = fs.open(address.relpath)
NOTE: When getting a file that was saved with an extension, it's not necessary to supply the extension. Extensions are ignored when looking for a file based on the ID or path.
Delete a file by address ID or path.
.. code-block:: python
fs.delete(address.id)
fs.delete(address.abspath)
fs.delete(address.relpath)
NOTE: When a file is deleted, any parent directories above the file will also be deleted if they are empty directories.
Below are some of the more advanced features of HashFS
.
The HashFS
files may not always be in sync with it's depth
, width
, or algorithm
settings (e.g. if HashFS
takes ownership of a directory that wasn't previously stored using content hashes or if the HashFS
settings change). These files can be easily reindexed using repair()
.
.. code-block:: python
repaired = fs.repair()
# Or if you want to drop file extensions...
repaired = fs.repair(extensions=False)
WARNING: It's recommended that a backup of the directory be made before repairing just in case something goes wrong.
Instead of actually repairing the files, you can iterate over them for custom processing.
.. code-block:: python
for corrupted_path, expected_address in fs.corrupted():
# do something
WARNING: HashFS.corrupted()
is a generator so be aware that modifying the file system while iterating could have unexpected results.
Iterate over files.
.. code-block:: python
for file in fs.files():
# do something
# Or using the class' iter method...
for file in fs:
# do something
Iterate over folders that contain files (i.e. ignore the nested subfolders that only contain folders).
.. code-block:: python
for folder in fs.folders():
# do something
Compute the size in bytes of all files in the root
directory.
.. code-block:: python
total_bytes = fs.size()
Count the total number of files.
.. code-block:: python
total_files = fs.count()
# Or via len()...
total_files = len(fs)
For more details, please see the full documentation at http://hashfs.readthedocs.org.
.. |version| image:: http://img.shields.io/pypi/v/hashfs.svg?style=flat-square :target: https://pypi.python.org/pypi/hashfs/
.. |travis| image:: http://img.shields.io/travis/dgilland/hashfs/master.svg?style=flat-square :target: https://travis-ci.org/dgilland/hashfs
.. |coveralls| image:: http://img.shields.io/coveralls/dgilland/hashfs/master.svg?style=flat-square :target: https://coveralls.io/r/dgilland/hashfs
.. |license| image:: http://img.shields.io/pypi/l/hashfs.svg?style=flat-square :target: https://pypi.python.org/pypi/hashfs/
distutils.dir_util.mkpath
with os.path.makedirs
.shutil.move
instead of shutil.copy
to move temporary file created during put
operation to HashFS
directory.scandir
package for iterating over files/folders when platform is Python < 3.5. Scandir implementation was added to os
module starting with Python 3.5.HashFS.copy
to HashFS._copy
.is_duplicate
attribute to HashAddress
.HashFS.put()
return HashAddress
with is_duplicate=True
when file with same hash already exists on disk.HashFS.size()
method that returns the size of all files in bytes.HashFS.count()
/HashFS.__len__()
methods that return the count of all files.HashFS.__iter__()
method to support iteration. Proxies to HashFS.files()
.HashFS.__contains__()
method to support in
operator. Proxies to HashFS.exists()
.HashFS.repair()
not using extensions
argument properly.HashFS.length
parameter/property to width
. (breaking change)HashFS.get
to HashFS.open
. (breaking change)HashFS.get()
method that returns a HashAddress
or None
given a file ID or path.HashFS.get()
method that retrieves a reader object given a file ID or path.HashFS.delete()
method that deletes a file ID or path.HashFS.folders()
method that returns the folder paths that directly contain files (i.e. subpaths that only contain folders are ignored).HashFS.detokenize()
method that returns the file ID contained in a file path.HashFS.repair()
method that reindexes any files under root directory whose file path doesn't not match its tokenized file ID.Address
classs to HashAddress
. (breaking change)HashAddress.digest
to HashAddress.id
. (breaking change)HashAddress.path
to HashAddress.abspath
. (breaking change)HashAddress.relpath
which represents path relative to HashFS.root
.HashFS
class.HashFS.put()
method that saves a file path or file-like object by content hash.HashFS.files()
method that returns all files under root directory.HashFS.exists()
which checks either a file hash or file path for existence.Version | Tag | Published |
---|---|---|
0.7.2 | 3yrs ago | |
0.7.1 | 4yrs ago | |
0.7.0 | 6yrs ago | |
0.6.0 | 7yrs ago |