TLSH is a fuzzy matching library. Given a byte stream with a minimum length of 50 bytes TLSH generates a hash value which can be used for similarity comparisons. Similar objects will have similar hash values which allows for the detection of similar objects by comparing their hash values. Note that the byte stream should have a sufficient amount of complexity. For example, a byte stream of identical bytes will not generate a hash value.
Release version 4.7.1 Updated Python realease with additional functions:
Release version 4.6.0 Issue 99 raised issues about what to do when evaluating the TLSH for files over 4GB We decided to define that TLSH is the TLSH of the first 4GB of a file
We have written technical material that focuses on 2 topics at https://tlsh.org
TLSH has gained some traction. It has been included in STIX 2.1 and been ported to a number of langauges.
We are adding a version identifier ("T1") to the start of the digest so that we can cleary distinguish between different variants of the digest (such as non-standard choices of 3 byte checksum). This means that we do not rely on the length of the hex string to determine if a hex string is a TLSH digest (this is a brittle method for identifying TLSH digests). We are doing this to enable compatibility, especially backwards compatibility of the TLSH approach.
This release will add "T1" to the start of TLSH digests. The code is backwards compatible, it can still read and interpret 70 hex character strings as TLSH digests. And data sets can include mixes of the old and new digests. If you need old style TLSH digests to be outputted, then use the command line option '-old'
Thanks to Chun Cheng, who was a humble and talented engineer.
The program in default mode requires an input byte stream with a minimum length of 50 bytes (and a minimum amount of randomness - see note in Python extension below).
For consistency with older versions, there is a -conservative option which enforces a 256 byte limit. See notes for version 3.17.0 of TLSH
The computed hash is 35 bytes of data (output as 'T1' followed 70 hexidecimal characters. Total length 72 characters). The 'T1' has been added as a version number for the hash - so that we can adapt the algorithm and still maintain backwards compatibility. To get the old style 70 hex hashes, use the -old command line option.
Bytes 3,4,5 are used to capture the information about the file as a whole (length, ...), while the last 32 bytes are used to capture information about incremental parts of the file. (Note that the length of the hash can be increased by changing build parameters described below in CMakeLists.txt, which will increase the information stored in the hash. For some applications this might increase the accuracy in predicting similarities between files.)
Building TLSH (see below) will create a static library in the
tlsh executable (a symbolic link to
'tlsh' links to the static library, in the
The library has functionality to generate the hash value from a given
file, and to compute the similarity between two hash values.
tlsh is a utility for generating TLSH hash values and comparing TLSH
hash values to determine similarity. Run it with no parameters for detailed usage.
We list these ports just for reference. We have not checked the code in these repositories, and we have not checked that the results are identical to TLSH here. We also request that any ports include the files LICENSE and NOTICE.txt exactly as they appear in this repository.
Download TLSH as follows:
wget https://github.com/trendmicro/tlsh/archive/master.zip -O master.zip unzip master.zip cd tlsh-master
git clone git://github.com/trendmicro/tlsh.git cd tlsh git checkout master
Edit CMakeLists.txt to build TLSH with different options.
Note: Building TLSH on Linux depends upon
cmake to create the
Makefile and then
make the project, so the build will fail if
cmake is not installed.
Added in March 2020. See the instructions in README.mingw
If you need to build your own Python package, then there is a README.python with notes about the python version
(1) compile the C++ code $./make.sh (2) build the python version $ cd py_ext/ $ python ./setup.py build (3) install - possibly - sudo, run as root or administrator $ python ./setup.py install (4) test it $ cd ../Testing $ ./python_test.sh
import tlsh tlsh.hash(data)
Note data needs to be bytes - not a string. This is because TLSH is for binary data and binary data can contain a NULL (zero) byte.
In default mode the data must contain at least 50 bytes to generate a hash value and that it must have a certain amount of randomness. To get the hash value of a file, try
Note: the open statement has opened the file in binary mode.
import tlsh h1 = tlsh.hash(data) h2 = tlsh.hash(similar_data) score = tlsh.diff(h1, h2) h3 = tlsh.Tlsh() with open('file', 'rb') as f: for buf in iter(lambda: f.read(512), b''): h3.update(buf) h3.final() # this assertion is stating that the distance between a TLSH and itself must be zero assert h3.diff(h3) == 0 score = h3.diff(h1)
diffxlen function removes the file length component of the tlsh header from the comparison.
If a file with a repeating pattern is compared to a file with only a single instance of the pattern,
then the difference will be increased if the file lenght is included.
But by using the
diffxlen function, the file length will be removed from consideration.
If you use the "conservative" option, then the data must contain at least 256 characters. For example,
import os tlsh.conservativehash(os.urandom(256))
should generate a hash, but
will generate TNULL as it is less than 256 bytes.
If you need to generate old style hashes (without the "T1" prefix) then use
The old and conservative options may be combined:
TLSH similarity is expressed as a difference score:
13/09/2021 added options -thread and -private -thread the TLSH is evaluated with 2 threads (faster calculation) Only done for files / bytestreams >= 10000 bytes But this means that it is impossible to calculate the checksum So the checksum is set to zero -private Does not evaluate the checksum Useful if you do not want to leak information Slightly faster than default TLSH (code was written to optimize this) Timing (using the utility provide "timing_unittest") : (On Mac 2.3 GHz) Byte size: 1 million bytes eval TLSH DEFAULT (4.9.3 compact hash 1 byte checksum sliding_window=5) 500 times... TLSH(buffer) = T1A12500088C838B0A0F0EC3C0ACAB82F3B8228B0308CFA302338C0F0AE2C24F28000008 BEFORE ms=1631512230350 AFTER ms=1631512234041 TIME ms=3691 TIME ms= 7.38 per iteration eval TLSH THREADED (4.9.3 compact hash 1 byte checksum sliding_window=5) 500 times... TLSH(buffer) = T1002500088C838B0A0F0EC3C0ACAB82F3B8228B0308CFA302338C0F0AE2C24F28000008 BEFORE ms=1631512234041 AFTER ms=1631512236464 TIME ms=2423 TIME ms= 4.85 per iteration eval TLSH PRIVATE (4.9.3 compact hash 1 byte checksum sliding_window=5) 500 times... TLSH(buffer) = T1002500088C838B0A0F0EC3C0ACAB82F3B8228B0308CFA302338C0F0AE2C24F28000008 BEFORE ms=1631512236464 AFTER ms=1631512239485 TIME ms=3021 TIME ms= 6.04 per iteration eval TLSH distance 50 million times... Test 2: Calc distance TLSH digest dist=138 BEFORE ms=1631512239500 AFTER ms=1631512240550 TIME ms=1050 TIME ms= 21.00 per million iterations