scs

spatial-correlation-sampler

Custom implementation of Corrleation Module

Showing:

Popularity

Downloads/wk

0

GitHub Stars

310

Maintenance

Last Commit

9mos ago

Contributors

4

Package

Dependencies

0

License

Categories

Readme

PyPI

Pytorch Correlation module

this is a custom C++/Cuda implementation of Correlation module, used e.g. in FlowNetC

This tutorial was used as a basis for implementation, as well as NVIDIA's cuda code

  • Build and Install C++ and CUDA extensions by executing python setup.py install,
  • Benchmark C++ vs. CUDA by running python benchmark.py {cpu, cuda},
  • Run gradient checks on the code by running python grad_check.py --backend {cpu, cuda}.

Requirements

This module is expected to compile for Pytorch 1.6.

Installation

this module is available on pip

pip install spatial-correlation-sampler

For a cpu-only version, you can install from source with

python setup_cpu.py install

Known Problems

This module needs compatible gcc version and CUDA to be compiled. Namely, CUDA 9.1 and below will need gcc5, while CUDA 9.2 and 10.0 will need gcc7 See this issue for more information

Usage

API has a few difference with NVIDIA's module

  • output is now a 5D tensor, which reflects the shifts horizontal and vertical.
    input (B x C x H x W) -> output (B x PatchH x PatchW x oH x oW)
    
  • Output sizes oH and oW are no longer dependant of patch size, but only of kernel size and padding
  • Patch size patch_size is now the whole patch, and not only the radii.
  • stride1 is now stride andstride2 is dilation_patch, which behave like dilated convolutions
  • equivalent max_displacement is then dilation_patch * (patch_size - 1) / 2.
  • dilation is a new parameter, it acts the same way as dilated convolution regarding the correlation kernel
  • to get the right parameters for FlowNetC, you would have
    kernel_size=1
    patch_size=21,
    stride=1,
    padding=0,
    dilation=1
    dilation_patch=2
    

Example

import torch
from spatial_correlation_sampler import SpatialCorrelationSampler, 

device = "cuda"
batch_size = 1
channel = 1
H = 10
W = 10
dtype = torch.float32

input1 = torch.randint(1, 4, (batch_size, channel, H, W), dtype=dtype, device=device, requires_grad=True)
input2 = torch.randint_like(input1, 1, 4).requires_grad_(True)

#You can either use the function or the module. Note that the module doesn't contain any parameter tensor.

#function

out = spatial_correlation_sample(input1,
                             input2,
                                 kernel_size=3,
                                 patch_size=1,
                                 stride=2,
                                 padding=0,
                                 dilation=2,
                                 dilation_patch=1)

#module

correlation_sampler = SpatialCorrelationSampler(
    kernel_size=3,
    patch_size=1,
    stride=2,
    padding=0,
    dilation=2,
    dilation_patch=1)
out = correlation_sampler(input1, input2)

Benchmark

  • default parameters are from benchmark.py, FlowNetC parameters are same as use in FlowNetC with a batch size of 4, described in this paper, implemented here and here.
  • Feel free to file an issue to add entries to this with your hardware !

CUDA Benchmark

  • See here for a benchmark script working with NVIDIA's code, and Pytorch.

  • Benchmark are launched with environment variable CUDA_LAUNCH_BLOCKING set to 1.

  • Only float32 is benchmarked.

  • FlowNetC correlation parameters where launched with the following command:

    CUDA_LAUNCH_BLOCKING=1 python benchmark.py --scale ms -k1 --patch 21 -s1 -p0 --patch_dilation 2 -b4 --height 48 --width 64 -c256 cuda -d float
    
    CUDA_LAUNCH_BLOCKING=1 python NV_correlation_benchmark.py --scale ms -k1 --patch 21 -s1 -p0 --patch_dilation 2 -b4 --height 48 --width 64 -c256
    
    implementationCorrelation parametersdevicepassmin timeavg time
    oursdefault980 GTXforward5.745 ms5.851 ms
    oursdefault980 GTXbackward77.694 ms77.957 ms
    NVIDIAdefault980 GTXforward13.779 ms13.853 ms
    NVIDIAdefault980 GTXbackward73.383 ms73.708 ms
    oursFlowNetC980 GTXforward26.102 ms26.179 ms
    oursFlowNetC980 GTXbackward208.091 ms208.510 ms
    NVIDIAFlowNetC980 GTXforward35.363 ms35.550 ms
    NVIDIAFlowNetC980 GTXbackward283.748 ms284.346 ms

Notes

  • The overhead of our implementation regarding kernel_size > 1 during backward needs some investigation, feel free to dive in the code to improve it !
  • The backward pass of NVIDIA is not entirely correct when stride1 > 1 and kernel_size > 1, because not everything is computed, see here.

CPU Benchmark

  • No other implementation is avalaible on CPU.

  • It is obviously not recommended to run it on CPU if you have a GPU.

    Correlation parametersdevicepassmin timeavg time
    defaultE5-2630 v3 @ 2.40GHzforward159.616 ms188.727 ms
    defaultE5-2630 v3 @ 2.40GHzbackward282.641 ms294.194 ms
    FlowNetCE5-2630 v3 @ 2.40GHzforward2.138 s2.144 s
    FlowNetCE5-2630 v3 @ 2.40GHzbackward7.006 s7.075 s

Rate & Review

Great Documentation0
Easy to Use0
Performant0
Highly Customizable0
Bleeding Edge0
Responsive Maintainers0
Poor Documentation0
Hard to Use0
Slow0
Buggy0
Abandoned0
Unwelcoming Community0
100