Improving time-of-flight measurements with the shading constraint

This post discusses the paper by Martin Böhme in order to filter range maps (a.k.a. depth map or a depth image) received by a time-of-flight camera by their intensity image (a.k.a. confidence map or near infrared image). There are actually 2 versions of the paper. One from 2008 (link to paper) [1] and one from 2010 (link to paper) [2]. Be sure to always take the 2010 [2] version because there are small differences between the papers.

My main interest was to gain more knowledge of how to solve such problems as stated in this paper. This is the reason that I only will use synthetic data here instead of real recorded data. In this way it is easier to concentrate on learning the tools in order how to solve such a problem without dealing with the problems of real data.

At the bottom of this post you can find a download link to an executable demo with the accompanying source code and data sets.

As a last note: this blog post is not about giving an elaborate scientific explanation about the used methods. But more to give an idea how it is working and a guide of where to find good resources that will give you an in-depth explanation.

Goal

A consumer time-of-flight camera, such as the Creative Senz3D or DepthSense 325,  has on its depth map on average 1,0 cm noise per pixel. Actually the amount can vary from as small as 0,5 cm to as large as 1,5 cm per pixel. So it speaks for itself that a lot of detail is lost.

Therefore in order to reduce the noise and recover small details in the range map, the paper makes use of the observation that the range map and the intensity image are linked to each other by the shading constraint. Meaning, if the reflectance properties of the measured surface are known that a certain range map implies a corresponding intensity image. Thus if we modify the range map so that it corresponds with the measured intensity image we can recover lost details that are observed in the intensity image but were lost in the range map due to noise.

The images below were taken with a Creative Senz3D camera. If you look at the curved surface in the center of the image you can clearly see where the paper aims at. The curved surface has the same reflectance properties everywhere because it is a homogeneous non-specular material without any texture on it. It is clearly noticeable that the intensities of the intensity image (or confidence map) give us some clue about the shape of the surface. On the other hand, at the left side you can observe a book with a texture on it. This surface has variable reflectance properties and thus is not our focus at the moment.

The used technique in order to filter with the shading constraint is a variational method. This means we will construct a cost function (a.k.a. energy or objective function) and try to find the minimum of this function. The big difference with traditional optimization is that we want to minimize a whole range of variables, a function you can say. So an image, such as the range map, is viewed as a function whose sampling corresponds to the discrete matrix form of a given image [3]. If for example the range map has dimensions of 320 x 240 pixels then we need to find 76800 optimal values (a.k.a. unknowns) that together minimize the cost function. This number is big!

Objective Function

$latex E^{G}(\textbf{R,A}) = \displaystyle \sum_{j} \frac{(X_j^I – I_j(\textbf{R,A}))^2}{2\sigma_I^2}&fg=404040$

With $latex j &fg=404040$ going through all the pixels.  $latex X_j^I&fg=404040$ is the pixel value of our original measured intensity image. And $latex I_j(\textbf{R,A})&fg=404040$ is our calculated intensity based on the albedo and depth. So this constraint ensures that our depth corresponds with the intensity image. $latex \sigma_I &fg=404040$ is the standard deviation of the noise on the intensity image. Actually you just have to interpret the whole $latex 2\sigma_I^2&fg=404040$ term as a weight for this constraint.

In order to calculate the intensity the Lambertian model of diffuse reflection is applied:

$latex I_j = \displaystyle a \frac{\textbf{n} \cdot \textbf{l}}{r^2}&fg=404040$

Where $latex \textbf{n} &fg=404040$ is the surface normal, $latex \textbf{l} &fg=404040$ is the unit vector from the surface point towards the light source, $latex r&fg=404040$ is the distance of the surface point to the light source, and $latex a&fg=404040$ is a constant that depends on the albedo of the surface, the intensity of the light source and properties of the camera such as aperture and exposure time.  For brevity, we will refer to $latex a&fg=404040$ as the albedo, exactly like what the paper does [2].

This shading gradient constraint is ill-posed and the following constraints will introduce some form of regularization.

Shape prior or smoothness constraint

$latex E^{S}(\textbf{R})=w_S\displaystyle\sum_{\substack{pixels\, j,k \\ adjacent}}\|n_j-n_k\| &fg=404040$

Meaning that we want that our neighboring normals are similar to the normal of our current pixel. It should ensure a smooth varying range map. The constant $latex w_S&fg=404040$ is a weight that we can use of how hard this constraint should influence on the total result.

Depth constraint

$latex E^{D}(\textbf{R}) = \displaystyle \sum_{j} \frac{(X_j^R – R_j)^2}{2\sigma_R^2}&fg=404040$

With $latex X_j^R&fg=404040$ as our original measured range value and $latex R_j&fg=404040$ as our current range value of this pixel.

The constraint ensures that we don’t deviate too much from the original range map. $latex \sigma_R &fg=404040$ is the standard deviation of the noise on the range map and can just be interpreted as a weight.

Albedo constraint

$latex E^A(\textbf{A})=w_A \displaystyle \sum_{\substack{pixels\, j,k \\ adjacent}}|a_j-a_k| &fg=404040$

The paper also mentions a constraint if a varying albedo is used. In our case we will work on synthetic data and we know our albedo values. Therefore we can ignore this constraint.

Full cost function

If we combine these constraints it will give us:

$latex E(\textbf{R,A})=E^{G}+E^{S}+E^{D} \\ \\ E(\textbf{R,A})=\displaystyle\sum_{j} \frac{(X_j^I – I_j(\textbf{R,A}))^2}{2\sigma_I^2}+w_S\sum_{\substack{pixels\, j,k \\ adjacent}}\|n_j-n_k\|+\sum_{j}\frac{(X_j^R – R_j)^2}{2\sigma_R^2}&fg=404040$

If we can find the minimum of this function we will have our optimal filtered range map.

Minimizing the objective function

Now we have our objective function and the goal is to minimize this function. How do we do this?

The first step is to understand how to solve a simple linear variational problem. For this I want to refer to the “Variational Methods for Computer Vision” YouTube lectures of Prof. Danial Cremer from the University of München. In particular Lecture 5 and Lecture 6. In Lecture 5 he gives a very basic example of a linear variational problem for image denoising with some iterative methods for solving such problems. The video course is very interesting but it is more focused on demonstrating all the possible applications of variational methods. And not so much on explaining how to solve all those particular problems in practice. Therefore you need to search for other resources such as numerical computing courses. One big tip: there is a way in YouTube for increasing the playback speed because the professor talks a little bit slow…

After these lectures you will understand how to solve a linear problem. But is our problem also linear?

If we would take the first derivative of our objective function we would see that our function is linear in the normal vector variables. But not in the depth variables. This because for example the calculation of the normals in our case is not a linear process due to the normalization process of the normal vector. And thus we are dealing with a non-linear objective function.

The main idea for solving non-linear functions is to simplify the function. Find the minimum of this simplified function and repeat this process until we find a local minimum of our non-linear function. Generally such methods can be split up in 2 categories: first-order and second-order methods. A first-order method approximates the non-linear function with a first-order Taylor expansion (so with the first derivative) and the latter one with a second-order Taylor expansion (so up to the second derivative, meaning a quadratic function).
For such methods you always need a good initial guess that is relatively close to the actual solution. In our case this guess will be the noisy measured depth map.

The paper suggests using the “Polak-Ribière variant of the non-linear conjugate gradient algorithm” and this is what I’ve also used. It’s a first order method and the advantage is that it is very easy to implement. The disadvantage is that it doesn’t run so fast. The reason for this is that we need to do a lot of cost function evaluations due to the non-linear property of our problem. Later you will also see that the evaluation of the gradient (a.k.a. the vector of the partial derivatives) of the cost function is not optimal. More recent papers such as [5] use second-order methods and can get a real-time experience while our method will take seconds or even minutes.

The best theoretical resource for understanding the conjugate gradient is “An introduction to the conjugate gradient method without the agonizing pain.” of Shewchuk J.R. [6]. A more practical explanation and implementation can be found in the book “Numerical Recipes” [4].

In order to understand the conjugate gradient method one must understand the gradient descent method (a.k.a. method of steepest descent). As I quote from Wikipedia [7]: “Gradient descent is based on the observation that if the multivariable function $latex F(x)&fg=404040$ is defined and differentiable in a neighborhood of a point $latex a &fg=404040$, then $latex F(x)&fg=404040$ decreases fastest if one goes from a in the direction  of the negative gradient of $latex F&fg=404040$ at $latex a&fg=404040$, $latex -\nabla F(a)&fg=404040$”. So basically the gradient descent is nothing more then calculating the gradient at the current estimate. Then trying to find the minimum of the cost function along the direction of the negative gradient. This minimum will be our new estimate and we repeat the process until the algorithm converges.

The problem with the gradient descent method is that in certain circumstances it will take very inefficient search directions. Also known as zig-zagging. This is what the conjugate gradient method tries to solve. Imagine you want to solve the linear system Ax=b. And if you look at image below of the gradient descent process on this linear system you can notice that all the succeeding search directions are orthogonal to each other. The conjugate gradient method redefines this. And aims at making the search directions A-orthogonal instead of just orthogonal. Meaning that the search directions are orthogonal in the field of A. Actually it defines that a new search direction is A-orthogonal with all the preceding search directions. Mathematically two vectors $latex d_{(i)}&fg=404040$ and $latex d_{(j)}&fg=404040$ are A-orthogonal, or conjugate, if

$latex d_{(i)}^TAd_{(j)}=0&fg=404040$

[6] gives a good tip on how to imagine the meaning of A-orthogonality. Imagine that the right picture below of the conjugate gradient was printed on bubble gum and you stretch it until the ellipses appear circular. The vectors would then appear orthogonal just exactly as the left picture of the gradient descent.
An interesting property of the conjugate gradient method is that it ensures that it will converge after n iterations. With n as the number of dimensions of our linear system. If our problem was linear and if our range map resolution would be 320 x 240 then the conjugate gradient method ensures that it will guarantee after 76800 iterations. In practice the algorithm will converge much sooner.

For non-linear problems, such as ours, this A-orthogonal property for the search directions cannot be defined as exactly as for linear problems. In order to compute the next search direction researchers have come up with several choices. For example there is the Flecther-Reeves method, the Polak-Ribière method etc. It is up to the user in order to select the most appropriate choice for his or her problem. For this problem I’ve followed [2] which uses the “Polak-Ribière method” and is well explained in the resources mentioned before [4][6].

Because we are dealing with a non-linear function we cannot just calculate where the minimum is on the search direction. Therefore we have to evaluate the cost function at different points on the search direction in order to find the minimum. This frequent evaluation is the main factor that makes our algorithm slow.

For a more in depth explanation about these methods see [4] and [6].

Implementation

We will use the conjugate gradient method from the book “Numerical Recipes 2nd Edition”. The function declaration from chapter 10.6 looks like this:

We can see that the function needs an initial starting point p[]. This will be our measured depth map. For the tolerance ftol stopping criteria we will just take a small value. And we need to provide pointers to the energy evaluation function func() and the gradient function dfunc().

The implementation of the function evaluation speaks for itself. It is just literally evaluating the cost function.

The dfunc() function should calculate the gradient. The gradient is a vector of the partial derivatives in the unknowns.  We can calculate the gradient in 2 ways: analytically or numerically. With the analytical method you have to take pen and paper and derive the partial derivatives an implement those. It’s a little bit of work but it is the most efficient and correct one. For our implementation we will use a numerical approximation. It’s easier to implement for complicated functions but is less efficient.

If for example our depth image contains 76800 values, then our gradient vector will also have a size of 76800. For each element of the vector we will evaluate the cost function with and without a small depth change of that element. And subtract these two evaluations with each other.

$latex grad_i=\displaystyle \frac{E(\textbf{R + h}) – E(\textbf{R})}{h}&fg=404040$

But the gradient represents the slope of the tangent of the function when we change a depth value with 1 unit. 1 unit in our case is 1 meter. This scale has an influence on line search algorithm. It’s initial step will be very big and it’s possible that it will jump over our solution. To prevent this we can scale this gradient with a certain factor. If the factor is chosen to be the same as the differentiation delta then we get the following formula.

$latex grad_i=E(\textbf{R + h}) – E(\textbf{R})&fg=404040$

In practice we don’t need to evaluate the complete energy but only the neighborhood where the change of a depth value has influence on. In our case it’s a 3 by 3 neighborhood around the pixel.

Computation of Surface Normals

As you can see in the previous section, both the Lambertian reflectance model and the shape prior require the normals of the depth surface. The author of the paper reserved a full section about how to compute the surface normals. And remarkable is that the explanation of the 2008 paper differs a little bit with the 2010 paper. This means to me that this was a real struggle to get right for the author. Initially I followed the 2008 paper, unaware of the existence of the 2010 paper, and really struggled to get things right. Luckily I found the 2010 paper with a more elaborate explanation of the normal calculation. But in the end I use my own simpler normal approximation method.

I quote from [2]:

An obvious way is to compute the cross product of two tangent vectors p(i + 1, j) – p(i – 1, j) and p(i, j +1) – p(i, j – 1) (where p(i,j) are the three dimensional coordinates of the point corresponding to pixel (i,j)), but surface normals calculated in this way can lead to the minimize astray: Because the normals of pixels with even indices depend only on the positions of pixels with odd indices, and vice versa, neighboring pixels are not constrained to have similar range values, and the minimizer may happily compute a surface with a “checkerboard” pattern, where neighboring pixels are alternately displaced upwards.

And indeed! I took a noisy surface and only applied the shape prior constraint on it and you get a result like this:

The author tries to solve it with triangulating the depth map in an asymmetric way. Actually he uses 2 triangulation topologies at the same time. Please see [2] for more details. And indeed, the proposed method works. And an asymmetric kernel is the way how you can solve this problem. But in my opinion, the suggested implementation by the author is cumbersome.Therefore I went for a 1st approximation of the normal based on the image gradient of the depth image. It only takes the right and bottom neighbor into account. It’s far from perfect, but does an ok job on the synthetic data and keeps the example code short an elegant.

$latex n=(\displaystyle \frac{dg}{dx},-\displaystyle \frac{dg}{dy},-sg)^t&fg=404040$

Where sg is the pixel width at distance g.

In code it looks something likes this:

Because this is a 1st approximation it is heavily influenced by pixel noise. But pixel noise can be removed by applying a smoothing filter on the depth map.

This normal calculation method is maybe far from ideal but does an ok job on the used data sets.

Results

Below you will find some 3d visualizations on synthetic datasets that I’ve generated.

On the first frame of the movies you will see the original noisy range map. Then you will see the range map with a median filter applied on top of it. After that the iterative filtering process begins.

Example 1: Carved figures

One can clearly observe that the carved figures are not visible in the initial noisy range map. But after the filtering process they are fully recovered.

Example 2: Wave

The original surface is well reconstructed and the noise is almost completely gone.

Example 3: Corner data set

A big benefit  over traditional simple smoothing methods is shown here. The result are nice flat walls and a sharp corner!

Conclusions

• The used optimization method is very easy to implement. Perfect for prototyping. But runs relatively slow. The next step would be to use a 2nd order method as explained in [5].
• The big weakness of this paper is that you need a good guess for the albedo. Even then, in practice the albedo differs in the image itself. Think about textured objects. If you have a flat wall with a painting on it, it will probably carve the texture edges into the depth image. The actual paper also implements a varying albedo method but this does not sound ideal to me. Another solution is that you first segment the intensity image in consistent patches. For each patch you make an albedo estimate etc. In my opinion this is the main factor of why it is commercially unusable. This is the same for [5]. In that paper you can clearly see that textures on a flat t-shirt are carved into the depth image.
• The influence of the shape prior is very subtle.
• An alternative for the shape prior constraint can be a geometric constraint. Instead of using the neighboring normals one uses the neighboring 3d positions.

Demo Project

Source code and a demo project can be found on my GitHub.

The original code used the non-linear conjugate gradient solver from the book “Numerical Recipes 2nd Edition”. In order to prevent any possible copyright infringements I modified this version so that it uses the non-linear conjugate gradient method of OpenCV. It seems to be a lot slower than the original code…

References

[1] Bohme, M., et al. “Shading constraint improves accuracy of time-of-flight measurements.” Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE Computer Society Conference on. IEEE, 2008.

[2] Böhme, Martin, et al. “Shading constraint improves accuracy of time-of-flight measurements.” Computer vision and image understanding 114.12 (2010): 1329-1335.

[3] Chen, Ke. “Introduction to variational image-processing models and applications.” International Journal of Computer Mathematics 90.1 (2013): 1-8.

[4] Press, William H., et al. “Numerical recipes, 2nd.” Cambridge University Pn,’ss, Cambridge (1992).

[5] Wu, Chenglei, et al. “Real-time shading-based refinement for consumer depth cameras.” Proc. SIGGRAPH Asia (2014).

[6] Shewchuk, Jonathan Richard. “An introduction to the conjugate gradient method without the agonizing pain.” (1994).

[7] Wikipedia contributors. “Gradient descent.” Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 20 Apr. 2015. Web. 7 May. 2015.