Thứ Ba, 12 tháng 10, 2010

[CV][Descriptor] Local Self-Similarity & Global Self Similarity

1. Introduction
Based on the observation on self-similarity of object. In Figure 1, although the heart in images are variant but they are created by one repeated texture to make "heart" shape. The only difference is the texture to make the heart in each image. This idea pushed Eli Shechtman and Michal Irani to reach Local Self-Similarity Descriptor [1].

Figure 1: Object with variance in color, texture and edge.
In [1] this descriptor is used in context of Template Matching (Figure 2).
Input:
+ Object Image (template, usually small image and about 150-200 pixels in [1])
+ Large image (containing object in template image)
Output:
Location of template object in large image.

Figure 2: An example of template matching

2. Applied Area
- Template matching (with template image is a real image or hand sketched image)

3. Details

a) Descriptor
To extract local self-similarity $d_q$ at a point q, following this steps:
- Convert image in CIE L*a*b
- Correlate image patch (centered at q) 5x5 with a surrounding region (centered at q) 40 x40 using Sum of Square Difference (SSD) between patch colors, resulting in "distance surface" $SSD_q(x,y)$
- Normalize "distance surface" into "correlation surface" $S_a(x, y)$
(Figure 2a):

\[
S_q(x, y) = exp \left( - \frac{SSD_q(x,y)}{max(var_{noise}, var_{auto}(q)}} \right)
\]

where:
+ $var_{noise}$ is constant corresponding to variant photometric variations
+ $var_{auto}(q)$ is patch difference contrast in the meaning that sharp patches are more toleratable than smooth patches. In their implementation,
$var_{auto}(q)$ is the maximal variance of difference of patches in neighborhood of radius 1 related to centered patch.

- Correlation surface is transformed into binned log-polar representation with 80 bins (4 in log radius and 20 in angle)

- The maximum value in each bin is chosen --> make local affine transformation tolerance --> descriptor vector of size 80.

- Normalize this vector into range [0..1] by linearly stretching --> invariant to color and pattern distribution of surrounding patterns.

b) Descriptor for Video (Figure 2b)
- Correlate 5 x5x1 (patch without time axis) with 3D region 60 x 60 x 5 (2 previous and 2 next frames).
- Log polar space is changed to log- log-polar (log in time and space, polor in space) to take the descriptor vector of size 182.

c) Matching:
F: template image
G: input image
- Compute $d_q$ densely in both F and G (in multiple scales with Gaussian pyramid) .
- Remove meaningless descriptors (both F and G) including :
+ Centered patches are salient but no patch is similar to it in region (all un-normalized vector element are less than a threshold $t_{non\_info}$)
+ Descriptors with high self-similarity in everywhere in region (large homogeneous region). Maybe, there is a threshold $t_{high}$ here (didn't report).
- Essemble Matching - find the subset of descriptors in G close to all descriptor in F under geometric contrast. To do that, author modified essemble matching in [2]. Location with highest likelihood value considered as detected location.

d) Parameter:

NameMeaningValues
$var_{noise}$acceptable photometric variations(in illumination, color or noise)?
$t_{non\_info}$Threahold to remove non-informative descriptor?
$t_{high}$Homogeneous region threshold?

Figure 3: Extract $d_q$

e) Result:
- Outperform GLOH, Shape Context on template matching (Figure 4).

Figure 4: Result comparison with other descriptors.

f) Global Self-Similarity [3]
For each patch $t_q$ centered at q, correlate with whole H x W image I resulting in the correlation surface $C_q$ (here, region in Local Self-Similarity is actually image). In $C_q$, $C_q(q^{\prime})$ is correlation between $t_q$ and $t_q^{\prime}$.
Global Self-Similarity descriptor for image is given by:
\[
S_I(q, q^{\prime}) = C_q(q^{\prime})
\]

where, $S_I$ is quadratic in size of image I: H x W x H x W

4. Properties

Invariance capability:


Inner changeTransformationClutter Backgound
ShapeTextureIlluminationTransitionRotationScaleAffine
Local Self-Similarity
+++x

x
+++

x: have ability (
+: low,
++: medium,
+++: high)
- Local affine transformation invariance: find maximum in each bin of log polar
- Texture Invariance: find positions similar to local texture (local texture similarity)
- Illumination Invariance: due to 2 nomalization steps: normalize "distance surface" and normalize vector into [0..1]
- Scale Invariance: apply multiple scale
- Clutter background invariance: usually, if centered patch belongs to object then Local Self-Similarity finds neighbour patches like it, in another hand, only takes interest in object texture. Therefore, any background (not quite similar to object region) doesn't influence.

5.
Source code
- Source code byVarun Gulshan in Visual Geometry Group
- See also: Global Self-Sililarity Descriptor by Thomas Deselaers on his paper
- Local Self-Similarity by Rainer Lienhart in OpenCV

6. My conclusion
- Very exciting descriptor with novel idea.

- Good for object with characteristics:
+ Only influenced by scale
+ Texture changes on difference background (human, human-made object on real image)
+ May change but keep main form

- I haven't seen anything interst in Global Self-Similarity descriptor (at least now)

References:
[1] Eli ShechtMan,Michal Irani. "Matching Local Self-Similarities across Images and Videos". IEEE Conference on Computer Vision and Pattern Recognition, 2007.
[2] O. Boiman, M. Irani. "Detecting irregularities in images and videos". in ICCV, 2005.
[3] Thomas Deselaers and Vittorio Ferrari. "Global and Efficient Self-Similarity for Object Classification and Detection". CVPR, 2010.

Không có nhận xét nào:

Đăng nhận xét