Paper Reading/Review

[리뷰] BoT-SORT: Robust Associations Multi-Pedestrian Tracking

728x90

이번에는 2022년에 발표된 논문인 BoT-SORT: Robust Associations Multi-Pedestrian Tracking를 읽고, 리뷰해보고자 합니다.

Index
1. Background
1.1. Extrapolaltion
1.2. Linear Kalman Filter
1.3. RANdom SAmple Consensus
1.4. Rigid Motion & Non-Rigid Motion
2. Abstract
3. Introduction
4. Related Work
5. Method
5.1. Kalman Filter
5.2. Camera Motion Compensation
5.3. IoU - Re-ID Fusion
5.4. Whole Architecture
6. Experiment
7. Conclusion

1. Background

1.1. Extrapolaltion

Interpolation과 Extrapolation

1. Interpolation ; 보간법, 내삽법 두 지점 사이의 값을 예측하는 것 2. Extrapolation ; 보외법, 외삽법 두 지점 밖의 값을 예측하는 것 3. 비교 그림 참고 링크 https://x-engineer.org/linear-interpolation-extrapolation-

alstn59v.tistory.com

1.2. Linear Kalman Filter

Kalman Filter

0. 들어가기에 앞서 본 게시글은 Kalman Filter에 대해 쉽게 이해할 수 있도록 최대한 간략하게 작성한 글입니다. 더욱 자세한 내용을 알고싶다면, 아래의 참고 링크 부분의 링크를 참고 바랍니다. 1.

alstn59v.tistory.com

1.3. RANdom SAmple Consensus

RANdom SAmple Consensus

0. 들어가기에 앞서 본 게시글은 수학적인 내용을 제외하고, 개념적으로 가볍게 RANSAC이 무엇인지 알아보기 위해 작성한 글입니다. 더욱 자세한 내용을 알고싶다면, 아래의 참고 링크 부분의 링

alstn59v.tistory.com

1.4. Rigid Motion & Non-Rigid Motion

rigid motion과 non-rigid motion

1. Rigid Motion ; 강체운동 물체 안의 임의의 두 점 사이의 거리가 변화하지 않은 채 이동이나 회전 따위의 변화를 하는 일 2. Non-Rigid Motion ; 비강체운동 rigid motion과 반대로, 물체 안의 임의의 두 점

alstn59v.tistory.com

2. Abstract

MOT의 목표는 각 object에 대한 ID를 유지하면서, 장면의 모든 object를 detection & tracking 하는 것
본 논문에서 카메라 모션 보상, 더 정확한 kalman filter state vector, motion, appearance information을 이용한 tracker 제시

3. Introduction

최근의 MOT 연구는 주로 SORT, DeepSORT 및 jointly learns the detector and embedding 방식을 기반으로 함
현재 tracking by detection이 가장 효과적인 방식이며, 아래와 같은 과정으로 작업이 이루어짐
- kalman filter를 이용하여 detection box와 association을 위해 다음 frame에서 tracklet의 bounding box를 예측하기 위한 motion model과 state 추정하며, occlusion 또는 detection 누락의 경우 tracklet의 상태를 예측하는 데 사용
- IoU와 ReID를 이용하여 새로운 frame에서 detection한 object를 association하며, IoU를 사용하면 일반적으로 더 나은 MOTA를 달성, Re-ID는 더 높은 IDF1을 달성
SORT와 같은 알고리즘의 몇 가지 한계를 인식
- 일정한 속도 모델 가정을 motion model로 사용하는 kalman filter 채택
  - tracker의 output으로 kalman filter의 state 추정을 사용하면, object detector의 detection에 비해 최적이 아닌 bounding box의 shape이 생성
  - DeepSORT에서 제안된 kalman filter의 상태 특성은 width 대신 상자의 aspect ratio를 추정하여, 부정확한 width 추정으로 이어짐
- IoU 기반 접근 방식은 bounding box의 품질에 영향을 받음
  - bounding box의 정확한 위치를 예측하는 것은 카메라의 motion으로 실패할 수 있으며, 이는 곧 낮은 IoU를 의미하고 tracker의 성능을 저하
suitable kalman filter state vector와 camera motion compensation 기반의 tracker를 추가하여 bounding box의 localization 성능 향상
detection과 tracklet간의 robust한 association을 위해 IoU와 ReID의 cosine distance를 fusion한 방법 제시

4. Related Work

motion model
- 일정한 속도 모델 가정을 가진 kalman filter, 이를 변형한 NSA-Kalman, enhanced correlation coefficient maximization나 ORB등을 이용하여 frame을 정렬한 camera motion compensation
appearance models and re-identification
- occlusion으로 인해 appearance 정보가 부족해짐 높은 계산 비용의 seperate tracker, 낮은 계산 비용의 joint tracker
최근의 연구는 appearance 정보를 포기하고, 높은 실행속도와 motion 정보에만 의존

5. Method

5.1. Kalman Filter

일반적으로 이미지에서 object의 움직임을 모델링 하기 위해 일정한 속도 모델을 가정하는 kalman filter 이용
SORT에서 kalman filter의 state vector는 \( \mathbf{x}=[x_{c}, y_{c}, s, a, \dot{x_{c}}, \dot{y_{c}}, \dot{s}]^{T} \)로 사용되며, 더 최근의 tracker는 \( \mathbf{x}=[x_{c}, y_{c}, a, h, \dot{x_{c}}, \dot{y_{c}}, \dot{a}, \dot{h}]^{T} \)를 이용
- \( s \)와 \( a \)는 이전 frame에서 bounding box의 area와 aspect ratio, \( \dot{*} \)는 이후 frame에서 \( * \)요소의 값
bounding box의 width와 height를 직접 측정하면 더 나은 성능을 얻을 수 있다는 것을 발견하여, kalman filter의 state vector \( \mathbf{x} \)와 measurement vector \( \mathbf{z} \)를 아래와 같이 정의
- \( \mathbf{x}_{k}=[x_{c}(k), y_{c}(k), w(k), h(k), \dot{x_{c}}(k), \dot{y_{c}}(k), \dot{w}(k), \dot{h}(k)]^{T} \), \( \mathbf{z}_{k}=[z_{x_{c}}(k), z_{y_{c}}(k), z_{w}(k), z_{h}(k)]^{T} \)
noise 공분산 행렬 \( R \), \( Q \)은 SORT에서 시간에 독립적으로 사용되었지만, DeepSORT에서는 일부 추정이나 측정 요소의 함수로 사용하는 시간에 종속적인 방법을 제안
- 따라서 R와 Q을 state vector의 변화에 따라 아래와 같이 정의
  - \( \begin{align}
    R_{k} = diag \left( (\sigma_{m} \hat{w}_{k|k-1})^{2}, (\sigma_{m} \hat{h}_{k|k-1})^{2}, (\sigma_{m} \hat{w}_{k|k-1})^{2}, (\sigma_{m} \hat{h}_{k|k-1})^{2} \right)
    \end{align} \), \( \begin{align}
    Q_{k} = diag ( (\sigma_{p} \hat{w}_{k-1|k-1})^{2}, (\sigma_{p} \hat{h}_{k-1|k-1})^{2}, (\sigma_{p} \hat{w}_{k-1|k-1})^{2}, (\sigma_{p} \hat{h}_{k-1|k-1})^{2}, \\
    (\sigma_{v} \hat{w}_{k-1|k-1})^{2}, (\sigma_{v} \hat{h}_{k-1|k-1})^{2}, (\sigma_{v} \hat{w}_{k-1|k-1})^{2}, (\sigma_{v} \hat{h}_{k-1|k-1})^{2} )
    \end{align} \)
    - noise factor는 \( \sigma_{p}=0.05 \), \( \sigma_{v}=0.00625 \), \( \sigma_{m}=0.05 \)로 설정
track loss의 경우, long prediction은 bounding box의 shape에 변형을 가할 수 있으므로, ByteTrack의 logic을 따름
kalman filter 수정을 통해 bounding box의 width를 적합하게 개선한다고 주장

5.2. Camera Motion Compensation

tracking by detection 방식의 tracker는 tracklet의 bounding box와 detected bounding box의 overlap에 의존
카메라가 움직이는 상황에서, 이미지 상에 위치한 bounding box의 위치가 급격하게 변할 수 있음
- IDSW나 FN이 증가할 수 있음
- 카메라가 고정된 경우라도, 바람이나 진동에 의해 이미지가 급격히 움직일 수 있음
영상 motion의 pattern은 camera 포즈의 변화와 보행자와 같은 object의 non-rigid motion으로부터 rigid motion으로 요약
camera에 대한 정보(고유 matrix, motion)가 부족
- 인접한 두 frame을 이용해 camera의 rigid motion의 근사값을 구할 수 있음
\( k-1^{th} \) frame의 prediction bounding box를 변환하여 \( k^{th} \) frame의 bounding box의 좌표를 예측하기 위해 affine 행렬 \( A^{k}_{k-1} \)를 사용
- 변환 행렬의 이동 부분은 bounding box의 중심 좌표에만 영향을 미치지만, 다른 부분은 state vector와 noise 행렬에 영향을 미침
이미지에서 배경의 움직임을 나타내기 위해 OpenCV의 Video Stabilization 모듈에 구현되있는 affine transform을 이용하는 아래 과정의 global motion compensation을 사용
- 이동으로 발생하는 local outlier를 제거하기 위해 sparse optical flow를 사용
- 이미지에서 keypoint를 추출
- affine 행렬 \( A^{k}_{k-1} \in \mathbb{R}^{2 \times 3} \)는 RANSAC을 사용하여 구함
- sparse registration technique을 사용하면 이미지에서 움직이는 obejct를 무시하고 배경의 움직임을 더 정확하게 추정할 수 있음
camera motion 보정은 아래의 수식을 통해 수행
- \( A^{k}_{k-1}=[M_{2x2}|T_{2x1}]=\begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \end{bmatrix} \), \( \tilde{M}^{k}_{k-1}=\begin{bmatrix} M & 0 & 0 & 0 \\ 0 & M & 0 & 0 \\ 0 & 0 & M & 0 \\ 0 & 0 & 0 & M \end{bmatrix} \), \( \tilde{T}^{k}_{k-1}=\begin{bmatrix} a_{13} \\ a_{23} \\ 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \)
  - \( M \in \mathbb{R}^{2 \times 2} \)은 affine matrix의 scale과 rotation에 대한 부분, \( T \)는 translation에 대한 부분으로, 수학적인 trick을 이용하여 \( \tilde{M}^{k}_{k-1} \in \mathbb{R}^{8 \times 8} \), \( \tilde{T}^{k}_{k-1} \in \mathbb{R}^{8} \)로 변환하여 이용
- \( \hat{x}\prime_{k|k-1}=\tilde{M}^{k}_{k-1}\hat{x}_{k|k-1}+\tilde{T}^{k}_{k-1} \), \( P\prime_{k|k-1}=\tilde{M}^{k}_{k-1}P_{k|k-1}{\tilde{M}^{k}_{k-1}}^{T} \)
  - \( \hat{x}_{k|k-1} \)와 \( P_{k|k-1} \)은 \( k \)시점에서 예측된 kalman filter의 state vector와 오차 공분산, \( \hat{x}\prime_{k|k-1} \)와 \( P\prime_{k|k-1} \)는 이들에 각각 camera motion compensation이 적용된 결과
위의 \( \hat{x}\prime_{k|k-1} \), \( P\prime_{k|k-1} \)를 kalman filter의 update(correction) 과정에 이용
고속의 상황에서 속도를 포함한 state vector의 보정은 필수지만, 저속의 상황이라면 \( P_{k|k-1} \)의 보정은 필요하지 않음
- tracker가 camera의 motion에 robust해짐
rigid camera motion을 보상한 후, object의 위치가 인접한 frame에서 약간만 변한다고 가정
높은 FPS의 영상에서 detection이 누락되면 kalman filter의 prediction 단계를 통해 track을 경험에 비추어 extrapolation하는것이 가능
- 약간 더 높은 MOTA를 달성

5.3. IoU - Re-ID Fusion

deep visual representation을 활용하기 위해 tracker에 appearance features를 tracker와 통합
- Re-ID feature를 추출하기 위해 ResNeSt50을 backbone으로 한 FastReID library를 사용
\( k^{th} \) frame에서 \( i^{th} \) tracklet의 appearance state \( e^{k}_{i} \)을 업데이트 하기 위해 EMA 사용
- \( e^{k}_{i} = \alpha e^{k-1}_{i} + (1-\alpha)f^{k}_{i} \)
  - \( f^{k}_{i} \) is current matched detection’s appearance embedding, \( \alpha=0.9 \) is momentum term
- 평균 \( e^{k}_{i} \)와 새로운 detection embedding \( f^{k}_{j} \)를 matching하기 위해 cosine similarity를 측정
appearance feature는 군중, 흐리거나 가려진 물체에 취약하므로 올바른 feature를 위해 높은 confidence score의 detection만 이용
cost matrix \( C \)를 계산하기 위해 일반적인 appearance cost와 motion cost의 가중합을 버리고, 아래의 수식을 이용
- \( C = \lambda A_{a} + (1-\lambda) A_{m} \)
  - \( \lambda=0.98 \) is weight factor, \( A_{a} \) is appearance cost, \( A_{m} \) is motion cost
motion, appearance를 결합(즉, IoU distance와 cosine distance를 결합)하는 아래의 새로운 방법 개발
- IoU 측면에서 낮은 cosine similarity나 멀리 떨어진 후보는 제외
- \( \hat{d}^{cos}_{i,j} =
  \begin{cases}
  0.5 \cdot d^{cos}_{i,j}, & (d^{cos}_{i,j} < \theta_{emb}) \wedge (d^{iou}_{i,j} < \theta_{iou}) \\
  1, & otherwise
  \end{cases} \), \( C_{i,j}=min\{d^{iou}_{i,j},\hat{d}^{cos}_{i,j}\} \)
  - \( C_{i,j} \)는 \( C \)의 \( (i,j) \)번째 원소, \( d^{iou}_{i,j} \)는 \( i^{th} \) prediction bounding box와 \( j^{th} \) detection bounding box의 IoU distance로 motion cost, \( d^{cos}_{i,j} \)는 평균 \( e^{k}_{i} \)의 descriptor와 \( f^{k}_{j} \)의 discriptor의 cosine distance, \( \hat{d}^{cos}_{i,j} \)는 새로운 appearance cost, \( \theta_{iou} \)는 가능성이 낮은 tracklet과 detection쌍을 제거하기 위한 0.5에 가까운 threshold값, \( \theta_{emb} \)는 \( e^{k}_{i} \)와 \( f^{k}_{j} \)의 positive association을 분리해내기 위한 appearance threshold값
high confidence detections에 대한 할당문제는 hungarian algorithm과 \( C \)를 이용

5.4. Whole Architecture

Overview

Algorithm
- 연두색 박스로 표시한 부분이, 본논문에서 ByteTrack의 algorithm에 추가적으로 삽입한 부분

6. Experiment

다양한 요소들의 적용에 따른 성능 변화

similarity의 종류에 따른 성능 변화

MOT dataset에서의 성능 비교

현재 frame의 MOTA인 cMOTA를 시각화

7. Conclusion

robust한 association을 위해 다양한 방법을 사용한 BoT-SORT 제안
제안된 방법과 구성 요소는 다른 tracker에 쉽게 적용 가능
새로운 MOT investigation tool인 cMOTA 도입
다음과 같은 제한이 있음
- 움직이는 object가 많은 장면에서, 배경 keypoint의 부족으로 인해 camera의 motion 측정 실패에 따른 tracker의 오작동 발생 가능
- 큰 이미지에서 camera의 motion의 계산에 시간이 오래 걸림
- seperated appearance tracker는 joint tracker나 appearance-free tracker에 비해 시간이 오래 걸리므로, 계산 비용을 줄이기 위해 detection score가 높은 것에서만 appearance 추출

논문 링크

https://arxiv.org/abs/2206.14651

https://github.com/NirAharon/BoT-SORT

728x90

저작자표시 비영리 동일조건

'Paper Reading > Review' 카테고리의 다른 글

[리뷰] Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking (2)	2023.03.13
[리뷰] SMILEtrack: SiMIlarity LEarning for Multiple Object Tracking (0)	2023.03.06
[리뷰] StrongSORT: Make DeepSORT Great Again (0)	2023.02.24
[리뷰] MOTR: End-to-End Multiple-Object Tracking with TRansformer (0)	2023.02.01
[리뷰] ByteTrack: Multi-Object Tracking by Associating Every Detection Box (0)	2023.01.17

Contents

새소식

인기 검색어

[리뷰] BoT-SORT: Robust Associations Multi-Pedestrian Tracking

1. Background

1.1. Extrapolaltion

1.2. Linear Kalman Filter

1.3. RANdom SAmple Consensus

1.4. Rigid Motion & Non-Rigid Motion

2. Abstract

3. Introduction

4. Related Work

5. Method

5.1. Kalman Filter

5.2. Camera Motion Compensation

5.3. IoU - Re-ID Fusion

5.4. Whole Architecture

6. Experiment

7. Conclusion

'Paper Reading > Review' 카테고리의 다른 글

당신이 좋아할만한 콘텐츠

티스토리툴바