【AR实验室】mulberryAR:并行提取ORB特征

  • 时间:
  • 浏览:0

本文转载请注明出处 —— polobymulberry-博客园

0x00 - 前言


在【AR实验室】mulberryAR : ORBSLAM2+VVSION末尾提及了苹果5机6手机手机5s真机测试结果,其中ExtractORB函数,也但会 提取图像的ORB社会形态你这些 块耗时很可观。但会 这也是目前可不还能否够了优化的重中之重。此处,我使用【AR实验室】mulberryAR :上加连续图像作为输入中上加的连续图像作为输入。但会 的好处有五个,一五个但会 保证输入一致,没法 单多应用程序 提取社会形态和并行提取社会形态一种 办法优化对比就比较有可信度,但会 是都可不还能否够了使用iOS模拟器来跑多多应用程序 了,原困 可不还能否够了了打开摄像头的,测试起来相当方便,更有多种机型任你选。

目前对社会形态提取这每项优化就可不还能否够了五个想法:

  • 将社会形态提取的过程并行化。
  • 减少提取的社会形态点数量。

第二种办法很容易,只可不还能否够了在配置文件中更改提取社会形态点的数目即可,此处不赘述。本文主要集中第一种 办法,初步尝试将社会形态提取并行化。

0x01 - ORB-SLAM2社会形态提取过程耗时分 析


ORB-SLAM2中社会形态提取函数叫做ExtractORB,是Frame类的一五个成员函数。用来提取当前Frame的ORB社会形态点。

// flag是给双目相机用的,单目相机默认flag为0
// 提取im上的ORB社会形态点
void Frame::ExtractORB(int flag, const cv::Mat &im)
{
    if(flag==0)
        // mpORBextractorLeft是ORBextractor对象,原困

ORBextractor重载了()
        // 但会



才会有下面你这些

用法
        (*mpORBextractorLeft)(im,cv::Mat(),mvKeys,mDescriptors);
    else
        (*mpORBextractorRight)(im,cv::Mat(),mvKeysRight,mDescriptorsRight);
}

从后面 代码都可不还能否够了看出ORB-SLAM2社会形态提取主要调用的是ORBextractor重载的()函数。大伙给该函数重要的几条每项打点,测试每个每项的耗时。

重要提示-测试代码执行时间:

测试某段代码执行的时间有但会 种办法,比如:

clock_t begin = clock();
//...
clock_t end = clock();
cout << "execute time = " << (end - begin) / CLOCKS_PER_SEC << "s" << endl;

不过我时候在多多应用程序 求和【原】C++11并行计算 — 数组求和联 使用上述办法计时,发现你这些 办法对于多多应用程序 计算处于bug。原困 目前我是基于iOS平台,但会 此处我使用了iOS中计算时间的办法。另外又原困 在C++文件中可不还能否够了直接使用Foundation组件,但会 采用对应的CoreFoundation。

CFAbsoluteTime beginTime = CFAbsoluteTimeGetCurrent();
CFDateRef beginDate = CFDateCreate(kCFAllocatorDefault, beginTime);
// ...
CFAbsoluteTime endime = CFAbsoluteTimeGetCurrent();
CFDateRef endDate = CFDateCreate(kCFAllocatorDefault, endTime);
CFTimeInterval timeInterval = CFDateGetTimeIntervalSinceDate(endDate, beginDate);
cout << "execure time = " << (double)(timeInterval) * 4000.0 << "ms" << endl;

将上述计时代码插入到operator()函数中,目前函数整体看起来如下,主但会 对五个每项进行计时,分别为ComputePyramid、ComputeKeyPointsOctTree和ComputeDescriptors:

void ORBextractor::operator()( InputArray _image, InputArray _mask, vector<KeyPoint>& _keypoints,
                      OutputArray _descriptors)
{ 
    if(_image.empty())
        return;

    Mat image = _image.getMat();
    assert(image.type() == CV_8UC1 );

    // 1.计算图像金字塔的时间
    CFAbsoluteTime beginComputePyramidTime = CFAbsoluteTimeGetCurrent();
    CFDateRef computePyramidBeginDate = CFDateCreate(kCFAllocatorDefault, beginComputePyramidTime);
    // Pre-compute the scale pyramid
    ComputePyramid(image);
    CFAbsoluteTime endComputePyramidTime = CFAbsoluteTimeGetCurrent();
    CFDateRef computePyramidEndDate = CFDateCreate(kCFAllocatorDefault, endComputePyramidTime);
    CFTimeInterval computePyramidTimeInterval = CFDateGetTimeIntervalSinceDate(computePyramidEndDate, computePyramidBeginDate);
    cout << "ComputePyramid time = " << (double)(computePyramidTimeInterval) * 4000.0 << endl;

    vector < vector<KeyPoint> > allKeypoints;
    
    // 2.计算关键点KeyPoint的时间
    CFAbsoluteTime beginComputeKeyPointsTime = CFAbsoluteTimeGetCurrent();
    CFDateRef computeKeyPointsBeginDate = CFDateCreate(kCFAllocatorDefault, beginComputeKeyPointsTime);
    
    ComputeKeyPointsOctTree(allKeypoints);
    //ComputeKeyPointsOld(allKeypoints);
    CFAbsoluteTime endComputeKeyPointsTime = CFAbsoluteTimeGetCurrent();
    CFDateRef computeKeyPointsEndDate = CFDateCreate(kCFAllocatorDefault, endComputeKeyPointsTime);
    CFTimeInterval computeKeyPointsTimeInterval = CFDateGetTimeIntervalSinceDate(computeKeyPointsEndDate, computeKeyPointsBeginDate);
    cout << "ComputeKeyPointsOctTree time = " << (double)(computeKeyPointsTimeInterval) * 4000.0 << endl;

    Mat descriptors;

    int nkeypoints = 0;
    for (int level = 0; level < nlevels; ++level)
        nkeypoints += (int)allKeypoints[level].size();
    if( nkeypoints == 0 )
        _descriptors.release();
    else
    {
        _descriptors.create(nkeypoints, 32, CV_8U);
        descriptors = _descriptors.getMat();
    }

    _keypoints.clear();
    _keypoints.reserve(nkeypoints);

    int offset = 0;
    
    // 3.计算描述子的时间
    CFAbsoluteTime beginComputeDescriptorsTime = CFAbsoluteTimeGetCurrent();
    CFDateRef computeDescriptorsBeginDate = CFDateCreate(kCFAllocatorDefault, beginComputeDescriptorsTime);
    for (int level = 0; level < nlevels; ++level)
    {
        vector<KeyPoint>& keypoints = allKeypoints[level];
        int nkeypointsLevel = (int)keypoints.size();

        if(nkeypointsLevel==0)
            continue;

        // preprocess the resized image
        Mat workingMat = mvImagePyramid[level].clone();
        GaussianBlur(workingMat, workingMat, cv::Size(7, 7), 2, 2, BORDER_REFLECT_101);

        // Compute the descriptors
        Mat desc = descriptors.rowRange(offset, offset + nkeypointsLevel);
        computeDescriptors(workingMat, keypoints, desc, pattern);

        offset += nkeypointsLevel;

        // Scale keypoint coordinates
        if (level != 0)
        {
            float scale = mvScaleFactor[level]; //getScale(level, firstLevel, scaleFactor);
            for (vector<KeyPoint>::iterator keypoint = keypoints.begin(),
                 keypointEnd = keypoints.end(); keypoint != keypointEnd; ++keypoint)
                keypoint->pt *= scale;
        }
        // And add the keypoints to the output
        _keypoints.insert(_keypoints.end(), keypoints.begin(), keypoints.end());
    }
    CFAbsoluteTime endComputeDescriptorsTime = CFAbsoluteTimeGetCurrent();
    CFDateRef computeDescriptorsEndDate = CFDateCreate(kCFAllocatorDefault, endComputeDescriptorsTime);
    CFTimeInterval computeDescriptorsTimeInterval = CFDateGetTimeIntervalSinceDate(computeDescriptorsEndDate, computeDescriptorsBeginDate);
    cout << "ComputeDescriptors time = " << (double)(computeDescriptorsTimeInterval) * 4000.0 << endl;
}

此时,使用苹果5机6手机手机7模拟器运行mulberryAR,但会 运行我时候录制的一段连续图像帧,得到结果如下(此处我只截取前三帧的结果):

都可不还能否够了看出优化的重点在于ComputeKeyPointsOctTree、ComputeDescriptiors。

0x02 - ORB-SLAM2社会形态提取优化思路


ComputePyramid、ComputeKeyPointsOctTree和ComputeDescriptors函数中前会 根据图像金字塔的不同层级做同样的操作,但会 此处都可不还能否够了将图像金字塔不同层级的操作并行化。按照你这些 思路,对五个每项的代码进行了修改。

1.ComputePyramid函数并行化

该函数暂时无法进行并行化除理,原困 后面 在计算图像金字塔中第n层图像的时候,依赖第n-1层的图像,另外此函数在整个社会形态提取的每项占比都在很大,相对来说并行化意义都在很大。

2.ComputeKeyPointsOctTree函数并行化

该函数的并行化过程很容易,只可不还能否够了将其中的for(int i = 0; i < nlevels; ++i)后面 的函数做成单独函数,并上加到人个的thread中即可。不废话,直接上代码:

void ORBextractor::ComputeKeyPointsOctTree(vector<vector<KeyPoint> >& allKeypoints)
{
    allKeypoints.resize(nlevels);

    vector<thread> computeKeyPointsThreads;
    
    for (int i = 0; i < nlevels; ++i) {
        computeKeyPointsThreads.push_back(thread(&ORBextractor::ComputeKeyPointsOctTreeEveryLevel, this, i, std::ref(allKeypoints)));
    }
    
    for (int i = 0; i < nlevels; ++i) {
        computeKeyPointsThreads[i].join();
    }
    
    // compute orientations
    vector<thread> computeOriThreads;
    for (int level = 0; level < nlevels; ++level) {
        computeOriThreads.push_back(thread(computeOrientation, mvImagePyramid[level], std::ref(allKeypoints[level]), umax));
    }
    
    for (int level = 0; level < nlevels; ++level) {
        computeOriThreads[level].join();
    }
}

其中ComputeKeyPointsOctTreeEveryLevel函数如下:

void ORBextractor::ComputeKeyPointsOctTreeEveryLevel(int level, vector<vector<KeyPoint> >& allKeypoints)
{
    const float W = 400;
    
    const int minBorderX = EDGE_THRESHOLD-3;
    const int minBorderY = minBorderX;
    const int maxBorderX = mvImagePyramid[level].cols-EDGE_THRESHOLD+3;
    const int maxBorderY = mvImagePyramid[level].rows-EDGE_THRESHOLD+3;
    
    vector<cv::KeyPoint> vToDistributeKeys;
    vToDistributeKeys.reserve(nfeatures*10);
    
    const float width = (maxBorderX-minBorderX);
    const float height = (maxBorderY-minBorderY);
    
    const int nCols = width/W;
    const int nRows = height/W;
    const int wCell = ceil(width/nCols);
    const int hCell = ceil(height/nRows);
    
    for(int i=0; i<nRows; i++)
    {
        const float iniY =minBorderY+i*hCell;
        float maxY = iniY+hCell+6;
        
        if(iniY>=maxBorderY-3)
            continue;
        if(maxY>maxBorderY)
            maxY = maxBorderY;
        
        for(int j=0; j<nCols; j++)
        {
            const float iniX =minBorderX+j*wCell;
            float maxX = iniX+wCell+6;
            if(iniX>=maxBorderX-6)
                continue;
            if(maxX>maxBorderX)
                maxX = maxBorderX;
            
            vector<cv::KeyPoint> vKeysCell;
            FAST(mvImagePyramid[level].rowRange(iniY,maxY).colRange(iniX,maxX),
                 vKeysCell,iniThFAST,true);
            
            if(vKeysCell.empty())
            {
                FAST(mvImagePyramid[level].rowRange(iniY,maxY).colRange(iniX,maxX),
                     vKeysCell,minThFAST,true);
            }
            
            if(!vKeysCell.empty())
            {
                for(vector<cv::KeyPoint>::iterator vit=vKeysCell.begin(); vit!=vKeysCell.end();vit++)
                {
                    (*vit).pt.x+=j*wCell;
                    (*vit).pt.y+=i*hCell;
                    vToDistributeKeys.push_back(*vit);
                }
            }
            
        }
    }
    
    vector<KeyPoint> & keypoints = allKeypoints[level];
    keypoints.reserve(nfeatures);
    
    keypoints = DistributeOctTree(vToDistributeKeys, minBorderX, maxBorderX,
                                  minBorderY, maxBorderY,mnFeaturesPerLevel[level], level);
    
    const int scaledPatchSize = PATCH_SIZE*mvScaleFactor[level];
    
    // Add border to coordinates and scale information
    const int nkps = keypoints.size();
    for(int i=0; i<nkps ; i++)
    {
        keypoints[i].pt.x+=minBorderX;
        keypoints[i].pt.y+=minBorderY;
        keypoints[i].octave=level;
        keypoints[i].size = scaledPatchSize;
    }
}

在苹果5机6手机手机7模拟器上测试,得到如下结果(取前5帧图像测试):

都可不还能否够了看多通过并行除理,ComputeKeyPointsOctTree获得了2~3倍的提速。

3.ComputeDescriptors每项并行化

虽然你这些 每项叫做“每项”,而非“函数”是原困 这每项涉及的函数相对于ComputeKeyPointsOctTree比较错综复杂,涉及的变量比较多。可不还能否够了理清之间的关系能够安全地并行化。

此处但会 赘述,直接贴出修改后的并行化代码:

vector<thread> computeDescThreads;
vector<vector<KeyPoint> > keypointsEveryLevel;
keypointsEveryLevel.resize(nlevels);
// 图像金字塔每层的offset与前面每层的offset有关,但会



可不还能否够了直接插进ComputeDescriptorsEveryLevel计算
for (int level = 0; level < nlevels; ++level) {
    computeDescThreads.push_back(thread(&ORBextractor::ComputeDescriptorsEveryLevel, this, level, std::ref(allKeypoints), descriptors, offset, std::ref(keypointsEveryLevel[level])));
    int keypointsNum = (int)allKeypoints[level].size();
    offset += keypointsNum;
}

for (int level = 0; level < nlevels; ++level) {
    computeDescThreads[level].join();
}
// _keypoints要按照顺序进行插入,但会



可不还能否够了直接插进ComputeDescriptorsEveryLevel计算
for (int level = 0; level < nlevels; ++level) {
    _keypoints.insert(_keypoints.end(), keypointsEveryLevel[level].begin(), keypointsEveryLevel[level].end());
}

// 其中ComputeDescriptorsEveryLevel函数如下
void ORBextractor::ComputeDescriptorsEveryLevel(int level, std::vector<std::vector<KeyPoint> > &allKeypoints, const Mat& descriptors, int offset, vector<KeyPoint>& _keypoints)
{
    vector<KeyPoint>& keypoints = allKeypoints[level];
    int nkeypointsLevel = (int)keypoints.size();
    
    if(nkeypointsLevel==0)
        return;
    
    // preprocess the resized image
    Mat workingMat = mvImagePyramid[level].clone();
    GaussianBlur(workingMat, workingMat, cv::Size(7, 7), 2, 2, BORDER_REFLECT_101);
    
    // Compute the descriptors
    Mat desc = descriptors.rowRange(offset, offset + nkeypointsLevel);
    computeDescriptors(workingMat, keypoints, desc, pattern);
    
//    offset += nkeypointsLevel;
    
    // Scale keypoint coordinates
    if (level != 0)
    {
        float scale = mvScaleFactor[level]; //getScale(level, firstLevel, scaleFactor);
        for (vector<KeyPoint>::iterator keypoint = keypoints.begin(),
             keypointEnd = keypoints.end(); keypoint != keypointEnd; ++keypoint)
            keypoint->pt *= scale;
    }
    // And add the keypoints to the output
//    _keypoints.insert(_keypoints.end(), keypoints.begin(), keypoints.end());
    _keypoints = keypoints;
}

在苹果5机6手机手机7模拟器上测试,得到如下结果(取前5帧图像测试):

都可不还能否够了看多通过并行除理,ComputeDescriptors获得了2~3倍的提速。

0x03 - 并行化结果分析


0x02小节原困 对比了每步优化的结果。此处从整体的速率单位对结果进行简单的分析。使用苹果5机6手机手机7模拟器跑了前5帧的对比结果:

从结果中都可不还能否够了看出,ORB社会形态提取速率单位有了2~3倍的提升,在TrackMonocular每项占比也下降了不少,暂时ORB社会形态提取不必作为性能优化的重点。后面 原困 从但会 方面对ORB-SLAM2进行优化。