TensorRT——INT8推理

技术TensorRT——INT8推理 TensorRT——INT8推理原理为什么要使用INT8推理:更高的吞吐量/处理的fps提高以及更低的内存占用(8-bit vs 32-bit)
将FP32模型转换

TensorRT——INT8推理

原理

为什么使用INT8推理:更高的处理吞吐量/fps和更低的内存消耗(8位vs 32位)

将FP32型号转换为INT8型号的挑战:较低的动态范围和精度

考虑32位浮点可以在区间[-3.4e38,3.40e38]内表示大约40亿个数字。这个可表示数字的区间也被称为the dynamic-range。两个相邻的可表示数之间的距离是the precision of the representation。—— 《Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT》

如何将FP32量化成INT8:最简单的方法是对称线性量化。每个张量可以将量化的INT8值乘以与之相关的标量因子。那么如何确定这个标量因子呢?

对于权重,TensorRT使用左侧图片进行映射,不会带来精度下降;对于激活,TensorRT按照上图右边的方式量化INT8,这就面临了一个新的问题。如何为每个激活张量选择最佳|阈值(这实际上是校准的过程)

选择不同的阈值相当于不同的编码方法。从信息论的角度来看,我们希望选择一种编码方法来最小化编码前后的信息损失,我们可以用KL散度来衡量这种信息损失。

激活校准

实践

为了使用TensorRT的INT8推理,我们需要编写自己的校准器类,然后告诉builder使用这个校准器通过Builder-Set INT8校准器来校准数据,从而减少量化误差。

至于如何校准构建器,构建器类实现了以下功能:

Builder首先调用校准器类的getBatchSize()来获取输入批处理的大小。

然后builder反复调用getBatch()获取输入数据进行校准。读入的批处理数据的大小必须与getBatchSize()获得的大小相同。如果没有输入批处理数据,getBatch()将返回false。

Builder将首先构建一个32位引擎,对校准集进行正向推理,并记录每一层激活的直方图。

根据得到的直方图建立校准表。

根据获得的校准表和网络定义创建一个8位引擎。

然而,校准过程非常耗时。通过缓存校准表,可以多次有效地构建相同的网络。要实现校准表的缓存功能,需要在校准器类中实现writeCalibrationCache()和readCalibrationCache()两个函数。

总而言之,要实现INT8的Engine,开发人员需要实现一个校准器类,它需要覆盖以下函数:

getBatchSize

getBatch

写校准缓存(可选)

读取校准缓存(可选)

这个校准器类是一个iint8校准器,TensorRT提供了四个iint8校准器的派生类(IInt8EntropyCalibrator

、IInt8EntropyCalibrator2、IInt8MinMaxCalibrator、IInt8LegacyCalibrator,我们例子中的calibrator继承自IInt8EntropyCalibrator.


#include algorithm
#include assert.h
#include cmath
#include cuda_runtime_api.h
#include fstream
#include iomanip
#include iostream
#include sstream
#include sys/stat.h
#include time.h
#include opencv2/opencv.hpp
#include "NvInfer.h"
#include "NvOnnxParser.h"
#include "argsParser.h"
#include "logger.h"
#include "common.h"
#include "image.hpp"
#define DebugP(x) std::cout  "Line"  __LINE__  "  "  #x  "="  x  std::endl
using namespace nvinfer1;
Logger gLogger;
// LogStreamConsumer gLogError;
static const int INPUT_H = 224;
static const int INPUT_W = 224;
static const int INPUT_C = 3;
static const int OUTPUT_SIZE = 1000;
const char* INPUT_BLOB_NAME = "input";
const char* OUTPUT_BLOB_NAME = "output";
const std::string gSampleName = "TensorRT.sample_onnx_image";
const std::string onnxFile = "resnet50.onnx";
const std::string engineFile = "../data/resnet50_int8.trt"
const std::string calibFile = "../data/calibration_img.txt"
samplesCommon::Args gArgs;
std::vectorfloat prepareImage(cv::Mat img) {
    int c = 3;
    int h = INPUT_H;
    int w = INPUT_W;
    // 1 Resize the source Image to a specific size(这里保持原图长宽比进行resize)
    float scale = std::min(float(w) / img.cols, float(h) / img.rows);
    auto scaleSize = cv::Size(img.cols * scale, img.rows * scale);
    // Convert BGR to RGB
    cv::Mat rgb;
    cv::cvtColor(img, rgb, CV_BGR2RGB);
    cv::Mat resized;
    cv::resize(rgb, resized, scaleSize, 0, 0, cv::INTER_CUBIC);
    // 2 Crop Image(将resize后的图像放在(H, W, C)的中心, 周围用127做padding)
    cv::Mat cropped(h, w, CV_8UC3, 127)
    // Rect(left_top_x, left_top_y, width, height)
    cv::Rect rect((w - scaleSize.width) / 2, (h - scaleSize.height) / 2, scaleSize.width, scaleSize.height);
    resize.copyTo(cropped(rect));
    // 3 Type conversion, convert unsigned int 8 to float 32
    cv::Mat img_float;
    cropped.convertTo(img_float, CV_32FC3, 1.f / 255.0);
    // HWC to CHW, and convert Mat to std::vectorfloat
    std::vectorcv::Mat input_channels(c);
    cv::split(cropped, input_channels);
    std::vectorfloat result(h * w * c);
    auto data = result.data();
    int channelLength = h * w;
    for (int i = 0; i  c; ++i) {
        memcpy(data, input_channels[i].data, channelLength * sizeof(float));
        data += channelLength;
    }
    return result;
}
// 实现自己的calibrator类
namespace nvinfer1 {
    class int8EntropyCalibrator: public nvinfer1::IInt8EntropyCalibrator {
        public:
            int8EntropyCalibrator(const int batchSize,
                const std::string imgPath,
                const std::string calibTablePath);
            virtual ~int8EntropyCalibrator();
            int getBatchSize() const override { return batchSize; }
            bool getBatch(void *bindings[], const char *names[], int nbBindings) override;
            const void *readCalibationCache(std::size_t length) override;
            void writeCalibrationCache(const void *ptr, std::size_t length) override;
       
        private:
            int batchSize;
            size_t inputCount;
            size_t imageIndex;
            std::string calibTablePath;
            std::vectorstd::string imgPaths;
            float *batchData { nullptr };
            void *deviceInput { nullptr };
            bool readCache;
            std::vectorchar calibrationCache;
    };
    int8EntropyCalibrator::int8EntropyCalibrator(const int batchSize, const std::string imgPath,
        const std::string calibTablePath) : batchSize(batchSize), calibTablePath(calibTablePath), imageIndex(0) {
            int inputChannel = 3;
            int inputH = 256;
            int inputW = 256;
            inputCount = batchSize * inputChannel  * inputH * inputW;
            std::fstream f(imgPath);
            if (f.is_open()) {
                std::string temp;
                while( std::getline(f, temp) ) imgPaths.push_back(temp);
            }
            int len = imgPaths.size();
            for( int i = 0; i  len; i++) {
                std::cout  imgPaths[i]  std::endl;
            }
            // allocate memory for a batch of data, batchData is for CPU, deviceInput is for GPU
            batchData = new flowt[inputCount];
            CHECK(cudaMalloc(deviceInput, inputCount * sizeof(float)));
        }
        IInt8EntropyCalibrator::~IInt8EntropyCalibrator() {
            CHECK(cudaFree(deviceInput));
            if (batchData) {
                delete[] batchData;
            }
        }
        bool int8EntropyCalibrator::getBatch(void **bindings, const char **names, int nbBindings) {
            std::cout  imageIndex  " "  batchSize  std::endl;
            std::cout  imgPaths.size()  std::endl;
            if (imageIndex + batchSize  ing(imgPaths.size()))
                return false;
            // load batch
            float *ptr = batchData;
            for (size_t j = imageIndex; j  imageIndex + batchSize; ++j) {
                cv::Mat img = cv::imread(imgPaths[j]);
                std::vectorfloat inputData = prepareImage(img);
                if (inputData.size() != inputCount) {
                    std::cout  "InputSize Error"  std::endl;
                    return false;
                }
                assert(inputData.size() == inputCount);
                memcpy(ptr, inputData.data(), (int)(inputData.size()) * sizeof(float));
                ptr += inputData.size();
                std::cout  "load image "  imgPaths[j]  " "  (j + 1) * 100. / imgPaths.size()  "%"  std::endl;
            }
            imageIndex += batchSize;
            // copy bytes from Host to Device
            CHECK(cudaMemcpy(deviceInput, batchData, inputCount * sizeof(float), cudaMemcpyHostToDevice));
            bindings[0] = deviceInput;
            return true;
        }
        const void* int8Entropycalibrator::readCalibrationCache(std::size_t length) {
            calibrationCache.clear();
            std::ifstream input(calibTablePath, std::ios::binary);
            input  std::noskipws;
            if (readCache  input.good()) {
                std::copy(std::istream_iteratorchar(input), std::istream_iteratorchar(),
                    std::back_inserter(calibrationCache));
            }
            length = calibrationCache.size();
            return length  calibrationCache[0] : nullptr;
        }
        void int8EntropyCalibrator::writeCalibrationCache(const void *cache, std::size_t length) {
            std::ofstream output(calibTablePath, std::ios::binary);
            output.write(reinterpret_castconst char*(cache), length);
        }
}
bool onnxToTRTModel(const std::string modelFile, // name of the onnx model
                    unsigned int maxBatchSize,    // batch size - NB must be at least as large as the batch we want to run with
                    IHostMemory* trtModelStream, // output buffer for the TensorRT model
                    const std::string engineFile)
    // create the builder
    IBuilder* builder = createInferBuilder(gLogger.getTRTLogger());
    assert(builder != nullptr);
    // create the config
    auto config = builder-createBuilderConfig();
    assert(config != nullptr);
    if (! builder-platformHasFastInt8()) {
        std::cout  "builder platform do not support Int8"  std::endl;
        return false;
    }
    const auto explicitBatch = 1U  static_castuint32_t(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
    std::cout  "explicitBatch is: "  explicitBatch  std::endl;
    nvinfer1::INetworkDefinition* network = builder-createNetworkV2(explicitBatch);
    auto parser = nvonnxparser::createParser(*network, gLogger.getTRTLogger());
    //Optional - uncomment below lines to view network layer information
    //config-setPrintLayerInfo(true);
    //parser-reportParsingInfo();
    if ( !parser-parseFromFile( locateFile(modelFile, gArgs.dataDirs).c_str(), static_castint(gLogger.getReportableSeverity()) ) )
    {
        gLogError  "Failure while parsing ONNX file"  std::endl;
        return false;
    }
 
    // config
    config-setAvgTimingIterations(1);
    config-setMinTimingIterations(1);
    config-setMaxWorkspaceSize(1_GiB);
    // Build the engine
    builder-setMaxBatchSize(maxBatchSize);
    //builder-setMaxWorkspaceSize(1  20);
    builder-setMaxWorkspaceSize(10  20);
    nvinfer1::int8EntropyCalibrator *calibrator = nullptr;
    if (calibFile.size()  0 ) calibrator = new nvinfer1::int8EntropyCalibrator(maxBatchSize, calibFile, "");
    // builder-setFp16Mode(gArgs.runInFp16);
    // builder-setInt8Mode(gArgs.runInInt8);
    // 对builder进行设置, 告诉它使用Int8模式, 并利用编写好的calibrator类进行calibration
    builder-setInt8Mode(true);
    builder-setInt8Calibrator(calibrator);
    // if (gArgs.runInInt8)
    // {
    //     samplesCommon::setAllTensorScales(network, 127.0f, 127.0f);
    // }
    config-setFlag(BuiderFlag::kINT8);
    config-setInt8Calibrator(calibrator);
    // 如果使用了calibrator, 应该参考https://github.com/enazoe/yolo-tensorrt/blob/dd4cb522625947bfe6bfbdfbb6890c3f7558864a/modules/yolo.cpp, 把下面这行注释掉,使用数据集校准得到dynamic range;否则使用下面这行手动设置dynamic range。
    // setAllTensorScales函数在官方TensorRT开源代码里有
    samplesCommon::setAllTensorScales(network, 127.0f, 127.0f);
    // samplesCommon::enableDLA(builder, gArgs.useDLACore);
   
    ICudaEngine* engine = builder-buildCudaEngine(*network);
    assert(engine);
    if (calibrator) {
        delete calibrator;
        calibrator = nullptr;
    }
    // we can destroy the parser
    parser-destroy();
    // serialize the engine, then close everything down
    trtModelStream = engine-serialize();
    std::ofstream file;
    file.open(engineFile, std::ios::binary | std::ios::out);
    file.write((const char*)data-data(), data-size());
    file.close();
    engine-destroy();
	config-destroy();
    network-destroy();
    builder-destroy();
    return true;
}
void doInference(IExecutionContext context, float* input, float* output, int batchSize)
{
    const ICudaEngine engine = context.getEngine();
    // input and output buffer pointers that we pass to the engine - the engine requires exactly IEngine::getNbBindings(),
    // of these, but in this case we know that there is exactly one input and one output.
    assert(engine.getNbBindings() == 2);
    void* buffers[2];
    // In order to bind the buffers, we need to know the names of the input and output tensors.
    // note that indices are guaranteed to be less than IEngine::getNbBindings()
   
    const int inputIndex = engine.getBindingIndex(INPUT_BLOB_NAME);
    const int outputIndex = engine.getBindingIndex(OUTPUT_BLOB_NAME);
   
    DebugP(inputIndex); DebugP(outputIndex);
    // create GPU buffers and a stream
    CHECK(cudaMalloc(buffers[inputIndex], batchSize * INPUT_C * INPUT_H * INPUT_W * sizeof(float)));
    CHECK(cudaMalloc(buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float)));
    cudaStream_t stream;
    CHECK(cudaStreamCreate(stream));
    // DMA the input to the GPU,  execute the batch asynchronously, and DMA it back:
    CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * INPUT_C * INPUT_H * INPUT_W * sizeof(float), cudaMemcpyHostToDevice, stream));
    context.enqueue(batchSize, buffers, stream, nullptr);
    CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * OUTPUT_SIZE * sizeof(float), cudaMemcpyDeviceToHost, stream));
    cudaStreamSynchronize(stream);
    // release the stream and the buffers
    cudaStreamDestroy(stream);
    CHECK(cudaFree(buffers[inputIndex]));
    CHECK(cudaFree(buffers[outputIndex]));
}
//!
//! \brief This function prints the help information for running this sample
//!
void printHelpInfo()
{
    std::cout  "Usage: ./sample_onnx_mnist [-h or --help] [-d or --datadir=path to data directory] [--useDLACore=int]\n";
    std::cout  "--help          Display help information\n";
    std::cout  "--datadir       Specify path to a data directory, overriding the default. This option can be used multiple times to add multiple directories. If no data directories are given, the default is to use (data/samples/mnist/, data/mnist/)"  std::endl;
    std::cout  "--useDLACore=N  Specify a DLA engine for layers that support DLA. Value can range from 0 to n-1, where n is the number of DLA engines on the platform."  std::endl;
    std::cout  "--int8          Run in Int8 mode.\n";
    std::cout  "--fp16          Run in FP16 mode."  std::endl;
}
int main(int argc, char** argv)
{
    bool argsOK = samplesCommon::parseArgs(gArgs, argc, argv);
    if (gArgs.help)
    {
        printHelpInfo();
        return EXIT_SUCCESS;
    }
    if (!argsOK)
    {
        std::cout  "Invalid arguments"  std::endl;
        // gLogError  "Invalid arguments"  std::endl;
        printHelpInfo();
        return EXIT_FAILURE;
    }
    if (gArgs.dataDirs.empty())
    {
        gArgs.dataDirs = std::vectorstd::string{"data/"};
    }
    auto sampleTest = gLogger.defineTest(gSampleName, argc, const_castconst char**(argv));
    gLogger.reportTestStart(sampleTest);
    // create a TensorRT model from the onnx model and serialize it to a stream
    nvinfer1::IHostMemory* trtModelStream{nullptr};
    if (!onnxToTRTModel(onnxFile, 1, trtModelStream))
        gLogger.reportFail(sampleTest);
    assert(trtModelStream != nullptr);
    std::cout  "Successfully parsed ONNX file!!!!"  std::endl;
   
   
    std::cout  "Start reading the input image!!!!"  std::endl;
   
    cv::Mat image = cv::imread(locateFile("test.jpg", gArgs.dataDirs), cv::IMREAD_COLOR);
    if (image.empty()) {
        std::cout  "The input image is empty!!! Please check....."std::endl;
    }
    DebugP(image.size());
    cv::cvtColor(image, image, cv::COLOR_BGR2RGB);
    cv::Mat dst = cv::Mat::zeros(INPUT_H, INPUT_W, CV_32FC3);
    cv::resize(image, dst, dst.size());
    DebugP(dst.size());
    float* data = normal(dst);
    // deserialize the engine
    IRuntime* runtime = createInferRuntime(gLogger);
    assert(runtime != nullptr);
    if (gArgs.useDLACore = 0)
    {
        runtime-setDLACore(gArgs.useDLACore);
    }
    ICudaEngine* engine = runtime-deserializeCudaEngine(trtModelStream-data(), trtModelStream-size(), nullptr);
    assert(engine != nullptr);
    trtModelStream-destroy();
    IExecutionContext* context = engine-createExecutionContext();
    assert(context != nullptr);
   
    float prob[OUTPUT_SIZE];
    typedef std::chrono::high_resolution_clock Time;
    typedef std::chrono::durationdouble, std::ratio1, 1000 ms;
    typedef std::chrono::durationfloat fsec;
    double total = 0.0;
    // run inference and cout time
    auto t0 = Time::now();
    doInference(*context, data, prob, 1);
    auto t1 = Time::now();
    fsec fs = t1 - t0;
    ms d = std::chrono::duration_castms(fs);
    total += d.count();
    // destroy the engine
    context-destroy();
    engine-destroy();
    runtime-destroy();
   
    std::cout  std::endl  "Running time of one image is:"  total  "ms"  std::endl;
  
    std::cout  "Output:\n";
    for (int i = 0; i  OUTPUT_SIZE; i++)
    {
        gLogInfo  prob[i]  " ";
    }
    std::cout  std::endl;
    return gLogger.reportTest(sampleTest, true);
}

除了上面这个实现外,官方的sampleINT8.cpp也非常值得参考。

参考资料:

  1. PPT《8-bit inference with TensorRT》
  2. Video 《8-bit inference with TensorRT》

内容来源网络,如有侵权,联系删除,本文地址:https://www.230890.com/zhan/84599.html

(0)

相关推荐

  • Python制作动态词频条形图的过程是怎样的

    技术Python制作动态词频条形图的过程是怎样的这篇文章给大家介绍Python制作动态词频条形图的过程是怎样的,内容非常详细,感兴趣的小伙伴们可以参考借鉴,希望对大家能有所帮助。前言”数据可视化“这个话题,相信大家并不陌

    攻略 2021年11月11日
  • 最新单机游戏排行榜,十大耐玩手机单机游戏有哪些

    技术最新单机游戏排行榜,十大耐玩手机单机游戏有哪些我觉得比较好玩的十个手机单机游戏有最新单机游戏排行榜:《地狱边境》、《刺客信条》、《狂爆之翼》、《阿尔托的冒险》、《滑雪大冒险》、《方舟》、《使命召唤(手游版)》、《我的

    生活 2021年10月26日
  • 北京名胜古迹介绍,北京现存的文物古迹有哪些

    技术北京名胜古迹介绍,北京现存的文物古迹有哪些很高兴回答你的问题!以下是我罗列的21处北京现存的文物古迹景点北京名胜古迹介绍,希望能对你有所帮助!1、北京故宫,国家5A级景区、世界文化遗产、全国重点文物保护单位。世界上现

    生活 2021年10月28日
  • gps是什么意思,导航和GPS有什么区别

    技术gps是什么意思,导航和GPS有什么区别GPS(Global Positioning System)即全球卫星定位系统,从1973年发展到2009年36年间,实际意义早已超过设计之初的定义。现在,GPS不仅仅为移动目

    生活 2021年10月28日
  • Activiti中28张数据表的含义是什么

    技术Activiti中28张数据表的含义是什么这期内容当中小编将会给大家带来有关Activiti中28张数据表的含义是什么,文章内容丰富且以专业的角度为大家分析和叙述,阅读完这篇文章希望大家可以有所收获。Activiti

    攻略 2021年10月20日
  • MySQL中grant和revoke怎么用

    技术MySQL中grant和revoke怎么用这篇文章主要为大家展示了“MySQL中grant和revoke怎么用”,内容简而易懂,条理清晰,希望能够帮助大家解决疑惑,下面让小编带领大家一起研究并学习一下“MySQL中g

    攻略 2021年10月29日