When clipping a video, mind the GOP and hope the I-frame is an IDR-frame

A picture is worth a thousand words, but if it is a B-frame it may be worth two hundred words.

Mind the GOP
Mind the GOP

One of our products ,ViDeus Auditor, lets you clip and join videos, showing a preview before doing the actual clipping. For doing it, we have to understand how an encoded video is composed. We usually work with H264.

When encoding each video picture you can get a I-frame, a P-frame or a B-frame.

  • The I-frame is the easy one, all the information for decoding the picture is within the I-frame.
  • The P-frame is a frame which needs previous decoded pictures for being decoded. So it uses information from old pictures.
  • And the B-frame needs decoded pictures from the past and from the future. So it uses information from old and future pictures.
    For example, you can get something like this:
I B B P B B P B B P

The I-frame can be decoded instantaneously, then the second frame (B-frame) needs information from previous frames (the I-frame for example) and from future frames (like the P-frame).

As the B-frames may need information from the following frames, the stream is rearranged for decoding, in a way such as when a B-frame is being decoded, everything needed is there. So usually the frames are transmitted like this:

I P B B P B B P B B

This results in having a decoding time-stamp (DTS) less than the presentation time-stamp (PTS) in the rearranged frames.

That series of frames can be a GOP, a Group of Picture, a video is composed by a series of GOPs, each GOP starting with an I-frame, this would be three GOPs:

I B B P B B P B B P I B B P B B P B B P I B B P B B P B B P

As the I-frame doesn’t need any other information for decoding, that’s a good point for fast-clipping a video because all the information for decoding is within it; clipping a video in the middle of a GOP (when it’s not an I-frame) will most likely result in a corrupt output for a while until a new full GOP is decoded.

IDR-frame

But, clipping a video at a GOP start will not always result in a clean output.

The I-frame at the beginning will certainly be decoded fine, it doesn’t need anything special. However, the following B-frames and P-frames will probably need previous frames for being decoded correctly. Sometimes those needed frames are within the GOP which it is usefull, but sometimes they are outside the GOP which is bad for clipping, because it means they reference pictures which are before the I-frame where we cut the video, resulting in a corrupt output.

When frames from a GOP reference frames from another GOP it’s called Open GOP. If not, it is called a Closed GOP.

Hopefully, the video was encoded with IDR-frames. Those are a special case of I-frames. Apart from being an I-frame the IDR-frame ensures the following frames will not reference any frame before the IDR.
In a GOP the IDR-frame replace the I-frame, all IDR-frames are I-frames but not all I-frames are IDR-frames.

So, if an IDR is found that’s a good place for clipping, because that frame will be decoded without any other information and all the following frames will not require information from before the IDR-frame.

Next time you want to clip a video, mind the GOP and find an IDR-frame.

Using GCC Intrinsics (MMX, SSEx, AVX) to look for max value in array

To begin with, you shouldn’t start your new codes focusing on performance; functionality should be the key factor and consider leaving room for future improvements. But well.. after you did your job and everything is working as it should be, you might need to tweak your code a little bit to increase its performance.

efficiency
First functionality then efficiency
The problem

We want to look for the maximum value in an array, this array is composed of int16_t mono audio samples. The maximum value of the array will be the peak value during the audio interval being analysed, this peak is known as sample peak and should not be interpreted as the real peak of the audio which is the true peak (there is a very good explanation about the differences here).

A basic(?) solution

Ok, we have to look for the maximum value in an array and we focus on functionality, this is quite simple actually…

int16_t max = buff [0];
for(i = 1; i < size; i++) {
>....if(max < buff[i]) {
>....>....max = buff[i];
>....}
}

 The intrinsics

These are a series of functions which implement many MMX, SSE and AVX instructions, they are mapped directly to C functions and are also further optimized with gcc. Most of the instructions use vector operations, and you can work with 128, 256 or 512 bit vectors depending on the architecture and the compiler. There is a very detailed guide here and you can see a full list of the funcions here.

You will need to include the headers depending on what functions you want to call, or just include x86intrin.h, which will include all the available ones. Then, you will need to add the appropiate flag to the gcc compile line, in this specific case I’m using -maxv2. If you want to check the supported functions you may use the following command to list you the corresponding includes that gcc will use.

$ gcc -mavx2 -dM -E - < /dev/null | egrep "SSE|AVX"
#define __AVX__ 1
#define __AVX2__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1

The code

What I’m going to do is compare two buffers of 128 bits, one of them has the max value (initialized to 0, if there are all negative values the result will be wrong)  and the other will be the input buffer, this is done using: _mm_max_epi16() which will compare 8 values (int16_t) at a time. This is one of the reasons why the intrinsics increase the performance.

After going through the whole input buffer, I will have in maxval array the maximum value, but I don’t know the position. For the sake of the example I am doing two redundant things here. First, using _mm_shufflelo_epi16()/_mm_shufflehi_epi16() and the _mm_max_epi16()  I will compare values inside the vector and rotate them so the whole buffer has the maximum value, being 16 bit values I can’t shuffle the whole buffer so the shuffle is done in the high bits and the low bits separately.

Finally I will store the final vector in a int16_t array with _mm_store_si128() and I’ll look for the maximum inside it (I could have done this before, but I wanted to show the shuffle which might be useful if the samples were not 16 bit, and the shuffles were not partial).

int16_t find_max(int16_t* buff, int size)
{
    int16_t maxmax[8];
    int i;
    int16_t max = buff[0];

    __m128i *f8 = (__m128i*)buff;
    __m128i maxval = _mm_setzero_si128();
    __m128i maxval2 = _mm_setzero_si128();
    for (i = 0; i < size / 16; i++) {
        maxval = _mm_max_epi16(maxval, f8[i]);
    }
    maxval2 = maxval;
    for (i = 0; i < 3; i++) {
        maxval = _mm_max_epi16(maxval, _mm_shufflehi_epi16(maxval, 0x3));
        _mm_store_si128(&maxmax, maxval);
        maxval2 = _mm_max_epi16(maxval2, _mm_shufflelo_epi16(maxval2, 0x3));
        _mm_store_si128(&maxmax, maxval2);
    }
    _mm_store_si128(&maxmax, maxval);
    for(i = 0; i < 8; i++)
        if(max < maxmax[i])
            max = maxmax[i];
    return max;
}
Some numbers

I’m going to compare 3 different cases, and find the maximum value in random (pseudo random) arrays of 10000 and 1000000 samples.

  • Using an intuitive for loop as the one shown before
  • Same loop as before but with compiler optimizations
  • Using SSE instructions via intrinsics (sample code).

This table shows the total delay in us (micro seconds) that the different functions take.

Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
Elements Mode Avg. Delay [us]
10000 Default 650
10000 -O2 250
10000 Intrinsics 200
10000 Intrinsics w/-O2 30
1000000 Default 31000
1000000 -O2 13000
1000000 Intrinsics 8500
1000000 Intrinsics w/-O2 3500

As you can see, the code gets much faster with the Intrinsics, and even faster with the optimizations.

Afterwords
  • You will see an improvement in most cases, and consider that it gets better when the arrays are larger
  • It gets even better with optimizations (-O2)
  • There is still a difference with smaller arrays (10000 element ones)
  • The key factor is to look for repetitive operations in large arrays
  • You may lose some portability, as some functions may not be available in every microprocessor
  • There are some transition penalties when switching between AVX and SSE, so when mixing both this should be considered

Hope you guys liked the post, please feel free to ask any questions and if I can/know, I will answer you.