Wednesday, April 11, 2012

Seeking in FFmpeg: Know Your Timestamp!

History

I've been working on a plugin lately that reads videos with FFmpeg. Things are making pretty good progress, but by far the biggest problem I've had so far is seeking accurately. Particularly, I would try to seek to the very beginning of a file, but for some reason would end up at the second keyframe, not the first.

So I tested it further, talked on IRC with some devs about it, and decided it must be a bug. So I filed a bug report.

The Lightbulb

Thankfully, my bug report got a quick response. It turned out that, yet again, the "bug" I discovered really wasn't a bug. It was me misunderstanding FFmpeg.

Here's the catch: the documentation for av_seek_frame and avformat_seek_file both talk about being able to seek by a timestamp. This whole time, I thought FFmpeg was seeking by the PTS. As Reimar points out in the bug report, this is not the case. FFmpeg seeks by the DTS, not the PTS.*

Why This is Important

The video file I was using was an mpeg2video MOV file. The first packet has a DTS of -1 and a PTS of 0, and the second packet had a DTS of 0 and a PTS of 1. This means that when I tried to seek to a timestamp of 0 (which I thought was the beginning), it was really seeking to a PTS of 1, which is why it would skip over the first keyframe (at PTS=0 and DTS=-1) and stop at the second keyframe.

This also screwed me up when trying to seek by one frame throughout the file. I kept track of my current position by the PTS, and when I wanted to seek forward/backward by one frame, I would just add/subtract 1 from my PTS. This meant that if I was at PTS=12 (there was a keyframe here), and I tried to go back one frame, I tried to seek to a timestamp of 12 - 1 = 11. The problem was that I was thinking in PTS and FFmpeg was thinking in DTS, so when I asked FFmpeg to seek to a timestamp of 11, it seeked to a DTS of 11 (or PTS=12, which is where I already was!).

In the end, it's important to know which timestamp you're using in seeking. At the time of writing, the FFmpeg documentation doesn't say that the DTS is the timestamp used, as it just mentions "timestamp." The docs will probably have this note added (soonish), though.

*Update! So I tried submitting a patch that changed the documentation to say that seeking was done by DTS and not by PTS. Michael Niedermayer then informed me that this isn't true for all demuxers. Apparently some use DTS and some use PTS. Just be sure to be very aware of this. I believe most demuxers, however, seek by DTS.

Double Update! It should be noted that Michael Niedermayer recently added the flag AVFMT_SEEK_TO_PTS, to be used in AVInputFormat.flags, to specify that the demuxer seeks by PTS and not by DTS. Otherwise, you can expect the demuxer to seek by DTS. Note that this is a recent change (at the time or writing), and most relevant demuxers haven't been updated with this (so until the update is passed on to more demuxers, it's possible a demuxer may be seeking by PTS without having AVFMT_SEEK_TO_PTS set).

2 comments:

  1. Hi Michael,

    I have to say, FFMPEG, although it obviously saves you a ton of work, also creates a ton of work in trying to understand what the hell is going on under the hood. If the API was complete, documented, consistent and reliable, this wouldn't be a problem. But of course, both you and I know that it isn't.

    I've been trying to find a way of reliably seeking to a time point in a media file for a while now, and nothing seems to work well across different file formats. I just want to say to FFMPEG "seek to time xx:xx:xx:xxxx" and it just take me there, in audio and video, and stop faffing around.

    What code did you end up using? Was it robust and set for general use? For instance, how do you determine how far apart the DTS and PTS are in time in different scenarios?

    In my own experiments I've discovered that the part of the file sought to is often way before the part I actually want. No big deal I thought, just read and discard the packets until the point I actually require. But, of course some packets don't have a valid PTS so I have to DECODE them to try and figure one out. If you're decoding several seconds-worth of data to determine PTS to get to the sodding part of the file you really want, this can make the seek a *very* slow operation.

    Any experience here, or tips to improve this?

    *ANY* advice would be welcome. Pulling my hair out here (well, what's left of it).

    ReplyDelete
    Replies
    1. Dang, I wish I saw this comment 3 months ago. Sorry I didn't!

      Ultimately, you can't just seek to a specific time stamp. That's impossible for certain codecs. For example, H.264 video streams have P and B frames that depend on the previous frames before them. For example, lets say you have two I frames (keyframes), and they're 15 frames apart. That means that between these two keyframes, you've got 15 P and/or B frames in between. If you're trying to seek to a time that's in the middle of these two keyframes, you are required to do some decoding (starting with the first keyframe and going up to the desired time). Otherwise, if you don't, you will get junk data out when you decode the P/B frame.

      So you can't really escape the fact that you have to do some "fast forwarding" (decoding the frames without actually using what you decode). That's inevitable.

      The second important detail is that even if you do have the DTS/PTS values, they will sometimes lie to you. Especially for audio. Don't trust the DTS/PTS values any more than you have to for audio. For audio, it's more accurate to track the number of samples you've decoded and use the number of samples as a timestamp value than trusting in a DTS/PTS decoded from a frame.

      When I was writing a decoding plugin that used FFmpeg, I ignored the audio sample count and instead used the reported DTS/PTS values to keep track of when each audio frame should be played. The result was horrible, stuttering audio. This was because the PTS values of the audio frames were always slightly off (a little ahead for one frame, a little behind for another). So I had to change to using decoded sample counts as my timestamp, which resulted in nice audio.

      When seeking (audio), use the DTS/PTS for the first frame, because that's all you have. But as you start decoding and playing, use the sample count as your timestamp and don't keep using the DTS/PTS.

      In the end, seeking was my biggest headache for my FFmpeg decoding plugin. Most of my code dealt with seeking and timestamps. Here's what I did, more or less:

      Try seeking to a specific time T. Decode the first frame. Check the time we're at. If we're past time T, seek again, but further back. Repeat a couple times until we get to a time that is not past time T (or report failure if that couldn't be done). Then decode and fast forward up to time T.

      I can't remember what I did if after seeking the decoded frame had no (valid) DTS/PTS. I think I might have reported failure in that situation, but I'm not sure.

      When decoding, I used a number of methods to track the PTS as accurately as I could. For audio, I used the sample count. For video, I used the DTS, PTS, best_effort_timestamp, frame rate (so frame duration is approximated by 1/fps), and frame duplication checking (I recall a flag being set that specified if a frame was duplicated so its duration was doubled) to track the timestamp (with some priority among those things which I've forgotten). It got kind of messy.

      We tested every night with hundreds of videos and we'd always find some problematic videos that would require me to make some kind of new hack (or else we'd just say the video was junk and not our fault). It seemed to work pretty well, overall.

      Delete

Note: Only a member of this blog may post a comment.