ISO/IEC 14496-2:1999/PDAM 4

INTERNATIONAL ORGANISATION FOR STANDARDISATION
ORGANISATION INTERNATIONALE DE NORMALISATION
ISO/IEC JTC 1/SC 29/WG 11
CODING OF MOVING PICTURES AND AUDIO

ISO/IEC JTC1/SC 29/WG 11 N3315

Noordwijkerhout, March, 2000

Information technology— Coding of audio-visual objects— Part 2: Visual
AMENDMENT 4: Streaming video profile


ISO/IEC 14496-2:1999/PDAM 4

Proposed Draft Amendment (PDAM 4)

Draft of 11 April, 2000

1

©ISO/IEC 2000 – All rights reserved

ISO/IEC 14496-2:1999/PDAM 4

Copyright notice

This ISO document is a Proposed Draft Amendment and is copyright-protected by ISO. Except as permitted under the applicable laws of the user’s country, neither this ISO draft nor any extract from it may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, photocopying, recording or otherwise, without prior written permission being secured.

Requests for permission to reproduce should be addressed to ISO at the address below or ISO’s member body in the country of the requester.

Copyright Manager

ISO Central Secretariat

1 rue de Varembé

1211 Geneva 20 Switzerland

tel. + 41 22 749 0111

fax + 41 22 734 1079

internet:

Reproduction may be subject to royalty payments or a licensing agreement.

Violators may be prosecuted.


Contents

1 Introduction 1

1.1 Purpose 1

1.2 FGS syntax 1

1.2.1 FGS Structure 1

1.2.2 Overview of the FGS syntax 3

2 Normative references 3

3 Definitions 4

4 Abbreviations and symbols 10

5 Conventions 10

5.1 Method of describing bitstream syntax 10

5.2 Definition of Functions 10

5.2.1 Definition of next_bits() function 10

5.2.2 Definition of bytealigned() function 10

5.2.3 Definition of nextbits_bytealigned() function 10

5.2.4 Definition of next_start_code() function 10

5.2.5 Definition of start_of_bit_plane() function 10

5.3 Reserved, forbidden and marker_bit 11

5.4 Arithmetic precision 11

6 Bitstream syntax and semantics 11

6.1 Bitstream syntax 11

6.1.1 Start Codes 11

6.1.2 Visual Object Sequence and Visual Object 12

6.1.3 Video Object Layer 13

6.1.4 FGS Video Object Plane 14

6.1.5 FGS Motion Macroblock 15

6.1.6 FGS Motion Interlaced Information 16

6.1.7 FGS Motion Vector 17

6.1.8 FGS Macroblock 18

6.1.9 FGS Block 18

6.2 Bitstream semantics 19

6.2.1 Semantic rules for higher syntactic structures 19

6.2.2 Visual Object Sequence and Visual Object 19

6.2.3 Video Object Layer 20

6.2.4 FGS Video Object Plane 23

6.2.5 FGS Motion Macroblock 25

6.2.6 FGS Motion Interlaced Information 25

6.2.7 FGS Motion Vector 26

6.2.8 FGS Macroblock 26

6.2.9 FGS Block 28

7 The FGS decoding process 28

7.1 Bit-plane decoding of the absolute values of the residue signal 28

7.2 Decoding of the signs of the residue signal 29

7.3 Reconstruction of the enhancement DCT residue 29

7.4 Reconstruction of the enhanced VOP 30

8 Streaming Video Profiles and Levels 30

Annex A (normative) Forming (RUN, EOP) Symbols 32

Annex B (xxxxtive) Variable length codes 33

B.1 table_bpc 33

B.2 table_bpc4fw_m 39

B.3 table_bpc4fw_h 45

Annex C (xxxtive) Profile and level indication and restrictions 49

Annex D (informative) A method of decoding a truncated bitstream 53

Foreword

Foreword to be provided by ISO

1

©ISO/IEC 2000 – All rights reserved

ISO/IEC 14496-2:1999/PDAM 4

Information technology— Coding of audio-visual objects— Part2:Visual
AMENDMENT 4: Streaming video profile

1  Introduction

1.1  Purpose

This amendment to ISO/IEC 14496-2 is developed in response to the growing need for a video coding method for Streaming Video on Internet applications. It provides the definition and description of Simple Streaming Video Profile (SSVP) and Streaming Video Profile (SVP). Simple Streaming Video Profile (SSVP) provides the capability to distribute single layered frame based video at wide range of bit rates available for the distribution of video on Internet. Streaming Video Profile (SVP) uses Simple Streaming Video in the base layer and provides the description of two enhancement layer types - Fine Granularity Scalability (FGS) and FGS Temporal Scalability (FGST). SVP allows the coverage of wide range of bit rates for the distribution of video on Internet with the flexibility of using multiple layers, where there is a wide range of bandwidth variation.

Whether two or only one profiles need to be defined is under discussion. It is clear, however, that a conformance point needs to be created at the base layer in addition to the conformance point that includes the enhancement layers.

1.2  FGS syntax

1.2.1  FGS Structure

FGS provides quality scalability for each VOP. Figure 11 shows a basic FGS decoder structure.

Figure 11— A Basic FGS Decoder Structure

To reconstruct the enhanced VOP, the enhancement bitstream is first decoded using bit-plane VLD. The decoded bit-planes in the DCT domain are then shifted based on the frequency weighting and selective enhancement shifting factors. The output of bit-plane shift is the residues of the DCT coefficients. After the IDCT, the image domain residues are reconstructed. They are added to the reconstructed clipped base-layer pixels to reconstruct the enhanced VOP. The reconstructed enhancement VOP pixels are limited into the value range between 0 and 255 by the clipping unit in the FGS enhancement layer to generate the final enhancement video. The reconstructed base layer video is available as an optional output since each base layer reconstructed VOP needs to be stored in the frame buffer for motion compensation.

The basic FGS enhancement layer consists of FGS VOPs that enhance the quality of the base-layer VOPs as shown in Figure 12.

Figure 12— Basic FGS Enhancement Structure

When FGS temporal scalability is used, there are two possible enhancement structures. One structure is to have two separate enhancement layers for FGS and FGST as shown in Figure 13 and the other structure is to have one combined enhancement layer for FGS and FGST as shown in Figure 14.

Figure 13— Two Separate Enhancement Layers for FGS and FGST

Figure 14— One Combined Enhancement Layer for FGS and FGST

In either one of these two structures that include FGS temporal scalability, the prediction for the temporal scalable VOPs can only be from the base layer. Each temporal scalable VOP has two separate parts. The first part contains motion vector data and the second part contains the DCT texture data. The syntax for the first part is similar to that in the temporal scalability described in ISO/IEC 14496-2 with some syntax modifications so that the DCT texture data are not coded (only motion vector data are coded) and the prediction from the same layer is prohibited. The DCT texture data in the second part are coded using bit-plane coding in the same way as that in the FGS VOP. To distinguish the temporal scalability in ISO/IEC 14496-2 and in this Amendment, the FGS temporal scalability layer in Figure 13 is called “FGST layer”. The combined FGS and FGST layer in Figure 14 is called “FGS-FGST layer”.

1.2.2  Overview of the FGS syntax

The high level syntax in VisualObjectSequence() and VisualObject() is identical to that in ISO/IEC 14496-2. Only the code value of profile_and_level_indication in VisualObjectSequence() has been extended to include the profile and level indications for Simple Streaming Video Profile and Streaming Video Profile. The identifier for an FGS bitstream is the syntax video_object_type_indication in VideoObjectLayer(). A unique code is defined for Streaming Video Object Type to indicate that this VOL contains FGS enhancement VOPs. Another unique code is defined for Simple Streaming Video Object Type to indicate that this VOL is the base-layer for FGS. There is a syntax fgs_layer_type in VideoObjectLayer() to indicate whether this VOL is and FGS only layer as shown in Figure 12, or an FGST layer as shown in Figure 13, or an FGS-FGST layer as shown in Figure 14. Similar to the syntax structure in ISO/IEC 14496-2, under each VOL for FGS, there is a hierarchy of FGS Video Object Plane, FGS Macroblock, and FGS Block. An FGS enhancement VOP starts with a unique fgs_vop_start_code. Within each FGS VOP, there are multiple bit-planes. Each bit-plane in an FGS VOP starts with an fgs_bp_start_code whose last 5 bits indicate the ID of the bit-plane. In each FGS Macroblock, there are 4 FGS Blocks for the luminance component, 2 FGS Blocks for the two chrominance components for the 4:2:0 chrominance format. Each FGS Block contains one bit-plane of 64 bits coded by VLC.

2  Normative references

The following ITU-T Recommendations and International Standards contain provisions that, through reference in this text, constitute provisions of this Amendment. At the time of publication, the editions indicated were valid. All Recommendations and Standards are subject to revision, and parties to agreements based on this Amendment are encouraged to investigate the possibility of applying the most recent editions of the Standards indicated below. Members of IEC and ISO maintain registers of currently valid International Standards. The Telecommunication Standardisation Bureau maintains a list of currently valid ITU-T Recommendations.

·  ITU-T Rec. T.81(1992)|ISO/IEC 10918-1:1994, Information technology —Digital compression and coding of continuous-tone still images: Requirements and guidelines.

·  ISO/IEC 11172-1:1993, Information technology — Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s — Part 1: Systems.

·  ISO/IEC 11172-2:1993, Information technology — Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s — Part 2: Video.

·  ISO/IEC 11172-3:1993, Information technology — Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s — Part 3: Audio.

·  ITU-T Rec.H.222.0(2000)|ISO/IEC 13818-1:2000, Information technology — Generic coding of moving pictures and associated audio information: Systems.

·  ITU-T Rec. H.262(2000)|ISO/IEC 13818-2:2000, Information technology — Generic coding of moving pictures and associated audio information: Video.

·  ISO/IEC 13818-3 1998, Information technology — Generic coding of moving pictures and associated audio information — Part 3: Audio.

·  ISO/IEC 14496-1 1999, Information technology —Coding of audio-visual objects — Part 1: Systems.

·  ISO/IEC 14496-3 1999, Information technology —Coding of audio-visual objects — Part 3: Audio.

·  IEC Publication 461:1986, Time and control code for video tape recorder.

·  IEC Publication 908:1987, CD Digital Audio System.

·  ITU-T Recommendation H.261 (Formerly CCITT Recommendation H.261) Codec for audiovisual services at px64 kbit/s Geneva, 1990.

·  ITU-T Recommendation H.263 Video Coding for Low Bitrate Communication Geneva, 1996.

·  Recommendations and reports of the CCIR, 1990 XVIIth Plenary Assembly, Dusseldorf, 1990 Volume XI - Part 1 Broadcasting Service (Television) Recommendation ITURBT.6013 “Encoding parameters of digital television for studios”.

·  CCIR Volume X and XI Part 3 Recommendation ITURBR.648 “Recording of audio signals”.

·  CCIR Volume X and XI Part 3 Report ITUR9552 “Satellite sound broadcasting to vehicular, portable and fixed receivers in the range 500 - 3000Mhz”.

·  IEEE Standard Specifications for the Implementations of 8 by 8 Inverse Discrete Cosine Transform, IEEE Std 1180-1990, December 6, 1990.

3  Definitions

For the purposes of this Amendment, the following definitions apply.

3.1  AC coefficient: Any DCT coefficient for which the frequency in one or both dimensions is non-zero.

3.2  B-VOP; bidirectionally predictive-coded video object plane (VOP): A VOP that is coded using motion compensated prediction from past and/or future reference VOPs.

3.3  backward compatibility: A newer coding standard is backward compatible with an older coding standard if decoders designed to operate with the older coding standard are able to continue to operate by decoding all or part of a bitstream produced according to the newer coding standard.

3.4  backward motion vector: A motion vector that is used for motion compensation from a reference VOP at a later time in display order.

3.5  backward prediction: Prediction from the future reference VOP.

3.6  bit-plane: An array of 64 bits, one from each DCT residue or DCT coefficient, in a zigzag scan order. In the context of an FGS VOP, it also refers to the collection of the raster scanned arrays of 64 bits in an FGS VOP, one array per DCT block.

3.7  bitrate: The rate at which the coded bitstream is delivered from the storage medium or network to the input of a decoder. In the context of FGS, this may be different from the rate at which an FGS encoder generates a bitstream.

3.8  bitstream; stream: An ordered series of bits that forms the coded representation of the data.

3.9  block: An 8-row by 8-column matrix of samples, or 64 DCT coefficients (source, quantised or dequantised).

3.10  byte aligned: A bit in a coded bitstream is byte-aligned if its position is a multiple of 8-bits from the first bit in the stream.

3.11  byte: Sequence of 8-bits.

3.12  channel: A digital medium or a network that stores or transports a bitstream constructed according to ISO/IEC 14496 and this Amendment.

3.13  chrominance format: Defines the number of chrominance blocks in a macroblock.

3.14  chrominance component: A matrix, block or single sample representing one of the two colour difference signals related to the primary colours in the manner defined in the bitstream. The symbols used for the chrominance signals are Cr and Cb.

3.15  coded B-VOP: A B-VOP that is coded.

3.16  coded VOP: A coded VOP is a coded I-VOP, a coded P-VOP or a coded B-VOP.

3.17  coded I-VOP: An I-VOP that is coded.

3.18  coded P-VOP: A P-VOP that is coded.

3.19  coded video bitstream: A coded representation of a series of one or more VOPs as defined in ISO/IEC 14496-2 and this Amendment.

3.20  coded representation: A data element as represented in its encoded form.

3.21  coding parameters: The set of user-definable parameters that characterise a coded video bitstream. Bitstreams are characterised by coding parameters. Decoders are characterised by the bitstreams that they are capable of decoding.

3.22  component: A matrix, block or single sample from one of the three matrices (luminance and two chrominance) that make up a picture.

3.23  compression: Reduction in the number of bits used to represent an item of data.

3.24  constant bitrate coded video: A coded video bitstream with a constant bitrate.

3.25  constant bitrate: Operation where the bitrate is constant from start to finish of the coded bitstream.

3.26  data element: An item of data as represented before encoding and after decoding.

3.27  DC coefficient: The DCT coefficient for which the frequency is zero in both dimensions.