Last updated: April 19, 2026
Application No. 18/536,053
SCENE BASED AUDIO MIXING FOR GENERATING AUDIO DESCRIPTION CONTENT

Non-Final OA §103
Filed
Dec 11, 2023
Examiner
MCCORD, PAUL C
Art Unit
2692
Tech Center
2600 — Communications
Assignee
Amazon Technologies, Inc.
OA Round
1 (Non-Final)
Interview Optional

— +26.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 569 resolved cases, 2023–2026
Examiner Intelligence

MCCORD, PAUL C View full profile →
Grants 69% — above average
Career Allow Rate
393 granted / 569 resolved
+7.1% vs TC avg
Strong +27% interview lift
Without
With
+26.6%
Interview Lift
resolved cases with interview
Typical timeline
3y 5m
Avg Prosecution
41 currently pending
Career history
610
Total Applications
across all art units
Statute-Specific Performance

§101
10.5%
-29.5% vs TC avg
§103
54.0%
+14.0% vs TC avg
§102
6.8%
-33.2% vs TC avg
§112
20.9%
-19.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 569 resolved cases
Office Action

§103
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
 
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-4 rejected under 35 U.S.C. 103 as being unpatentable over Groeschel: 20130170672 hereinafter Gro further in view of Wang: 20210151082 and further in view of Piel: 20240338114.

Regarding claim 1
Gro teaches:
A computer-implemented method comprising: under control of a computing system comprising one or more computer processors configured to execute specific instructions, obtaining an input audio comprising one or more audio channels, the input audio being an audio track of a video content item (Gro: ¶ 15, 64, etc.: system receives a main input audio signal and associated audio signal wherein the audio signals are associated with a movie, dvd video or other media bearing same); 
obtaining an audio description (AD) narration (Gro: ¶ 64: such as an associated description track, such as a directors commentary which is overlaid upon the main audio based on metadata associated therewith), wherein the AD narration comprises a plurality of AD sections of narration of the video content item (Gro: ¶ 10, 11, 64; Fig 1A, 1B: such as commentary narrating aspects of a scene, the production thereof, etc. overlaid upon or among scenes of a movie),
wherein a first AD section and a second AD section of the plurality of AD sections respectively correspond to a first scene and a second scene of the input audio (Gro: ¶ 10, 11, 64, 99, 100: a first, second, etc. director commentary or other descriptive tracks respectively correspond to one or more particular first, second, etc. scenes in the movie); 
normalizing (Gro: ¶ 11, 62, 96, etc.; Fig 1A, etc.: system performs leveling, normalization, etc. on audio tracks based on scene-wise metadata; the figure “shows an example of different programs without such leveling or dialog level normalization,” to which leveling or normalization is to be applied), 
using a first loudness level associated with the one or more audio channels during the first scene, a second loudness level of the first AD section to generate a first normalized AD section with a first normalized loudness level (Gro: ¶ 3, 11, 13, 16, 62, 63, 69, 96, etc.; Fig 1A, 1B, 3, etc.: such as by scaling a director commentary track with respect to a main audio scene level of one or more scenes of a production wherein the main audio scene level is also scaled; that is, consider the exemplary very loud scene which would have audio at a first loudness level greater than that of the loudness level of the commentary in a manner analogous to the louder and lower audio S1 and S2 respectively in figure 1a; Examiner is aware that the utility of Fig 1a in the specification does not address the scaling in this manner but the statement and figure serve to illustrate the manner in which such processing is accomplished, that is to arrive at loudness normalization in the form of a uniform audio by provision of producer, director, user, etc. desired scaling values to alter the general sound level of audio portions to generate at consistent audio, such that the program audio of the scene at the higher level, the S1 audio, etc. is scaled down, attenuated, etc. with respect to the commentary audio in the scene and/or the commentary level audio of the scene at the lower signal level, the S2 audio, is scaled up, normalized with respect to the louder signal, etc., and as exemplified by figure 3, etc.; in this way dissimilar scaling factors may be used to normalize first, second, etc. audio within a scene to uniform levels such that “the input signals are normalized,” wherein “the normalization can be applied either before or after the determination of the dominant signal, as the results will be the same,” such as to ensure that dialog values “of the input signals is/are set correctly and for both the main and associated signals to be at dialog level 31 before mixing,”); 
normalizing, using a third loudness level associated with the one or more audio channels during the second scene, a fourth loudness level of the second AD section to generate a second normalized AD section with a second normalized loudness level (Gro: ¶ 11, 62, 63, 69, 96, etc.; Fig 1A, 1B, 3, 5A-C, etc.: such as by repeating the process to generate dissimilar scaling factors such as in a scene with audio similar to that of S3 and S4 wherein a second scene requires scaling factors different from that of a first scene and by which the dissimilar audio within a scene is normalized to uniform levels such that “the input signals are normalized,” wherein “the normalization can be applied either before or after the determination of the dominant signal, as the results will be the same” such as to ensure that dialog values “of the input signals is/are set correctly and for both the main and associated signals to be at dialog level 31 before mixing,”); 
compressing, by a first computer processor of the one or more computer processors, a first audio channel of the one or more audio channels during the first scene based at least in part on the first normalized loudness level of the first normalized AD section to generate a first portion of a first compressed audio channel (Gro: ¶ 6, 59-62, etc.: such as by using a compression scheme on one or more audio channels, such as a first audio channel normalized as described supra, such as for communication of the data to additional devices , such as on a scene requiring the scaling generative of the first normalized AD section); 
compressing, by a second computer processor of the one or more computer processors, the first audio channel of the one or more audio channels during the second scene based at least in part on the second normalized loudness level of the second normalized AD section to generate a second portion of the first compressed audio channel  (Gro: ¶ 6, 59-62, etc.: such as by using a compression scheme on one or more audio channels, such as a first audio channel normalized as described supra, such as for communication of the data to additional devices , such as on a scene requiring the scaling generative of the second normalized AD section); and 
mixing the first normalized AD section to the first compressed audio channel during the first scene and the second normalized AD section to the first compressed audio channel during the second scene to generate a first sound channel of an AD content, wherein the AD content comprises the video content item and provides narration of the video content item between the sections of dialogue (Gro: ¶ 59-64, 69, 70; Fig 3: such as by providing the individual audio channel comprising the directors commentary to a channel of a multi-channel audio codec, standard, etc. as discussed normalized in the manner the discussed with respect to the main audio such as for multi-channel output). 

Gro describes mixing the main audio and commentary audio together such as for output but teaches and/or suggests the claimed subject matter in the manner discussed supra which teaches that the mixing occurring in a simultaneous, overlapping, manner over successive, etc. time periods. Gro does not explicitly teach the commentary occurring specifically between sections of dialogue of the main audio, wherein the first AD section and second AD section of the plurality of AD sections respectively correspond to a first scene and a second scene of the input audio such that using the Gro taught or suggested loudness levels associated with the one or more audio channels interspersed sequentially within the first scene generate a first normalized AD section with a first normalized loudness level. Additionally, Gro strongly suggests operating a plurality of processor modules to perform the claimed subject matter as represented in figure 3 but does not explicitly teach the Gro taught or suggested insertion, normalization, compression, mixing pipeline stages, module operations, etc. assigned to specific processors.
 In a related field of endeavor Wang teaches a computer-implemented method comprising: under control of a computing system comprising one or more computer processors configured to execute specific instructions obtaining an input audio comprising one or more audio channels, the input audio being an audio track of a video content item (Wang: Abstract; ¶ 10, 52, 61, 62, etc.; Fig 1, 2, etc.: such as accessing audio tracks by operating memory modules upon one or more physical processors wherein the audio track is associated with a video recording, content item, etc.); 
obtaining an audio description (AD), wherein the AD comprises a plurality of AD sections of the video content item (Wang: ¶ 86, 102-104; Fig 7: a dialog list comprising one or more translations of film dialog are input for creating an additional audio track wherein the in and out times for lines of dialog in one or more scenes are determined) between sections of dialogue of the input audio (Wang: ¶ 86, 103, 104, 107: system determines dialog in and out times for a plurality of scenes for the purpose of inserting a “synthesized voice description … to provide additional information to a visually impaired viewer without interrupting the audio track that is associated with the video recording, typically by inserting the synthesized voice description into a segment of the audio track in which there is no dialog), a first AD section and a second AD section of the plurality of AD sections respectively corresponding to a first scene and a second scene of the input audio (Wang: ¶ 11, 48, 59; Fig 2: system replaces and/or inserts audio into one or more scenes of a program, movie, etc.); 
compressing, by a computer processor of the one or more computer processors, a first audio channel of the one or more audio channels during the first scene based at least in part on a loudness level of the first AD section to generate a first portion of a first compressed audio channel (Wang: ¶ 80, 94, 95; Fig 1, 5: system operates to compress an audio track to a determined loudness level; such as by processor(s) 130); compressing, by a computer processor of the one or more computer processors, the first audio channel of the one or more audio channels during the second scene based at least in part on a loudness level of the second AD section to generate a second portion of the first compressed audio channel (Wang: ¶ 80, 94, 95; Fig 1, 5, 6A, 6B: system operates to compress an audio track to a determined loudness levels such as upon subsequent audio portions of a media; such as by processor(s) 130); and 
mixing the first and second AD sections to the first compressed audio channel during the first scenes and second scenes to generate a first sound channel of an AD content, wherein the AD content comprises the video content item and provides narration of the video content item between the sections of dialogue (Wang: Abstract; ¶ 86, 103, 104, 107: system mixes in or otherwise inserts a “synthesized voice description,” at dialog in and out times for a plurality of scenes to provide information without interrupting the audio track that is associated with the video recording, typically by inserting the synthesized voice description into a segment of the audio track in which there is no dialog). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to adapt the Gro system and method to compressed the various dialog audio segments and mix or otherwise insert compressed audio description segments between dialog segments in the manner taught or suggested by Wang and for at least the purpose of providing description information without interrupting distracting from the original dialog audio, audio track, etc.; one of ordinary skill in the art would have expected only predictable results therefrom.

Gro in view of Wang strongly suggest but do not explicitly teach operating the modules, portions of the pipeline, etc. of the disclosed steps upon a plurality of processors, such as in parallel, upon plural cores, etc.
In a related field of endeavor Peil teaches a system and method for determining a processing path for audio processing (Peil: Abstract) comprising utilizing multiple processors for selective processing of particular audio (Peil: ¶ 11, 12, 25-27, 60-65; Fig 7; Claim 8: system comprises user specified processing paths to perform signal processing upon multiple processors, cores, DSPs, etc.).  It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize a multiprocessor system such as that taught or suggested by Peil to conduction operations of the Gro in view of Wang processing pipeline upon a plurality of processors and for at least the purpose of providing timewise benefit to the processing and mixing of audio such as with decreased latency, in substantially real-time, etc. and in the manner claimed; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 2
Gro in view of Wang in view of Peil teaches or suggest:
The computer-implemented method of claim 1, further comprising: adjusting a dynamic range of the first audio channel (Gro: ¶ 16-19, 60: such as by utilizing a dialnorm or other compression and/or dynamics adjusting processes such as spectral band replication, etc.); (Wang: ¶ 80, 94, 95; Fig 5: such as by compression of an audio track to a determined loudness level which adjusts the dynamic range over volume or gain; by re-sampling which necessarily adjusts a dynamic frequency range, etc.) prior to normalizing the second loudness level of the first AD section and the fourth loudness level of the second AD section Gro: ¶ 3, 11, 13, 16, 62, 63, 69, 96, etc.; Fig 1A, 1B, 3, etc.: such as by performing normalization of the loudness levels of a particular second to be included in the output media); (Wang: ¶ 86, 103, 104, 107: system determines dialog in and out times for a plurality of scenes for the purpose of interleaving audio in a scene at particular volume levels utilizing compression values to manage overall gain continuity); (Peil: such as by adapting processing a processing pipeline in a specific manner). Examiner considers the claim obvious to try to an average skilled practitioner in possession of Gro, Wang and Peil. Piel addresses the relevant problem of the sequencing of processing steps for audio signal processing and distributing same across a plurality of processors at a plurality of timings; there exist a finite number of potential and predictable solutions in this regard, signal processing steps including normalizing of portions may occur prior to, incident with, or after the normalizing of other distinct portions based thereon; and one of ordinary skill in the art could have pursued the known potential solutions with a reasonable expectation of success such as by issuing relevant instructions therefrom to the multi-core processing pipeline of the system; as such the claim would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application.

Regarding claim 3
Gro in view of Wang in view of Peil teaches or suggest:
The computer-implemented method of claim 1, wherein compressing the first audio channel during the first scene is based on a difference between the first normalized loudness level of the first normalized AD section and the first loudness level associated with the one or more audio channels during the first scene (Gro: Figs 2A, 2B: such as the generation of a normalized scene level of figure 2B based on processing the scene portions of figure 2A based on the differences therebetween), and wherein compressing the first audio channel during the second scene is based on a difference between the second normalized loudness level of the second normalized AD section (Gro: Figs 2A, 2B: such as by iteratively generating normalized processing of scene levels of figure 2B based on processing the scene portions of figure 2A based on the differences therebetween over the available or necessary to process scenes of a multi-scene media) and the third loudness level associated with the one or more audio channels during the second scene (Gro: Abstract; ¶ 3, 11, 13, 16, 20etc. Figs 1A, 1B, 5A-5C: the system normalizes the volume of sections, segments, portions, etc. of a media such that the media overall maintains a consistent sound levels using scale factors based on differences between segment volumes); (Wang; Abstract; ¶ 57, 74-77, etc.: system adjusts, compresses, etc. the level of a portion of audio to be consistent with prior, subsequent portions and to allow highlighted volume to a voiceover); (Peil: ¶ 11, 12, 25-27, 60-65, etc.; Fig 7; Claim 8: system operates to adapt an audio signal in keeping with a user specification thereof). The claim is considered obvious over Gro as modified by Wang  and Piel as addressed in the base claim as it would have been obvious to apply the further teaching of Gro, Wang, and/or Piel to the modified device of Gro, Wang, and Piel; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 4
Gro in view of Wang in view of Peil teaches or suggest:
The computer-implemented method of claim 3, further comprising: 
determining the difference between the first normalized loudness level of the first normalized AD section and the first loudness level associated with the one or more audio channels during the first scene  (Gro: Figs 2A, 2B: such as by iteratively generating normalized processing of scene levels of figure 2B based on processing the scene portions of figure 2A based on the differences therebetween over the available or necessary to process scenes of a multi-scene media); 
determining the difference between the second normalized loudness level of the second normalized AD section and the third loudness level associated with the one or more audio channels during the second scene (Gro: Figs 2A, 2B: such as by iteratively generating normalized processing of scene levels of figure 2B based on processing the scene portions of figure 2A based on the differences therebetween over the available or necessary to process scenes of a multi-scene media); 
determining the difference between the first normalized loudness level of the first normalized AD section and the first loudness level associated with the one or more audio channels during the first scene to a first range of a plurality of ranges (Gro: ¶ 11, 16, 25, 26, 31, 59-62: system includes compression and frequency metadata characteristics for scaling, compressing, normalizing etc. particular frequency ranges of one or more audio tracks based on differences therebetween); (Wang: ¶ 34, 42, 65-67, 95, 101; Fig 6A, 6B: such as for determining properties of particular tracks, portions thereof such as for generating labels, metadata, or other specific characteristics thereof; such as for labelling specific ranges of frequencies, etc.); 
determining the difference between the second normalized loudness level of the second normalized AD section and the third loudness level associated with the one or more audio channels during the second scene to a second range of the plurality of ranges (Gro: ¶ 11, 16, 25, 26, 31, 59-62: system includes compression and frequency metadata characteristics for scaling, compressing, normalizing etc. particular frequency ranges of one or more audio tracks based on differences therebetween); (Wang: ¶ 34, 42, 65-67, 95, 101; Fig 6A, 6B: such as for determining properties of particular tracks, portions thereof such as for generating labels, metadata, or other specific characteristics thereof; such as for labelling specific ranges of frequencies, etc.);
generating, based on the first range, a first parameter set for compressing the first audio channel during the first scene (Gro: ¶ 11, 16, 25, 26, 31, 59-62: parameters determined by a user of the system, or generated by codec to which tracks are to be compressed etc.); (Wang: ¶ 80, 94, 95; Fig 1, 5: system operates to compress an audio track to a determined loudness level; such as by processor(s) 130); and generating, based on the second range, a second parameter set for compressing the first audio channel during the second scene (Gro: ¶ 11, 16, 25, 26, 31, 59-62: parameters determined by a user of the system, or generated by codec to which tracks are to be compressed etc.); (Wang: ¶ 80, 94, 95; Fig 1, 5: system operates to compress an audio track to a determined loudness level; such as by processor(s) 130).
generating, based on the first range, a first parameter set for compressing the first audio channel during the first scene (Gro: ¶ 11, 16, 25, 26, 31, 59-62: such as for transmitting audio such as upon a selected codec); (Wang: ¶ 80, 94, 95; Fig 1, 5: such as for determining an appropriate compression, data structure thereof for reducing saved track data);
generating, based on the second range, a second parameter set for compressing the first audio channel during the second scene (Gro: ¶ 11, 16, 25, 26, 31, 59-62: such as for transmitting audio such as upon a selected codec); (Wang: ¶ 80, 94, 95; Fig 1, 5: such as for determining an appropriate compression, data structure thereof for reducing saved track data);
Gro in view of Wang in view of Peil does not explicitly teach classifying the data to resolve the parameters recited merely processing the data in such a way and the utility of classifying the audio tracks to determine particular characteristics of audio tracks however It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize the base method and known technique of classifying audio tracks to determine parameters thereof as taught by Wang to improve the Gro in view of Wang in view of Peil system and method to select among the taught compression schemes for based on classified frequency ranges, volume ranges etc. such as for the purpose of managing headroom, file size for transmission, etc. Further, one of ordinary skill in the art could have implemented such an improvement to the system without undue experimentation and in full expectation of predictable results. The claim is thus considered obvious over Gro as modified by Wang and Piel as addressed in the base claim as it would have been obvious to apply the further teaching of Gro, Wang, and/or Piel to the modified device of Gro, Wang, and Piel; one of ordinary skill in the art would have expected only predictable results therefrom.

Claims 5-20 rejected under 35 U.S.C. 103 as being unpatentable over Groeschel: 20130170672 hereinafter Gro further in view of Wang: 20210151082.

Regarding claim 5
Gro teaches:
A system for generating an audio description (AD) content, the system comprising: memory that stores computer-executable instructions; and one or more processors in communication with the memory, wherein the computer-executable instructions, when executed by the one or more processors (Gro: ¶ 15, 38, 64 etc.; Fig 2: memory bearing coded instructions for operation in concert with a processor such as that of the figure for receiving a main input audio signal and associated audio signal wherein the audio signals are associated with a movie, dvd video or other media bearing same), cause the one or more processors to: 
obtain an input audio comprising audio for a video content item (Gro: ¶ 15, 64, etc.: system receives a main input audio signal and associated audio signal wherein the audio signals are associated with a movie, dvd video or other media bearing same); 
obtain an AD narration, wherein the AD narration comprises a plurality of AD sections, a first AD section of the plurality of AD sections corresponding to a first scene of the input audio  (Gro: ¶ 10, 11, 64, 99, 100: such as an associated description track, such as a directors commentary which is overlaid upon the main audio based on metadata associated therewith and comprising a first, second, etc. director commentary narrating aspects of a scene, the production thereof, etc. overlaid upon or among scenes of a movie or other descriptive tracks respectively corresponding to one or more particular first, second, etc. scenes in the movie); 
modify (Gro: ¶ 11, 62, 96, etc.; Fig 1A, etc.: system performs leveling, normalization, etc. on audio tracks based on scene-wise metadata; the figure “shows an example of different programs without such leveling or dialog level normalization,” to which leveling or normalization is to be applied), using a loudness level associated with the first scene, a loudness level of the first AD section to generate a first modified AD section (Gro: ¶ 3, 11, 13, 16, 62, 63, 69, 96, etc.; Fig 1A, 1B, 3, etc.: such as by scaling a director commentary track with respect to a main audio scene level of one or more scenes of a production wherein the main audio scene level is also scaled; that is, consider the exemplary very loud scene which would have audio at a first loudness level greater than that of the loudness level of the commentary in a manner analogous to the louder and lower audio S1 and S2 respectively in figure 1a; Examiner is aware that the utility of Fig 1a in the specification does not address the scaling in this manner but the statement and figure serve to illustrate the manner in which such processing is accomplished, that is to arrive at loudness normalization in the form of a uniform audio by provision of producer, director, user, etc. desired scaling values to alter the general sound level of audio portions to generate at consistent audio, such that the program audio of the scene at the higher level, the S1 audio, etc. is scaled down, attenuated, etc. with respect to the commentary audio in the scene and/or the commentary level audio of the scene at the lower signal level, the S2 audio, is scaled up, normalized with respect to the louder signal, etc., and as exemplified by figure 3, etc.; in this way dissimilar scaling factors may be used to normalize first, second, etc. audio within a scene to uniform levels such that “the input signals are normalized,” wherein “the normalization can be applied either before or after the determination of the dominant signal, as the results will be the same,” such as to ensure that dialog values “of the input signals is/are set correctly and for both the main and associated signals to be at dialog level 31 before mixing,”); 
modify, based at least in part on a loudness level of the first modified AD section, the first scene of the input audio to generate a first modified scene (Gro: ¶ 63, 71; Figs 1B, 5A-5C: volume adjusted of a particular track with respect to other tracks within a scene to composite a scene for output such as by mixing); commentary tracks ; and mix the first modified AD section and the first modified scene to generate a first AD content scene(Gro: ¶ 59-64, 69, 70; Fig 3: such as by providing the individual audio channel comprising the directors commentary to a channel of a multi-channel audio codec, standard, etc. as discussed normalized in the manner the discussed with respect to the main audio such as for multi-channel output).

Gro describes mixing the main audio and commentary audio together such as for output but teaches and/or suggests the claimed subject matter in the manner discussed supra which teaches that the mixing occurring in a simultaneous, overlapping, manner over successive, etc. time periods. Gro does not explicitly teach the commentary occurring specifically between sections of dialogue of the main audio, wherein the first AD section and second AD section of the plurality of AD sections respectively correspond to a first scene and a second scene of the input audio such that using the Gro taught or suggested loudness levels associated with the one or more audio channels interspersed sequentially within the first scene generate a first normalized AD section with a first normalized loudness level.

In a related field of endeavor Wang teaches a computer-implemented method comprising: under control of a computing system comprising one or more computer processors configured to execute specific instructions obtaining an input audio comprising one or more audio channels, the input audio being an audio track of a video content item (Wang: Abstract; ¶ 10, 52, 61, 62, etc.; Fig 1, 2, etc.: such as accessing audio tracks by operating memory modules upon one or more physical processors wherein the audio track is associated with a video recording, content item, etc.); 
obtaining an audio description (AD), wherein the AD comprises a plurality of AD sections of the video content item (Wang: ¶ 86, 102-104; Fig 7: a dialog list comprising one or more translations of film dialog are input for creating an additional audio track wherein the in and out times for lines of dialog in one or more scenes are determined) between sections of dialogue of the input audio (Wang: ¶ 86, 103, 104, 107: system determines dialog in and out times for a plurality of scenes for the purpose of inserting a “synthesized voice description … to provide additional information to a visually impaired viewer without interrupting the audio track that is associated with the video recording, typically by inserting the synthesized voice description into a segment of the audio track in which there is no dialog), a first AD section and a second AD section of the plurality of AD sections respectively corresponding to a first scene and a second scene of the input audio (Wang: ¶ 11, 48, 59; Fig 2: system replaces and/or inserts audio into one or more scenes of a program, movie, etc.); 
compressing, by a computer processor of the one or more computer processors, a first audio channel of the one or more audio channels during the first scene based at least in part on a loudness level of the first AD section to generate a first portion of a first compressed audio channel (Wang: ¶ 80, 94, 95; Fig 1, 5: system operates to compress an audio track to a determined loudness level; such as by processor(s) 130); compressing, by a computer processor of the one or more computer processors, the first audio channel of the one or more audio channels during the second scene based at least in part on a loudness level of the second AD section to generate a second portion of the first compressed audio channel (Wang: ¶ 80, 94, 95; Fig 1, 5, 6A, 6B: system operates to compress an audio track to a determined loudness levels such as upon subsequent audio portions of a media; such as by processor(s) 130); and 
mixing the first and second AD sections to the first compressed audio channel during the first scenes and second scenes to generate a first sound channel of an AD content, wherein the AD content comprises the video content item and provides narration of the video content item between the sections of dialogue (Wang: Abstract; ¶ 86, 103, 104, 107: system mixes in or otherwise inserts a “synthesized voice description,” at dialog in and out times for a plurality of scenes to provide information without interrupting the audio track that is associated with the video recording, typically by inserting the synthesized voice description into a segment of the audio track in which there is no dialog). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to adapt the Gro system and method to compressed the various dialog audio segments and mix or otherwise insert compressed audio description segments between dialog segments in the manner taught or suggested by Wang and for at least the purpose of providing description information without interrupting distracting from the original dialog audio, audio track, etc.; one of ordinary skill in the art would have expected only predictable results therefrom. The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 6
Gro in view of Wang teaches or suggest:
The system of claim 5, wherein a second AD section of the plurality of AD sections corresponds to a second scene of the input audio, and wherein the computer-executable instructions, when executed, further cause the one or more processors to: 
modify, using a loudness level associated with the second scene, a loudness level of the second AD section to generate a second modified AD section; modify, based at least in part on a loudness level of the second modified AD section, the second scene of the input audio to generate a second modified scene (Gro: ¶ 11, 62, 63, 69, 96, etc.; Fig 1A, 1B, 3, 5A-C, etc.: such as by repeating the process to generate dissimilar scaling factors such as in a scene with audio similar to that of S3 and S4 wherein a second scene requires scaling factors different from that of a first scene and by which the dissimilar audio within a scene is normalized to uniform levels such that “the input signals are normalized,” wherein “the normalization can be applied either before or after the determination of the dominant signal, as the results will be the same” such as to ensure that dialog values “of the input signals is/are set correctly and for both the main and associated signals to be at dialog level 31 before mixing,”); (Wang: ¶ 11, 48, 59, 80, 94, 95; Fig 1, 2, 5: system replaces and/or inserts audio into one or more scenes of a program, movie, etc. such as to control output volume of an audio track(s) with respect a determined loudness level; such as by processor(s) 130)); and mix the second modified AD section and the second modified scene to generate a second AD content scene (Gro: ¶ 59-64, 69, 70; Fig 3: such as by providing the individual audio channel comprising the directors commentary to a channel of a multi-channel audio codec, standard, etc. as discussed normalized in the manner the discussed with respect to the main audio such as for multi-channel output); (Wang: Abstract; ¶ 86, 103, 104, 107: system mixes in or otherwise inserts a “synthesized voice description,” at dialog in and out times for a plurality of scenes to provide information without interrupting the audio track that is associated with the video recording, typically by inserting the synthesized voice description into a segment of the audio track in which there is no dialog). The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 7
Gro in view of Wang teaches or suggest:
The system of claim 6, wherein the input audio comprises a third scene between the first scene and the second scene, and wherein the third scene of the input audio is unmodified. The claimed subject matter is considered obvious as a matter of design choice on the part of a director, producer, or other creator of content; Gro in view of Wang discusses the relevant scope and content of the prior art necessary to accomplish the claim (please see figure 1B of Gro which illustrates compositing of scenes), design incentives such as abound in the video production industry would have prompted such decisions on the part of a director, producer, or other creator of content; such a modification as claimed are encompassed by Gro in view of Wang and amount to no more than known and predictable variations which could have been implemented without undue experimentation in full expectation of predictable results. The claim is thus considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 8
Gro in view of Wang teaches or suggest:
The system of claim 7, wherein the computer-executable instructions, when executed, further cause the one or more processors to: concatenate the first scene of the input audio and the third scene of the input audio Wang: Fig 1A, 1B: scenes are concatenated for output as part of generating a media). The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 9
Gro in view of Wang teaches or suggest:
The system of claim 5, wherein the input audio comprises one or more audio channels of the audio for the video content item (Gro: ¶ 59-64, 69, 70; Fig 2A, 2B, 3: plurality of scenes concatenated for output comprise at least an audio channel, AD channel, etc.); (Wang: ¶ 86, 103, 104, 107: system determines in and out times of audio for a plurality of scenes in concert with description audio corresponding to the scenes; said scenes concatenated for output as a media). The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 10
Gro in view of Wang teaches or suggest:
The system of claim 9, wherein the computer-executable instructions, when executed, further cause the one or more processors to: boost the first AD content scene  (Gro: ¶ 3, 11, 13, 16, 62, 63, 69, 96, etc.; Fig 1A, 1B, 3, etc.: system operates to scale audio tracks under direction of a user thereof); (Wang: ¶ 80, 94, 95; Fig 1, 5: system operates to compress or otherwise scale audio tracks to a determined loudness levels). The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 11
Gro in view of Wang teaches or suggest:
The system of claim 5, wherein the computer-executable instructions, when executed, further cause the one or more processors to: insert the first modified AD section, according to a start time or an end time of the first scene, to a silent audio file to generate a normalized narration file (Gro: ¶ 3, 11, 13, 16, 62, 63, 69, 96, etc.; Fig 1A, 1B, 3, etc.: system operable to normalize an audio file, narration file, etc.); (Wang: ¶ 86, 98, 102-104; Fig 7: system determines in and out times of audio and operates to add periods of silence therebetween such as for insertion of voice descriptions between dialog). The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 12
Gro in view of Wang teaches or suggest:
The system of claim 11, wherein a duration of the normalized narration file equals a duration of the input audio (Wang: ¶ 86, 98, 102-104; Fig 7: such as for insertion of description therein). The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 13
Gro in view of Wang teaches or suggest:
The system of claim 5, wherein the computer-executable instructions, when executed, further cause the one or more processors to: generate, based on an AD script, the AD narration (Wang: ¶ 39, 103-105: lektoring tracks the script dialog of a movie, such as by generation of a text therefrom to provide to a text to speech engine for generating a synthesized narration voice based on a parsed script of a media). The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 14
Gro in view of Wang teaches or suggest:
The system of claim 13, wherein the AD script is generated by a machine learning (ML) model or a human operator  (Wang: ¶ 39, 103-105: input dialog is parsed and converted to markup language the markup language is considered converted by a model or human). The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 15
Gro in view of Wang teaches or suggest:
The system of claim 13, wherein the AD narration is generated using a computer synthesized speech voice  (Wang: ¶ 39, 103-105: lektoring tracks the script dialog of a movie, generates a text therefrom to provide to a text to speech engine for generating a synthesized narration voice based on a parsed script of a media). The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 16
Gro in view of Wang teaches or suggest:
The system of claim 5, wherein the computer-executable instructions, when executed, further cause the one or more processors to: determine one or more audio properties of the input audio; and split, based at least in part on the one or more audio properties of the input audio, the input audio into one or more audio channels (Gro: ¶ 23-26: multi-channel audio comprise metadata scale factors for a plurality of output channels; such as based on determined output channel functionality); (Wang: ¶ 37, 86, 103, 104, 107: multimedia stream split into dialog and non-dialog sections, tracks, etc. and synthetic voice over tracks inserted therein). The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 17
Gro in view of Wang teaches or suggest:
The system of claim 5, wherein modifying the loudness level of the first AD section comprises increasing or decreasing the loudness level of the first AD section based on the loudness level associated with the first scene (Gro: ¶ 3, 11, 13, 16, 62, 63, 69, 96, etc.; Fig 1A, 1B, 3, etc.: such as by scaling audio track(s) level(s) of one or more scenes of a production); (Wang: ¶ 80, 94, 95; Fig 1, 5: system operates to compress audio track(s) to determined loudness levels; such as by processor(s) 130). The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 18
Gro in view of Wang teaches or suggest:
The system of claim 5, wherein the computer-executable instructions, when executed, further cause the one or more processors to: adjust a dynamic range of the first scene prior to modifying the first scene, wherein adjusting the dynamic range of the first scene comprises limiting the loudness level associated with the first scene to a predetermined loudness level (Gro: ¶ 6, 59-62, etc.: such as by using a compression scheme on one or more audio channels, such as a first audio channel normalized as described supra, such as for communication of the data to additional devices). The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Regarding claim 19—the claim is considered to recite substantially similar subject matter to that of claim 5 and is similarly rejected.

Regarding claim 20
Gro in view of Wang teaches or suggest:
The computer-implemented method of claim 19, wherein modifying the first scene of the input audio is based on a difference between the loudness level of the first modified AD section and the loudness level associated with the first scene (Gro: Figs 2A, 2B: such as the generation of a normalized scene level of figure 2B based on processing the scene portions of figure 2A based on the differences therebetween); (Wang: ¶ 80, 94, 95; Fig 1, 5: system operates to compress audio tracks to determined loudness levels to maintain listenability of scenes in the media; such as by processor(s) 130). The claim is considered obvious over Gro as modified by Wang as addressed in the base claim as it would have been obvious to apply the further teaching of Gro and/or Wang to the modified device of Gro and Wang; one of ordinary skill in the art would have expected only predictable results therefrom.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL C MCCORD whose telephone number is (571)270-3701. The examiner can normally be reached 730-630 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, CAROLYN EDWARDS can be reached at (571) 270-7136. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/PAUL C MCCORD/Primary Examiner, Art Unit 2692            
/CAROLYN R EDWARDS/Supervisory Patent Examiner, Art Unit 2692
Read full office action
Prosecution Timeline

Dec 11, 2023
Application Filed
Jan 24, 2026
Non-Final Rejection — §103
Mar 24, 2026
Applicant Interview (Telephonic)
Mar 24, 2026
Examiner Interview Summary
Precedent Cases

Applications granted by this same examiner with similar technology

18/674,333
Patent 12603094
ADAPTIVE PROCESSING WITH MULTIPLE MEDIA PROCESSING NODES
2y 5m to grant Granted Apr 14, 2026
18/653,631
Patent 12592238
INFORMATION PROCESSING METHOD, INFORMATION PROCESSING DEVICE, AND NON-TRANSITORY COMPUTER READABLE RECORDING MEDIUM STORING INFORMATION PROCESSING PROGRAM
2y 5m to grant Granted Mar 31, 2026
19/029,744
Patent 12593192
MEDIA PLAYBACK BASED ON SENSOR DATA
2y 5m to grant Granted Mar 31, 2026
18/280,697
Patent 12572323
DYNAMIC AUDIO CONTENT GENERATION
2y 5m to grant Granted Mar 10, 2026
16/822,293
Patent 12567003
TECHNOLOGIES FOR DECENTRALIZED FLEET ANALYTICS
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
69%
Grant Probability
96%
With Interview (+26.6%)
3y 5m
Median Time to Grant
Low
PTA Risk
Based on 569 resolved cases by this examiner. Grant probability derived from career allow rate.