DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
In response to the Advisory Action from 11/28/2025, Applicant has filed a Request for Continued Examination (RCE) on 1/2/2026. With the filing of the RCE, the amendments that were not entered after final have now been entered (see checked box for previously submitted amendment on the 1/2/2026 PTO/SB/30EFS). These amendments added limitations to independent claims 1, 10, and 12 regarding dividing an input audio stream in mono including a streaming video content being carried out automatically, determining a metadata associated with defined stereo location and use the metadata to inform the channel, and performing panning "based on the metadata that was informed to the channel."
Applicant has also argued that the prior art of record fails to teach the "ability to automatically divide the input audio stream into stereophonic channels allowing the generation of stereo sound from a mono input" where "the mechanism for doing this is based on the accompanying video input from the stream" to determine a location of a sound source and then "use this as meta data informing the channel to which the audio should be panned to" (resubmitted Remarks from 11/20/2025, Page 10). These arguments have been fully considered, however, are not found to be persuasive for the reasons noted in the Response to Arguments section.
Applicant argues that amendments to independent claims 1, 10, and 12 overcomes the rejections under 35 U.S.C. 112(b) (Remarks, Pages 9-10).
In response, the amendment to the independent claims does not overcome the rejection under 35 U.S.C. 112(b) by adding "amongst the plurality of the stereophonic channels" because the limitation "the channel associated with an appropriate channel" still does not appear earlier in the claims and "the audio" antecedent basis issue in the last line of the claims/the same step still has not been resolved. In the specific case of claim 10, the issues of duplicate claim language and the separate antecedent basis issue have been resolved and are withdrawn.
Response to Arguments
With respect to independent claims 1, 10, and 12, Applicant argues that the prior art of record fails to teach the "ability to automatically divide the input audio stream into stereophonic channels allowing the generation of stereo sound from a mono input" where "the mechanism for doing this is based on the accompanying video input from the stream" to determine a location of a sound source and then "use this as meta data informing the channel to which the audio should be panned to" (resubmitted Remarks from 11/20/2025, Page 10). In particular, Applicant contends that Apsingekar, et al. (U.S. PG Publication: 2024/0056761 A1) instead "concentrates on classifying each of the audio sources into one of the object classes based on the position or the motion trajectory of that object" (Remarks, Page 10).
In response, the specification only describes the use of "metadata" at three places in the specification- twice in Paragraph 0076 and once in paragraph 0078. The specification only notes that the metadata is generated "describing the spatial location associated with the sound of interest" (Paragraph 0076) consistent with the interpretation provided in the arguments on page 10. The specification does not provide a clear and unmistakable definition for "metadata", thus the ordinary meaning of the term controls claim term interpretation under the broadest reasonable interpretation (BRI), which is data describing other data.
Turning then to Aspingekar, the applied prior art describes deriving "information" (i.e., data) about audio associated with video objects that are used to determine "how to modify...the substantially mono audio content in different audio samples" (Paragraph 0053). In the same paragraph, Aspingekar discusses a moving object metadata to render audio content "so that sounds are played from different speakers to make it audibly appear as if the object is moving within a 3D space around a listener." Aspingekar also offers another example where trajectory information/metadata is used to pan audio of a vehicle in accordance with "the vehicle's motion" (see Paragraph 0066).
Lastly, in regards to Applicant's position that Aspingekar does not teach automatically dividing the input audio stream into a plurality of stereophonic channels based on the video data that also generates the metadata, it is first noted that these arguments related to unclaimed claim features. While the claim does note that the mono input is "including a streaming video content" there is no mention that this streaming video content is relied upon to divide the audio signal, only that the mono audio data includes such information. Moreover, Aspingekar teaches the applicable claim language in describing a mono audio input being automatically divided into multi-channel stereo (Paragraphs 0053, 0059-0062, and 0065) where the input audio may take the form of streaming audio (e.g., Paragraph 0005 explicitly noting "separating audio streams within the audio content based on the video content," "streaming content" in Paragraph 0069, and "audio streams" in Fig. 8, Element 808) and illustrating a computer-based "sound source classification and separation" block (306) that takes in the mono video clip (206b) to produce a multi-channel audio clip (212) after subsequent processing. Lastly, even though the use of video content for automatic audio channel division is not claimed, Aspingekar nevertheless teaches objection detection "associated with the video" to modify the mono audio content in "different audio channels" (see Paragraphs 0005 ("separating audio streams within the audio content based on the video content"), 0053, and 0059, the latter describing the results of image classification being applied to the sound source separation functionality).
Thus, Applicant's arguments directed towards the independent claims have been fully considered, but are not found to be persuasive.
The prior art rejections of the remaining dependent claims have been traversed for reasons similar to the independent claims (Remarks, Page 12). In regards to such arguments, see the response directed towards the independent claims.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-8 and 10-19 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
In the "informing" step of independent Claims 1, 10, and 12, "the channel associated with an appropriate channel" lacks antecedent basis since the preceding channels are only recited in a plurality and as “the channel” in the determination of metadata step instead of the channel “associated with an appropriate channel.” This limitation will be construed as -- an appropriate channel-- in the interest of claim interpretation for compact prosecution. Also, in the “determining a metadata” step/function added via the instant amendment, “the channel” lacks antecedent basis since only multiple channels were introduced earlier in the claim and not a singular channel where it is unclear whether a particular channel of those multiple channels is being referenced or if a new limitation is being introduced. For claim interpretation in the interest of compact prosecution, "the channel" will be construed as "a channel of the plurality of stereophonic channels"
Lastly, for "the audio" in the same step," it is unclear what audio in the claim is being referenced because the claims feature multiple preceding audio instances (e.g., "stereoscopic audio" and "audio fingerprint"). For claim interpretation, "the audio" will be interpreted as --audio corresponding to the sound of interest--
Claim 8 recites "generating metadata" where it is unclear whether this metadata is the same as the metadata already determined in parent claim 1 or is attempting to reference a separate instance of metadata. For the purposes of claim interpretation, claim 8 will be construed as further associating --the determined metadata--- already generated in parent claim 1 with the sound of interest where the generating step is considered removed since the metadata is already determined in parent claim 1. Claim 19 contains similar indefinite claim language, and thus, has also been rejected under 35 U.S.C. 112(b) under similar rationale.
The dependent claims fail to resolve and inherit the indefinite issues of the independent claims, and thus, have also been rejected under 35 U.S.C. 112(b) by virtue of their dependency.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.
Claims 1-8 and 10-19 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Apsingekar, et al. (U.S. PG Publication: 2024/0056761 A1).
With respect to Claim 1, Apsingekar discloses:
A computer implemented method for generating stereophonic audio for a video having a monophonic audio track (see title of the invention regarding 3D sound rendering based upon mono audio input), the method comprising:
processing an input audio stream and an accompanying video using a feature recognition algorithm to identify a visual feature of interest (video clip 206a is fed into an object detection, tracking, and classification analyzer 302 that uses video feature frames to classify regions of interest, Paragraphs 0052-0053, 0057, 0063, 0072, and 0075; Fig.3; See also various regions of interest detected in a video frame enclosed by bounding boxes, Fig. 4; the video accompanies audio that is also processed, Paragraphs 0056-0058);
for each identified visual feature of interest, determining a spatial location of the visual feature of interest in the video ("identify the position of each detected object" and "may also identify the position of each object within at least some image frames of the video clip(s), Paragraph 0057; see also Paragraphs 0072 and 0079- "position or trajectory of the object 402");
identifying, in a monophonic audio track of the video, a sound of interest (a "corresponding" mono audio clip 206b is fed into a sound source classification and separation analyzer 306 where "the sound source classification and separation function 306 may operate to identify audio data within the audio clip(s) 206b that appears to come from each of one or more unique audio sources" and "can operate to classify that audio source into one of the same object classes used to classify the detected objects, such as when the sound source classification and separation function 306 attempts to classify each audio source as a person, animal, vehicle, aircraft, or other type of object," Paragraphs 0059-0060);
automatically dividing the input audio stream into a plurality of stereophonic channels allowing the generation of stereo sound from a mono input including a streaming video content (see Fig. 3 showing a mono audio stream being input into a sound source separation component (306) for separation into the multiple stereo channels (310) based upon video stream content; see also Paragraphs 0053, 0059-0062, and 0065 describing such processing where the "audio channel can represent a separate stream of audio data to be provided to and played by a separate speaker" based upon processing the associated video stream; where input audio may taking the form of streaming audio, Paragraph 0005 explicitly noting "separating audio streams within the audio content based on the video content," "streaming content" in Paragraph 0069, and "audio streams" in Fig. 8, Element 808; Paragraph 0050- audio-video input 202 includes video content and substantially mono audio content”; computer-based/automatic processing described in Paragraphs 0088 and 0090);
determining an audio fingerprint for the sound of interest (sound source classification and separation function 306 here can generate various outputs 310, such as separate audio streams extracted from the audio clip(s) 206b and the object class associated with each audio stream, Paragraphs 0061 and 0076, wherein the extracted individual sounds belonging to specific objects in a video frame correspond to an audio fingerprint of that object- e.g., person, vehicle, animal, etc.);
analyzing the video based on the sound of interest and the determined audio fingerprint to identify if the sound of interest is linked to any of one or more visual features of interest (the results of the video and mono audio clip classification algorithms are fed into a sound matching and rendering analyzer 312 where the "sound matching and rendering function 312 may further receive a separate audio stream for each potential audio source, an object classification for each potential audio source, and an indication whether each potential audio source is an on-screen or off-screen source," Paragraph 0062, and where "sound matching and rendering function 312 can also determine whether the object classification of each detected object matches the object classification of any of the audio sources associated with the separated audio streams. For example, the sound matching and rendering function 312 may determine if there is an object classified as a person and a separated audio stream associated with an audio source classified as a person, and the sound matching and rendering function 312 may determine if there is an object classified as a vehicle and a separated audio stream associated with an audio source classified as a vehicle," Paragraph 0063; see also Paragraphs 0065 and 0092);
responsive to identifying the sound of interest is linked to a visual feature of interest, associating the sound of interest with the determined spatial location of the visual feature of interest in the video (one a match/link between sound object and video object is identified, the spatial location of the visual object in the video is associated with the sound object for stereo rendering, see Paragraph 0065- "For each object identified as being a sounding object, the sound matching and rendering function 312 can use the position and/or trajectory of that object to distribute the audio stream associated with that object among multiple audio channels;" see also Paragraphs 0066, 0079, and 0092);
defining a stereo location of the sound of interest within stereoscopic audio for the video based on the associated spatial location in the video (one a match/link between sound object and video object is identified, the spatial location of the visual object in the video is associated with the sound object for stereo rendering, see Paragraph 0065- "For each object identified as being a sounding object, the sound matching and rendering function 312 can use the position and/or trajectory of that object to distribute the audio stream associated with that object among multiple audio channels;" see also Paragraphs 0066, 0079, and 0092; Fig. 8, Element 812 wherein it is noted that the particular audio channel associated with the sound of interest/sound object corresponds to the claimed “stereo location”);
determining a metadata associated with defined stereo location and use the metadata to inform the channel (deriving "information" (i.e., data) about audio associated with video objects that are used to determine "how to modify...the substantially mono audio content in different audio samples" (Paragraph 0053) wherein Aspingekar discusses a moving object metadata to render audio content "so that sounds are played from different speakers to make it audibly appear as if the object is moving within a 3D space around a listener." Aspingekar also offers another example where trajectory information/metadata is used to pan audio of a vehicle in accordance with "the vehicle's motion" (see Paragraph 0066));
panning the sound of interest within the stereoscopic audio for the video based on the defined stereo location of the sound of interest based on the metadata that was informed to the channel (panning as a result of a visual objection location/trajectory information/metadata gleaned from a video- see examples of a mono audio of an explosion panned to a left speaker in Paragraph 0053 or a vehicle sound that starts in the left speaker and moves to a right speaker as the vehicle moves in the video, Paragraph 0066); and
informing the channel associated with an appropriate channel amongst the plurality of the stereophonic channels to which the audio should be panned (Paragraph 0065- "sound matching and rendering function 312 can use the position and/or trajectory of that object to distribute the audio stream associated with that object among multiple audio channels. Each audio channel can represent a separate stream of audio data to be provided to and played by a separate speaker." See also Paragraphs 0063 and 0066 describing the use of object "trajectory" or "motion" in a video to inform a specific/appropriate audio channel to play an associated audio stream (e.g., a car sound in a left speaker then to a right speaker responsive to movement/trajectory of that car)).
With respect to Claim 2, Apsingekar further discloses:
The method of claim 1, wherein the visual feature of interest comprises a representation of at least part of a living body, and wherein the feature recognition algorithm comprises a body part detection algorithm configured to identify a presence of one or more parts of the living body within the video (“detection, classification, and tracking function 302 may operate to identify various objects contained within the video clips) 206a and classify each object as a person," Paragraph 0057; video analysis models algorithms discussed in Paragraph 0075; and see Fig. 4, Elements 402 and 404 depicting body parts included in a detected bounding box and note also in Fig. 4 that the body part analysis of an object can identify the parts as corresponding to an animal such as a dog or horse).
With respect to Claim 3, Apsingekar further discloses:
The method of claim 1, wherein the spatial location of the visual feature of interest describes a lateral position of the visual feature in a lateral axis of a field of view of the video, and wherein determining the spatial location of the visual feature of interest in the video comprises: analyzing the video to determine a position of the visual feature of interest in the field of view of the video ("identify the position of each detected object" and "may also identify the position of each object within at least some image frames of the video clip(s), Paragraph 0057; see also Paragraphs 0072 and 0079- "position or trajectory of the object 402"); and based on the determined position of the visual feature of interest, categorizing the position of the visual feature of interest into one of a set of lateral position categories, wherein the set of lateral position categories comprises a left category; a center category; and a right category (the sound rendering based upon visual object locations includes lateral positions such as left. See the example in Paragraph 0053- "One example of this was noted above, where an explosion occurs on the left side of a scene. The object detection and sound rendering operation 208 can operate to detect the explosion, determine amplitudes and phases of the substantially mono audio content as reproduced in different audio channels, and generate multi-channel audio in which the substantially mono audio content is modified to play the explosion sound only on left speakers of a speaker setup." See also the left-to-right location of a vehicle example in Paragraph 0066).
With respect to Claim 4, Apsingekar further discloses:
The method of claim 1, wherein the spatial location of the visual feature of interest further describes a distance of the visual feature of interest from a viewpoint of the video, and wherein determining the spatial location of the visual feature of interest in the video comprises: analyzing the video to determine said distance of the visual feature of interest from the viewpoint of the video; and based on the determined distance of the visual feature of interest, categorizing the distance of the visual feature of interest into one of a set of distance categories, wherein the set of distance categories comprises a near category; a middle category; and a far category (See Paragraph 0066 that includes near/towards categories and far/away categories to classify a video object of interest from the viewpoint of the video by the viewer- “Similar types of effects may be created for objects moving in other directions, such as… in a direction perpendicular to the plane of a viewing screen (like when vehicles or other objects appear to move towards or away from a viewer”; see also prior citations regarding the use of sound object position to affect sound object rendering- “identify the position of each detected object" and "may also identify the position of each object within at least some image frames of the video clip(s), Paragraph 0057; see also Paragraphs 0072 and 0079).
With respect to Claim 5, Apsingekar further discloses:
The method of claim 1, wherein identifying the sound of interest in the monophonic audio track of the video comprises: processing the monophonic audio track with a voice recognition algorithm to detect one or more spoken words of interest; and identifying the detected one or more spoken words of interest as the sound of interest (the mono audio clip 206b is fed into the sound source classification and separation module 306 as depicted in Fig. 3 where a machine learning model/algorithm is used to classify audio associated with "one person speaking," Paragraph 0076; note that the detection of active speaking that would convey a language comprising words would constitute a detection of “one or more spoken words of interest” as claimed).
With respect to Claim 6, Apsingekar further discloses:
The method of claim 1, wherein the audio fingerprint for the sound of interest describes a variation in an audio parameter value of the sound of interest, and wherein the audio parameter comprises at least one of a frequency, an amplitude, a wave form or a duration (classification of the mono audio clip 206b "may include determining an amplitude and a phase of the audio stream associated with that audio source," Paragraph 0032; amplitude is also discussed as volume in Paragraph 0066; See also Paragraphs 0053 and 0065; tracking of variation in the parameter is detailed as the object moves in a video sequences (Paragraph 0065)- “the amplitudes and/or phases depend on the position and/or trajectory of the sounding object”).
With respect to Claim 7, Apsingekar further discloses:
The method of claim 1, wherein analyzing the video based on the sound of interest and the determined audio fingerprint comprises: identifying a portion of the video associated with the monophonic audio track comprising the sound of interest (the results of the video and mono audio clip classification algorithms are fed into a sound matching and rendering analyzer 312 where the "sound matching and rendering function 312 may further receive a separate audio stream for each potential audio source, an object classification for each potential audio source, and an indication whether each potential audio source is an on-screen or off-screen source," Paragraph 0062, and where "sound matching and rendering function 312 can also determine whether the object classification of each detected object matches the object classification of any of the audio sources associated with the separated audio streams. For example, the sound matching and rendering function 312 may determine if there is an object classified as a person and a separated audio stream associated with an audio source classified as a person, and the sound matching and rendering function 312 may determine if there is an object classified as a vehicle and a separated audio stream associated with an audio source classified as a vehicle," Paragraph 0063; see also Paragraphs 0065 and 0092);
analyzing the identified portion of the video associated with the mono audio track to detect a causal relationship between the audio fingerprint of the sound of interest and a variation in any of the one or more visual features of interest (trajectory/movement tracking of visual objects related to sound objects (i.e., different movements/trajectories of a visual object map to/cause different audio fingerprints with respect to varying phase and amplitude), Paragraphs 0053 and 0065-0066); and
responsive to detecting a causal relationship between the audio fingerprint of the sound of interest and a variation in a first visual feature of interest, identifying that the sound of interest is linked to the first visual feature of interest (matching of audio fingerprint (e.g., sound stream of the object, amplitude, phase, location, etc.) to the corresponding visual object, Paragraphs 0053, 0063, and 0065-0066; see also paragraph 0092 where the audio fingerprint is “based on at least one of the position or the motion trajectory of that object”).
With respect to Claim 8, Apsingekar further discloses:
The method of claim 1, wherein defining the stereo location of the sound of interest within the stereoscopic audio for the video comprises: generating metadata describing the spatial location associated with the sound of interest; and associating the generated metadata with the sound of interest (generating and assigning amplitude and phase data for an extracted sound object (e.g., person or vehicle sounds) associated with position and trajectory, Paragraphs 0032, 0053, 0065-0066, and 0092).
Claim 10 contains subject matter similar to Claim 1, and thus, is rejected under similar rationale. Furthermore, Apsingekar teaches method implementation as computer program instructions stored on one or more computer readable storage media (Paragraph 0010).
Claim 11 contains subject matter similar to claim 2, and thus, is rejected under similar rationale.
Claim 12 contains subject matter similar to Claim 1, and thus, is rejected under similar rationale. Furthermore, Apsingekar teaches method implementation as a system comprising one or more processors and a memory comprising program code (Paragraph 0036).
Claims 13-19 contain subject matter respectively similar to claims 2-8, and thus, are rejected under the same rationale.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Sheaffer, et al. (U.S. Patent: 11,997,463)- teaches an audio signal for generating spatial audio parameters for a mono signal including direction of arrival information for a particular object sound (Abstract and Col. 1, Line 49- Col. 2, Line 17).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAMES S WOZNIAK whose telephone number is (571)272-7632. The examiner can normally be reached 7-3, off alternate Fridays.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached at (571)272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
JAMES S. WOZNIAK
Primary Examiner
Art Unit 2655
/JAMES S WOZNIAK/ Primary Examiner, Art Unit 2655