Last updated: April 19, 2026
Application No. 18/738,243
KNOWLEDGE-BASED AUDIO SCENE GRAPH

Non-Final OA §103
Filed
Jun 10, 2024
Examiner
PATEL, HEMANT SHANTILAL
Art Unit
2694
Tech Center
2600 — Communications
Assignee
Qualcomm Incorporated
OA Round
1 (Non-Final)
Interview Optional

— +13.6% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 939 resolved cases, 2023–2026
Examiner Intelligence

PATEL, HEMANT SHANTILAL View full profile →
Grants 81% — above average
Career Allow Rate
761 granted / 939 resolved
+19.0% vs TC avg
Moderate +14% lift
Without
With
+13.6%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
25 currently pending
Career history
964
Total Applications
across all art units
Statute-Specific Performance

§101
4.5%
-35.5% vs TC avg
§103
44.9%
+4.9% vs TC avg
§102
15.4%
-24.6% vs TC avg
§112
22.9%
-17.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 939 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-2, 11-13, 16-18, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ferri (US Patent No. 12,374,353), and further in view of Maas (US Patent No. 10,854,192).
Regarding claim 1, Ferri teaches a device (Figs. 11-12) comprising:
a memory (Fig. 11 items 1106, 1108, Fig. 12 items 1206, 1208) configured to store knowledge data (Fig. 1A items 155, 157, 172); and one or more processors coupled to the memory (Fig. 11 item 1104, Fig. 12 item 1204) (col. 52 ll. 24-col. 54 ll. 7) and configured to:
obtain a first audio embedding (Fig. 1A item 162a) of a first audio segment of audio data (Fig. 1A item 113a), the first audio segment corresponding to a first audio event of audio events (Fig. 1A item 112a) (col. 4 ll. 64-col. 5 ll. 26);
obtain a first text embedding (text of natural language processing identifying event) of a first tag (event profile) assigned to the first audio segment (col. 10 ll. 1-14, col. 13 ll. 2-12);
obtain a first event representation of the first audio event, the first event representation based on a combination of the first audio embedding and the first text embedding (col. 5 ll. 26-col. 7 ll. 67, col. 12 ll. 48-56 different types of sound events);
obtain a second event representation of a second audio event of the audio events (event detection for any of Fig. 1A items 112b 112c);
determine, based on the knowledge data (Fig. 1A items 155, 157, 172), relations between the audio events (col. col. 5 ll. 26-col. 7 ll. 67 relations in time); and
construct an audio scene graph based on the audio events (Fig. 6 item 620, col. 11 ll. 19-51, col. 32 ll. 49-59, col. 36 ll. 52-64) (col. 4 ll. 48-col. 44 ll. 45 for complete details).
Ferri does not specifically teach to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event.
However, in the similar field, Maas teaches to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event (col. 14 ll. 18-col. 16 ll. 20, col. 22 ll. 29-col. 23 ll. 48).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event as taught by Maas so that “The highest recognition score path, where the recognition score is a combination of the acoustic model score, the language model score, and/or other factors, may be returned by the speech recognition engine 258 as the ASR result for the associated feature vectors” (Maas, col. 15 ll. 28-32).
Regarding claim 2, Ferri teaches to obtain audio segments of audio data identified as corresponding to the audio events, the audio segments including the first audio segment (Fig. 1A item 113a) and a second audio segment (Fig. 1A item 113b), wherein the second audio segment corresponds to the second audio event (Fig. 1A item 112b); and obtain tags (event profiles) assigned to the audio segments, a tag of a particular audio segment describing a corresponding audio event, wherein the tags include the first tag (col. 4 ll. 64-col. 5 ll. 26, col. 10 ll. 1-14).
Regarding claim 11, Ferri teaches to encode the audio scene graph to generate an encoded graph, and use the encoded graph to perform one or more downstream tasks (col. 32 ll. 48-59, col. 34 ll. 1-51 encoded graph used for mapping).
Regarding claim 12, Ferri teaches to update the audio scene graph based on user input, video data, or both (col. 36 ll. 40-64).
Regarding claim 13, Ferri teaches to generate a graphical user interface (GUI) including a representation of the audio scene graph; provide the GUI to a display device; receive a user input; and update the audio scene graph based on the user input (Fig. 6, col. 32 ll. 48-59, col. 33 ll. 55-67, col. 36 ll. 16-64).
Regarding claim 16, Ferri teaches to update the knowledge data responsive to an update of the audio scene graph (col. 10 ll. 22-33, col. 36 ll. 40-64).
Regarding claim 17, Ferri teaches a microphone configured to generate the audio data (Fig. 11 item 1120, col. 5 ll. 7-26, col. 21 ll. 51-55, col. 29 ll. 46-49, col. 30 ll. 37-40, col. 41 ll. 54-56 , col. 54 ll. 34-40).
Regarding claim 18, Ferri teaches a method comprising:
obtaining, at a first device, a first audio embedding (Fig. 1A item 162a) of a first audio segment of audio data (Fig. 1A item 113a), the first audio segment corresponding to a first audio event of audio events (Fig. 1A item 112a) (col. 4 ll. 64-col. 5 ll. 26);
obtaining, at the first device, a first text embedding (text of natural language processing identifying event) of a first tag (event profile) assigned to the first audio segment (col. 10 ll. 1-14, col. 13 ll. 2-12);
obtaining, at the first device, a first event representation of the first audio event, the first event representation based on a combination of the first audio embedding and the first text embedding (col. 5 ll. 26-col. 7 ll. 67, col. 12 ll. 48-56 different types of sound events);
obtaining, at the first device, a second event representation of a second audio event of the audio events (event detection for any of Fig. 1A items 112b 112c);
determining, based on knowledge data (Fig. 1A items 155, 157, 172), relations between the audio events (col. col. 5 ll. 26-col. 7 ll. 67 relations in time); and
constructing, at the first device, an audio scene graph based on the audio events (Fig. 6 item 620, col. 11 ll. 19-51, col. 32 ll. 49-59, col. 36 ll. 52-64); and
providing a representation of the audio scene graph to a second device (col. 36 ll. 40-64) (col. 4 ll. 48-col. 44 ll. 45 for complete details).
Ferri does not specifically teach constructing an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event.
However, in the similar field, Maas teaches to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event (col. 14 ll. 18-col. 16 ll. 20, col. 22 ll. 29-col. 23 ll. 48).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event as taught by Maas so that “The highest recognition score path, where the recognition score is a combination of the acoustic model score, the language model score, and/or other factors, may be returned by the speech recognition engine 258 as the ASR result for the associated feature vectors” (Maas, col. 15 ll. 28-32).
Regarding claim 20, Ferri teaches a non-transitory computer-readable medium (Fig. 11 items 1106, 1108, Fig. 12 items 1206, 1208) storing instructions that, when executed by one or more processors (Fig. 11 item 1104, Fig. 12 item 1204), cause the one or more processors to (col. 52 ll. 24-col. 54 ll. 7):
obtain a first audio embedding (Fig. 1A item 162a) of a first audio segment of audio data (Fig. 1A item 113a), the first audio segment corresponding to a first audio event of audio events (Fig. 1A item 112a) (col. 4 ll. 64-col. 5 ll. 26);
obtain a first text embedding (text of natural language processing identifying event) of a first tag (event profile) assigned to the first audio segment (col. 10 ll. 1-14, col. 13 ll. 2-12);
obtain a first event representation of the first audio event, the first event representation based on a combination of the first audio embedding and the first text embedding (col. 5 ll. 26-col. 7 ll. 67, col. 12 ll. 48-56 different types of sound events);
obtain a second event representation of a second audio event of the audio events (event detection for any of Fig. 1A items 112b 112c);
determine, based on knowledge data (Fig. 1A items 155, 157, 172), relations between the audio events (col. col. 5 ll. 26-col. 7 ll. 67 relations in time); and
construct an audio scene graph based on the audio events (Fig. 6 item 620, col. 11 ll. 19-51, col. 32 ll. 49-59, col. 36 ll. 52-64) (col. 4 ll. 48-col. 44 ll. 45 for complete details).
Ferri does not specifically teach to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event.
However, in the similar field, Maas teaches to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event (col. 14 ll. 18-col. 16 ll. 20, col. 22 ll. 29-col. 23 ll. 48).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event as taught by Maas so that “The highest recognition score path, where the recognition score is a combination of the acoustic model score, the language model score, and/or other factors, may be returned by the speech recognition engine 258 as the ASR result for the associated feature vectors” (Maas, col. 15 ll. 28-32).

Claims 3-6, 10, 19 are rejected under 35 U.S.C. 103 as being unpatentable over Ferri and Maas as applied to claims 1, 18 above, and further in view of Wang (US Patent Application Publication No. 2021/0397980).
Regarding claim 3, Ferri and Maas do not teach to obtain edge weights assigned to the audio scene graph based on a similarity metric and the relations between the audio events.
However, in the similar field, Wang teaches to obtain edge weights assigned to the scene graph based on a similarity metric and the relations between the events (occurrence of distinct ward) (Paragraphs 0009, 0087-0098, 0101-0103, 0131).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri and Maas to obtain edge weights assigned to the scene graph based on a similarity metric and the relations between the events as taught by Wang in order “to improve the accuracy of information recommendation” (Wang, Paragraph 0094).
Regarding claim 4, Ferri and Maas do not teach processors are configured to, based on a determination that the knowledge data indicates at least a first relation between the first audio event and the second audio event, assign a first edge weight to a first edge between the first node and the second node, wherein the first edge weight is based on a first similarity metric associated with the first event representation and the second event representation.
However, in the similar field, Wang teaches to, based on a determination that the knowledge data indicates at least a first relation between the first event (first word “apple” detected) and the second event (second word “pear” detected), assign a first edge weight to a first edge between the first node and the second node, wherein the first edge weight is based on a first similarity metric associated with the first event representation and the second event representation (first edge weight based on relationship of presented as produced in the similar region) (Paragraphs 0099-0103).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri and Maas to, based on a determination that the knowledge data indicates at least a first relation between the first event and the second event, assign a first edge weight to a first edge between the first node and the second node, wherein the first edge weight is based on a first similarity metric associated with the first event representation and the second event representation as taught by Wang so that “the magnitude of the weight assigned to the edge between two nodes may be proportional to the magnitude of the degree of association between the two entity words being targets of the two nodes” (Wang, Paragraph 0102).
Regarding claim 5, Ferri teaches to determine the first similarity metric based on a cosine similarity between the first event representation and the second event representation (col. 5 ll. 27-52, col. 14 ll. 14-32, col. 33 ll. 10-25, col. 34 ll. 1-34, col. 39 ll. 29-51).
Regarding claim 6, Wang teaches to, based on determining that the knowledge data indicates multiple relations between the first event (first word “apple” detected) and the second event (second word “pear” detected), determine the first edge weight further based on relation similarity metrics of the multiple relations (both being related as fruit and also as produced in the similar region) (Paragraphs 0099-0103).
Regarding claim 10, Ferri teaches to update the first similarity metric responsive to an update of the audio scene graph (col. 35 ll.34-col. 36 ll. 64).
Regarding claim 19, Ferri and Maas do not teach based on determining that the knowledge data indicates at least a first relation between the first audio event and the second audio event, determining a first edge weight based on a first similarity metric associated with the first event representation and the second event representation, wherein the first edge weight is assigned to a first edge between the first node and the second node.
However, in the similar field, Wang teaches based on determining that the knowledge data indicates at least a first relation between the first event (first word “apple” detected) and the second event (second word “pear” detected), determining a first edge weight based on a first similarity metric associated with the first event representation and the second event representation, wherein the first edge weight is assigned to a first edge between the first node and the second node (first edge weight based on relationship of presented as produced in the similar region) (Paragraphs 0099-0103).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri and Maas include based on determining that the knowledge data indicates at least a first relation between the first event and the second event, determining a first edge weight based on a first similarity metric associated with the first event representation and the second event representation, wherein the first edge weight is assigned to a first edge between the first node and the second node as taught by Wang so that “the magnitude of the weight assigned to the edge between two nodes may be proportional to the magnitude of the degree of association between the two entity words being targets of the two nodes” (Wang, Paragraph 0102).

Claims 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Ferri and Maas as applied to claim 1 above, and further in view of Shi (US Patent No. 11,928,145).
Regarding claim 14, Ferri and Maas do not teach to detect visual relations in video data, the video data associated with the audio data; and update the audio scene graph based on the visual relations.
However, in the similar field, Shi teaches to detect visual relations in video data, the video data associated with the audio data; and update the audio scene graph based on the visual relations (col. 7 ll. 28-col. 9 ll. 15, col. 10 ll. 40-col. 11 ll.40).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri and Maas to detect visual relations in video data, the video data associated with the audio data; and update the audio scene graph based on the visual relations as taught by Shi in order to store “the extracted audio information and video information with a timestamp corresponding to its occurrence in the video” (Shi, col. 7 ll. 41-43).
Regarding claim 15, Ferri teaches a camera configured to generate the video data (Fig. 11 item 1118, col. 53 ll. 42).

Claims 1-2, 11-13, 16-18, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ferri (US Patent No. 12,374,353), and further in view of Allam (US Patent Application Publication No. 2015/0348569).
Regarding claim 1, Ferri teaches a device (Figs. 11-12) comprising:
a memory (Fig. 11 items 1106, 1108, Fig. 12 items 1206, 1208) configured to store knowledge data (Fig. 1A items 155, 157, 172); and one or more processors coupled to the memory (Fig. 11 item 1104, Fig. 12 item 1204) (col. 52 ll. 24-col. 54 ll. 7) and configured to:
obtain a first audio embedding (Fig. 1A item 162a) of a first audio segment of audio data (Fig. 1A item 113a), the first audio segment corresponding to a first audio event of audio events (Fig. 1A item 112a) (col. 4 ll. 64-col. 5 ll. 26);
obtain a first text embedding (text of natural language processing identifying event) of a first tag (event profile) assigned to the first audio segment (col. 10 ll. 1-14, col. 13 ll. 2-12);
obtain a first event representation of the first audio event, the first event representation based on a combination of the first audio embedding and the first text embedding (col. 5 ll. 26-col. 7 ll. 67, col. 12 ll. 48-56 different types of sound events);
obtain a second event representation of a second audio event of the audio events (event detection for any of Fig. 1A items 112b 112c);
determine, based on the knowledge data (Fig. 1A items 155, 157, 172), relations between the audio events (col. col. 5 ll. 26-col. 7 ll. 67 relations in time); and
construct an audio scene graph based on the audio events (Fig. 6 item 620, col. 11 ll. 19-51, col. 32 ll. 49-59, col. 36 ll. 52-64) (col. 4 ll. 48-col. 44 ll. 45 for complete details).
Ferri does not specifically teach to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event.
However, in the similar field, Allam teaches to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event (Paragraphs 0038-0044, 0055-0067).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event as taught by Allam so that “the contents (semantics, meaning) of the nodes in the speech graph are used to further augment the speech graph, in order to form a hybrid graph of both semantic and non-semantic information” (Allam, Paragraph 0055).
Regarding claim 2, Ferri teaches to obtain audio segments of audio data identified as corresponding to the audio events, the audio segments including the first audio segment (Fig. 1A item 113a) and a second audio segment (Fig. 1A item 113b), wherein the second audio segment corresponds to the second audio event (Fig. 1A item 112b); and obtain tags (event profiles) assigned to the audio segments, a tag of a particular audio segment describing a corresponding audio event, wherein the tags include the first tag (col. 4 ll. 64-col. 5 ll. 26, col. 10 ll. 1-14).
Regarding claim 11, Ferri teaches to encode the audio scene graph to generate an encoded graph, and use the encoded graph to perform one or more downstream tasks (col. 32 ll. 48-59, col. 34 ll. 1-51 encoded graph used for mapping).
Regarding claim 12, Ferri teaches to update the audio scene graph based on user input, video data, or both (col. 36 ll. 40-64).
Regarding claim 13, Ferri teaches to generate a graphical user interface (GUI) including a representation of the audio scene graph; provide the GUI to a display device; receive a user input; and update the audio scene graph based on the user input (Fig. 6, col. 32 ll. 48-59, col. 33 ll. 55-67, col. 36 ll. 16-64).
Regarding claim 16, Ferri teaches to update the knowledge data responsive to an update of the audio scene graph (col. 10 ll. 22-33, col. 36 ll. 40-64).
Regarding claim 17, Ferri teaches a microphone configured to generate the audio data (Fig. 11 item 1120, col. 5 ll. 7-26, col. 21 ll. 51-55, col. 29 ll. 46-49, col. 30 ll. 37-40, col. 41 ll. 54-56 , col. 54 ll. 34-40).
Regarding claim 18, Ferri teaches a method comprising:
obtaining, at a first device, a first audio embedding (Fig. 1A item 162a) of a first audio segment of audio data (Fig. 1A item 113a), the first audio segment corresponding to a first audio event of audio events (Fig. 1A item 112a) (col. 4 ll. 64-col. 5 ll. 26);
obtaining, at the first device, a first text embedding (text of natural language processing identifying event) of a first tag (event profile) assigned to the first audio segment (col. 10 ll. 1-14, col. 13 ll. 2-12);
obtaining, at the first device, a first event representation of the first audio event, the first event representation based on a combination of the first audio embedding and the first text embedding (col. 5 ll. 26-col. 7 ll. 67, col. 12 ll. 48-56 different types of sound events);
obtaining, at the first device, a second event representation of a second audio event of the audio events (event detection for any of Fig. 1A items 112b 112c);
determining, based on knowledge data (Fig. 1A items 155, 157, 172), relations between the audio events (col. col. 5 ll. 26-col. 7 ll. 67 relations in time); and
constructing, at the first device, an audio scene graph based on the audio events (Fig. 6 item 620, col. 11 ll. 19-51, col. 32 ll. 49-59, col. 36 ll. 52-64); and
providing a representation of the audio scene graph to a second device (col. 36 ll. 40-64) (col. 4 ll. 48-col. 44 ll. 45 for complete details).
Ferri does not specifically teach constructing an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event.
However, in the similar field, Allam teaches to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event (Paragraphs 0038-0044, 0055-0067).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event as taught by Allam so that “the contents (semantics, meaning) of the nodes in the speech graph are used to further augment the speech graph, in order to form a hybrid graph of both semantic and non-semantic information” (Allam, Paragraph 0055).
Regarding claim 20, Ferri teaches a non-transitory computer-readable medium (Fig. 11 items 1106, 1108, Fig. 12 items 1206, 1208) storing instructions that, when executed by one or more processors (Fig. 11 item 1104, Fig. 12 item 1204), cause the one or more processors to (col. 52 ll. 24-col. 54 ll. 7):
obtain a first audio embedding (Fig. 1A item 162a) of a first audio segment of audio data (Fig. 1A item 113a), the first audio segment corresponding to a first audio event of audio events (Fig. 1A item 112a) (col. 4 ll. 64-col. 5 ll. 26);
obtain a first text embedding (text of natural language processing identifying event) of a first tag (event profile) assigned to the first audio segment (col. 10 ll. 1-14, col. 13 ll. 2-12);
obtain a first event representation of the first audio event, the first event representation based on a combination of the first audio embedding and the first text embedding (col. 5 ll. 26-col. 7 ll. 67, col. 12 ll. 48-56 different types of sound events);
obtain a second event representation of a second audio event of the audio events (event detection for any of Fig. 1A items 112b 112c);
determine, based on knowledge data (Fig. 1A items 155, 157, 172), relations between the audio events (col. col. 5 ll. 26-col. 7 ll. 67 relations in time); and
construct an audio scene graph based on the audio events (Fig. 6 item 620, col. 11 ll. 19-51, col. 32 ll. 49-59, col. 36 ll. 52-64) (col. 4 ll. 48-col. 44 ll. 45 for complete details).
Ferri does not specifically teach to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event.
However, in the similar field, Allam teaches to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event (Paragraphs 0038-0044, 0055-0067).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri to construct an audio scene graph based on a temporal order of the audio events, the audio scene graph constructed to include a first node corresponding to the first audio event and a second node corresponding to the second audio event as taught by Allam so that “the contents (semantics, meaning) of the nodes in the speech graph are used to further augment the speech graph, in order to form a hybrid graph of both semantic and non-semantic information” (Allam, Paragraph 0055).

Claims 3-6, 10, 19 are rejected under 35 U.S.C. 103 as being unpatentable over Ferri and Allam as applied to claims 1, 18 above, and further in view of Wang (US Patent Application Publication No. 2021/0397980).
Regarding claim 3, Ferri and Allam do not teach to obtain edge weights assigned to the audio scene graph based on a similarity metric and the relations between the audio events.
However, in the similar field, Wang teaches to obtain edge weights assigned to the scene graph based on a similarity metric and the relations between the events (occurrence of distinct ward) (Paragraphs 0009, 0087-0098, 0101-0103, 0131).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri and Allam to obtain edge weights assigned to the scene graph based on a similarity metric and the relations between the events as taught by Wang in order “to improve the accuracy of information recommendation” (Wang, Paragraph 0094).
Regarding claim 4, Ferri and Allam do not teach processors are configured to, based on a determination that the knowledge data indicates at least a first relation between the first audio event and the second audio event, assign a first edge weight to a first edge between the first node and the second node, wherein the first edge weight is based on a first similarity metric associated with the first event representation and the second event representation.
However, in the similar field, Wang teaches to, based on a determination that the knowledge data indicates at least a first relation between the first event (first word “apple” detected) and the second event (second word “pear” detected), assign a first edge weight to a first edge between the first node and the second node, wherein the first edge weight is based on a first similarity metric associated with the first event representation and the second event representation (first edge weight based on relationship of presented as produced in the similar region) (Paragraphs 0099-0103).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri and Allam to, based on a determination that the knowledge data indicates at least a first relation between the first event and the second event, assign a first edge weight to a first edge between the first node and the second node, wherein the first edge weight is based on a first similarity metric associated with the first event representation and the second event representation as taught by Wang so that “the magnitude of the weight assigned to the edge between two nodes may be proportional to the magnitude of the degree of association between the two entity words being targets of the two nodes” (Wang, Paragraph 0102).
Regarding claim 5, Ferri teaches to determine the first similarity metric based on a cosine similarity between the first event representation and the second event representation (col. 5 ll. 27-52, col. 14 ll. 14-32, col. 33 ll. 10-25, col. 34 ll. 1-34, col. 39 ll. 29-51).
Regarding claim 6, Wang teaches to, based on determining that the knowledge data indicates multiple relations between the first event (first word “apple” detected) and the second event (second word “pear” detected), determine the first edge weight further based on relation similarity metrics of the multiple relations (both being related as fruit and also as produced in the similar region) (Paragraphs 0099-0103).
Regarding claim 10, Ferri teaches to update the first similarity metric responsive to an update of the audio scene graph (col. 35 ll.34-col. 36 ll. 64).
Regarding claim 19, Ferri and Allam do not teach based on determining that the knowledge data indicates at least a first relation between the first audio event and the second audio event, determining a first edge weight based on a first similarity metric associated with the first event representation and the second event representation, wherein the first edge weight is assigned to a first edge between the first node and the second node.
However, in the similar field, Wang teaches based on determining that the knowledge data indicates at least a first relation between the first event (first word “apple” detected) and the second event (second word “pear” detected), determining a first edge weight based on a first similarity metric associated with the first event representation and the second event representation, wherein the first edge weight is assigned to a first edge between the first node and the second node (first edge weight based on relationship of presented as produced in the similar region) (Paragraphs 0099-0103).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri and Allam include based on determining that the knowledge data indicates at least a first relation between the first event and the second event, determining a first edge weight based on a first similarity metric associated with the first event representation and the second event representation, wherein the first edge weight is assigned to a first edge between the first node and the second node as taught by Wang so that “the magnitude of the weight assigned to the edge between two nodes may be proportional to the magnitude of the degree of association between the two entity words being targets of the two nodes” (Wang, Paragraph 0102).

Claims 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over Ferri and Allam as applied to claim 1 above, and further in view of Shi (US Patent No. 11,928,145).
Regarding claim 14, Ferri and Allam do not teach to detect visual relations in video data, the video data associated with the audio data; and update the audio scene graph based on the visual relations.
However, in the similar field, Shi teaches to detect visual relations in video data, the video data associated with the audio data; and update the audio scene graph based on the visual relations (col. 7 ll. 28-col. 9 ll. 15, col. 10 ll. 40-col. 11 ll.40).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the present invention to modify Ferri and Allam to detect visual relations in video data, the video data associated with the audio data; and update the audio scene graph based on the visual relations as taught by Shi in order to store “the extracted audio information and video information with a timestamp corresponding to its occurrence in the video” (Shi, col. 7 ll. 41-43).
Regarding claim 15, Ferri teaches a camera configured to generate the video data (Fig. 11 item 1118, col. 53 ll. 42).

Allowable Subject Matter
Claims 7-9 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The above objection(s) is (are) based on the claim(s) as presently set forth in its (their) totality. It should not be interpreted as indicating that amended claim(s) broadly reciting certain limitations would be allowable. A more detailed reason(s) for allowance may be set forth in a subsequent Notice of Allowance if and when all claims in the application are put into a condition for allowance. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HEMANT PATEL whose telephone number is (571)272-8620. The examiner can normally be reached M-F 8:00 AM - 4:30 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Fan Tsang can be reached at 571-272-7547. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

HEMANT PATEL
Primary Examiner
Art Unit 2694



/HEMANT S PATEL/           Primary Examiner, Art Unit 2694
Read full office action
Prosecution Timeline

Jun 10, 2024
Application Filed
Feb 17, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/129,192
Patent 12598254
SYSTEMS AND METHODS RELATING TO GENERATING SIMULATED INTERACTIONS FOR TRAINING CONTACT CENTER AGENTS
2y 5m to grant Granted Apr 07, 2026
18/551,621
Patent 12592843
INFORMATION PROCESSING DEVICE AND INFORMATION PROCESSING METHOD
2y 5m to grant Granted Mar 31, 2026
18/387,144
Patent 12578920
AUDIO SYSTEM CONTROL DEVICE
2y 5m to grant Granted Mar 17, 2026
18/582,428
Patent 12573409
AUDIO ENCODER, METHOD FOR PROVIDING AN ENCODED REPRESENTATION OF AN AUDIO INFORMATION, COMPUTER PROGRAM AND ENCODED AUDIO REPRESENTATION USING IMMEDIATE PLAYOUT FRAMES
2y 5m to grant Granted Mar 10, 2026
18/349,756
Patent 12563160
MULTIUSER TELECONFERENCING WITH SPOTLIGHT FEATURE
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
81%
Grant Probability
95%
With Interview (+13.6%)
2y 10m
Median Time to Grant
Low
PTA Risk
Based on 939 resolved cases by this examiner. Grant probability derived from career allow rate.