Last updated: April 19, 2026

Application No. 18/467,869

RENDERING INTERFACE FOR AUDIO DATA IN EXTENDED REALITY SYSTEMS

Non-Final OA §102

Filed

Sep 15, 2023

Examiner

KRZYSTAN, ALEXANDER J

Art Unit

2694

Tech Center

2600 — Communications

Assignee

Qualcomm Incorporated

OA Round

3 (Non-Final)

Interview Optional

— +6.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.

Based on 1121 resolved cases, 2023–2026

Examiner Intelligence

KRZYSTAN, ALEXANDER J View full profile →

Grants 81% — above average

Career Allow Rate

913 granted / 1121 resolved

+19.4% vs TC avg

Moderate +7% lift

Without

With

+6.9%

Interview Lift

resolved cases with interview

Typical timeline

3y 1m

Avg Prosecution

38 currently pending

Career history

1159

Total Applications

across all art units

Statute-Specific Performance

§101

2.7%

-37.3% vs TC avg

§103

37.1%

-2.9% vs TC avg

§102

24.3%

-15.7% vs TC avg

§112

21.0%

-19.0% vs TC avg

Black line = Tech Center average estimate • Based on career data from 1121 resolved cases

Office Action

§102

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
	
Examiner’s Comments

Based upon the most recently submitted amendment and remarks, the examiner notes ‘audio element’ as recited in the claims is not recited as a single signal as a particular point, but rather as a general description of the data at multiple points in the processing chain used to create the signals in order to render the element to the user.  The same conventions will be used when describing the prior art in mapping to applicant’s claims.
Based on applicants remarks and amendment the 112 rejection to claim 28 has been withdrawn.

	
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-30 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Lai et al (US 20230316594 A1).

As per claim 1, Lai discloses a device configured to process a bitstream, the device comprising: 
a memory configured to store the bitstream representative of at least one audio element in an extended reality scene (memory required in the device of Fig. 2a,2b,3a to store and process the bitstream from 543 in fig. 5a per the elements/objects per para 107, in an XR system per para 7), and 
audio descriptive information (attributes, and relationships between objects per para 107 and the movement information, contextual awareness, and/or user commands, per para 56 ) associated with the at least one audio element, wherein the audio descriptive information identifies a pose of the at least one audio element in terms of one or more of a position and orientation of the at least one audio element within the extended reality scene; (rendered virtual objects 240 and content per para 62 are rendered within the extended reality content scene per para 58, where any information associated with the object/content used to render the object/content are audio descriptive information at a particular point in the field of view of the extended reality scene; and as additionally noted in para 62: The extended reality content may include video, audio, haptic feedback, or some combination thereof, any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional (3D) effect to the viewer, per para 62, ((where a 3d effect is based on a designated position within a 3d coordinate system)); further noting the content may comprise audio per para 62)  and 
processing circuitry (the  devices in figs 2,3,5 requires at least one processor in order to implement the cited functions) coupled of the memory and configured to execute a scene manager (para. 109: the rule based systems 562, algorithms 565, and/or models 567 of the artificial intelligence platform 560 may be configured for any known scene graph generation (SGG) techniques) and 
an audio unit (the portions to perform the rendering per para. 113 rendering of virtual content 543 by the client system (including virtual assistant application 505 and I/O interfaces 545)  ), 
wherein the scene manager is configured to:
construct, based on the at least one audio element, a scene graph (para. 108) that includes at least one node that represents the at least one audio element/object (the nodes per para 108); and
modify, based on the scene graph, the audio descriptive information to obtain modified audio descriptive information (para. 56: received from extended reality system 205 and sensors 215, such as movement information, contextual awareness, and/or user commands, and, in some examples, data from any external sensors, such as third-party information or device, to capture information within the real world, physical environment, such as motion by user 220 and/or feature tracking information with respect to user 220. Based on the sensed data, the extended reality application determines interaction information to be presented for the frame of reference of extended reality system 205 and, in accordance with the current context of the user 220, renders the extended reality content 225), and 

wherein the audio unit is configured to: 
render, based on the modified audio descriptive information, the at least one audio element to one or more speaker feeds; and 
output the one or more speaker feeds (speakers per para 67, and rendering per para. 113).  

As per claim 2, the device of claim 1, wherein the scene manager is further configured to obtain at least one visual element (the objects with the video response per para 46), and 
wherein the scene manager is configured to construct, based on the at least one audio element and the at least one visual element, the scene graph that includes a parent node representative of the at least one visual element (the node for an object with a video presence), and 
a child node that depends from the parent node and that represents the at least one audio element (the system is an XR system where an XR system has object with: video, audio, haptic feedback, or some combination thereof per para 4, where the combination of audio and video for one object are interdependent nodes in the scene graph).  

As per claim 3, the device of claim 2, wherein the scene manager is configured to align the at least one visual element and the at least one audio element when constructing the scene graph (para. 108: The context graph 570 is a structured representation of the data (e.g., an image), where since it is structured, all the audio and video portions of the objects in the nodes must be aligned).

As per claim 4, the device of claim 1, wherein the scene manager is further configured to update the scene graph to add, remove, or edit the at least one node that represents the at least one audio element (the audio and video objects are implemented via nodes, and the objects change per the updating per the claim 1 rejection, where the nodes must be added removed or edited in order to change the objects as such).

As per claim 5, the device of claim 2, wherein the scene manager is further configured to map, based on visual descriptive information associated with the at least one visual element and the audio descriptive information associated with the at least one audio element, the at least one visual element to the at least one audio element (para. 118, mapping the interactions, where the interactions include audio and video and their respective descriptive information per the example at the bottom of para 46).  

As per claim 6, the device of claim 5, wherein the visual descriptive information includes a position of the at least one visual element in the extended reality scene, wherein the audio descriptive information includes a position of the at least one audio element in the extended reality scene, and wherein the scene manager is configured to modify, based on the position of the at least one visual element, the position of the at least one audio element to obtain a modified position of the at least one audio element in the extended reality scene (para. 46: e.g., rendering virtual content/audio element overlaid on a real-world object/visual element within the display). The presented responses may be based on different modalities such as audio/audio element, text, image, and video, noting that when the user moves, the overlaid audio will move because the real world object will have moved as well).  

As per claim 7, the device of claim 6, wherein the modified position of the at least one audio element differs from the position of the at least one audio element (para. 56: During this process, the extended reality application uses sensed data received from extended reality system 205 and sensors 215, such as movement information, contextual awareness, and/or user commands, and, in some examples, data from any external sensors, such as third-party information or device, to capture information within the real world, physical environment, such as motion by user 220 and/or feature tracking information with respect to user 220. Based on the sensed data, the extended reality application determines interaction information to be presented for the frame of reference of extended reality system 205 and, in accordance with the current context of the user 220, renders the extended reality content 225) (where movement changes the frame of reference which changes the virtual position of the element).  

As per claim 8, the device of claim 6, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a rotational angle.  (per the claim 5 rejection, any movement of an object will by definition have a new position different from its previous position in space by an amount which is representable in the form of a rotational angle).

As per claim 9, the device of claim 6, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a translational distance (per the claim 5 rejection, any movement of an object will by definition have a new position different from its previous position in space by an amount which is a translational distance).

As per claim 10, the device of claim 5, wherein the at least one audio element includes a first audio element and a second audio element, wherein the scene manager is further configured to:
 map, based on visual descriptive information associated with the at least one visual element and audio descriptive information associated with the first audio element, the at least one visual element to the first audio element (the audio and video that are mapped together per the claim 5 rejection); 

determine that none of the at least one visual element maps to the second audio element (an object with just an audio interaction being detected by the processor); and 
render, based on the audio descriptive information associated with the second audio element, the second audio element to the one or more speaker feeds(the rendering as described above per the claim 1 rejection the audio only object/interaction).  

As per claim 11, the device of claim 5, wherein the visual descriptive information includes an identifier that uniquely identifies the at least one visual element, wherein the audio descriptive information includes an identifier that uniquely identifies the at least one audio element, and wherein the scene manager is configured to map, based on the identifier that uniquely identifies the at least one visual element and the identifier that uniquely identifies the at least one audio element, the at least one visual element to the at least one audio element (the elements mapped per the example at the bottom of para 46, where each element requires a respective unique identifier in order to be recognized by the processor to be processed as described). 

As per claim 12, the device of claim 11, wherein the identifier that uniquely identifies the at least one visual element includes one or more of a visual element identifier and a visual element name, and wherein the identifier that uniquely identifies the at least one audio element includes one or more of an audio element identifier and an audio element name (the elements mapped per the example at the bottom of para 46, where each element requires a respective unique identifier in order to be recognized by the processor to be processed as described).

As per claim 13, the device of claim 5, further comprising one or more speakers (para 67) configured to reproduce, based on the one or more speaker feeds, a soundfield (para. 62: such as stereo video that produces a three-dimensional (3D) effect to the viewer).  
As per claim 14, the device of claim 5, wherein the scene manager is further configured to output the modified audio descriptive information to the audio unit (para. 56: received from extended reality system 205 and sensors 215, such as movement information, contextual awareness, and/or user commands, and, in some examples, data from any external sensors, such as third-party information or device, to capture information within the real world, physical environment, such as motion by user 220 and/or feature tracking information with respect to user 220. Based on the sensed data, the extended reality application determines interaction information to be presented for the frame of reference of extended reality system 205 and, in accordance with the current context of the user 220, renders the extended reality content 225) (the portion of the system that reads in and responds to the updated movements and renders accordingly).

As per claim 15, the device of claim 5, wherein the scene manager is further configured to output, via an application programming interface exposed by the audio unit, the modified audio descriptive information to the audio unit (the system comprises an API per para 44, which is used between software functions such as those regarding the processing with the modified audio descriptive information to the rendering function).

As per claim 16, the device of claim 1, wherein the bitstream is transmitted according to one or more of a wireless network protocol (para. 41, the wireless wan), a personal area network protocol, and a cellular network protocol.

As per claim 17, the claim 1 rejection discloses a method comprising: 
obtaining a bitstream representative of at least one audio element in an extended reality scene (claim 1 rejection, bitstream),
wherein the audio descriptive information identifies a pose of the at least one audio element in terms of one or more of a position and orientation of the at least one audio element within the extended reality scene; (rendered virtual objects 240 and content per para 62 are rendered within the extended reality content scene per para 58, where any information associated with the object/content used to render the object/content are audio descriptive information at a particular point in the field of view of the extended reality scene; and as additionally noted in para 62: The extended reality content may include video, audio, haptic feedback, or some combination thereof, any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional (3D) effect to the viewer, per para 62, ((where a 3d effect is based on a designated position within a 3d coordinate system)); further noting the content may comprise audio per para 62)
 and 
audio descriptive information associated with the at least one audio element (claim 1 rejection); 
constructing, based on the at least one audio element, a scene graph that includes at least one node that represents the at least one audio element (claim 1 rejection); 
modifying, based on the scene graph, the audio descriptive information to obtain modified audio descriptive information (claim 1 rejection); 
rendering, based on the modified audio descriptive information, the at least one audio element to one or more speaker feeds (claim 1 rejection); and 
outputting the one or more speaker feeds (claim 1 rejection).

As per claim 18, the method of claim 17, further comprising obtaining at least one visual element, wherein constructing the scene graph includes constructing, based on the at least one audio element and the at least one visual element, the scene graph that includes a parent node representative of the at least one visual element, and a child node that depends from the parent node and that represents the at least one audio element.(per claim 2 rejection).

As per claim 19, the method of claim 18, wherein constructing the scene graph includes aligning the at least one visual element and the at least one audio element (per claim 3 rejection).

As per claim 20, the method of claim 18, further comprising mapping, based on visual descriptive information associated with the at least one visual element and the audio descriptive information associated with the at least one audio element, the at least one visual element to the at least one audio element (per claim 5 rejection).

As per claim 21,  the method of claim 20, wherein the visual descriptive information includes a position of the at least one visual element in the extended reality scene, wherein the audio descriptive information includes a position of the at least one audio element in the extended reality scene, and wherein modifying the audio descriptive information comprises modifying, based on the position of the at least one visual element, the position of the at least one audio element to obtain to obtain a modified position of the at least one audio element in the extended reality scene (per claim 6 rejection).

As per claim 22, the method of claim 21, wherein the modified position of the at least one audio element differs from the position of the at least one audio element (per claim 7 rejection).

As per claim 23, the method of claim 21, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a rotational angle (per claim 8 rejection).

As per claim 24, the method of claim 21, wherein the modified position of the at least one audio element differs from the position of the at least one audio element in terms of a translational distance (per claim 9 rejection).

As per claim 25, the method of claim 20, wherein the at least one audio element includes a first audio element and a second audio element, and wherein the method further comprises: mapping, based on visual descriptive information associated with the at least one visual element and audio descriptive information associated with the first audio element, the at least one visual element to the first audio element; determining that none of the at least one visual element maps to the second audio element; and rendering, based on the audio descriptive information associated with the second audio element, the second audio element to the one or more speaker feeds (per claim 10 rejection).

As per claim 26, the method of claim 20, wherein the visual descriptive information includes an identifier that uniquely identifies the at least one visual element, wherein the audio descriptive information includes an identifier that uniquely identifies the at least one audio element, and wherein mapping the at least one visual element to the at least one audio element comprises mapping, based on the identifier that uniquely identifies the at least one visual element and the identifier that uniquely identifies the at least one audio element, the at least one visual element to the at least one audio element (per claim 11).

As per claim 27, the method of claim 26, wherein the identifier that uniquely identifies the at least one visual element includes one or more of a visual element identifier and a visual element name, and wherein the identifier that uniquely identifies the at least one audio element includes one or more of an audio element identifier and an audio element name (per claim 12 rejection). 

As per claim 28, the method of claim 20, further comprising outputting, via an application programming interface exposed by an audio unit, the modified audio metadata to the audio unit (per the claim 15 rejection).

As per claim 29, the method of claim 20, wherein the bitstream is transmitted according to one or more of a wireless network protocol, a personal area network protocol, and a cellular network protocol (per claim 16 rejection).  

As per claim 30, the system of the claim 1 rejection requires a non-transitory computer-readable medium having stored thereon instructions (in order to implement the cited functions) that, when executed, cause processing circuitry to: 
obtain a bitstream representative of at least one audio element in an extended reality scene, and audio descriptive information associated with the at least one audio element;
wherein the audio descriptive information identifies a pose of the at least one audio element in terms of one or more of a position and orientation of the at least one audio element within the extended reality scene; (rendered virtual objects 240 and content per para 62 are rendered within the extended reality content scene per para 58, where any information associated with the object/content used to render the object/content are audio descriptive information at a particular point in the field of view of the extended reality scene; and as additionally noted in para 62: The extended reality content may include video, audio, haptic feedback, or some combination thereof, any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional (3D) effect to the viewer, per para 62, ((where a 3d effect is based on a designated position within a 3d coordinate system)); further noting the content may comprise audio per para 62)
 and construct, based on the at least one audio element, a scene graph that includes at least one node that represents the at least one audio element; and modify, based on the scene graph, the audio descriptive information to obtain modified audio descriptive information, and render, based on the modified audio descriptive information, the at least one audio element to one or more speaker feeds; and output the one or more speaker feeds (per the claim 1 and 17 rejections).  


Response to Arguments
Applicant's arguments have been fully considered but they are moot in view of the new grounds of rejection. 

Previous responses to previous arguments:
As per applicant’s arguments that the prior art does not disclose audio elements/objects, or associated audio descriptive information, the examiner notes the cited objects/elements are in the context of an XR system per para 7 as cited, which is by definition comprised of virtual audio/video based objects.  Additionally, per para 4: The extended reality content may include digital images or animation, video, audio, haptic feedback, or some combination thereof, and any of which may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional effect to the viewer).
Additionally, per para 46: virtual content overlaid on the physical environment such as real-world object) or audio , where, since the virtual content/object is associated with audio as such it is an audio element.
Additionally as further described in para 62,101
Accordingly, since the cited audio elements are audio elements as noted above, the cited associated audio descriptive information is in fact associated audio descriptive information.

As per applicant’s argument that the cited prior art does not disclose a bitstream with an audio element, the examiner notes per para 101: Moreover, the one or more I/O interfaces 545 may include one or more wired or wireless NICs for communicating with a network, such as network 120 described with respect to FIG. 1. A passive manner means that the virtual assistant application 505 obtains data via the image capture devices, sensors, remote systems, the like.  As such 545 is not limited to merely user interface devices.
Additionally, If the data is received from remote systems, then messaging platform 550, ASR module 552, processing system 555, artificial intelligence platform 560, or a combination thereof may be used to process the remote system data and determine the objects, attributes, and/or relationships received from the remote system per para 107.
Additionally, note the communication with other users applications noted in para 43 where the user interface at one device send virtual audio objects to another device to be rendered in the communication, where any data sent across a network or from a remote system or digitally is via a bitstream.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEXANDER KRZYSTAN whose telephone number is 571-272-7498, and whose email address is alexander.krzystan@uspto.gov

The examiner can usually be reached on m-f 7:30-4:00 est.
If attempts to reach the examiner by telephone or email are unsuccessful, the examiner’s supervisor, Fan Tsang can be reached on (571) 272-7547.  

The fax phone numbers for the organization where this application or proceeding is assigned are 571-273-8300 for regular communications and 571-273-8300 for After Final communications.

/ALEXANDER KRZYSTAN/Primary Examiner, Art Unit 2653                                                                                                                                                                                                        
Examiner Alexander Krzystan
February 20, 2026

Read full office action

Prosecution Timeline

Sep 15, 2023

Application Filed

Aug 05, 2025

Non-Final Rejection — §102

Nov 03, 2025

Response Filed

Nov 17, 2025

Final Rejection — §102

Jan 22, 2026

Response after Non-Final Action

Feb 11, 2026

Request for Continued Examination

Feb 18, 2026

Response after Non-Final Action

Feb 20, 2026

Non-Final Rejection — §102 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/286,841

Patent 12598440

RENDERING OF OCCLUDED AUDIO ELEMENTS

2y 5m to grant Granted Apr 07, 2026

18/486,764

Patent 12593170

SWITCHING METHOD FOR AUDIO OUTPUT CHANNEL, AND DISPLAY DEVICE

2y 5m to grant Granted Mar 31, 2026

18/314,713

Patent 12573410

DECODER, ENCODER, AND METHOD FOR INFORMED LOUDNESS ESTIMATION IN OBJECT-BASED AUDIO CODING SYSTEMS

2y 5m to grant Granted Mar 10, 2026

18/397,683

Patent 12574675

Acoustic Device and Method

2y 5m to grant Granted Mar 10, 2026

18/082,987

Patent 12541554

TRANSCRIPT AGGREGATON FOR NON-LINEAR EDITORS

2y 5m to grant Granted Feb 03, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.

Prosecution Projections

3-4

Expected OA Rounds

81%

Grant Probability

88%

With Interview (+6.9%)

3y 1m

Median Time to Grant

High

PTA Risk

Based on 1121 resolved cases by this examiner. Grant probability derived from career allow rate.