Last updated: April 19, 2026
Application No. 18/078,051
EQUALIZING AND TRACKING SPEAKER VOICES IN SPATIAL CONFERENCING

Non-Final OA §101§102§103
Filed
Dec 08, 2022
Examiner
KAZEMINEZHAD, FARZAD
Art Unit
2653
Tech Center
2600 — Communications
Assignee
Intel Corporation
OA Round
1 (Non-Final)
Interview Optional

— +67.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 534 resolved cases, 2023–2026
Examiner Intelligence

KAZEMINEZHAD, FARZAD View full profile →
Grants 71% — above average
Career Allow Rate
379 granted / 534 resolved
+9.0% vs TC avg
Strong +67% interview lift
Without
With
+67.2%
Interview Lift
resolved cases with interview
Typical timeline
3y 6m
Avg Prosecution
24 currently pending
Career history
558
Total Applications
across all art units
Statute-Specific Performance

§101
13.6%
-26.4% vs TC avg
§103
36.9%
-3.1% vs TC avg
§102
18.3%
-21.7% vs TC avg
§112
18.5%
-21.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 534 resolved cases
Office Action

§101 §102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claims 8, 17 are objected to because of the following informalities: “first in-room user of the second in-room user” appears to be misspelling of “first in-room user  or the second in-room user” .  Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 stand rejected:
Claims 1, 10 and 19 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. At a high level, the application is about how a “Far-End” user could remotely monitor a “conferencing session” comprised of a plurality of “user[s]”. The “far-end” user would be able to observe the position (i.e., “distance”) of each of the “conferencing session” “user[s]” from an “in-room” “camera” associated with an “in-room device” that he is remotely receiving via his own computing device and can select by “touch” (claim 8) of his own computing device “display” a specific “user” and have that “user’s” “voice” “enhance[ed]” (claim 9).
	The claims 1, 10 and 19 concern mainly determination of  “distance” of each “user” of  the “conferencing session” from the “in-room” “device”, as well as distances of each “user” from other “users” in the “conferencing session” as they are filmed. That is obtained by using “metadata” (Sp. ¶ 0054 S2: “consists of face IDs of the in-room users tagged with their locations”) and “first number of pixels across the face of the” “in-room user” (Sp. ¶ 0083 lines 7+: “The pixel count may be determined by running a pixel count algorithm that provides distances”).
	These limitations, as drafted, are processes that, under their broadest reasonable interpretation, cover performance of the limitations in the mind but for the recitation of the “processing circuitry” (Claim 1) and “one or more processors” (claim 10). That is, other than reciting “the processing circuitry configured to” (claim 1), or “one or more processors” (claim 10), nothing in the claim elements precludes the steps from practically being performed in the mind. For example, the step of “identify” merely relates to looking up depth information and camera information that has already been recorded.  The step of “perform” relates to mental activity of determining who is sitting where. We do this observation in any in person meeting automatically. The next two steps are only calculations without any tie in to what they are used for. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claims recite an abstract idea.
	This judicial exception is not integrated into a practical application. In particular, the claims 1 and 10 each only recite one additional element (“processing circuitry” (claim 1) and “one or more processors” (claim 10)). The “processing circuitry” and/or “one or more processors” are recited at a high-level of generality; see Sp. ¶ 00119 S2: “One example of a processor is a state machine or an application specific integrated circuit (ASTC) that includes at least one input and at least one output”. Furthermore the “processing circuitry” and/or “one or more processors” are taught as a generic processor for doing all the steps of “identify …”, “perform …”, “calculate …”, such that they amount no more than mere instructions to apply the exception using a generic computer component. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limit on practicing the abstract idea. The claims are thus directed to an abstract idea.
	The claims do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using “a processing circuitry” and/or “one or more processors” to carry out all the claim steps amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. The claims are not therefore patent eligible.
	
Regarding claims 2 (11 and 20) and 3 (12), the step of “identify” merely relates to looking up depth information (e.g. distance) and camera information that has already been recorded. 

	Regarding claims 4 (13) and 5 (14), looking up camera information (metadata) does not require any particular machine and/or method including computer implemented methods. Also using a time interval associated with a meeting is a mental process not requiring any particular machine either.

	Regarding claims 6 (15), looking up camera information (metadata) does inherently depend on the camera’s field of view (e.g., how much it was able to zoom into a scene) and its resolution.

	Regarding claims 7 (16), looking up and analyzing a video stream is a mental process as anyone with a cell phone can do, and engaging this activity does not enhance performance of that cell phone.

	Regarding claims 8 (17), the “select[ion]” of an “in-room user” via “touch” and “steer[ing]” of “a beamformer” in their direction do not connect back to the independent claim. Furthermore, they can relate to a human touching another user for selecting which user can speak and then the steering of the beamformer is mathematical operation based on the signal selected.

	Regarding claims 9 (18), it relates to a gazed monitoring which can be mental and “enhancing” which can be asking other users to stop speaking. Furthermore, it is not connected back to its independent parent claim 1 as to how the steps of the parent claim (in particular the “calculate[ing]” steps) are connected to the monitoring here. 


Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1-7, 10-16, 19-20 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Bhatt et al. (US 2024/0185449).
Regarding claim 1, Bhatt et al. do teach a device, the device comprising processing circuitry coupled to storage (Abstract S1: “A video conference call system” (a device) “is provided with a camera to generate an input frame image of a conference room”; ¶ 0081 lines 11+: “The information handling system may include random access memory (RAM), one or more processing resources such as a central processing unit (CPU) or hardware or software control logic”), 
the processing circuitry configured to:
identify metadata comprising depth sensing information and camera information received from an in-room device located at a first location having a first camera (Abstract S1: “A video” (or “camera field of view” (metadata or camera information, ¶ 0035 2nd column line 9)) “conference call system is provided with a camera” (received in an in-room device) “to generate an input frame image of a conference room, where the video conference call system detects a human head”(detecting) “for each meeting participant” (one or more in-room users); ¶ 0002 S3: “detecting the location of meeting participant in a room location” “e.g., the distance” (including depth sensing information) “and direction between the camera and meeting participant” (associated with each in-room person); ¶ 0003 column 2 lines 7+: “identify” (identifying) “for each detected human head, a head bounding box with specified image plane coordinate and dimension information in a head box data structure” (a subset of the “video” (metadata)));
perform face recognition on one or more in-room users (Abstract lines 6-8: “generates a head bounding box which surround each detected human head” (for each face) “and identifies” (perform identification) “a corresponding meeting participant” (for each user of the one or more in-room users));
calculate a distance of a first in-room user based on the metadata and a first number of pixels across the face of the first in-room user (¶ 0048 S1: “determining the position and distance” (calculating a distance) “of a meeting participant” (of a first in-room user) “in a room based on the pixel count” (based on a first number of pixels) “for the height and width” (based on the “video” pictures (metadata)) “of each human head” (across his face));
calculate a distance between the first in-room user and a second in-room user based on the metadata and the first number of pixels across the face of the first in-room user and a number of pixels across the face of the second in-room user (¶ 0078 first column last 3 lines+: “The disclosed methodology also applies the pixel width measure” (using the first number of pixels across each participant face)  “and pixel height measure” “extracted from each head bounding box” (and metadata) “to one or more reverse lookup tables to extract meeting room coordinates” (determine distances of each in-room user from the in-room device because according to ¶ 0028 line 21 “coordinates [are] in relation to the camera location”) “for each meeting participant” (for e.g., a first in-room user as well as a second user, which enables calculating the distance between them, e.g., see Fig. 4 and ¶ 0031 lines 13+: “For example, a meeting participant 42” (e.g. a first in-room user) “located on the center line of sight of camera focal point” “at a distance of d=0.5 meters” “a meeting participant 43” (a second in-room user) “located on the center line of sight of camera” “at a distance of camera” “at a larger distance of d=1.0 (which readily gives the distance between “participant 42” (first in-room user) and “participant 43” (and second in-room user) as 0.5 meters)).

Regarding claim 2, Bhatt et al. do teach the device of claim 1, wherein the depth sensing information is associated with the one or more in-room users located in a field of view of the first camera (¶ 0002 S3: “detecting the location of meeting participant in a room location” “e.g., the distance” (the depth sensing information) “and direction between the camera and meeting participant” (associated with each in-room user being located in the field of view of the “camera” (the first camera))).

Regarding claim 3, Bhatt et al. do teach the device of claim 1, wherein the depth sensing information provides distances of the one or more in-room users (¶ 0002 S3: “detecting the location of meeting participant in a room location” “e.g., the distance” (the depth sensing information provides “distance” (distances)) “and direction between the camera and meeting participant” (for each in-room user from the in-room device)).

Regarding claim 4, Bhatt et al. do teach The device of claim 1, wherein the metadata information comprises reference frame information associated with each of the one or more in-room users captured at one or more time intervals (¶ 0078 first column last 3 lines+: “The disclosed methodology also applies the pixel width measure”  “and pixel height measure” “extracted from each head bounding box” (the metadata) “to one or more reverse lookup tables to extract meeting room coordinates” (comprises “coordinates” (reference frame information) associated with each in-room user from the in-room device because according to ¶ 0028 line 21 “coordinates [are] in relation to the camera location”) “for each meeting participant”; ¶ 0057 S1: “At step 181, the method starts when a video/web conference call meeting is started” (the “video” (metadata) is captured at an interval which begins with the start of the conference) “in a conference room or area in which a video conference system is located”). 

Regarding claim 5, Bhatt et al. do teach the device of claim 4, wherein the one or more time intervals comprises a start of a conferencing session (¶ 0057 S1: “At step 181, the method starts when a video/web conference call meeting is started” (the “video” (metadata) is captured at an interval which begins with the start of the conference) “in a conference room or area in which a video conference system is located”).

Regarding claim 6, Bhatt et al. do teach the device of claim 1, wherein the metadata information comprises a field of view (FOV) of the first camera and a resolution of the first camera (Abstract S1: “A video” (or “camera field of view” (metadata or camera information, ¶ 0035 2nd column line 9)) “conference call system is provided with a camera” (received in an in-room device) “to generate an input frame image of a conference room, where the video conference call system detects a human head”(detecting) “for each meeting participant” (one or more in-room users); ¶ 0035 S3: “And by knowing the camera field of view” (metadata) “resolution” (associated with the “camera” “resolution”) “in both horizontal and vertical directions with the respective horizontal and vertical pixel counts”).

Regarding claim 7, Bhatt et al. do teach the device of claim 1, wherein the processing circuitry is further configured to analyze a video stream coming from the in-room device (¶ 0059 S. before last: “In addition, one or more image” (video stream) “post-processing” (analysis) “steps may be applied to the results output from the head detection model”; note according to ¶ 0050 lines 4+: “head detector system 170” (using processing circuitry) “which processes” (analyzing) “incoming room-view video frame images” (video stream) “171 of a meeting room scene” (coming from the “camera” (the in-room device))).

Regarding claim 10, Bhatt et al. do teach a non-transitory computer-readable medium storing computer-executable instructions which when executed by one or more processors (¶ 0079 S2: “The disclosed system includes one or more first processors, a first data bus coupled to the one or more first processors, and a non-transitory, computer-readable storage medium embodying computer program code and being coupled to the first data bus, where the computer program code interacts with a plurality of computer operations and includes first instructions executable by the one or more first processors”)
Result in performing operations comprising:
identifying metadata comprising depth sensing information and camera information received from an in-room device located at a first location having a first camera (Abstract S1: “A video” (or “camera field of view” (metadata or camera information, ¶ 0035 2nd column line 9)) “conference call system is provided with a camera” (received in an in-room device) “to generate an input frame image of a conference room, where the video conference call system detects a human head”(detecting) “for each meeting participant” (one or more in-room users); ¶ 0002 S3: “detecting the location of meeting participant in a room location” “e.g., the distance” (including depth sensing information) “and direction between the camera and meeting participant” (associated with each in-room person); ¶ 0003 column 2 lines 7+: “identify” (identifying) “for each detected human head, a head bounding box with specified image plane coordinate and dimension information in a head box data structure” (a subset of the “video” (metadata)));
performing face recognition on one or more in-room users (Abstract lines 6-8: “generates a head bounding box which surround each detected human head” (for each face) “and identifies” (perform identification) “a corresponding meeting participant” (for each user of the one or more in-room users));
calculating a distance of a first in-room user based on the metadata and a first number of pixels across the face of the first in-room user (¶ 0048 S1: “determining the position and distance” (calculating a distance) “of a meeting participant” (of a first in-room user) “in a room based on the pixel count” (based on a first number of pixels) “for the height and width” (based on the “video” pictures (metadata)) “of each human head” (across his face));
calculating a distance between the first in-room user and a second in-room user based on the metadata and the first number of pixels across the face of the first in-room user and a number of pixels across the face of the second in-room user (¶ 0078 first column last 3 lines+: “The disclosed methodology also applies the pixel width measure” (using the first number of pixels across each participant face)  “and pixel height measure” “extracted from each head bounding box” (and metadata) “to one or more reverse lookup tables to extract meeting room coordinates” (determine distances of each in-room user from the in-room device because according to ¶ 0028 line 21 “coordinates [are] in relation to the camera location”) “for each meeting participant” (for e.g., a first in-room user as well as a second user, which enables calculating the distance between them, e.g., see Fig. 4 and ¶ 0031 lines 13+: “For example, a meeting participant 42” (e.g. a first in-room user) “located on the center line of sight of camera focal point” “at a distance of d=0.5 meters” “a meeting participant 43” (a second in-room user) “located on the center line of sight of camera” “at a distance of camera” “at a larger distance of d=1.0 (which readily gives the distance between “participant 42” (first in-room user) and “participant 43” (and second in-room user) as 0.5 meters)).

Regarding claim 11, Bhatt et al. do teach the non-transitory computer-readable medium of claim 10, wherein the depth sensing information is associated with the one or more in-room users located in a field of view of the first camera (¶ 0002 S3: “detecting the location of meeting participant in a room location” “e.g., the distance” (the depth sensing information) “and direction between the camera and meeting participant” (associated with each in-room user being located in the field of view of the “camera” (the first camera))).

Regarding claim 12, Bhatt et al. do teach the non-transitory computer-readable medium of claim 10, wherein the depth sensing information provides distances of the one or more in-room users (¶ 0002 S3: “detecting the location of meeting participant in a room location” “e.g., the distance” (the depth sensing information provides “distance” (distances)) “and direction between the camera and meeting participant” (for each in-room user from the in-room device)).

Regarding claim 13, Bhatt et al. do teach the non-transitory computer-readable medium of claim 10, wherein the metadata information comprises reference frame information associated with each of the one or more in-room users captured at one or more time intervals (¶ 0078 first column last 3 lines+: “The disclosed methodology also applies the pixel width measure”  “and pixel height measure” “extracted from each head bounding box” (the metadata) “to one or more reverse lookup tables to extract meeting room coordinates” (comprises “coordinates” (reference frame information) associated with each in-room user from the in-room device because according to ¶ 0028 line 21 “coordinates [are] in relation to the camera location”) “for each meeting participant”; ¶ 0057 S1: “At step 181, the method starts when a video/web conference call meeting is started” (the “video” (metadata) is captured at an interval which begins with the start of the conference) “in a conference room or area in which a video conference system is located”). 

Regarding claim 14, Bhatt et al. do teach the non-transitory computer-readable medium of claim 13, wherein the one or more time intervals comprises a start of a conferencing session (¶ 0057 S1: “At step 181, the method starts when a video/web conference call meeting is started” (the “video” (metadata) is captured at an interval which begins with the start of the conference) “in a conference room or area in which a video conference system is located”).

Regarding claim 15, Bhatt et al. do teach the non-transitory computer-readable medium of claim 10, wherein the metadata information comprises a field of view (FOV) of the first camera and a resolution of the first camera (Abstract S1: “A video” (or “camera field of view” (metadata or camera information, ¶ 0035 2nd column line 9)) “conference call system is provided with a camera” (received in an in-room device) “to generate an input frame image of a conference room, where the video conference call system detects a human head”(detecting) “for each meeting participant” (one or more in-room users); ¶ 0035 S3: “And by knowing the camera field of view” (metadata) “resolution” (associated with the “camera” “resolution”) “in both horizontal and vertical directions with the respective horizontal and vertical pixel counts”).

Regarding claim 16, Bhatt et al. do teach the non-transitory computer-readable medium of claim 10, wherein the operations further comprise analyze a video stream coming from the in-room device (¶ 0059 S. before last: “In addition, one or more image” (video stream) “post-processing” (analysis) “steps may be applied to the results output from the head detection model”; note according to ¶ 0050 lines 4+: “head detector system 170” (using processing circuitry) “which processes” (analyzing) “incoming room-view video frame images” (video stream) “171 of a meeting room scene” (coming from the “camera” (the in-room device))).

Regarding claim 19, Bhatt et al. do teach a method (Abstract S1: “A video conference call system” (a device and associated method) “is provided with a camera to generate an input frame image of a conference room”; ¶ 0081 lines 11+: “The information handling system may include random access memory (RAM), one or more processing resources such as a central processing unit (CPU) or hardware or software control logic”), 
comprising:
identifying metadata comprising depth sensing information and camera information received from an in-room device located at a first location having a first camera (Abstract S1: “A video” (or “camera field of view” (metadata or camera information, ¶ 0035 2nd column line 9)) “conference call system is provided with a camera” (received in an in-room device) “to generate an input frame image of a conference room, where the video conference call system detects a human head”(detecting) “for each meeting participant” (one or more in-room users); ¶ 0002 S3: “detecting the location of meeting participant in a room location” “e.g., the distance” (including depth sensing information) “and direction between the camera and meeting participant” (associated with each in-room person); ¶ 0003 column 2 lines 7+: “identify” (identifying) “for each detected human head, a head bounding box with specified image plane coordinate and dimension information in a head box data structure” (a subset of the “video” (metadata)));
performing face recognition on one or more in-room users (Abstract lines 6-8: “generates a head bounding box which surround each detected human head” (for each face) “and identifies” (perform identification) “a corresponding meeting participant” (for each user of the one or more in-room users));
calculating a distance of a first in-room user based on the metadata and a first number of pixels across the face of the first in-room user (¶ 0048 S1: “determining the position and distance” (calculating a distance) “of a meeting participant” (of a first in-room user) “in a room based on the pixel count” (based on a first number of pixels) “for the height and width” (based on the “video” pictures (metadata)) “of each human head” (across his face));
calculating a distance between the first in-room user and a second in-room user based on the metadata and the first number of pixels across the face of the first in-room user and a number of pixels across the face of the second in-room user (¶ 0078 first column last 3 lines+: “The disclosed methodology also applies the pixel width measure” (using the first number of pixels across each participant face)  “and pixel height measure” “extracted from each head bounding box” (and metadata) “to one or more reverse lookup tables to extract meeting room coordinates” (determine distances of each in-room user from the in-room device because according to ¶ 0028 line 21 “coordinates [are] in relation to the camera location”) “for each meeting participant” (for e.g., a first in-room user as well as a second user, which enables calculating the distance between them, e.g., see Fig. 4 and ¶ 0031 lines 13+: “For example, a meeting participant 42” (e.g. a first in-room user) “located on the center line of sight of camera focal point” “at a distance of d=0.5 meters” “a meeting participant 43” (a second in-room user) “located on the center line of sight of camera” “at a distance of camera” “at a larger distance of d=1.0 (which readily gives the distance between “participant 42” (first in-room user) and “participant 43” (and second in-room user) as 0.5 meters)).

Regarding claim 20, Bhatt et al. do teach the method of claim 19, wherein the depth sensing information is associated with the one or more in-room users located in a field of view of the first camera (¶ 0002 S3: “detecting the location of meeting participant in a room location” “e.g., the distance” (the depth sensing information) “and direction between the camera and meeting participant” (associated with each in-room user being located in the field of view of the “camera” (the first camera))).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 8, 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Bhatt et al., and further in view of Tangeland et al. (US 2023/0300295).
Regarding claim 8, Bhatt et al. do not specifically disclose the device of claim 1, wherein the processing circuitry is further configured to:
select the first in-room user of the second in-room user by utilizing touch; and
Steer a beamformer in a direction of the first in-room user or the second in-room user.
Tangeland et al. do teach the device of claim 1 (Title Abstract)
Wherein the processing circuitry is further configured to:
Select the first in-room user or the second in-room user by utilizing touch (¶ 0028 S4: “In one embodiment, video endpoint device 120 may display the self-view of the conference room”(for in-room users) “on display 122 and, when display 122 is a touch screen, a participant may select” (select) “one of participants 202-210” (either a first or a second in-room user) “as the presenter by touching” (by utilizing touch) “an image of the presenter on the touch screen”);
And Steer a beamformer in a direction of the first in-room user or the second in-room user (¶ 0032 last S: “Microphones 320-1 to 320-N may detect audio from participants 302-310 and a position of a speaker” (towards a first and/or a second in-room user) “may be determined using the speaker tracking” (steering) “microphone array” (a beamformer); “tracking” a “speaker” requires “steering” a “microphone array” (“beamformer”) towards or in the direction of the “speaker”; i.e., see Ashoori et al. (US 2018/0315094) ¶ 0027 S1: “In some embodiments, multi-microphone arrays can dynamically steer “listening beams,” which, with the aid of video cameras, can track the location of the individual”).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the methods of “speaker” “tracking” of Tangeland et al. in a “conference” to determination of “meeting room coordinates” for “each” “meeting participant” of Bhatt et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable an alternative determination of “participant” “positions” to help assessing validity of its techniques.

Regarding claim 17, Bhatt et al. do not specifically disclose the non-transitory computer-readable medium of claim 10, wherein the operations further comprise:
selecting the first in-room user or the second in-room user by utilizing touch; and
Steer a beamformer in a direction of the first in-room user or the second in-room user.
Tangeland et al. do teach:
Selecting the first in-room user or the second in-room user by utilizing touch (¶ 0028 S4: “In one embodiment, video endpoint device 120 may display the self-view of the conference room”(for in-room users) “on display 122 and, when display 122 is a touch screen, a participant may select” (select) “one of participants 202-210” (either a first or a second in-room user) “as the presenter by touching” (by utilizing touch) “an image of the presenter on the touch screen”);
And Steer a beamformer in a direction of the first in-room user or the second in-room user (¶ 0032 last S: “Microphones 320-1 to 320-N may detect audio from participants 302-310 and a position of a speaker” (towards a first and/or a second in-room user) “may be determined using the speaker tracking” (steering) “microphone array” (a beamformer); “tracking” a “speaker” requires “steering” a “microphone array” (“beamformer”) towards or in the direction of the “speaker”; i.e., see Ashoori et al. (US 2018/0315094) ¶ 0027 S1: “In some embodiments, multi-microphone arrays can dynamically steer “listening beams,” which, with the aid of video cameras, can track the location of the individual”).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the methods of “speaker” “tracking” of Tangeland et al. in a “conference” to determination of “meeting room coordinates” for “each” “meeting participant” of Bhatt et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable an alternative determination of “participant” “positions” to help assessing validity of its techniques.

Claim(s) 9, 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Bhatt et al., and further in view of Foote et al. (US 2005/0007445).
Regarding claim 9, Bhatt et al. do not specifically disclose the device of claim 1, wherein the processing circuitry is further configured to:
monitor at least one of the one or more in-room users using gaze; and
enhance voice data of the at least one of the one or more in-room users by direction-based tuning.
Foote et al. do teach the device of claim 1 (Title, Abstract), 
wherein the processing circuitry is further configured to:
monitor at least one of the one or more in-room users using gaze (Abstract last S: “A camera can be mounted adjacent to the screen, and can allow the subject to view a selected conference participant or a desired location such that when the camera is trained on the selected participant” (monitor at least one or more of “conference participant[s]” (in-room users)) “or desired location a gaze” (using gaze) “of the remote participant displayed by the screen appears substantially directed at the selected participant or desired location”); and
enhance voice data of the at least one of the one or more in-room users by direction-based tuning (¶ 0033 S3: “A microphone array 114 can provide directional speech pickup” (a direction-based voice tuning) “and enhancement over a range of participant” (on e.g. at least one or more in-room users) “positions”).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the overall “VIDEO TELECONFERENCING” methods of Foote et al. into the “video conference call system” of Bhatt et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Bhatt et al. to “provide directional speech” “enhancement over a range of [a] participant positions” as disclosed in Foote et al. ¶ 0033 S3 to help Bhatt et al. “sound processing” “to amplify the speech of  a speaker” as disclosed in Bhatt et al. ¶ 0027 2nd column lines 6-7.

Regarding claim 18, Bhatt et al. do not specifically disclose the non-transitory computer-readable medium of claim 10, wherein the operations further comprise: 
monitor at least one of the one or more in-room users using gaze; and
enhance voice data of the at least one of the one or more in-room users by direction-based tuning.
Foote et al. do teach:
monitor at least one of the one or more in-room users using gaze (Abstract last S: “A camera can be mounted adjacent to the screen, and can allow the subject to view a selected conference participant or a desired location such that when the camera is trained on the selected participant” (monitor at least one or more of “conference participant[s]” (in-room users)) “or desired location a gaze” (using gaze) “of the remote participant displayed by the screen appears substantially directed at the selected participant or desired location”); and
enhance voice data of the at least one of the one or more in-room users by direction-based tuning (¶ 0033 S3: “A microphone array 114 can provide directional speech pickup” (a direction-based voice tuning) “and enhancement over a range of participant” (on e.g. at least one or more in-room users) “positions”).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the overall “VIDEO TELECONFERENCING” methods of Foote et al. into the “video conference call system” of Bhatt et al. would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Bhatt et al. to “provide directional speech” “enhancement over a range of [a] participant positions” as disclosed in Foote et al. ¶ 0033 S3 to help Bhatt et al. “sound processing” “to amplify the speech of  a speaker” as disclosed in Bhatt et al. ¶ 0027 2nd column lines 6-7.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARZAD KAZEMINEZHAD whose telephone number is (571)270-5860. The examiner can normally be reached 10:30 am to 11:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Paras D. Shah can be reached at (571) 270-1650. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/Farzad Kazeminezhad/
Art Unit 2653
February 20th 2026.
Read full office action
Prosecution Timeline

Dec 08, 2022
Application Filed
Jan 25, 2023
Response after Non-Final Action
Feb 20, 2026
Non-Final Rejection — §101, §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/319,946
Patent 12603080
GAZE-BASED AND AUGMENTED AUTOMATIC INTERPRETATION METHOD AND SYSTEM
2y 5m to grant Granted Apr 14, 2026
18/475,788
Patent 12592242
MACHINE LEARNING (ML) BASED EMOTION, IDENTITY AND VOICE CONVERSION IN AUDIO USING VIRTUAL DOMAIN MIXING AND FAKE PAIR-MASKING
2y 5m to grant Granted Mar 31, 2026
18/890,293
Patent 12586596
SYSTEM AND METHOD FOR BACKGROUND NOISE SUPPRESSION BY PROJECTING AN INPUT AUDIO INTO A HIGHER DIMENSION SPACE
2y 5m to grant Granted Mar 24, 2026
18/604,374
Patent 12555587
APPARATUS AND METHOD FOR ENCODING AN AUDIO SIGNAL USING AN OUTPUT INTERFACE FOR OUTPUTTING A PARAMETER CALCULATED FROM A COMPENSATION VALUE
2y 5m to grant Granted Feb 17, 2026
18/164,336
Patent 12537019
ACTIVITY CHARTING WHEN USING PERSONAL ARTIFICIAL INTELLIGENCE ASSISTANTS INCLUDING DIFFERENTIATING A PATIENT FROM A DIFFERENT PERSON BASED ON AUDIO ASSOCIATED WITH TOILETTING
2y 5m to grant Granted Jan 27, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
71%
Grant Probability
99%
With Interview (+67.2%)
3y 6m
Median Time to Grant
Low
PTA Risk
Based on 534 resolved cases by this examiner. Grant probability derived from career allow rate.