Last updated: April 19, 2026
Application No. 18/565,041
IMAGE ANALYSIS TO SWITCH AUDIO DEVICES

Non-Final OA §101§103§112§DP
Filed
Nov 28, 2023
Examiner
DUFFY, CAROLINE TABANCAY
Art Unit
2662
Tech Center
2600 — Communications
Assignee
Hewlett-Packard Development Company, L.P.
OA Round
1 (Non-Final)
Interview Optional

— +26.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 78 resolved cases, 2023–2026
Examiner Intelligence

DUFFY, CAROLINE TABANCAY View full profile →
Grants 80% — above average
Career Allow Rate
62 granted / 78 resolved
+17.5% vs TC avg
Strong +27% interview lift
Without
With
+26.9%
Interview Lift
resolved cases with interview
Typical timeline
3y 1m
Avg Prosecution
18 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
13.8%
-26.2% vs TC avg
§103
58.2%
+18.2% vs TC avg
§102
7.7%
-32.3% vs TC avg
§112
18.2%
-21.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 78 resolved cases
Office Action

§101 §103 §112 §DP
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Applicant’s claim for the benefit of prior-filed application PCT/US2021/038090 filed 06/18/2021 under 35 U.S.C. 119(e) or under 35 U.S.C. 120, 121, 365(c), or 386(c) is acknowledged.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 12/08/2023 is being considered by the examiner.

Claim Objections
Claim 6 is objected to because of the following informalities: Claim 6, lines 5-6 recites “a engagement.” The claim should be amended to “an engagement.” Appropriate correction is required.

Double Patenting
Claim 1 of this application is patentably indistinct from claim 6 of Application No. 18/560793. Pursuant to 37 CFR 1.78(f), when two or more applications filed by the same applicant or assignee contain patentably indistinct claims, elimination of such claims from all but one application may be required in the absence of good and sufficient reason for their retention during pendency in more than one application. Applicant is required to either cancel the patentably indistinct claims from all but one application or maintain a clear line of demarcation between the applications. See MPEP § 822.
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The filing of a terminal disclaimer by itself is not a complete reply to a nonstatutory double patenting (NSDP) rejection. A complete reply requires that the terminal disclaimer be accompanied by a reply requesting reconsideration of the prior Office action. Even where the NSDP rejection is provisional the reply must be complete. See MPEP § 804, subsection I.B.1. For a reply to a non-final Office action, see 37 CFR 1.111(a). For a reply to final Office action, see 37 CFR 1.113(c). A request for reconsideration while not provided for in 37 CFR 1.113(c) may be filed after final for consideration. See MPEP §§ 706.07(e) and 714.13.
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The actual filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/apply/applying-online/eterminal-disclaimer.
Claim 1 is provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over Claim 6 of copending Application No. 18/560793 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because the “gesture” of the reference application is considered a “sequence of motion” as recited in the instant application. That is, “a gesture of the user to indicate the feature,” recited in Claim 6 of 18/560793 is “a sequence motion in the video.” 
Instant Application (18/565041)
Claim 1
Reference Application (18/560793)
Claim 6
A non-transitory machine-readable medium comprising instructions that, when executed by a processor, cause the processor to:
A non-transitory machine-readable medium storing instructions, which, when executed by a processor of an electronic device, cause the processor to:
analyze a video captured by a computing device to detect a sequence of motion in the video; and 
detect a feature of a user's face in images captured by an image sensor coupled to the electronic device; detect, in the images, a gesture of the user to indicate the feature; and
select an audio endpoint of the computing device based on the sequence of motion.
select an audio endpoint coupled to the electronic device for use in response to detecting the gesture.


This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 6 and 10 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claims 6 and 10 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being incomplete for omitting essential elements, such omission amounting to a gap between the elements.  See MPEP § 2172.01.  The omitted elements are: additional audio endpoint(s).  Claim 6 recites “an audio endpoint.” In a subsequent limitation, Claim 6 recites “automatically switch audio endpoints.” As best interpreted in light of the specification, multiple audio endpoints are required for a switching action. Claim 10 also recites “switch audio endpoints” with no indication of an additional or alternative audio endpoint. 

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-5 and 11-15 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 
Claim 1
Step 1 – YES
	Claim 1 discloses a machine, and thus falls in one of the statutory categories. 

Step 2A, Prong One – YES
	Claim 1 recites an abstract idea. Claim 1 recites “analyze a video captured by a computing device to detect a sequence of motion in the video” and “select an audio endpoint of the computing device based on the sequence of motion.” As claimed, “analyze a video” may be performed in the mind by judgment and observation. Selecting an audio endpoint is recited at a high level of generality and merely uses a computer as a tool to perform the mental process of selection. 

Step 2A, Prong Two – NO
	Claim 1 does not recite additional elements that integrate the judicial exception into a practical application. Claim 1 recites “A non-transitory machine-readable medium comprising instructions that, when executed by a processor, cause the processor to…” A non-transitory machine-readable medium and a processor are both generic computer components and thus do not integrate the abstract idea into a practical application.

Step 2B – NO
	Claim 1 does not recite additional elements that amount to significantly more than the judicial exception. Claim 1 recites “A non-transitory machine-readable medium comprising instructions that, when executed by a processor” and “a video captured by a computing device.” As stated above, the non-transitory machine-readable medium and processor are generic computer components. A video captured by a computing device is mere data gathering and is thus an extrasolution activity that does not amount to  significantly more than the judicial exception. 
Thus, Claim 1 is not eligible subject matter. 
	
Claims 2-5 do not recite additional elements that integrate the abstract idea into a practical application or amount to significantly more than the abstract idea. Claim 2 recites the abstract idea of “deselecting” a second audio endpoint. Deselecting an audio endpoint is recited at a high level of generality and merely uses a computer as a tool to perform the mental process of deselection. Claim 3 recites the abstract ideas “analyze the video to detect a representation of the audio endpoint in the video.” Analyzing a video to detect a visual feature, may be performed in the mind by judgment and observation. As stated, selecting an audio endpoint is recited at a high level of generality and merely uses a computer as a tool to perform the mental process of selection. Claim 4 recites the abstract idea of “classify the sequence of motion as a donning gesture,” which may be performed in the mind by judgment and observation. Claim 5 recites the abstract idea of “classify the sequence of motion as a doffing gesture,” which may be performed in the mind by judgment and observation. Claims 3-5 recite steps to “select the audio endpoint,” which as stated, are steps recited at a high level of generality and merely uses a computer as a tool to perform the mental process of selection.
Thus, Claims 2-5 are not eligible subject matter.

Claim 11
Step 1 – YES
	Claim 11 discloses a machine, and thus falls in one of the statutory categories. 

Step 2A, Prong One – YES
	Claim 11 recites an abstract idea. Claim 11 recites “perform image analysis on images captured by the camera” and “switch between the speaker and the wearable audio endpoint based on the image analysis.” As claimed, “image analysis” may be performed in the mind by judgment and observation. Switching between a speaker and wearable audio endpoint is recited at a high level of generality and merely uses a computer as a tool to perform the mental process of switching. 

Step 2A, Prong Two – NO
	Claim 11 does not recite additional elements that integrate the judicial exception into a practical application. Claim 11 recites “A computing device comprising: a camera; a speaker; a machine-learning system; and a processor connected to the camera and the speaker, the processor further connectable to a wearable audio endpoint, the processor to: apply the machine-learning system.” A camera merely performs the insignificant extra-solution activity of capturing images. A speaker and a wearable audio endpoint are used merely as a tool to perform an existing process; use of a computer or other machinery in its ordinary capacity or simply adding computer components to an abstract idea does not integrate a judicial exception into a practical application or provide significantly more. A camera, speaker, and a wearable audio endpoint are merely machinery used in their ordinary capacities. A processor is a generic computer component and thus does not integrate the judicial exception into a practical application. Finally, applying a machine-learning system is recited at a high level of generality and amounts to mere instructions to “apply it” on a computer.

Step 2B – NO
	Claim 11  does not recite additional elements that amount to significantly more than the judicial exception. Claim 11 recites “A computing device comprising: a camera; a speaker; a machine-learning system; and a processor connected to the camera and the speaker, the processor further connectable to a wearable audio endpoint, the processor to: apply the machine-learning system.” As stated above, the computing device comprising a camera, speaker, machine-learning system, and processor amount to a computer and other machinery used in their ordinary capacity; simply adding computer components to an abstract idea of switching between outputs does not provide significantly more. The machine-learning system amounts to mere instructions to apply the abstract idea on a computer or machine. 

Thus, Claim 11 is not eligible subject matter. 

	Claims 12-14 also recite mere instructions to “apply it.” Claims 12, 13, and 14 recite the abstract ideas “detect hand motion indicative of a user putting on or removing the wearable audio endpoint”, “detect representations of headsets and earbuds in the images,” and “detect a change in location of the computing device,” respectively. Claims 12, 13, and 14 recite the additional elements “a machine-learning model to detect hand motion indicative of a user putting on or removing the wearable audio endpoint”, “a machine-learning model to detect representations of headsets and earbuds in the images,” and “a machine-learning model to detect a change in location of the computing device,” respectively. However, Claims 12-14 neither integrate the abstract ideas of “perform image analysis on images captured by the camera”, “switch between the speaker and the wearable audio endpoint based on the image analysis,” nor the respective abstract ideas of Claims 12, 13, and 14. Claims 12-14 omit any details as to how the machine-learning model solves a technical problem and instead recites only the idea of a solution or outcome. Claims 12-14 contain mere instructions to apply the abstract ideas on a computer. 
Thus, Claims 12-14 are not eligible subject matter.

	Claim 15 recites the abstract ideas “detect a hand motion,” “detect the wearable audio endpoint,” and “switch between the speaker and the wearable audio endpoint in response to detection of the wearable audio endpoint.” Similarly to Claims 12-14, “apply the machine-learning system” to detect a hand motion and detect a wearable audio endpoint are mere abstract ideas  amounts to mere instructions to apply it. Switching between the speaker and the wearable audio endpoint is recited at a high level of generality and merely uses a computer as a tool to perform the mental process of switching. The limitation “in response to detection of the wearable audio endpoint” does not contain details as to how the machine-learning model detection solves a technical problem and instead recites only the idea of a solution or outcome. Claim 15 recites mere instructions to apply the abstract ideas on the computer. 
Thus, Claim 15 is not eligible subject matter.

Applicant is recommended to incorporate the limitation “automatically switch,” as recited in Claim 6, as the term “automatically” integrates the judicial exception into a practical application because selecting or switching outputs may not be done automatically in the mind or with generic computer components. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-7 and 9-12 are rejected under 35 U.S.C. 103 as being unpatentable over Xiang et al. (US 2020/0077193 A1) in view of Smith et al. (US 2019/0384406 A1).
Regarding Claim 1, Xiang teaches “A non-transitory machine-readable medium comprising instructions that, when executed by a processor, cause the processor to” (Xiang, [0170] discloses “A processor or other means for processing as disclosed herein may also be embodied as one or more computers (e.g., machines including one or more arrays programmed to execute one or more sets or sequences of instructions) or other processors”):
“analyze a video captured by a computing device to detect a sequence of motion in the video” (Xiang, [0147] discloses “FIG. 28C shows a block diagram of an implementation A120 of apparatus A110 that includes a capture device CD10 which captures the scene that includes the gesture”; where an apparatus A110 is a computing device; where a scene that includes a gesture is a video captured by a computing device; where a gesture is a sequence of motion in the video); “and 

Xiang does not explicitly teach “select an audio endpoint of the computing device based on the sequence of motion.” 
However, in an analogous field of endeavor, Smith teaches “select an audio endpoint of the computing device based on the sequence of motion” (Smith [0109] discloses “In response to detecting gesture sequence 900A, a previously muted left audio channel (e.g., reproduced by a left speaker in the HMD) is unmuted; without changes to the operation of the right audio channel”). 
	It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to have modified Xiang to incorporate the teachings of Smith by muting a particular audio channel in response to a detected gesture. The prior art contained a ‘base’ device upon which the claimed invention can be seen as an ‘improvement.’ That is, Xiang teaches a device that detects gestures from a captured scene. The prior art contained a known technique that is applicable to the base device. Smith teaches a technique of adjusting audio channels based on detected gestures. One of ordinary skill in the art would have recognized that applying the known technique would have yielded predictable results and resulted in an improved system. Additionally, Xiang broadly teaches a “muting gesture” that may be interpreted by the “gesture interpreter” (see Xiang, [0143]). Smith teaches a particular audio control of audio channels. Thus, one of ordinary skill in the art would be motivated to combine the Xiang and Smith references by applying a known technique to a known device ready for improvement to yield predictable results. Accordingly, the combination of Xiang and Smith discloses the invention of Claim 1. 

	Regarding Claim 2, the combination of Xiang and Smith teaches “The non-transitory machine-readable medium of claim 1, wherein:
the audio endpoint is a first audio endpoint” (Smith [0110] discloses “In response to detecting gesture sequence 900B, the left audio channel may be muted, or its volume may be reduced, without changes to the operation of the right audio channel”; where a right audio channel is a first audio endpoint); “and 
based on the sequence of motion, the instructions are further to deselect a second audio endpoint” (Smith [0110] discloses “In response to detecting gesture sequence 900B, the left audio channel may be muted, or its volume may be reduced, without changes to the operation of the right audio channel”; where a left audio channel is a second audio endpoint; where muting a left audio channel is deselecting a second audio endpoint). The proposed combination as well as the motivation for combining the Xiang and Smith references presented in the rejection of Claim 1, apply to Claim 2 and are incorporated herein by reference.  Thus, the apparatus recited in Claim 2 is met by Xiang and Smith. 

    PNG
    media_image1.png
    530
    711
    media_image1.png
    Greyscale

Fig. 9A and 9B of Smith

	Regarding Claim 3, the combination of Xiang and Smith teaches “The non-transitory machine-readable medium of claim 1, wherein the instructions are further to:
analyze the video to detect a representation of the audio endpoint in the video” (Smith, 110 discloses “In gesture sequence 900B, starting position 903 shows a user with her left hand 909 out and away by a selected distance, followed by motion 904 where the user covers her left ear 910 with the palm of her left hand 909. In response to detecting gesture sequence 900B, the left audio channel may be muted, or its volume may be reduced, without changes to the operation of the right audio channel”; where a gesture covering a left ear is a representation of the audio end point in the video); “and 
select the audio endpoint further based on detection of the representation of the audio endpoint” (Smith, 110 discloses “In response to detecting gesture sequence 900B, the left audio channel may be muted, or its volume may be reduced, without changes to the operation of the right audio channel”; where muting a left audio channel is selecting an audio endpoint). The proposed combination as well as the motivation for combining the Xiang and Smith references presented in the rejection of Claim 1, apply to Claim 3 and are incorporated herein by reference.  Thus, the apparatus recited in Claim 3 is met by Xiang and Smith.

Regarding Claim 4, the combination of Xiang and Smith teaches “The non-transitory machine-readable medium of claim 1, wherein the instructions are further to:
classify the sequence of motion as a donning gesture” (Smith, Fig. 9A sequence 900A shows motions corresponding to unmuting: Smith, [0109] discloses “In gesture sequence 900A, starting position 901 shows a user covering her left ear 910 with the palm of her left hand 909, followed by motion 902, where the left hand 909 moves out and away from the user, uncovering her left ear 910”; where gesture sequence 900A is a donning gesture); “and 
select the audio endpoint based on the donning gesture.” (Smith, Fig. 9A discloses “In response to detecting gesture sequence 900A, a previously muted left audio channel (e.g., reproduced by a left speaker in the HMD) is unmuted”). The proposed combination as well as the motivation for combining the Xiang and Smith references presented in the rejection of Claim 1, apply to Claim 4 and are incorporated herein by reference.  Thus, the apparatus recited in Claim 4 is met by Xiang and Smith.

Regarding Claim 5, the combination of Xiang and Smith discloses “The non-transitory machine-readable medium of claim 1, wherein the instructions are further to:
classify the sequence of motion as a doffing gesture” (Smith, Fig. 9B sequence 900B shows motions corresponding to muting: Smith, [0110] discloses “In gesture sequence 900B, starting position 903 shows a user with her left hand 909 out and away by a selected distance, followed by motion 904 where the user covers her left ear 910 with the palm of her left hand 909”; where gesture sequence 900B is a doffing gesture); “and 
select the audio endpoint based on the doffing gesture” (Smith, [0110] discloses “In response to detecting gesture sequence 900B, the left audio channel may be muted”). The proposed combination as well as the motivation for combining the Xiang and Smith references presented in the rejection of Claim 1, apply to Claim 5 and are incorporated herein by reference. Thus, the apparatus recited in Claim 5 is met by Xiang and Smith.

	Regarding Claim 6, Smith teaches “A non-transitory machine-readable medium comprising instructions that, when executed by a processor” (Smith, [0071] discloses “In various embodiments, modules 301-303 may be stored in memory 203 of host IHS 103, in the form of program instructions, that are executable by processor 201”), “cause the processor to:
apply images captured by a computing device to a trained machine-learning system” (Smith, [0073] discloses “Generally, at least a portion of user 101 may be identified in the video frame data obtained using camera 108 using gesture sequence recognition component 301. For example, through image processing, a given locus of a video frame or depth map may be recognized as belonging to user 101.” Smith, [0075] discloses “Such a machine-learning method may analyze a user with reference to information learned from a previously trained collection of known gestures and/or poses stored in calibration component 302. During a supervised training phase, for example, a variety of gesture sequences may be observed, and trainers may provide label various classifiers in the observed data. The observed data and annotations may then be used to generate one or more machine-learned algorithms that map inputs (e.g., observation data from a depth camera) to desired outputs (e.g., body-part indices for relevant pixels)”; where a camera is computing device; where a machine learning method is a trained machine-learning system; where video frame data is images captured by a computing device);

“in response to detection of the visual indication, detect with the trained machine-learning system a representation of the audio endpoint” (Smith, 110 discloses “In gesture sequence 900B, starting position 903 shows a user with her left hand 909 out and away by a selected distance, followed by motion 904 where the user covers her left ear 910 with the palm of her left hand 909. In response to detecting gesture sequence 900B, the left audio channel may be muted, or its volume may be reduced, without changes to the operation of the right audio channel”; where a gesture covering a left ear is a representation of the audio end point in the video. Under the broadest reasonable interpretation, a “representation of the audio endpoint” is any motion, image, or other indication of an audio endpoint. Under the broadest reasonable interpretation, an audio endpoint may be a speaker, wearable audio endpoint, or even the ear itself as Claim 6 does not require an “audio endpoint” be connected to any hardware or computing device. Thus, Smith teaches a representation of the audio endpoint); “and 
in response to detection of the representation of the audio endpoint, automatically switch audio endpoints of the computing device” (Smith [0109] discloses “In response to detecting gesture sequence 900A, a previously muted left audio channel (e.g., reproduced by a left speaker in the HMD) is unmuted; without changes to the operation of the right audio channel.” Smith, Claim 18 also recites “wherein the set further comprises an audio mute or unmute gesture sequence comprising at least one hand covering or uncovering at least one ear, and wherein executing the command further comprises muting or unmuting only one audio channel”; where muting an audio channel is switching audio endpoints. As stated above, under the broadest reasonable interpretation, switching in the context of a computing device is shifting from one electrical circuit to another. However, Claim 6 recites only “an audio endpoint” and then recites “switch audio endpoints” without disclosing a second audio endpoint to switch to. Thus, as best interpreted in light of the 112(b) rejection of Claim 6, the disclosure of Smith of muting one audio channel is switching audio endpoints as muting an audio channel switches from two output audio channel to one output audio channel).
Smith does not explicitly teach “detect with the trained machine-learning system a visual indication of a engagement of an audio endpoint.”
However, in an analogous field of endeavor, Xiang teaches “detect with the trained machine-learning system a visual indication of a engagement of an audio endpoint” (Xiang, [0099] discloses “Task TA10 may include feature classification, such as classifying a feature as the closest among a set of gesture feature candidates (e.g., according to the largest similarity measure), if a measure of the match (e.g., similarity measure) is above a threshold, which may be candidate-dependent. The one or more aspects of a feature may include, for example, one or more of shape, position (e.g., spatial relation of a user's hands to each other, and/or spatial relation of a user's hand to the user's face and/or eyes), distance (e.g., as detected by ranging and/or by a size of the detected feature), orientation (e.g., a tilt of the hand or head, a direction of pointing), and translation (e.g., movement left, right, up, and/or down). FIG. 15 shows three examples of gesture feature candidates.” Xiang, [0101] discloses “Such classification may be based on a hidden Markov model or other pattern recognition algorithm to recognize a gesture element from individual features within a scene or frame and/or to recognize a sequence of gesture elements over time”; where a hidden Markov model is a machine learning system; where recognizing gesture feature candidates is visual indication of a engagement of an audio endpoint).
	It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to have modified Smith to incorporate the teachings of Xiang by detecting features of an image such as hands, face, and eyes. The prior art contained a ‘base’ device upon which the claimed invention can be seen as an ‘improvement.’ That is, Smith teaches a device that detects gestures to control audio channels. The prior art contained a known technique that is applicable to the base device. Xiang teaches a technique of detecting gestures as well as particular features, or visual indications of engagement of an audio endpoint. One of ordinary skill in the art would have recognized that applying the known technique would have yielded predictable results and resulted in an improved system. Additionally, as claimed, “a visual indication of an engagement of an audio endpoint” and “a representation of the audio endpoint” are not necessarily distinct elements. That is, a visual indication of an engagement and a representation may be the same detection. However, an additional reference of Xiang is provided for clarity. It would be obvious to one of ordinary skill in the art that detecting a gesture involving the head and hand as in Smith, requires first a detection of the head and hand in the images, as in Xiang. Thus, one of ordinary skill in the art would be motivated to combine the Xiang and Smith references by applying a known technique to a known device ready for improvement to yield predictable results. Accordingly, the combination of Smith and Xiang discloses the invention of Claim 6. 

Regarding Claim 7, the combination of Smith and Xiang teaches “The non-transitory machine-readable medium of claim 6, wherein:
the visual indication includes a donning gesture” (Smith, Fig. 9A sequence 900A shows motions corresponding to unmuting: Smith, [0109] discloses “In gesture sequence 900A, starting position 901 shows a user covering her left ear 910 with the palm of her left hand 909, followed by motion 902, where the left hand 909 moves out and away from the user, uncovering her left ear 910”; where gesture sequence 900A is a donning gesture); 
“the representation of the audio endpoint includes a wearable audio endpoint” (Smith [0109] discloses “In response to detecting gesture sequence 900A, a previously muted left audio channel (e.g., reproduced by a left speaker in the HMD) is unmuted; without changes to the operation of the right audio channel”; where a left speaker in the HMD is a wearable audio endpoint); “and 
the instructions are further to, in response to detection of the representation of the audio endpoint, switch from a speaker to the wearable audio endpoint” (Smith [0109] discloses “In response to detecting gesture sequence 900A, a previously muted left audio channel (e.g., reproduced by a left speaker in the HMD) is unmuted; without changes to the operation of the right audio channel.” Smith, Claim 18 also recites “wherein the set further comprises an audio mute or unmute gesture sequence comprising at least one hand covering or uncovering at least one ear, and wherein executing the command further comprises muting or unmuting only one audio channel” ).
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to have modified Smith to incorporate the teachings of Xiang by muting a particular audio channel in response to a detected gesture. Under the broadest reasonable interpretation, muting only one audio channel is switching between audio channels. Claim 18 of Smith recites “muting or unmuting only one audio channel,” with no indication of a left or right audio channel. Thus, Smith discloses a gesture indicating muting or unmuting a single audio channel, or audio endpoint, that is not particularly a left or right audio channel of an HMD. Xiang teaches control of speakers using gestures (Xiang, [0143] recites “gesture interpreter GI10 may be implemented to receive an associated value for one or more parameters of the corresponding command: sound-blocking gesture—direction to block and/or degree of attenuation; muting gesture—degree of attenuation”). The speakers of Xiang are “a speaker array,” but Xiang also discloses “the various methods disclosed herein may be performed by a portable communications device such as a handset, headset, or portable digital assistant (PDA)” (Xiang, [0175]). Thus, it would be obvious to one of ordinary skill in the art to combine the Xiang and Smith references to mute, or switch to, only one audio channel, an audio channel being either on a headset, head mounted device, or other speaker array. Thus, Smith and Xiang teach the invention of Claim 7. 
Regarding Claim 9, the combination of Smith and Xiang teaches “The non-transitory machine-readable medium of claim 6, wherein the trained machine-learning system includes:
a motion-detecting model trained to detect visual indications of user gestures” (Xiang, [0091] discloses “Task TC10 may be implemented to use a camera-based imaging system to capture a sequence of images, and task TA10 may be implemented to use image processing techniques to recognize objects and movements within that sequence”); “and 
an object-detecting model trained to detect visual representations of audio endpoints” (Xiang, [0097] discloses “Task TA10 may be implemented to perform one or more subtasks on the analyzed scene, such as feature extraction and feature classification. Feature extraction may include analyzing the captured scene to detect and locate regions of interest, such as the user's hand, fingers, head, face, eyes, body, and/or shoulders.”) The proposed combination as well as the motivation for combining the Smith and Xiang references presented in the rejection of Claim 6, apply to Claim 9 and are incorporated herein by reference. Thus, the apparatus recited in Claim 9 is met by Smith and Xiang. 

Regarding Claim 10,  teaches “The non-transitory machine-readable medium of claim 6, wherein the instructions are to apply the images, detect the visual indication of the engagement of the audio endpoint, detect the representation of the audio endpoint, and automatically switch audio endpoints in real time or near real time during a videoconference” (Smith, [0048] discloses “For example, each user may wear their own HMD tethered to a different host IHS, such as in the form of a video game or a productivity application (e.g., a virtual meeting).”) The proposed combination as well as the motivation for combining the Smith and Xiang references presented in the rejection of Claim 6, apply to Claim 10 and are incorporated herein by reference. Thus, the apparatus recited in Claim 10 is met by Smith and Xiang.

Regarding Claim 11, Xiang teaches “A computing device comprising: a camera” (Xiang, [0147] discloses “FIG. 28C shows a block diagram of an implementation A120 of apparatus A110 that includes a capture device CD10 which captures the scene that includes the gesture”); 
a speaker” (Xiang, [0148] discloses “FIG. 28D shows a block diagram of an implementation A105 of apparatus A100 that includes a loudspeaker array R10”);

a processor connected to the camera and the speaker” (Xiang, [0170] discloses “A processor or other means for processing as disclosed herein may also be embodied as one or more computers (e.g., machines including one or more arrays programmed to execute one or more sets or sequences of instructions) or other processors”), 


Xiang does not explicitly teach “a machine-learning system” and “the processor further connectable to a wearable audio endpoint, the processor to: apply the machine-learning system to perform image analysis on images captured by the camera; and switch between the speaker and the wearable audio endpoint based on the image analysis.”
However, in an analogous field of endeavor, Smith teaches “a machine-learning system” (Smith, [0075] discloses “Such a machine-learning method may analyze a user with reference to information learned from a previously trained collection of known gestures and/or poses stored in calibration component 302. During a supervised training phase, for example, a variety of gesture sequences may be observed, and trainers may provide label various classifiers in the observed data. The observed data and annotations may then be used to generate one or more machine-learned algorithms that map inputs (e.g., observation data from a depth camera) to desired outputs (e.g., body-part indices for relevant pixels)”);
“the processor further connectable to a wearable audio endpoint” (Smith, [0067] discloses “For example, HMD 102 may include a monaural, binaural, or surround audio reproduction system with one or more internal loudspeakers”; where HMD is a Head-Mounted Device),
“the processor to: apply the machine-learning system to perform image analysis on images captured by the camera” (Smith, [0075] discloses “Such a machine-learning method may analyze a user with reference to information learned from a previously trained collection of known gestures and/or poses stored in calibration component 302. During a supervised training phase, for example, a variety of gesture sequences may be observed, and trainers may provide label various classifiers in the observed data. The observed data and annotations may then be used to generate one or more machine-learned algorithms that map inputs (e.g., observation data from a depth camera) to desired outputs (e.g., body-part indices for relevant pixels)”); “and 
switch between the speaker and the wearable audio endpoint based on the image analysis” (Smith [0109] discloses “In response to detecting gesture sequence 900A, a previously muted left audio channel (e.g., reproduced by a left speaker in the HMD) is unmuted; without changes to the operation of the right audio channel.” Smith, Claim 18 also recites “wherein the set further comprises an audio mute or unmute gesture sequence comprising at least one hand covering or uncovering at least one ear, and wherein executing the command further comprises muting or unmuting only one audio channel”; where muting only one audio channel is switching between audio channels; Xiang teaches a loudspeaker array that is simply substitutable for an audio channel; Smith, Claim 18 recites a generic audio channel and thus an audio channel of Smith is simply substitutable for a loudspeaker of Xiang).
	It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to have modified Xiang to incorporate the teachings of Smith by muting a particular audio channel in response to a detected gesture. Under the broadest reasonable interpretation, muting only one audio channel is switching between audio channels. Claim 18 of Smith recites “muting or unmuting only one audio channel,” with no indication of a left or right audio channel. Thus, Smith discloses a gesture indicating muting or unmuting a single audio channel, or audio endpoint, that is not particularly a left or right audio channel of an HMD. Xiang teaches control of speakers using gestures (Xiang, [0143] recites “gesture interpreter GI10 may be implemented to receive an associated value for one or more parameters of the corresponding command: sound-blocking gesture—direction to block and/or degree of attenuation; muting gesture—degree of attenuation”). The speakers of Xiang are “a speaker array,” but Xiang also discloses “the various methods disclosed herein may be performed by a portable communications device such as a handset, headset, or portable digital assistant (PDA)” (Xiang, [0175]). Thus, it would be obvious to one of ordinary skill in the art to combine the Xiang and Smith references to mute, or switch to, only one audio channel, an audio channel being either on a headset, head mounted device, or speaker array. Additionally, the prior art contained a ‘base’ device upon which the claimed invention can be seen as an ‘improvement.’ That is, Xiang teaches a device that detects gestures from a captured scene. The prior art contained a known technique that is applicable to the base device. Smith teaches a technique of adjusting audio channels based on detected gestures. One of ordinary skill in the art would have recognized that applying the known technique would have yielded predictable results and resulted in an improved system. Additionally, Xiang broadly teaches a “muting gesture” that may be interpreted by the “gesture interpreter” (see Xiang, [0143]). Smith teaches a particular audio control of audio channels. Thus, one of ordinary skill in the art would be motivated to combine the Xiang and Smith references by applying a known technique to a known device ready for improvement to yield predictable results. Accordingly, the combination of Xiang and Smith discloses the invention of Claim 11. 

Regarding Claim 12, the combination of Xiang and Smith teaches “The computing device of claim 11, wherein the machine-learning system includes a machine-learning model to detect hand motion indicative of a user putting on or removing the wearable audio endpoint” (Smith [0110] discloses “In response to detecting gesture sequence 900B, the left audio channel may be muted, or its volume may be reduced, without changes to the operation of the right audio channel”; where gestures of Fig. 9A and 9B are hand motions indicative of putting on and removing, respectively, a wearable audio endpoint). The proposed combination as well as the motivation for combining the Xiang and Smith references presented in the rejection of Claim 11, apply to Claim 12 and are incorporated herein by reference. Thus, the apparatus recited in Claim 11 is met by Xiang and Smith.

Claims 8, 13, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Xiang et al. (US 2020/0077193 A1) in view of Smith et al. (US 2019/0384406 A1), further in view of Su et al. (CN 112883814 A). 
	Regarding Claim 8, the combination of Smith and Xiang  teaches “The non-transitory machine-readable medium of claim 6, wherein:
the visual indication includes a doffing gesture” (Smith, Fig. 9B sequence 900B shows motions corresponding to muting: Smith, [0110] discloses “In gesture sequence 900B, starting position 903 shows a user with her left hand 909 out and away by a selected distance, followed by motion 904 where the user covers her left ear 910 with the palm of her left hand 909”; where gesture sequence 900B is a doffing gesture); 

the instructions are further to, in response to detection of the representation of the audio endpoint, switch from the wearable audio endpoint to a speaker” (Smith, [0110] discloses “In response to detecting gesture sequence 900B, the left audio channel may be muted”).
The combination of Smith and Xiang does not explicitly teach “the representation of the audio endpoint includes an absence of a wearable audio endpoint.”
However, in an analogous field of endeavor, Su teaches “the representation of the audio endpoint includes an absence of a wearable audio endpoint” (Su, paragraph 40 discloses “In this embodiment, the examinee binaural state is defined as normal, abnormal and not detected to the target 3 condition. Under the normal condition, the examinee exposes the whole appearance of the ears; the ear state is naked without foreign body; the ear is not worn earphone, earring or other examination prohibited articles; opposite to the condition, the definition is abnormal condition; when the picture does not detect the ear of the examinee, the definition is not detected to the target condition.”)

	It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to have modified the combination of Smith and Xiang to incorporate the teachings of Su by determining an ear in a naked state. Smith and Xiang teach a ‘base’ device upon which the claimed invention can be seen as an improvement. The combination of Smith and Xiang teach a device for muting speakers in response to a detected gesture involving the head and ear. The prior art contained a known technique that is applicable to the base device. Su teaches a method of detecting an ear and determining if the ear state is naked or obstructed (abnormal). One of ordinary skill in the art would have recognized that applying the known technique would have yielded predictable results and resulted in an improved system. That is, it would have been obvious to one of ordinary skill in the art that the method of Su is applicable to the method of Smith and Xiang to at least locate the ear in the image, and further to determine if the ear is obstructed. Accordingly, the combination of Smith, Xiang, and Su discloses the invention of Claim 8.

Regarding Claim 13, the combination of Xiang and Smith does not explicitly teach “The computing device of claim 11, wherein the machine-learning system includes a machine-learning model to detect representations of headsets and earbuds in the images.”
However, in an analogous field of endeavor, Su teaches “The computing device of claim 11, wherein the machine-learning system includes a machine-learning model to detect representations of headsets and earbuds in the images” (Su, paragraph 40 discloses “In this embodiment, the examinee binaural state is defined as normal, abnormal and not detected to the target 3 condition. Under the normal condition, the examinee exposes the whole appearance of the ears; the ear state is naked without foreign body; the ear is not worn earphone, earring or other examination prohibited articles; opposite to the condition, the definition is abnormal condition; when the picture does not detect the ear of the examinee, the definition is not detected to the target condition.”)
It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to have modified the combination of Smith and Xiang to incorporate the teachings of Su by determining an ear in a naked state and detect an “earphone.” Smith and Xiang teach a ‘base’ device upon which the claimed invention can be seen as an improvement. The combination of Smith and Xiang teach a device for muting speakers in response to a detected gesture involving the head and ear. The prior art contained a known technique that is applicable to the base device. Su teaches a method of detecting an ear and determining if the ear state is naked or obstructed (abnormal). One of ordinary skill in the art would have recognized that applying the known technique would have yielded predictable results and resulted in an improved system. That is, it would have been obvious to one of ordinary skill in the art that the method of Su is applicable to the method of Smith and Xiang to at least locate the ear in the image, and further to determine if the ear is obstructed. Accordingly, the combination of Xiang, Smith, and Su discloses the invention of Claim 13.

Regarding Claim 15, the combination of Xiang, Smith, and Su teaches “The computing device of claim 11, wherein the processor is to:
apply the machine-learning system to perform image analysis to detect a hand motion” (Smith, 110 discloses “In gesture sequence 900B, starting position 903 shows a user with her left hand 909 out and away by a selected distance, followed by motion 904 where the user covers her left ear 910 with the palm of her left hand 909); “and 
in response to detection of the hand motion, apply the machine-learning system to perform additional image analysis to detect the wearable audio endpoint” (Su, paragraph 40 discloses “In this embodiment, the examinee binaural state is defined as normal, abnormal and not detected to the target 3 condition. Under the normal condition, the examinee exposes the whole appearance of the ears; the ear state is naked without foreign body; the ear is not worn earphone, earring or other examination prohibited articles; opposite to the condition, the definition is abnormal condition; when the picture does not detect the ear of the examinee, the definition is not detected to the target condition”); “and 
switch between the speaker and the wearable audio endpoint in response to detection of the wearable audio endpoint” (Smith [0109] discloses “In response to detecting gesture sequence 900A, a previously muted left audio channel (e.g., reproduced by a left speaker in the HMD) is unmuted; without changes to the operation of the right audio channel.” Smith, Claim 18 also recites “wherein the set further comprises an audio mute or unmute gesture sequence comprising at least one hand covering or uncovering at least one ear, and wherein executing the command further comprises muting or unmuting only one audio channel”). The proposed combination as well as the motivation for combining the Smith, Xiang, and Su references presented in the rejection of Claim 13, apply to Claim 15 and are incorporated herein by reference. Thus, the apparatus recited in Claim 15 is met by Smith, Xiang, and Su

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Xiang et al. (US 2020/0077193 A1) in view of Smith et al. (US 2019/0384406 A1), further in view of Yang (CN 107453962 A).  
Regarding Claim 14, the combination of Xiang and Smith does not explicitly teach “The computing device of claim 11, wherein the machine-learning system includes a machine-learning model to detect a change in location of the computing device.”
However, in an analogous field of endeavor, Yang teaches “The computing device of claim 11, wherein the machine-learning system includes a machine-learning model to detect a change in location of the computing device” (Yang, paragraph 19 discloses “The present invention further provides a management and control system of an intelligent device, comprising a wireless access device, the wireless access device comprising: a monitoring module, configured to monitor network signal strength information of a mobile terminal; an identifying module, configured to identify, according to the network Signal strength information identifying location change information of a location of the mobile terminal; and a control module, configured to control the working status of a preset smart device corresponding to the mobile terminal according to the location change information”). 
	It would have been obvious to one of ordinary skill in the art before the effective filing
date of the claimed invention to have modified the combination of Xiang and Smith to incorporate the teachings of Yang by identifying location change information of a mobile terminal. One of ordinary skill in the art would be motivated to combine the Xiang, Smith, and Yang references in order to save electricity: Yang, paragraph 21 discloses “The present invention identifies the position and state changes of the mobile terminal according to the change of the network signal strength of the wireless access device connected to the mobile terminal so as to identify whether the user carrying the mobile terminal is going home or away from home so as to automatically turn on or turn off the controllable smart device ; To meet the needs of users of life and save electricity.” Accordingly, the combination of Xiang, Smith, and Yang discloses the invention of Claim 14.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CAROLINE TABANCAY DUFFY whose telephone number is (703)756-1859. The examiner can normally be reached Monday - Friday 8:00 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Amandeep Saini can be reached at 5712723382. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CAROLINE TABANCAY DUFFY/Examiner, Art Unit 2662                                                                                                                                                                                                        
/AMANDEEP SAINI/Supervisory Patent Examiner, Art Unit 2662
Read full office action
Prosecution Timeline

Nov 28, 2023
Application Filed
Jan 16, 2026
Non-Final Rejection — §101, §103, §112
Mar 25, 2026
Interview Requested
Apr 02, 2026
Examiner Interview Summary
Apr 02, 2026
Applicant Interview (Telephonic)
Precedent Cases

Applications granted by this same examiner with similar technology

18/225,348
Patent 12602753
ULTRASOUND IMAGE PROCESSING APPARATUS
2y 5m to grant Granted Apr 14, 2026
18/279,316
Patent 12602788
METHOD AND SYSTEM FOR FULLY AUTOMATICALLY SEGMENTING CEREBRAL CORTEX SURFACE BASED ON GRAPH NETWORK
2y 5m to grant Granted Apr 14, 2026
18/187,649
Patent 12597130
IMAGE PROCESSING APPARATUS, OPERATION METHOD OF IMAGE PROCESSING APPARATUS, AND OPERATION PROGRAM OF IMAGE PROCESSING APPARATUS
2y 5m to grant Granted Apr 07, 2026
18/000,359
Patent 12580081
SYSTEMS AND METHODS FOR DIRECTLY PREDICTING CANCER PATIENT SURVIVAL BASED ON HISTOPATHOLOGY IMAGES
2y 5m to grant Granted Mar 17, 2026
18/333,600
Patent 12567130
REAL-TIME BLIND REGISTRATION OF DISPARATE VIDEO IMAGE STREAMS
2y 5m to grant Granted Mar 03, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
80%
Grant Probability
99%
With Interview (+26.9%)
3y 1m
Median Time to Grant
Low
PTA Risk
Based on 78 resolved cases by this examiner. Grant probability derived from career allow rate.