Prosecution Insights
Last updated: April 19, 2026
Application No. 18/218,552

METHODS AND SYSTEMS FOR SPEECH DETECTION

Final Rejection §103
Filed
Jul 05, 2023
Examiner
NEWAY, SAMUEL G
Art Unit
2657
Tech Center
2600 — Communications
Assignee
Soapbox Labs Ltd.
OA Round
4 (Final)
75%
Grant Probability
Favorable
5-6
OA Rounds
3y 0m
To Grant
83%
With Interview

Examiner Intelligence

Grants 75% — above average
75%
Career Allow Rate
517 granted / 686 resolved
+13.4% vs TC avg
Moderate +8% lift
Without
With
+7.6%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
29 currently pending
Career history
715
Total Applications
across all art units

Statute-Specific Performance

§101
16.6%
-23.4% vs TC avg
§103
34.5%
-5.5% vs TC avg
§102
17.1%
-22.9% vs TC avg
§112
20.1%
-19.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 686 resolved cases

Office Action

§103
DETAILED ACTION This is responsive to the amendment filed 09 February 2026. Claims 21-40 remain pending and are considered below. Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Arguments Applicant’s arguments with respect to claims 21-40 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Claim Rejections - 35 USC § 103 The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. Claims 21-23, 25, 34-35 and 38-40 are rejected under 35 U.S.C. 103 as being unpatentable over Rosenberg (US 2007/0024579) in view of Parker et al. (US 2018/0059781). Claim 21: Rosenberg discloses a method of processing user input to a computing system having an audio input and a visual input (Abstract), the method comprising: receiving, at the computing system, an audio input signal from said audio input; processing the audio input signal to identify an action executable by the computing system in relation to one or more items in an environment of a user of the computing system (“an aural recognition unit coupled to the controller unit functional to receive the aural signals from at least one functionally coupled aural sensor indicative of verbal commands appropriate for controlling one of a plurality of electronic devices”, [0018], see also “Generally, aural recognition units 50 employs circuitry that transforms input signals representing aural utterances, as captured by one or more aural sensors 55, into discrete digital representations that are compared to stored digital representations in a library of digital words used as commands to control the various electronic devices”, [0073]); determining whether the user has demonstrated a reliable intent to interact with an item of the one or more items in the environment of the user to cause the action to be executed in relation to the item (“a user issues a command to select a desired electronic device by looking in the direction of the desired one of the plurality of electronic devices while issuing a verbally uttered command”, [0009], see also “determine if the gaze was intentional rather than spurious and/or to determine if the gaze met any specified timing requirements with respect to an uttered verbal command. In this way, the various embodiments may be configured to determine if user 190 is gazing upon electronic device 175, if that gaze was intentional, and if that gaze was contemporaneous with an uttered verbal command from user 190”, [0067]), wherein determining whether the user has demonstrated the reliable intent comprises: performing gaze direction detection on an image received from the visual input to determine a direction of gaze of the user; and determining whether an item in a field of view of the user is consistent with the determined direction of gaze (“the user aims his or her visual gaze substantially in the direction of the one of the separately located electronic devices or a portion thereof. The appropriately directed gaze is referred to herein as a device-targeting gaze and generally must fall within certain close physical proximity of one of the separately located electronic devices, or a portion thereof, for more than a certain threshold amount of time”, [0038], see also “configuration of the image processing routines is performed upon the image data collected by the CCD image sensor to determine if a user is gazing substantially in the direction of the sensor/light source pair”, [0068]); and responsive to determining that the user has demonstrated the reliable intent to interact with the item in the environment of the user, triggering an event to execute the action in relation to the item in the field of view of the user (“determine if the gaze was intentional rather than spurious and/or to determine if the gaze met any specified timing requirements with respect to an uttered verbal command. In this way, the various embodiments may be configured to determine if user 190 is gazing upon electronic device 175, if that gaze was intentional, and if that gaze was contemporaneous with an uttered verbal command from user 190”, [0067], see also “determine a gaze selected target in dependence on the optical signals received from the gaze detection unit 201. The processor is further programmed to determine an appropriate control action in dependence on the aural signals received from the aural recognition unit 203. If the optical and aural control signals are sufficiently contemporaneous 205, an appropriate command is sent to the interface unit associated with the device to be controlled 209”, [0089]). Rosenberg does not explicitly disclose that the visual input comprises a camera. In an analogous system similarly detecting a user’s gaze direction using a visual input, Parker discloses wherein the visual input comprises a camera (“Gaze detection may be accomplished, for example, by processing visual, infrared, or sonar images of the user. The images may be captured using, for example, depth cameras, digital or analog video cameras, infrared sensors, and/or ultrasound sensors”, [0033]). It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to substitute Rosenberg’s visual input with Parker’s to yield the predictable result of detecting a user’s gaze using image captured by a camera because cameras are standard image capturing devices (see Parker, [0033]). Claim 22: Rosenberg in view of Parker discloses the method of claim 21. Parker discloses wherein the one or more items in the environment of the user represent a plurality of items displayed on a screen of a computing system and available for interaction by the user (“a user 200 who is looking at display 201 on system 202 which is presenting several options including media 203, phone 204, and navigation 205. These options may be presented as button, tiles, or simply words on display 201. Gaze detector 206 monitors user 200 and identifies the user's point of focus. As illustrated in FIG. 2A, the user's gaze 207 is focused on media option 203. Accordingly, system 202 knows that the user is currently interested in the media subsystem and not the phone or navigation subsystems. System 202 selects a media interaction set, which may be a grammar for use with speech recognition system 208 that includes media control terms, such as the words “start,” “stop,” “rewind,” “fast forward,” and lists of artist and song names and terms. When user 200 says “Chicago,” the media system lists songs from the musical Chicago”, [0052], see also [0053]). It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to combine the references to yield the predictable result in which Rosenberg’s one or more items represent a plurality of items displayed on a screen of a computing system and available for interaction by the user in order to interact, using the user’s gaze and speech, with a particular displayed item from amongst many (see Parker, [0052]-[0053]). Claim 23: Rosenberg in view of Parker discloses the method of claim 21, wherein the one or more items in the environment of the user represent a plurality of devices available for interaction by the user, the plurality of devices comprise one or more of home devices, smart devices, smart toys or robots (Rosenberg, [0039], see also [0073]). Claim 25: Rosenberg in view of Parker discloses the method of claim 21, further comprising performing one or more additional verification operations, wherein confirming that the user has demonstrated the reliable intent to interact with the identified item in the environment of the user is dependent on the outcome of the one or more further verification operations in addition to confirming that the user has demonstrated the reliable intent to interact with the identified item in the environment of the user (Rosenberg, [0050]-[0051] and [0074], see also [0076]). Claim 34: Rosenberg in view of Parker discloses the method of claim 21, wherein triggering the event to execute the action in relation to the item in the field of view of the user further comprises: recording the audio input signal; and performing a speech processing function on the recorded audio input signal (Rosenberg, [0089], note that the audio signal must be stored, at least temporarily, in order to be processed). Claim 35: Rosenberg in view of Parker discloses the method of claim 21, wherein triggering the event to execute the action in relation to the item in the field of view of the user further comprises: recording the audio input signal; and sending the recorded audio input signal to a remote computing device for speech processing (Rosenberg, [0077], note that the audio signal must be stored, at least temporarily, in order to be processed). Claims 38-39: Rosenberg in view of Parker discloses a computing system for processing user input having an audio input and a visual input, the system comprising: a memory; and a processor, coupled to the memory (Rosenberg, [0041]-[0042]), to perform operations comprising performing the steps of process claims 21 and 25 as shown above. Claim 40: Rosenberg in view of Parker discloses a non-transitory computer readable medium comprising instructions, which when executed by a processor (Rosenberg, [0041]-[0042], see also [0030]), cause the processor to perform the steps of process claim 21 as shown above. Claim 24 is rejected under 35 U.S.C. 103 as being unpatentable over Rosenberg (US 2007/0024579) in view of Parker et al. (US 2018/0059781) and Hernandez-Abrego et al. (US 2012/0295708). Claim 24: Rosenberg in view of Parker discloses the method of claim 21, but does not explicitly disclose wherein the one or more items in the environment of the user represent a plurality of items in at least one of a virtual reality, mixed reality or augmented reality overlay seen by the user and available for interaction by the user. In an analogous system similarly interacting with one or more items in an environment of a user, Hernandez-Abrego discloses wherein the one or more items in the environment of the user represent a plurality of items in at least one of a virtual reality, mixed reality or augmented reality overlay seen by the user and available for interaction by the user (“User 406 plays a game that includes virtual reality 402, and avatars 408 and 410. Avatar 410 is being controlled by user 406. When user 406 directs his gaze towards avatar 408, the game moves avatar 410 near avatar 408. Once avatar 410 is situated near avatar 408, user 406 speaks 404 (e.g., "good morning"), and user's 406 speech is reproduced 412 by avatar 410”, [0056]). It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to combine the references to yield the predictable result in which Rosenberg’s one or more items represent a plurality of items in at least one of a virtual reality, mixed reality or augmented reality overlay seen by the user and available for interaction by the user in order to interact, using the user’s gaze and speech, with a particular virtual reality, mixed reality or augmented reality item from amongst many (see Hernandez-Abrego, [0056]). Claim 26, 27 and 32-33 are rejected under 35 U.S.C. 103 as being unpatentable over Rosenberg (US 2007/0024579) in view of Parker et al. (US 2018/0059781) and Basu et al. (US 2003/0018475). Claim 26: Rosenberg in view of Parker discloses the method of claim 25, but does not explicitly disclose wherein the one or more additional verification operations comprise a mouth movement detection operation to verify that the user's mouth is moving. In a system with similar verification operations to record audio input ([0009]), Basu discloses wherein the verification operations comprise a mouth movement detection operation to verify that the user's mouth is moving (“a microphone associated with the speech recognition system is turned on such that an audio signal from the microphone may be initially buffered if at least one mouth opening is detected from the video signal. Then, the buffered audio signal is decoded in accordance with the speech recognition system if mouth opening pattern recognition indicates that subsequent portions of the video signal are representative of speech”, [0012]). It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to combine the references to yield the predictable result of Salvador’s verification operations comprising a mouth movement detection operation to verify that the user's mouth is moving in order to use not only audio information but also visual information related to mouth movement in determining whether received audio is speech (see Basu, [0012]). Claim 27: Rosenberg in view of Parker and Basu discloses the method of claim 26, wherein the mouth movement detection operation further verifies that the mouth movement of the user corresponds to a movement pattern indicative of speech (Basu, “the buffered audio signal is decoded in accordance with the speech recognition system if mouth opening pattern recognition indicates that subsequent portions of the video signal are representative of speech”, [0012]). Claim 32: Rosenberg in view of Parker and Basu discloses the method of claim 26, wherein at least two of the additional verification operations are performed (Rosenberg, [0050]-[0051]). Claim 33: Rosenberg in view of Parker and Basu discloses the method of claim 26, wherein the determination of whether the item in the field of view of the user is consistent with the determined direction of gaze and/or the additional verification operations is a weighted determination, and wherein a positive determination is made when the weighted determination is above a threshold (Rosenberg, [0050]-[0051]). Claims 28-29, 31 and 36-37 are rejected under 35 U.S.C. 103 as being unpatentable over Rosenberg (US 2007/0024579) in view of Parker et al. (US 2018/0059781) and Salvador et al. (US 2015/0066494). Claims 28: Rosenberg in view of Parker discloses the method of claim 25, but does not explicitly disclose wherein the one or more additional verification operations comprise an audio detection operation to verify that the audio input is receiving sound from the environment of the user. In an analogous art similarly performing one or more additional verification operations to confirm user intent to interact with an item, Salvador discloses wherein the one or more additional verification operations comprise an audio detection operation to verify that the audio input is receiving sound from the environment of the user (“monitors activity 320 for indicia that a user command to record or process audio maybe forthcoming”, [0040], see also “The detection of ambient sound (495) may be used as indicia itself, in combination with other indicia, and/or may activate other audio analysis processes that consume greater power, such as processing the audio to detect speech”, [0046]). It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to combine the references to yield the predictable result of including an audio detection operation to verify that the audio input is receiving sound from the environment of the user as part of Rosenberg’s one or more additional verification operations in order to save energy by activating audio processes that consume considerable power only when sound is detected (see Salvador, [0046]). Claim 29: Rosenberg in view of Parker and Salvador discloses the method of claim 28, wherein the audio detection operation further verifies that the characteristics of detected sound are consistent with speech (Salvador, [0047]). Claim 31: Rosenberg in view of Parker and Salvador discloses the method of claim 28, wherein the audio detection operation further verifies that the characteristics of detected sound are consistent with a speech profile stored for a given user (Salvador, [0048], see also Rosenberg, [0074]). Claim 36: Rosenberg in view of Parker discloses the method of claim 21, but does not explicitly disclose wherein triggering the event to execute the action in relation to the item in the field of view of the user further comprises: storing the audio input signal in a buffer; and performing one of: responsive to confirming that the user has not demonstrated the reliable intent to interact with the identified item in the environment of the user, overwriting or discarding the audio input signal stored in the buffer; or responsive to determining that the user has demonstrated the reliable intent to interact with the identified item in the environment of the user, retrieving the audio input signal from the buffer. In an analogous art similarly triggering an event to execute an action in relation to an item in a field of view of a user, Salvador discloses wherein triggering the event to execute the action in relation to the item in the field of view of the user further comprises: storing the audio input signal in a buffer; and performing one of: responsive to confirming that the user has not demonstrated the reliable intent to interact with the identified item in the environment of the user, overwriting or discarding the audio input signal stored in the buffer; or responsive to determining that the user has demonstrated the reliable intent to interact with the identified item in the environment of the user, retrieving the audio input signal from the buffer (“At least a portion of the buffered audio is then retrieved from the buffer (350) and prepended (360) onto audio received after the user command. The combined audio stream is then recorded and/or processed (370), such as processing the speech into text”, [0053], see [0020] for overwriting). It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to combine the references to yield the predictable result of performing one of: responsive to confirming that the user has not demonstrated the reliable intent to interact with the identified item in the environment of the user, overwriting or discarding the audio input signal stored in the buffer; or responsive to determining that the user has demonstrated the reliable intent to interact with the identified item in the environment of the user, retrieving the audio input signal from the buffer in order to control a device using a buffered user command only after the user’s intention to control the device has been confirmed (see Rosenberg, [0067] and Salvador, [0053]). Claim 37: Rosenberg in view of Parker and Salvador discloses the method of claim 36, but does not explicitly disclose wherein the buffer is of sufficient capacity to store an audio signal of a duration at least as long as the time required to confirm that the user has demonstrated the reliable intent to interact with the identified item in the environment of the user and optionally the additional verification operations. However, Salvador does disclose that the buffer may be of any size (“The buffer may be of any size, such as for example two or three seconds”, [0020]). Therefore, it would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention that the buffer may be of sufficient capacity to store an audio input signal of a duration at least as long as the time required to determine the face detection and optionally the additional verification operations because face detection is be performed swiftly (for example less than three seconds) in the types of device disclosed in Salvador (see [0065]). Claim 30 is rejected under 35 U.S.C. 103 as being unpatentable over Rosenberg (US 2007/0024579) in view of Parker et al. (US 2018/0059781), Salvador et al. (US 2015/0066494) and Dey et al. (US 2014/0010418). Claim 30: Rosenberg in view of Parker and Salvador discloses the method of claim 28, but does not explicitly disclose wherein the audio detection operation further verifies that the direction from which sound is detected is consistent with the direction of gaze of the user. In a system with a similar face detection operation, Dey discloses verifying that the direction from which sound is detected is consistent with the direction of a detected face (“Detection of lip activity of a user along with audio activity from the same direction can be used to infer a "genuine" speech command”, [0010], see also [0017]). It would have been obvious to one with ordinary skill in the art before the effective filing date of the claimed invention to combine the references to yield the predictable result of verifying that the direction from which sound is detected is consistent with the direction of Salvador’s detected face in order to ascertain that speech came from the user whose face was detected and not from other users in the same environment (see Dey, [0010]). Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to SAMUEL G NEWAY whose telephone number is (571)270-1058. The examiner can normally be reached Monday-Friday 9:00am-5:00pm EST. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached at 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /SAMUEL G NEWAY/Primary Examiner, Art Unit 2657
Read full office action

Prosecution Timeline

Jul 05, 2023
Application Filed
Dec 11, 2024
Non-Final Rejection — §103
Mar 05, 2025
Interview Requested
Mar 12, 2025
Examiner Interview Summary
Mar 12, 2025
Applicant Interview (Telephonic)
Mar 17, 2025
Response Filed
May 24, 2025
Final Rejection — §103
Jul 28, 2025
Interview Requested
Aug 25, 2025
Examiner Interview Summary
Aug 25, 2025
Applicant Interview (Telephonic)
Aug 29, 2025
Request for Continued Examination
Sep 03, 2025
Response after Non-Final Action
Nov 05, 2025
Non-Final Rejection — §103
Jan 15, 2026
Interview Requested
Jan 26, 2026
Examiner Interview Summary
Jan 26, 2026
Applicant Interview (Telephonic)
Feb 09, 2026
Response Filed
Mar 18, 2026
Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602538
METHOD AND SYSTEM FOR EXEMPLAR LEARNING FOR TEMPLATIZING DOCUMENTS ACROSS DATA SOURCES
2y 5m to grant Granted Apr 14, 2026
Patent 12603177
INTERACTIVE CONVERSATIONAL SYMPTOM CHECKER
2y 5m to grant Granted Apr 14, 2026
Patent 12603092
AUTOMATED ASSISTANT CONTROL OF NON-ASSISTANT APPLICATIONS VIA IDENTIFICATION OF SYNONYMOUS TERM AND/OR SPEECH PROCESSING BIASING
2y 5m to grant Granted Apr 14, 2026
Patent 12596734
PARSE ARBITRATOR FOR ARBITRATING BETWEEN CANDIDATE DESCRIPTIVE PARSES GENERATED FROM DESCRIPTIVE QUERIES
2y 5m to grant Granted Apr 07, 2026
Patent 12596892
MACHINE TRANSLATION SYSTEM FOR ENTERTAINMENT AND MEDIA
2y 5m to grant Granted Apr 07, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

5-6
Expected OA Rounds
75%
Grant Probability
83%
With Interview (+7.6%)
3y 0m
Median Time to Grant
High
PTA Risk
Based on 686 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month