Last updated: May 29, 2026

Application No. 18/735,672

SPEECH RECOGNITION METHOD AND APPARATUS

Final Rejection §103

Filed

Jun 06, 2024

Priority

Dec 30, 2021 — CN 202111659241.3 +1 more

Examiner

LELAND III, EDWIN S

Art Unit

2654

Tech Center

2600 — Communications

Assignee

Huawei Technologies Co., Ltd.

OA Round

2 (Final)

Interview Optional

— -0.6% interview lift. Interview lift (-0.6%) is below the 15.0% threshold. A written response is recommended.

Based on 456 resolved cases, 2023–2026

Examiner Intelligence

LELAND III, EDWIN S View full profile →

Grants 75% — above average

Career Allowance Rate

342 granted / 456 resolved

+13.0% vs TC avg

Minimal -1% lift

Without

With

+-0.6%

Interview Lift

resolved cases with interview

Typical timeline

2y 5m

Avg Prosecution

14 currently pending

Career history

471

Total Applications

across all art units

Statute-Specific Performance

§101

11.1%

-28.9% vs TC avg

§103

67.9%

+27.9% vs TC avg

§102

10.9%

-29.1% vs TC avg

§112

5.6%

-34.4% vs TC avg

Black line = Tech Center average estimate • Based on career data from 456 resolved cases

Office Action

§103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Receipt is acknowledged of papers submitted under 35 U.S.C. 119(a)-(d), which papers have been placed of record in the file. 

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 9/26/2024, 12/19/2024, 1/19/2025, 7/14/2025 and 12/15/2025 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the examiner.

Status of Claims
Claims 1-20 are pending in this application.  

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-3, 10-12 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Du et al. (Chinese Patent Application Publication CN111063162, listed in IDS dated 12/19/2024) in view of Anand et al. (U.S. Patent 12,217,749).
As per claims 1, 10 and 20, Du et al. discloses:
An apparatus (Figure 3 and Paragraph [0073] - multi-modal recognition is performed on the image data and audio data), comprising: 
at least one processor (Paragraphs [0038] & [0059] - includes a memory and a processor, the memory stores a computer program); and 
at least one computer-readable storage medium storing a program that is executable by the at least one processor (Paragraph [0038] - includes a memory and a processor, the memory stores a computer program), the program comprising instructions for: 
obtaining a first speech text (Paragraph [0073] - When the user recognizes the preset key sentence in the code word data table through voice recognition,); and 
obtaining, based on the first speech text, first modal information that matches the first speech text (Paragraph [0073] - Specifically, audio detection may include sound event detection and voice information detection. Further, it may include performing feature extraction on the audio data, outputting text information according to the extracted feature result,), wherein a modality indicated by the first modal information is a first modality in a plurality of preset modalities (Paragraphs [0073-0074] - Different users have different settings… identify whether there are target data that triggers the generation of warning information…recognition is performed on the image data and audio data of the regulatory environment… In one of the embodiments, as shown in FIG. 4, the preset multimodal recognition model includes an image classification unit, an action classification unit, and an item detection and positioning unit); and 
determining, based on the first speech text and the first modal information, a first intention (Paragraph [0073] - If there are multiple preset keywords such as "Hand over the money", "Robbery" and "Help" in the text information, or perform sound event detection on the audio data, identify the dangerous sounds in the audio data such as gunshots, violent Impact sound, etc., or sentence intent recognition of audio data, when the sentence intent is recognized as a distress or threat to the safety of others 'personal property, it is judged that there is a dangerous situation in the car and an early warning message is generated.) and a first slot (Paragraphs [0070] & [0079] - intercepting the monitoring video frames in the car at a certain time interval… it can also send real-time image data and audio data in the car to the platform) that are indicated by the first speech text when the first speech text matches the first modal information (Paragraphs [0064] & [0073] - performing feature extraction on the audio data, outputting text information according to the extracted feature result… extract the fusion features of multiple data modalities. )
Du et al. fails to disclose but Anand in the same field of endeavor teaches:
the first intention is an executable intention supported by the apparatus, and wherein the first slot is a slot holding data needed as input for executing the first intention (Column 18, lines 5-37).
It would be obvious for a person having ordinary skill in the art at the effective filing date of the invention to modify the method, apparatus and computer readable medium of Du et al. with the slot based intention system of Anand et al. because it is a case of combining prior art elements according to known methods to yield predictable results.
Claim 1 is directed to the method of using the apparatus of claim 10, so is rejected for similar reason.
Claim 20 is directed to a non-transitory computer readable medium storing instructions to cause a processor to act as the apparatus of claim 10.

As per claims 2 and 11, the combination of Du et al. and Anand et al. discloses all of the limitations of claims 1 and 10 above. Du et al. in the combination further discloses:
 obtaining, based on the first speech text, the first modal information that matches the first speech text comprises (Paragraphs [0073] & [0093] - when the sentence intent is recognized as a distress or threat to the safety of others 'personal property, it is judged that there is a dangerous situation in the car and an early warning message is generated…when it recognizes that the audio data has preset keywords and generates warning information): 
obtaining a multimodal selection vector based on the first speech text, wherein the multimodal selection vector indicates a probability of relevance between the first speech text and each of the plurality of preset modalities (Paragraph [0073] – example of a linked modality to the detected speech words: If there are multiple preset keywords such as "Hand over the money", "Robbery" and "Help" in the text information, or perform sound event detection on the audio data, identify the dangerous sounds in the audio data such as gunshots, violent Impact sound, etc.,); and 
obtaining the first modal information based on the multimodal selection vector (Paragraph [0098] - The preset multi-modal recognition model is based on the training of scene data constructed by multi-modal data under the regulatory environment. Detection methods are activated in certain modalities based on detected speech words, which is the functional equivalent of the use of a two bit multimodal selection vector).

As per claims 3 and 12, the combination of Du et al. and Anand et al. discloses all of the limitations of claims 2 and 11 above. Du et al. in the combination further discloses:
obtaining the multimodal selection vector based on the first speech text comprises (Paragraph [0073] – sentence intent is recognized): 
determining a first context category to which the first speech text belongs (Paragraph [0073] - If there are multiple preset keywords such as "Hand over the money", "Robbery" and "Help" in the text information ); and 
obtaining the multimodal selection vector based on the first context category, wherein the multimodal selection vector indicates a probability of relevance between the first context category and each of the plurality of preset modalities (Paragraph [0073], [0093] & [0068]- three examples of context relevance - when the sentence intent is recognized as a distress or threat to the safety of others 'personal property, it is judged that there is a dangerous situation in the car and an early warning message is generated; when it recognizes that the audio data has preset keywords and generates warning information; when the voiceprint recognizes that the user's voice is an adult voice and the user's age group recognition result is a child, the corresponding warning information is filtered.).

As per claim 19, the combination of Du et al. and Anand et al. discloses all of the limitations of claim 10 above. Du et al. in the combination further discloses:
performing an operation related to the first intention (Paragraph [0073] - sentence intent recognition of audio data, when the sentence intent is recognized as a distress or threat to the safety of others 'personal property, it is judged that there is a dangerous situation in the car and an early warning message is generated. ).

Allowable Subject Matter
Claims 4-9 and 13-18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Examiner Notes
The Examiner cites particular columns and line numbers in the references as applied to the claims above for the convenience of the Applicant.  Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well.  It is respectfully requested that, in preparing responses, the Applicant fully considers the references in its entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or as disclosed by the Examiner. 
Communications via Internet e-mail are at the discretion of the applicant and require written authorization. Should the Applicant wish to communicate via e-mail, including the following paragraph in their response will allow the Examiner to do so:
“Recognizing that Internet communications are not secure, I hereby authorize the USPTO to communicate with me concerning any subject matter of this application by electronic mail. I understand that a copy of these communications will be made of record in the application file.”
Should e-mail communication be desired, the Examiner can be reached at Edwin.Leland@USPTO.gov

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to EDWIN S LELAND III whose telephone number is (571)270-5678. The examiner can normally be reached 8:00 - 5:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at 571-272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/EDWIN S LELAND III/Primary Examiner, Art Unit 2654

Read full office action

Prosecution Timeline

Jun 06, 2024

Application Filed

Jan 05, 2026

Non-Final Rejection mailed — §103

Mar 16, 2026

Response Filed

Mar 26, 2026

Final Rejection mailed — §103

May 22, 2026

Response after Non-Final Action

Precedent Cases

Applications granted by this same examiner with similar technology

18/315,789

Patent 12608548

METHODS AND SYSTEMS FOR PARSING A MIX OF FEATURES AND INSTRUCTIONS INTO A PROMPT

2y 11m to grant Granted Apr 21, 2026

18/654,795

Patent 12596869

DETECTING ARTIFICIAL INTELLIGENCE GENERATED TEXT

1y 11m to grant Granted Apr 07, 2026

17/936,873

Patent 12591602

TRAINING MACHINE LEARNING BASED NATURAL LANGUAGE PROCESSING FOR SPECIALTY JARGON

3y 6m to grant Granted Mar 31, 2026

17/993,063

Patent 12579370

MULTILINGUAL CHATBOT

3y 3m to grant Granted Mar 17, 2026

18/602,835

Patent 12579986

Systems and Methods for Distinguishing Between Human Speech and Machine Generated Speech

2y 0m to grant Granted Mar 17, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

3-4

Expected OA Rounds

75%

Grant Probability

74%

With Interview (-0.6%)

2y 5m (~5m remaining)

Median Time to Grant

Moderate

PTA Risk

Based on 456 resolved cases by this examiner. Grant probability derived from career allowance rate.