Last updated: May 29, 2026

Application No. 18/663,730

Character recognition-based augmentation for multimodal model inputs

Non-Final OA §101§103

Filed

May 14, 2024

Examiner

LI, RUIPING

Art Unit

2676

Tech Center

2600 — Communications

Assignee

Google LLC

OA Round

1 (Non-Final)

Interview Optional

— +18.5% interview lift. Examiner has a relatively high allowance rate (77%); +18.5% interview lift. A written response may suffice.

Based on 945 resolved cases, 2023–2026

Examiner Intelligence

LI, RUIPING View full profile →

Grants 77% — above average

Career Allowance Rate

729 granted / 945 resolved

+15.1% vs TC avg

Strong +18% interview lift

Without

With

+18.5%

Interview Lift

resolved cases with interview

Typical timeline

2y 9m

Avg Prosecution

25 currently pending

Career history

974

Total Applications

across all art units

Statute-Specific Performance

§101

4.8%

-35.2% vs TC avg

§103

72.1%

+32.1% vs TC avg

§102

16.8%

-23.2% vs TC avg

§112

4.2%

-35.8% vs TC avg

Black line = Tech Center average estimate • Based on career data from 945 resolved cases

Office Action

§101 §103

DETAILED ACTION
Notice of Pre-AIA  or AIA  Status. 
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

2.	Claims 1-20 filed and preliminary amended on 11/25/2024 are pending and being examined. Claims 1, 13, and 20 are independent form.

Claim Rejections - 35 USC § 101
3.	35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

4.	Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed inventions are directed to non-statutory subject matter (an abstract idea without significantly more).

4-1.	Regarding independent claim 1, the claim recites a method, comprising: 
[1] receiving, by one or more processors, multimodal input comprising at least two of text, images, video, or audio; 
[2] determining, by the one or more processors and based on one or more criteria, whether to process the multimodal input with character recognition (CR) data through a multimodal model trained to receive the multimodal input and generate a multimodal model output; and 
[3] generating, by the one or more processors, the multimodal model output using the multimodal input and the CR data. 

Step 1: 
With regard to step (1), claim 1, is directed to a method.  The claim 1 therefore is one of statutory categories of invention, i.e., a process.

 Step 2A-1:
With regard to 2A-1, The elements recited in claim 1, as drafted, under their broadest reasonable interpretation, encompass a process(es) which is/are directed to organizing human activity, can be practically performed in human mind, or falls within mathematical concepts. For example, “determining, [...], whether to process the multimodal input with character recognition (CR) data [...] to receive the multimodal input and generate a multimodal model output” in step [2] in the context of this claim, encompasses mental observation, evaluations, judgments, and/or opinions that “can be performed in human mind, or by a human using a pen and paper”, therefore the limitation falls within the “mental processes” grouping of abstract ideas. Similarly, “generating, [...], the multimodal model output using the multimodal input and the CR data” in step [3] in the context of this claim, encompasses mental activity, evaluations, judgments, and/or opinions that “can be performed in human mind, or by a human using a pen and paper”, therefore the limitation falls within the “mental processes” grouping of abstract ideas. 
Claim 1 therefore recites an abstract idea. If a claim limitation is directed to organizing human activity, can be practically performed in human mind, or falls within mathematical concepts, then the claim recites an abstract idea. See MPEP 2106.04(a)(2).

Step 2A-2:
The 2019 PEG defines the phrase "integration into a practical application" to require an additional element or a combination of additional elements in the claim to apply, rely on, or use the judicial exception. In the instant case, the additional elements of “receiving, [,], multimodal input comprising at least two of text, images, video, or audio” in step [1] under their broadest reasonable interpretation, is mere data gathering recited at a high level of generality, and thus are insignificant extra-solution activity. Similarly, “by one or more processors” is recited at high level of generality and amount to no more than mere instruction to apply the exception using generic processors. Similarly, “a multimodal model trained” is used to generally apply the abstract idea without limiting how the trained multimodal model functions. The trained multimodal model is described at high level such that it amounts to using a computer with a multimodal model to apply the abstract idea. Therefore, the claim as a whole does not integrate the judicial exception into a practical application.

Step 2B:
As explained above, the method “by one or more processors”, is at best the equivalent of merely adding the words “apply it” to the judicial exception. The “receiving” in step [1] was considered insignificant extra-solution activity. These conclusions should be reevaluated in Step 2B. The limitations are mere data gathering and/or output recited at high level of generality and amount to receiving (i.e., acquiring), accessing, or transmitting data over a network, which is well-understood, routine, conventional activity. See MPEP 2106.05(d), subsection II. The limitations remain insignificant extra-solution activity even upon reconsideration. Even when considered in combination, the additional elements present mere instructions to apply an exception and insignificant extra-solution activity, which cannot provide an inventive concept. The claim therefore is ineligible.

4-2.	Regarding dependent claims 2-12, they are viewed individually, these additional elements are under its broadest reasonable interpretation, either covers performance of the limitation in the mind, performing a mathematical algorithm or extra solution activity for data gathering and do not provide meaningful limitations to transform the abstract idea into a patent eligible application of the abstract idea such that the claims amount to significantly more than the abstract idea itself. And, when the claims are viewed as a whole, they do not improve a technology by allowing the technology to perform a function that it previously was not capable of performing; and they do not provide any limitations beyond generally linking the use of the abstract idea to a broad technological environment (i.e., computer-based analysis of generic data). Hence, the claimed invention does not constitute significantly more than the abstract idea, so the claims are rejected under 35 USC § 101 as being directed to non-statutory subject matter. 

4-3.	Regarding independent claims 13 and 20, the claims recite a system comprising one or more processors (claim 13) and a non-transitory storage medium (claim 20) and each of which is analogous to apparatus claim 1, grounds of rejection analogous to those applied to claim 1 are applicable to claims 13 and 20. Furthermore, the claim is a method that does not recite any additional elements, and according to step 2A-2 does not integrate the abstract idea into a practical application because it does not recite any additional elements that impose any meaningful limits on practicing the abstract idea. The claim recites an abstract idea.

Because the claim fails under (2A), the claim is further evaluated under (2B). The claim herein does not include any additional elements that are sufficient to amount to significantly more than the judicial exception. The claims are not patent eligible.

4-4.	Regarding dependent claims 14-19 they are dependent from claim 13 and viewed individually, these additional elements are under its broadest reasonable interpretation, either covers performance of the limitation in the mind, performing a mathematical algorithm or extra solution activity for data gathering and do not provide meaningful limitations to transform the abstract idea into a patent eligible application of the abstract idea such that the claims amount to significantly more than the abstract idea itself. And, when the claims are viewed as a whole, they do not improve a technology by allowing the technology to perform a function that it previously was not capable of performing; and they do not provide any limitations beyond generally linking the use of the abstract idea to a broad technological environment (i.e., computer-based analysis of generic data). Hence, the claimed invention does not constitute significantly more than the abstract idea, so the claims are rejected under 35 USC § 101 as being directed to non-statutory subject matter. 

Claim Rejections - 35 USC § 103
5.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

6.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

7.	Claim 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al (“Convolutional Neural Networks and Multimodal Fusion for Text Aided Image Classification”, 2017) in view of Gupta et al (“Automatic Assessment of OCR Quality in Historical Documents”, 2015).

Regarding claim 1, Wang discloses a method (the Convolutional Neural Networks (the CNNs) based method/system for text-aided image classification; see fig.1), comprising: 
receiving, by one or more processors, multimodal input comprising at least two of text, images (the CNNs may receive a “test image” including text data; see the “text image” of fig.1 and Sec. II-A and II-B), video, or audio; 
determining, by the one or more processors and based on one or more criteria, the reliability of character recognition (CR) data through a multimodal model trained to receive the multimodal input and generate a multimodal model output (the CNNs may assign the weight                 
                    
                            w
                        
                            t
                            j
                        
            defined by Eq(8) to the textual feature-based classification result                 
                    
                            d
                        
                            2
                        
                            (
                            j
                            )
                        
             to determine the reliability of the textual feature-based classification result and generate final classification result                 
                    
                            q
                        
                            i
                            j
                        
             defined by Eq(11). It should be noticed that: the textual feature-based classification includes words (characters) recognition, see ‘airplanes’... and ‘jet’ in fig.2); and 
generating, by the one or more processors, the multimodal model output using the multimodal input and the CR data (the CNNs may output the final classification decision score D for the text image on the basis of the final classification result                 
                    
                            q
                        
                            i
                            j
                        
             based on the reliability of the textual feature-based classification result; see the right col of fig.1, Eq(11), Eq(14), and Sec. III). 

As explained above, Wang does not explicitly disclose the feature of “determining [...] whether to process the multimodal input with character recognition (CR) data” recited in the claim. However, Wang discloses assigning the reliability (i.e., the weight)                 
                    
                            w
                        
                            t
                            j
                        
            defined by Eq(8) to the textual feature-based classification result                 
                    
                            d
                        
                            2
                        
                            (
                            j
                            )
                        
            , see Eqs(6), (8), (11), and (12). And then, the final classification result                 
                    
                            q
                        
                            i
                            j
                        
             output by the CNNs is determined on the basis of the weight                 
                    
                            w
                        
                            t
                            j
                        
            — namely, the reliability of the textual feature-based classification result. In other words, Wang has appreciated that OCR quality issue needs to be considered in multimodal document classification. In fact, in the same field of endeavor, Gupta clearly points out “when a document has poor quality, the OCR engine generally produces a large number of spurious bounding boxes (BBs) in addition to those that correspond to words in the document.” See Gupta, Sec. “Introduction”, Parapraph.3. To resolve this issue, Gupta, see Sec. “Method”, teaches a “pre-filtering” process prior to OCR, wherein the pre-filtering process uses three criteria (rules), that is, OCR word confidence, height-to-width rate, and area, to determine whether to process the input with character recognition (CR) data. It would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention was made to incorporate the teachings of Gupta into the teachings of Wang and use the criterion for OCR word confidence to determine whether to perform a pre-filtering to input text images taught by Gupta. Suggestion or motivation for doing so would have been to provide a robust OCR for document images. Gupta, see Sec. “Introduction”, paragraph 3; Wang, see Sec. III-B, paragraph 2. Therefore, the claim is unpatentable over Wang in view of Gupta.

Regarding claim 2, the combination of Wang and Gupta discloses the method of claim 1, wherein the CR data identifies or characterizes text in the multimodal input (Wang, see the word/text classification decision result D2 output by the CNN in fig.2/fig.1 and Sec. II-B).  

Regarding claim 3, 14, the combination of Wang and Gupta discloses, wherein determining whether to process the multimodal input with the CR data comprises: generating the CR data; and determining whether the generated CR data meet the one or more predetermined criteria (Gupta, see “Rule I: OCR word confidence. BBs with very low or very high confidence predominantly consist of noise, and are flagged accordingly during pre-filtering.”), and in response, generating, by the one or more processors, the multimodal model output using the multimodal input without the CR data (Gupta, performing a “pre-filtering” process to the text image prior to OCR; see “Method”, “Pre-Filtering”)

Regarding claim 4, 15, the combination of Wang and Gupta discloses, wherein determining whether to process the multimodal input with CR data comprises determining whether the multimodal input or the CR data satisfy the one or more predetermined criteria, comprising one or more of whether: 

Regarding claim 5, the combination of Wang and Gupta discloses the method of claim 1, wherein determining whether to process the multimodal input with the CR data comprises: determining, based on the multimodal input and the one or more criteria, whether to generate the CR data from the multimodal input; and generating the CR data from the multimodal input (Gupta, determining whether a BB (a bounding box extracted from the input text image) is a noise BB or a text BB and performing OCR only to text BBs; see fig.2 and “Method”, “Pre-Filtering”). 

Regarding claim 6, 16, 17, 18, the combination of Wang and Gupta discloses, wherein the one or more criteria are based on at least one of: the length of the multimodal input, the quantity of images or videos in the multimodal input, the size of the images or video in the multimodal input, or the resolution or quality of components of the multimodal input (Gupta, see “Rule 1”, “Rule 2”, and “Rule 3”,). 

Regarding claim 7, the combination of Wang and Gupta discloses the method of claim 1, wherein the method further comprises: generating, by the one or more processors, the CR data; and formatting, by the one or more processors, the multimodal input and the CR data according to one of one or more predetermined formats (Wang, the CNNs may assign the weight                 
                    
                            w
                        
                            t
                            j
                        
            defined by Eq(8) to the textual feature-based classification result                 
                    
                            d
                        
                            2
                        
                            (
                            j
                            )
                        
             to determine the reliability of the textual feature-based classification result and then generate final classification result                 
                    
                            q
                        
                            i
                            j
                        
             defined by Eq(11).).

Regarding claim 8, the combination of Wang and Gupta discloses the method of claim 1, wherein determining whether to process the multimodal input with the CR data comprises: training the multimodal model to: receive the multimodal input, and determine, based on the multimodal input, whether to generate a model output with the multimodal input or the multimodal input with the CR data (Gupta, wherein the labelling for noise and text BBs is trained by dataset to optimize the threshold values; see fig.5 and Sec. “Results”, “Pre-filtering”. Wang, see the CNN training, shown by fig.1 and stated Sec. II). 

Regarding claim 9, the combination of Wang and Gupta discloses the method of claim 8, wherein the method further comprises: training, by the one or more processors, the multimodal model on training data comprising: examples of model outputs generated with multimodal inputs, and examples of model outputs generated with the multimodal inputs and respective CR data identifying or characterizing text in each of the multimodal inputs (ibid.). 

Regarding claim 10, the combination of Wang and Gupta discloses the method of claim 9, wherein determining whether to generate the CR data comprises: executing the multimodal model with the multimodal input to generate a first output; executing the multimodal model with the multimodal input and the CR data to generate a second output; and outputting one of the first output and the second output based on a comparison of the first output and the second output (ibid.). 

Regarding claim 11, the combination of Wang and Gupta discloses the method of claim 1, further comprising: processing, by the one or more processors, the response through a machine learning model trained to generate output at least from the multimodal input (Wang, see training the CNN, shown by fig.1 and stated in Sec. II. Gupta, see training the labelling for noise and text BBs, shown by fig.5 and stated in Sec. “Results”,). 

Regarding claim 12, 19, the combination of Wang and Gupta discloses, wherein the CR data is optical character recognition (OCR) data generated by performing an OCR process on at least a portion of the multimodal input (Wang, see text recognition in fig.2 and Sec. II-B. Gupta, see word recognition, in “Rule I: OCR word confidence”).  

Regarding claim 13, 20, each of them is an inherent variation of claim 1, thus it is interpreted and rejected for the reasons set forth in the rejection of claim 1.

Conclusion
8.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to RUIPING LI whose telephone number is (571)270-3376. The examiner can normally be reached 8:30am--5:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, HENOK SHIFERAW can be reached on (571)272-4637. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit https://patentcenter.uspto.gov; https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center, and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RUIPING LI/Primary Examiner, Ph.D., Art Unit 2676

Read full office action

Prosecution Timeline

May 14, 2024

Application Filed

Mar 23, 2026

Non-Final Rejection mailed — §101, §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

17/681,319

Patent 12639793

SYSTEM, DEVICES AND/OR PROCESSES FOR DYNAMIC TEMPORAL ANTI-ALIASING TECHNIQUE SELECTION

4y 3m to grant Granted May 26, 2026

18/156,538

Patent 12638588

Determining Yaw Error from Map Data, Lasers, and Cameras

3y 4m to grant Granted May 26, 2026

18/379,020

Patent 12633164

APPARATUS AND METHOD FOR DETECTING EMOTIONAL CHANGE THROUGH FACIAL EXPRESSION ANALYSYS

2y 7m to grant Granted May 19, 2026

17/781,761

Patent 12620219

METHOD AND ASSISTANCE SYSTEM FOR CHECKING SAMPLES FOR DEFECTS

3y 11m to grant Granted May 05, 2026

18/049,308

Patent 12608974

USER AWARENESS DEVICE, USER AWARENESS SYSTEM, AND USER AWARENESS METHOD

3y 5m to grant Granted Apr 21, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

1-2

Expected OA Rounds

77%

Grant Probability

96%

With Interview (+18.5%)

2y 9m (~8m remaining)

Median Time to Grant

Low

PTA Risk

Based on 945 resolved cases by this examiner. Grant probability derived from career allowance rate.