Last updated: April 19, 2026
Application No. 18/041,370
METHOD OF TRAINING TEXT DETECTION MODEL, METHOD OF DETECTING TEXT, AND DEVICE

Non-Final OA §102§103
Filed
Feb 10, 2023
Examiner
ANSARI, TAHMINA N
Art Unit
2674
Tech Center
2600 — Communications
Assignee
BEIJING BAIDU NETCOM SCIENCE TECHNOLOGY CO., LTD.
OA Round
1 (Non-Final)
Interview Optional

— +17.9% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 868 resolved cases, 2023–2026
Examiner Intelligence

ANSARI, TAHMINA N View full profile →
Grants 86% — above average
Career Allow Rate
743 granted / 868 resolved
+23.6% vs TC avg
Strong +18% interview lift
Without
With
+17.9%
Interview Lift
resolved cases with interview
Typical timeline
2y 8m
Avg Prosecution
33 currently pending
Career history
901
Total Applications
across all art units
Statute-Specific Performance

§101
12.2%
-27.8% vs TC avg
§103
40.4%
+0.4% vs TC avg
§102
22.6%
-17.4% vs TC avg
§112
10.5%
-29.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 868 resolved cases
Office Action

§102 §103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
Applicant has filed a preliminary amendment dated February 10, 2023 to address concerns regarding multiple dependencies, to cancel claims 9-16 and 19 and add new claims 20-29. Claims 1-8, 17-18, and 20-29 are pending in this application.
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claims 1-3, 5-6, 8, 17-18, 20-21, 23-24, and 26-29 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Aggarwal et al. (USPGPub US 2020/0302016A1, hereby referred to as “Aggarwal”).

Consider Claim 1.
Aggarwal teaches: 
1. (Original) A method of training a text detection model, wherein the text detection model comprises a text feature extraction sub-model, a text encoding sub-model, a decoding sub- model and an output sub-model, the method comprising: (Aggarwal: abstract; Classifying structural features of a digital document by feature type using machine learning is leveraged in a digital medium environment. A document analysis system is leveraged to extract structural features from digital documents, and to classifying the structural features by respective feature types. To do this, the document analysis system employs a character analysis model and a classification model. The character analysis model takes text content from a digital document and generates text vectors that represent the text content. A vector sequence is generated based on the text vectors and position information for structural features of the digital document, and the classification model processes the vector sequence to classify the structural features into different feature types. The document analysis system can generate a modifiable version of the digital document that enables its structural features to be modified based on their respective feature types. [0039]-[0048], Figure 1; [0043] The analysis manager module 108 further includes a feature extraction module 124, a character analysis model 126, and a classification model 128. The feature extraction module 124 is representative of functionality to analyze and extract different features of the digital documents 114, such as the structural features 120. In at least one implementation, the feature extraction module 124 utilizes computer vision processes to analyze and extract the structural features 120 from the digital documents 114. The character analysis model 126 and the classification model 128 represent different machine learning models that take the structural features 120 as input, and generate feature categorizations 130 that classify individual structural features 120 into different pre-defined categories of features. Implementations of the character analysis model 126 and the classification model 128 are detailed below.)
1. inputting a sample image containing a text into the text feature extraction sub- model to obtain a first text feature of the text contained in the sample image, (Aggarwal: [0043] The analysis manager module 108 further includes a feature extraction module 124, a character analysis model 126, and a classification model 128. The feature extraction module 124 is representative of functionality to analyze and extract different features of the digital documents 114, such as the structural features 120. In at least one implementation, the feature extraction module 124 utilizes computer vision processes to analyze and extract the structural features 120 from the digital documents 114. The character analysis model 126 and the classification model 128 represent different machine learning models that take the structural features 120 as input, and generate feature categorizations 130 that classify individual structural features 120 into different pre-defined categories of features. Implementations of the character analysis model 126 and the classification model 128 are detailed below. [0058]-[0060], Figure 3; [0058] FIG. 3 depicts an example system 300 that describes a way for extracting features from a digital document to identify structural features that can then be categorized. In the system 300, a digital document 302 is input into the feature extraction module 124. Accordingly, the feature extraction module 124 extracts structural features 304 from the digital document 302 including position information 306 and for at least some of the structural features, text 308.)
1. wherein the sample image has a label indicating an actual position information of the text contained in the sample image and an actual category for the actual position information; (Aggarwal: [0051] In this example system, extracting the structural features 204 generates position information 206 and text 208. The position information 206 identifies locations of individual structural features on a page of the digital document 202. For instance, when the feature extraction module 124 identifies a particular structural feature 204 on a page of the digital document 202, the feature extraction module 124 utilizes a bounding box to enclose the particular structural feature 204 and separate it from other structural features 204. Thus, the position information 206 describes attributes of the bounding box, such as spatial coordinates of the bounding box. In at least one implementation, the spatial coordinates are described with reference to the geometry of a page of the digital document 202 from which the particular structural feature 204 is extracted. For instance, for the bounding box of the particular structural feature 204, the position information 206 includes an x-coordinate and a y-coordinate for an upper left corner of the bounding box and with reference to the upper left corner of the page of the digital document 202. Further, the position information 206 includes a width and a height of the bounding box, such as in pixels and/or distance measurement, e.g., dots per inch (dpi), millimeters (mm), and so forth. Thus, in such implementations, the position information 206 includes these coordinates for each of the structural features 204.)
1. inputting a predetermined text vector into the text encoding sub-model to obtain a first text reference feature; (Aggarwal: [0052]-[0053] To enable the structural features 204 to be categorized, vector representations of the structural features 204 can be generated. Accordingly, the character analysis model 126 takes the text 208 for each of the structural features 204 as input, and generates text vectors 210 from the text 208 for each of the features 204. Generally, the text vectors 210 are implemented as numerical representations of the text 208. Example ways for generating the text vectors 210 are detailed below, such as with reference to FIG. 7.)
1. inputting the first text feature and the first text reference feature into the decoding sub-model to obtain a first text sequence vector; (Aggarwal: [0054]The text vectors 210 and the position information 206 are then passed to a sequence generator module 212, which generates feature vectors 214 using the text vectors 210 and the position information 206. In at least some implementations, the position information 206 is generated by the feature extraction module 124 as numerical vectors (e.g., the spatial information described above), and thus is combinable with the text vectors 210 to generate the feature vectors 214. For instance, consider that the text vectors 210 are each represented as a vector vt for each of the structural features 204, and the position information 206 is represented as a vector vs for each of the structural features 204.)
1. inputting the first text sequence vector into the output sub-model to obtain a predicted position information of the text contained in the sample image and a predicted category for the predicted position information; (Aggarwal: [0055] After generating a feature vector 214 for each of the structural features 204, the sequence generator module 212 generates a vector sequence 216 using the feature vectors 214. In at least one implementation, to generate the vector sequence 216, the sequence generator module 212 concatenates the feature vectors 214 based on the position of their respective structural features 204 in the digital document 202. For instance, after obtaining a feature vector 214 corresponding to each structural feature 204 in a page of the digital document 202, the sequence generator module 212 geographically sorts the structural features 204, first vertically from top-down in the page. The sequence generator module 212 then picks a first structural feature 204, and considers all the elements (e.g., pixels) which lie vertically within the height range of its bounding box and sorts them horizontally leaving the remaining sequence of elements for other structural features 204 undisturbed. The sequence generator module 212 repeats this process for the elements in the remaining set structural features 204. In this way, the sequence generator module 212 sorts the structural features 204 and their corresponding elements vertically top-bottom and then horizontally in left-right manner in reference to a page of the digital document 202. This arranges the elements in natural reading order, e.g., left-to-right and top-to-bottom according to some written language reading orders. The sequence generator module 212 thus obtains the vector sequence 216 S=vc1, vc2, vc3, . . . vcn as a result of the sorting operation with n being number of structural features in a page.)
1. and training the text detection model based on the predicted category, the actual category, the predicted position information and the actual position information. (Aggarwal: [0044] To enable the character analysis model 126 and the classification model 128 to generate the feature categorizations 130, the document analysis system 102 maintains training data 132 stored on the storage 112. Generally, the training data 132 can be utilized by the analysis manager module 108 to train the character analysis model 126 and the classification model 128 prior to processing the structural features 120. [0075] Generally, this architecture is trained to predict the next character in a text sequence based on the sequence of characters received as input. Accordingly, the parameters of the LSTM cell 702 are trained so that they understand the input text sequence at a character level since the LSTM unit maintains the context and generates a hidden representation which is used by the neural architecture that follows it (e.g., the intermediate embedding layer 706 and the output layer 708) for predicting the next character in a text sequence 710. According to various implementations, text block data from tagged training documents 134 is used to train the character analysis model 126 by splitting text in a text block arbitrarily such that the character analysis model 126 is trained to predict a next character given the sequence of characters before the split as input. [0076] Accordingly, to generate vector representation of a text sequence 710 (e.g., a text block) of a digital document, the text is fed as a sequence of text characters into the character analysis model 126 described above and the output of the LSTM is extracted and used as the embedding representing the input text. In certain implementations where text content is obtained via OCR on a document image, the text may have mistakes such as character mutation. For example, an OCR process might read ‘p’ as ‘a’, ‘l’ as ‘i’, and so forth. To take this into account, the training data 132 is mutated probabilistically so that the embeddings obtained from the trained character analysis model 126 are robust to such alterations at classification time. )

Consider Claim 2.
Aggarwal teaches: 2. (Original) The method according to claim 1, wherein the text feature extraction sub-model comprises an image feature extraction network and a sequence encoding network, and the text detection model further comprises a first position encoding sub-model, and wherein obtaining the first text feature of the text contained in the sample image comprises: inputting the sample image into the image feature extraction network to obtain an image feature of the sample image; inputting a predetermined position vector into the first position encoding sub- model to obtain a position encoding feature; and adding the position encoding feature and the image feature, and inputting the added position encoding feature and image feature into the sequence encoding network to obtain the first text feature. (Aggarwal: [0041] The document analysis system 102 includes an analysis manager module 108 that is representative of functionality to analyze and categorize structural features of digital documents further to techniques for classifying structural features of a digital document by feature type using machine learning described herein. As part of enabling the analysis manager module 108 to perform such analyses and categorization, the document analysis system 102 maintains document data 110 in a storage 112. The document data 110 generally represents various attributes of digital documents and includes digital documents 114 and modified digital documents (“modified documents”) 116. The digital documents 114 generally represent different instances of electronic digital content that can be output in various ways and in various forms, such as via display on a display device 118 of the client device 104. Examples of the digital documents 114 include digital forms, digital publications, digital text documents, web content (e.g., web pages), and so forth. In at least some implementations, the digital documents 114 include image-based digital documents, such as PDF documents. An image-based digital document, for example, represents a digital document with content encoded as images, in contrast with other types of digital documents that may include machine-encoded text and other types of machine-encoded content. In at least one implementation, a digital document 114 represents an electronic document consisting of images only without any machine-encoded text or other editable graphics. [0042] The digital documents 114 include structural features 120, with some of the structural features including text 122. The structural features 120 represent visual elements of digital documents 114, such as visual structures that make up a digital document 114. Generally, a particular digital document 114 can be characterized as a set of structural features 120 that are arranged in a particular way to generate the visual appearance of the particular digital document 114. Examples of the structural features 120 include text blocks, fillable form fields, selectable options, lists, list items, bullets and bulleted items, and so forth. The text 122 includes representations of text characters, such as words, phrases, sections of text, and so forth. In an implementation where a digital document 114 is an image-based document, the text 122 is implemented as an image of text characters, i.e., the text 122 is not machine-encoded text. [0043] The analysis manager module 108 further includes a feature extraction module 124, a character analysis model 126, and a classification model 128. The feature extraction module 124 is representative of functionality to analyze and extract different features of the digital documents 114, such as the structural features 120. In at least one implementation, the feature extraction module 124 utilizes computer vision processes to analyze and extract the structural features 120 from the digital documents 114. The character analysis model 126 and the classification model 128 represent different machine learning models that take the structural features 120 as input, and generate feature categorizations 130 that classify individual structural features 120 into different pre-defined categories of features. Implementations of the character analysis model 126 and the classification model 128 are detailed below.)

Consider Claim 3.
Aggarwal teaches: 3. (Original) The method according to claim 2, wherein the image feature extraction network comprises a plurality of feature processing units connected in sequence and a feature conversion unit, and wherein obtaining the image feature of the sample image comprises: obtaining, by using the feature conversion unit, a one-dimensional vector representing the sample image based on the sample image; and inputting the one-dimensional vector into a first feature processing unit among the plurality of feature processing units, so that the one-dimensional vector is sequentially processed by the plurality of feature processing units to obtain the image feature of the sample image, wherein resolutions of feature maps output by the plurality of feature processing units are sequentially reduced according to a connection sequence. (Aggarwal: [0043] The analysis manager module 108 further includes a feature extraction module 124, a character analysis model 126, and a classification model 128. The feature extraction module 124 is representative of functionality to analyze and extract different features of the digital documents 114, such as the structural features 120. In at least one implementation, the feature extraction module 124 utilizes computer vision processes to analyze and extract the structural features 120 from the digital documents 114. The character analysis model 126 and the classification model 128 represent different machine learning models that take the structural features 120 as input, and generate feature categorizations 130 that classify individual structural features 120 into different pre-defined categories of features. Implementations of the character analysis model 126 and the classification model 128 are detailed below. [0044] To enable the character analysis model 126 and the classification model 128 to generate the feature categorizations 130, the document analysis system 102 maintains training data 132 stored on the storage 112. Generally, the training data 132 can be utilized by the analysis manager module 108 to train the character analysis model 126 and the classification model 128 prior to processing the structural features 120. The training data 132, for instance, includes training digital documents (“training documents”) 134, which include tagged structural features (“tagged features”) 136. The tagged features 136, for instance, are generated by processing (e.g., manually) the digital documents 114 and applying tags to the tagged features 136 that identify which category each tagged feature 136 belongs to. The tagged features 136 can then be used to train the character analysis model 126 and the classification model 128 to categorize the structural features 120. [0045]-[0046] In at least one implementation, a reformatted digital document 114 is generated to be adapted for display on a display device 118 of the client device 104. For instance, consider that a particular digital document 114 is originally generated for display on a large form factor display, such as a desktop computer display. Consider further that the display device 118 is a small form factor display, such as a mobile phone. Accordingly, the document editor module 138 can receive device attributes 142 from the client device 104 that indicate attributes of the display device 118. The device attributes 142, for instance, represent data that describes different attributes of the display device 118, such as display size, aspect ratio, resolution, display technology, make and model, and so forth. The document editor module 138 can then utilize the device attributes 142 to generate a reformatted document 114 that is formatted for display on the display device 118. Generating the reformatted document 114, for instance, involves manipulating various attributes of a set of structural features of a particular digital document 114 to generate the reflowed features 140 for the reformatted document 114. Generally, this enables the reformatted document 114 to be properly displayed on the display device 118. (Aggarwal: [0055] After generating a feature vector 214 for each of the structural features 204, the sequence generator module 212 generates a vector sequence 216 using the feature vectors 214. In at least one implementation, to generate the vector sequence 216, the sequence generator module 212 concatenates the feature vectors 214 based on the position of their respective structural features 204 in the digital document 202. For instance, after obtaining a feature vector 214 corresponding to each structural feature 204 in a page of the digital document 202, the sequence generator module 212 geographically sorts the structural features 204, first vertically from top-down in the page. The sequence generator module 212 then picks a first structural feature 204, and considers all the elements (e.g., pixels) which lie vertically within the height range of its bounding box and sorts them horizontally leaving the remaining sequence of elements for other structural features 204 undisturbed. The sequence generator module 212 repeats this process for the elements in the remaining set structural features 204. In this way, the sequence generator module 212 sorts the structural features 204 and their corresponding elements vertically top-bottom and then horizontally in left-right manner in reference to a page of the digital document 202. This arranges the elements in natural reading order, e.g., left-to-right and top-to-bottom according to some written language reading orders. The sequence generator module 212 thus obtains the vector sequence 216 S=vc1, vc2, vc3, . . . vcn as a result of the sorting operation with n being number of structural features in a page.)

Consider Claim 5.
Aggarwal teaches: 5. (Original) The method according to claim 3, wherein the text detection model further comprises a second position encoding sub-model, and wherein the obtaining, by using the feature conversion unit, a one-dimensional vector representing the sample image comprises: obtaining, by using the second position encoding sub-model, a position map of the sample image based on the sample image; and pixel-wise adding the sample image and the position map, and inputting the added sample image and position map into the feature conversion unit to obtain the one-dimensional vector representing the sample image. (Aggarwal: [0075] FIG. 7 depicts an example system 700 for obtaining text vectors using the character analysis model 126. In this particular example, the character analysis model 126 includes an LSTM cell 702, an LSTM output embedding layer 704, intermediate embedding layer 706, and an output layer 708. In order to generate vector embeddings from the text 308, the character analysis model 126 sequentially processes the characters present in the input text. For instance, for the text 308 from each of the structural features 304, the character analysis model 126 processes text sequences 710 (e.g., sequences of text inputs of arbitrary length) to obtain a representation which captures and incorporates long term dependencies within the text sequences 710. Once the LSTM cell 702 processes the sequences of characters present in each of the input text sequences 710, its output from the output embedding layer 704 is fed as input to the intermediate embedding layer 706 (e.g., a fully connected layer) followed by the output layer 708 (e.g., another fully connected layer), which outputs text vectors 712. Generally, the text vectors 712 include individual text vectors for text 308 from each of the structure features 304. In at least some implementations, the output layer 708 utilizes softmax activation to normalize the text vectors 712.)

Consider Claim 6.
Aggarwal teaches: 6. (Original) The method according to claim 1, wherein the training the text detection model comprises: determining a classification loss of the text detection model based on the predicted category and the actual category; determining a positioning loss of the text detection model based on the predicted position information and the actual position information; and training the text detection model based on the classification loss and the positioning loss. (Aggarwal: [0096] In configuring the classification model 128, the size of the forward and backward LSTMs in the context determination model 902 can be set to 500 resulting in hj having a dimension of 1000. Further, the size of the decoder model 904 Decθ is set to 1000 and the size of attention layer 906 is set to 500. In one or more implementations, a batch size of 8 is used while training. Further, for optimizing model parameters, an Adam Optimizer can be used with a learning rate of 10−4. Generally, the model parameters are optimized to maximize the log likelihood of feature types in the pages of the training documents 134. In at least one implementation, this can be achieved by minimizing the mean (taken over multiple pages of the training documents 134) of cross entropy loss between predicted softmax probability distribution of each structural feature in a page and one-hot vectors corresponding to their actual output class. Hence, the objective loss function becomes: 
    PNG
    media_image1.png
    38
    344
    media_image1.png
    Greyscale

where “⋅” is the dot product operation, N is a number of pages in a training document 134, n is a maximum number of structural features in a page of a training document, and the summation of j is performed to account for all structural features in a page. pj i is a softmax probability vector (as predicted by the models) over different possible output categories and lj i is the one-hot vector corresponding to actual class of jth structural feature in ith training document 134, with the ordering of the structural features done spatially as discussed in ‘preprocessing’ section, above.)

Consider Claim 8.
Aggarwal teaches: 8. (Currently Amended) A method of detecting a text by using a text detection model, wherein the text detection model comprises a text feature extraction sub-model, a text encoding sub-model, a decoding sub-model and an output sub-model, the method comprising: inputting an image to be detected containing a text into the text feature extraction sub-model to obtain a second text feature of the text contained in the image to be detected; inputting a predetermined text vector into the text encoding sub-model to obtain a second text reference feature; inputting the second text feature and the second text reference feature into the decoding sub-model to obtain a second text sequence vector; and inputting the second text sequence vector into the output sub-model to obtain a position of the text contained in the image to be detected, wherein the text detection model is trained using the method of claim 1. (Aggarwal: [0039]-[0048], Figure 1; [0043] The analysis manager module 108 further includes a feature extraction module 124, a character analysis model 126, and a classification model 128. The feature extraction module 124 is representative of functionality to analyze and extract different features of the digital documents 114, such as the structural features 120. In at least one implementation, the feature extraction module 124 utilizes computer vision processes to analyze and extract the structural features 120 from the digital documents 114. The character analysis model 126 and the classification model 128 represent different machine learning models that take the structural features 120 as input, and generate feature categorizations 130 that classify individual structural features 120 into different pre-defined categories of features. Implementations of the character analysis model 126 and the classification model 128 are detailed below. [0058]-[0060], Figure 3; [0058] FIG. 3 depicts an example system 300 that describes a way for extracting features from a digital document to identify structural features that can then be categorized. In the system 300, a digital document 302 is input into the feature extraction module 124. Accordingly, the feature extraction module 124 extracts structural features 304 from the digital document 302 including position information 306 and for at least some of the structural features, text 308.)

Consider Claims 9-16. (Canceled)

Consider Claim 17.
Aggarwal teaches: 17. (Currently Amended) An electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions, when executed by the at least one processor, cause the at least one processor to implement the method of claim 1. (Aggarwal: [0039]-[0048], Figure 1; [0043] The analysis manager module 108 further includes a feature extraction module 124, a character analysis model 126, and a classification model 128. The feature extraction module 124 is representative of functionality to analyze and extract different features of the digital documents 114, such as the structural features 120. In at least one implementation, the feature extraction module 124 utilizes computer vision processes to analyze and extract the structural features 120 from the digital documents 114. The character analysis model 126 and the classification model 128 represent different machine learning models that take the structural features 120 as input, and generate feature categorizations 130 that classify individual structural features 120 into different pre-defined categories of features. Implementations of the character analysis model 126 and the classification model 128 are detailed below. [0058]-[0060], Figure 3; [0058] FIG. 3 depicts an example system 300 that describes a way for extracting features from a digital document to identify structural features that can then be categorized. In the system 300, a digital document 302 is input into the feature extraction module 124. Accordingly, the feature extraction module 124 extracts structural features 304 from the digital document 302 including position information 306 and for at least some of the structural features, text 308.)

Consider Claim 18.
Aggarwal teaches: 18. (Currently Amended) A non-transitory computer-readable storage medium having computer instructions therein, wherein the computer instructions are configured to cause a computer to implement the method of claim1. (Aggarwal: [0102] FIG. 12 depicts an example procedure 1200 for modifying a digital document. Step 1202 receives an instruction to modify a modifiable version of a digital document. In at least one implementation, the instruction is generated by an automated process, such as automatically by the document editor module 138. For instance, the document editor module 138 determines that the digital document is to be displayed on a particular device, such as the client device 104. Accordingly, the document editor module 138 can generate an instruction to modify the digital document to be displayable on the client device 104. The instruction, for example, can specify that the digital document is to be modified for display based on various attributes of the display device 118 of the client device 104, such as display size, resolution, and so forth. [0107]-[0120] Figure 13, [0108] The example computing device 1302 as illustrated includes a processing system 1304, one or more computer-readable media 1306, and one or more I/O interfaces 1308 that are communicatively coupled, one to another. Although not shown, the computing device 1302 may further include a system bus or other data and command transfer system that couples the various components, one to another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. A variety of other examples are also contemplated, such as control and data lines.)

Consider Claim 19. (Canceled)

Consider Claim 20.
Aggarwal teaches: 20. (New) The electronic device according to claim 17, wherein the text feature extraction sub-model comprises an image feature extraction network and a sequence encoding network, and the text detection model further comprises a first position encoding sub-model, and wherein the instructions are further configured to cause the at least one processor to at least: input the sample image into the image feature extraction network to obtain an image feature of the sample image; input a predetermined position vector into the first position encoding sub-model to obtain a position encoding feature; and add the position encoding feature and the image feature, and input the added position encoding feature and image feature into the sequence encoding network to obtain the first text feature. (Aggarwal: [0041] The document analysis system 102 includes an analysis manager module 108 that is representative of functionality to analyze and categorize structural features of digital documents further to techniques for classifying structural features of a digital document by feature type using machine learning described herein. As part of enabling the analysis manager module 108 to perform such analyses and categorization, the document analysis system 102 maintains document data 110 in a storage 112. The document data 110 generally represents various attributes of digital documents and includes digital documents 114 and modified digital documents (“modified documents”) 116. The digital documents 114 generally represent different instances of electronic digital content that can be output in various ways and in various forms, such as via display on a display device 118 of the client device 104. Examples of the digital documents 114 include digital forms, digital publications, digital text documents, web content (e.g., web pages), and so forth. In at least some implementations, the digital documents 114 include image-based digital documents, such as PDF documents. An image-based digital document, for example, represents a digital document with content encoded as images, in contrast with other types of digital documents that may include machine-encoded text and other types of machine-encoded content. In at least one implementation, a digital document 114 represents an electronic document consisting of images only without any machine-encoded text or other editable graphics. [0042] The digital documents 114 include structural features 120, with some of the structural features including text 122. The structural features 120 represent visual elements of digital documents 114, such as visual structures that make up a digital document 114. Generally, a particular digital document 114 can be characterized as a set of structural features 120 that are arranged in a particular way to generate the visual appearance of the particular digital document 114. Examples of the structural features 120 include text blocks, fillable form fields, selectable options, lists, list items, bullets and bulleted items, and so forth. The text 122 includes representations of text characters, such as words, phrases, sections of text, and so forth. In an implementation where a digital document 114 is an image-based document, the text 122 is implemented as an image of text characters, i.e., the text 122 is not machine-encoded text. [0043] The analysis manager module 108 further includes a feature extraction module 124, a character analysis model 126, and a classification model 128. The feature extraction module 124 is representative of functionality to analyze and extract different features of the digital documents 114, such as the structural features 120. In at least one implementation, the feature extraction module 124 utilizes computer vision processes to analyze and extract the structural features 120 from the digital documents 114. The character analysis model 126 and the classification model 128 represent different machine learning models that take the structural features 120 as input, and generate feature categorizations 130 that classify individual structural features 120 into different pre-defined categories of features. Implementations of the character analysis model 126 and the classification model 128 are detailed below.)

Consider Claim 21.
Aggarwal teaches: 21. (New) The electronic device according to claim 20, wherein the image feature extraction network comprises a plurality of feature processing units connected in sequence and a feature conversion unit, and wherein the instructions are further configured to cause the at least one processor to at least:obtain, by using the feature conversion unit, a one-dimensional vector representing the sample image based on the sample image; and input the one-dimensional vector into a first feature processing unit among the plurality of feature processing units, so that the one-dimensional vector is sequentially processed by the plurality of feature processing units to obtain the image feature of the sample image, wherein resolutions of feature maps output by the plurality of feature processing units are sequentially reduced according to a connection sequence. (Aggarwal: [0043] The analysis manager module 108 further includes a feature extraction module 124, a character analysis model 126, and a classification model 128. The feature extraction module 124 is representative of functionality to analyze and extract different features of the digital documents 114, such as the structural features 120. In at least one implementation, the feature extraction module 124 utilizes computer vision processes to analyze and extract the structural features 120 from the digital documents 114. The character analysis model 126 and the classification model 128 represent different machine learning models that take the structural features 120 as input, and generate feature categorizations 130 that classify individual structural features 120 into different pre-defined categories of features. Implementations of the character analysis model 126 and the classification model 128 are detailed below. [0044] To enable the character analysis model 126 and the classification model 128 to generate the feature categorizations 130, the document analysis system 102 maintains training data 132 stored on the storage 112. Generally, the training data 132 can be utilized by the analysis manager module 108 to train the character analysis model 126 and the classification model 128 prior to processing the structural features 120. The training data 132, for instance, includes training digital documents (“training documents”) 134, which include tagged structural features (“tagged features”) 136. The tagged features 136, for instance, are generated by processing (e.g., manually) the digital documents 114 and applying tags to the tagged features 136 that identify which category each tagged feature 136 belongs to. The tagged features 136 can then be used to train the character analysis model 126 and the classification model 128 to categorize the structural features 120. [0045]-[0046] In at least one implementation, a reformatted digital document 114 is generated to be adapted for display on a display device 118 of the client device 104. For instance, consider that a particular digital document 114 is originally generated for display on a large form factor display, such as a desktop computer display. Consider further that the display device 118 is a small form factor display, such as a mobile phone. Accordingly, the document editor module 138 can receive device attributes 142 from the client device 104 that indicate attributes of the display device 118. The device attributes 142, for instance, represent data that describes different attributes of the display device 118, such as display size, aspect ratio, resolution, display technology, make and model, and so forth. The document editor module 138 can then utilize the device attributes 142 to generate a reformatted document 114 that is formatted for display on the display device 118. Generating the reformatted document 114, for instance, involves manipulating various attributes of a set of structural features of a particular digital document 114 to generate the reflowed features 140 for the reformatted document 114. Generally, this enables the reformatted document 114 to be properly displayed on the display device 118. (Aggarwal: [0055] After generating a feature vector 214 for each of the structural features 204, the sequence generator module 212 generates a vector sequence 216 using the feature vectors 214. In at least one implementation, to generate the vector sequence 216, the sequence generator module 212 concatenates the feature vectors 214 based on the position of their respective structural features 204 in the digital document 202. For instance, after obtaining a feature vector 214 corresponding to each structural feature 204 in a page of the digital document 202, the sequence generator module 212 geographically sorts the structural features 204, first vertically from top-down in the page. The sequence generator module 212 then picks a first structural feature 204, and considers all the elements (e.g., pixels) which lie vertically within the height range of its bounding box and sorts them horizontally leaving the remaining sequence of elements for other structural features 204 undisturbed. The sequence generator module 212 repeats this process for the elements in the remaining set structural features 204. In this way, the sequence generator module 212 sorts the structural features 204 and their corresponding elements vertically top-bottom and then horizontally in left-right manner in reference to a page of the digital document 202. This arranges the elements in natural reading order, e.g., left-to-right and top-to-bottom according to some written language reading orders. The sequence generator module 212 thus obtains the vector sequence 216 S=vc1, vc2, vc3, . . . vcn as a result of the sorting operation with n being number of structural features in a page.)

Consider Claim 23.
Aggarwal teaches: 23. (New) The electronic device according to claim 21, wherein the text detection model further comprises a second position encoding sub-model, and wherein the instructions are further configured to cause the at least one processor to at least: obtain, by using the second position encoding sub-model, a position map of the sample image based on the sample image; and pixel-wise add the sample image and the position map, and input the added sample image and position map into the feature conversion unit to obtain the one-dimensional vector representing the sample image. (Aggarwal: [0075] FIG. 7 depicts an example system 700 for obtaining text vectors using the character analysis model 126. In this particular example, the character analysis model 126 includes an LSTM cell 702, an LSTM output embedding layer 704, intermediate embedding layer 706, and an output layer 708. In order to generate vector embeddings from the text 308, the character analysis model 126 sequentially processes the characters present in the input text. For instance, for the text 308 from each of the structural features 304, the character analysis model 126 processes text sequences 710 (e.g., sequences of text inputs of arbitrary length) to obtain a representation which captures and incorporates long term dependencies within the text sequences 710. Once the LSTM cell 702 processes the sequences of characters present in each of the input text sequences 710, its output from the output embedding layer 704 is fed as input to the intermediate embedding layer 706 (e.g., a fully connected layer) followed by the output layer 708 (e.g., another fully connected layer), which outputs text vectors 712. Generally, the text vectors 712 include individual text vectors for text 308 from each of the structure features 304. In at least some implementations, the output layer 708 utilizes softmax activation to normalize the text vectors 712.)

Consider Claim 24.
Aggarwal teaches: 24. (New) The electronic device according to claim 17, wherein the instructions are further configured to cause the at least one processor to at least: determine a classification loss of the text detection model based on the predicted category and the actual category; determine a positioning loss of the text detection model based on the predicted position information and the actual position information; and train the text detection model based on the classification loss and the positioning loss. (Aggarwal: [0096] In configuring the classification model 128, the size of the forward and backward LSTMs in the context determination model 902 can be set to 500 resulting in hj having a dimension of 1000. Further, the size of the decoder model 904 Decθ is set to 1000 and the size of attention layer 906 is set to 500. In one or more implementations, a batch size of 8 is used while training. Further, for optimizing model parameters, an Adam Optimizer can be used with a learning rate of 10−4. Generally, the model parameters are optimized to maximize the log likelihood of feature types in the pages of the training documents 134. In at least one implementation, this can be achieved by minimizing the mean (taken over multiple pages of the training documents 134) of cross entropy loss between predicted softmax probability distribution of each structural feature in a page and one-hot vectors corresponding to their actual output class. Hence, the objective loss function becomes: 
    PNG
    media_image1.png
    38
    344
    media_image1.png
    Greyscale

where “⋅” is the dot product operation, N is a number of pages in a training document 134, n is a maximum number of structural features in a page of a training document, and the summation of j is performed to account for all structural features in a page. pj i is a softmax probability vector (as predicted by the models) over different possible output categories and lj i is the one-hot vector corresponding to actual class of jth structural feature in ith training document 134, with the ordering of the structural features done spatially as discussed in ‘preprocessing’ section, above.)

Consider Claim 26.
Aggarwal teaches: 26. (New) An electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, and the instructions, when executed by the at least one processor, cause the at least one processor to implement the method of claim 8. (Aggarwal: [0102] FIG. 12 depicts an example procedure 1200 for modifying a digital document. Step 1202 receives an instruction to modify a modifiable version of a digital document. In at least one implementation, the instruction is generated by an automated process, such as automatically by the document editor module 138. For instance, the document editor module 138 determines that the digital document is to be displayed on a particular device, such as the client device 104. Accordingly, the document editor module 138 can generate an instruction to modify the digital document to be displayable on the client device 104. The instruction, for example, can specify that the digital document is to be modified for display based on various attributes of the display device 118 of the client device 104, such as display size, resolution, and so forth. [0107]-[0120] Figure 13, [0108] The example computing device 1302 as illustrated includes a processing system 1304, one or more computer-readable media 1306, and one or more I/O interfaces 1308 that are communicatively coupled, one to another. Although not shown, the computing device 1302 may further include a system bus or other data and command transfer system that couples the various components, one to another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. A variety of other examples are also contemplated, such as control and data lines.)

Consider Claim 27.
Aggarwal teaches: 27. (New) The non-transitory computer-readable storage medium according to claim 18, wherein the text feature extraction sub-model comprises an image feature extraction network and a sequence encoding network, and the text detection model further compri
Read full office action
Prosecution Timeline

Feb 10, 2023
Application Filed
Oct 28, 2025
Non-Final Rejection — §102, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/068,590
Patent 12586249
PROCESSING APPARATUS, PROCESSING METHOD, AND STORAGE MEDIUM FOR CALIBRATING AN IMAGE CAPTURE APPARATUS
2y 5m to grant Granted Mar 24, 2026
18/484,909
Patent 12586354
TRAINING METHOD, APPARATUS AND NON-TRANSITORY COMPUTER READABLE MEDIUM FOR A MACHINE LEARNING MODEL
2y 5m to grant Granted Mar 24, 2026
18/471,055
Patent 12573083
COMPUTER-READABLE RECORDING MEDIUM STORING OBJECT DETECTION PROGRAM, DEVICE, AND MACHINE LEARNING MODEL GENERATION METHOD OF TRAINING OBJECT DETECTION MODEL TO DETECT CATEGORY AND POSITION OF OBJECT
2y 5m to grant Granted Mar 10, 2026
17/976,971
Patent 12548297
IMAGE PROCESSING METHOD AND APPARATUS, COMPUTER DEVICE, STORAGE MEDIUM, AND PROGRAM PRODUCT BASED ON FEATURE AND DISTRIBUTION CORRELATION
2y 5m to grant Granted Feb 10, 2026
18/444,143
Patent 12524504
METHOD AND DATA PROCESSING SYSTEM FOR PROVIDING EXPLANATORY RADIOMICS-RELATED INFORMATION
2y 5m to grant Granted Jan 13, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
86%
Grant Probability
99%
With Interview (+17.9%)
2y 8m
Median Time to Grant
Low
PTA Risk
Based on 868 resolved cases by this examiner. Grant probability derived from career allow rate.