Last updated: April 19, 2026
Application No. 18/686,233
READING ORDER WITH POINTER TRANSFORMER NETWORKS

Non-Final OA §101§103§DP
Filed
Feb 23, 2024
Examiner
LIEW, ALEX KOK SOON
Art Unit
2674
Tech Center
2600 — Communications
Assignee
Google LLC
OA Round
1 (Non-Final)
Interview Optional

— +7.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 1094 resolved cases, 2023–2026
Examiner Intelligence

LIEW, ALEX KOK SOON View full profile →
Grants 88% — above average
Career Allow Rate
957 granted / 1094 resolved
+25.5% vs TC avg
Moderate +7% lift
Without
With
+7.2%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
18 currently pending
Career history
1112
Total Applications
across all art units
Statute-Specific Performance

§101
8.6%
-31.4% vs TC avg
§103
44.7%
+4.7% vs TC avg
§102
13.5%
-26.5% vs TC avg
§112
3.0%
-37.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 1094 resolved cases
Office Action

§101 §103 §DP
DETAILED ACTION
[1]	Remarks
I.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
II.	Claims 1-16 and 18-21 are pending and have been examined, where claims 1-16 and 18-21 is/are rejected. Explanations will be provided below.
III.	Inventor and/or assignee search were performed and determined no double patenting rejection(s) is/are necessary. 
IV.	Patent eligibility (updated in 2019) shown by the following: Claims 1-9, 11-16 and 18-21 do not pass patent eligibility test because there is/are limitation or a combination of limitations amounting to an abstract idea(s). The following is an analysis on whether each limitation in all the independent claims can be perform using a mental process.
Limitation: receiving an image representing a document including a plurality of layout components; - does not amount to significant more.
Limitation: identifying textual information associated with the plurality of layout components; - can be perform using mental process.
Limitation: identifying visual information associated with the plurality of layout components; - can be perform using mental process.
combining the textual information with the visual information; and – can be perform using mental process. 
predicting a reading order of the plurality of layout components based on the combined
textual information and visual information using a self-attention encoder/decoder. – does not amount to significantly more. 
However, the following limitation or the combinations of the limitations (from claim 10): “a self-attention encoder configured to generate an embedding based on a first sequence associated with the plurality of layout components, the first sequence having a first order, and a self-attention decoder configured to generate a second sequence based on the embedding, the second sequence having a second order different from the first order” effects a transformation or a reduction of a particular article to a different state or thing / adds a specific limitation(s) other than what is well-understood, routine and conventional in the field, or adding unconventional steps that confine the claim to a particular useful application and providing improvements to the technical field of deep learning, which recite additional elements that integrate the judicial exception into a practical application and amounting significant more. 
V.	The PCT application, PCT/US2022/075454, is considered and the examiner determined no reference prior art are relevant to the claims of the current application.
[2]	Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):                                                                                                          
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

Use of the word “means” (or “step for”) in a claim with functional language creates a rebuttable presumption that the claim element is to be treated in accordance with 35 U.S.C. 112(f) (pre-AIA  35 U.S.C. 112, sixth paragraph).  The presumption that 35 U.S.C. 112(f) (pre-AIA  35 U.S.C. 112, sixth paragraph) is invoked is rebutted when the function is recited with sufficient structure, material, or acts within the claim itself to entirely perform the recited function.  Absence of the word “means” (or “step for”) in a claim creates a rebuttable presumption that the claim element is not to be treated in accordance with 35 U.S.C. 112(f) (pre-AIA  35 U.S.C. 112, sixth paragraph).  The presumption that 35 U.S.C. 112(f) (pre-AIA  35 U.S.C. 112, sixth paragraph) is not invoked is rebutted when the claim element recites function but fails to recite sufficiently definite structure, material or acts to perform that function. 
Claim elements in this application that use the word “means” (or “step for”) are presumed to invoke 35 U.S.C. 112(f) except as otherwise indicated in an Office action.  Similarly, claim elements that do not use the word “means” (or “step for”) are presumed not to invoke 35 U.S.C. 112(f) except as otherwise indicated in an Office action.
Claim(s) 18 is not interpreted under 35 U.S.C. 112(f) or pre-AIA  U.S.C. 112 6th paragraph because of the following reason(s): limitations are modified by sufficient structure or material for performing the claimed function. 
Claim(s) 1-16 and 19-21 does not require 35 U.S.C. 112(f) or pre-AIA  U.S.C. 112 6th paragraph interpretation because they are method claims and / or they are CRM claims.
Upon examination of the specification and claims, the examiner has determined, under the best understanding of the scope of the claim(s), rejection(s) under 35 U.S.C. 112(a)/(b) is not necessitated because of the following reasons: sufficient support are provided in the written description / drawings of the invention.
[3]	Grounds of Rejection
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows: 
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

35 U.S.C. 101 requires that a claimed invention must fall within one of the four eligible categories of invention (i.e. process, machine, manufacture, or composition of matter) and must not be directed to subject matter encompassing a judicially recognized exception as interpreted by the courts.  MPEP 2106.  Three categories of subject matter are found to be judicially recognized exceptions to 35 U.S.C. § 101 (i.e. patent ineligible) (1) laws of nature, (2) physical phenomena, and (3) abstract ideas.  MPEP 2106(II).  To be patent-eligible, a claim directed to a judicial exception must as whole be directed to significantly more than the exception itself.  See 2014 Interim Guidance on Patent Subject Matter Eligibility, 79 Fed. Reg. 74618, 74624 (Dec. 16, 2014).  Hence, the claim must describe a process or product that applies the exception in a meaningful way, such that it is more than a drafting effort designed to monopolize the exception.  Id
 Claim(s) 1-9, 11-16 and 18-21 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., an abstract idea) without significantly more.  
See reasoning in Remarks section above.
NOTE:  to qualify as ‘‘significantly more’’ a claim with a judicial exception must include either: Improvements to another technology or technical field; Improvements to the functioning of the computer itself; Applying the judicial exception with, or by use of, a particular machine; Effecting a transformation or reduction of a particular article to a different state or thing; Adding a specific limitation other than what is well-understood, routine and conventional in the field, or adding unconventional steps that confine the claim to a particular useful application; or Other meaningful limitations beyond generally linking the use of the judicial exception to a particular technological environment.
Claim Rejections - 35 USC § 103
1.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

2.	Claims 1-6, 10-12, 15-16, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Boliek (US 20040114813) in view of Wang et al. (Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, Furu Wei LayoutReader: Pre-training of Text and Layout for Reading Order Detection, arXiv:2108.11591v2 [cs.CL] 27 Aug 2021). 

Regarding claim 1, Boliek discloses a method comprising: 
receiving an image representing a document including a plurality of layout components (see figure 1B, shows the layout of a document); 
identifying textual information associated with the plurality of layout components (see figure 2A, the JPM file is divided into plurality of sections, JPEG and JBIG’s, where JBIG include identified text data, see illustration below); 
identifying visual information associated with the plurality of layout components (see figure 2A, the JPM file is divided into plurality of sections, JPEG and JBIG’s, where JPEG includes identified visual data, see illustration below); 
combining the textual information with the visual information (see figure 2A, the section in figure 2A is where the text and visual data are combined); and 
Boliek is silent in disclosing predicting a reading order of the plurality of layout components based on the combined textual information and visual information using a self-attention encoder/decoder.
Wang discloses predicting a reading order of the plurality of layout components based on the combined textual information and visual information using a self-attention encoder/decoder (see Section Layout Reader, Layout Reader solves the reading order detection task, which include encoder and decoder and see figure 1 illustration below):

    PNG
    media_image1.png
    351
    700
    media_image1.png
    Greyscale
.
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to include predicting a reading order of the plurality of layout components based on the combined textual information and visual information using a self-attention encoder/decoder in order to provides context-aware understanding of complex document layouts, which rule-based or cannot achieve. The visual information provides spatial context for the text, helping with logical flow, where the model sees and reads the document structure the content simultaneously. Also the self-attention mechanism allows the model to weigh the importance of all other components, such as texts, in the document when processing a single component, regardless of their distance in the initial sequence.

Regarding claim 2, Boliek discloses the method of claim 1, wherein the identifying of the textual information includes extracting text-based data from the image (see figure 2A, the JPM file is divided into plurality of sections, JPEG and JBIG’s, where JBIG include identified or extracted text data).

Regarding claim 3, Wang discloses the method of claim 2, wherein the extracting of the text- based data includes using a neural network configured to generate an embedding including representing the textual information (see figure 3, embeddings, where the transformer blocks are read as neural network). See the motivation (self attention portion) for claim 1. 

    PNG
    media_image2.png
    155
    257
    media_image2.png
    Greyscale
.

Regarding claim 4, Wang discloses the method of claim 3, wherein the neural network is a pretrained neural network that maps textual data to an embedding, and an array includes the embedding represents an element including the text-based data associated with each layout component of the plurality of layout components (see Section Experiment Layout Reader):

    PNG
    media_image3.png
    88
    773
    media_image3.png
    Greyscale
.
See the motivation for claim 1, Also, training a deep learning model from scratch is computationally expensive, time-consuming, and requires massive, high-quality labeled datasets and significant expertise. Using a pretrained BERT model allows you to save time and cost.

Regarding claim 5, Boliek discloses the method of claim 1, wherein the identifying of the visual information includes extracting visual-based data from the image (see figure 2A, the JPM file is divided into plurality of sections, JPEG and JBIG’s, where JPEG includes identified visual data):

    PNG
    media_image4.png
    536
    993
    media_image4.png
    Greyscale
.

Regarding claim 6, Wang discloses the method of claim 5, wherein the extracting of the visual-based data includes using a neural network configured to generate an embedding including the visual information (see figure 3 Token, Position, Segment, Layout embedding, and the each transformers include at least one Multi-Layer Perceptron, MLP):

    PNG
    media_image2.png
    155
    257
    media_image2.png
    Greyscale
.
See the motivations (self-attention) for claim 1.

Regarding claim 10, Wang discloses the method of claim 1, wherein the self- attention encoder/decoder includes: a self-attention encoder configured to generate an embedding based on a first sequence associated with the plurality of layout components, the first sequence having a first order, and a self-attention decoder configured to generate a second sequence based on the embedding, the second sequence having a second order different from the first order (see figure 1, the sequence is generated using embeddings, and the results shown in figure 1, also see citation above on Encoder and Decoder):

    PNG
    media_image5.png
    373
    809
    media_image5.png
    Greyscale
.
See the motivation for claim 1. In addition, to handle tasks where the input and output sequences have different structures, lengths, and element orderings, such as document layout analysis or machine translation.

Regarding claim 11, Wang discloses the method of claim 1, wherein the self- attention encoder/decoder includes a self-attention encoder configured to: weight relationships between pairs of elements in a set, and generate an embedding for the elements (see Decoder section below, where hk and bk are read as weights and bias respectively): 

    PNG
    media_image6.png
    247
    690
    media_image6.png
    Greyscale
.
See the motivation for claim 1. In addition, by weighing the relationship of a given element, a word in a sentence, to all other elements, the model can adjust its understanding of that element based on its surroundings pixels. 

Regarding claim 12, Wang discloses the method of claim 1, wherein the self- attention encoder/decoder includes a self-attention encoder configured to determine an influence of each an element in an embedding based on the combined textual information and visual information (see Section Layout Reader, Layout Reader solves the reading order detection task, which include encoder and decoder and see figure 1 illustration below, the image document is the visual information and text are in the image):

    PNG
    media_image7.png
    346
    691
    media_image7.png
    Greyscale
.
See the motivation for claims 1 and 11. 

Regarding claim 15, Wang discloses the method of claim 1, wherein the self-attention encoder/decoder includes a self-attention encoder and a self-attention decoder, and the self-attention decoder is configured to perform a QKV outer product between elements of the self-attention encoder and inputs to the self-attention decoder (see equation in Decoder, where the softmax equation include a product):

    PNG
    media_image8.png
    71
    538
    media_image8.png
    Greyscale
.
See the motivation for claims 1 and 11. Also performing a QKV outer product allows the decoder to retrieve relevant information from the entire encoder output to generate accurate and contextually appropriate outputs. 

Regarding claims 16 and 18 see the rationale and rejection for claim 1. In addition Boliek includes Processor, 1512, main memory (1504) and static memory (1506). 

3.	Claims 7-8 and 19-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Boliek (US 20040114813) in view of Wang et al. (Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, Furu Wei LayoutReader: Pre-training of Text and Layout for Reading Order Detection, arXiv:2108.11591v2 [cs.CL] 27 Aug 2021) and Rhanoui (Maryem Rhanoui, Mounia Mikram, Siham Yousfi, and Soukaina Barzali, A CNN-BiLSTM Model for Document-Level Sentiment Analysis, Machine Learning Knowledge. Extraction, MDPI, 2019).

Regarding claim 7, Boliek and Wang disclose all the limitations of claim 6 but is silent in disclosing the method of claim 6, wherein the neural network includes a two-dimensional convolution operation, the embedding includes an array, and the array includes an element including the visual-based data associated with each of the plurality of layout components. Rhanoui discloses the method of claim 6, wherein the neural network includes a two-dimensional convolution operation, the embedding includes an array, and the array includes an element including the visual-based data associated with each of the plurality of layout components (see figure 2 illustration below, the embedding matrix which reads on an array, the two convolution operation is performed after the embedding matrix generation):

    PNG
    media_image9.png
    429
    1430
    media_image9.png
    Greyscale
.
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to include convolutional layers because the visual data associated with layout components is arranged in an array that has a natural 2D spatial structure, where the operation is specifically designed to exploit and learn from the spatial relationships and local patterns within that 2D grid, which improves image recognition. 

Regarding claim 8, Rhanoui discloses the method of claim 6, wherein the neural network includes a plurality of two-dimensional convolution operations, and the embedding includes an array including an element including the visual-based data associated with an associated layout component and the visual-based data associated with at least one additional layout component (see figure 2 illustration below, the embedding matrix which reads on an array, the two convolution operation is performed after the embedding matrix generation, where the embedding matrix includes visual information). See the motivation for claims 1 and 7. 

Regarding claim 19, Rhanoui discloses the non-transitory computer-readable storage medium of claim 16, wherein the identifying of the visual information includes extracting visual-based data from the image, the extracting of the visual-based data includes using a neural network configured to generate an embedding including the visual information, the neural network includes a two-dimensional convolution operation, the embedding includes an array (see figure 2, convolutional 2D layer takes the embedding matrix an input, where the document is read as the visual information), and the array includes an element including the visual-based data associated with the plurality of layout components (see figure 2 illustration below, the embedding matrix which reads on an array, the two convolution operation is performed after the embedding matrix generation, which reads on the array):

    PNG
    media_image9.png
    429
    1430
    media_image9.png
    Greyscale
.
See the motivation in claims 1 and 7.

Regarding claim 20, Rhanoui discloses the non-transitory computer-readable storage medium of claim 16, wherein the identifying of the visual information includes extracting visual-based data from the image, the extracting of the visual-based data includes using a neural network configured to generate an embedding including the visual information, the neural network includes a plurality of two-dimensional convolution operations (see figure 2, convolutional 2D layer takes the embedding matrix an input, where the document is read as the visual information), and the embedding includes an array including an element including the visual-based data associated with an associated layout component and the visual-based data associated with at least one additional layout component (see figure 2 illustration below, the embedding matrix which reads on an array, the two convolution operation is performed after the embedding matrix generation, which reads on the array, see illustration above).

4.	Claims 9, 13 and 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Boliek (US 20040114813) in view of Wang et al. (Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, Furu Wei LayoutReader: Pre-training of Text and Layout for Reading Order Detection, arXiv:2108.11591v2 [cs.CL] 27 Aug 2021) and WANG (US 20210232773).

Regarding claim 9, Boliek and Wang disclose all the limitations of claim 6 but is silent in disclosing the method of claim 1, wherein the textual information is associated with a first embedding, the visual information is associated with a second embedding, and the combining of the textual information with the visual information includes concatenating the first embedding with the second embedding. WANG discloses the method of claim 1, wherein the textual information is associated with a first embedding, the visual information is associated with a second embedding, and the combining of the textual information with the visual information includes concatenating the first embedding with the second embedding (see paragraph 29, the encoder 332 may receive from the input sequence module 336 an output, e.g., an encoded vector representation of a concatenation of the image 350, the image caption 352 and the dialogue history 354).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to include combining of the textual information with the visual information includes concatenating because it creates a longer vector, providing more space for the model to learn complex relationships between the visual and textual features.

Regarding claim 13, Wang discloses the method of claim 1, wherein the self-attention encoder/decoder includes a self-attention decoder configured to operate as an auto-regressive inference (see paragraph 52, to preserve the autoregressive property of the answer generation, the unified transformer encoder 450 can employ the sequence-to-sequence (seq2seq) self-attention mask for a generative setting 482). It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to include the self-attention encoder/decoder includes a self-attention decoder configured to operate as an auto-regressive inference to ensure each generated token only depends on previously generated tokens, not future ones, which is crucial for sequential prediction tasks like language modeling, mimicking how humans write or speak one word at a time. This masking prevents the model from seeing the ground truth ahead, forcing it to learn the actual generation process, making it consistent between training and inference.

Regarding claim 21 see the rationale and rejection for claims 9 and 13.

4.	Claims 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Boliek (US 20040114813) in view of Wang et al. (Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, Furu Wei LayoutReader: Pre-training of Text and Layout for Reading Order Detection, arXiv:2108.11591v2 [cs.CL] 27 Aug 2021) and Olabiyi (US 20210027770). 

Regarding claim 14, Boliek and Wang disclose all the limitations of claim 6 but is silent in disclosing the method of claim 1, wherein the self- attention encoder/decoder includes a self-attention decoder configured to auto-regressively predict a next layout component in the reading order associated with the plurality of layout components. Olabiyi discloses the method of claim 1, wherein the self-attention encoder/decoder includes a self-attention decoder configured to auto-regressively predict a next layout component in the reading order associated with the plurality of layout components (see paragraph 76, as machine classifiers may include a word-token sequence generation model which generates reading order, also see paragraph 50, Machine classifiers in accordance with aspects of the application may utilize an autoregressive transformer architecture using only a decoder without the need for a separate encoder, autoregressive transformer models may use multiple layers of masked multi-head self-attention to map a sequence of input tokens to a sequence of output tokens, the initial word order is read as the layout).
It would have been obvious to one having ordinary skill in the art before the effective filing date of the claimed invention to include self-attention encoder/decoder includes a self-attention decoder configured to auto-regressively predict a next layout component in the reading order to ensure the generated components follow a logical sequential order and to prevent information leakage from future components during the prediction process.

CONTACT INFORMATION
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ALEX LIEW (duty station is located in New York City) whose telephone number is (571)272-8623 (FAX 571-273-8623), cell (917)763-1192 or email alexa.liew@uspto.gov. Please note the examiner cannot reply through email unless an internet communication authorization is provided by the applicant. The examiner can be reached anytime. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, MISTRY ONEAL R, can be reached on (313)446-4912.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ALEX KOK S LIEW/Primary Examiner, Art Unit 2674                                                                                                                                                                                                        Telephone: 571-272-8623
Date: 1/8/26
Read full office action
Prosecution Timeline

Feb 23, 2024
Application Filed
Jan 09, 2026
Non-Final Rejection — §101, §103, §DP (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/279,353
Patent 12597112
INSPECTION DEVICE, INSPECTION METHOD, AND RECORDING MEDIUM
2y 5m to grant Granted Apr 07, 2026
18/423,609
Patent 12597144
ANTERIOR SEGMENT ANALYSIS APPARATUS, ANTERIOR SEGMENT ANALYSIS METHOD, AND NON-TRANSITORY COMPUTER-READABLE STORAGE MEDIUM
2y 5m to grant Granted Apr 07, 2026
18/531,192
Patent 12597150
OBTAINING A DEPTH MAP
2y 5m to grant Granted Apr 07, 2026
18/076,663
Patent 12579795
DIAGNOSIS SUPPORT SYSTEM, DIAGNOSIS SUPPORT METHOD, AND STORAGE MEDIUM
2y 5m to grant Granted Mar 17, 2026
18/323,233
Patent 12572999
INCREASING RESOLUTION OF DIGITAL IMAGES USING SELF-SUPERVISED BURST SUPER-RESOLUTION
2y 5m to grant Granted Mar 10, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
88%
Grant Probability
95%
With Interview (+7.2%)
2y 7m
Median Time to Grant
Low
PTA Risk
Based on 1094 resolved cases by this examiner. Grant probability derived from career allow rate.
READING ORDER WITH POINTER TRANSFORMER NETWORKS

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email