Last updated: April 18, 2026
Application No. 17/804,983
NEURAL RANKING MODEL FOR GENERATING SPARSE REPRESENTATIONS FOR INFORMATION RETRIEVAL

Non-Final OA §101§103
Filed
Jun 01, 2022
Examiner
CASTILLO-TORRES, KEISHA Y
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Naver Corporation
OA Round
3 (Non-Final)
Interview Optional

— +30.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 108 resolved cases, 2023–2026
Examiner Intelligence

CASTILLO-TORRES, KEISHA Y View full profile →
Grants 74% — above average
Career Allow Rate
80 granted / 108 resolved
+12.1% vs TC avg
Strong +30% interview lift
Without
With
+30.5%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
32 currently pending
Career history
140
Total Applications
across all art units
Statute-Specific Performance

§101
26.2%
-13.8% vs TC avg
§103
42.9%
+2.9% vs TC avg
§102
15.1%
-24.9% vs TC avg
§112
8.8%
-31.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 108 resolved cases
Office Action

§101 §103
DETAILED ACTION
This communication is in response to the Amendments/Arguments filed on 01/26/2026. 
Applicant has elected Invention I: including claims 1-15 and 39-43.
Claims 16-38 and 42 have been canceled by the Applicant.
Claim(s) 1-15, 39-41, and 43-45 are pending and have been examined. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 02/26/2026 has been entered. 

Response to Arguments and Amendments
Amendments to the claims by the Applicant have been considered and addressed below. 
With respect to the 35 USC § 101, and 103 rejections, the Applicant provides several arguments in which the Examiner will respond accordingly, below.

35 USC § 101 rejection(s)
Arguments in pages 12-17 of the Remarks filed on 01/26/2026.
Examiner’s Response to Arguments:
Arguments have been considered but these are not persuasive. 
The Examiner respectfully disagrees with the arguments of 
“Applicant respectfully submits that a "pretrained language model" is not a mental process. Even under a broadest reasonable interpretation, a "pretrained language model" would be understood by an artisan to require much more than merely "a predetermined set of rules." An artisan would appreciate that a pretrained language model necessarily refers to a neural language model implemented by a processor that has been pretrained, and such a model could not be implemented by a human using pen and paper”
“…the present independent claims are analogous to eligible Claim 2 in Example 48 of the July 2024 Subject Matter Eligibility Examples. Claim 2 of Example 48, directed to speech separation methods… the presently recited pretrained neural language model cannot be practically performed in the human mind. The claims as a whole including the pretrained neural language model provide a technical improvement to existing computer technology and to information retrieval (IR) by providing a representation of an input sequence that can provide performance benefits comparable to dense embedding while allowing more efficient matching for IR ”
“A pretrained language model, at least, does not fall under a mental process under Step 2A, Prong One of the Guidelines for Determining Patent Eligibility. Furthermore, regarding Step 2B, Prong Two, the integration of a pretrained neural language model is fundamental to the method and architecture in claims 1, 29, and 40, … claims are eligible if they reflect an improvement in the functioning of a computer or to another technology or technical field. Id., at Page 4. The present application recognizes, and illustrates via experiments, that example neural ranker models can combine rich term embeddings from pretrained language models with sparsity that allows efficient matching for information retrieval (IR) based on inverted indexes, and can perform IR with improved speed (efficiency) compared to dense embedding with comparable performance (e.g., paragraphs [0027] - [0028]; [0078] - [0094]). This demonstrates improvements to both the functioning of a computer and to another technology (information retrieval). 
“… claims 1, 39, and 40 are also analogous to the claims at issue in Ex Parte Desjardins, Appeal No. 2024-000567 (PTAB September 26, 2025, Appeals Review Panel Decision). In Ex Parte Desjardins, the claimed method of training a machine learning model… independent claims 1, 39, and 40 improve efficiency and/or performance of neural information retrieval by providing representations of sequences (e.g., documents) by integrating embedding by a pretrained language model with determining importance of the sequences over a vocabulary using one or more trained neural layers and activation, where the representations are used in an information retrieval task which provides benefits of streamlined IR search with preservation of performance from searches using only dense embeddings…”
“…the independent claims have been amended to recite that the embedding is by a pretrained neural language model implemented by the processor having a transformer architecture. In rejecting claim 13, the Office Action asserts that a pretrained language model comprising a transformer architecture can be met by merely "Performing steps using a predetermined set of rules." Even under a broadest reasonable interpretation, a neural language model and a transformer architecture clearly cannot be provided merely by a human following "a predetermined set of rules" using a pen and paper…”

The Examiner, respectfully disagrees with the arguments above and notes:
That the limitation of "pretrained language model", in the claims as drafted, are not further limited in a way that a person would not be able to perform the actions mentally and/or using pen and paper.
The eligible Claim 2 in Example 48 of the July 2024 Subject Matter Eligibility Examples is directed to a method of receiving spoken audio from different sources, deriving a temporal feature representation and a spectrogram of the audio, and using a DNN to calculate embedding vectors based on the temporal feature representation and a spectrogram using a mathematical formula while the instant application is directed to document information retrieval using pretrained neural language model as well as one or more trained neural layers.
Regarding the Prong One and Prong Two analyses, the Examiner refers the Applicant to the analyses below.
In Ex Parte Desjardins, Appeal No. 2024-000567 (PTAB September 26, 2025, Appeals Review Panel Decision) (precedential), the claimed invention was a method of training a machine learning model on a series of tasks, and under Step 2A Prong Two, the Appeals Review Panel determined that the specification identified improvements as to how the machine learning model itself operates, including training a machine learning model to learn new tasks while protecting knowledge about previous tasks to overcome the problem of “catastrophic forgetting” encountered in continual learning systems. However, the instant application is not considered analogous, which is directed to document information retrieval using pretrained neural language model as well as one or more trained neural layers and does not provide any information regarding the training of the “pretrained neural language model.”
Regarding, the pretrained neural language model implemented by the processor having a transformer architecture, the Examiner still notes that the claim as drafted is not limiting regarding the transformer architecture, hence, a human following "a predetermined set of rules" (i.e., transforming input data/text into another format (e.g., sentence to bullet point list or table)), which could be performed by the human mentally and/or using a pen and paper.

Hence, the Examiner still considers the claim being directed to an abstract idea as will be discussed below. 
Also, the Examiner refers the Applicant to MPEP 2106.05(a): 
“It is important to note that in order for a method claim to improve computer functionality, the broadest reasonable interpretation of the claim must be limited to computer implementation. That is, a claim whose entire scope can be performed mentally, cannot be said to improve computer technology. Synopsys, Inc. v. Mentor Graphics Corp., 839 F.3d 1138, 120 USPQ2d 1473 (Fed. Cir. 2016) (a method of translating a logic circuit into a hardware component description of a logic circuit was found to be ineligible because the method did not employ a computer and a skilled artisan could perform all the steps mentally). Similarly, a claimed process covering embodiments that can be performed on a computer, as well as embodiments that can be practiced verbally or with a telephone, cannot improve computer technology. See RecogniCorp, LLC v. Nintendo Co., 855 F.3d 1322, 1328, 122 USPQ2d 1377, 1381 (Fed. Cir. 2017) (process for encoding/decoding facial data using image codes assigned to particular facial features held ineligible because the process did not require a computer).” (Emphasis added)

Please see detailed analysis below for more details on how the Examiner understands the independent claims do not recite additional elements that integrate the judicial exception into a practical application. Hence, not qualifying as patent eligible subject matter under 35 U.S.C. § 101.


Please refer to MPEP 2106.04(1): Eligibility Step 2A: Whether a Claim is Directed to a Judicial Exception: Prong One. 
“Prong One asks does the claim recite an abstract idea, law of nature, or natural phenomenon? In Prong One examiners evaluate whether the claim recites a judicial exception, i.e. whether a law of nature, natural phenomenon, or abstract idea is set forth or described in the claim. While the terms "set forth" and "described" are thus both equated with "recite", their different language is intended to indicate that there are two ways in which an exception can be recited in a claim. For instance, the claims in Diehr, 450 U.S. at 178 n. 2, 179 n.5, 191-92, 209 USPQ at 4-5 (1981), clearly stated a mathematical equation in the repetitively calculating step, and the claims in Mayo, 566 U.S. 66, 75-77, 101 USPQ2d 1961, 1967-68 (2012), clearly stated laws of nature in the wherein clause, such that the claims "set forth" an identifiable judicial exception. Alternatively, the claims in Alice Corp., 573 U.S. at 218, 110 USPQ2d at 1982, described the concept of intermediated settlement without ever explicitly using the words "intermediated" or "settlement."”
“An example of a claim that recites a judicial exception is "A machine comprising elements that operate in accordance with F=ma." This claim sets forth the principle that force equals mass times acceleration (F=ma) and therefore recites a law of nature exception. Because F=ma represents a mathematical formula, the claim could alternatively be considered as reciting an abstract idea. Because this claim recites a judicial exception, it requires further analysis in Prong Two in order to answer the Step 2A inquiry. An example of a claim that merely involves, or is based on, an exception is a claim to "A teeter-totter comprising an elongated member pivotably attached to a base member, having seats and handles attached at opposing sides of the elongated member." This claim is based on the concept of a lever pivoting on a fulcrum, which involves the natural principles of mechanical advantage and the law of the lever. However, this claim does not recite these natural principles and therefore is not directed to a judicial exception (Step 2A: NO). Thus, the claim is eligible at Pathway B without further analysis.”

From this analysis, in Step 2A, Prong One, the Examiner has evaluated the independent claims accordingly and determined that the amended independent claims as drafted indeed describe a judicial exception (i.e., an abstract idea), which represent a mental process (which can be performed by a human with pen and paper). 
More specifically, similar to what was discussed in the Final Rejection mailed on 11/26/2025: 
The limitations of independent claims as drafted cover a mental process and/or mathematical concepts. 
More specifically, the independent claim(s) 1 and 39-40 recite(s):
1. (Currently Amended) A method implemented by a computer having a processor and memory for providing a representation of an input sequence over a vocabulary for a neural information retrieval model for identifying a set of documents, the method comprising: 
embedding a pretrained neural language model implemented by the processor and having a transformer architecture, each token of an input sequence to provide an embedded input sequence, the input sequence having been tokenized using the vocabulary, the vocabulary having a vocabulary space including a plurality of tokens;
for each token of the embedded input sequence, determining, using one or more trained neural layers implemented by the processor, a prediction of an importance of each of the plurality of tokens of the vocabulary in the vocabulary space; 
computing, using the processor, a predicted term importance of the input sequence over the vocabulary space by performing an activation over the embedded input sequence of the determined prediction of the importance of each of the plurality of tokens of the vocabulary in the vocabulary space, wherein the predicted term importance of the input sequence over the vocabulary space is with respect to each of the plurality of tokens of the vocabulary in the vocabulary space; and
outputting the predicted term importance of the input sequence over the vocabulary space as the representation of the input sequence over the vocabulary, wherein the output predicted term importance is used by the neural information retrieval model for an information retrieval task in response to a query for identifying the set of documents representative of [[a]]the query.

39. A non-transitory computer-readable medium having executable instructions stored thereon for causing a processor and a memory to implement a method for providing a representation of an input sequence over a vocabulary in a first-stage ranker of a neural information retrieval model, the method comprising:
[the limitations as in claim 1, above.]

This is directed to the abstract idea grouping of mental process and reads on a human (i.e., mentally or using pen and paper):
Providing a representation of a sequence (e.g., sentence) by ranking or assigning scores using a predetermined set of rules by:
Rewriting the sequence in tokens (e.g., segments) (i.e., writing down on a piece of paper (i.e., following a predetermined set of rules: pretrained language model with a predetermined architecture) a list of tokens (e.g., words));
Determining the importance of each token (i.e., determining or assigning a predetermined value (e.g., 0 through 5) to each token (e.g., word) in the list of tokens (e.g., words)) using predetermined steps;
Obtaining the importance by further performing predetermined set of rules (i.e., calculating a predicted term/value for the entire list of tokens (e.g., words) (i.e., following a predetermined set of rules: mathematical equation/concept to calculate said predetermined term)), wherein the importance is determined with respect to each of the tokens of the sequence/vocabulary;
Writing down the importance of the sequence as the representation of a sequence (e.g., sentence) by ranking or assigning scores.

40. A computed implemented method for processing an input sequence, the method comprising:
embedding, by a pretrained neural language model implemented by the processor and having a transformer architecture, each token of an input sequence to provide an embedded input sequence, the input sequence having been tokenized using a predetermined vocabulary having a vocabulary space including a plurality of tokens;
predicting term importance of the embedded input sequence of tokens with respect to each of the plurality of tokens of the predetermined vocabulary; and 
providing the predicted term importance of the embedded input sequence of tokens for use by an information retrieval model in response to a query for identifying the set of documents; 
wherein the predicted term importance of the input sequence of tokens with respect to each of the plurality of tokens of the predetermined vocabulary provides a representation of the input sequence over a predetermined vocabulary for a neural information retrieval model for identifying a set of documents;
wherein said predicting term importance comprises:
for each token of the embedded input sequence, predicting, using one or more trained neural layers implemented by the processor, an importance of each of the plurality of tokens of the vocabulary in the vocabulary space; and
computing, using the processor, a predicted term importance of the input sequence over the vocabulary by performing an activation over the embedded input sequence of the determined prediction of the importance of each of the plurality of tokens of the vocabulary in the vocabulary space, wherein the predicted term importance of the input sequence over the vocabulary space is with respect to each of the plurality of tokens of the vocabulary in the vocabulary space.

This is directed to the abstract idea grouping of mental process and reads on a human (i.e., mentally or using pen and paper):
Analyzing a sequence (e.g., sentence) by:
Rewriting the sequence in tokens (e.g., segments) (i.e., writing down on a piece of paper (i.e., following a predetermined set of rules: pretrained language model with a predetermined architecture) a list of tokens (e.g., words));
Determining the importance of each token (i.e., determining or assigning a predetermined value (e.g., 0 through 5) to each token (e.g., word) in the list of tokens (e.g., words));
Obtaining the importance by further performing predetermined set of rules (i.e., calculating a predicted term/value for the entire list of tokens (e.g., words) (i.e., following a predetermined set of rules: mathematical equation/concept to calculate said predetermined term)) using predetermined steps;
Writing down the importance of the sequence; 
Wherein the importance of the sequence is the representation of a sequence (e.g., sentence) by ranking or assigning scores, performed using a predetermined set of rules and wherein the importance is determined with respect to each of the tokens of the sequence/vocabulary.

Please also refer to MPEP 2106.05(f)(2): Whether the claim invokes computers or other machinery merely as a tool to perform an existing process, and MPEP 2106.06(b): Clear Improvement to a Technology or to Computer Functionality.  

Please refer to MPEP 2106.04(2): Eligibility Step 2A: Whether a Claim is Directed to a Judicial Exception: Prong Two. 
“Prong Two asks does the claim recite additional elements that integrate the judicial exception into a practical application? In Prong Two, examiners evaluate whether the claim as a whole integrates the exception into a practical application of that exception. If the additional elements in the claim integrate the recited exception into a practical application of the exception, then the claim is not directed to the judicial exception (Step 2A: NO) and thus is eligible at Pathway B. This concludes the eligibility analysis. If, however, the additional elements do not integrate the exception into a practical application, then the claim is directed to the recited judicial exception (Step 2A: YES), and requires further analysis under Step 2B (where it may still be eligible if it amounts to an ‘‘inventive concept’’). For more information on how to evaluate whether a judicial exception is integrated into a practical application, see MPEP § 2106.04(d)(2).”

From this analysis, in Step 2A, Prong Two, the Examiner has evaluated the independent claims accordingly and determined that the amended independent claims as drafted that the claims as a whole do not include additional elements that integrate the exception into a practical application of that exception. (i.e., an abstract idea). As discussed in the Final Rejection mailed on 11/26/2025:
This judicial exception is not integrated into a practical application because for example: claims 1, 39 and/or 40 recite a “computer”, “memory”, “processor”, “non-transitory computer-readable medium”, “ranker”, “pretrained neural language model having a transformer architecture” and “trained neural layers”. As an example, in [0104] of the as filed specification, it is disclosed: “Embodiments described herein may be implemented in hardware or in software. The implementation can be performed using a non-transitory storage medium such as a computer-readable storage medium, for example a floppy disc, a DVD, a Blu-Ray, a CD, a ROM, a PROM, and EPROM, an EEPROM or a FLASH memory. Such computer-readable media can be any available media that can be accessed by a general-purpose or special-purpose computer system.”. Therefore, a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 

Please also refer to MPEP 2106.05(f)(2): Whether the claim invokes computers or other machinery merely as a tool to perform an existing process.

Finally, please refer to MPEP 2106.05(A): Relevant Considerations For Evaluating Whether Additional Elements Amount To An Inventive Concept
“Limitations that the courts have found not to be enough to qualify as "significantly more" when recited in a claim with a judicial exception include: 

i. Adding the words "apply it" (or an equivalent) with the judicial exception, or mere instructions to implement an abstract idea on a computer, e.g., a limitation indicating that a particular function such as creating and maintaining electronic records is performed by a computer, as discussed in Alice Corp., 573 U.S. at 225-26, 110 USPQ2d at 1984 (see MPEP § 2106.05(f)); 
ii. Simply appending well-understood, routine, conventional activities previously known to the industry, specified at a high level of generality, to the judicial exception, e.g., a claim to an abstract idea requiring no more than a generic computer to perform generic computer functions that are well-understood, routine and conventional activities previously known to the industry, as discussed in Alice Corp., 573 U.S. at 225, 110 USPQ2d at 1984 (see MPEP § 2106.05(d));”

From this analysis, in Step 2B, the Examiner has evaluated the independent claims accordingly and determined that the independent claims as drafted have limitations that the courts have found not to be enough to qualify as "significantly more" when recited in a claim with a judicial exception. Similar to what was discussed in the Final Rejection mailed on 11/26/2025:
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 
In summary, the Examiner respectfully disagrees with the arguments above. Please refer to analysis above.

35 USC § 103 rejection(s)
Arguments in pages 17-22 of the Remarks filed on 01/26/2026.
Examiner’s Response to Arguments:
Applicant’s arguments with respect to claim(s) 1, 39, and 40 under 35 U.S.C. § 103 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Harley et al. (US 20210319907 A1) and further in view of Bai et al. (Bai, Yang, et al. "Sparterm: Learning term-based sparse representation for fast text retrieval." arXiv preprint arXiv:2010.00768 (2020). https://arxiv.org/pdf/2010.00768) and Hall et al. (US 9536522 B1). 
For more details, please refer to updated 35 U.S.C. § 103 rejections for claims 1, 39, and 40, below.

Claim Objections
Claims 1 and 39-40 objected to because of the following informalities: the limitation of “an input sequence” in line 6 of claim 1, line 7 of claim 39, and line 4 of claim 40 should read: the input sequence.  
Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-15, 39-41 and 43-45 rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. More specifically directed to the abstract idea grouping of: mathematical concept and/or mental process (as will be specified below).
The independent claim(s) 1 and 39-40 recite(s):
1. (Currently Amended) A method implemented by a computer having a processor and memory for providing a representation of an input sequence over a vocabulary for a neural information retrieval model for identifying a set of documents, the method comprising: 
embedding a pretrained neural language model implemented by the processor and having a transformer architecture, each token of an input sequence to provide an embedded input sequence, the input sequence having been tokenized using the vocabulary, the vocabulary having a vocabulary space including a plurality of tokens;
for each token of the embedded input sequence, determining, using one or more trained neural layers implemented by the processor, a prediction of an importance of each of the plurality of tokens of the vocabulary in the vocabulary space; 
computing, using the processor, a predicted term importance of the input sequence over the vocabulary space by performing an activation over the embedded input sequence of the determined prediction of the importance of each of the plurality of tokens of the vocabulary in the vocabulary space, wherein the predicted term importance of the input sequence over the vocabulary space is with respect to each of the plurality of tokens of the vocabulary in the vocabulary space; and
outputting the predicted term importance of the input sequence over the vocabulary space as the representation of the input sequence over the vocabulary, wherein the output predicted term importance is used by the (first-stage ranker) neural information retrieval model for an information retrieval task in response to a query for identifying the set of documents representative of [[a]]the query.

39. (Currently Amended) A non-transitory computer-readable medium having executable instructions stored thereon for causing a processor and a memory to implement a method for providing a representation of an input sequence over a vocabulary for a neural information retrieval model for identifying a set of documents, the method comprising: 
embedding, by a pretrained neural language model implemented by the processor and having a transformer architecture, each token of an input sequence to provide an embedded input sequence , the input sequence having been tokenized using the vocabulary, the vocabulary having a vocabulary space including a plurality of tokens;
for each token of the embedded input sequence, determining, using one or more trained neural layers implemented by the processor, a prediction of an importance of each of the plurality of tokens of the vocabulary in the vocabulary space; and 
computing, using the processor, a predicted term importance of the input sequence over the vocabulary space by performing an activation over the embedded input sequence of the determined prediction of the importance of each of the plurality of tokens of the vocabulary in the vocabulary space, wherein the predicted term importance of the input sequence over the vocabulary space is with respect to each of the plurality of tokens of the vocabulary in the vocabulary space; and
outputting the predicted term importance of the input sequence over the vocabulary space as the representation of the input sequence over the vocabulary, wherein the output predicted term importance is used by the first-stage ranker of the neural information retrieval model for an information retrieval task in response to a query for identifying the set of documents representative of the query.

This is directed to the abstract idea grouping of mental process and reads on a human (i.e., mentally or using pen and paper):
Providing a representation of a sequence (e.g., sentence) by ranking or assigning scores using a predetermined set of rules by:
Rewriting the sequence in tokens (e.g., segments) (i.e., writing down on a piece of paper (i.e., following a predetermined set of rules: pretrained language model with a predetermined architecture/structure) a list of tokens (e.g., words));
Determining the importance of each token (i.e., determining or assigning a predetermined value (e.g., 0 through 5) to each token (e.g., word) in the list of tokens (e.g., words)) using predetermined steps;
Obtaining the importance by further performing predetermined set of rules (i.e., calculating a predicted term/value for the entire list of tokens (e.g., words) (i.e., following a predetermined set of rules: mathematical equation/concept to calculate said predetermined term)), wherein the importance is determined with respect to each of the tokens of the sequence/vocabulary;
Writing down the importance of the sequence as the representation of a sequence (e.g., sentence) by ranking or assigning scores.


40. A computed implemented method for processing an input sequence, the method comprising:
embedding, by a pretrained neural language model implemented by the processor and having a transformer architecture, each token of an input sequence to provide an embedded input sequence, the input sequence having been tokenized using a predetermined vocabulary having a vocabulary space including a plurality of tokens;
predicting term importance of the embedded input sequence of tokens with respect to each of the plurality of tokens of the predetermined vocabulary; and 
providing the predicted term importance of the embedded input sequence of tokens for use by an information retrieval model in response to a query for identifying the set of documents; 
wherein the predicted term importance of the input sequence of tokens with respect to each of the plurality of tokens of the predetermined vocabulary provides a representation of the input sequence over a predetermined vocabulary for a neural information retrieval model for identifying a set of documents;
wherein said predicting term importance comprises:
for each token of the embedded input sequence, predicting, using one or more trained neural layers implemented by the processor, an importance of each of the plurality of tokens of the vocabulary in the vocabulary space; and
computing, using the processor, a predicted term importance of the input sequence over the vocabulary by performing an activation over the embedded input sequence of the determined prediction of the importance of each of the plurality of tokens of the vocabulary in the vocabulary space, wherein the predicted term importance of the input sequence over the vocabulary space is with respect to each of the plurality of tokens of the vocabulary in the vocabulary space.

This is directed to the abstract idea grouping of mental process and reads on a human (i.e., mentally or using pen and paper):
Analyzing a sequence (e.g., sentence) by:
Rewriting the sequence in tokens (e.g., segments) (i.e., writing down on a piece of paper (i.e., following a predetermined set of rules: pretrained language model with a predetermined architecture/structure) a list of tokens (e.g., words));
Determining the importance of each token (i.e., determining or assigning a predetermined value (e.g., 0 through 5) to each token (e.g., word) in the list of tokens (e.g., words));
Obtaining the importance by further performing predetermined set of rules (i.e., calculating a predicted term/value for the entire list of tokens (e.g., words) (i.e., following a predetermined set of rules: mathematical equation/concept to calculate said predetermined term)) using predetermined steps;
Writing down the importance of the sequence; 
Wherein the importance of the sequence is the representation of a sequence (e.g., sentence) by ranking or assigning scores, performed using a predetermined set of rules and wherein the importance is determined with respect to each of the tokens of the sequence/vocabulary.

This judicial exception is not integrated into a practical application because for example: claims 1, 39 and/or 40 recite a “computer”, “memory”, “processor”, “non-transitory computer-readable medium”, “ranker”, “pretrained neural language model having a transformer architecture” and “trained neural layers”. As an example, in [0104] of the as filed specification, it is disclosed: “Embodiments described herein may be implemented in hardware or in software. The implementation can be performed using a non-transitory storage medium such as a computer-readable storage medium, for example a floppy disc, a DVD, a Blu-Ray, a CD, a ROM, a PROM, and EPROM, an EEPROM or a FLASH memory. Such computer-readable media can be any available media that can be accessed by a general-purpose or special-purpose computer system.”. Therefore, a general-purpose computer or computing device is described and mainly used as an application thereof. Accordingly, these additional elements do not integrate the abstract idea into a practical idea because it does not impose any meaningful limits on practicing the abstract idea. 
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional elements of using a computer is listed as a general computing device as noted. The claim is not patent eligible. 
With respect to claim 2, the claim(s) recite:
2. The method of claim 1, wherein the activation comprises a concave activation function.

This is directed to the abstract idea grouping of a mathematical concept and reads on a human (i.e., mentally or using pen and paper):
Performing steps using a predetermined set of rules (i.e., mathematical concept(s)).
No additional limitations are present.

With respect to claim 3, the claim(s) recite:
3. The method of claim 2, wherein the concave activation function comprises a logarithmic activation function or a radical function.

This is directed to the abstract idea grouping of a mathematical concept and reads on a human (i.e., mentally or using pen and paper):
Performing steps using a predetermined set of rules (i.e., mathematical concept(s)).
No additional limitations are present.
 	
With respect to claim 4, the claim(s) recite:
4. The method of claim 2, wherein the activation comprises a logarithmic activation function, wherein said logarithmic activation comprises:
for each token in the vocabulary, determining a maximum of a log-saturation of the determined importance of the token in the vocabulary over the embedded input sequence, wherein the log-saturation prevents some terms in the vocabulary from dominating and ensures sparsity in the representation.

This is directed to the abstract idea grouping of a mathematical concept and reads on a human (i.e., mentally or using pen and paper):
Performing steps using a predetermined set of rules (i.e., mathematical concept(s)).
No additional limitations are present.

With respect to claim 5, the claim(s) recite:
5. The method of claim 1, wherein the activation comprises a logarithmic activation function, wherein said logarithmic activation comprises:
for each token in the vocabulary, combining a log-saturation of the determined importance of the token in the vocabulary over the embedded input sequence, wherein the log-saturation prevents some terms in the vocabulary from dominating and ensures sparsity in the representation.

This is directed to the abstract idea grouping of a mathematical concept and reads on a human (i.e., mentally or using pen and paper):
Performing steps using a predetermined set of rules (i.e., mathematical concept(s)).
No additional limitations are present.
With respect to claim 6, the claim(s) recite:
6. The method of claim 1, further comprising:
tokenizing a received query using the vocabulary;
determining a ranking score for each of a plurality of candidate sequences, the candidate sequences being respectively associated with candidate documents, wherein said determining a ranking score comprises:
comparing the output predicted term importance for the candidate sequence for each vocabulary token in the tokenized query;
ranking the plurality of candidate sequences based on said determined ranking score; and
retrieving a subset of the candidate documents having a highest ranking.

This is directed to the abstract idea grouping of mental process and reads on a human (i.e., mentally or using pen and paper):
Rewriting the sequence in tokens (e.g., segments);
Determining the importance of each token and associating it with candidates (e.g., other text in the same category(ies));
Writing down the importance of the sequence; 
Ranking or ordering the candidate sequence based on said importances;
Selecting candidate document (e.g., text in a particular category(ies)) having the highest rank or order.
No additional limitations are present. 	

With respect to claim 7, the claim(s) recite:
7. The method of claim 1, wherein the information retrieval model includes a first stage that is a ranker stage and a second stage that is a re-ranker stage.
This is directed to the abstract idea grouping of mental process and reads on a human (i.e., mentally or using pen and paper):
Ranking a second time (i.e., a sequence/sentence).
No additional limitations are present. 	

With respect to claim 8, the claim(s) recite:
8. The method of claim 1, further comprising:
comparing the output predicted term importance for the input sequence to a previously determined predicted term importance for each of a plurality of candidate sequences, the candidate sequences being respectively associated with candidate documents;
ranking the plurality of candidate sequences based on said comparing;
retrieving a subset of the candidate documents having a highest ranking.

This is directed to the abstract idea grouping of mental process and reads on a human (i.e., mentally or using pen and paper):
Comparing importance of the sequence/candidates;
Ranking candidates based on comparison;
Selecting candidate document (e.g., text in a particular category(ies)) having the highest rank or order.
No additional limitations are present. 	

With respect to claim 9, the claim(s) recite:
9. The method of claim 8, wherein said comparing comprises calculating a dot product between the output predicted term importance of the input sequence and the predicted term importance for each of the plurality of candidate sequences.

This is directed to the abstract idea grouping of a mathematical concept and reads on a human (i.e., mentally or using pen and paper):
Performing steps using a predetermined set of rules (i.e., mathematical concept(s)).
No additional limitations are present.

With respect to claim 10, the claim(s) recite:
10. The method of claim 1, wherein said embedding each token of the tokenized input sequence is based at least on the vocabulary and the token's position within the input sequence to provide context embedded tokens.
This is directed to the abstract idea grouping of mental process and reads on a human (i.e., mentally or using pen and paper):
Wherein the rewriting the sequence in tokens (e.g., segments) is based on the token’s position in the sequence to provide context.
No additional limitations are present. 	

With respect to claim 11, the claim(s) recite:
11. The method of claim 10, wherein said determining a prediction comprises:
transforming the context embedded tokens using at least one logit function to predict an importance of each token in the vocabulary with respect to each token of the embedded input sequence.

This is directed to the abstract idea grouping of a mathematical concept and reads on a human (i.e., mentally or using pen and paper):
Performing steps using a predetermined set of rules (i.e., mathematical concept(s)).
No additional limitations are present.

With respect to claim 12, the claim(s) recite:
12. The method of claim 11, wherein the one or more trained neural layers comprise one or more linear layers with an activation, and wherein the at least one logit function further employs a normalization layer;
the one or more linear layers combining the transformation with the respective vocabulary token of the embedded input sequence and a token-level bias.

This is directed to the abstract idea grouping of a mathematical concept and reads on a human (i.e., mentally or using pen and paper):
Performing steps using a predetermined set of rules (i.e., mathematical concept(s)), wherein the predetermined set of rules includes transforming data using a series of computations.
No additional limitations are present.

With respect to claim 13, the claim(s) recite:
13. The method of claim 1, wherein the one or more trained linear layers are trained using in-batch negative sampling and further trained using regularization to sparsify vector representations 

This is directed to the abstract idea grouping of a mental process and reads on a human (i.e., mentally or using pen and paper):
Performing steps using a predetermined set of rules, wherein the predetermined set of rules includes transforming data using a predefined/known series of computations (i.e., mathematical concept(s)).
No additional limitations are present.

With respect to claim 14, the claim(s) recite:
14. The method of claim 1, wherein the pretrained language model is pretrained using a masked language modeling method.

This is directed to the abstract idea grouping of a mental process and reads on a human (i.e., mentally or using pen and paper):
Performing steps using a predetermined set of rules.
No additional limitations are present.

With respect to claim 15, the claim(s) recite:
15. The method of claim 1, wherein said performing an activation comprises, for each token in the embedded input sequence, applying an activation function to the determined prediction of the importance of each of the plurality of tokens of the vocabulary in the vocabulary space over the embedded input sequence to ensure the positivity of the determined term weights, and performing a concave function on a result of the applied activation function.

This is directed to the abstract idea grouping of a mathematical concept and reads on a human (i.e., mentally or using pen and paper):
Performing steps using a predetermined set of rules (i.e., mathematical concept(s)), wherein the importance is determined with respect to each of the tokens of the sequence/vocabulary.
No additional limitations are present.

With respect to claim 41 and 44-45, the claim(s) recite:
41/44/45. The method of claims 40/1/39, wherein the predicted term importance over the vocabulary is sparse.

This is directed to the abstract idea grouping of a mental process and reads on a human (i.e., mentally or using pen and paper):
Performing steps using a predetermined set of rules.
No additional limitations are present.

With respect to claim 43, the claim(s) recite:
43. The method of claim 40, wherein the input sequence is one of a query and a document sequence.

This is directed to the abstract idea grouping of a mental process and reads on a human (i.e., mentally or using pen and paper):
Wherein the sequence is a query and a document (e.g., received from another human).
No additional limitations are present. 	

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 6-7, 10-11, 13, 39-41, and 43-45 is/are rejected under 35 U.S.C. 103 as being unpatentable over Harley et al. (US 20210319907 A1) and further in view of Bai et al. (Bai, Yang, et al. "Sparterm: Learning term-based sparse representation for fast text retrieval." arXiv preprint arXiv:2010.00768 (2020). https://arxiv.org/pdf/2010.00768) and Hall et al. (US 9536522 B1). 

As to independent claim 1, Harley et al. teaches:
1. A method implemented by a computer having a processor and memory for providing a representation of an input sequence over a vocabulary for a neural information retrieval model for identifying a set of documents (see ¶ [0009]: “The method can comprise storing a plurality of multi-omic data indices, wherein each of the plurality of multi-omic data indices comprises cancer-specific tokenized data… The method can further comprise ranking the selected one or more multi-omic data indices based on at least one of clinical actionability, pathogenicity, feature weight, or frequency. The method can further comprise returning the ranked one or more multi-omic data indices to the user.” and ¶ [0178]: “As discussed above, the systems described herein, in accordance with various embodiments, can facilitate the integrating of neural information retrieval models aim to provide better semantic understanding capabilities for ranking literature, images, and annotations. In various embodiments, distributed representations of words (like those generated by word2vec) can be combined to generate embeddings for queries and documents, and averaged embeddings can be used to generate effective document similarity retrieval.”), the method comprising:
embedding, by a pretrained neural language model implemented by the processor and having a transformer architecture, each token of an input sequence to provide an embedded input sequence (see ¶ [0009 and 0178] citation as in limitation above: “…distributed representations of words (like those generated by word2vec) can be combined to generate embeddings for queries and documents, and averaged embeddings can be used to generate effective document similarity retrieval” and further Fig. 5a: “Query parser and tokenizer” and ¶ [0176] : “FIGS. 5a and 5b, which illustrate a query engine workflow that functions to (1) produce synonym and abbreviation expansion, (2) generate alternative (similar) queries, (3) produce content-based suggestions and provide query autocompletion and autocorrection functionality, (4) classify user query intent (e.g., does the user want variants, genes, pathways, samples, single sample data, cohort sample data, sample vs cohort comparison, cohort vs cohort comparison, publications, images?), (5) perform neural information retrieval (e.g., based on a joint embedding of query and indexed documents) and (6) provide summarization of documentation (e.g., multiple sources text summarization), which can be delivered back to user via the system UI. In accordance with various embodiments, topic-specific term embeddings can be used for query expansion, particularly in (2) above…” and further ¶ [0157]: “In accordance with various embodiments, ranking for variants and genes can be learned separately, or as part of deep-and-wide modes together with ranking for other document types. In some embodiments, ranking for text documents utilizes deep learning language modelling (LM) ranks items by probability of document given a query. In accordance with various embodiments, the deep learning language model can be a transformer model (e.g., BERT, RoBERTa, Xlnet, Albert) fine-tuned on relevant data. Such models can be large scale, pre-trained language model embeddings. In accordance with various embodiments, document relevance can be generated using textual and temporal parts of documents, for example, by deriving multiple classes of features including, for example, entity features and time features both derived from a set of annotations, named entity recognition (NER), and temporal tagging.”), 
the input sequence having been tokenized using the vocabulary (see Fig. 5a and ¶ [0176 and 0178] citations as in limitations above. “topic-specific term embeddings”);
the vocabulary having a vocabulary space including a plurality of tokens (see Fig. 5a and ¶ [0176 and 0178] citations as in limitations above and further ¶ [0177]: “As discussed above, the systems and methods described herein, in accordance with various embodiments, can facilitate the integrating of query term expansion using deep learning models trained on bio-medical literature and medical ontologies available (e.g., GO, UMLS, DO, MeSH, eVOC, HPO, MPO).”);

However, Harley et al. does not explicitly teach, but Bai et al.  does teach:
for each token of the embedded input sequence, determining, using one or more trained neural layers implemented by the processor, a prediction of an importance of each of the plurality of tokens of the vocabulary in the vocabulary space (see ¶ 3.1 Overview: “Figure 2(a) depicts the general architecture of SparTerm which comprises an importance predictor and a gating controller. Given the original textual passage p, we aim to map it into a deep and contextualized sparse representation p' in the vocabulary space. The mapping process can be formulated as: p' = 'T(p) 0 {}(p) where 'T is the item importance predictor and {} the gating controller. The importance predictor 'T generates a dense vector representing the semantic importance of each item in the vocabulary…”, 
¶ 3.2 The Importance Predictor: “…As shown in Figure 2(b), prior to importance prediction, BERT-based encoder is employed to help get the deep contextualized embedding h; for each term w; in the passage p. Each h; models the surrounding context from a certain position i, thus providing a different view of which terms are semantically related to the topic of the current passage. With a token-wise importance predictor, we obtain a dense importance distribution I; of dimension v for each h;: I; = Transform(h;)ET + b (2) where Transform denotes a linear transformation with GELU activation and layer normalization, E is the shared word embedding matrix and b the bias term...”,
and ¶ 4.2 Implementation: “The Importance Predictor and Gating Controller of our model have the same architecture and hyper-parameters of BERT (12-layer, 768-hidden, 12-heads, 110M parameters) and do not share weights. We initialize the Importance Predictor with Google's official pre-trained BERTbase model while the parameters of Token-wise Importance Predictor are initialized with the Masked Language Prediction layer of BERT…”);
computing, using the processor, a predicted term importance of the input sequence over the vocabulary space by performing an activation over the embedded input sequence of the determined prediction of the importance of each of the plurality of tokens of the vocabulary in the vocabulary space  (see ¶ 3.1 Overview, ¶ 3.2 The Importance Predictor, and ¶ 4.2 Implementation citations as in limitation above. More specifically and further: ¶ 3.2 The Importance Predictor: “Given the input passage p, the importance predictor outputs semantic importance of all the terms in the vocabulary, which unify term weighting and expansion into the framework. As shown in Figure 2(b), prior to importance prediction, BERT-based encoder is employed to help get the deep contextualized embedding h; for each term w; in the passage p. Each h; models the surrounding context from a certain position i, thus providing a different view of which terms are semantically related to the topic of the current passage. With a token-wise importance predictor, we obtain a dense importance distribution I; of dimension v for each h;: I; = Transform(h;)ET + b (2) where Transform denotes a linear transformation with GELU activation and layer normalization, E is the shared word embedding matrix and b the bias term. Note that the token-wise importance prediction module is similar to the masked language prediction layer in BERT, thus we can initialize this part of parameters directly from pre-trained BERT. The final passage-wise importance distribution can be fetched simply by the summation of all token-wise importance distributions: L I= I Relu(I;) (3) i=O where L is the sequence length of passage p and Relu activation function is leveraged to ensure the nonnegativity of importance logits.”), 
wherein the predicted term importance of the input sequence over the vocabulary space is with respect to each of the plurality of tokens of the vocabulary in the vocabulary space (see ¶ 3.1 Overview, ¶ 3.2 The Importance Predictor, and ¶ 4.2 Implementation citations as in limitation above. More specifically and further: ¶ 3.1 Overview: “… Given the original textual passage p, we aim to map it into a deep and contextualized sparse representation p' in the vocabulary space. The mapping process can be formulated as: p' = 'T(p) 0 {}(p) where 'T is the item importance predictor and {} the gating controller. The importance predictor 'T generates a dense vector representing the semantic importance of each item in the vocabulary…”); and
outputting the predicted term importance of the input sequence over the vocabulary space as the representation of the input sequence over the vocabulary (see ¶ 3.1 Overview, ¶ 3.2 The Importance Predictor, and ¶ 4.2 Implementation citations as in limitation above and further ¶ 5.4 Analysis of Term Weighting. “…Figure 3 shows three different queries (the first column) and the most relevant passages. The depth of the color represents the weights of terms, deeper is higher. We find that both DeepCT and SparTerm can figure out the most important terms and give them higher weights. However, DeepCT obtains sparser and sharper distributions and only activates very few terms in a passage, missing some important terms, such as "allergic reaction" in the first case. SparTerm can yield a smoother importance distribution by activating more terms though not appearing in the query. This distribution allows the passage to be retrieved by more queries. This also demonstrates that our model has a better ability on pointing out important terms in a passage.”), 
Harley et al. and Bai et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis (e.g., tokenization). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. to incorporate the teachings of Bai et al. of for each token of the embedded input sequence, determining, using one or more trained neural layers implemented by the processor, a prediction of an importance of each of the plurality of tokens of the vocabulary in the vocabulary space; computing, using the processor, a predicted term importance of the input sequence over the vocabulary space by performing an activation over the embedded input sequence of the determined prediction of the importance of each of the plurality of tokens of the vocabulary in the vocabulary space, wherein the predicted term importance of the input sequence over the vocabulary space is with respect to each of the plurality of tokens of the vocabulary in the vocabulary space; and outputting the predicted term importance of the input sequence over the vocabulary space as the representation of the input sequence over the vocabulary which provides the benefit of demonstrating that the model has a better ability on pointing out important terms in a passage (¶ 5.4 Analysis of Term Weighting of Bai et al.).

However, Harley et al. in combination with Bai et al. do not explicitly teach, but Hall et al.  does teach:
wherein the output predicted term importance is used by the neural information retrieval model (as already taught by Harley et al. in claim 1, above: ¶ [0178]) for an information retrieval task in response to a query for identifying the set of documents representative of the query (see ¶ at Col. 1, line 16: “(2) …When a natural language processing model that has been trained on training examples from a well-behaved domain is given input such as search queries and potential search results, which may be, for example, web documents, the results may be much worse than when the natural language processing model is given input similar to the training examples…”
¶ at Col. 2, line 26: “(7) A search query or potential search result may be received. An information retrieval model annotation may be added to the search query or potential search result. The trained natural language processing model may be applied to the search query or potential search result to obtain a prediction and a confidence score. The prediction may be, for example, one or more of a part-of-speech prediction, a parse-tree prediction, a mention chunking prediction, a beginning, inside, and outside label prediction, and a named entity recognition prediction. The confidence score may be a confidence level of the prediction.”
¶ at Col. 4, line 9: “(11) FIG. 1 shows an example system for training a natural language processing model according to an implementation of the disclosed subject matter. A computer 100 may include a machine learning system 110. The machine learning system 110 may include a natural language processing model 120, an information retrieval model 130, and a database 140. The computer 100 may be any suitable device, such as, for example, a computer 20 as described in FIG. 5, for implementing the machine learning system 110. The computer 100 may be a single computing device, or may include multiple connected computing devices. The natural language processing model 120 may be a part of the machine learning system 110 for making predictions about the linguistic structure of text inputs. The natural language processing model 120 may be implemented as, for example, a Bayesian network, artificial neural network, support vector machine, or any other suitable statistical or heuristic machine learning system type. The information retrieval model 130 may be any suitable model for retrieving information, such as, for example, web documents, related to a search query. The information retrieval model may be implemented in any suitable manner, such as, for example, with statistical or probabilistic models or machine learning systems. The database 140 may store a training data set 141 and an annotated training data set 142.”
and further Fig. 3 and ¶ at Col. 6, line 34: (20) FIG. 3 shows an example arrangement for using a trained natural language processing model according to an implementation of the disclosed subject matter. Once the natural language processing model 120 has been trained, the trained natural language processing model 310 may be used to make predictions for novel input. The input to the natural language processing model 120 may be, for example, a search query as depicted in FIG. 3, or a potential search result. Potential search results may be, for example, web documents or web pages, or other suitable documents or file types that include text and may be subject to searching. The input, which may be a target document, may be sent to the information retrieval model 130 from an input source 320, which may be, for example, a computing device on which a user has entered a search query into a web page, or a database containing potential search results for a search query. The information retrieval model 130 may add information retrieval model annotations to the input, and then pass the annotated input to the trained natural language processing model 310. The trained natural language processing model 310 may make predictions about the input, which may be sent back to the information retrieval model 130 for use in information retrieval…”);
Harley et al. and Bai et al. and Hall et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in text/query analysis and/or predictions. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. and Bai et al. to incorporate the teachings of Hall et al. of wherein the output predicted term importance is used by the neural information retrieval model for an information retrieval task in response to a query for identifying the set of documents representative of a query which provides the benefit of providing better predictions, for example more accurate part-of-speech predictions, with better disambiguation, for use by the information retrieval model and assisting the information retrieval model in finding more relevant search results for a search query, or making more accurate determinations as to the relevance of a potential search to a search query (¶ Col. 6, lines 57-63 of Hall et al.).

As to independent claim 39, Harley et al. teaches:
39. A non-transitory computer-readable medium having executable instructions stored thereon for causing a processor and a memory to implement a method for providing a representation of an input sequence over a vocabulary in a first-stage ranker of a neural information retrieval model (see ¶ [0009 and 0178] citations as in claim 1, above and further: ¶ [0009]: “In accordance with various embodiments, a non-transitory computer-readable medium is provided in which a program is stored for causing a computer to perform a method for utilizing multi-omic data indices for tumor profiling. …”, ¶ [0074]: “… The results retrieved from the indexer 115 can be ranked by a ranking engine 165 (e.g., learning-to-rank engine), which can be configured to derive a ranking model for, for example, variants, genes, pathways, phenotypes, text data, and images…”, and ¶ [0125]: “In accordance with various embodiments, a computer-implemented system is provided for utilizing multi-omic data indices for tumor profiling. The system can comprise a computer storage, a digital processing device comprising at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to create a multi-omic cancer search engine application…”), the method comprising:
[the limitations as in claim 1, above and taught by Harley et al. in combination with Bai et al. and Hall et al. and further:] 
Harley et al. further teaches:
first-stage ranker of the neural information retrieval model (as already taught by Harley et al. in claim 1, above: ¶ [0178] and further: ¶ [0074] “… The results retrieved from the indexer 115 can be ranked by a ranking engine 165 (e.g., learning-to-rank engine), which can be configured to derive a ranking model for, for example, variants, genes, pathways, phenotypes, text data, and images…”) 

Bai et al. further teaches:
outputting the predicted term importance of the input sequence over the vocabulary space as the representation of the input sequence over the vocabulary (see ¶ 3.1 Overview, ¶ 3.2 The Importance Predictor, and ¶ 4.2 Implementation citations as in limitation above and further ¶ 5.4 Analysis of Term Weighting. “…Figure 3 shows three different queries (the first column) and the most relevant passages. The depth of the color represents the weights of terms, deeper is higher. We find that both DeepCT and SparTerm can figure out the most important terms and give them higher weights. However, DeepCT obtains sparser and sharper distributions and only activates very few terms in a passage, missing some important terms, such as "allergic reaction" in the first case. SparTerm can yield a smoother importance distribution by activating more terms though not appearing in the query. This distribution allows the passage to be retrieved by more queries. This also demonstrates that our model has a better ability on pointing out important terms in a passage.”).
Harley et al. and Bai et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis (e.g., tokenization). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. to incorporate the teachings of Bai et al. of outputting the predicted term importance of the input sequence over the vocabulary space as the representation of the input sequence over the vocabulary which provides the benefit of demonstrating that the model has a better ability on pointing out important terms in a passage (¶ 5.4 Analysis of Term Weighting of Bai et al.).

Hall et al. further teaches:
wherein the output predicted term importance is used by the  representative of the query (see ¶ at Col. 1, line 16: “(2) …When a natural language processing model that has been trained on training examples from a well-behaved domain is given input such as search queries and potential search results, which may be, for example, web documents, the results may be much worse than when the natural language processing model is given input similar to the training examples…”
¶ at Col. 2, line 26: “(7) A search query or potential search result may be received. An information retrieval model annotation may be added to the search query or potential search result. The trained natural language processing model may be applied to the search query or potential search result to obtain a prediction and a confidence score. The prediction may be, for example, one or more of a part-of-speech prediction, a parse-tree prediction, a mention chunking prediction, a beginning, inside, and outside label prediction, and a named entity recognition prediction. The confidence score may be a confidence level of the prediction.”
¶ at Col. 4, line 9: “(11) FIG. 1 shows an example system for training a natural language processing model according to an implementation of the disclosed subject matter. A computer 100 may include a machine learning system 110. The machine learning system 110 may include a natural language processing model 120, an information retrieval model 130, and a database 140. The computer 100 may be any suitable device, such as, for example, a computer 20 as described in FIG. 5, for implementing the machine learning system 110. The computer 100 may be a single computing device, or may include multiple connected computing devices. The natural language processing model 120 may be a part of the machine learning system 110 for making predictions about the linguistic structure of text inputs. The natural language processing model 120 may be implemented as, for example, a Bayesian network, artificial neural network, support vector machine, or any other suitable statistical or heuristic machine learning system type. The information retrieval model 130 may be any suitable model for retrieving information, such as, for example, web documents, related to a search query. The information retrieval model may be implemented in any suitable manner, such as, for example, with statistical or probabilistic models or machine learning systems. The database 140 may store a training data set 141 and an annotated training data set 142.”
and further Fig. 3 and ¶ at Col. 6, line 34: (20) FIG. 3 shows an example arrangement for using a trained natural language processing model according to an implementation of the disclosed subject matter. Once the natural language processing model 120 has been trained, the trained natural language processing model 310 may be used to make predictions for novel input. The input to the natural language processing model 120 may be, for example, a search query as depicted in FIG. 3, or a potential search result. Potential search results may be, for example, web documents or web pages, or other suitable documents or file types that include text and may be subject to searching. The input, which may be a target document, may be sent to the information retrieval model 130 from an input source 320, which may be, for example, a computing device on which a user has entered a search query into a web page, or a database containing potential search results for a search query. The information retrieval model 130 may add information retrieval model annotations to the input, and then pass the annotated input to the trained natural language processing model 310. The trained natural language processing model 310 may make predictions about the input, which may be sent back to the information retrieval model 130 for use in information retrieval…”);
Harley et al. and Bai et al. and Hall et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in text/query analysis and/or predictions. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. and Bai et al. to incorporate the teachings of Hall et al. of wherein the output predicted term importance is used by the information retrieval model for an information retrieval task in response to a query for identifying the set of documents representative of a query which provides the benefit of providing better predictions, for example more accurate part-of-speech predictions, with better disambiguation, for use by the information retrieval model and assisting the information retrieval model in finding more relevant search results for a search query, or making more accurate determinations as to the relevance of a potential search to a search query (¶ Col. 6, lines 57-63 of Hall et al.). 

As to independent claim 40, Harley et al. teaches:
40. A computed implemented method for processing an input sequence (see ¶ [0178] citations as in claim 1, above and further: ¶ [0009 and 0125] citations as in claim 39, above.), the method comprising:
embedding, by a pretrained neural language model implemented by the processor and having a transformer architecture, each token of an input sequence to provide an embedded input sequence  (see ¶ [0009 and 0178] citation as in limitation above: “…distributed representations of words (like those generated by word2vec) can be combined to generate embeddings for queries and documents, and averaged embeddings can be used to generate effective document similarity retrieval” and further Fig. 5a: “Query parser and tokenizer” and ¶ [0176] : “FIGS. 5a and 5b, which illustrate a query engine workflow that functions to (1) produce synonym and abbreviation expansion, (2) generate alternative (similar) queries, (3) produce content-based suggestions and provide query autocompletion and autocorrection functionality, (4) classify user query intent (e.g., does the user want variants, genes, pathways, samples, single sample data, cohort sample data, sample vs cohort comparison, cohort vs cohort comparison, publications, images?), (5) perform neural information retrieval (e.g., based on a joint embedding of query and indexed documents) and (6) provide summarization of documentation (e.g., multiple sources text summarization), which can be delivered back to user via the system UI. In accordance with various embodiments, topic-specific term embeddings can be used for query expansion, particularly in (2) above…” and further ¶ [0157]: “In accordance with various embodiments, ranking for variants and genes can be learned separately, or as part of deep-and-wide modes together with ranking for other document types. In some embodiments, ranking for text documents utilizes deep learning language modelling (LM) ranks items by probability of document given a query. In accordance with various embodiments, the deep learning language model can be a transformer model (e.g., BERT, RoBERTa, Xlnet, Albert) fine-tuned on relevant data. Such models can be large scale, pre-trained language model embeddings. In accordance with various embodiments, document relevance can be generated using textual and temporal parts of documents, for example, by deriving multiple classes of features including, for example, entity features and time features both derived from a set of annotations, named entity recognition (NER), and temporal tagging.”), 
the input sequence having been tokenized using a predetermined vocabulary having a vocabulary space including a plurality of tokens (see Fig. 5a and ¶ [0009, 0157, 0176 and 0178] citations as in claim 1, above “topic-specific term embeddings” and further ¶ [0177]: “As discussed above, the systems and methods described herein, in accordance with various embodiments, can facilitate the integrating of query term expansion using deep learning models trained on bio-medical literature and medical ontologies available (e.g., GO, UMLS, DO, MeSH, eVOC, HPO, MPO).”);

However, Harley et al. does not explicitly teach, but Bai et al.  does teach:
predicting term importance of the embedded input sequence of tokens with respect to each of the plurality of tokens of the predetermined vocabulary (see ¶ 3.1 Overview, ¶ 3.2 The Importance Predictor, and ¶ 4.2 Implementation citations as in claim 1, above. More specifically, ¶ 3.1 Overview: “Figure 2(a) depicts the general architecture of SparTerm which comprises an importance predictor and a gating controller. Given the original textual passage p, we aim to map it into a deep and contextualized sparse representation p' in the vocabulary space. The mapping process can be formulated as: p' = 'T(p) 0 {}(p) where 'T is the item importance predictor and {} the gating controller. The importance predictor 'T generates a dense vector representing the semantic importance of each item in the vocabulary…”); 
wherein said predicting term importance (see ¶ 3.1 Overview, ¶ 3.2 The Importance Predictor, and ¶ 4.2 Implementation citations as in claim 1 and in limitation, above.) comprises:
for each token of the embedded input sequence, predicting, using one or more trained neural layers implemented by the processor, an importance of each of the plurality of tokens of the vocabulary in the vocabulary space (see ¶ 3.1 Overview: “Figure 2(a) depicts the general architecture of SparTerm which comprises an importance predictor and a gating controller. Given the original textual passage p, we aim to map it into a deep and contextualized sparse representation p' in the vocabulary space. The mapping process can be formulated as: p' = 'T(p) 0 {}(p) where 'T is the item importance predictor and {} the gating controller. The importance predictor 'T generates a dense vector representing the semantic importance of each item in the vocabulary…”, 
¶ 3.2 The Importance Predictor: “…As shown in Figure 2(b), prior to importance prediction, BERT-based encoder is employed to help get the deep contextualized embedding h; for each term w; in the passage p. Each h; models the surrounding context from a certain position i, thus providing a different view of which terms are semantically related to the topic of the current passage. With a token-wise importance predictor, we obtain a dense importance distribution I; of dimension v for each h;: I; = Transform(h;)ET + b (2) where Transform denotes a linear transformation with GELU activation and layer normalization, E is the shared word embedding matrix and b the bias term...”,
and ¶ 4.2 Implementation: “The Importance Predictor and Gating Controller of our model have the same architecture and hyper-parameters of BERT (12-layer, 768-hidden, 12-heads, 110M parameters) and do not share weights. We initialize the Importance Predictor with Google's official pre-trained BERTbase model while the parameters of Token-wise Importance Predictor are initialized with the Masked Language Prediction layer of BERT…”); and
computing, using the processor, a predicted term importance of the input sequence over the vocabulary by performing an activation over the embedded input sequence of the determined prediction of the importance of each of the plurality of tokens of the vocabulary in the vocabulary space (see ¶ 3.1 Overview, ¶ 3.2 The Importance Predictor, and ¶ 4.2 Implementation citations as in limitation above.), 
wherein the predicted term importance of the input sequence over the vocabulary space is with respect to each of the plurality of tokens of the vocabulary in the vocabulary space (see ¶ 3.1 Overview, ¶ 3.2 The Importance Predictor, and ¶ 4.2 Implementation citations as in limitation above. More specifically and further: ¶ ¶ 3.1 Overview: “… Given the original textual passage p, we aim to map it into a deep and contextualized sparse representation p' in the vocabulary space. The mapping process can be formulated as: p' = 'T(p) 0 {}(p) where 'T is the item importance predictor and {} the gating controller. The importance predictor 'T generates a dense vector representing the semantic importance of each item in the vocabulary…”).
Harley et al. and Bai et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis (e.g., tokenization). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. to incorporate the teachings of Bai et al. of predicting term importance of the embedded input sequence of tokens with respect to each of the plurality of tokens of the predetermined vocabulary; wherein said predicting term importance comprises: for each token of the embedded input sequence, predicting, using one or more trained neural layers implemented by the processor, an importance of each of the plurality of tokens of the vocabulary in the vocabulary space; and computing, using the processor, a predicted term importance of the input sequence over the vocabulary by performing an activation over the embedded input sequence of the determined prediction of the importance of each of the plurality of tokens of the vocabulary in the vocabulary space, wherein the predicted term importance of the input sequence over the vocabulary space is with respect to each of the plurality of tokens of the vocabulary in the vocabulary space which provides the benefit of demonstrating that the model has a better ability on pointing out important terms in a passage (¶ 5.4 Analysis of Term Weighting of Bai et al.).

However, Harley et al. in combination with Bai et al. do not explicitly teach, but Hall et al.  does teach:
providing the predicted term importance of the embedded input sequence of tokens for use by an information retrieval model in response to a query for identifying the set of documents; (see ¶ at Col. 1, line 16: “(2) …When a natural language processing model that has been trained on training examples from a well-behaved domain is given input such as search queries and potential search results, which may be, for example, web documents, the results may be much worse than when the natural language processing model is given input similar to the training examples…”
¶ at Col. 2, line 26: “(7) A search query or potential search result may be received. An information retrieval model annotation may be added to the search query or potential search result. The trained natural language processing model may be applied to the search query or potential search result to obtain a prediction and a confidence score. The prediction may be, for example, one or more of a part-of-speech prediction, a parse-tree prediction, a mention chunking prediction, a beginning, inside, and outside label prediction, and a named entity recognition prediction. The confidence score may be a confidence level of the prediction.”
¶ at Col. 4, line 9: “(11) FIG. 1 shows an example system for training a natural language processing model according to an implementation of the disclosed subject matter. A computer 100 may include a machine learning system 110. The machine learning system 110 may include a natural language processing model 120, an information retrieval model 130, and a database 140. The computer 100 may be any suitable device, such as, for example, a computer 20 as described in FIG. 5, for implementing the machine learning system 110. The computer 100 may be a single computing device, or may include multiple connected computing devices. The natural language processing model 120 may be a part of the machine learning system 110 for making predictions about the linguistic structure of text inputs. The natural language processing model 120 may be implemented as, for example, a Bayesian network, artificial neural network, support vector machine, or any other suitable statistical or heuristic machine learning system type. The information retrieval model 130 may be any suitable model for retrieving information, such as, for example, web documents, related to a search query. The information retrieval model may be implemented in any suitable manner, such as, for example, with statistical or probabilistic models or machine learning systems. The database 140 may store a training data set 141 and an annotated training data set 142.”
and further Fig. 3 and ¶ at Col. 6, line 34: (20) FIG. 3 shows an example arrangement for using a trained natural language processing model according to an implementation of the disclosed subject matter. Once the natural language processing model 120 has been trained, the trained natural language processing model 310 may be used to make predictions for novel input. The input to the natural language processing model 120 may be, for example, a search query as depicted in FIG. 3, or a potential search result. Potential search results may be, for example, web documents or web pages, or other suitable documents or file types that include text and may be subject to searching. The input, which may be a target document, may be sent to the information retrieval model 130 from an input source 320, which may be, for example, a computing device on which a user has entered a search query into a web page, or a database containing potential search results for a search query. The information retrieval model 130 may add information retrieval model annotations to the input, and then pass the annotated input to the trained natural language processing model 310. The trained natural language processing model 310 may make predictions about the input, which may be sent back to the information retrieval model 130 for use in information retrieval…”);
wherein the predicted term importance of the input sequence of tokens with respect to each of the plurality of tokens of the predetermined vocabulary provides a representation of the input sequence over a predetermined vocabulary  for a neural information retrieval model (as already taught by Harley et al. in claim 1, above: ¶ [0178]) for identifying a set of documents (see ¶ at Col. 1, line 16, ¶ at Col. 2, line 26, ¶ at Col. 4, line 9, and further Fig. 3 and ¶ at Col. 6, line 34 citations as in claim 1, above.);
Harley et al. and Bai et al. and Hall et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in text/query analysis and/or predictions. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. and Bai et al. to incorporate the teachings of Hall et al. of providing the predicted term importance of the embedded input sequence of tokens for use by an information retrieval model in response to a query for identifying the set of documents;  and wherein the predicted term importance of the input sequence of tokens with respect to each of the plurality of tokens of the predetermined vocabulary provides a representation of the input sequence over a predetermined vocabulary  for a neural information retrieval model for identifying a set of documents which provides the benefit of providing better predictions, for example more accurate part-of-speech predictions, with better disambiguation, for use by the information retrieval model and assisting the information retrieval model in finding more relevant search results for a search query, or making more accurate determinations as to the relevance of a potential search to a search query (¶ Col. 6, lines 57-63 of Hall et al.)

Regarding claim 6, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 1, above.
Harley et al. further teaches:
6. The method of claim 1, further comprising:
tokenizing a received query using the vocabulary (see Fig. 5a: “Query parser and tokenizer” and ¶ [0176] : “FIGS. 5a and 5b, which illustrate a query engine workflow that functions to (1) produce synonym and abbreviation expansion, (2) generate alternative (similar) queries, (3) produce content-based suggestions and provide query autocompletion and autocorrection functionality, (4) classify user query intent (e.g., does the user want variants, genes, pathways, samples, single sample data, cohort sample data, sample vs cohort comparison, cohort vs cohort comparison, publications, images?), (5) perform neural information retrieval (e.g., based on a joint embedding of query and indexed documents) and (6) provide summarization of documentation (e.g., multiple sources text summarization), which can be delivered back to user via the system UI. In accordance with various embodiments, topic-specific term embeddings can be used for query expansion, particularly in (2) above…”);
determining a ranking score for each of a plurality of candidate sequences (see ¶ [0009]: “…The method can further comprise indexing the ingested additional multi-omic data and annotations while preserving gene names, gene variant names and multi-omic mapping between different data streams for the same patient in the specific index, to produce tokenized ingested additional multi-omic data. The method can further comprise receiving a user query. The method can further comprise selecting one or more relevant multi-omic data indices based on the user query. The method can further comprise ranking the selected one or more multi-omic data indices based on at least one of clinical actionability, pathogenicity, feature weight, or frequency. The method can further comprise returning the ranked one or more multi-omic data indices to the user.”), the candidate sequences being respectively associated with candidate documents (see ¶ [0178]: “As discussed above, the systems described herein, in accordance with various embodiments, can facilitate the integrating of neural information retrieval models aim to provide better semantic understanding capabilities for ranking literature, images, and annotations. In various embodiments, distributed representations of words (like those generated by word2vec) can be combined to generate embeddings for queries and documents, and averaged embeddings can be used to generate effective document similarity retrieval.”), 
wherein said determining a ranking score (see ¶ [0009] citation as in limitation above.) comprises:
comparing the output predicted term importance for the candidate sequence to in the tokenized query (see Fig. 5a: “Query parser and tokenizer” and ¶ [0009, 0176 and 0178] citations a in limitations above and further: “[0074] ... The user interface 125 can allow a user to enter queries and receive results provided by a query engine 150. Query engine 150 can be configured to accept the user query; select, pre-join, aggregate, and summarize relevant multi-omic indices; and return ranked multi-omic data or features. In accordance with various embodiments, the system architecture can further include a load balancer 155 to accommodate the bi-directional transfer of data between UI 125 and query engine 150 for a large number of users…”);
ranking the plurality of candidate sequences based on said determined ranking score (see Fig. 5a: “Query parser and tokenizer” and ¶ [0009, 0176 and 0178] citations a in limitations above and further: [0074] “… The results retrieved from the indexer 115 can be ranked by a ranking engine 165 (e.g., learning-to-rank engine), which can be configured to derive a ranking model for, for example, variants, genes, pathways, phenotypes, text data, and images…”); 
retrieving a subset of the candidate documents having a highest ranking (see Fig. 5a: “Query parser and tokenizer” and ¶ [0009, 0176 and 0178] citations a in limitations above and further: [0074]: “… The results retrieved from the indices can be ranked by the ranking engine and presented to the user in a ranked order…”).

Regarding claim 7, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 1, above.
Harley et al. further teaches:
7. The method of claim 1, 
wherein the information retrieval model includes a first stage that is a ranker stage (¶ [0009, 0176 and 0178] citations a in claim 1 and 9, above: [0074] “… The results retrieved from the indexer 115 can be ranked by a ranking engine 165 (e.g., learning-to-rank engine), which can be configured to derive a ranking model for, for example, variants, genes, pathways, phenotypes, text data, and images…”) and a second stage that is a re-ranker stage (see ¶ [0163]: “In accordance with various embodiments, global ranking can be optimized for clinical actionability (or pathogenicity when clinical utility is unknown) and preloaded into the indices, whereby results (subjected to, for example, a top-K algorithm) can be re-ranked to further satisfy a particular information need. In accordance with various embodiments, re-ranking can involve the use of language modeling or weighted transformed features from standard information retrieval models (e.g. PageRank, BM25, RM3).”).

Regarding claim 10, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 1, above.
Harley et al. further teaches:
10. The method of claim 1, 
wherein said embedding each token of the tokenized input sequence is based at least on the vocabulary and the token's position within the input sequence to provide context embedded tokens (see ¶ [0178] citation as in claims 1 above: “…distributed representations of words (like those generated by word2vec) can be combined to generate embeddings for queries and documents, and averaged embeddings can be used to generate effective document similarity retrieval” and further Fig. 5a: “Query parser and tokenizer” and ¶ [0176] : “FIGS. 5a and 5b, which illustrate a query engine workflow that functions to (1) produce synonym and abbreviation expansion, (2) generate alternative (similar) queries, (3) produce content-based suggestions and provide query autocompletion and autocorrection functionality, (4) classify user query intent (e.g., does the user want variants, genes, pathways, samples, single sample data, cohort sample data, sample vs cohort comparison, cohort vs cohort comparison, publications, images?), (5) perform neural information retrieval (e.g., based on a joint embedding of query and indexed documents) and (6) provide summarization of documentation (e.g., multiple sources text summarization), which can be delivered back to user via the system UI. In accordance with various embodiments, topic-specific term embeddings can be used for query expansion, particularly in (2) above…” and further ¶ [0094]: ... A model for learning to rank can also include other machine-learning models or deep neural networks. The ranking can further comprise deep learning ranking. The ranking can further comprise a similarity between embeddings of a query and indexed documents in a joint embedding space learned via deep learning methods. The deep learning ranking can be derived from a deep learning model selected from the group consisting of a deep semantic similarity model, a deep and wide model, a deep language model, a learned deep learning text embedding, a learned named entity recognition, Siamese neural network, and combinations thereof.”).
Regarding claim 11, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 10, above.
11. The method of claim 10, 
Bai et al. further teaches:
wherein said determining a prediction (see ¶ 3.1 Overview, ¶ 3.2 The Importance Predictor, and ¶ 4.2 Implementation citations as in claim 1 above. More specifically: ¶ 3.1 Overview: “Figure 2(a) depicts the general architecture of SparTerm which comprises an importance predictor and a gating controller. Given the original textual passage p, we aim to map it into a deep and contextualized sparse representation p' in the vocabulary space. The mapping process can be formulated as: p' = 'T(p) 0 {}(p) where 'T is the item importance predictor and {} the gating controller. The importance predictor 'T generates a dense vector representing the semantic importance of each item in the vocabulary…”) comprises:
transforming the context embedded tokens using at least one logit function to predict an importance of each token in the vocabulary with respect to each token of the embedded input sequence (see ¶ 3.1 Overview, ¶ 3.2 The Importance Predictor, and ¶ 4.2 Implementation citations as in claim 1 above and further ¶ ¶ 1-2 of 3.2 The Importance Predictor: “…With a token-wise importance predictor, we obtain a dense importance distribution I; of dimension v for each h;: I; = Transform(h;)ET + b (2) where Transform denotes a linear transformation with GELU activation and layer normalization, E is the shared word embedding matrix and b the bias term. Note that the token-wise importance prediction module is similar to the masked language prediction layer in BERT, thus we can initialize this part of parameters directly from pre-trained BERT. The final passage-wise importance distribution can be fetched simply by the summation of all token-wise importance distributions: L I= I Relu(I;) (3) i=O where L is the sequence length of passage p and Relu activation function is leveraged to ensure the nonnegativity of importance logits.”).
Harley et al. and Bai et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis (e.g., tokenization). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. to incorporate the teachings of Bai et al. of wherein said determining a prediction comprises: transforming the context embedded tokens using at least one logit function to predict an importance of each token in the vocabulary with respect to each token of the embedded input sequence which provides the benefit of demonstrating that the model has a better ability on pointing out important terms in a passage (¶ 5.4 Analysis of Term Weighting of Bai et al.).

Regarding claim 13, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 1, above.
Bai et al. further teaches:
13. The method of claim 1, wherein the one or more trained linear layers are trained using in-batch negative sampling and further trained using regularization to sparsify vector representations (see ¶ 4.2 Implementation: “The Importance Predictor and Gating Controller of our model have the same architecture and hyper-parameters of BERT (12-layer, 768- hidden, 12-heads, 110M parameters) and do not share weights. We initialize the Importance Predictor with Google's official pre-trained BERTbase model while the parameters of Token-wise Importance Predictor are initialized with the Masked Language Prediction layer of BERT. When using expansion-enhanced gating, the Gating Controller is also initialized with BERTbase· We fine-tune our model on the training set of MSMARCO passage retrieval dataset on 4 NVIDIA-vlO0 GPUs with a batch size of 128… To ensure the sparsity, the threshold in the Binarizer in Equation (4) is set to 0.7. We do not fine-tune our model on the training set of document retrieval dataset but just use the model trained on the passage retrieval dataset for the document ranking.”
and ¶ 6. Conclusion: “In this work, we propose SparTerm to directly learn term-based sparse representation in the full vocabulary space…”).
Harley et al. and Bai et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis (e.g., tokenization). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. to incorporate the teachings of Bai et al. of wherein the one or more trained linear layers are trained using in-batch negative sampling and further trained using regularization to sparsify vector representations which provides the benefit of demonstrating that the model has a better ability on pointing out important terms in a passage (¶ 5.4 Analysis of Term Weighting of Bai et al.).

Regarding claim 41, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 40, above.
Harley et al. further teaches:
41. The method of claim 40, 
wherein the predicted term importance over the vocabulary is sparse (see ¶ [0157]: “In accordance with various embodiments, ranking for variants and genes can be learned separately, or as part of deep-and-wide modes together with ranking for other document types. In some embodiments, ranking for text documents utilizes deep learning language modelling (LM) ranks items by probability of document given a query. In accordance with various embodiments, the deep learning language model can be a transformer model (e.g., BERT, RoBERTa, Xlnet, Albert) fine-tuned on relevant data. Such models can be large scale, pre-trained language model embeddings...”).

Regarding claim 43, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 40, above.
Harley et al. further teaches:
43. The method of claim 40, 
wherein the input sequence is one of a query and a document sequence (see ¶ [0094]: “…The ranking can further comprise a similarity between embeddings of a query and indexed documents in a joint embedding space learned via deep learning methods…” and further ¶ [0178] citation as in limitation above: “…distributed representations of words (like those generated by word2vec) can be combined to generate embeddings for queries and documents, and averaged embeddings can be used to generate effective document similarity retrieval”). 

Regarding claim 44, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 1, above.
Harley et al. further teaches:
44. (New) The method of claim 1, wherein the predicted term importance over the vocabulary is sparse (see ¶ [0157] citation as in claim 41, above).

Regarding claim 45, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 39, above.
Harley et al. further teaches:
45. (New) The method of claim 39, wherein the predicted term importance over the vocabulary is sparse  (see ¶ [0157] citation as in claim 41, above).

Claims 2-3 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Harley et al. (US 20210319907 A1) and further in view of Bai et al. (Bai, Yang, et al. "Sparterm: Learning term-based sparse representation for fast text retrieval." arXiv preprint arXiv:2010.00768 (2020). https://arxiv.org/pdf/2010.00768) and Hall et al. (US 9536522 B1) as applied to claim 1 above, and further in view of Roffo (Roffo, Giorgio. "Ranking to learn and learning to rank: On the role of ranking in pattern recognition applications." arXiv preprint arXiv:1706.05933 (2017).). 

Regarding claim 2, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 1, above.
However, Harley et al. in combination with Bai et al. and Hall et al.  do not explicitly teach, but Roffo does teach:
2. The method of claim 1, wherein the activation comprises a concave activation function (see ¶ 2 of 3.1 A Biologically-Inspired Model: “. The main activation functions (or rectifiers) used in the context of ANNs are

    PNG
    media_image1.png
    136
    470
    media_image1.png
    Greyscale
” 
(i.e., Softplus)).
Harley et al., Bai et al., Hall et al. and Roffo are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. in combination with Bai et al.  and Hall et al.  to incorporate the teachings of Roffo of wherein the activation comprises a concave activation function which provides the benefit of producing effective ranks, contributing to push forward the state of the art (abstract of Roffo).

Regarding claim 3, Harley et al. in combination with Bai et al.  and Hall et al. and Roffo teach all of the limitations as in claim 2, above.
Roffo further teaches:
3. The method of claim 2, 
wherein the concave activation function comprises a logarithmic activation function or [only one needed] a radical function (see ¶ 2 of 3.1 A Biologically-Inspired Model citation as in claim 2, above. (i.e., activation functions – Softplus comprising natural logarithm function)).
Harley et al., Bai et al.  Hall et al. and Roffo are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. in combination with Bai et al.  and Hall et al.  to incorporate the teachings of Roffo of wherein the concave activation function comprises a logarithmic activation function which provides the benefit of producing effective ranks, contributing to push forward the state of the art (abstract of Roffo).

Regarding claim 15, Harley et al. in combination with Bai et al.  and Hall et al. and Roffo teach all of the limitations as in claim 1, above.
Bai et al. further teaches:
15. The method of claim 1, 
the determined prediction of the importance of each of the plurality of tokens of the vocabulary in the vocabulary space (see ¶ 3.1 Overview, ¶ 3.2 The Importance Predictor, and ¶ 4.2 Implementation citations as in claim 1 above. More specifically and further: ¶ ¶ 3.1 Overview: “… Given the original textual passage p, we aim to map it into a deep and contextualized sparse representation p' in the vocabulary space. The mapping process can be formulated as: p' = 'T(p) 0 {}(p) where 'T is the item importance predictor and {} the gating controller. The importance predictor 'T generates a dense vector representing the semantic importance of each item in the vocabulary…”)
Harley et al. and Bai et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis (e.g., tokenization). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. to incorporate the teachings of Bai et al. of the determined prediction of the importance of each of the plurality of tokens of the vocabulary in the vocabulary space which provides the benefit of demonstrating that the model has a better ability on pointing out important terms in a passage (¶ 5.4 Analysis of Term Weighting of Bai et al.).

Roffo further teaches:
wherein said performing an activation (see ¶ 2 of 3.1 A Biologically-Inspired Model citation as in claim 2, above. (i.e., Softplus comprising natural logarithm function) and further: “…Therefore, an MLP can be viewed as a logistic regression classifier where the input is first transformed using a learnt non-linear transformation -. This transformation projects the input data into a space where it becomes linearly separable…”) comprises, 
for each token in the embedded input sequence (see ¶ 2 of 3.1 A Biologically-Inspired Model citation as in claim 2, above. (i.e., Softplus comprising natural logarithm function)), 
applying an activation function to the determined prediction of the importance of each of the plurality of tokens over the embedded input sequence to ensure the positivity of the determined term weights (see ¶ 2 of 3.1 A Biologically-Inspired Model citation as in claim 2, above and further: ¶ 3 of 1.1: Thesis Statement and Contributions: “Feature Ranking and Selection: This thesis contributed in the feature selection context, by introducing a graphbased algorithm for feature ranking, called Infinite Feature Selection (Inf-FS), that permits the investigation of the importance (relevance and redundancy) of a feature when injected into an arbitrary set of cues.” and ¶ 3 of of 3.1 A Biologically-Inspired Model: “… Each hidden unit is constituted of activation functions that control the propagation of neuron signal to the next layer (e.g. positive weights simulate the excitatory stimulus whereas negative weights simulate the inhibitory ones as in its biological counterpart). A hidden unit is composed of a regression equation that processes the input information into a non-linear output data…”), and 
performing a concave function on a result of the applied activation function (see ¶ 2-3 of 3.1 A Biologically-Inspired Model and ¶ 3 of 1.1: Thesis Statement and Contributions citations as in claim 2 and limitation above).
Harley et al., Bai et al.  Hall et al. and Roffo are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. in combination with Bai et al.  and Hall et al.  to incorporate the teachings of Roffo of wherein said performing a concave activation comprises, for each token in the embedded input, applying an activation function to the determined importance of the token in the vocabulary over the embedded input sequence to ensure the positivity of the determined term weights, and performing a concave function on the result of the activation function which provides the benefit of producing effective ranks, contributing to push forward the state of the art (abstract of Roffo).

Claim 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Harley et al. (US 20210319907 A1) and further in view of Bai et al. (Bai, Yang, et al. "Sparterm: Learning term-based sparse representation for fast text retrieval." arXiv preprint arXiv:2010.00768 (2020). https://arxiv.org/pdf/2010.00768) and further in view of Roffo (Roffo, Giorgio. "Ranking to learn and learning to rank: On the role of ranking in pattern recognition applications." arXiv preprint arXiv:1706.05933 (2017).) as applied to claim 2 above, and further in view of Nelakanti et al. (US 20160070697 A1). 

Regarding claim 4, Harley et al. in combination with Bai et al. and Hall et al. and Roffo teach all of the limitations as in claim 2, above.
Roffo further teaches:
4. The method of claim 2, 
wherein the activation comprises a logarithmic activation function (see ¶ 2 of 3.1 A Biologically-Inspired Model citation as in claim 2, above. (i.e., Softplus comprising natural logarithm function)), 
Harley et al., Bai et al. Hall et al.  and Roffo are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. in combination with Bai et al.  and Hall et al.  to incorporate the teachings of Roffo of wherein the concave activation function comprises a logarithmic activation function which provides the benefit of producing effective ranks, contributing to push forward the state of the art (abstract of Roffo).

However, Harley et al. in combination with Bai et al. and Hall et al.  and Roffo do not explicitly teach, but Nelakanti et al. does teach:
wherein said logarithmic activation (see ¶ [0011]: “Disclosed herein are language models learned using penalized maximum likelihood estimation in log-linear models. A convex objective function is optimized using approaches for which polynomial-time algorithms are derived. In contrast to language models based on unstructured norms such as l.sub.2 (quadratic penalties) or l.sub.1 (absolute discounting), the disclosed approaches employ tree-structured norms that mimic the nested nature of word contexts and provide an efficient framework for the disclosed language models. In some embodiments, structured l.sub.∞ tree norms are employed with a complexity nearly linear in the number of nodes. This leads to an memory-efficient and time-efficient learning algorithm for generalized linear language models In a further aspect, the optimization is performed using a proximal algorithm that is adapted to be able to scale with the large dimensionality of the feature space commonly employed in language model training (e.g., vocabularies of order thousands to tens of thousands of words, and context sample sizes on the order of 10.sup.5 or larger for some typical natural language model learning tasks).”) comprises:
for each token in the vocabulary (see ¶ [0011] citation as in limitation above.: “…This leads to an memory-efficient and time-efficient learning algorithm for generalized linear language models In a further aspect, the optimization is performed using a proximal algorithm that is adapted to be able to scale with the large dimensionality of the feature space commonly employed in language model training (e.g., vocabularies of order thousands to tens of thousands of words, and context sample sizes on the order of 10.sup.5 or larger for some typical natural language model learning tasks).”), determining a maximum of a log-saturation of the determined importance of the token in the vocabulary over the embedded input sequence (see ¶ [0011] citation as in limitation above.: “…language models learned using penalized maximum likelihood estimation in log-linear models…”), wherein the log-saturation prevents some terms in the vocabulary from dominating and ensures sparsity in the representation (see ¶ [0011] citation as in limitation above and further ¶ [0022]: “On the other hand, by constraining the parameters to be positive (i.e., the set of feasible solutions custom-character is the positive orthant), the projection step 2 in Algorithm 1 can be done with the same complexity, while maintaining sparse parameters across multiple categories. More precisely, the weights for the category k associated to a given context x, is always zeros if the category k never occurred after context x. A significant gain in memory (nearly |V|-fold for large context lengths) was obtained without loss of accuracy in experiments reported herein.”).
Harley et al., Bai et al., Roffo, and Nelakanti et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. in combination with Bai et al. and Hall et al.  and Roffo to incorporate the teachings of Nelakanti et al.i of wherein said logarithmic comprises: for each token in the vocabulary, determining a maximum of a log-saturation of the determined importance of the token in the vocabulary over the embedded input, wherein the log-saturation prevents some terms in the vocabulary from dominating and ensures sparsity in the representation which provides the benefit of maintaining sparse parameters across multiple categories ([0022] of Nelakanti et al.).

Claim 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Harley et al. (US 20210319907 A1) and further in view of Bai et al. (Bai, Yang, et al. "Sparterm: Learning term-based sparse representation for fast text retrieval." arXiv preprint arXiv:2010.00768 (2020). https://arxiv.org/pdf/2010.00768) and Hall et al. (US 9536522 B1) as applied to claim 1 above, and further in view of Roffo (Roffo, Giorgio. "Ranking to learn and learning to rank: On the role of ranking in pattern recognition applications." arXiv preprint arXiv:1706.05933 (2017).), and Nelakanti et al. (US 20160070697 A1). 

Regarding claim 5, Harley et al. in combination with Bai et al. and Hall et al.  teach all of the limitations as in claim 1, above.
However, Harley et al. in combination with Bai et al. and Hall et al. do not explicitly teach, but Roffo does teach:
5. The method of claim 1, 
wherein the activation comprises a logarithmic activation function (see ¶ 2 of 3.1 A Biologically-Inspired Model citation as in claim 2, above. (i.e., Softplus comprising natural logarithm function)), 
Harley et al., Bai et al.  Hall et al. and Roffo are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. in combination with Bai et al.  and Hall et al.  to incorporate the teachings of Roffo of wherein the concave activation function comprises a logarithmic activation function which provides the benefit of producing effective ranks, contributing to push forward the state of the art (abstract of Roffo).

However, Harley et al. in combination with Bai et al.  and Hall et al. and Roffo do not explicitly teach, but Nelakanti et al. does teach:
wherein said logarithmic activation (see ¶ [0011]: “Disclosed herein are language models learned using penalized maximum likelihood estimation in log-linear models. A convex objective function is optimized using approaches for which polynomial-time algorithms are derived. In contrast to language models based on unstructured norms such as l.sub.2 (quadratic penalties) or l.sub.1 (absolute discounting), the disclosed approaches employ tree-structured norms that mimic the nested nature of word contexts and provide an efficient framework for the disclosed language models. In some embodiments, structured l.sub.∞ tree norms are employed with a complexity nearly linear in the number of nodes. This leads to an memory-efficient and time-efficient learning algorithm for generalized linear language models In a further aspect, the optimization is performed using a proximal algorithm that is adapted to be able to scale with the large dimensionality of the feature space commonly employed in language model training (e.g., vocabularies of order thousands to tens of thousands of words, and context sample sizes on the order of 10.sup.5 or larger for some typical natural language model learning tasks).”) comprises:
for each token in the vocabulary (see ¶ [0011] citation as in limitation above.: “…This leads to an memory-efficient and time-efficient learning algorithm for generalized linear language models In a further aspect, the optimization is performed using a proximal algorithm that is adapted to be able to scale with the large dimensionality of the feature space commonly employed in language model training (e.g., vocabularies of order thousands to tens of thousands of words, and context sample sizes on the order of 10.sup.5 or larger for some typical natural language model learning tasks).”),
combining a log-saturation of the determined importance of the token in the vocabulary over the embedded input sequence (see ¶ [0011] citation as in limitation above and further ¶ [0017]: “The feature vector φ.sub.m(x) corresponds to one path of length m starting at the root of the suffix trie S. The entries in W correspond to weights for each suffix. This provides a trie structure S on W (see FIG. 2) constraining the number of free parameters. In other words, there is one weight parameter per node in the trie S and the matrix of parameters W is of size |S|.”,  ¶ [0022]: “On the other hand, by constraining the parameters to be positive (i.e., the set of feasible solutions custom-character is the positive orthant), the projection step 2 in Algorithm 1 can be done with the same complexity, while maintaining sparse parameters across multiple categories. More precisely, the weights for the category k associated to a given context x, is always zeros if the category k never occurred after context x. A significant gain in memory (nearly |V|-fold for large context lengths) was obtained without loss of accuracy in experiments reported herein.”), 
wherein the log-saturation prevents some terms in the vocabulary from dominating and ensures sparsity in the representation (see ¶ [0011, 0017, and 0022] citation as in limitation above, More specifically, ¶ [0017]: “The feature vector φ.sub.m(x) corresponds to one path of length m starting at the root of the suffix trie S. The entries in W correspond to weights for each suffix. This provides a trie structure S on W (see FIG. 2) constraining the number of free parameters. In other words, there is one weight parameter per node in the trie S and the matrix of parameters W is of size |S|.””).
Harley et al., Bai et al., Roffo, and Nelakanti et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. in combination with Bai et al. and Hall et al. and Roffo to incorporate the teachings of Nelakanti et al.i of wherein said logarithmic activation comprises: for each token in the vocabulary, combining a log-saturation of the determined importance of the token in the vocabulary over the embedded input sequence, wherein the log-saturation prevents some terms in the vocabulary from dominating and ensures sparsity in the representation which provides the benefit of maintaining sparse parameters across multiple categories ([0022] of Nelakanti et al.).

Claims 8-9 is/are rejected under 35 U.S.C. 103 as being unpatentable Harley et al. (US 20210319907 A1) and further in view of Bai et al. (Bai, Yang, et al. "Sparterm: Learning term-based sparse representation for fast text retrieval." arXiv preprint arXiv:2010.00768 (2020). https://arxiv.org/pdf/2010.00768) and Hall et al. (US 9536522 B1) as applied to claim 1 above, and further in view of Orr et al. (US 20180150143 A1). 

Regarding claim 8, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 1, above.
Harley et al. further teaches:
8. The method of claim 1, further comprising:
ranking the plurality of candidate sequences based on said comparing (see Fig. 5a: “Query parser and tokenizer” and ¶ [0009, 0176 and 0178] citations a in limitations above and further: [0074] “… The results retrieved from the indexer 115 can be ranked by a ranking engine 165 (e.g., learning-to-rank engine), which can be configured to derive a ranking model for, for example, variants, genes, pathways, phenotypes, text data, and images…”);
retrieving a subset of the candidate documents having a highest ranking (see Fig. 5a: “Query parser and tokenizer” and ¶ [0009, 0176 and 0178] citations a in limitations above and further: [0074]: “… The results retrieved from the indices can be ranked by the ranking engine and presented to the user in a ranked order…”).

However, Harley et al. in combination with Bai et al.  and Hall et al. do not explicitly teach, but Orr et al. does teach:
comparing the output predicted term importance for the input sequence to a previously determined predicted term importance for each of a plurality of candidate sequences (see ¶ [0006]: “… An online training module is configured to update the vocabulary by using either a direction associated with the predicted next item, or, by comparing the new text item and the predicted next text item and propagating results of the comparison to a final layer of the neural network.” and ¶ [0033 and 0048] citations as in claim 1, above. “”), the candidate sequences being respectively associated with candidate documents (see ¶ [0006, 0033 and 0048] citations as in limitation above or claim 1, above: “[0033] At the neural network output stage, an output layer of the neural network produces numerical values which are activation levels of units in the output layer of the network. These numerical values form a predicted embedding. In order to convert the predicted embedding into scores for individual candidate items (such as candidate words, phrases, morphemes, emoji or other items) a measure of similarity is computed between the predicted embedding and individual ones of a plurality of embeddings available to the scoring process. In some examples a dot product is computed as the measure of similarity but this is not essential as other measures of similarity may be used. The similarity measures give a plurality of scores, one for each of the embeddings, which when normalized express the likelihood that the next item in the sequence is each of the items corresponding to the embeddings…” and “[0048] … The activations of the output units are converted to scores of items in a set of available item embeddings. This is done by taking a dot product (or other measure of similarity) between the predicted item embedding given by the activations of the output units and each of the available item embeddings and then, in the case that scalar bias values are available, adding a scalar bias value which has been stored for that item…”);
Harley et al. Bai et al., Hall et al. and Orr et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis (e.g., tokenization). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. in combination with Bai et al., and Hall et al. to incorporate the teachings of Orr et al. of comparing the output predicted term importance for the input sequence to a previously determined predicted term importance for each of a plurality of candidate sequences, the candidate sequences being respectively associated with candidate documents which provides the benefit of facilitating input and output stages (¶ [0034] of Orr et al.).

Regarding claim 9, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 8, above.
However, Harley et al. in combination with Bai et al.  and Hall et al. do not explicitly teach, but Orr et al. does teach:
9. The method of claim 8, 
wherein said comparing comprises calculating a dot product between the output predicted term importance of the input sequence and the predicted term importance for each of the plurality of candidate sequences (see ¶ [0006, 0033 and 0048] citations as in claims 1 and/or 8, above: “[0033] At the neural network output stage, an output layer of the neural network produces numerical values which are activation levels of units in the output layer of the network. These numerical values form a predicted embedding. In order to convert the predicted embedding into scores for individual candidate items (such as candidate words, phrases, morphemes, emoji or other items) a measure of similarity is computed between the predicted embedding and individual ones of a plurality of embeddings available to the scoring process. In some examples a dot product is computed as the measure of similarity but this is not essential as other measures of similarity may be used. The similarity measures give a plurality of scores, one for each of the embeddings, which when normalized express the likelihood that the next item in the sequence is each of the items corresponding to the embeddings…” and “[0048] … The activations of the output units are converted to scores of items in a set of available item embeddings. This is done by taking a dot product (or other measure of similarity) between the predicted item embedding given by the activations of the output units and each of the available item embeddings and then, in the case that scalar bias values are available, adding a scalar bias value which has been stored for that item…”).
Harley et al. Bai et al., Hall et al. and Orr et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in lexical analysis (e.g., tokenization). Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. in combination with Bai et al., and Hall et al.to incorporate the teachings of Orr et al. of wherein said comparing comprises calculating a dot product between the output predicted term importance of the input sequence and the predicted term importance for each of the plurality of candidate sequences which provides the benefit of facilitating input and output stages (¶ [0034] of Orr et al.).

Claim 12 is/are rejected under 35 U.S.C. 103 as being unpatentable Harley et al. (US 20210319907 A1) and further in view of Bai et al. (Bai, Yang, et al. "Sparterm: Learning term-based sparse representation for fast text retrieval." arXiv preprint arXiv:2010.00768 (2020). https://arxiv.org/pdf/2010.00768) and Hall et al. (US 9536522 B1) as applied to claim 11 above, and further in view of Burkhart et al. (US 20200320382 A1). 

Regarding claim 12, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 11, above.
However, Harley et al. in combination with Bai et al.  and Hall et al. do not explicitly teach, but Burkhart et al. does teach:
12. The method of claim 11, 
wherein the one or more trained neural layers comprise one or more linear layers with an activation, (see ¶ [0118 and 0121]: “[0118] The VAE estimator takes X.sub.i ∈ R.sup.b to be user i's ratings for each item (or the residual ratings after subtracting off half of user i's and item j's mean values), where b refers to the number of items included in the matrix M... [0121] In one or more implementations, the neural network 204 is implemented as a 3-layer neural network. FIG. 4 illustrates an example 400 of a neural network 402. The neural network 402 can be, for example, the neural network 204 of FIG. 2 or FIG. 3. The neural network 402 includes an input layer 404, a hidden layer 406, and an output layer 408. The estimator output values 210 are fed into the input layer 404, the hidden layer 406 implements leaky rectified linear activation, and the output layer 408 outputs vectors of predicted logits L. The example 400 further includes a mapping and normalization layer 410, which maps the logits L to a mapped value using the ).
wherein the at least one logit function further employs a normalization layer (see ¶ [0118 and 0121]: “[0121] … The example 400 further includes a mapping and normalization layer 410, which maps the logits L to a mapped value using the function 
    PNG
    media_image2.png
    52
    66
    media_image2.png
    Greyscale
 These mapped values are normalized to produce probabilities for one of multiple (e.g., 5) potential item values.”);
Harley et al., Bai et al.  Hall et al. and Burkhart  et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in query processing and ranking. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. in combination with Bai et al. and Hall et al.  to incorporate the teachings of Burkhart et al. of wherein the one or more trained neural layers comprise one or more linear layers with an activation and wherein the at least one logit function is provided by one or more linear layers, each including an activation and a normalization layer which provides the benefit of improving the operation of a computing device by generating better recommendations on how to enhance the digital experience for a user ([0026] of Burkhart et al.).

Claim 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Harley et al. (US 20210319907 A1) and further in view of Bai et al. (Bai, Yang, et al. "Sparterm: Learning term-based sparse representation for fast text retrieval." arXiv preprint arXiv:2010.00768 (2020). https://arxiv.org/pdf/2010.00768) and Hall et al. (US 9536522 B1) as applied to claim 1 above, and further in view of Dernoncourt et al. (US 20220245179 A1). 

Regarding claim 14, Harley et al. in combination with Bai et al. and Hall et al.  teaches all of the limitations as in claim 1, above.
However, Harley et al. in combination with Bai et al.  and Hall et al. do not explicitly teach, but Dernoncourt et al. does teach:
14. The method of claim 13, wherein the pretrained language model is pretrained using a masked language modeling method (see ¶ [0062]: “In some examples, BERT uses a masked language model (MLM or Masked LM) pre-training objective to alleviate the unidirectionality constraint. The masked language model randomly masks some of the tokens from the input, and the language model is used to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context, which pretrains a deep bidirectional transformer. In addition to the masked language model, BERT includes a next sentence prediction (NSP) task that jointly pretrains text-pair representations.”).
Harley et al., Bai et al.  Hall et al. and Dernoncourt et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in query processing and/or ranking/scoring. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Harley et al. in combination with Bai et al. and Hall et al.  to incorporate the teachings of Dernoncourt et al. of wherein the language model is pretrained using a masked language modeling method which provides the benefit of provide an improved phrasal similarity apparatus ([0018] of Dernoncourt et al.).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Keisha Y Castillo-Torres whose telephone number is (571)272-3975. The examiner can normally be reached Monday - Friday, 9:00 am - 4:00 pm (EST).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Keisha Y. Castillo-Torres
Examiner
Art Unit 2659



/Keisha Y. Castillo-Torres/Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Jun 01, 2022
Application Filed
Apr 07, 2025
Non-Final Rejection — §101, §103
Sep 11, 2025
Response Filed
Nov 18, 2025
Final Rejection — §101, §103
Jan 26, 2026
Response after Non-Final Action
Feb 26, 2026
Request for Continued Examination
Feb 27, 2026
Response after Non-Final Action
Mar 31, 2026
Non-Final Rejection — §101, §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/710,137
Patent 12573402
GENERATING AND/OR UTILIZING UNINTENTIONAL MEMORIZATION MEASURE(S) FOR AUTOMATIC SPEECH RECOGNITION MODEL(S)
2y 5m to grant Granted Mar 10, 2026
18/187,330
Patent 12536989
Language-agnostic Multilingual Modeling Using Effective Script Normalization
2y 5m to grant Granted Jan 27, 2026
17/995,518
Patent 12531050
VOICE DATA CREATION DEVICE
2y 5m to grant Granted Jan 20, 2026
17/954,845
Patent 12499332
TRANSLATING TEXT USING GENERATED VISUAL REPRESENTATIONS AND ARTIFICIAL INTELLIGENCE
2y 5m to grant Granted Dec 16, 2025
17/334,543
Patent 12488180
SYSTEMS AND METHODS FOR GENERATING DIALOG TREES
2y 5m to grant Granted Dec 02, 2025
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
74%
Grant Probability
99%
With Interview (+30.5%)
3y 0m
Median Time to Grant
High
PTA Risk
Based on 108 resolved cases by this examiner. Grant probability derived from career allow rate.