Last updated: April 19, 2026
Application No. 18/528,116
POSITION-BASED TEXT-TO-SPEECH MODEL

Final Rejection §101§103
Filed
Dec 04, 2023
Examiner
HUTCHESON, CODY DOUGLAS
Art Unit
2659
Tech Center
2600 — Communications
Assignee
Adobe Inc.
OA Round
2 (Final)
Interview Optional

— +47.1% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 24 resolved cases, 2023–2026
Examiner Intelligence

HUTCHESON, CODY DOUGLAS View full profile →
Grants 62% of resolved cases
Career Allow Rate
15 granted / 24 resolved
+0.5% vs TC avg
Strong +47% interview lift
Without
With
+47.1%
Interview Lift
resolved cases with interview
Typical timeline
2y 10m
Avg Prosecution
34 currently pending
Career history
Total Applications
across all art units
Statute-Specific Performance

§101
33.9%
-6.1% vs TC avg
§103
40.9%
+0.9% vs TC avg
§102
14.8%
-25.2% vs TC avg
§112
7.5%
-32.5% vs TC avg
Black line = Tech Center average estimate • Based on career data from 24 resolved cases
Office Action

§101 §103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
	1. Regarding the objections to claims 6, 8, and 12, Applicant has amended each claim to address the minor informalities. Accordingly, the objects have been withdrawn.

2. Regarding the rejection of claims 1-20 under 35 U.S.C. § 101, Applicant's arguments filed 11/25/2025 have been fully considered but they are not persuasive.

Applicant argues on pgs. 6-8 that the claims are directed to patent-eligible subject matter. Specifically, Applicant argues that claims 1, 13, and 19 recite claimed features which are not abstract ideas but are a specific technological improvement in text-to-speech synthesis (see pg. 7 1st para.), that the claimed technical solution employs a specific technical joint architecture which is not a generic computer implementation (see pg. 7, 3rd para.), and that the joint technical features of claim 19 integrate any alleged abstract ideas into a practical application by further providing a technological improvement (see pg. 8, 2nd para.). The Examiner respectfully disagrees. The claims as currently written are directed to abstract ideas without significantly more.
	The claims as currently amended recite abstract ideas under Step 2A Prong 1 analysis. Specifically, the claims recite steps which can be performed in the human mind as mental processes with the aid of pen and paper, as well as mathematical concepts, both of which fall under the category of abstract idea. Specifically, a person can write down text and document positional encodings representing content and location of words in a document, and can use this to generate audio data comprising a reordered text sequence (a more natural reading order). Additionally, generating a spectrogram corresponding to this reordered text sequence amounts to a mathematical calculation. Furthermore, the amended limitations added to claims 1, 13, and 19 further recite abstract ideas. Specifically, the step of generating “as part of multi-task training by jointly modeling text reading order detection and digital audio generation” in claim 1, the step of decoding the text encoding into digital audio “jointly having a reordered text sequence as part of multi-task training” in claim 13, and “generating digital audio generated “jointly through simultaneous optimization” in claim 19, each recite mathematical calculations. Therefore, the claims recite abstract ideas under Step 2A Prong 1.
	Furthermore, the claims as amended do not integrate the judicial exception into a practical application under Step 2A Prong 2 analysis. Under Step 2A Prong 2 analysis, additional elements are viewed alone and in combination to make this determination. The additional elements in claims 1, 13, and 19 as currently amended do not recite improvements to the functioning of a computer, or technology or technical field, and instead implement the judicial exception using generic computer components. Claim 1 does not contain any additional elements which have not been grouped under mental processes or mathematical concepts. Claim 13 recites “a text-to-phone converter module implemented by a processing device”, “a text-to-speech model implemented by the processing device…using machine learning”, “a text layout encoder to generate…”, and “a reading sequence decoder to decode…”, which each amount to implementing a mental process or mathematical calculation via a generic computer component. Claim 19 similarly contains “a text layout encoder and a reading order sequence decoder of a text-to-speech model” to jointly decode the text encodings, which further amounts to implementing a process which can be performed mentally using a generic computer component. These limitations do not integrate the judicial exception into a practical application as they do not impose any meaningful limits on practicing the abstract ideas, and instead are merely using a computer as a tool to perform the abstract ideas identified under Step 2A Prong 1. Therefore, the claims are directed to abstract ideas.
	Hence, Applicant’s arguments are not persuasive.

	3. Regarding the following rejections:
of claims 1-6, 8-11, 13-17, and 19 under 35 U.S.C. § 103 as being unpatentable over Cui in view of Abbas,
of claims 7 and 20 under 35 U.S.C. § 103 as being unpatentable over Cui in view of Abbas, and further in view of Hwang, and
of claims 12 and 18 under 35 U.S.C. § 103 as being unpatentable over Cui in view of Abbas, and further in view of Klimkov,

Applicant’s arguments have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

4. Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. 

Regarding claim 1, “A method” is recited, which is directed to one of the four statutory categories of invention (process) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes or mathematical concepts which fall into the category of abstract idea (Step 2A Prong 1: YES).
	The following limitations, under their broadest reasonable interpretation, recite mental processes or mathematical concepts:
receiving, by a processing device, a digital document having text arranged in an initial text sequence: a person obtains a document having text arranged in an initial sequence
generating, by the processing device, a text encoding and a document positional encoding from the digital document, the document positional encoding is based on a location of the text encoding within the digital document: a person reads the documents, and writes down a text encoding and a positional encoding based on a location of the text within the document, using pen and paper
generating, by the processing device, digital audio as part of multi-task training by jointly modeling text reading order detection and digital audio generation as a spectrogram having a reordered text sequence, which is different from the initial text sequence, by decoding the text encoding and the document positional encoding: a person can write down a reordered text sequence different from the initial sequence by decoding the text encoding and positional encoding, using pen and paper. Generating a spectrogram corresponding to the reordered text sequence using multi-task training by jointly modeling text reading order detection and digital audio generation amounts to a mathematical calculation which falls under the abstract idea grouping of mathematical concepts.

Claim 1 does not contain any additional elements which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). There are no additional limitations in the claim. Therefore, claim 1 is directed to an abstract idea (Step 2A: YES).
Claim 1 does not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). There are no additional limitations in the claim. Therefore, claim 1 is not patent eligible.

Regarding dependent claims 2-12, “The method” is recited, which is directed to one of the four statutory categories of invention (process) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes or mathematical concepts which fall into the category of abstract idea (Step 2A Prong 1: YES).
	The following limitations, under their broadest reasonable interpretation, recite mental processes or mathematical concepts:

Claim 2:
wherein the document positional encoding is based on coordinates defined in relation to a page of the digital document: a person writes down a positional encoding representing coordinates in relation to the page (e.g. (x,y)) using pen and paper
Claim 2 contains no additional limitations.

Claim 3:
wherein the document positional encoding is based on a bounding box defined for the text: a person writes down a positional encoding representing a bounding box (e.g. writes down coordinates of each corner) using pen and paper
Claim 3 contains no additional limitations.

Claim 4:
wherein the document positional encoding includes four two-dimensional positional encoding defining a relative spatial position of the text within the digital document: a person writes down a positional encoding representing four two-dimensional coordinates in relation to the page (e.g. (x,y)) using pen and paper
Claim 4 contains no additional limitations.

Claim 5:
wherein the generating includes embedding the document positional encoding as part of the text encoding: a person writes down a combined embedding including the positional encoding with the text encoding using pen and paper.
Claim 5 contains no additional limitations.

Claim 6:
generating the text encoding and the document positional encoding…generating the digital audio including the spectrogram having the reordered text sequence…: a person generates text encoding and position encoding, and the reordered text sequence using pen and paper; generating a spectrogram amounts to a mathematical concept.
Claim 6 contains the limitations “performed by a text layout encoder of a text-to-speech model using machine learning” and “performed using a reading sequence decoder of the text-to-speech model using machine learning”. These limitations are recited at a high level of generality and amount to mere instructions to implement the judicial exception using a generic computer.

Claim 7:
Claim 7 contains the additional limitation “wherein the text-to-speech model is trained using curriculum learning”, which amounts to mere instructions to implement the judicial exception using a generic computer.

Claim 8:
generating the text encoding and the document positional encoding: a person writes down the text and position encodings using pen and paper
Claim 8 contains the additional limitation “performed jointly by the text layout encoder”, which amounts to mere instructions to implement the judicial exception using a generic computer.

Claim 9:
generating the digital audio including the spectrogram having the reordered text sequence: a person writes down a reordered text sequence, and generating a spectrogram amounts to a mathematical concept.
Claim 9 contains the additional limitation “performed jointly using the reading sequence decoder”, which amounts to mere instructions to implement the judicial exception using a generic computer.

Claim 10:
wherein the generating the text encoding further comprises generating a text sequence positional encoding as part of the text encoding, the text sequence positional encoding defining a position of the text encoding within the text sequence of the digital document: a person writes down a text encoding which comprises a sequence positional encoding defining a position of the text encoding within the text sequence, using pen and paper.
Claim 10 contains no additional limitations.

Claim 11:
wherein the generating includes converting the text from the digital document into a phoneme and wherein the text encoding is generated based on the phoneme: a person reads the text, and converts each word into its corresponding phonemes, and uses the phonemes to generate an encoding using pen and paper.
Claim 11 contains no additional limitations.

Claim 12:
wherein the generating the digital audio includes classifying whether the document positional encoding indicates a break in the digital document: a person determines based on a position encoding if there is a break in the document.
Claim 12 contains no additional limitations.

Claims 2-12 do not contain any additional elements which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). As discussed above, the only additional limitations are mere instructions to implement the judicial exception using a generic computer which, even when viewed in combination, do not integrate the judicial exception into a practical application because they do not impose any meaningful limits on practicing the abstract idea. Therefore, claims 2-12 are directed to an abstract idea (Step 2A: YES).
Claims 2-12 do not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). As discussed above, the only additional limitations are mere instructions to implement the judicial exception using a generic computer, which do not amount to significantly more than the judicial exception as they cannot provide an inventive concept. Therefore, claims 2-12 are not patent eligible.

Regarding claim 13, “A system” is recited, which is directed to one of the four statutory categories of invention (machine) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes or mathematical concepts which fall into the category of abstract idea (Step 2A Prong 1: YES).
	The following limitations, under their broadest reasonable interpretation, recite mental processes or mathematical concepts:
convert text in a digital document into a plurality of phonemes: a person reads text in a document and writes down phonemes corresponding to the words using pen and paper.
convert the plurality of phonemes into digital audio: a person can use the phonemes to write down audio data using pen and paper
generate a plurality of text encodings based on the plurality of phonemes…the plurality of text encoding having embedded, respectively, a document positional encoding based on a location of a respective said text encoding within the digital document: a person a person writes down a text encoding for the phonemes, as well as document positional encoding based on a location of the text within the document, using pen and paper
decode the plurality of text encodings into the digital audio jointly having a reordered text sequence as part of multi-task training: a person decodes the text encodings and writes down digital audio corresponding to the encodings using pen and paper. Performing this operation using multi-task training amounts to a mathematical concept.

Claim 13 does not contain any additional limitations which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). The only limitations are “a text-to-phoneme converter module implemented by a processing device to…”, “a text-to-speech model implemented by the processing device…using machine learning, the text-to speech model including: a text layer encoder to…using machine learning” and “a reading sequence decoder to…”. These limitations are recited at a high level of generality and amount to mere instructions to implement the judicial exception using a generic computer which, even when viewed in combination, do not integrate the judicial exception into a practical application as they do not impose and meaningful limits on practicing the abstract idea. Therefore, claim 13 is directed to an abstract idea (Step 2A: YES).
Claim 13 does not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). As discussed above, the only additional limitations amount to mere instructions to implement the judicial exception using a generic computer, which do not amount to significantly more than the judicial exception as they cannot provide an inventive concept. Therefore, claim 13 is not patent eligible.

Regarding dependent claims 14-18, “The system” is recited, which is directed to one of the four statutory categories of invention (machine) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes or mathematical concepts which fall into the category of abstract idea (Step 2A Prong 1: YES).
	The following limitations, under their broadest reasonable interpretation, recite mental processes or mathematical concepts:

Claim 14:
generate reordered text sequence in the digital audio which is different from an initial text sequence of the plurality of phonemes: a person writes down a reordered text sequence different from an initial text sequence using pen and paper.
Claim 14 contains the additional limitation “wherein the reading sequence decoder is configured to…”. This limitation amounts to mere instructions to implement the judicial exception using a generic computer.

Claim 15:
generate the digital audio as including a spectrogram having the reordered text sequence: generating a spectrogram amounts to a mathematical concept.
Claim 15 contains the additional limitation “wherein the reading sequence is configured to…” which amounts to mere instructions to implement the judicial exception using a generic computer.

Claim 16:
wherein the document positional encoding is based on coordinates defined in relation to a page of the digital document: a person writes down a positional encoding representing coordinates in relation to the page (e.g. (x,y)) using pen and paper
Claim 16 contains no additional limitations.

Claim 17:
generate a text sequence positional encoding as part of the text encoding, the text sequence positional encoding defining a position of the text encoding within a text sequence of the digital document: a person writes down a positional encoding as part of the text encoding representing a position of the text within the text sequence, using pen and paper.
Claim 17 contains the additional limitation “wherein the text layout encoder is further configured to…”

Claim 18:
determine whether a respective said document positional encoding associated with a respective said text encoding indicates a break in the digital document…: a person determines if an encoding corresponding to a break in the digital document.
Claim 18 contains the limitations “wherein the reading sequence decoder is further configured as a classifier to…”. These limitations are recited at a high level of generality and amount to mere instructions to implement the judicial exception using a generic computer.

Claims 14-18 do not contain any additional elements which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). As discussed above, the only additional limitations are mere instructions to implement the judicial exception using a generic computer which, even when viewed in combination, do not integrate the judicial exception into a practical application as they do not impose any meaningful limits on practicing the abstract idea. Therefore, claims 14-18 are directed to an abstract idea (Step 2A: YES).
Claims 14-18 do not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). As discussed above, the only additional limitations are mere instructions to implement the judicial exception using a generic computer, which do not amount to significantly more than the judicial exception as they cannot provide an inventive concept. Therefore, claims 14-18 are not patent eligible.

Regarding claim 19, “One or more computer readable storage media” is recited, which is directed to one of the four statutory categories of invention (article of manufacture) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes or mathematical concepts which fall into the category of abstract idea (Step 2A Prong 1: YES).
	The following limitations, under their broadest reasonable interpretation, recite mental processes or mathematical concepts:
receive a digital document having text: a person obtains a reads a document having text
generating digital audio based on the digital document, the digital audio including a spectrogram having a reading order … through simultaneous optimization: a person writes down audio data based on the document using pen and paper, generating a spectrogram amounts to a mathematical concept. Performing the above using simultaneous optimization amounts to a mathematical calculation.

Claim 19 does not contain any additional limitations which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). The only limitations are “One or more computer readable storage media storing instructions that, responsive to execution by a processing device, causes the processing device to perform operations comprising…” and “…generated jointly by a text layout encoder and a reading order sequence decoder of a text-to-speech model”. These limitations are recited at a high level of generality and amount to mere instructions to implement the judicial exception using a generic computer which, even when viewed in combination, do not integrate the judicial exception into a practical application as they do not impose any meaningful limits on practicing the abstract idea. Therefore, claim 19 is directed to an abstract idea (Step 2A: YES).
Claim 19 does not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). As discussed above, the only additional limitations amount to mere instructions to implement the judicial exception using a generic computer, which do not amount to significantly more than the judicial exception as they cannot provide an inventive concept. Therefore, claim 19 is not patent eligible.

Regarding dependent claim 20, “The one or more computer readable storage media” is recited, which is directed to one of the four statutory categories of invention (article of manufacture) (Step 1: YES). However, the claims limitations, under their broadest reasonable interpretation, recite mental processes or mathematical concepts which fall into the category of abstract idea (Step 2A Prong 1: YES).
	The following limitations, under their broadest reasonable interpretation, recite mental processes or mathematical concepts:

Claim 20:
Claim 20 contains the additional limitation “wherein the text-to-speech model is trained using curriculum learning”. This limitation amounts to mere instructions to implement the judicial exception using a generic computer.

Claim 20 does not contain any additional elements which integrate the judicial exception into a practical application (Step 2A Prong 2: NO). As discussed above, the only additional limitations are mere instructions to implement the judicial exception using a generic computer which, even when viewed in combination, do not integrate the judicial exception into a practical application as they do not impose any meaningful limits on practicing the abstract idea. Therefore, claim 20 is directed to an abstract idea (Step 2A: YES).
Claim 20 does not contain any additional elements which amount to significantly more than the judicial exception (Step 2B: NO). As discussed above, the only additional limitations are mere instructions to implement the judicial exception using a generic computer, which do not amount to significantly more than the judicial exception as they cannot provide an inventive concept. Therefore, claim 20 is not patent eligible.

Claim Rejections - 35 USC § 103
5. Claims 1-6, 8-11, 13-17, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Cui et al. (US 2024/0265206 A1, hereinafter Cui) in view of Abbas et al. (US 11,694,674 B1, hereinafter Abbas) and further in view of Dong et al. (US 2025/0061888 A1, hereinafter Dong).

Regarding claim 1, Cui discloses A method (para. 0003) comprising: receiving, by a processing device (Fig. 12, para. 0122), a digital document having text arranged in an initial text sequence (Fig. 3, 162; para. 0043 “As shown in FIG. 3, the reading order of a plurality of text elements in the text sequence presented by the document 162 is to be detected.”; para. 0044 “Depending on the type of the document 162, a corresponding text recognition technique may be used to extract the text sequence 312. For example, if the document 162 is an image-format file, the OCR technique may be used to extract the text sequence 312. If the document 162 is a PDF file, a PDF parsing tool may be used to extract the text sequence 312.”); generating, by the processing device, a text encoding (para. 0032 “For example, the text element may include a word, a phrase, a symbol, a combination of the foregoing, or any other elements that appear in a natural language expression.”; para. 0046 “In some implementations, an embedding representation 330 of the text element may be determined. The embedding representation may characterize the text element in the form of a vector.”) and a document positional encoding from the digital document (para. 0049 “The layout information may also be converted into an embedding representation 331 in the form of a vector for input to the feature extraction model 120.”), the document positional encoding is based on a location of the text encoding within the digital document (para. 0047 “The layout information indicates a spatial layout of the text elements in the text sequence in the document 162.”); and … a reordered text sequence (para. 0068 “The semantic feature representations of text elements (for example, h.sub.1 to h.sub.7 shown in FIG. 3) may be provided to the order determination model 150 for determining the reading order of the plurality of text elements in the document 162.”), which is different from the initial text sequence (para. 0039 “Scenarios in which the default reading order from top to bottom and from left to right might lead to errors further include other documents such as forms, receipts and invoices having a multi-column arrangement, a flyer with text elements being arranged freely, and so on. The incorrect reading order might result in reduced accuracy or increased complexity of subsequent processing tasks.”; para. 0040 “According to this solution, machine learning technology is used to determine the reading order of text elements from the text elements themselves in the text sequence of the document and the layout information of these text elements in the document. As compared with simply determining the reading order from the text itself, the introduction of layout information can better characterize a spatial layout manner of text elements in a specific document, thereby determining the reading order more effectively and accurately.”; para. 0069 “The order determination model 150 may sequentially determine the reading order index of each text element one by one starting from an initial text element of the text sequence 310.”), by decoding the text encoding and the document positional encoding (para. 0061 “The feature extraction model 120 is configured to determine the semantic feature representations corresponding to respective text elements in the text sequence 312 based on the input embedding representations.”; para. 0068 “The semantic feature representations of text elements (for example, h.sub.1 to h.sub.7 shown in FIG. 3) may be provided to the order determination model 150 for determining the reading order of the plurality of text elements in the document 162.”).
Cui does not specifically disclose generating, by the processing device, digital audio…as a spectrogram [having a reordered text sequence….]
Abbas teaches for a corresponding text sequence generating, by the processing device, digital audio … as a spectrogram (Col. 6 Lines 21-31 “FIG. 4 illustrates embodiments of a text-to-spectrogram (acoustic) model. In some embodiments, this is the acoustic model 122 of FIG. 1. In general, the acoustic model predicts spectrograms at a first level (e.g., sentence-level) and uses that predicted spectrogram at a second level (e.g., word-level). Subsequent levels (e.g., a third level) will use at least the preceding level's predicted spectrogram, but also use all other levels' spectrograms in some embodiments. The predicted spectrograms and upsampled frames of the frame-level are then used to predict an “actual” spectrogram.”; Fig. 4, final spectrogram predicted for input text to 403).
Cui and Abbas are considered to be analogous to the claimed invention as
they both are in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Cui to incorporate the teachings of Abbas in order to specifically generate digital audio including a spectrogram for the reordered text sequence. Doing so would be beneficial, as generated spectrograms can be used to produce audio signals (Col. 2 Lines 49-57) for the reordered text sequence disclosed in Cui, which would lead to more understandable synthesized speech for documents with varieties of digital content in different spatial orientations, improving user experience for those with visual disabilities or who are otherwise busy with other tasks (Tran et al. (US 2021/0020159), Abstract, para. 0001-0002).
	Cui in view of Abbas does not specifically disclose [generating, by the processing device, digital audio] as part of multi-task training by jointly modeling text reading order detection and digital audio generation.
	Dong teaches generating digital audio as part of multi-task training by jointly modeling text reading order detection and digital audio generation (Dong teaches performing multi-task training to predict an output text sequence (secondary task of speech-to-text translation task (para. 0080, Fig. 6 “Secondary Representation”)) and a corresponding digital audio speech feature (primary task of speech-to-speech translation (para. 0080, Fig. 6 “Target-language speech feature”)); para. 0080 “The first decoder module 530 is mainly configured to predict and synthesize a target-language speech feature.”; para. 0080 “…and the speech-to-text translation task is used to transform a source-language speech feature into source-language text and transform the source-language text into target-language text.”; para. 0080 “During training, the two secondary tasks accept an input of the encoder module 510, and predicted loss values are added to the primary task in the form of a weighted sum. During testing, the second decoder modules 550 are not used.”).
Cui, Abbas, and Dong are considered to be analogous to the claimed invention as they are in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Cui in view of Abbas to incorporate the teachings of Dong in order to specifically generate digital audio as part of multi-task training by jointly modeling text reading order detection and digital audio generation. Doing so would be beneficial, as multi-task learning obtains useful information included in a plurality of different tasks in order to obtain more accurate learning for each individual task, allowing for the different asks to improve each other and prevent a single tasks from easily falling into a local optimum (para. 0047).

Regarding claim 2, Cui in view of Abbas and Dong discloses wherein the document positional encoding is based on coordinates defined in relation to a page of the digital document (Cui, see Eq. 4; para. 0057 “…the embedding representation 331 of the layout information may be represented as follows… where W and H represent a total width and a total height of the document 162, and L represents a length of the input sequence S corresponding to the text element. In the above Equation (4), the x-axis coordinate information (x0, x1) and width w are used as a triple to construct an embedding representation, and the y-axis coordinate information (y0, y1) and height h are used as a triple to construct another embedding representation, and then the two embedding representations are concatenated into the embedding representation of the layout information of the i.sup.th text element.”).

Regarding claim 3, Cui in view of Abbas and Dong discloses wherein the document positional encoding is based on a bounding box defined for the text (Cui, para. 0049 “Certainly, in other examples, the layout information of the text element may also be characterized in other ways, for example, the relative spatial position may be represented with a center point of the bounding box, and the size may be represented with an area of the boundary box, and so on. The layout information is not limited in the text herein, as long as it can be ensured that any alternative or additional information of different text elements in the two-dimensional space of the document 162 all may be used. The layout information may also be converted into an embedding representation 331 in the form of a vector for input to the feature extraction model 120.”).

Regarding claim 4, Cui in view of Abbas and Dong discloses wherein the document positional encoding includes four two-dimensional encoding defining a relative spatial position of the text within the digital document (Cui, para. 0048 “As an example, the layout information of the i.sup.th text element may be represented as (x.sub.0,x.sub.1,custom-character.sub.0,custom-character.sub.1,w,h), where (x0, y0) represents x-axis and y-axis coordinates of the upper left (right) corner of the bounding box of the text element, (x1, y1) represents x-axis and y-axis coordinates of the lower right (left) corner of the bounding box…”).

Regarding claim 5, Cui in view of Abbas and Dong discloses wherein the generating includes embedding the document positional encoding as part of the text encoding (Cui, see Eq 3; para. 0055 “Correspondingly, for the i.sup.th visual embedding representation 334, the constructed embedding representation includes the following content…where VisTokEmb(I).sub.i) represents the i.sup.th embedding representation 334 of the visual information obtained from the change of the feature map. PosEmb1D(i) represents the embedding representation corresponding to the sequence index information of the i.sup.th visual embedding representation.”; Fig. 4).

Regarding claim 6, Cui in view of Abbas and Dong discloses wherein the generating the text encoding and the document position encoding is performed by a text layout encoder (Cui, para. 0077 “The feature extraction model 120 may extract a text sequence from the sample documents according to the processing procedure described above, determine the layout information, and determine the semantic feature representations of the text elements in the sample documents based on the text sequence and the layout information.”) of a text-to-speech model (Abbas, TTS Model: Fig. 1, “Text-to-Speech (TTS) Service/Component 110”) using machine learning (Cui, para. 0021 “As used herein, the term “model” may refer to an association between corresponding input and output learnable from training data, and thus a corresponding output may be generated for a given input after the training. The generation of the model may be based on machine learning techniques.”); and the generating the digital audio including the spectrogram having the reordered text sequence is performed using a reading sequence decoder (Cui, reordered text sequence: para. 0068 “The semantic feature representations of text elements (for example, h.sub.1 to h.sub.7 shown in FIG. 3) may be provided to the order determination model 150 for determining the reading order of the plurality of text elements in the document 162. In some implementations, a reading order index of each text element may be determined. For example, if there are a total of N text elements, the reading order index performs indexing from 1 to N to indicate the reading order of these text elements.”; Abbas, spectrogram from text sequence: Col. 6 Lines 21-31 “FIG. 4 illustrates embodiments of a text-to-spectrogram (acoustic) model. In some embodiments, this is the acoustic model 122 of FIG. 1. In general, the acoustic model predicts spectrograms at a first level (e.g., sentence-level) and uses that predicted spectrogram at a second level (e.g., word-level). Subsequent levels (e.g., a third level) will use at least the preceding level's predicted spectrogram, but also use all other levels' spectrograms in some embodiments. The predicted spectrograms and upsampled frames of the frame-level are then used to predict an “actual” spectrogram.”; Fig. 4, final spectrogram predicted for input text to 403) of the text-to-speech model (Abbas, TTS Model: Fig. 1, “Text-to-Speech (TTS) Service/Component 110”) using machine learning (Cui, para. 0021 “As used herein, the term “model” may refer to an association between corresponding input and output learnable from training data, and thus a corresponding output may be generated for a given input after the training. The generation of the model may be based on machine learning techniques.”). 
Cui, Abbas, and Dong are considered to be analogous to the claimed invention as they are in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Abbas in order to have the text layout encoder disclosed in Cui be a part of the text-to-speech model taught in Abbas, and to further use the reading sequence decoder disclosed in Abbas to generate digital audio including a spectrogram using machine learning. Doing so would be beneficial, as generated spectrograms can be used to produce audio signals (Col. 2 Lines 49-57) for the reordered text sequence disclosed in Cui, which would lead to more understandable synthesized speech for documents with varieties of digital content in different spatial orientations, improving user experience for those with visual disabilities or who are otherwise busy with other tasks (Tran et al. (US 2021/0020159), Abstract, para. 0001-0002).

Regarding claim 8, Cui in view of Abbas and Dong discloses wherein the generating the text encoding and the document position encoding is performed jointly by the text layout encoder (Cui, para. 0077 “The feature extraction model 120 may extract a text sequence from the sample documents according to the processing procedure described above, determine the layout information, and determine the semantic feature representations of the text elements in the sample documents based on the text sequence and the layout information.”).

Regarding claim 9, Cui in view of Abbas and Dong discloses wherein the generating the digital audio including the spectrogram having the reordered text sequence is performed jointly using the reading sequence decoder (Abbas, Col. 5 Lines 19-30 ”FIG. 2 illustrates embodiments of a system for generating mel spectrograms in a hierarchical manner. As shown, a text encoder 201 encodes (or embeds) text. This encoded text is fed into a plurality of neural networks (203, 205, 207), each of which generates a spectrogram at a time scale (Ti) and a frequency scale (Fj). As shown, a first neural network at T1, F1 generates a first spectrogram. This spectrogram is fed to two different neural networks, one 213 with the same time, but a different frequency and one 215 with a different time, but the same frequency. This hierarchy continues in both domains as shown (e.g., from 215 to 225 and 217, ′INVF13 to.”).
Cui, Abbas, and Dong are considered to be analogous to the claimed invention as they are in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Abbas in order to have the generating the digital audio including the spectrogram having the reordered text sequence be performed jointly using the reading sequence decoder. Doing so would be beneficial, as generated spectrograms can be used to produce audio signals (Col. 2 Lines 49-57) for the reordered text sequence disclosed in Cui, which would lead to more understandable synthesized speech for documents with varieties of digital content in different spatial orientations, improving user experience for those with visual disabilities or who are otherwise busy with other tasks (Tran et al. (US 2021/0020159), Abstract, para. 0001-0002).

Regarding claim 10, Cui in view of Abbas and Dong discloses wherein the generating the text encoding further comprises generating a text sequence positional encoding as part of the text encoding, the text sequence positional encoding defining a position of the text encoding within a text sequence of the digital document (Cui, para. 0050 “In some implementations, the text elements in the text sequence 312 may also have respective sequence index information, which is to indicate sequential positions in the text sequence 312. Different from the two-dimensional relative spatial positions of the document 162 indicated by the layout information, the sequence index information is used to indicate a relative position of the text element in a one-dimensional text sequence 312, and therefore may also be regarded as one-dimensional position information. It is possible to assign corresponding sequence index information to each text element in order from a starting text element of the text sequence 312.”).

Regarding claim 11, Cui in view of Abbas and Dong discloses wherein the generating includes converting the text from the digital document into a phoneme and wherein the text encoding is generated based on the phoneme (Abbas, Col. 7 Lines 9-21 “FIG. 5 illustrates embodiments of an exemplary phoneme embedding encoder. In some embodiments, the phoneme embedding encoder 403 includes one or more of… a tokenizer 503 to tokenize the word(s) of the input text; a grapheme-to-phoneme transcriber 505 to convert graphemes into phonemes (in some embodiments, a set of rules is used to perform this conversion, in other embodiments a neural network is used); an embedding layer INVE07 to embed phonemes into one or more trainable vectors; and a neural network (e.g., BI-LSTM) 509 to generate the phoneme embeddings.”).
Cui, Abbas, and Dong are considered to be analogous to the claimed invention as they are in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Abbas in order to convert the text from the digital document into a phoneme and generating the text encoding based on the phoneme. Doing so would be beneficial, this would allow for phoneme-level spectrograms to be generated which when combined with spectrograms generated for different linguistic units such as sentence and word-level spectrograms, leads to improved naturalness for TTS (Col. 2, Lines 21-32 and 40-48; Fig. 4).

Regarding claim 13, Cui discloses A system (Fig. 12) comprising:…a processing device (para. 0123 “As shown in FIG. 12, the computing device 100 includes a computing device 1200 in form of a general-purpose computing device. Components of the computing device 1200 may include, but are not limited to, one or more processors or processing units 1210, a memory 1220, a storage device 1230, one or more communication units 1240, one or more input devices 1250, and one or more output devices 1260.”) …a text layout encoder to generate a plurality of text encodings…using machine learning (para. 0077 “The feature extraction model 120 may extract a text sequence from the sample documents according to the processing procedure described above, determine the layout information, and determine the semantic feature representations of the text elements in the sample documents based on the text sequence and the layout information.”; para. 0021 “As used herein, the term “model” may refer to an association between corresponding input and output learnable from training data, and thus a corresponding output may be generated for a given input after the training. The generation of the model may be based on machine learning techniques.”), the plurality of text encodings having embedded, respectively (para. 0059 “In addition, for each text element and each visual embedding representation, there exist a corresponding embedding representation 331 of the layout information and a corresponding embedding representation 332 of the index sequence information.”), a document positional encoding based on a location of a respective said text encoding within the digital document (para. 0049 “The layout information may also be converted into an embedding representation 331 in the form of a vector for input to the feature extraction model 120.”; para. 0047 “The layout information indicates a spatial layout of the text elements in the text sequence in the document 162.”); and a reading sequence decoder to decode the plurality of text encodings… (para. 0068 “The semantic feature representations of text elements (for example, h.sub.1 to h.sub.7 shown in FIG. 3) may be provided to the order determination model 150 for determining the reading order of the plurality of text elements in the document 162. In some implementations, a reading order index of each text element may be determined. For example, if there are a total of N text elements, the reading order index performs indexing from 1 to N to indicate the reading order of these text elements.”).
Cui does not specifically disclose:
a text-to-phoneme converter module implemented by [a processing device] to convert text in a digital document into a plurality of phonemes; and a text-to-speech model implemented by the processing device to convert the plurality of phonemes into digital audio using machine learning, the text-to-speech model including…[a reading sequence decoder to decode the plurality of text encodings] into the digital audio….
Abbas teaches a text-to-phoneme converter module implemented by [a processing device] to convert text in a digital document into a plurality of phonemes (Col. 7 Lines 9-21 “FIG. 5 illustrates embodiments of an exemplary phoneme embedding encoder. In some embodiments, the phoneme embedding encoder 403 includes one or more of… a tokenizer 503 to tokenize the word(s) of the input text; a grapheme-to-phoneme transcriber 505 to convert graphemes into phonemes (in some embodiments, a set of rules is used to perform this conversion, in other embodiments a neural network is used); an embedding layer INVE07 to embed phonemes into one or more trainable vectors; and a neural network (e.g., BI-LSTM) 509 to generate the phoneme embeddings.”); and a text-to-speech model implemented by the processing device to convert the plurality of phonemes into digital audio using machine learning (Col. 7 Lines 22-24 “The “p” phonemes embeddings are provided to one or more spectrogram generation “levels” within the encoder 401.”; Col. 8 Lines 6-9 “The spectrograms generated in the various levels are concatenated (or otherwise combined) with the one or more frames of the frame-level and fed to the decoder 421 which generates a Mel spectrogram having “t” frames.”), the text-to-speech model including…[a reading sequence decoder to decode the plurality of text encodings] into the digital audio… (Col. 8 Lines 6-9 “The spectrograms generated in the various levels are concatenated (or otherwise combined) with the one or more frames of the frame-level and fed to the decoder 421 which generates a Mel spectrogram having “t” frames.”).
Cui and Abbas are considered to be analogous to the claimed invention as
they both are in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Cui to incorporate the teachings of Abbas in order to include a text-to-phoneme converter to convert text to a plurality of phonemes, a text-to-speech model to convert the plurality of phonemes into digital audio using machine learning, and to have the reading sequence decoder decode the plurality of text encodings into the digital audio. Using a text-to-speech model to convert phonemes into digital audio would be beneficial, as generated spectrograms can be used to produce audio signals (Col. 2 Lines 49-57) for the reordered text sequence disclosed in Cui, which would lead to more understandable synthesized speech for documents with varieties of digital content in different spatial orientations, improving user experience for those with visual disabilities or who are otherwise busy with other tasks (Tran et al. (US 2021/0020159), Abstract, para. 0001-0002). Furthermore, using a text-to-phoneme converter would be beneficial, as this would allow for phoneme-level spectrograms to be generated which when combined with spectrograms generated for different linguistic units such as sentence and word-level spectrograms, leads to improved naturalness for TTS (Col. 2, Lines 21-32 and 40-48; Fig. 4).
Cui in view of Abbas does not specifically disclose to decode into the digital audio jointly having a reordered text sequence as part of multi-task training.
Dong teaches to decode into the digital audio jointly having a reordered text sequence as part of multi-task training (Dong teaches performing multi-task training to predict an output text sequence (secondary task of speech-to-text translation task (para. 0080, Fig. 6 “Secondary Representation”)) and a corresponding digital audio speech feature (primary task of speech-to-speech translation (para. 0080, Fig. 6 “Target-language speech feature”)); para. 0080 “The first decoder module 530 is mainly configured to predict and synthesize a target-language speech feature.”; para. 0080 “…and the speech-to-text translation task is used to transform a source-language speech feature into source-language text and transform the source-language text into target-language text.”; para. 0080 “During training, the two secondary tasks accept an input of the encoder module 510, and predicted loss values are added to the primary task in the form of a weighted sum. During testing, the second decoder modules 550 are not used.”).
Cui, Abbas, and Dong are considered to be analogous to the claimed invention as they are in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Cui in view of Abbas to incorporate the teachings of Dong in order to specifically generate digital audio as part of multi-task training by jointly modeling text reading order detection and digital audio generation. Doing so would be beneficial, as multi-task learning obtains useful information included in a plurality of different tasks in order to obtain more accurate learning for each individual task, allowing for the different asks to improve each other and prevent a single tasks from easily falling into a local optimum (para. 0047).

Regarding claim 14, Cui in view of Abbas and Dong discloses wherein the reading sequence decoder is configured to generate reordered text sequence (Cui, para. 0068 “The semantic feature representations of text elements (for example, h.sub.1 to h.sub.7 shown in FIG. 3) may be provided to the order determination model 150 for determining the reading order of the plurality of text elements in the document 162.”) in the digital audio (Abbas, Fig. 4:  digital audio (Final Spectrogram) generated for text sequence) which is different from an initial text sequence (Cui, para. 0039 “Scenarios in which the default reading order from top to bottom and from left to right might lead to errors further include other documents such as forms, receipts and invoices having a multi-column arrangement, a flyer with text elements being arranged freely, and so on. The incorrect reading order might result in reduced accuracy or increased complexity of subsequent processing tasks.”; para. 0040 “According to this solution, machine learning technology is used to determine the reading order of text elements from the text elements themselves in the text sequence of the document and the layout information of these text elements in the document. As compared with simply determining the reading order from the text itself, the introduction of layout information can better characterize a spatial layout manner of text elements in a specific document, thereby determining the reading order more effectively and accurately.”; para. 0069 “The order determination model 150 may sequentially determine the reading order index of each text element one by one starting from an initial text element of the text sequence 310.”) of the plurality of phonemes (Abbas, Col. 7 Lines 22-24 “The “p” phonemes embeddings are provided to one or more spectrogram generation “levels” within the encoder 401.”).
Cui, Abbas, and Dong are considered to be analogous to the claimed invention as they are in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Abbas in order to generate digital audio using the plurality of phonemes. Generating the digital audio would be beneficial, as generated spectrograms can be used to produce audio signals (Col. 2 Lines 49-57) for the reordered text sequence disclosed in Cui, which would lead to more understandable synthesized speech for documents with varieties of digital content in different spatial orientations, improving user experience for those with visual disabilities or who are otherwise busy with other tasks (Tran et al. (US 2021/0020159), Abstract, para. 0001-0002). Furthermore, using the plurality of phonemes would be beneficial, as this would allow for phoneme-level spectrograms to be generated which when combined with spectrograms generated for different linguistic units such as sentence and word-level spectrograms, leads to improved naturalness for TTS (Col. 2, Lines 21-32 and 40-48; Fig. 4).

Regarding claim 15, Cui in view of Abbas and Dong discloses wherein the reading sequence decoder is configured to generate the digital audio as including a spectrogram having the reordered text sequence (Cui, reordered text sequence: para. 0068 “The semantic feature representations of text elements (for example, h.sub.1 to h.sub.7 shown in FIG. 3) may be provided to the order determination model 150 for determining the reading order of the plurality of text elements in the document 162. In some implementations, a reading order index of each text element may be determined. For example, if there are a total of N text elements, the reading order index performs indexing from 1 to N to indicate the reading order of these text elements.”; Abbas, spectrogram from text sequence: Col. 6 Lines 21-31 “FIG. 4 illustrates embodiments of a text-to-spectrogram (acoustic) model. In some embodiments, this is the acoustic model 122 of FIG. 1. In general, the acoustic model predicts spectrograms at a first level (e.g., sentence-level) and uses that predicted spectrogram at a second level (e.g., word-level). Subsequent levels (e.g., a third level) will use at least the preceding level's predicted spectrogram, but also use all other levels' spectrograms in some embodiments. The predicted spectrograms and upsampled frames of the frame-level are then used to predict an “actual” spectrogram.”; Fig. 4, final spectrogram predicted for input text to 403).
Cui, Abbas, and Dong are considered to be analogous to the claimed invention as they are in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Abbas in order to use the reading sequence decoder disclosed in Abbas to generate digital audio including a spectrogram having the reordered text sequence disclosed in Cui. Doing so would be beneficial, as generated spectrograms can be used to produce audio signals (Col. 2 Lines 49-57) for the reordered text sequence disclosed in Cui, which would lead to more understandable synthesized speech for documents with varieties of digital content in different spatial orientations, improving user experience for those with visual disabilities or who are otherwise busy with other tasks (Tran et al. (US 2021/0020159), Abstract, para. 0001-0002).

Regarding claim 16, Cui in view of Abbas and Dong discloses wherein the document positional encoding is based on coordinates defined in relation to a page of the digital document (Cui, see Eq. 4; para. 0057 “…the embedding representation 331 of the layout information may be represented as follows… where W and H represent a total width and a total height of the document 162, and L represents a length of the input sequence S corresponding to the text element. In the above Equation (4), the x-axis coordinate information (x0, x1) and width w are used as a triple to construct an embedding representation, and the y-axis coordinate information (y0, y1) and height h are used as a triple to construct another embedding representation, and then the two embedding representations are concatenated into the embedding representation of the layout information of the i.sup.th text element.”).     

Regarding claim 17, Cui in view of Abbas and Dong discloses wherein the text layout encoder is further configured to generate a text sequence positional encoding as part of the text encoding, the text sequence positional encoding defining a position of the text encoding within a text sequence of the digital document (Cui, para. 0050 “In some implementations, the text elements in the text sequence 312 may also have respective sequence index information, which is to indicate sequential positions in the text sequence 312. Different from the two-dimensional relative spatial positions of the document 162 indicated by the layout information, the sequence index information is used to indicate a relative position of the text element in a one-dimensional text sequence 312, and therefore may also be regarded as one-dimensional position information. It is possible to assign corresponding sequence index information to each text element in order from a starting text element of the text sequence 312.”).

Regarding claim 19, Cui discloses One or more computer readable storage media storing instructions (para. 0126 “The computing device 1200 usually includes various computer storage medium. The computer storage medium may be any available medium accessible by the computing device 1200, including but not limited to, volatile and non-volatile medium, or detachable and non-detachable medium.”) that, responsive to execution by a processing device (para. 0126 “The memory 1220 can be a volatile memory (for example, a register, cache, Random Access Memory (RAM)), non-volatile memory (for example, a Read-Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash memory), or any combination thereof. The memory 1220 may include a processing module 1222. This program module is configured to perform the functionalities of various implementations described herein. The processing module 1222 may be accessed and run by the processing unit 1210 to implement the corresponding functions.”), causes the processing device to perform operations including: receiving a digital document having text (Fig. 3, 162; para. 0043 “As shown in FIG. 3, the reading order of a plurality of text elements in the text sequence presented by the document 162 is to be detected.”; para. 0044 “Depending on the type of the document 162, a corresponding text recognition technique may be used to extract the text sequence 312. For example, if the document 162 is an image-format file, the OCR technique may be used to extract the text sequence 312. If the document 162 is a PDF file, a PDF parsing tool may be used to extract the text sequence 312.”); and …a reading order generated jointly …by a text layout encoder (para. 0077 “The feature extraction model 120 may extract a text sequence from the sample documents according to the processing procedure described above, determine the layout information, and determine the semantic feature representations of the text elements in the sample documents based on the text sequence and the layout information.”) and a reading order sequence decoder…(para. 0068 “The semantic feature representations of text elements (for example, h.sub.1 to h.sub.7 shown in FIG. 3) may be provided to the order determination model 150 for determining the reading order of the plurality of text elements in the document 162. In some implementations, a reading order index of each text element may be determined. For example, if there are a total of N text elements, the reading order index performs indexing from 1 to N to indicate the reading order of these text elements.”).
Cui does not specifically disclose generating digital audio based on the digital document, the digital audio including a spectrogram [having a reading order…and a reading order sequence decoder] of a text-to-speech model.
Abbas teaches generating digital audio based on the digital document, the digital audio including a spectrogram (Col. 6 Lines 21-31 “FIG. 4 illustrates embodiments of a text-to-spectrogram (acoustic) model. In some embodiments, this is the acoustic model 122 of FIG. 1. In general, the acoustic model predicts spectrograms at a first level (e.g., sentence-level) and uses that predicted spectrogram at a second level (e.g., word-level). Subsequent levels (e.g., a third level) will use at least the preceding level's predicted spectrogram, but also use all other levels' spectrograms in some embodiments. The predicted spectrograms and upsampled frames of the frame-level are then used to predict an “actual” spectrogram.”; Fig. 4, final spectrogram predicted for input text to 403) [having a reading order…and a reading order sequence decoder] of a text-to-speech model (Fig. 1, 110; Col. 5 Lines 13-16 “The acoustic model 122 generates one or more mel spectrograms at circle 3 and feds the vocoder 124 (as directed by the orchestrator 120). The vocoder, at circle 4, generates audio from the mel spectrogram(s).”).
Cui and Abbas are considered to be analogous to the claimed invention as
they both are in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Cui to incorporate the teachings of Abbas in order to generate digital audio based on the digital document, the digital audio including a spectrogram. Doing so would be beneficial, as generated spectrograms can be used to produce audio signals (Col. 2 Lines 49-57) for the reordered text sequence disclosed in Cui, which would lead to more understandable synthesized speech for documents with varieties of digital content in different spatial orientations, improving user experience for those with visual disabilities or who are otherwise busy with other tasks (Tran et al. (US 2021/0020159), Abstract, para. 0001-0002).
	Cui in view of Abbas does not specifically disclose that the generating digital audio including a spectrogram has a reading order generated jointly through simultaneous optimization.
	Dong teaches that the generating digital audio including a spectrogram has a reading order generated jointly through simultaneous optimization (Dong teaches performing multi-task training to predict an output text sequence (secondary task of speech-to-text translation task (para. 0080, Fig. 6 “Secondary Representation”)) and a corresponding digital audio speech feature (primary task of speech-to-speech translation (para. 0080, Fig. 6 “Target-language speech feature”)) through simultaneously optimizing for both tasks (during training, weighting the losses for the respective tasks (para. 0080)); para. 0080 “The first decoder module 530 is mainly configured to predict and synthesize a target-language speech feature.”; para. 0080 “…and the speech-to-text translation task is used to transform a source-language speech feature into source-language text and transform the source-language text into target-language text.”; para. 0080 “During training, the two secondary tasks accept an input of the encoder module 510, and predicted loss values are added to the primary task in the form of a weighted sum. During testing, the second decoder modules 550 are not used.”).
Cui, Abbas, and Dong are considered to be analogous to the claimed invention as they are in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Cui in view of Abbas to incorporate the teachings of Dong in order to specifically generate digital audio as part of multi-task training by jointly modeling text reading order detection and digital audio generation. Doing so would be beneficial, as multi-task learning obtains useful information included in a plurality of different tasks in order to obtain more accurate learning for each individual task, allowing for the different asks to improve each other and prevent a single task from easily falling into a local optimum (para. 0047).

6. Claims 7 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Cui in view of Abbas and Dong, and in further view of Hwang & Chang (NPL Document-Level Neural TTS Using Curriculum Learning and Attention Masking, hereinafter Hwang).

	Regarding claim 7, Cui in view of Abbas and Dong does not specifically disclose wherein the text-to-speech model is trained using curriculum learning.
	Hwang teaches wherein the text-to-speech model is trained using curriculum learning (Figure 3; pg. 3 2nd para. “Because our purpose is to synthesize document-level text into speech, we begin training with short sentences and gradually adopt long sentences based on curriculum learning…”).
Cui, Abbas, Dong, and Hwang are considered to be analogous to the claimed invention as they are all in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Hwang in order to train the text-to-speech model using curriculum learning. Doing so would be beneficial, as this would allow for the text-to-speech model to be trained on long sentences with limited GPU capacity (Hwang, Abstract).

Regarding claim 20, Cui in view of Abbas and Dong does not specifically disclose wherein the text-to-speech model is trained using curriculum learning.
	Hwang teaches wherein the text-to-speech model is trained using curriculum learning (Figure 3; pg. 3 2nd para. “Because our purpose is to synthesize document-level text into speech, we begin training with short sentences and gradually adopt long sentences based on curriculum learning…”).
Cui, Abbas, Dong, and Hwang are considered to be analogous to the claimed invention as they are all in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Hwang in order to train the text-to-speech model using curriculum learning. Doing so would be beneficial, as this would allow for the text-to-speech model to be trained on long sentences with limited GPU capacity (Hwang, Abstract).

7. Claims 12 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Cui in view of Abbas and Dong, and in further view of Klimkov et al. (NPL Phrase break prediction for long-form reading TTS: exploiting text structure information, hereinafter Klimkov).

Regarding claim 12, Cui in view of Abbas and Dong does not specifically disclose wherein the generating the digital audio includes classifying whether the document position encoding indicates a break in the digital document.
Klimkov teaches wherein the generating the digital audio includes classifying whether the document position encoding indicates a break in the digital document (pg. 2 section 3 “Input Features”: “Distance: CART models do not take context into account. Additional distance features are therefore needed. The number of syllables from the current word to the previous and next punctuation mark was used. This results in 2 additional features…”; pg. 3 section 4 “Modelling”: 2nd para. “The last layer is a softmax which estimates the posterior probabilities of no break and of a respiratory break with a pause…”; pg. 3 section 6 “Subjective evaluations”: 2nd para. “For the listening tests, text with breaks inserted by various phrasing models, was synthesized using a hybrid TTS system…”).
Cui, Abbas, Dong, and Klimkov are considered to be analogous to the claimed invention as they are all in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Klimkov in order to further have the generation of digital audio include classifying whether the document position encoding indicates a break in the digital document. Doing so would be beneficial, as this would improve prediction of phrase breaks for long sentences, increasing the naturalness of the synthesized speech (Klimkov, Abstract).

Regarding claim 18, Cui in view of Abbas and Dong does not specifically disclose wherein the reading sequence decoder is further configured as a classifier to determine whether a respective said document positional encoding associated with a respective said text encoding indicates a break in the digital document.
Klimkov teaches wherein the reading sequence decoder is further configured as a classifier to determine whether a respective said document positional encoding associated with a respective said text encoding indicates a break in the digital document (pg. 2 section 3 “Input Features”: “Distance: CART models do not take context into account. Additional distance features are therefore needed. The number of syllables from the current word to the previous and next punctuation mark was used. This results in 2 additional features…”; pg. 3 section 4 “Modelling”: 2nd para. “The last layer is a softmax which estimates the posterior probabilities of no break and of a respiratory break with a pause…”; pg. 3 section 6 “Subjective evaluations”: 2nd para. “For the listening tests, text with breaks inserted by various phrasing models, was synthesized using a hybrid TTS system…”).
Cui, Abbas, Dong, and Klimkov are considered to be analogous to the claimed invention as they are all in the same field of natural language processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the teachings of Klimkov in order to further have the generation of digital audio include classifying whether the document position encoding indicates a break in the digital document. Doing so would be beneficial, as this would improve prediction of phrase breaks for long sentences, increasing the naturalness of the synthesized speech (Klimkov, Abstract).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Schnell et al. (US 12,548,551 B1): jointly predicting linguistic representations and acoustic representations (Fig. 1, 165 and 175)
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.



Any inquiry concerning this communication or earlier communications from the examiner should be directed to CODY DOUGLAS HUTCHESON whose telephone number is (703)756-1601. The examiner can normally be reached M-F 8:00AM-5:00PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached at (571)-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CODY DOUGLAS HUTCHESON/Examiner, Art Unit 2659     

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659
Read full office action
Prosecution Timeline

Dec 04, 2023
Application Filed
Sep 09, 2025
Non-Final Rejection — §101, §103
Nov 24, 2025
Applicant Interview (Telephonic)
Nov 24, 2025
Examiner Interview Summary
Nov 25, 2025
Response Filed
Feb 17, 2026
Final Rejection — §101, §103
Apr 09, 2026
Examiner Interview Summary
Apr 09, 2026
Applicant Interview (Telephonic)
Apr 09, 2026
Request for Continued Examination
Apr 12, 2026
Response after Non-Final Action
Precedent Cases

Applications granted by this same examiner with similar technology

18/330,472
Patent 12603096
VOICE ENHANCEMENT METHODS AND SYSTEMS
2y 5m to grant Granted Apr 14, 2026
18/545,677
Patent 12591750
GENERATIVE LANGUAGE MODEL UNLEARNING
2y 5m to grant Granted Mar 31, 2026
18/163,230
Patent 12579447
TECHNIQUES FOR TWO-STAGE ENTITY-AWARE DATA AUGMENTATION
2y 5m to grant Granted Mar 17, 2026
18/217,880
Patent 12537018
METHOD AND SYSTEM FOR PREDICTING A MENTAL CONDITION OF A SPEAKER
2y 5m to grant Granted Jan 27, 2026
17/877,543
Patent 12530529
DOMAIN-SPECIFIC NAMED ENTITY RECOGNITION VIA GRAPH NEURAL NETWORKS
2y 5m to grant Granted Jan 20, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

3-4
Expected OA Rounds
62%
Grant Probability
99%
With Interview (+47.1%)
2y 10m
Median Time to Grant
Moderate
PTA Risk
Based on 24 resolved cases by this examiner. Grant probability derived from career allow rate.
POSITION-BASED TEXT-TO-SPEECH MODEL

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email