DETAILED ACTION
This office action is in response to the Application No. 18907466 filed on
02/11/2026. Claims 1-20 are presented for examination and are currently pending. Applicant’s arguments have been carefully and respectfully considered.
Response to Arguments
The Applicant’s arguments have been considered but are moot in light of the added secondary reference which is necessitated by the claim amendments. Guo in view of Harikumar in view of Bourdev discloses the limitations of independent claims 1 and 11.
In response to the withdrawal of the 103 rejection and the patentability of the
claims, it is noted that Guo in view of Harikumar in view of Bourdev now teaches the limitations of independent claims 1 and 11. As a result, the independent claims are not allowable.
Furthermore, dependent claims 2-10 and 12- 20, which depend directly or
indirectly from claims 1 and 11 are not allowable for similar reasons argued above
regarding the independent claims.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
3. Claims 1, 5, 6, 8, 11, 15, 16 and 18 are rejected under 35 U.S.C 103 as being unpatentable over Guo et al. ("MSMC-TTS: Multi-stage multi-codebook VQ-VAE based neural TTS." IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023): 1811-1824, Date of publication 2 May 2023) in view of Harikumar (US20230419551 filed 06/22/2022) and further in view of Bourdev et al. (US20180174275)
Regarding claim 1, Guo teaches a system for multimodal data processing and generation using a vector quantized variational autoencoder (VQ-VAE) (We propose a VectorQuantized Variational AutoEncoder (VQ-VAE) based feature analyzer to encode acoustic features into sequences with different time resolutions, and quantize them with multiple VQ codebooks to form the Multi-Stage Multi-Codebook Representation (MSMCR), abstract) and
a latent transformer (Feature Analyzer: MSMC-VQ-VAE is implemented based on Feed-Forward Transformer in FastSpeech, pg. 1816, right col., section B. Implement Details), comprising:
receive multimodal data comprising different data types (In prediction, the multi-stage predictor is trained to map the input text sequence to MSMCRs (Multi-Stage Multi-Codebook Representation) in stages, abstract; The model first encodes the input speech sequence x to encoding sequences {e(1),..., e(S)} with encoders progressively:, pg. 1814, left col., last para. The Examiner notes this indicates there are text and speech as inputs which are plurality of data types that are different);
convert the combined representation into a discrete latent representation using vector quantization with a learned codebook (Then, the vector-quantized latent representations z are obtained as follows: z = Q(~z; c) (2) where c denotes the codebook containing M codewords with the dimension of N (pg. 1813, left col., first para.); z refer to latent sequences before … vector quantization “Q” with the codebook c composed of M codewords with the dimension of N, pg. 1813, Fig. 2),
wherein vector quantization comprises mapping continuous latent vectors to nearest discrete codebook vectors via nearest-neighbor lookup (For each latent vector z˜i, the quantizer Q compares it with all codewords, and chooses the one nearest to it according to the Euclidean distance, pg. 1813, left col., first para.);
process the discrete latent representation using a transformer to learn relationships (In each encoder, the input sequence is processed by a projection layer, and added with position encodings, then fed to 4 FeedForward Transformer blocks (pg. 1816, right col., section B. Implement Details) and
generate new discrete representations (The output sequence is also processed by another neural network based module X for prediction, pg. 1814, right col., last para.);
decode the new discrete representations into output data (The decoder Di first transforms the quantized sequence z(i) with a projection, and adds it with h(i+1) when i<S, pg. 1814, right col., last para.);
restore information lost during the vector quantization using a neural upsampler …, trained to leverage cross- modal correlations between the different data types (A residual convolutional layer further processes it, which is then up-sampled by repetition to h(i) (pg. 1814, right col., last para.); Besides, since latent sequences in MSMCR are extracted in order of higher stage to lower stage, the generation process should also consider this pattern to maintain the correlation between sequences on the timescale, pg. 1815, right col., first para.), and
jointly train (VQ-VAE aims to learn a discrete latent representation from target data with an encoder-decoder model. As shown in Fig. 2, VQ-VAE comprises three parts, encoder, decoder, and the VQ operation in between, pg. 1812, right col., last para.) the encoding, vector quantization, processing, decoding, and neural upsampling using a combined loss function that includes a vector quantization loss component (As shown in Fig. 4, this model aims to encode the input sequence x into the multi-stage vector-quantized representation Z = {z(1),..., z(S) } using the codebook group C = {c(1),..., c(S) }, where S denotes the number of stages (pg. 1814, left col., last para.); … The quantized sequence z(i) is processed by the decoder Di to obtain h(i) for the following quantization and decoding, and to predict the next-stage quantized sequence z(i−1) or speech sequence x: … The decoder Di first transforms the quantized sequence z(i) with a projection, and adds it with h(i+1) when i … A residual convolutional layer further processes it, which is then up-sampled by repetition to h(i)... The loss function of this model is written as follows:
PNG
media_image1.png
176
432
media_image1.png
Greyscale
pg. 1814, right col., last para.).
Guo does not explicitly teach a computing device comprising at least a memory and a processor; a plurality of programming instructions stored in the memory and operable on the processor, wherein the plurality of programming instructions, when operating on the processor, cause the computing device to:, encode each data type into a modality-specific representation using specialized encoders for each data type; fuse the modality-specific representations into a combined representation by applying cross-modal attention mechanisms to capture relationships between the different data types; wherein the cross-modal correlations comprise learned relationships between features in a first data type and features in a second data type different from the first data type;
Harikumar teaches a computing device comprising at least a memory and a processor; a plurality of programming instructions stored in the memory and operable on the processor, wherein the plurality of programming instructions, when operating on the processor, cause the computing device to (a computer-readable apparatus including a storage medium stores computer-readable and computer-executable instructions that are configured to, when executed by at least one processor apparatus, cause the at least one processor apparatus or another apparatus (e.g., the computerized apparatus) to perform the operations of the method 1200. Example components of the computerized apparatus are illustrated in FIG. 15 [0102]):
encode each data type (the input image 712, a sketch image 716 [0087]; A sketch image 714 is a sketch version of the input image 712 [0088]) into a modality-specific representation using specialized encoders for each data type (In some embodiments, the image encoder model 104 includes a corresponding second encoder 105. The second encoder 105 is configured to receive an image 112. In some implementations, the image 112 is not an image of the type that sketch image 110 [0037]; a second deep learning model of a sketch encoder (e.g., 520) is trained. In some embodiments, second autoencoder (e.g., VQVAE) model is trained on sketches of input images [0067]. The Examiner notes encoder model 104 and encoder model 105 encodes an image as first datatype, and a sketch encoder, a second encoder autoencoder encodes sketch image 110 as second datatype);
fuse the modality-specific representations into a combined representation (Tokenized representations 406 of FIG. 4 may be examples of the unique integer values obtained by the image encoder 510, and similarly, unique integer values obtained by the sketch encoder 520 [0070]. The Examiner notes Tokenized representations 406 is a combined representation.) by applying cross-modal attention mechanisms to capture relationships between the different data types (In some embodiments, the tokenized representations 406 are outputted from multiple ones of the encoder layers … attention layers are present between the encoder 404 and the decoder 408 [0056]. The Examiner notes the attention layers are the cross-modal attention mechanisms);
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Guo to incorporate the teachings of Harikumar for the benefit of increasing the resolution of the image, which results in the intricate patterns being created without loss of fidelity and at high efficiency (Harikumar [0099])
Bourdev teaches decode the new discrete representations into output data (The standard decoder module 115 decodes the encoded LQ image 180 to produce the decoded LQ image 185 [0078], Fig. 4; encoder model 350 performs additional processing steps on the tensor 375 [0054]);
restore information lost during the vector quantization using a neural upsampler (The decoded LQ image 185 is upsampled by the upsampling module 210 [0078]; the upsampling module 210 (including the trained upsampling model) [0065]; For example, the upsampling model can be a convolutional neural network that is trained to predict an image of a higher resolution given an image of a lower resolution [0106]) comprising a trained neural network with learnable parameters (Here, … the upsampling model, … can include an input layer of nodes, an output layer of nodes, and one or more hidden layers of nodes between the input and output layers. Each layer can be associated with learned parameters that are adjusted during training due to the loss function. Examples of learned parameters include learned weights and learned biases [0067]), trained to leverage cross- modal correlations between the different data types, wherein the cross-modal correlations comprise learned relationships (Here, the encoder/decoder block applies an autoencoder model (e.g., encoder model and decoder model) that attempts to learn a representation of the residual 320. In various embodiments, the representation of the residual 320 is a tensor 375 including structural features of the residual 320 [0052]) between features in a first data type and features in a second data type different from the first data type (The residual generation module 220 determines a residual 320 between the labeled HQ image 130 and the HQ′ image 310 [0076]);
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Guo and Harikumar to incorporate the teachings of Bourdev for the benefit of improvement of image quality of an image using a machine-learned autoencoder [0002] and further improvement of higher resolution images (Bourdev [0086])
Regarding claim 5, Modified Guo teaches the system of claim 1, Guo teaches wherein the combined loss function incorporates reconstruction quality across all modalities and latent space consistency (Hence, another loss term is introduced to train the encoder by minimizing the Euclidean distance between z and ˜z. The complete loss function is written as follows:
PNG
media_image2.png
126
552
media_image2.png
Greyscale
… and α is a coefficient to balance these two loss terms, pg. 1813, left col., last para.).
Regarding claim 6, Modified Guo teaches the system of claim 1, Guo teaches wherein the computing device is further caused to explore and manipulate the discrete latent representation (Then, the vector-quantized latent representations z are obtained as follows: z = Q(˜z; c) (2) where c denotes the codebook containing M codewords with the dimension of N, pg. 1813, left col., first para.) to generate new or modified multimodal data (For each latent vector z˜i, the quantizer Q compares it with all codewords, and chooses the one nearest to it according to the Euclidean distance as the quantized output zi, which is written as follows:, pg. 1813, left col., first para. The Examiner notes that output zi as the modified multimodal data).
Regarding claim 8, Modified Guo teaches the system of claim 1, Guo teaches wherein the multimodal data comprises at least two of: time- series data, textual data, image data, audio data, and structured tabular data (the multi-stage predictor is trained to map the input text sequence to MSMCRs in stages (abstract); The model first encodes the input speech sequence x to encoding sequences {e(1),..., e(S)} with encoders progressively:, pg. 1814, left col., last para.).
Regarding claim 11, claim 11 is similar to claim 1. It is rejected in same manner and reasoning applying.
Regarding claim 15, claim 15 is similar to claim 5. It is rejected in same manner and reasoning applying.
Regarding claim 16, claim 16 is similar to claim 6. It is rejected in same manner and reasoning applying.
Regarding claim 18, claim 18 is similar to claim 8. It is rejected in same manner and reasoning applying.
4. Claims 2-4 and 12-14 are rejected under 35 U.S.C 103 as being unpatentable over Guo et al. ("MSMC-TTS: Multi-stage multi-codebook VQ-VAE based neural TTS." IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023): 1811-1824, Date of publication 2 May 2023) in view of Harikumar (US20230419551 filed 06/22/2022) in view of Bourdev et al. (US20180174275) and further in view of Yu et al. (US20240013504 filed 10/31/2022)
Regarding claim 2, Modified Guo teaches the system of claim 1, Guo teaches wherein encoding the multimodal data (The model first encodes the input speech sequence x to encoding sequences {e(1),..., e(S)} with encoders progressively:, pg. 1814, left col., last para.) comprises:
fusing the modality-specific representations into a combined representation (Then, the vector-quantized latent representations z are obtained as follows: z = Q(~z; c) (2) where c denotes the codebook containing M codewords with the dimension of N (pg. 1813, left col., first para.); ~z refer to latent sequences before … vector quantization “Q” with the codebook c composed of M codewords with the dimension of N, pg. 1813, Fig. 2); and
converting the combined representation into discrete codes using vector quantization (Then, the vector-quantized latent representations z are obtained as follows: z = Q(~z; c) (2) where c denotes the codebook containing M codewords with the dimension of N (pg. 1813, left col., first para.); z refer to latent sequences after … vector quantization “Q” with the codebook c composed of M codewords with the dimension of N, pg. 1813, Fig. 2)).
Guo does not explicitly teach encoding each data type into a modality-specific representation using specialized encoders;
Yu teaches wherein encoding the multimodal data comprises: encoding each data type into a modality-specific representation using specialized encoders (In operation, an image I∈RH×W×3, 304, is input into the image encoder 310 [0039]; a text expression 302 is input into the text encoder 306 [0040]);
fusing the modality-specific representations into a combined representation (a concatenation module that concatenates the refined text embedding(s) output by the text adaptor and the feature tokens output by the image encoder; (5) a convolution module that applies a convolution layer to fuse the concatenated refined text embedding(s) and feature tokens to generate flattened feature tokens [0019]; The convolution module 314 applies a convolution layer to fuse the concatenated refined text embedding(s) and feature tokens to generate flattened feature tokens 316 [0042]); and
converting the combined representation into discrete codes (In some embodiments, the feature maps 305, 307, and 309 generated by the image encoder 310 are projected to the dimension of 256 using a linear layer and then flattened into the feature tokens 316, denoted herein by C3′, C4′, and C5 [0042])
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Guo, Harikumar and Bourdev to incorporate the teachings of Yu for the benefit of using transformer encoder which reduces computation costs, enables faster convergence, and promotes good attention representations (Yu [0043])
Regarding claim 3, Modified Guo teaches the system of claim 1, Guo teaches wherein the transformer operates without embedding or positional encoding layers (then fed to 4 FeedForward Transformer blocks. Specifically, the number of heads in multi-head attention is 2, and the feedforward module is composed of two convolutional layers with a kernel size of 3 and a ReLU activation function between them.).
Yu also teaches wherein the transformer operates without embedding or positional encoding layers (In some embodiments, the transformer encoder 318 is the transformer encoder of the Deformable DETR model [0043]. The Examiner notes the prior art does not teach embedding or positional encoding layers in the transformer).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Guo, Harikumar and Bourdev to incorporate the teachings of Yu for the benefit of using transformer encoder which reduces computation costs, enables faster convergence, and promotes good attention representations (Yu [0043])
Regarding claim 4, Modified Guo teaches the system of claim 1, Guo teaches wherein decoding the new discrete representations comprises: converting the new discrete representations into a continuous latent representation (Finally, the output speech sequence xˆ is generated by the decoder D for reconstruction: xˆ = D(z) (pg. 1813, left col., first para.); Specifically, in the quantizer block Qi, e(i) is concatenated with the hidden sequence h(i+1) (except when i = S) from the higher-stage decoder, and then transformed by a projection layer to obtain ˜z(i), pg. 1814, right col., first para.; ˜z refer to latent sequences … after vector quantization “Q” with the codebook c composed of M codewords with the dimension of N, pg. 1813, left col., Fig. 2); and
Guo does not explicitly teach generating output data for each modality from portions of the continuous latent representation using modality-specific decoders.
Yu teaches wherein decoding the new discrete representations comprises: converting the new discrete representations into a continuous latent representation; and generating output data for each modality from portions of the continuous latent representation using modality-specific decoders (The location decoder 324 takes the refined feature tokens 320 and the N randomly initialized queries 322 as inputs and outputs location-aware queries 326 [0044]; The mask decoder 328 predicts object masks using self-attention. In particular, the mask decoder 328 uses the location-aware queries 326 to attend the refined feature tokens 320, denoted herein by C3″, C4″, and C5″, and to generate dense self-attention maps [0045]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Guo, Harikumar and Bourdev to incorporate the teachings of Yu for the benefit of using transformer encoder which reduces computation costs, enables faster convergence, and promotes good attention representations (Yu [0043])
Regarding claim 12, claim 12 is similar to claim 2. It is rejected in same manner and reasoning applying.
Regarding claim 13, claim 13 is similar to claim 3. It is rejected in same manner and reasoning applying.
Regarding claim 14, claim 14 is similar to claim 4. It is rejected in same manner and reasoning applying.
5. Claims 7 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Guo et al. ("MSMC-TTS: Multi-stage multi-codebook VQ-VAE based neural TTS." IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023): 1811-1824, Date of publication 2 May 2023) in view of Harikumar (US20230419551 filed 06/22/2022) in view of Bourdev et al. (US20180174275) and further in view of Shih et al. (US20210064925)
Regarding claim 7, Modified Guo teaches the system of claim 6, Modified Guo does not explicitly teach wherein exploring and manipulating the discrete latent representation comprises using techniques including interpolation, extrapolation, and vector arithmetic.
Shih teaches wherein exploring and manipulating the discrete latent representation comprises using techniques including interpolation, extrapolation (achieving high-quality long-range video interpolation and extrapolation through operating on a landmark representation space [0050]), and
vector arithmetic (In at least one embodiment, arithmetic operations on texture data and input geometry data compute pixel color data for each geometric fragment [0329]).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Guo to incorporate the teachings of Shih for the benefit of performing inferencing of information, such as speech recognition, or other artificial intelligence services (Shih [0124])
Regarding claim 17, claim 17 is similar to claim 7. It is rejected in same manner and reasoning applying.
6. Claims 9, 10, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Guo et al. ("MSMC-TTS: Multi-stage multi-codebook VQ-VAE based neural TTS." IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023): 1811-1824, Date of publication 2 May 2023) in view of Harikumar (US20230419551 filed 06/22/2022) in view of Bourdev et al. (US20180174275) and further in view of Yu et al. ("Vector-quantized image modeling with improved vqgan." arXiv preprint arXiv:2110.04627 (2021) hereinafter “Yu 2021”)
Regarding claim 9, Modified Guo teaches the system of claim 1, Modified Guo does not explicitly teach wherein the computing device is further caused to perform conditional generation by adding a condition vector to the input of the transformer.
Yu 2021 teaches wherein the computing device is further caused to perform conditional generation by adding a condition vector to the input of the transformer (we train stage 2 transformer models for…class-conditioned image synthesis … All models are trained with an input image resolution 256 × 256 on CloudTPUv4, pg. 7, last para.); For class-conditioned image synthesis, a class-id token is prepended before the image tokens, pg. 2, Stage 2: Vector-quantized Image Modeling)
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Guo to incorporate the teachings of Yu 2021 for the benefit of codebook learning, factorized codes with low-dimensional latent variables consistently achieve better reconstruction quality when the latent dimension is reduced. (Yu 2021, pg. 7, first para.)
Regarding claim 10, Modified Guo teaches the system of claim 1, Modified Guo does not explicitly teach wherein the computing device is further caused to quantify uncertainty in the generated output data by using multiple samplings from the discrete latent representation.
Yu 2021 teaches wherein the computing device is further caused to quantify uncertainty in the generated output data by using multiple samplings from the discrete latent representation (With a pretrained generative Transformer model, unconditional image generation is achieved by simply sampling token-by-token from the output softmax distribution. All samples used for both qualitative and quantitative results are obtained without temperature reduction. The sampled tokens are then fed into the decoder of ViT-VQGAN to decode output images. Our default Stage 1 ViT-VQGAN encodes input images of resolution 256 × 256 into 32 × 32 latent codes with a codebook size 8192, pg. 5, section 4.1).
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Guo to incorporate the teachings of Yu 2021 for the benefit of codebook learning, factorized codes with low-dimensional latent variables consistently achieve better reconstruction quality when the latent dimension is reduced. (Yu 2021, pg. 7, first para.)
Regarding claim 19, claim 19 is similar to claim 9. It is rejected in same manner and reasoning applying.
Regarding claim 20, claim 20 is similar to claim 10. It is rejected in same manner and reasoning applying.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone number is (571)272-8670. The examiner can normally be reached Monday-Friday 8am-5pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle T Bechtold can be reached on (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/M.G./Examiner, Art Unit 2148
/MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148