DETAILED ACTION
This action is responsive to the application filed on 11/10/2025. Claims 1-5,7-19, and 21-24 are pending and have been examined. This action is Non-final (RCE).
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Applicant’s claim for the benefit of a prior-filed application under 35 U.S.C. 119(e) or under 35 U.S.C.
120, 121, 365(c), or 386(c) is acknowledged.
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 11/10/2025 has been entered.
Response to Arguments
Argument 1: The applicant argues that the claims do recite subject matters that are in line of the 8/6/2025 guidelines.
Examiner Response to Argument 1: The examiner has carefully considered the arguments set forth above, however it’s not persuasive. As set forth in the current 101 analysis, even as amended, independent claim 1 (and claims depending therefrom) is still directed, under Step 2A Prong 1, to generating feature vectors, interpreting those feature vectors to produce different “interpretations” (including semantic segmentations and text), and using those interpretations for functions such as searching images, captioning images, providing navigation or instructions, or communicating with a human operator, all of which, as drafted, amount to evaluation, interpretation, and use of information, which are mental processes. The amendments (e.g., expressly reciting that at least one of the different tasks comprises semantic segmentation and generating text data, and that a natural language processing decoder generates text data from a final feature vector) merely refine the type of data interpretation and text generation already recited and do not add any additional element that improves the functioning of the computer itself or otherwise integrates the abstract idea into a practical application under Step 2A Prong 2. Instead, the claim continues to implement these mental processes on generic computer components, like processors, memories, a camera, a “machine,” and neural-network encoders/decoders performing their conventional roles, which as explained in the current mapping, are well-understood, routine, and conventional and do not provide “significantly more” under Step 2B. Accordingly, notwithstanding Applicant’s citation to more recent guidance generally addressing AI-related subject matter, the claims as amended remain directed to a judicial exception (a mental process and, in some instances, mathematical concepts) without additional elements that integrate that exception into a practical application or add an inventive concept, and the 101 rejection is therefore maintained.
Argument 2: Applicant contends that the OA acknowledges Wang does not describe “wherein the text data is used to perform at least one of searching the images captured by the camera, captioning the images, providing navigation or instructions to the machine, or communicating with a human operator of the machine” but nevertheless relies on Wang for that feature. Applicant further asserts that Wang’s model “generates an answer prediction from the encoded visual dialogue input based on dialogue input from a human user” and therefore does not describe “the natural language processing decoder generates text data by operating on a single input that is a final one of the feature vectors received from the unified encoder” as recited in claim 1. Applicant argues that Shazeer and Teichmann likewise fail to describe this “single input / final feature vector” configuration and therefore independent claims 1, 5, and 19 (and their dependents) are allowable over the cited references.
Examiner Response to Argument 2: The examiner has considered the argument set forth above, however it’s not persuasive because it mischaracterizes how the rejection relies on the references and because, under 103 rejection, no single reference is required to disclose all limitations. As set forth in the current mapping, the mapping does not rely on Wang to teach the limitation “wherein the natural language processing decoder generates the text data by operating on a single input that is a final one of the feature vectors received from the unified encoder.” Rather, Shazeer is relied on for a computer-implemented system with processors, memories, and executable instructions implementing a machine-learning model and text-generating tasks; Teichmann is relied on for a unified encoder that produces shared feature vectors used by task-specific decoders (including semantic segmentation) in a multi-task architecture; Wang is relied on for using text data to perform searching images, captioning images, and communicating with a human operator; and Vinyals is relied on specifically for the natural language processing decoder that operates on a single final feature vector output by the encoder to generate text data. As explained in the mapping for claim 1 (and similarly for claims 5 and 19), Vinyals discloses a CNN image encoder whose last hidden layer is used “as an input to the RNN decoder that generates sentences” and states that “the image I is only input once,” i.e., a single fixed-length vector produced by the encoder is supplied as the sole image input to the decoder RNN that generates sentences. The examiner interprets this last hidden-layer feature vector, provided once as the only image input to the RNN decoder that “generates sentences,” to be the same as “a single input that is a final one of the feature vectors received from the unified encoder” being supplied to a natural language processing decoder that generates text data, because both describe a natural-language decoder that takes the encoder’s final feature representation (a single feature vector produced by the encoder) as its only input for text generation. For independent claims 1, 5, and 19 specifically, the overall combination remains obvious. Shazeer supplies the general multi-task, multi-modal ML system with text-generating tasks; Teichmann supplies the unified image encoder and semantic-segmentation decoder in a multi-task encoder/decoder architecture; Wang supplies the use of generated text data to search images, caption images, and communicate with a human operator; and Vinyals supplies the particular natural-language processing (NLP) decoder that operates on a single final feature vector from an image encoder to generate sentences. It would have been obvious to a person of ordinary skill in the art, before the effective filing date, to incorporate the well-known Neural Image Caption (NIC)-style encoder/decoder configuration from Vinyals into the multi-task image and text system of Shazeer/Teichmann so that the text-generating portion is implemented as a dedicated NLP decoder that consumes the encoder’s final image feature vector and outputs sentences, and to further use those sentences for image search, captioning, and human to machine communication as taught by Wang. Doing so represents a predictable use of known encoder/decoder image captioning techniques (Vinyals) in the context of multi-task visual perception and dialogue systems (Shazeer, Teichmann, Wang) and thus would have been a routine design choice for a skilled practitioner seeking to generate text interpretations from shared image feature vectors. Accordingly, the combination of Shazeer, Teichmann, Wang, and Vinyals still renders independent claims 1, 5, and 19 obvious, in spite of Applicant’s observation that Wang alone does not describe the “single input / final feature vector” NLP decoder.
Argument 3: Applicant further asserts that “independent claims 1, 5, and 19 are allowable over the cited references” and that “the dependent claims are submitted to be allowable over the cited references in the same manner, at least because they are dependent on the independent claims and thus contain all the limitations of the independent claims.”
Examiner Response to Argument 3: The examiner has considered the argument set forth above, however it’s not persuasive because the dependent claims have been individually addressed and rejected under Sec. 103 based on additional specific teachings in the cited art, beyond those applied to the independent claims. Accordingly, Applicant’s general assertion that the dependent claims are allowable “in the same manner” as the independent claims does not overcome the specific 103 rejections of those dependent claims.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition
of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the
conditions and requirements of this title.
Claims 1-5,7-19, and 21-24 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim 1,
Step 1 – Is the claim to a process, machine, manufacture, or composition of matter?
The claim is directed to a system, which falls under the category of machine. Step 1 is satisfied.
Step 2A Prong 1:
“generate the one or more feature vectors useful for performing, based on the one or more feature vectors, a plurality of different tasks each comprising different interpretations of the data; and at the different tasks comprise semantic segmentation and generating text data; interpreting the one or more feature vectors so as to decode one or more of the feature vectors… or communicating with a human operator of the machine” -- The limitation is directed to analyzing data and producing interpretations (semantic segmentations and text) from internal representations, and communication with a human operator. In its recite, this is evaluation and interpretation of information, which can be performed mentally using observation, reasoning, and judgment, and therefore recites a mental process.
Step 2A Prong 2 and Step 2B:
“A computer implemented system for interpreting data using machine learning, comprising: one or more processors; one or more memories; and one or more computer executable instructions embedded on the one or more memories, wherein the computer executable instructions are configured to execute…a unified encoder comprising a neural network encoding data of the images into one or more feature vectors, wherein the unified encoder is trained using machine learning to…a plurality of decoders connected to the unified encoder, each of the decoders comprising a neural network…a plurality of decoders, including a semantic segmentation decoder and a natural language processing decoder connected to the unified encoder, each of the decoders comprising a neural network”-- This limitation recites a system for machine learning that will have instructions to apply the limitations of the claim onto a computer which is the processors/memories and the computer system itself, which cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(f)).
“to output one of the interpretations and wherein the natural language processing decoder-generates the text data by operating on a single input that is a final one of the feature vectors received from the unified encoder; and an output to the machine comprising the semantic segmentation and the text data, -- The limitation recites outputting interpretations and where the NL processing decoder will generate data by a single input of feature vectors received from the encoder, and an output to the machine that comprises semantic segmentation and text data. The limitation is directed to an insignificant, extra-solution activity that cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, under Step 2B, transmitting data over a network is a well-understood, routine, and conventional activity (WURC), and does not provide significantly more than the judicial exception (see MPEP 2106.05(d)(II)).
“wherein the text data is used to perform at least one of searching the images captured by the camera, captioning the images, providing navigation or instructions to the machine,”-- The limitation is directed to text data that is instructed to performing image searching that’s captured by a camera, image captioning, and machine communications/instructions. The limitation is directed to mere instructions to apply onto a computer, and thus it does not integrate to a practical application, nor provides significantly more than the judicial exception (see MPEP 2106.05(f)).
Therefore, claim 1 is non-patent eligible.
Regarding claim 2,
Step 1: The claim depends on claim 1, and thus is considered a process and an allowable subject matter. The claim satisfies Step 1.
Step 2A Prong 1:
“The computer implemented system of claim 1, wherein the different interpretations comprise at least one of a different classification or a conversion of the data into a different data format.” – The limitation is directed to interpretations of data include classifications/conversions that are in different formats, which is directed to a mental process.
There are no elements to be evaluated under step 2A Prong 2 and Step 2B.
Therefore, claim 2 is non-patent eligible.
Regarding claim 3,
Step 1: The claim depends on claim 1, and thus is considered a process and an allowable subject matter. The claim satisfies claim 3.
There are no elements to be evaluated under step 2A Prong 1
Step 2A Prong 2 and Step 2B:
“The system of claim 1, wherein the data comprises first image data, and the different interpretations comprise at least one of text data, second image data, or semantic segmentation” – This limitation recites data that is further comprising image data, and that the different interpretations further comprise of data or semantic segmentation. This is directed to merely limiting to a field of use to a particular environment and cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Therefore, claim 3 is non-patent eligible.
Regarding claim 4,
Step 1: The claim depends on claim 1, and thus is considered a process and an allowable subject matter. The claim satisfies claim 4.
There are no elements to be evaluated under step 2A Prong 1.
Step 2A Prong 2 and Step 2B:
“The system of claim 1, wherein the different tasks comprise image captioning or natural language processing, semantic segmentation, and image reconstruction.” – This limitation recites tasks first introduced in claim 1 will further include natural language processing, and semantic reconstruction as well as image reconstructions, which is merely applying the mental process of claim 1 to a particular use/field of use, and cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Therefore, claim 4 is non-patent eligible.
Regarding claim 5,
Step 1: The claim is directed to a system, which falls under the category of machine. The claim satisfies Step 1.
Step 2A Prong 1:
“to generate the one or more feature vectors useful for performing a plurality of different tasks each comprising different interpretations of data;…interpreting the one or more feature vectors so as to decode one or more feature vectors to output one of the interpretations;” -- This limitation is directed to generating feature vectors for performing tasks based on interpretation of data, which is directed to a process that can be performed mentally using observation, evaluation, and judgement, and thus is considered a mental process.
Step 2A Prong 2 and Step 2B:
“A computer implemented system for interpreting data using machine learning, comprising: an application-specific integrated circuit (ASIC) for an artificial neural network, the ASIC comprising one or more processors and one or more memories configured to execute: a unified encoder comprising a neural network encoding data into one or more feature vectors, wherein the unified encoder is trained using machine learning… a plurality of decoders including a semantic segmentation decoder and a natural language processing decoder connected to the unified encoder, each of the decoders comprising a neural network…wherein the natural language processing decoder generates text data by operating on a single input that is a final one of the feature vectors from the unified encoder:” -- The limitation recites a system for data interpretation using machine learning that will comprise a integrated circuit system for an ANN that will configure processors and memories to execute the preceding instructions to apply onto the computer/system/machine, which cannot be integrated to a practical application, nor can it provide significantly more than the judicial exception (see MPEP 2106.05(f)).
Therefore, claim 5 is non-patent eligible.
Regarding claim 7,
Step 1: The claim is directed to a system, which is directed to a machine. The claim satisfies Step 1.
There are no elements to be evaluated under step 2A Prong 1.
Step 2A Prong 2 and Step 2B:
“The system of claim 1, wherein: the system comprises a distributed network of the processors, “ the unified encoder is modular so that the unified encoder can be transmitted between different ones of the processors and executed or trained on each of the different ones of the processors, and the decoders can be executed on different ones of the processors.” – The limitation recites a system that is a “distributed”/transmittable network over a computer (processor), which falls under insignificantly, extra solution activity that cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, under step 2B, the act of a distributed network transmitting data from the encoder is a well-understood, routine and conventional activity (WURC), and cannot provide significantly more than the judicial exception (see MPEP 2106.05(d)(II)).
Therefore, claim 7 is non-patent eligible.
Regarding claim 8,
Step 1: The claim is directed to a system, which is directed to a machine. The claim satisfies Step 1.
There are no elements to be evaluated under step 2A Prong 1.
Step 2A Prong 1:
“concatenates the first output and the second output to form a combined output;” – This limitation recites concatenating outputs to form a combined output, which in synthesized terms, means linking output together to product a combined output, which can be performed in a mental process.
Step 2A Prong 2 and Step 2B:
“the data comprises an image; one of the decoders comprises a semantic segmentation decoder executing a decoder convolution layer and deconvolution layers, the semantic segmentation decoder:” – This limitation recite data that is further comprising images and a decoder that will execute (apply onto a computer) convolution/deconvolution layers, which cannot be integrated to a practical application, nor provide significantly more than judicial exception (see MPEP 2106.05(f)).
“the unified encoder executes a plurality of encoder convolution layers so as to output a first one of the feature vectors comprising an intermediate feature vector after a first plurality of the convolution layers and a final feature vector after all the convolution layers; receives the intermediate feature vector and passing the intermediate feature vector through the decoder convolution layer to form a first output; receives the final feature vector and passing the final feature vector through one a first one of the deconvolution layers to form a second output; passes the combined output through at least a second one of the deconvolution layers to form the one of the interpretations. -- This limitation recites a encoder that will execute convolution layers to be the output of vectors that comprised gathered data of feature vectors from a group of convolution layers and also receiving/passing final feature vectors and convolution and/or deconvolution layers, which is directed to mere data gathering and data manipulation, and thus is an insignificant, extra-solution activity and cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, under step 2B, the act of transmitting data over a network is a well-understood, routine, and conventional (WURC) and thus cannot be considered significantly more than the judicial exception (see MPEP 2106.05(d)(II)).
Therefore, claim 8 is non-patent eligible.
Regarding claim 9,
Step 1: The claim depends on claim 8, and claim 8 depends on claim 1,and thus is considered a process and an allowable subject matter. The claim satisfies Step 1.
Step 2A Prong 1:
“concatenates the reduced dimension feature vector with a previous hidden state and previous hidden word, if necessary; to form a concatenated layer” – This limitation is directed to concatenating vectors from previous words or state; the act of concatenating vectors is considered to be a mental process.
Step 2A Prong and Step 2B:
“receives only the final feature vector; passes the flattened feature vector through a fully connected layer to reduce a number of dimensions and form a reduced dimension feature vector; inputs the concatenated layer to a bidirectional GRU layer to form a GRU output passes the GRU output through at least one fully connected layer so as to reduce a dimensions and form another of the interpretations comprising a word output.” – This limitation recites receiving vectors, passing the vector over layers in a network, and outputting GRU through layers to form a word output, and is an insignificant , and thus is an insignificant, extra-solution activity and cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, under step 2B, the act of transmitting data over a network is a well-understood, routine, and conventional (WURC) and thus cannot be considered significantly more than the judicial exception (see MPEP 2106.05(d)(II)).
Therefore, claim 9 is non-patent eligible.
Regarding claim 10,
Step 1: The claim is directed to a system, which is considered to be under the category.
Step 2A Prong 1:
“successively deconvolutes the final feature vector through a plurality of deconvolution layers so as to reconstruct the data comprising an image” – The limitation is directed to deconvolution of vectors from layers in the neural network for image reconstruction, and involves a mathematical concept, and thus is directed to math.
Step 2A Prong 2 and Step 2B:
“The system of claim 9, wherein another one of the decoders comprises an image reconstruction decoder:” – This limitation merely recites furthering the limitation to a particular environment or field of use, and thus cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Therefore, claim 10 is non-patent eligible.
Regarding claim 11,
Step 1: The claim depends on claim 10, and claim 10 depends on claim 9,and thus is considered a process and an allowable subject matter. The claim satisfies Step 1.
There are no elements to be evaluated under Step 2A prong 1.
Step 2A Prong 2 and Step 2B:
“The system of claim 10, wherein hidden layers in the image reconstruction decoder and the semantic segmentation decoder are equipped with RELU activation.” – This limitation merely recites furthering the limitation to a particular environment or field of use, and thus cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Therefore, claim 11 is non-patent eligible.
Regarding claim 12,
Step 1: The claim depends on claim 8, and claim 8 depends on claim 1,and thus is considered a process and an allowable subject matter. The claim satisfies Step 1.
There are no elements to be evaluated under Step 2A Prong 1.
Step 2A Prong 2 and step 2B
“The system of claim 8, wherein the unified encoder comprises a spatial pyramid pooling layer after the convolution layers.” – The limitation recites an unified encoder that will further comprise a spatial pooling layer after the convolution layers, for which is directed to mere further limiting the unified encoder to a field of use/particular environment for the convolutional layers, and thus cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Therefore, claim 12 is non-patent eligible.
Regarding claim 13,
Step 1: The claim is directed to a system, which is directed to a machine. The claim satisfies Step 1.
There are no elements to be evaluated under step 2A Prong 1.
Step 2A Prong 2 and Step 2B:
“The system of claim 1, wherein the different tasks comprise terrain classification and image captioning.” – This limitation merely recites furthering the limitation to a particular environment or field of use, and thus cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Therefore, claim 13 is non-patent eligible.
Regarding claim 14,
Step 1: The claim is directed to a system, which is directed to a machine. The claim satisfies Step 1.
There are no elements to be evaluated under step 2A Prong 1.
Step 2A Prong 2 and Step 2B:
“The system of claim 1, wherein the unified encoder comprises a RES NET neural network, an Xception neural network, or a MobileNet neural network.” -- This limitation merely recites furthering the limitation to a particular environment or field of use, and thus cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Therefore, claim 14 is non-patent eligible.
Regarding claim 15,
Step 1: The claim is directed to a system, which is directed to a machine. The claim satisfies Step 1.
Step 2A Prong 1:
“wherein the machine utilizes one or more of the interpretations for operation of the machine.” – The limitation is directed to one or more interpretations being utilized in a machine, which is human mind capable and is directed to a mental process with or without the aid of a machine.
Step 2A Prong 2 and Step 2B:
“The system of claim 1, further comprising a machine coupled to or including one of the processors, wherein the machine comprises a vehicle, a spacecraft, a weapon, an aircraft, a robot, a medical device, an imaging device or camera, a rover, a sensor, an actuator, an intelligent agent, or a smart device in one or more smart buildings” -- This limitation merely recites furthering the limitation to a particular environment or field of use, and thus cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Therefore, claim 15 is non-patent eligible.
Regarding claim 16,
Step 1: The claim is directed to a system, which is directed to a machine. The claim satisfies Step 1.
Step 2A Prong 1:
“utilizing the interpretations for operation of the apparatus” – The limitation is directed to utilizing interpretations, which is directed to a mental process.
Step 2A Prong 2 and Step 2B:
“The system of claim 1, further comprising an apparatus coupled to or including one of the processors… herein the apparatus comprises at least one machine selected from a machine performing automated manufacturing, devices controlled by a control system, one or more devices used in banking, one or more devices supplying power or controlling power distribution, or one or more devices in an automotive or aerospace system.” – The limitation recites instructions, in a generic manner, to apply to the system that comprises an apparatus with processors, and thus cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(f)).
Therefore, claim 16 is non-patent eligible.
Regarding claim 17,
Step 1: The claim depends on claim 15, and claim 15 depends on 1, and thus is considered a process and an allowable subject matter. The claim satisfies Step 1.
There are no elements to be evaluated under step 2A Prong 1.
Step 2A Prong 2 and Step 2B:
“The system of claim 15, comprising a control system actuating motion of the machine in response to the interpretations” – This limitation recited generic further instructions of application that the control system will be controlling/actuating the motion of the machine based off of interpretations, which cannot be integrated to a practical application nor provide significantly more than the judicial exception (see MPEP 2106.05(f)).
Therefore, claim 17 is non-patent eligible.
Regarding claim 18,
Step 1: The claim is directed to a system, which is directed to a machine. The claim satisfies Step 1.
There are no elements to be evaluated under step 2A Prong 1.
Step 2A Prong 2 and Step 2B:
“The system of claim 1, further comprising a display displaying the interpretations and a camera for capturing the data” -- This limitation merely recites furthering the limitation to a particular environment or field of use, and thus cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Therefore, claim 18 is non-patent eligible.
Regarding claim 19,
Step 1: The claim is directed to a method, which is directed to a machine. The claim satisfies Step 1.
Step 2A Prong 1:
“to generate one or more training feature vectors useful for performing, based on the one or more feature vectors, a plurality of different tasks each comprising different interpretations of training data;” -- This limitation is directed to generating feature vectors for performing tasks based on interpretation of data, which is directed to a process that can be performed mentally using observation, evaluation, and judgement, and thus is considered a mental process.
“mutual transfer learning comprising propagating and gradient across orthogonal task specific parameter spaces -- The limitation is directed to using mutual transfer learning method which involves gradient propagation across parameter spaces. The limitation is directed to the use of a known mathematical concept, gradient propagation, and thus it is directed to a mathematical concept.
“wherein the text data is used to perform at least one of searching the images captured by the camera, captioning the images … or communicating with the human operator of machine;” -- The limitation is directed to that text data will be used to perform searching of images captured by a camera, image captioning, and communicating with a human operator portion of a machine. The limitation is directed to a process that can be performed in the human mind using evaluation, observation, and judgement, and thus it is directed to a mental process.
Step 2A Prong 2 and Step 2B:
“outputting the interpretations to a machine coupled to a camera, wherein the interpretations comprise an identification of an environment of the machine” -- The limitation recites outputting interpretations to a machine that’s couple to identification of an machine environment. The limitation is directed to outputting interpretations (gathered data) to be inputted/outputted over a network, and it is considered an insignificant, extra-solution activity that cannot be integrated to a practical application (see MPEP 2106.05(g)). Furthermore, under Step 2B, the act of inputting/outputting gathered data is a well-understood, routine, and conventional activity (WURC) that cannot provide significantly more than the judicial exception (see MPEP 2106.05(d)(II)).
“A method for interpreting data using machine learning, comprising: training a unified encoder, comprising neural network, using one or more machine learning models,… encoding new data, using the unified encoder, into one or more feature vectors,… using a plurality of decoders, including a semantic segmentation decoder and a natural language processing decoder connected to the unified encoder, each of the decoders comprising a neural network…wherein the natural language processing decoder generates one of the interpretations comprising text data by operating on a single input that is a final one of the feature vectors from the unified encoder;…and the text data is used to perform at least one of searching the images captured by the camera, captioning the images, providing navigation or instructions to the machine…and wherein the unified encoder is trained using at least one of:…or the machine learning comprising a first model for performing at first one of the different tasks and a second model for performing a second one of the different tasks, and the training of the unified encoder alternates between the first model and the second model after an epoch, or trains both methods each epoch.” -- The limitation recites a method for training an encoder, using machine learning models, encoding new data using an encoder and its feature vectors, and multiple other instructions to apply onto a computer, which does not integrate to a practical application, nor does it provide significantly more than the judicial exception (see MPEP 2106.05(f)).
Therefore, claim 19 is non-patent eligible.
Regarding claim 21:
Step 1: The claim is directed to a system, which is directed to a machine. The claim satisfies Step 1.
There are no elements to be evaluated under Step 2A Prong 1.
Step 2A Prong 2 and Step 2B:
“The system of claim 1, wherein the machine comprises a vehicle, a spacecraft, a weapon, an aircraft, a robot, a medical device, a machine performing automated manufacturing, one or more devices used in banking, one or more devices supplying power or controlling power distribution, or a smart device in one or more smart buildings.” – The limitation recites that the machine first introduced in claim 1 will further include a bunch of other elements like a vehicle, spacecraft, a machine that will perform auto-manufacturing, and one or more devices used in different ways. The limitation amounts to no more than merely limiting the machine to field of uses/environments, and thus it cannot be integrated to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Therefore, claim 21 is non-patent eligible.
Regarding claim 22:
Step 1: The claim is directed to a system, which is directed to a machine. The claim satisfies Step 1.
There are no elements to be evaluated under Step 2A Prong 1.
Step 2A Prong 2 and Step 2B:
“The system of claim 1, wherein the image captioning comprises connections of words relating different parts in the image according to Scientific Captioning of Terrain Images (SCOTI).” -- The limitation recites that the image captioning will further comprise of words connections relation to different parts of the image based on SCOTI captioning standards. The limitation amounts to no more than mere further limits to a field of use/environment, and thus does not integrate to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Therefore, claim 22 is non-patent eligible.
Regarding claim 23:
Step 1: The claim is directed to a system, which is directed to a machine. The claim satisfies Step 1.
There are no elements to be evaluated under Step 2A Prong 1.
Step 2A Prong 2 and Step 2B:
“The system of claim 1, wherein the unified encoder is trained using the machine learning comprising a first model for performing a first one of the different tasks and a second model for performing a second one of the different tasks, and the training of the unified encoder alternates between the first model and the second model after an epoch or trains both methods each epoch.” -- The limitation recites that the unified encoder is further trained comprising the first model to performing different tasks, the rest of the limitations were recited before. The limitation amounts to no more than mere further limits to a field of use/environment, and thus does not integrate to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Therefore, claim 23 is non-patent eligible.
Regarding claim 24:
Step 1: The claim is directed to a system, which is directed to a machine. The claim satisfies Step 1.
Step 2A Prong 1:
“The system of claim 5, wherein the unified encoder is trained using mutual transfer learning comprising gradient propagation across orthogonal task-specific parameter spaces” -- The limitation is directed to a unified encoder that is trained using mutual transfer learning that comprises gradient propagation across parameter spaces. The limitation is directed to the use of mathematical concepts/calculation, and thus the limitation is directed to math.
Step 2A Prong 2 and Step 2B:
“and the different tasks comprise commonalities or utilize shared information” -- The limitation recites that the different tasks will further comprise commonalities or use shared information. The limitation amounts to no more than mere further limits to a field of use/environment, and thus does not integrate to a practical application, nor provide significantly more than the judicial exception (see MPEP 2106.05(h)).
Therefore, claim 24 is non-patent eligible.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this
Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not
identically disclosed as set forth in section 102, if the differences between the claimed invention and the
prior art are such that the claimed invention as a whole would have been obvious before the effective filing
date of the claimed invention to a person having ordinary skill in the art to which the claimed invention
pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are
summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-4, 7-8, 13-18, 21, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over US-10789427-B2 by Shazeer et. al. (referred herein by Shazeer) in view of NPL reference “Multinet: Real-time joint semantic reasoning for autonomous driving.” by Teichmann et. al. (referred herein as Teichmann) in view of US-11562147-B2 by Wang (referred herein as Wang) further in view of NPL reference “Show and tell: A neural image caption generator.”, by Vinyals et. al. (referred herein as Vinyals).
Regarding claim 1, Shazeer teaches:
A computer implemented system for interpreting data using machine learning, comprising a camera for capturing images, a machine, and one or more processors, one or more memories, and one or more computer executable instructions; ([Shazeer, col. 1, lines 40-44] “one innovative aspect … can be embodied in a system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement a machine learning model”, AND [Shazeer, col. 8 line 66, col. 9 line 1] “An image input modality network is a neural network that is configured to deepen a received input image feature depth using one or more convolutional layers”, wherein the examiner interprets one or more computers and one or more storage devices storing instructions to be the same as one or more processors, one or more memories, and computer-executable instructions, and interprets a received input image to be the same as an image captured by a camera).
embedded on the one or more memories, wherein the computer executable instructions are configured to execute: ([Shazeer, col 14 lines 16-20] “The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data.” wherein the examiner interprets “central processing unit for performing or executing instructions and one or more memory devices” to be the same as “memories, wherein the computer executable instructions are configured to execute”.)
each comprising different interpretations of the data, and the different tasks comprise semantic segmentation and generating text data; ([Shazeer, col. 6] “input text segment … to be parsed … an output modality neural network … may be configured to generate a text output” AND [Shazeer, col 5 lines 20-26] “machine learning tasks include speech recognition, image classification, machine translation, or parsing. For example, the multi task multi modal machine learning model 100 may receive text inputs corresponding to a machine translation task, e.g., an input text segment in an input natural language to be translated into a target natural language, or text inputs corresponding to a parsing task, e.g., an input text segment to be parsed.”, wherein the examiner interprets parsing the input segment and generating a text output to be the same as semantic segmentation generating text data). wherein the examiner interprets “machine learning tasks include speech recognition, image classification, machine translation, or parsing” to be the same as “different interpretations of the data, and the different tasks” because they are both related to different segmentation tasks. The examiner further interprets “… an input text segment in an input natural language to be translated into a target natural language … an input text segment to be parsed” and “an output modality neural network … may be configured to generate a text output” to be the same as “comprise semantic segmentation and generating text data” because they are both directed to multiple “machine learning tasks” that provide different “image classification,” “machine translation,” and “parsing” interpretations of the data and that use an “output modality neural network … configured to generate a text output” as text data corresponding to those different interpretations of the data.)
Shazeer does not teach a unified encoder comprising a neural network encoding data of the images into one or more feature vectors, wherein the unified encoder is trained using machine learning to generate the one or more feature vectors useful for performing, based on the one or more feature vectors, a plurality of different tasks a plurality of decoders, including a semantic segmentation decoder …, connected to the unified encoder, each of the decoders comprising a neural network interpreting the one or more feature vectors so as to decode one or more of the feature vectors to output one of the interpretations and wherein the natural language processing decoder generates the text data by operating on a single input that is a final one of the feature vectors received from the unified encoder, and an output to the machine comprising the semantic segmentation and the text data, wherein the text data is used to perform at least one of searching the images captured by the camera, captioning the images providing navigation or instructions to the machine, or communicating with a human operator of the machine.
Teichmann teaches:
a unified encoder comprising a neural network encoding data of the images into one or more feature vectors, wherein the unified encoder is trained using machine learning to generate the one or more feature vectors useful for performing, based on the one or more feature vectors, a plurality of different tasks ([Teichmann, page 1] “we present an approach to joint classification, detection and semantic segmentation using a unified architecture where the encoder is shared amongst the three tasks”, AND [Teichmann, page 1] “The encoder is a deep CNN, producing rich features that are shared among all task. Those features are then utilized by task-specific decoders”, wherein the examiner interprets unified architecture where the encoder is shared and rich features … shared among all task to be the same as a unified encoder that produces feature vectors reused across multiple different tasks).
a plurality of decoders, including a semantic segmentation decoder …, connected to the unified encoder, each of the decoders comprising a neural network interpreting the one or more feature vectors so as to decode one or more of the feature vectors to output one of the interpretations … ([Teichmann, page 1] “This is done by incorporating all three task into a unified encoder-decoder architecture. We name our approach MultiNet … The encoder is a deep CNN, producing rich features that are shared among all task. Those features are then utilized by task-specific decoders”, wherein the examiner interprets the “unified encoder-decoder architecture” with “rich features that are shared among all task” and “task-specific decoders” to be the same as “a plurality of decoders, including a semantic segmentation decoder, … connected to the unified encoder, each of the decoders comprising a neural network interpreting the one or more feature vectors so as to decode one or more of the feature vectors to output one of the interpretations” because they are both directed to a deep CNN encoder that produces shared feature representations (feature vectors) which are fed into multiple task-specific decoder neural networks (for classification, detection, and semantic segmentation), each decoder taking the shared features as input and producing its own task-specific interpretation as output.)
an output to the machine comprising the semantic segmentation … ([Teichmann, page 1, Introduction] “Fig. 1: Our goal: Solving street classification, vehicle detection and road segmentation in one forward pass.” AND [Teichmann, page 1] “The encoder is a deep CNN …which produce their outputs in real-time. ”, wherein the examiner interprets “segmentation” done by the “CNN” to “produce their outputs” to be the same as “output to the machine comprising the semantic segmentation”, because both are providing segmentation of the scene into different categories using a machine).
… providing navigation or instructions to the machine; ([Teichmann, page 1] “we argue that computational times are very important in order to enable real-time applications such as autonomous driving.”, wherein the examiner interprets real-time applications such as autonomous driving to be the same as providing navigation or instructions to the machine).
Shazeer and Teichmann do not teach and the text data, wherein the text data is used to perform at least one of searching the images captured by the camera, captioning the images, communicating with a human operator of the machine;
Wang teaches:
and the text data, wherein the text data is used to perform at least one of searching the images captured by the camera, ([Wang, col 2, lines 35-37] “yielding mixed results in tasks, such as VQA, visual reasoning, and image retrieval.”, wherein the examiner interprets image retrieval to be the same as searching the images captured by the camera).
captioning the images; ([Wang, col 16, lines 54-57] “The text data may also include one or more captions 352 relating or corresponding to the image data 350.”, wherein the examiner interprets captions 352 relating … to the image data 350 to be the same as captioning the images).
communicating with a human operator of the machine; ([Wang, col. 3, lines 46-48] “The visual dialogue model 140 can operate with an AI-based machine agent to hold a meaningful dialogue with humans in natural, conversational language about visual content.”, wherein the examiner interprets meaningful dialogue with humans in natural, conversational language about visual content to be the same as communicating with a human operator of the machine).
Wang does not teach …and a natural language processing decoder … and wherein the natural language processing decoder generates the text data by operating on a single input that is a final one of the feature vectors received from the unified encoder,.
Vinyals teaches …and a natural language processing decoder … and wherein the natural language processing decoder generates the text data by operating on a single input that is a final one of the feature vectors received from the unified encoder, ([Vinyals, page 3156] “An ‘encoder’ RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn is used as the initial hidden state of a ‘decoder’ RNN that generates the target sentence.”, [Vinyals, page 3157] “Hence, it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences (see Fig. 1). We call this model the Neural Image Caption, or NIC.”, [Vinyals, page 3159] “The image I is only input once” AND [Vinyals, page 3163] “the CNN to extract features that are relevant to horse-looking animals.”, wherein the examiner interprets “last hidden layer” of the CNN image encoder, which is provided as the only image input (“image I is only input once”) to “the CNN to extract features”, and RNN decoder that “generates sentences,” to be the same as a single final feature vector produced by a unified encoder that is supplied as the sole input to a natural language processing decoder that generates text data.)
Shazeer, Teichmann, Wang, Vinyals, and the instant application are analogous art because they are all directed to computer-implemented, machine-learning systems that interpret image data with neural networks.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system claim 1 disclosed by Shazeer to include the unified architecture for classification, detection, and segmentation disclosed by Teichmann. One would be motivated to do so to efficiently reuse a single set of learned visual features for multiple perception tasks in real time, as suggested by Teichmann ([Teichmann, page 1] “Our approach is also very efficient, allowing us to perform inference at more then 23 frames per second.”). It would have been further obvious to a person of ordinary skill in the art before the effective filing date of the invention to include the natural language processing to text process disclosed by Wang. One would be motivated to do so to effectively enable natural-language interaction and text-based search/captioning over camera images produced by the system, as suggested by Wang ([Wang, col. 3, lines 46-48, col. 16, lines 54-57] “The visual dialogue model 140 can operate with an AI-based machine agent to hold a meaningful dialogue with humans in natural, conversational language about visual content. … The text data may also include one or more captions 352 relating or corresponding to the image data 350.”). It would have also been obvious to a person of ordinary skill in the art before the effective filing date of the invention to include the RNN for sentence generation disclosed by Vinyals. One would be motivated to do so to efficiently generate natural-language descriptions directly from the shared image feature vectors produced by the encoder, using a dedicated text decoder, as suggested by Vinyals ([Vinyals, [3157, 3161]] “Hence, it is natural to use a CNN as an image ‘encoder’, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences (see Fig. 1). We call this model the Neural Image Caption, or NIC. … We have presented NIC, an end-to-end neural network system that can automatically view an image and generate a reasonable description in plain English.”).
Regarding claim 2, Shazeer, Teichmann, Wang, and Vinyals teaches The computer implemented system of claim 1, (see rejection of claim 1).
Wang further teaches wherein the different interpretations comprise at least one of a different classification or a conversion of the data into a different data format. ([Wang, Abstract] “The model generates an encoded visual dialogue input from the unified contextualized representation using visual dialogue encoding layers...”, [Wang, col 15, line 49-54] “the module 330 generates, from the unified contextualized representation and using a first self- s attention mask associated with discriminative settings of the transformer encoder network or a second self-attention mask associated with generative settings of the transformer encoder network, an answer prediction”, wherein the examiner interprets “generative settings of the transformer encoder network, an answer prediction” to be the same as “different interpretations comprising at least one of a different classification or a conversion of the data into a different data format,” as both terms refer to generating outputs that transform the input data into different forms, such as classifications or predictions, to provide an appropriate response or interpretation of the input.)
Shazeer, Teichmann, Wang, Vinyals, and the instant application are analogous art, because they are all directed to interpreting data using machine learning in a unified encoder-decoder architecture to perform multiple tasks.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system of claim 1 disclosed by Shazeer, Teichmann, Wang, and Vinyals to include the “visual dialogue encoding layers” used to generate answer predictions in response to dialogue inputs disclosed by Wang. One would be motivated to do so to effectively improve task-specific processing for interpreting complex multi-modal inputs, as suggested by Wang ([Wang, col 15, line 49-54] “the module 330 generates, from the unified contextualized representation… generative settings of the transformer encoder network, an answer prediction”).
Regarding claim 3, Shazeer, Teichmann, Wang, and Vinyals teaches The computer implemented system of claim 1, (see rejection of claim 1).
Teichmann further teaches wherein the data comprises first image data, and the different interpretations comprise at least one of text data, second image data, or semantic segmentation. ([Teichmann, Abstract] “Towards this goal, we present an approach to joint classification, detection and semantic segmentation using a unified architecture where the encoder is shared amongst the three tasks” AND [Teichmann, Page 1, Introduction] “In this paper we take an alternative approach and design a network architecture that can very efficiently perform classification, detection and semantic segmentation simultaneously. This is done by incorporating all three tasks into a unified encoder-decoder architecture. We name our approach MultiNet”, wherein the examiner interprets “joint classification, detection and semantic segmentation” to be the same as “different interpretations comprising at least one of text data, second image data, or semantic segmentation,” as both terms describe processing image data and producing multiple types of outputs, including semantic segmentation, which aligns with the concept of generating different interpretations of the data in the claim.)
Shazeer, Teichmann, Wang, Vinyals, and the instant application are analogous art because they are all directed to systems for interpreting data to generate multiple outputs using machine learning techniques.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system of claim 1 disclosed by Shazeer, Teichmann, Wang, and Vinyals to include the “unified architecture where the encoder is shared amongst the three tasks” disclosed by Teichmann. One would be motivated to do so to effectively process image data to produce multiple types of outputs, as suggested by Teichmann ([Teichmann, Abstract] “an approach to joint classification, detection and semantic segmentation using a unified architecture where the encoder is shared amongst the three tasks” AND [Teichmann, Page 1, Introduction] “This is done by incorporating all three tasks into a unified encoder-decoder architecture. We name our approach MultiNet” ).
Regarding claim 4, Shazeer, Teichmann, Wang, and Vinyals teaches The computer implemented system of claim 1, (see rejection of claim 1).
Shazeer further teaches wherein the different tasks comprise image captioning or natural language processing, semantic segmentation, and image reconstruction. ([[Shazeer, col 1, line 32] “This specification describes methods and systems, including computer programs encoded on computer storage media, for training a single machine learning model to perform multiple machine learning tasks from different machine learning domains. Example machine learning domains include image recognition, speech recognition, machine translation, image captioning, or parsing.” wherein the examiner interprets “image recognition, speech recognition, machine translation, image captioning, or parsing” to be the same as “image captioning or natural language processing, semantic segmentation, and image reconstruction,” as both refer to multiple tasks within different machine learning domains such as vision tasks (e.g., image captioning) and text processing tasks (e.g., parsing, which is related to natural language processing) while parsing an image into objects or components would be used to guide a generative process that reconstructs the image.)
Regarding claim 7, Shazeer, Teichmann, Wang, and Vinyals teaches The computer implemented system of claim 1, (see rejection of claim 1).
Wang further teaches encoder is modular so that the unified encoder can be transmitted between different ones of the processors and executed or trained on each of the different ones of the processors, and the decoders can be executed on different ones of the processors. ([Wang, col 6, lines 2-15] “In some embodiments, processor 310 and/or memory 320 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 310 and/or memory 320 may be located in one or more data centers and/or cloud computing facilities... machine-readable media that includes executable code that when run by one or more processors (e.g., processor 310) may cause the one or more processors to perform the methods described in further detail herein.”, wherein the examiner interprets “distributed, virtualized, and/or containerized computing resources” and “processor 310 and/or memory 320... located in one or more data centers and/or cloud computing facilities” to be the same as “the system comprises a distributed network of the processors,” as both are directed to systems enabling processing across multiple physical or virtual computing environments. Additionally, the examiner interprets “machine-readable media that includes executable code that when run by one or more processors... may cause the one or more processors to perform the methods described in further detail herein” to be the same as “the unified encoder is modular so that the unified encoder can be transmitted between different ones of the processors and executed or trained on each of the different ones of the processors,” as both describe mechanisms allowing functionality to be deployed across different processors.)
Shazeer, Teichmann, Wang, Vinyals, and the instant application are analogous art because they are all directed to systems for deploying and executing machine learning models across distributed computing environments.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system of claim 1 disclosed by Shazeer, Teichmann, Wang, and Vinyals to include the use of “distributed, virtualized, and/or containerized computing resources” disclosed by Wang. One would be motivated to do so to efficiently enable scalable training and execution of the unified encoder across distributed processors, as suggested by Wang ([Wang, col 6, lines 2-15] “processor 310 and/or memory 320 may include distributed, virtualized, and/or containerized computing resources”).
Regarding claim 13, Shazeer, Teichmann, Wang, and Vinyals teaches The system of claim 1, (see rejection of claim 1).
Teichmann further teaches wherein the different tasks comprise terrain classification ([Teichmann, Abstract] “Towards this goal, we present an approach to joint classification, detection and semantic segmentation using a unified architecture where the encoder is shared amongst the three tasks” AND [Teichmann, Page 1, Introduction] “In this paper we take an alternative approach and design a network architecture that can very efficiently perform classification, detection and semantic segmentation simultaneously. This is done by incorporating all three tasks into a unified encoder-decoder architecture. We name our approach MultiNet”, wherein the examiner interprets “semantic segmentation using a unified architecture where the encoder is shared amongst the three tasks” to be the same as “terrain classification,” as semantic segmentation can divide an image into different terrains, such as grass, forest, crop, or body of water, and “joint classification, detection and semantic segmentation” to be the same as “different interpretations comprising at least one of text data, second image data, or semantic segmentation,” as both terms describe processing image data and producing multiple types of outputs, including semantic segmentation, which aligns with the concept of generating different interpretations of the data in the claim.)
Wang further teaches and image captioning. ([Wang, col 2, lines 56-62] “the subject technology can first encode the image input into a series of detected objects and feed them into a unified transformer encoder together with a corresponding image caption and multi-turn dialogue history.”, wherein the examiner interprets. “encode the image input into a series of detected objects and feed them into a unified transformer encoder together with a corresponding image caption and multi-turn dialogue history input” to be the same as “image captioning,” as both are directed to systems that process multi-task functionalities using a shared encoding approach )
Shazeer, Teichmann, Wang, Vinyals, and the instant application are analogous art because they are all directed to systems that use machine learning models to perform multiple tasks such as terrain classification and image captioning using shared encoder architectures.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system of claim 1 disclosed by “approach to joint classification, detection and semantic segmentation using a unified architecture” by Teichmann to include the process of encoding “the image input into a series of detected objects and … a corresponding image caption” disclosed by Wang. One would be motivated to do so to effectively feed detected objects in the image along with the captions, as suggested by Wang ([Wang, col 2, lines 56-62] “the subject technology can first encode the image input into a series of detected objects and feed them into a unified transformer encoder together with a corresponding image caption and multi-turn dialogue history.”)
Regarding claim 14, Shazeer, Teichmann, Wang, and Vinyals teaches The system of claim 1, (see rejection of claim 1).
Teichmann further teaches wherein the unified encoder comprises a RES NET neural network, an Xception neural network, or a MobileNet neural network. ([Teichmann, page 3, sec 3.1] “We perform experiments using versions of VGG16 [58] and ResNet [22] architectures. Our first VGG encoder uses all convolutional and pooling layers of VGG16”), wherein the examiner interprets “ResNet architectures” to be the same as “RES NET neural network,” as both are directed to unified architectures/encoders comprising of RES NET.)
Shazeer, Teichmann, Wang, Vinyals, and the instant application are analogous art because they are all directed to systems using neural network architectures to encode data for machine learning tasks.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system of claim 1 disclosed by Shazeer, Teichmann, Wang, and Vinyals to include the “ResNet architectures” disclosed by Teichmann. One would be motivated to do so to robustly test and comprise of a RES NET neural network, as suggested by Teichmann ([Teichmann, page 3, sec 3.1] “We perform experiments using versions of VGG16 [58] and ResNet [22] architectures.”)
Regarding claim 15, Shazeer, Teichmann, Wang, and Vinyals teaches The system of claim 1, (see rejection of claim 1).
Teichmann further teaches further comprising a machine coupled to or including one of the processors, wherein the machine comprises a vehicle, a spacecraft, a weapon, an aircraft, a robot, a medical device, an imaging device or camera, a rover, a sensor, an actuator, an intelligent agent, or a smart device in one or more smart buildings, wherein the machine utilizes one or more of the interpretations for operation of the machine. ([Teichmann, page 1, Abstract] “applications such as autonomous driving. Towards this goal, we present an approach to joint classification, detection and semantic segmentation using a unified architecture where the encoder is shared amongst the three tasks … Fig. 1: Our goal: Solving street classification, vehicle detection and road segmentation in one forward pass”, wherein the examiner interprets “applications such as autonomous driving” and “solving street classification, vehicle detection and road segmentation in one forward pass” to be the same as “wherein the machine comprises a vehicle” and “utilizes one or more of the interpretations for operation of the machine,” as both are directed to using interpretations, such as street classification and vehicle detection, for the operation of a vehicle.)
Shazeer, Teichmann, Wang, Vinyals, and the instant application are analogous art because they are all directed to systems using machine learning models to perform tasks enabling the operation of various machines.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system of claim 1 disclosed by Shazeer, Teichmann, Wang, and Vinyals to include the “applications such as autonomous driving” disclosed by Teichmann. One would be motivated to do so to effectively enable machine learning systems to support real-time operation and decision-making for machines, such as vehicles, as suggested by Teichmann ([Teichmann, page 1, Abstract] “Towards this goal, we present an approach to joint classification, detection and semantic segmentation using a unified architecture where the encoder is shared amongst the three tasks…street classification, vehicle detection and road segmentation”).
Regarding claim 16, Shazeer, Teichmann, Wang, and Vinyals teaches The system of claim 1, (see rejection of claim 1).
Teichmann further teaches further comprising an apparatus coupled to or including one of the processors and utilizing the interpretations for operation of the apparatus, wherein the apparatus comprises at least one machine selected from a machine performing automated manufacturing, devices controlled by a control system, one or more devices used in banking, one or more devices supplying power or controlling power distribution, or one or more devices in an automotive or aerospace system. ([Teichmann, page 1, Abstract] “While most approaches to semantic reasoning have focused on improving performance, in this paper we argue that computational times are very important in order to enable real time applications such as autonomous driving.”, wherein the examiner interprets “real time applications such as autonomous driving” to be the same as “one or more devices in an automotive ... system,” as both are directed to apparatuses that utilize interpretations for automotive operation.)
Shazeer, Teichmann, Wang, Vinyals, and the instant application are analogous art because they are all directed to systems that use machine learning models to perform real-time tasks for controlling apparatuses, such as automotive or aerospace systems.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system of claim 1 disclosed by Shazeer, Teichmann, Wang, and Vinyals to include the “real time applications such as autonomous driving” disclosed by Teichmann. One would be motivated to do so to effectively enable automatic, real-time control and operation of apparatuses using machine learning interpretations, as suggested by Teichmann ([Teichmann, page 1, Abstract] “computational times are very important in order to enable real time applications such as autonomous driving.”).
Regarding claim 17, Shazeer, Teichmann, Wang, and Vinyals teaches The system of claim 15, (see rejection of claim 15).
Teichmann further teaches comprising a control system actuating motion of the machine in response to the interpretations. ([Teichmann, page 1, Abstract] “While most approaches to semantic reasoning have focused on improving performance, in this paper we argue that computational times are very important in order to enable real time applications such as autonomous driving.”, wherein the examiner interprets “real time applications such as autonomous driving” to be the same as “a control system actuating motion of the machine in response to the interpretations,” as both are directed to systems where the interpretations are used to control and actuate the motion of a machine, such as a vehicle.)
Shazeer, Teichmann, Wang, Vinyals, and the instant application are analogous art because they are all directed to systems that process data to control the motion of machines based on machine learning interpretations.It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system of claim 1 disclosed by Shazeer, Teichmann, Wang, and Vinyals to include the “real time applications such as autonomous driving” disclosed by Teichmann. One would be motivated to do so to effectively enable control systems to actuate machine motion in real-time based on data-driven interpretations, as suggested by Teichmann ([Teichmann, page 1, Abstract] “computational times are very important in order to enable real time applications such as autonomous driving.”)
Regarding claim 18, Shazeer, Teichmann, Wang, and Vinyals teaches The system of claim 1, (see rejection of claim 1).
Wang further teaches further comprising a display displaying the interpretations and a camera for capturing the data. ([Wang, col 2, lines 40-44] “Specifically, each image in a visual dialogue dataset is associated with up to 10 dialogue turns, which contains much longer contexts than either VQA or image captioning.”, wherein the examiner interprets “each image in a visual dialogue dataset” to be the same as “a camera for capturing the data” and “associated with up to 10 dialogue turns” to be the same as “a display displaying the interpretations,” as both are directed to systems that involve capturing visual data and presenting interpretations or interactions based on that data.)
Shazeer, Teichmann, Wang, Vinyals, and the instant application are analogous art because they are all directed to systems that involve capturing visual data and displaying interpretations to enable user interaction or decision-making.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system of claim 1 disclosed by Shazeer, Teichmann, Wang, and Vinyals to include the “each image in a visual dialogue dataset is associated with up to 10 dialogue turns” disclosed by Wang. One would be motivated to do so to effectively capture and present visual data alongside contextual interpretations for enhanced user interaction, as suggested by Wang ([Wang, col 2, lines 40-44] “Specifically, each image in a visual dialogue dataset is associated with up to 10 dialogue turns, which contains much longer contexts than either VQA or image captioning.”).
Regarding claim 21, Shazeer, Teichmann, Wang, Vinyals teach The system of claim 1 (see rejection of claim 1).
Teichmann further teaches wherein the machine comprises a vehicle, a spacecraft, a weapon, an aircraft, a robot, a medical device, a machine performing automated manufacturing, one or more devices used in banking, one or more devices supplying power or controlling power distribution, or a smart device in one or more smart buildings. ([Teichmann, page 1] “computational times are very important in order to enable real time applications such as autonomous driving.”, AND [Teichmann, page 1] “Current advances in the field of computer vision have made clear that visual perception is going to play a key role in the development of self-driving cars … Figure 1: Our goal: Solving street classification, vehicle detection and road segmentation in one forward pass.”, wherein the examiner interprets “applications such as autonomous driving” and “self-driving cars … solving street classification, vehicle detection and road segmentation in one forward pass” to be the same as “wherein the machine comprises a vehicle …” because they are both directed to a machine that is specifically a vehicle (a self-driving car) that uses the interpreted outputs (street classification, vehicle detection, road segmentation) for its operation. The examiner further notes that the claim recites a group (“a vehicle, a spacecraft, a weapon, an aircraft, a robot, a medical device, a machine performing automated manufacturing, one or more devices used in banking, one or more devices supplying power or controlling power distribution, or a smart device in one or more smart buildings”), and that teaching one member of this group (a vehicle) is sufficient to meet the limitation as written, since the prior art system implemented on a vehicle falls within the claimed set of machine types.)
Shazeer, Teichmann, Wang, Vinyals, and the instant application are analogous art because they are all directed to computer-implemented neural-network systems that interpret image or sensor data to control or assist operation of machines (including vehicles and other automated devices).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system claim 1 disclosed by Shazeer, Teichmann, Wang, and Vinyals to include the “applications such as autonomous driving” disclosed by Teichmann. One would be motivated to do so to effectively adapt the general perception and interpretation system for deployment on concrete autonomous machines such as vehicles that require real-time visual perception for safe operation, as suggested by Teichmann ([Teichmann, page 1] “computational times are very important in order to enable real time applications such as autonomous driving.”).
Regarding claim 23, Shazeer, Teichmann, Wang, and Vinyals teach The system of claim 1 (see rejection of claim 1).
Teichmann further teaches:
wherein the unified encoder is trained using the machine learning ([Teichmann, page 1] “Our approach is very simple, can be trained end-to-end…” AND [Teichmann, page 4] “MultiNet training follows a classic fine-tuning pipeline.” wherein the examiner interprets “can be trained end-to-end” and “MultiNet training follows a classic fine-tuning pipeline” to be the same as “the unified encoder is trained using the machine learning” because they are both directed to training the shared encoder of the network using machine-learning optimization procedures.)
comprising a first model for performing a first one of the different tasks and a second model for performing a second one of the different tasks, ([Teichmann, page 3] “Our approach shares a common encoder over the three tasks and has three branches, each implementing a decoder for a given task.”, wherein the examiner interprets “three branches, each implementing a decoder for a given task” to be the same as “a first model for performing a first one of the different tasks and a second model for performing a second one of the different tasks” because they are both directed to multiple task-specific neural-network models (decoders) that share a unified encoder but each performs a different task.).
and the training of the unified encoder alternates between the first model and the second model after an epoch [Teichmann, page 5] “we fine-tune the encoder using just one of the three losses segmentation, detection and classification and compare their performance…”, wherein the examiner interprets “fine-tune the encoder using just one of the three losses” to be the same as “the training of the unified encoder alternates between the first model and the second model after an epoch” because both describe training the shared encoder with one task’s loss (one model) at a time in separate phases, which corresponds to alternating task-specific training over epochs.)
or trains both methods each epoch. ([Teichmann, page 4] “Our joint training implementation computes the forward passes for examples corresponding to each of the three tasks independently. The gradients are only added during the back-propagation steps.” AND [Teichmann, page 5] “In the second part we compare joint training of all three decoders with individual inference…”, wherein the examiner interprets “joint training implementation [that] computes the forward passes for examples corresponding to each of the three tasks independently” with gradients added during back-propagation and “joint training of all three decoders” to be the same as “trains both methods each epoch” because both describe a process in which examples from multiple tasks (multiple models) are processed within the same training schedule and their gradients are combined to update the shared encoder during each training epoch.)
Shazeer, Teichmann, Wang, Vinyals, and the instant application are analogous art because they are all directed to multi-task neural-network systems that use a shared encoder trained with machine learning to support multiple different task-specific models or decoders.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system claim 1 disclosed by Shazeer, Teichmann, Wang, and Vinyals to include the “our approach shares a common encoder over the three tasks and has three branches, each implementing a decoder for a given task” disclosed by Teichmann. One would be motivated to do so to efficiently improve the shared encoder’s performance by alternately or jointly training it with losses from multiple task-specific models, as suggested by Teichmann ([Teichmann, page 3] “our approach shares a common encoder over the three tasks and has three branches, each implementing a decoder for a given task”).
Claims 5 and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Wang in view of Teichmann further in view of Vinyals.
Regarding claim 5, Wang teaches:
A computer implemented system for interpreting data using machine learning, comprising: an application-specific integrated circuit (ASIC) for an artificial neural network, the ASIC comprising one or more processors and one or more memories configured to execute: ([Wang, page 6] “processor 310 may additionally or alternately include … an application specific integrated circuit (ASIC) … that provides accelerated performance when evaluating neural-network models.”, wherein the examiner interprets an ASIC that accelerates evaluation of neural-network models to be the same as the claimed ASIC that houses processors and memories configured to execute the artificial neural network).
Wang does not teach a unified encoder comprising a neural network encoding data into one or more feature vectors, wherein the unified encoder is trained using machine learning to generate the one or more feature vectors useful for performing a plurality of different tasks each comprising different interpretations of the data; and a plurality of decoders including a semantic segmentation decoder … connected to the unified encoder, each of the decoders comprising a neural network interpreting the one or more feature vectors so as to decode one or more of the feature vectors to output one of the interpretations, … and a natural language processing decoder … wherein the natural language processing decoder generates text data by operating on a single input that is a final one of the feature vectors from the unified encoder.
Teichmann teaches:
a unified encoder comprising a neural network encoding data into one or more feature vectors, wherein the unified encoder is trained using machine learning to generate the one or more feature vectors useful for performing a plurality of different tasks each comprising different interpretations of the data; ([Teichmann, page 1] “we present an approach to joint classification, detection and semantic segmentation using a unified architecture where the encoder is shared amongst the three tasks”, AND [Teichmann, page 1] “The encoder is a deep CNN, producing rich features that are shared among all task. Those features are then utilized by task-specific decoders”, wherein the examiner interprets the “unified architecture where the encoder is shared amongst the three tasks” and “encoder … deep CNN, producing rich features that are shared among all task” to be the same as “a unified encoder comprising a neural network encoding data into one or more feature vectors, wherein the unified encoder is trained using machine learning to generate the one or more feature vectors,” because a “deep CNN” is a machine-learned neural network that processes input data into “rich features” (feature vectors) that are shared. The examiner further interprets the “joint classification, detection and semantic segmentation” performed by “task-specific decoders” operating on those shared rich features to be the same as “a plurality of different tasks each comprising different interpretations of the data,” because classification, detection, and semantic segmentation are three different interpretation tasks applied to the same encoded data using the shared feature vectors.).
and a plurality of decoders including a semantic segmentation decoder … connected to the unified encoder, each of the decoders comprising a neural network interpreting the one or more feature vectors so as to decode one or more of the feature vectors to output one of the interpretations, ([Teichmann, page 1] “This is done by incorporating all three task into a unified encoder-decoder architecture. We name our approach MultiNet … The encoder is a deep CNN, producing rich features that are shared among all task. Those features are then utilized by task-specific decoders”, wherein the examiner interprets the “unified encoder-decoder architecture” with “rich features that are shared among all task” and “task-specific decoders” to be the same as “a plurality of decoders, including a semantic segmentation decoder, … connected to the unified encoder, each of the decoders comprising a neural network interpreting the one or more feature vectors so as to decode one or more of the feature vectors to output one of the interpretations” because they are both directed to a deep CNN encoder that produces shared feature representations (feature vectors) which are fed into multiple task-specific decoder neural networks (for classification, detection, and semantic segmentation), each decoder taking the shared features as input and producing its own task-specific interpretation as output.)
Wang and Teichmann do not teach … and a natural language processing decoder … wherein the natural language processing decoder generates text data by operating on a single input that is a final one of the feature vectors from the unified encoder.
Vinyals teaches … and a natural language processing decoder … wherein the natural language processing decoder generates text data by operating on a single input that is a final one of the feature vectors from the unified encoder. ([Vinyals, page 3156] “An ‘encoder’ RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn is used as the initial hidden state of a ‘decoder’ RNN that generates the target sentence.”, [Vinyals, page 3157] “Hence, it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences (see Fig. 1). We call this model the Neural Image Caption, or NIC.”, [Vinyals, page 3159] “The image I is only input once” AND [Vinyals, page 3163] “the CNN to extract features that are relevant to horse-looking animals.”, wherein the examiner interprets “last hidden layer” of the CNN image encoder, which is provided as the only image input (“image I is only input once”) to “the CNN to extract features”, and RNN decoder that “generates sentences,” to be the same as a single final feature vector produced by a unified encoder that is supplied as the sole input to a natural language processing decoder that generates text data.)
Wang, Teichmann, Vinyals, and the instant application are analogous art because they are all directed to neural-network systems that use shared encoders and task-specific decoders to interpret image data.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the ASIC disclosed by Wang to include the unified architecture for classification, detection, and segmentation disclosed by Teichmann. One would be motivated to do so to efficiently reuse a single set of encoder features for multiple perception tasks while achieving real-time performance, as suggested by Teichmann ([Teichmann, page 1] “Our approach is also very efficient, allowing us to perform inference at more then 23 frames per second.”). It would have been further obvious to a person of ordinary skill in the art before the effective filing date of the invention to include the RNN for sentence generation disclosed by Vinyals. One would be motivated to do so to effectively generate natural-language descriptions of camera images directly from the encoder’s feature vectors, as suggested by Vinyals ([Vinyals, page 3161] “We have presented NIC, an end-to-end neural network system that can automatically view an image and generate a reasonable description in plain English.”).
Regarding claim 24, Wang, Teichmann, and Vinyals teach The system of claim 5, (see rejection of claim 5).
Teichman further teaches wherein the unified encoder ([Teichmann, page 3] “Our approach shares a common encoder over the three tasks and has three branches, each implementing a decoder for a given task.”, wherein the examiner interprets “common encoder over the three tasks” to be the same as “the unified encoder” because both describe a single shared encoder network whose parameters are reused across multiple tasks.)
is trained using mutual transfer learning ([Teichmann, page 4] “Our joint training implementation computes the forward passes for examples corresponding to each of the three tasks independently. The gradients are only added during the back-propagation steps. This has the practical advantage that we are able to use different training parameters for each decoder.”, wherein the examiner interprets “joint training implementation” with gradients from multiple tasks being added during back-propagation to be the same as “mutual transfer learning” because both describe a regime where learning signals from different tasks are shared through a common encoder so that training on one task benefits the others via the shared parameters.)
comprising gradient propagation across orthogonal task-specific parameter spaces ([Teichmann, page 4] “computes the forward passes for examples corresponding to each of the three tasks independently. The gradients are only added during the back-propagation steps. This has the practical advantage that we are able to use different training parameters for each decoder.”, wherein the examiner interprets performing forward passes “for each of the three tasks independently” with “different training parameters for each decoder” and then adding gradients during back-propagation to be the same as “gradient propagation across orthogonal task-specific parameter spaces” because each decoder has its own distinct (task-specific) parameter space while the shared encoder receives gradient contributions from all tasks, so gradients propagate from separate task-specific parameter areas into the common encoder, transferring information across tasks.)
and the different tasks comprise commonalities or utilize shared information ([Teichmann, page 1] “we present an approach to joint classification, detection and semantic segmentation using a unified architecture where the encoder is shared amongst the three tasks” AND [Teichmann, page 1] “The encoder is a deep CNN, producing rich features that are shared among all task. Those features are then utilized by task-specific decoders.”, wherein the examiner interprets “encoder … shared amongst the three tasks” and “rich features that are shared among all task … utilized by task-specific decoders” to be the same as “the different tasks comprise commonalities or utilize shared information” because both describe multiple tasks (classification, detection, semantic segmentation) that rely on common, shared feature representations produced by the unified encoder, i.e., the tasks share information via the encoder’s features and thus comprise commonalities in their learned representations.)
Wang, Teichmann, Vinyals, and the instant application are analogous art because they are all directed to multi-task neural-network systems in which a shared (unified) encoder is trained using learning signals from multiple related tasks so that the tasks share information via common learned feature representations.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system claim 5 disclosed by Wang, Teichmann, and Vinyals to include the “our joint training implementation computes the forward passes for examples corresponding to each of the three tasks independently. The gradients are only added during the back-propagation steps. This has the practical advantage that we are able to use different training parameters for each decoder” disclosed by Teichmann. One would be motivated to do so to efficiently implement mutual transfer learning with gradient propagation across task-specific parameter spaces so that related tasks can exploit shared information in the unified encoder, as suggested by Teichmann ([Teichmann, page 4] “our joint training implementation computes the forward passes for examples corresponding to each of the three tasks independently. The gradients are only added during the back-propagation steps.”).
Claims 8-11 are rejected under 35 U.S.C. 103 as being unpatentable over Shazeer in view of Teichmann in view of Wang in view of Vinyals further in view of US11010902B2 by Bagci et. al. (referred herein as Bagci.)
Regarding claim 8, Shazeer, Teichmann, Wang, and Vinyals teaches The system of claim 1, (see rejection of claim 1).
Teichmann further teaches:
one of the decoders comprises a semantic segmentation decoder executing a decoder convolution layer and deconvolution layers, the semantic segmentation decoder: receives the intermediate feature vector and passing the intermediate feature vector through the decoder convolution layer to form a first output; receives the final feature vector and passing the final feature vector through one a first one of the deconvolution layers to form a second output; ([Teichmann, page 3-4, sec 3.2] “By increasing the input size of our image to 1248 × 348, we effectively apply our feature generator to each spatial location of the image [51] [35]. The result is a grid of 39 × 12 features, each corresponding to a spatial region … The segmentation decoder follows the main ideas of the FCN architecture [35]. Given the features produced by the encoder, we produce a low-resolution segmentation of size 39 × 12 using a 1 × 1 convolutional layer. This output is then upsampled using three transposed convolution layers.”, wherein the examiner interprets “the segmentation decoder follows the main ideas of the FCN architecture... given the features produced by the encoder, we produce a low-resolution segmentation of size 39 × 12 using a 1 × 1 convolutional layer” to be the same as “receives the intermediate feature vector and passing the intermediate feature vector through the decoder convolution layer to form a first output,” as both describe initial processing of encoder-generated features using features and a segmentation decoder. Additionally, “this output is then upsampled using three transposed convolution layers” is interpreted to be the same as “receives the final feature vector and passing the final feature vector through one a first one of the deconvolution layers to form a second output,” as both describe refining outputs through deconvolution (a.k.a. transposed convolution) layers.)
concatenates the first output and the second output to form a combined output; and passes the combined output through at least a second one of the deconvolution layers to form the one of the interpretations. ([Teichmann, page 4, sec 3.3] “Thus it can be implemented inside the CNN pooling. The result is an end-to-end trainable system which is faster. The features pooled by the RoI align are concatenated with the initial prediction and used to produce a more accurate prediction. The second prediction is modeled as offset, its output is added to the initial prediction.”, wherein the examiner interprets “the features pooled by the RoI align are concatenated with the initial prediction” to be the same as “concatenates the first output and the second output to form a combined output,” as both describe combining intermediate and subsequent results into a unified representation using concatenation. Additionally, “used to produce a more accurate prediction” and “its output is added to the initial prediction” is interpreted to be the same as “passes the combined output through at least a second one of the deconvolution layers to form the one of the interpretations,” as both describe refining predictions through further processing to generate a final output.)
Shazeer, Teichmann, Wang, Vinyals, and the instant application are analogous art because they are all directed to NN based systems that perform semantic segmentation using encoder-generated feature maps and decoder architectures with convolution and deconvolution (upsampling) layers to produce refined segmentation outputs.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system claim 1 disclosed by Shazeer, Teichmann, Wang, and Vinyals to include the “segmentation decoder follows the main ideas of the FCN architecture…Given the features produced by the encoder, we produce a low-resolution segmentation… This output is then upsampled using three transposed convolution layers” disclosed by Teichmann. One would be motivated to do so to effectively improve semantic segmentation accuracy and spatial resolution by combining intermediate and upsampled feature outputs in a decoder with convolution and deconvolution layers, as suggested by Teichmann ([Teichmann, page 3-4] “Given the features produced by the encoder, we produce a low-resolution segmentation… This output is then upsampled using three transposed convolution layers.”).
Shazeer, Teichmann, Wang, and Vinyals do not teach wherein: the data comprises an image; the unified encoder executes a plurality of encoder convolution layers so as to output a first one of the feature vectors comprising an intermediate feature vector after a first plurality of the convolution layers and a final feature vector after all the convolution layers;.
Bagci teaches wherein: the data comprises an image; the unified encoder executes a plurality of encoder convolution layers so as to output a first one of the feature vectors comprising an intermediate feature vector after a first plurality of the convolution layers and a final feature vector after all the convolution layers; ([Bagci, col 6, line 31-38] “As illustrated in FIG. 2, the input to the SegCaps network is a 512x512-pixel image, which is depicted in FIG. 2 as a slice of a CT Scan. This image is passed through a 2D convolutional layer which produces 16 feature maps of the same spatial dimensions. This output forms the first set of capsules, with a single capsule type represented by a grid of 512x512 capsules, each of which is a 16-dimensional vector. This is then followed by the first convolutional capsule layer.”, wherein the examiner interprets “the input to the SegCaps network is a 512x512-pixel image” to be the same as “the data comprises an image,” as both describe image data being processed by the encoder. Additionally, “the 2D convolutional layer which produces 16 feature maps of the same spatial dimensions” is interpreted to be the same as “the unified encoder executes a plurality of encoder convolution layers to output a first one of the feature vectors comprising an intermediate feature vector,” as both involve convolutional processing resulting in an intermediate representation. Finally, “this is then followed by the first convolutional capsule layer” is interpreted to be the same as “a final feature vector after all the convolution layers,” as both describe subsequent processing leading to a final output representation.)
Shazeer, Teichmann, Wang, Vinyals, Bagci, and the instant application are analogous art because they are all directed to systems for processing image data through encoders and decoders to generate task-specific outputs.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system of claim 1 disclosed by Shazeer, Teichmann, Wang, and Vinyals to include the “the 2D convolutional layer which produces 16 feature maps of the same spatial dimensions” disclosed by Bagci, and the “low-resolution segmentation of size 39 × 12 using a 1 × 1 convolutional layer” disclosed by Teichmann. One would be motivated to do so to efficiently generate intermediate feature vectors for processing by downstream decoders, as suggested by Bagci ([Bagci, col 6, line 3] “This image is passed through a 2D convolutional layer which produces 16 feature maps of the same spatial dimensions.”).
Regarding claim 9, Shazeer, Teichmann, Wang, Vinyals, and Bagci teaches The system of claim 8, (see rejection of claim 8).
Wang further teaches:
wherein another one of the decoders comprises a natural language processing decoder: receives only the final feature vector; flattens the final feature vector to form a flattened feature vector; passes the flattened feature vector through a fully connected layer to reduce a number of dimensions and form a reduced dimension feature vector; concatenates the reduced dimension feature vector with a previous hidden state and previous hidden word, if necessary; to form a concatenated layer. ([Wang, col 5, lines 33-38] “For example, the encoder 332 may receive from the input sequence module 336 an output, e.g., an encoded vector representation of a concatenation of the image 350, the image caption 352 and the dialogue history 354.”, wherein the examiner interprets “an encoded vector representation of a concatenation of the image 350, the image caption 352 and the dialogue history 354” to be the same as “concatenates the reduced dimension feature vector with a previous hidden state and previous hidden word, if necessary; to form a concatenated layer,” as both describe combining multiple sources of data, including contextual or historical elements, into a unified, concatenated, representation for further processing.)
inputs the concatenated layer to a bidirectional GRU layer to form a GRU output; and passes the GRU output through at least one fully connected layer so as to reduce a dimensions and form another of the interpretations comprising a word output. ([Wang, col 8, lines 7-27] “the structure 400 can adopt visually-grounded MLM and NSP learning objectives to train the unified transformer encoder 450 for effective vision and dialogue fusion using two types of self-attention masks (e.g., 334). The unified transformer encoder 450 may employ bidirectional and sequence-to-sequence (or referred to as “seq2seq”) self-attention masks for the discriminative and generative settings, respectively. For example, in the discriminative settings, all of the utilities (e.g., image input 410 (depicted as “I”), dialog history 420 (depicted as “H,”), user question 430 (depicted as “Q,”) and answer option (depicted as “A,”)) are not masked (denoted by non-patterned shape), and thus, all are available for attention processing. In the generative settings, all with the exclusion of the answer option are not masked and available for attention processing. In this regard, the answer option is masked using seq2seq self-attention masks. The outputs are further optimized with a ranking optimization module to further fine-tune on the dense annotations.”, wherein the examiner interprets “the unified transformer encoder 450 may employ bidirectional and sequence-to-sequence (or referred to as “seq2seq”) self-attention masks” to be the same as “inputs the concatenated layer to a bidirectional GRU layer to form a GRU output,” as both describe bidirectional processing of input data to generate an output (GRU is a form of bidirectional processing). Additionally, “the outputs are further optimized with a ranking optimization module to further fine-tune on the dense annotations” is interpreted to be the same as “passes the GRU output through at least one fully connected layer so as to reduce a dimensions and form another of the interpretations comprising a word output,” as both describe further processing to refine the final output representation (either word output or annotations).
Shazeer, Teichmann, Wang, Vinyals, Bagci, and the instant application are analogous art because they are all directed to systems for processing data through encoders and decoders to generate outputs for tasks such as natural language processing or image processing.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system of claim 8 disclosed by Shazeer, Teichmann, Wang, Vinyals, and Bagci to include the “an encoded vector representation of a concatenation of the image 350, the image caption 352 and the dialogue history 354” disclosed by Wang. One would be motivated to do so to effectively combine image and contextual data to improve downstream processing and generate richer outputs, as suggested by Wang ([Wang, col 5, lines 33-38] “the encoder 332 may receive from the input sequence module 336 an output, e.g., an encoded vector representation of a concatenation of the image 350, the image caption 352 and the dialogue history 354.”)
Regarding claim 10, Shazeer, Teichmann, Wang, Vinyals, and Bagci teach The system of claim 9, (see rejection of claim 9).
Teichmann further teaches wherein another one of the decoders comprises an image reconstruction decoder: successively deconvolutes the final feature vector through a plurality of deconvolution layers so as to reconstruct the data comprising an image. ([Teichmann, page 4, sec 3.4] “The segmentation decoder follows the main ideas of the FCN architecture [35]. Given the features produced by the encoder, we produce a low resolution segmentation of size 39 × 12 using a 1 × 1 convolutional layer. This output is then upsampled using three transposed convolution layers [9]. Skip connections are utilized to extract high resolution features from the lower layers. Those features are first processed by a 1 × 1 convolution layer and then added to the partially upsampled results.”, wherein the examiner interprets “this output is then upsampled using three transposed convolution layers” to be the same as “successively deconvolutes the final feature vector through a plurality of deconvolution layers,” as both describe using transposed convolutions (also known as deconvolutions) to process outputs and achieve reconstruction. Additionally, the examiner interprets “the segmentation decoder follows the main ideas of the FCN architecture” and “added to the partially upsampled results” to be the same as “so as to reconstruct the data comprising an image,” as both describe using segmentation techniques and upsampling to generate a final reconstructed output. )
Shazeer, Teichmann, Wang, Vinyals, Bagci, and the instant application are analogous art because they are all directed to systems for processing data through encoders and decoders to reconstruct outputs such as images.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify system of claim 9 disclosed by Shazeer, Teichmann, Wang, Vinyals, and Bagci to include the process where “this output is then upsampled using three transposed convolution layers” disclosed by Teichmann. One would be motivated to do so to efficiently reconstruct high-resolution images from encoded features, as suggested by Teichmann ([Teichmann, page 4, sec 3.4] “Skip connections are utilized to extract high resolution features from the lower layers. Those features are first processed by a 1 × 1 convolution layer and then added to the partially upsampled results.”).
Regarding claim 11, Shazeer, Teichmann, Wang, Vinyals, and Bagci teaches The system of claim 10, (see rejection of claim 10).
Wang further teaches wherein hidden layers in the image reconstruction decoder and the semantic segmentation decoder are equipped with RELU activation. ([Wang, col 9, lines 14-18] “Lastly, visual features with its position features and segment identifier are mapped to an embedding with the same dimension separately via a two-layer liner layer with ReLU activation and further combined with layer normalization.”, wherein the examiner interprets “a two-layer liner layer with ReLU activation” to be the same as “hidden layers in the image reconstruction decoder and the semantic segmentation decoder are equipped with RELU activation,” as both describe layers within the network that incorporate ReLU activation functions to process and transform input data. )
Shazeer, Teichmann, Wang, Vinyals, Bagci, and the instant application are analogous art because they are all directed to systems for processing data in neural networks with activation functions to enhance performance.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system of claim 10 disclosed by Shazeer, Teichmann, Wang, Vinyals, and Bagci to include the “a two-layer liner layer with ReLU activation” disclosed by Wang. One would be motivated to do so to effectively enhance the image segmentation and reconstruction, as suggested by Wang ([Wang, col 9, lines 14-18] “ visual features with its position features and segment identifier are mapped to an embedding with the same dimension separately via a two-layer liner layer with ReLU activation and further combined with layer normalization.”).
Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Shazeer in view of Teichmann in view of Wang in view of Vinyals in view of Bagci further in view of NPL reference “Pyramid Scheme Parsing Network” by Zhao et al (referred herein as Zhao).
Regarding claim 12, Shazeer, Teichmann, Wang, Vinyals, and Bagci teaches The system of claim 8, (see rejection of claim 8).
Shazeer, Teichmann, Wang, Vinyals, and Bagci do not teach wherein the unified encoder comprises a spatial pyramid pooling layer after the convolution layers.
Zhao teaches wherein the unified encoder comprises a spatial pyramid pooling layer after the convolution layers. ([Zhao, page 2884, sec 3.2] “The pyramid pooling module fuses features under four different pyramid scales. The coarsest level highlighted in red is global pooling to generate a single bin output. The following pyramid level separates the feature map into different sub-regions and forms pooled representation for different locations. The output of different levels in the pyramid pooling module contains the feature map with varied sizes. … Finally, different levels of features are concatenated as the final pyramid pooling global feature.”, wherein the examiner interprets “the pyramid pooling module fuses features under four different pyramid scales” to be the same as “the unified encoder comprises a spatial pyramid pooling layer” and “the coarsest level highlighted in red is global pooling to generate a single bin output” and “the following pyramid level separates the feature map into different sub-regions and forms pooled representation for different locations” to be the same as “a spatial pyramid pooling layer after the convolution layers,” as both terms are directed to pooling features at multiple spatial scales to generate a unified representation.)
Shazeer, Teichmann, Wang, Vinyals, Bagci, Zhao, and the instant application because they are all directed to systems that enhance feature extraction and representation in convolutional neural networks by pyramid pooling features at multiple spatial scales.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system of claim 8 disclosed by Shazeer, Teichmann, Wang, Vinyals, and Bagci to include the “the pyramid pooling module fuses features under four different pyramid scales” disclosed by Zhao. One would be motivated to do so to effectively capture multi-scale spatial features for improved downstream processing, as suggested by Zhao ([Zhao, page 2884, sec 3.2] “Finally, different levels of features are concatenated as the final pyramid pooling global feature.”).
Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Teichmann in view of Vinyals further in view of Wang.
Regarding claim 19, Teichmann teaches:
A method for interpreting data using machine learning, comprising: training a unified encoder, comprising neural network, using one or more machine learning models, to generate one or more training feature vectors useful for performing, based on the one or more feature vectors, a plurality of different tasks each comprising different interpretations of training data; ([Teichmann, page 1, Abstract] “we present an approach to joint classification, detection and semantic segmentation using a unified architecture where the encoder is shared amongst the three tasks.”, wherein the examiner interprets “encoder … shared amongst the three tasks” to be the same as a unified encoder trained once and reused across multiple tasks in the manner claimed)
encoding new data, using the unified encoder, into one or more feature vectors, to generate the one or more feature vectors useful for performing the plurality of different tasks each comprising the different ones of the interpretations of the new data, and ([Teichmann, page 1] “we present an approach to joint classification, detection and semantic segmentation using a unified architecture where the encoder is shared amongst the three tasks”, AND [Teichmann, page 1] “The encoder is a deep CNN, producing rich features that are shared among all task. Those features are then utilized by task-specific decoders”, wherein the examiner interprets “a unified architecture where the encoder is shared amongst the three tasks” and “rich features that are shared among all task … utilized by task-specific decoders” to be the same as “encoding new data, using the unified encoder, into one or more feature vectors, to generate the one or more feature vectors useful for performing the plurality of different tasks each comprising the different ones of the interpretations” because they are both directed to a single shared encoder (the unified encoder/deep CNN) that produces feature representations (rich features/feature vectors) reused across multiple different tasks (joint classification, detection, and semantic segmentation), each task having its own decoder that outputs a different interpretation of the same encoded data.)
interpreting the one or more feature vectors, using a plurality of decoders including a semantic segmentation decoder … connected to the unified encoder, each of the decoders comprising a neural network outputting a different one of the interpretations of the new data, ([Teichmann, page 1], “This is done by incorporating all three task into a unified encoder-decoder architecture. We name our approach MultiNet … The encoder is a deep CNN, producing rich features that are shared among all task. Those features are then utilized by task-specific decoders”, wherein the examiner interprets “a unified encoder-decoder architecture” with “rich features that are shared among all task” and “task-specific decoders” to be the same as “interpreting the one or more feature vectors, using a plurality of decoders including a semantic segmentation decoder … connected to the unified encoder, each of the decoders comprising a neural network outputting a different one of the interpretations of the new data” because they are both directed to a deep CNN encoder (the unified encoder) that produces shared feature representations (the one or more feature vectors) which are fed into multiple task-specific decoder neural networks (the plurality of decoders, including a semantic segmentation decoder), each decoder taking the shared features as input and producing its own task-specific output (a different one of the interpretations of the new data)).
outputting the interpretations to a machine coupled to a camera, wherein the interpretations comprise an identification of an environment of the machine… ([Teichmann, page 1, Introduction] “visual perception is going to play a key role in the development of self-driving cars.” AND [Teichmann, Fig. 1 caption] “Figure 1: Our goal: Solving street classification, vehicle detection and road segmentation in one forward pass.” AND [Teichmann, Abstract] “Those features are then utilized by task-specific decoders, which produce their outputs in real-time.” wherein the examiner interprets a self-driving car that relies on visual perception with “task-specific decoders … [that] produce their outputs in real-time” to be the same as “outputting the interpretations to a machine coupled to a camera” because they are both directed to a vision-based machine (a vehicle equipped with cameras) receiving decoder outputs that interpret the captured images. The examiner further interprets “street classification, vehicle detection and road segmentation” to be the same as “identification of an environment of the machine” because they are all semantic interpretations that describe the surrounding road environment in which the machine (self-driving vehicle) operates.)
… providing navigation or instructions to the machine …([Teichmann, page 1] “computational times are very important in order to enable real time applications such as autonomous driving.” wherein the examiner interprets “real time applications such as autonomous driving” to be the same as “providing navigation or instructions to the machine” because both involve using model outputs (e.g., classification, detection, and segmentation of the scene) to control or guide the vehicle’s motion in real time.)
wherein the unified encoder is trained using at least one of: mutual transfer learning comprising propagating a gradient across orthogonal task specific parameter spaces, ([Teichmann, page 4] “Our joint training implementation computes the forward passes for examples corresponding to each of the three tasks independently. The gradients are only added during the back-propagation steps. This has the practical advantage that we are able to use different training parameters for each decoder.”, wherein the examiner interprets computing separate forward passes for each task while “the gradients are only added during the back-propagation steps” and “different training parameters for each decoder” to be the same as “mutual transfer learning comprising propagating a gradient across orthogonal task specific parameter spaces” because they are both directed to a multi-task training process in which each task-specific decoder has its own parameter space (task-specific parameters) while a shared gradient signal is propagated back through the common encoder parameters across those distinct task-specific spaces.)
or the machine learning comprising a first model for performing a first one of the different tasks and a second model for performing a second one of the different tasks, ([Teichmann, page 3] “Our approach shares a common encoder over the three tasks and has three branches, each implementing a decoder for a given task.”, wherein the examiner interprets “three branches, each implementing a decoder for a given task” to be the same as “a first model for performing a first one of the different tasks and a second model for performing a second one of the different tasks” because they are both directed to multiple task-specific neural-network models (decoders) that share a unified encoder but perform different tasks on the encoded features.)
and the training of the unified encoder alternates between the first model and the second model after an epoch or trains both methods each epoch. ([Teichmann, page 4] “Our joint training implementation computes the forward passes for examples corresponding to each of the three tasks independently. The gradients are only added during the back-propagation steps.”, AND [Teichmann, page 6] “we fine-tune the encoder using just one of the three losses segmentation, detection and classification and compare their performance … In the second part we compare joint training of all three decoders with individual inference…”, wherein the examiner interprets “fine-tune the encoder using just one of the three losses” to be the same as “alternates between the first model and the second model after an epoch” because both describe training the shared encoder using one task’s loss at a time. The examiner further interprets “joint training implementation computes the forward passes for examples corresponding to each of the three tasks independently” with gradients added during back-propagation to be the same as “trains both methods each epoch” because both describe a process in which examples from multiple tasks are processed within the same training schedule so that the unified encoder is updated using gradients from more than one task during each epoch.)
Teichmann does not teach …and a natural language processing decoder… wherein the natural language processing decoder generates one of the interpretations comprising text data by operating on a single input that is a final one of the feature vectors from the unified encoder, … and the text data is used to perform at least one of searching the images captured by the camera, captioning the images, … or communicating with a human operator of the machine, …
Vinyals teaches, …and a natural language processing decoder…wherein the natural language processing decoder generates one of the interpretations comprising text data by operating on a single input that is a final one of the feature vectors from the unified encoder, and ([Vinyals, page 3156], “An ‘encoder’ RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn is used as the initial hidden state of a ‘decoder’ RNN that generates the target sentence.”, [Vinyals, page 3157], “Hence, it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences (see Fig. 1). We call this model the Neural Image Caption, or NIC.”, [Vinyals, page 3159], “The image I is only input once” AND [Vinyals, page 3163], “the CNN to extract features that are relevant to horse-looking animals.”, wherein the examiner interprets “last hidden layer” of the CNN image encoder, which is provided as the only image input (“image I is only input once”) to “the CNN to extract features”, and RNN decoder that “generates sentences,” to be the same as a single final feature vector produced by a unified encoder that is supplied as the sole input to a natural language processing decoder that generates text data because they are both directed to architectures in which an encoder network processes an image into a single fixed-length feature representation (the last hidden layer/feature vector). In both cases, that vector is then provided as the only input to a decoder neural network which outputs a natural-language sentence as the textual interpretation of that image.)
Vinyals does not teach … and the text data is used to perform at least one of searching the images captured by the camera, captioning the images, … or communicating with a human operator of the machine, …
Wang teaches … and the text data is used to perform at least one of searching the images captured by the camera, captioning the images, … or communicating with a human operator of the machine, …
Wang teaches and the text data is used to perform at least one of searching the images captured by the camera, ([Wang, col. 2, lines 35–37] “yielding mixed results in tasks, such as VQA, visual reasoning, and image retrieval.”, wherein the examiner interprets “image retrieval” to be the same as “searching the images captured by the camera” because they are both directed to using learned text/visual representations to search for and retrieve images from a collection of captured images.)
captioning the images, ([Wang, col. 16, lines 54–57] “The text data may also include one or more captions 352 relating or corresponding to the image data 350.”, wherein the examiner interprets “captions 352 relating or corresponding to the image data 350” to be the same as “captioning the images” because they are both directed to generating text captions that describe associated images.)
… or communicating with a human operator of the machine, ([Wang, col. 3, lines 46–48] “The visual dialogue model 140 can operate with an AI-based machine agent to hold a meaningful dialogue with humans in natural, conversational language about visual content.”, wherein the examiner interprets “hold a meaningful dialogue with humans in natural, conversational language about visual content” to be the same as “communicating with a human operator of the machine” because they are both directed to a machine agent using generated text to converse with a human about visual (image) data.)
Teichmann, Vinyals, Wang, and the instant application are analogous art because they are all directed to methods for interpreting image data using a unified encoder and multiple neural-network decoders to produce both visual and textual interpretations.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method claim 19 disclosed by Teichmann to include the RNN decoder disclosed by Vinyals. One would be motivated to do so to effectively generate natural-language descriptions of images directly from the shared feature vectors produced by the unified encoder, as suggested by Vinyals ([Vinyals, page 3157, page 3161] “using the last hidden layer as an input to the RNN decoder that generates sentences … an end-to-end neural network system that can automatically view an image and generate a reasonable description in plain English.”).
It would have also been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the method claim 19 disclosed by Teichmann to include the AI-based machine agent disclosed by Wang. One would be motivated to do so to effectively use the generated text data for image search, captioning, and natural-language interaction with a human operator about the machine’s visual environment, as suggested by Wang ([Wang, col. 3, lines 46-48] “operate with an AI-based machine agent to hold a meaningful dialogue with humans in natural, conversational language about visual content.”).
Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over Shazeer in view of Teichmann in view of Wang in view of Vinyals further in view of NPL reference “SCOTI: Science Captioning of Terrain Images for data prioritization and local image search” by Qiu et. al. (referred herein as Qiu).
Regarding claim 22, Shazeer, Teichmann, Wang, and Vinyals teach The system of claim 1 (see rejection of claim 1).
Shazeer, Teichmann, Wang, and Vinyals do not teach wherein the image captioning comprises connections of words relating different parts in the image according to Scientific Captioning of Terrain Images (SCOTI).
Qiu teaches wherein the image captioning comprises connections of words relating different parts in the image ([Qiu, page 1] “In this paper, we propose an approach to tackle these problems by leveraging an LSTM-based image captioning neural network architecture with visual attention mechanism, which pays attention on different parts of an image at different times to encode them into embeddings and then translates these embeddings into a meaningful sequence of words (a caption) for the image. Relations of different parts in the image are captured by the ordering of attentions, and are recovered into relation/ connection words in its caption.” wherein the examiner interprets “image captioning neural network architecture” to be the same as “the image captioning” because they are both directed to a neural-network-based system that performs image captioning. The examiner further interprets “a meaningful sequence of words (a caption)” and “relation/connection words in its caption” to be the same as “connections of words” because both describe words in a caption being linked together in a sequence that encodes relations among image content. Finally, the examiner interprets “relations of different parts in the image … recovered into relation/connection words in its caption” and “repeatedly pays attention to different parts of the image … and generates a caption … word by word” to be the same as “relating different parts in the image” because they are all directed to using the caption’s words to express relationships among multiple regions/parts of the image.)
according to Scientific Captioning of Terrain Images (SCOTI).([Qiu, page 2] “This section describes the Science Captioning of Terrain Images (SCOTI) network for tackling the problem of “understanding” planetary images (primarily terrain images) for a machine. SCOTI extracts visual features from a raw image input into a feature map (a multidimensional vector), repeatedly pays attention to different parts of the image”, wherein the examiner interprets “SCOTI: Science Captioning of Terrain Images…” and “Science Captioning of Terrain Images (SCOTI) network” to be the same as “according to Scientific Captioning of Terrain Images (SCOTI)” because they explicitly name the SCOTI method and describe it as the specific image-captioning network used to generate captions for terrain images.)
Shazeer, Teichmann, Wang, Vinyals, Qiu, and the instant application are analogous art because they are all directed to neural-network–based image understanding and captioning systems that generate textual descriptions of images by relating different parts or regions of the image (including terrain images) to one another.
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the invention to modify the system claim 1 disclosed by Shazeer, Teichmann, Wang, and Vinyals to include the “Science Captioning of Terrain Images (SCOTI) network … [that] pays attention on different parts of an image at different times … and then translates these embeddings into a meaningful sequence of words (a caption) for the image” disclosed by Qiu. One would be motivated to do so to effectively improve the system’s ability to generate image captions that capture relationships among different parts of an image using an attention-based captioning mechanism, as suggested by Qiu ([Qiu, [page 1-2] “tackle these problems by leveraging an LSTM-based image captioning neural network architecture with visual attention mechanism, which pays attention on different parts of an image at different times … and then translates these embeddings into a meaningful sequence of words (a caption) for the image.”).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DEVAN KAPOOR whose telephone number is (703)756-1434. The examiner can normally be reached Monday - Friday: 9:00AM - 5:00 PM EST (times may vary).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, David Yi can be reached at (571) 270-7519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DEVAN KAPOOR/Examiner, Art Unit 2126
/DAVID YI/Supervisory Patent Examiner, Art Unit 2126