Last updated: April 18, 2026
Application No. 17/016,534
Multi-Dimensional Deep Neural Network

Final Rejection §103§112
Filed
Sep 10, 2020
Examiner
GODO, MORIAM MOSUNMOLA
Art Unit
2148
Tech Center
2100 — Computer Architecture & Software
Assignee
Mitsubishi Electric Research Laboratories Inc.
OA Round
4 (Final)
Interview Optional

— +33.4% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 68 resolved cases, 2023–2026
Examiner Intelligence

GODO, MORIAM MOSUNMOLA View full profile →
Grants 44% of resolved cases
Career Allow Rate
30 granted / 68 resolved
-10.9% vs TC avg
Strong +33% interview lift
Without
With
+33.4%
Interview Lift
resolved cases with interview
Typical timeline
4y 8m
Avg Prosecution
47 currently pending
Career history
115
Total Applications
across all art units
Statute-Specific Performance

§101
16.1%
-23.9% vs TC avg
§103
56.7%
+16.7% vs TC avg
§102
12.7%
-27.3% vs TC avg
§112
12.9%
-27.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 68 resolved cases
Office Action

§103 §112
DETAILED ACTION
1.	This office action is in response to the Application No. 17016534 filed on 12/08/2025. Claims 1-21 are presented for examination and are currently pending.

Response to Arguments
2.	The Examiner is withdrawing the rejections in the previous Office action because Applicant’s amendment necessitated the new grounds of rejection presented in this Office action. As a result, the Applicant’s argument are moot.  
	Arguments regarding the 112(f) claim interpretation is persuasive. As a result, the 112(f) claim interpretation has been withdrawn.
	The Examiner notes that independent claims 16 and 21 are similar to claim 1. The same basis of rejection applies to claims 16 and 21. As a result, these claims are not patentable.
	The Examiner notes dependent claims 2-15 and 17-20 which depend directly or indirectly from claims 1 and 16 are not allowable because the Applicant’s argument have been considered but are moot for similar reasons argued above regarding claim 1.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


3.	Claim 1-21 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
	Claims 1, 16 and 21 recites “the second layer is subsequent to the fifth layer along the first dimension of data propagation in the subsequent DNN”.
	It is unclear what the Applicant means by “the second layer is subsequent to the fifth layer along the first dimension of data propagation in the subsequent DNN”. This is unclear because this indicates second layer of the subsequent DNN comes after the fifth layer of the subsequent DNN along a first dimension, but the Applicant is also claiming “the fourth layer is prior to the fifth layer along the first dimension of data propagation in the subsequent DNN”. So, if the fourth layer of the subsequent DNN comes before the fifth layer of the subsequent DNN, then it is unclear how the “second layer” can come after the fifth layer.
	Claims 2-15 and 17-20 that are not specifically mentioned are rejected due to dependency.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



4.	Claims 1-3, 16 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Rabinowitz et al. (US20170337464) in view of Yang et al. (US20210019602 filed 07/18/2019)

Regarding claim 1, Rabinowitz teaches a computer-based artificial intelligence (AI) system (the neural network system comprising a sequence of deep neural networks (DNNs) [0005]), comprising: an input interface (a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer [0075]) configured to accept input data (The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output [0072]); 
a memory (a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data [0073]) configured to store a multi-dimensional neural network having a sequence of deep neural networks (DNN) (sequence of DNNs 102 [0045]) including an inner DNN (First DNN 104, Fig. 1) and an outer DNN (Sequence DNNs 108, Fig. 1), 
wherein each DNN includes a sequence of layers (Similarly, subsequent DNN 108 includes input layer 108 a, hidden layers 108 b and 108 c, and output layer 108 d [0047]; The first DNN 104 includes multiple neural network layers, e.g., input layer 104 a, hidden layers 104 b and 104 c, and output layer 104 d [0045]), and 
corresponding layers of different DNNs have identical parameters … of one DNN in the sequence of DNNs to another DNN in the sequence of DNNs (In some implementations training each subsequent DNN may include setting preceding DNN parameters of preceding DNNs to constant values, e.g., to previously trained values [0062]), 
each DNN is configured to process the input data sequentially by the sequence of layers along a first dimension of data propagation (The Examiner notes that the first dimension of data propagation is the upward arrow of input 1, input 2, input 3, Fig.1), 
the DNNs in the sequence of DNNs are arranged along a second dimension of data propagation starting from the inner DNN till the outer DNN (Hidden layer 104b → Hidden layer 106c and Hidden layer 104b → Hidden layer 108c is the second dimension of data propagation in Fig. 1), 
the DNNs in the sequence of DNNs are connected such that an output of a first layer of a first DNN (preceding DNNs, e.g., the first DNN [0060]; For example, generally, the first DNN 104 may include L indexed layers e.g., layers i=1, . . . , L [0045]) is combined with an input to a second layer of a subsequent DNN in the sequence of DNNs (The last subsequent DNN includes a number of indexed layers, and (ii) each layer in the number of indexed layers with index greater than one receives input from … (ii) one or more preceding layers of each preceding DNN in the sequence of DNNs [0065]. The Examiner notes this indicates layer i =1 of the First DNN 104 as preceding DNN is connected to each layer of subsequent DNN 108 as last subsequent DNN including layer i = 2) and an output of a third layer of the first DNN is combined with the input to a fourth layer of the subsequent DNN in the sequence of DNNs (The last subsequent DNN includes a number of indexed layers, and (ii) each layer in the number of indexed layers with index greater than one receives input from … (ii) one or more preceding layers of each preceding DNN in the sequence of DNNs [0065]. The Examiner notes this indicates layer i =3 of the First DNN 104 as preceding DNN is connected to each layer of subsequent DNN 108 as last subsequent DNN including layer i = 4), the third layer (layer i =3 of the First DNN 104) is subsequent to the first layer (layer i =1 of the First DNN 104) along the first dimension of data propagation in the first DNN (The Examiner notes that the first dimension of data propagation is the upward arrow of input1 → input layer 104a → hidden layer 104b → hidden layer 104 c, Fig. 1), 
the subsequent DNN comprises a fifth layer (subsequent DNN 108 as last subsequent DNN including layer i = 5) corresponding to the first layer of the first DNN (layer i =1 of the First DNN 104) such that the fifth layer (subsequent DNN 108 as last subsequent DNN including layer i = 5) …, the second layer (subsequent DNN 108 as last subsequent DNN including layer i = 2) is subsequent to the fifth layer (subsequent DNN 108 as last subsequent DNN including layer i = 5) along the first dimension of data propagation in the subsequent DNN (The Examiner notes that the first dimension of data propagation is the upward arrow of input3 → input layer 108a → hidden layer 108b → hidden layer 108 c, Fig. 1), and the fourth layer (subsequent DNN 108 as last subsequent DNN including layer i = 4, Fig. 1) is prior to the fifth layer (subsequent DNN 108 as last subsequent DNN including layer i = 5, Fig. 1) along the first dimension of data propagation in the subsequent DNN (The Examiner notes that the first dimension of data propagation is the upward arrow of input3 → input layer 108a → hidden layer 108b → hidden layer 108 c, Fig. 1); 
a processor (The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers [0069]) configured to submit the input data to the multi-dimensional neural network to produce an output of the outer DNN (output3 of the subsequent DNN 108,  Fig. 1); and
 an output interface configured to render at least a function of the output of the outer DNN (this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user [0075]).
Rabinowitz does not explicitly teach identical parameters formed by duplicating parameters of one DNN in the sequence of DNNs to another DNN in the sequence of DNNs, such that each of the DNNs in the sequence of DNNs includes an identical combination of the parameters, …such that the fifth layer has identical weights as the first layer,
Yang teaches a memory configured to store a multi-dimensional neural network (The controller 300 may further include register files 320 for storing the specific configuration (e.g., the number of CNN processing engines) [0049]; Each CNN processing engine includes multiple convolution layers, abstract), 
identical parameters (In some examples, all of the CNN processing engines are identical. For example, CNN processing engines 252(1)-252(N), 262(1)-262(N) are identical [0046]; Each CNN processing engine includes multiple convolution layers, abstract) formed by duplicating parameters of one DNN in the sequence of DNNs to another DNN in the sequence of DNNs, such that each of the DNNs in the sequence of DNNs includes an identical combination of the parameters (In some examples, the CNN operations in an AI chip, e.g., 1600, may be performed with one or more convolution layers in one or more CNN processing engines having duplicate weights from other convolution layers [0079]), 
… such that the fifth layer has identical weights as the first layer (In some examples, the CNN operations in an AI chip, e.g., 1600, may be performed with one or more convolution layers in one or more CNN processing engines having duplicate weights from other convolution layers [0079]),
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Rabinowitz to incorporate the teachings of Yang for the benefit of systems and methods that reduce the required memory space the weights of an AI chip by allowing storing and accessing duplicates amongst various convolution layers (Yang [0134])

Regarding claim 2, Modified Rabinowitz teaches the AI system of claim 1, Rabinowitz teaches wherein the multi-dimensional neural network (Sequence of DNNs 102 [0045]) has at least one hidden DNN (Sequence DNN 106, Fig. 1) arranged between the inner DNN (First DNN 104, Fig. 1) and the outer DNN (Sequence DNNs 108, Fig. 1) along the second dimension of data propagation (Hidden layer 104b → Hidden layer 106c and Hidden layer 104b → Hidden layer 108c is the second dimension of data propagation in Fig. 1).

Regarding claim 3, Modified Rabinowitz teaches the AI system of claim 1, Rabinowitz teaches wherein one or more layers of the inner DNN (For example, generally, the first DNN 104 may include L indexed layers e.g., layers i=1, . . . , L [0045]) are connected to multiple layers of the outer DNN (Similarly, subsequent DNN 108 includes input layer 108 a, hidden layers 108 b and 108 c, and output layer 108 d [0047]), 
multiple layers of the inner DNN are connected to a layer of the outer DNN, or combination thereof (Input layer 104a is connected to Hidden layer 108b, Input layer 104b is connected to Hidden Layer 108c, Fig. 1).

Regarding claim 16, claim 16 is similar to claim 1. It is rejected in the same manner and reasoning applying. Further, Rabinowitz teaches a method for generating an output of an outer deep neural network (DNN) (output3 of the subsequent DNN 108,  Fig. 1) of a multi-dimensional neural network (sequence of DNNs 102 [0045]), 
wherein the method uses a processor coupled with stored instructions implementing the method, wherein the instructions, when executed by the processor carry out at least some steps of the method, comprising (The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit) [0072]):
	
Regarding claim 21, claim 21 is similar to claim 1. It is rejected in the same manner and reasoning applying. Further, Rabinowitz a non-transitory computer-readable storage medium embodied thereon a program executable by a processor for performing a method, the method comprising (Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory program carrier for execution by, or to control the operation of, data processing apparatus [0068]):

5.	Claims 4 is rejected under 35 U.S.C. 103 as being unpatentable over Rabinowitz et al. (US20170337464) in view of Yang et al. (US20210019602 filed 07/18/2019) and further in view of Dehghani et al. (US20190354567)

Regarding claim 4, Modified Rabinowitz teaches the AI system of claim 1, 
Rabinowitz teaches wherein multiple layers of the inner DNN (First DNN 104, Fig. 1) are connected to a sixth layer of the subsequent DNN (subsequent DNN 108 as last subsequent DNN including layer i = 6, Fig. 1)
a seventh layer of the outer DNN (Sequence DNNs 108, Fig. 1; The last subsequent DNN includes a number of indexed layers, and (ii) each layer in the number of indexed layers with index greater than one receives input from … (ii) one or more preceding layers of each preceding DNN in the sequence of DNNs [0065]. The Examiner notes layer of subsequent DNN 108 as last subsequent DNN including layer i = 7).
But does not explicitly teach  connected via a plurality of soft connections to scale outputs of the multiple layers of the inner DNN based on weights of the soft connections before adding the scaled outputs to an input of a … layer of the outer DNN.
Dehghani teaches wherein multiple layers of the inner DNN are connected to a … layer of the subsequent DNN via a plurality of soft connections to scale outputs of the multiple layers of the inner DNN (h1 representation 105a is connected to Self-attention Process 112b … and Self-attention Process 112m; h2 representation 105b is connected to Self-attention Process 112a … and Self-attention Process 112m, Fig. 1)
based on weights of the soft connections (In addition, number of computational steps of the Universal Transformer can be varied dynamically after training because the model shares weights across its sequential computational steps [0006])
 before adding the scaled outputs to an input of a ... layer of the outer DNN (output of recurrent encoder block 410 after T steps is sent to recurrent encoder block 420, Fig. 4)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the method of Dehghani for the benefit of computer systems with GPUs and other accelerator hardware to exploit the parallel computational structure of the Universal Transformer (Dehghani [0012])

6.	Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Rabinowitz et al. (US20170337464) in view of Keskar et al. (US20190130273) in view of Yang et al. (US20210019602 filed 07/18/2019) in view of Dehghani et al. (US20190354567)
 
Regarding claim 5, Modified Rabinowitz teaches the AI system of claim 4, Modified Rabinowitz does not explicitly teach wherein the weights of the soft connections are trained simultaneously with parameters of the multi-dimensional neural network.
Keskar teaches wherein the weights of the soft connections are trained simultaneously with parameters of the multi-dimensional neural network (the learned scaling parameters may correspond to weighting parameters that have values between zero and one and add up to one [0040] Fig. 3B)
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the teachings of Keskar for the benefit of using learned scaling parameters which may reduce and/or prevent co-adaptation among branches 360 a-m during training, thereby improving the performance of branched transformer model 300 (Keskar [0038])

7.	Claims 6, 7, 12-14, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Rabinowitz et al. (US20170337464) in view of Keskar et al. (US20190130273) in view of Yang et al. (US20210019602 filed 07/18/2019) 

Regarding claim 6, Modified Rabinowitz teaches the AI system of claim 1, Modified Rabinowitz wherein the multi-dimensional neural network forms an encoder in an encoder-decoder architecture of a multi-pass transformer (MPT), such that the output of the outer DNN includes encodings of the input data processed by a decoder to produce an output of the AI system via the output interface, wherein each layer of each of the DNN in the multi-dimensional neural network includes an attention module and each attention module includes a self-attention subnetwork followed by a feed-forward subnetwork.
Keskar teaches wherein the multi-dimensional neural network forms an encoder in an encoder-decoder architecture of a multi-pass transformer (MPT), (As depicted in FIG. 3A, branched transformer model 300 includes an input stage 310, and encoder stage, 320, and a decoder stage 330 [0028])
 such that the output of the outer DNN includes encodings of the input data processed by a decoder to produce an output of the AI system (Similarly, each of branched attention decoder layers 330 a-(n-1) generates a respective layer decoded representation 335 a-(n-1) that is received by a subsequent layer among decoder layers 330 b-n. An output layer 340 receives decoded representation 335 n from the decoder layer 330 n and generates output sequence 304 [0029])
 via the output interface, (output interface of output sequence 104, Fig. 1) 
wherein each layer of each of the DNN in the multi-dimensional neural network includes an attention module and each attention module includes a self- attention subnetwork (each of branches 360 a-m may include one or more sub-layers arranged sequentially. As depicted in FIG. 3B, the sub-layers may include, but are not limited to, a parameterized attention network (e.g., parameterized attention networks 361 a-m) [0033]) 
followed by a feed-forward subnetwork (where the parameterized transformation network 363 f includes a two-layer feed-forward neural network (dff) [0062])
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the teachings of Keskar for the benefit of using learned scaling parameters which may reduce and/or prevent co-adaptation among branches 360 a-m during training, thereby improving the performance of branched transformer model 300 (Keskar [0038])

Regarding claim 7, Modified Rabinowitz teaches the AI system of claim 6, Modified Rabinowitz does not explicitly teach wherein each of the DNNs in the multi-dimensional neural network includes a residual connection before each attention module and between the self-attention subnetwork and the feed-forward subnetwork.
Keskar teaches wherein each of the DNNs in the multi-dimensional neural network includes a residual connection (Although not depicted in FIGS. 3A- 3C, branched transformer model 300 may include any number of residual connections … In general, the use of residual connections may accelerate the training of branched transformer model 300 by reducing the effective path length between a given layer and/or sub-layer and output layer 340 [0043])
before each attention module and between the self-attention subnetwork (As depicted in FIG. 3B, the sub-layers may include, but are not limited to, a parameterized attention network (e.g., parameterized attention networks 361 a-m) [0033]) and 
the feed-forward subnetwork. (where the parameterized transformation network 363 f includes a two-layer feed-forward neural network (dff) [0062])
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the teachings of Keskar for the benefit of using learned scaling parameters which may reduce and/or prevent co-adaptation among branches 360 a-m during training, thereby improving the performance of branched transformer model 300 (Keskar [0038])

Regarding claim 12, Modified Rabinowitz teaches an audio processing system including the AI system of claim 1, Modified Rabinowitz does not explicitly teach wherein the input data include an audio signal, and the function of the output include transcription of the audio signal.
Keskar teaches wherein the input data include an audio signal, (In some embodiments the first and second sequence may correspond to audio sequences [0054]) and 
the function of the output include transcription of the audio signal (In some embodiments the first and second sequence may correspond to text sequences [0054]) 
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the teachings of Keskar for the benefit of using learned scaling parameters which may reduce and/or prevent co-adaptation among branches 360 a-m during training, thereby improving the performance of branched transformer model 300 (Keskar [0038])

Regarding claim 13, Modified Rabinowitz teaches the audio processing system of claim 12, Keskar teaches wherein the audio signal includes speech utterance, such that the audio processing system is an automatic speech recognition (ASR) system (it is to be understood that the sequence-to-sequence models may operate on a wide variety of types of input sequences, including but not limited to text sequences, audio sequences [0015])
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the teachings of Keskar for the benefit of using learned scaling parameters which may reduce and/or prevent co-adaptation among branches 360 a-m during training, thereby improving the performance of branched transformer model 300 (Keskar [0038])

Regarding claim 14, Modified Rabinowitz teaches the AI system of claim 1 but does not explicitly teach a machine translation device including the AI system of claim 1, trained to convert the input data representing a speech utterance in a first language into the output data representing the speech utterance in a second language. 
Keskar teaches a machine translation device including the AI system (a system for sequence-to-sequence prediction according to some embodiments [0005]; Sequence-to-sequence prediction using a neural network model (title); the neural network model may include a plurality of model parameters learned according to a machine learning process [0053])
trained to convert the input data representing a speech utterance in a first language (In machine translation applications, the first sequence may correspond to a text sequence (e.g., a word, phrase, sentence, document, and/or the like) in a first language [0054])
 into the output data representing the speech utterance in a second language (n machine translation applications, the output sequence may correspond to a translated version of the first sequence in a second language [0055])
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the teachings of Keskar for the benefit of using learned scaling parameters which may reduce and/or prevent co-adaptation among branches 360 a-m during training, thereby improving the performance of branched transformer model 300 (Keskar [0038])

Regarding claim 20, Modified Rabinowitz teaches the method of claim 16, 
Modified Rabinowitz does not explicitly teach further comprising: forming an encoder in an encoder-decoder architecture of a neural network based on the multi-dimensional neural network, such that the output of the outer DNN includes the encodings of the input data processed by a decoder to produce an output of an AI system, wherein each layer of each of the DNN in the multi- dimensional neural network includes an attention module and each attention module includes a self-attention subnetwork followed by a feed-forward subnetwork; and analyzing a residual connection in each of the DNNs in the multi- dimensional neural network before each attention module and between a self- attention subnetwork and a feed-forward subnetwork of the attention module.
Keskar teaches further comprising: forming an encoder in an encoder-decoder architecture of a neural network based on the multi-dimensional neural network, (As depicted in FIG. 3A, branched transformer model 300 includes an input stage 310, and encoder stage, 320, and a decoder stage 330 [0028]; A method for sequence-to-sequence prediction using a neural network model, abstract)
such that the output of the outer DNN includes the encodings of the input data processed by a decoder to produce an output of an AI system, (Similarly, each of branched attention decoder layers 330 a-(n-1) generates a respective layer decoded representation 335 a-(n-1) that is received by a subsequent layer among decoder layers 330 b-n. An output layer 340 receives decoded representation 335 n from the decoder layer 330 n and generates output sequence 304 [0029])
wherein each layer of each of the DNN in the multi- dimensional neural network includes an attention module and each attention module includes a self-attention subnetwork (each of branches 360 a-m may include one or more sub-layers arranged sequentially. As depicted in FIG. 3B, the sub-layers may include, but are not limited to, a parameterized attention network (e.g., parameterized attention networks 361 a-m) [0033]) 
followed by a feed-forward subnetwork; (where the parameterized transformation network 363 f includes a two-layer feed-forward neural network (dff) [0062]) and 
analyzing a residual connection in each of the DNNs in the multi- dimensional neural network (Although not depicted in FIGS. 3A-3C, branched transformer model 300 may include any number of residual connections … In general, the use of residual connections may accelerate the training of branched transformer model 300 by reducing the effective path length between a given layer and/or sub-layer and output layer 340 [0043])
before each attention module and between a self- attention subnetwork (As depicted in FIG. 3B, the sub-layers may include, but are not limited to, a parameterized attention network (e.g., parameterized attention networks 361 a-m) [0033]) and 
a feed-forward subnetwork of the attention module. (where the parameterized transformation network 363 f includes a two-layer feed-forward neural network (dff) [0062])
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the teachings of Keskar for the benefit of using learned scaling parameters which may reduce and/or prevent co-adaptation among branches 360 a-m during training, thereby improving the performance of branched transformer model 300 (Keskar [0038])

8.	Claims 8-11 and are rejected under 35 U.S.C. 103 as being unpatentable over Rabinowitz et al. (US20170337464) in view of Keskar et al. (US20190130273) in view of Yang et al. (US20210019602 filed 07/18/2019) and further in view of Kitaev et al. (US10909461 filed 05/08/2020)

Regarding claim 8, Modified Rabinowitz teaches the AI system of claim 7, Rabinowitz teaches wherein a connection between two layers of two DNNs of the sequence of DNNs including the first DNN (preceding DNNs, e.g., the first DNN [0060]; For example, generally, the first DNN 104 may include L indexed layers e.g., layers i=1, . . . , L [0045]) and the subsequent DNN (subsequent DNN 106 as subsequent DNN) combines an output of a sixth layer of the first DNN (layer i =6 of the First DNN 104) with an input to a seventh layer of the subsequent DNN (subsequent DNN 106 as subsequent DNN including layer i = 7), wherein the output is added to the input of the seventh layer of the subsequent DNN (subsequent DNN 108 as the last subsequent DNN including layer i = 7)
Modified Rabinowitz does not explicitly teach wherein the output is added to the input of the ... layer of the subsequent DNN prior to a residual connection of the self- attention subnetwork of the attention module of the ... layer of the subsequent DNN.
Kitaev teaches wherein a connection between two layers of two DNNs of the sequence of DNNs including the first DNN and the subsequent DNN combines an output of a ... layer of the first DNN with an input to a ... layer of the subsequent DNN, (a first standard residual connection layer combines the output 412 of an attention sub-layer with the input 402 to the attention sub-layer to generate an attention residual output 422, Fig. 4A, col 7, lines 8-11)
wherein the output (output 412, Fig. 4A)
is added to the input of the  ... layer of the subsequent DNN (input 402, Fig. 4a)
 prior to a residual connection of the self- attention subnetwork of the attention module of the ... layer of the subsequent DNN (attention sub-layer to generate an attention residual output 422, Fig. 4A)
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the method of Kitaev for the benefit of a machine learning task for neural machine translation (col 2, lines 22-23) and an audio processing task (col 2, lines 30-31) wherein the transformations applied by the layer 130 will generally be the same for each input position (but different feed-forward layers in the attention neural network 100 will apply different transformations) (Kitaev, col 6, lines 52-55)

Regarding claim 9, Modified Rabinowitz teaches the AI system of claim 7, 
Rabinowitz teaches wherein a connection between two layers of two DNNs of the sequence of DNNs (preceding DNNs, e.g., the first DNN [0060]; For example, generally, the first DNN 104 may include L indexed layers e.g., layers i=1, . . . , L [0045]) combines an output of a sixth layer of the first DNN of the sequence of DNNs (layer i =6 of the First DNN 104) with an input to a seventh layer of the subsequent DNN (subsequent DNN 106 as subsequent DNN including layer i = 7), wherein the output is added to the input of the seventh layer of the subsequent DNN (subsequent DNN 108 as the last subsequent DNN including layer i = 7) 
Modified Rabinowitz does not explicitly teach wherein the output is added to the input of the ... layer of the subsequent DNN after a residual connection of the self-attention subnetwork of the attention module of the layer of the subsequent DNN.
Kitaev teaches wherein a connection between two layers of two DNNs of the sequence of DNNs combines an output of a ... layer of the first DNN of the sequence of DNNs with an input to a ... layer of the subsequent DNN, (A second residual connection layer combines the output 432 of the position-wise feed-forward layer with the input 422 to the position-wise feed-forward layer to generate a feed-forward residual output 442, Fig. 4A, col 7, lines 11-15)
wherein the output (output 432, Fig. 4A)
is added to the input of the ... layer of the subsequent DNN (input 422, Fig.4A)
after a residual connection of the self-attention subnetwork of the attention module of the ... layer of the subsequent DNN (position-wise feed-forward layer to generate a feed-forward residual output 442, Fig. 4A, col 7, lines 14-15)
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the method of Kitaev for the benefit of a machine learning task for neural machine translation (col 2, lines 22-23) and an audio processing task (col 2, lines 30-31) wherein the transformations applied by the layer 130 will generally be the same for each input position (but different feed-forward layers in the attention neural network 100 will apply different transformations) (Kitaev, col 6, lines 52-55)

Regarding claim 10, Modified Rabinowitz teaches the AI system of claim 7, Rabinowitz teaches wherein a connection between two layers of two DNNs of the sequence of DNNs (sequence of DNNs 102 [0045]) combines an intermediate output of a sixth layer of the first DNN (layer i =6 of the First DNN 104) with an input to a seventh layer of the subsequent DNN (subsequent DNN 106 as subsequent DNN including layer i = 7, Fig. 1), 
wherein the output is added to the input of the seventh layer of the subsequent DNN (subsequent DNN 108 as the last subsequent DNN including layer i = 7, Fig. 1)
Modified Rabinowitz does not explicitly teach wherein the output is added to the input of the ... layer of the subsequent DNN prior to a residual connection of the self-attention subnetwork of the attention module of the ... layer of the subsequent DNN.
Kitaev teaches wherein a connection between two layers of two DNNs of the sequence of DNNs combines an intermediate output of a ... layer of the first DNN with an input to a ... layer of the subsequent DNN, (outputs 448 and 450, and efficiently recover intermediate layer activations, e.g., outputs 428 and 430, from the final layer activations, Fig. 4C, col 8, lines 33-35) 
wherein the output (output 412, Fig. 4A)
 is added to the input of the ... layer of the subsequent DNN prior to a residual connection of the self-attention subnetwork of the attention module of the ... layer of the subsequent DNN (attention sub-layer to generate an attention residual output 422, Fig. 4A; a first standard residual connection layer combines the output 412 of an attention sub-layer with the input 402 to the attention sub-layer to generate an attention residual output 422, Fig. 4A, col 7, lines 8-11)
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the method of Kitaev for the benefit of a machine learning task for neural machine translation (col 2, lines 22-23) and an audio processing task (col 2, lines 30-31) wherein the transformations applied by the layer 130 will generally be the same for each input position (but different feed-forward layers in the attention neural network 100 will apply different transformations) (Kitaev, col 6, lines 52-55)

Regarding claim 11, Modified Rabinowitz teaches the AI system of claim 7, Rabinowitz teaches wherein a connection between two layers of two DNNs of the sequence of DNNs (sequence of DNNs 102 [0045]) combines an intermediate output of a sixth layer of the first DNN (layer i =6 of the First DNN 104) with an input to a seventh layer of the subsequent DNN (subsequent DNN 106 as subsequent DNN including layer i = 7),
 wherein the output is added to the input of the seventh layer of the subsequent DNN ((subsequent DNN 108 as the last subsequent DNN including layer i = 7)
Modified Rabinowitz does not explicitly teach wherein the output is added to the input of the ... layer of the subsequent DNN after a residual connection of the self-attention subnetwork of the attention module of the ... layer of the subsequent DNN.
Kitaev teaches wherein a connection between two layers of two DNNs of the sequence of DNNs combines an intermediate output of a ... layer of the first DNN with an input to a ... layer of the subsequent DNN, (output 448 and 450, and efficiently recover intermediate layer activation, e.g., output 430, from the final layer activations, Fig. 4C, col 8, lines 33-35)
 wherein the output is added to the input of the ... layer of the subsequent DNN after a residual connection of the self-attention subnetwork of the attention module of the ... layer of the subsequent DNN. (position-wise feed-forward layer to generate a feed-forward residual output 442, Fig. 4A, col 7, lines 14-15)
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the method of Kitaev for the benefit of a machine learning task for neural machine translation (col 2, lines 22-23) and an audio processing task (col 2, lines 30-31) wherein the transformations applied by the layer 130 will generally be the same for each input position (but different feed-forward layers in the attention neural network 100 will apply different transformations) (Kitaev, col 6, lines 52-55)

9.	Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Rabinowitz et al. (US20170337464) in view of Keskar et al. (US20190130273) in view of Yang et al. (US20210019602 filed 07/18/2019) and further in view of Adato et al. (US20190236531)

	Regarding claim 15, Modified Rabinowitz teaches AI system of claim 1, Modified Rabinowitz does not explicitly teach a cooperative operation system for maintaining process data of products on assembly lines including the AI system of claim 1, wherein the AI system is trained to convert speech to text, comprising: an input device configured to acquire instructions of an operator; a network interface controller (NIC) configured to communicate with the operator and a robot, wherein the NIC is connected to a manipulator state detector and an object detector, wherein the NIC acquires a manipulator state of the robot from the manipulator state detector, and a workpiece state representing a state between a workpiece and the manipulator from the object detector with respect to the assemble lines, wherein the NIC receives process flows representing process steps for assembling products via a network; wherein the AI system stores a speech-to-text program, the AI system converts the instructions from the input device into translated data of a predetermined language, and converts the translated data into text data of the predetermined language using the speech-to-data program; and a display device configured to indicate the text data, the process data including the manipulator state and workpiece state according to a predetermined process information format for recording qualities of the products.
Keskar teaches wherein the AI system is trained to convert speech to text, ((it is to be understood that the sequence-to-sequence models may operate on a wide variety of types of input sequences, including but not limited to audio sequences [0015]) comprising: 
an input device (input interface of system 100, Fig. 1)
the AI system converts the instructions from the input device into translated data of a predetermined language, and converts the translated data into text data of the predetermined language using the speech-to-data program; (it is to be understood that the sequence-to-sequence models may operate on a wide variety of types of input sequences, including but not limited to audio sequences [0015]; output sequence 104 may correspond to a text sequence in a second language [0016]; the output sequence may correspond to a translated version of the first sequence [0055])
a display device (output interface of output sequence 104, Fig. 1)
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the teachings of Keskar for the benefit of develop machine translation models that achieve higher accuracy than current state of art machine translation models (Keskar [0015])
Modified Rabinowitz does not explicitly teach a cooperative operation system for maintaining process data of products on assembly lines including the AI system of claim 1, wherein the AI system is trained to convert speech to text, comprising: an input device configured to acquire instructions of an operator; a network interface controller (NIC) configured to communicate with the operator and a robot, wherein the NIC is connected to a manipulator state detector and an object detector, wherein the NIC acquires a manipulator state of the robot from the manipulator state detector, and a workpiece state representing a state between a workpiece and the manipulator from the object detector with respect to the assemble lines, wherein the NIC receives process flows representing process steps for assembling products via a network; wherein the AI system stores a speech-to-text program, and a display device configured to indicate the text data, the process data including the manipulator state and workpiece state according to a predetermined process information format for recording qualities of the products.
  Adato teaches a cooperative operation system (System 100 may include or be connected to various network computing resources (e.g., servers, routers, switches, network connections, storage devices, etc.) necessary to support the services provided by system 100 [0120]) 
for maintaining process data of products on assembly lines (by processing data received from sensors positioned on retail shelves (for example, from detection elements described in relation to FIGS. 8A, 8B and 9), or processing data relating to a retail shelf, the data obtained by any other means [0587]) 
including the AI system of claim 1, wherein the AI system (an artificial neural network configured to recognize product types may be used to analyze the signals [0207])
 is trained to convert speech to text, (For example, server 135 may access a speech recognition module to convert the received audio file to text format. Server 135 may also access a text recognition module to recognize the indication of a type of products in the text [0263]) comprising: 
an input device (In some embodiments, server 135 may receive the image from an input device [0333]) configured to 
acquire instructions of an operator; (server 135 includes processing device 202, Fig. 2; Processing device 202 may implement virtual machine technologies or other technologies to provide the ability to execute, control, run, manipulate, store, etc., multiple software processes, applications, programs [0130] Fig. 2)
a network interface controller (NIC) (server 135 may include network interface 206 [0029] Fig. 2; data may be transmitted wirelessly to the detection elements (e.g., to wireless network interface controllers forming part of the detection elements) [0200]) configured to
communicate with the operator and a robot, (The robotic devices may be controlled by server 135 and may be operated remotely or autonomously [0149]; Bus 200 may interconnect processing device 202, and network interface 206 [0237] Fig. 2)
wherein the NIC is connected to a manipulator state detector and an object detector, (For example, the robotic capturing devices may use input from sensors (e.g., image sensors, depth sensors, proximity sensors, etc.) [0149])
wherein the NIC acquires a manipulator state of the robot (In some embodiments, the robotic devices may include a robot on a track (e.g., a Cartesian robot configured to move along an edge of a shelf or in parallel to a shelf [0149])
from the manipulator state detector, (such as capturing device 125E [0149]) and 
a workpiece state representing a state between a workpiece and the manipulator from the object detector (In another example, system 100 may prove a visualization of both the present state of a shelf, as depicted in image 3050, and the desired arrangement, as depicted in image 3060 [0631])
with respect to the assemble lines, (FIG. 6B illustrates a perspective view assembly diagram depicting a portion of a retail shelving unit 620 with multiple systems 500 [0163])
wherein the NIC receives process flows representing process steps for assembling products via a network; (The visualization may additionally include detailed instructions as to how to arrange the products to achieve the desired arrangement [0631])
wherein the AI system stores a speech-to-text program, (server 135 may be configured to access database 140 directly … Accessing database 140 may comprise storing/retrieving data stored in database 140 [0332])
the AI system converts the instructions from the input device into translated data of a predetermined language, (For example, an input may comprise a word corresponding with a type of product [0456]) and 
converts the translated data into text data of the predetermined language using the speech-to-data program; (For example, server 135 may access a speech recognition module to convert the received audio file to text format [0263]) and 
a display device (I/O system 210 may include a display screen (e.g., CRT, LCD, etc.) [0133]) 
configured to indicate the text data, (I/O system 210 configured to receive signals or input from devices and provide signals or output to one or more devices that allow data to be received [0133])
the process data including the manipulator state and workpiece state according to a predetermined process information format (Such alerts or recommendations may include any suitable format for conveying the non-compliance information to a particular entity [0769])
for recording qualities of the products (server may be configured to generate a record indicative of changes in product placement [0124])
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the method of Adato for the benefit of a speech recognition module to convert the received audio file to text format (Adato [0263])

10.	Claims 17 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Rabinowitz et al. (US20170337464) in view of Keskar et al. (US20190130273) in view of Yang et al. (US20210019602 filed 07/18/2019) in view of Adato et al. (US20190236531) and further in view of Dehghani et al. (US20190354567)

Regarding claim 17, Modified Rabinowitz teaches the method of claim 16, Rabinowitz teaches wherein one or more layers of the inner DNN are connected to multiple layers of the subsequent DNN ((The last subsequent DNN includes a number of indexed layers, and (ii) each layer in the number of indexed layers with index greater than one receives input from … (ii) one or more preceding layers of each preceding DNN in the sequence of DNNs [0065])
Modified Rabinowitz does not explicitly teach via a plurality of soft connections that scale outputs of the one or more layers of the inner DNN before adding the scaled outputs to the multiple layers of the outer DNN.
Dehghani teaches wherein one or more layers of the inner DNN are connected to multiple layers of the subsequent DNN via a plurality of soft connections that scale outputs of the one or more layers of the inner DNN (h1 representation 105a is connected to Self-attention Process 112b … and Self-attention Process 112m; h2 representation 105b is connected to Self-attention Process 112a … and Self-attention Process 112m, Fig. 1)
 before adding the scaled outputs to the multiple layers of the outer DNN (output of recurrent encoder block 410 after T steps is sent to recurrent encoder block 420, Fig. 4)
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the method of Dehghani for the benefit of computer systems with GPUs and other accelerator hardware to exploit the parallel computational structure of the Universal Transformer (Dehghani [0012])

Regarding claim 18, Modified Rabinowitz teaches the method of claim 17, Modified Rabinowitz does not explicitly teach wherein the soft connections scale the outputs based on weights trained simultaneously with parameters of the multi-dimensional neural network.
Dehghani teaches wherein the soft connections scale the outputs based on weights trained simultaneously with parameters of the multi-dimensional neural network (In addition, number of computational steps of the Universal Transformer can be varied dynamically after training because the model shares weights across its sequential computational steps [0006]; The system can apply the same series of operations iteratively .... In some implementations, the system also uses the same learned parameter values for each step [0034])
The same motivation to combine dependent claim 17 applies here.

11.	Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Rabinowitz et al. (US20170337464) in view of Keskar et al. (US20190130273) in view of Yang et al. (US20210019602 filed 07/18/2019) in view of Adato et al. (US20190236531) in view of Dehghani t al. (US20190354567) and further in view of Ward et al. (US10210860 filed 08/22/2018)

Regarding claim 19, Modified Rabinowitz teaches the method of claim 17, Modified Rabinowitz does not explicitly teach wherein the sequence of deep neural networks is fully connected with the soft connections, such that all layers of the inner DNN are connected to all layers of the subsequent DNN with different weights determined by training simultaneously with the parameters of the multi-dimensional neural network.
Keskar teaches wherein the sequence of deep neural networks (In some embodiments, memory 130 may store a model 140 that is evaluated by processor 120 during sequence-to-sequence prediction. Model 140 may include a plurality of neural network layers [0020])
DNN are connected to layers of the subsequent DNN with different weights determined by training simultaneously with the parameters of the multi-dimensional neural network (the number of learned scaling parameters in branched attention encoder layer 320 f is 
    PNG
    media_image1.png
    42
    25
    media_image1.png
    Greyscale
(M), where M denotes the number of branches 360 a-m. This may represent a small subset of the total number of learnable model parameters associated with branched attention encoder layer 320 f (e.g., the total number of weights and/or biases associated with parameterized attention layers 361 a-m and/or parameterized transformation layers 363 a-m) [0038] Fig. 3B)
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the teachings of Keskar for the benefit of using learned scaling parameters which may reduce and/or prevent co-adaptation among branches 360 a-m during training, thereby improving the performance of branched transformer model 300 (Keskar [0038])
Modified Rabinowitz does not explicitly teach wherein the sequence of deep neural networks is fully connected with the soft connections, such that all layers of the inner DNN are connected to all layers of the subsequent DNN with different weights determined by training simultaneously with the parameters of the multi-dimensional neural network.
Ward teaches wherein the sequence of deep neural networks is fully connected with the soft connections, such that all layers of the inner DNN are connected to all layers of the subsequent DNN (Example neural network 1100 is a fully-connected neural network with multiple layers of hidden states, col 20, lines 7-8, Fig. 11. Examiner notes Fig. 11 is fully connected network with soft connections) 
with different weights determined by training simultaneously with the parameters of the multi-dimensional neural network. (The weights in the linear combination may be referred to as the weights of the node, and each node may have different weights, col 4, lines 55-57; Each expert neural network layer may comprise the weights for the inbound edges to the nodes of the expert neural network layer and the activation function of the nodes, col 20, lines 23-26)
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Rabinowitz to incorporate the method of Ward for the benefit of the neural network architecture having the advantage over traditional ASR systems of being able to be repurposed to other classification-type tasks without hand tuning (Ward, col 14, lines 19-22)

Conclusion
	Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone number is (571)272-8670. The examiner can normally be reached Monday-Friday 8am-5pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michelle T Bechtold can be reached on (571) 431-0762. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/M.G./Examiner, Art Unit 2148                                                                                                                                                                                                       
/MICHELLE T BECHTOLD/Supervisory Patent Examiner, Art Unit 2148
Read full office action
Prosecution Timeline

Sep 10, 2020
Application Filed
Jul 28, 2023
Non-Final Rejection — §103, §112
Sep 15, 2023
Interview Requested
Oct 04, 2023
Applicant Interview (Telephonic)
Oct 06, 2023
Examiner Interview Summary
Oct 31, 2023
Response Filed
Jan 21, 2024
Final Rejection — §103, §112
Apr 30, 2024