DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Status of the Claims
Claims 1-20 are pending.
Claim 17 is amended.
Claims 1-10 and 12-20 are rejected.
Claim 11 is objected.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), and Buddemeiser et al. (US 20030043153)
Regarding claim 1:
Parbhavalkar teaches:
A computer-implemented method (Parbhavalkar [0095] The computer program product contains instructions that, when executed, perform one or more methods, such as those described above.), comprising:
receiving audio data corresponding to an utterance of speech (Parbhavalkar [0010] One aspect of the disclosure provides a method for biasing speech recognition that includes receiving, at data processing hardware, audio data encoding an utterance, and obtaining, by the data processing hardware, a set of one or more biasing phrases corresponding to a context of the utterance, each biasing phrase in the set of one or more biasing phrases includes one or more words.);
computing, using a transformer-based audio encoder and a decoder, a weighted vector indicative of a plurality of features associated with the audio data (Parbhavalkar [0042] Accordingly, concatenating the contextual biasing vector 138 with the weighted audio encoding vector 136 into a weighted vector “injects” contextual biasing into the speech recognition model 300. The weighted vector 140 collectively represents the audio, grapheme, and phoneme information. The weighted vector 140 is input to the decoder 142.);
Parbhavalkar does not teach:
computing, using the weighted vector and one or more component vectors, an animation vector corresponding to one or more positions for one or more feature points associated with a digital character representation;
rendering the digital character representation based, at least, on the animation vector.
Yu teaches:
computing, using the weighted vector and one or more component vectors, an animation vector corresponding to one or more positions for one or more feature points associated with a digital character representation (Yu [Abstract] The invention relates to the field of two-dimensional animation and provides an automatic match method of two-dimensional animation characters in the two-dimensional animation production environments. The method includes: respectively extracting feature points in characters according to character information in two key frames; allocating the dimension and the direction for each feature point in each character by means of a feature description algorithm and generating a high-dimensional feature vector; constructing a Markov random field satisfying the adjacency relation based on the obtained feature points; and computing the maximum posterior probability and seeking a minimum point of an energy function according to the obtained Markov random field and by combining the obtained high-dimensional feature vectors so as to establish match relations of the animation characters.); and
Buddemeiser teaches:
rendering the digital character representation based, at least, on the animation vector (Buddemeiser [0005] The present invention provides a technique for translating facial animation values to head mesh positions for rendering facial features of an animated avatar.):
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar with Yu and Buddemeiser. Having vectors correspond positions and features of an animation and then rendering the character, as in Yu and Buddemeiser, would benefit the Parbhavalkar teachings by allowing for the audio to be added with a visual representation. Additionally, this is the application of a known technique, combining audio and character positions, to yield predictable results.
Claim(s) 2 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), Liu (US 20190130562), and Weston et al. (EP 3937170).
Regarding claim 2:
Parbhavalkar, Yu, and Buddemeiser teach:
The computer-implemented method of claim 1,
Parbhavalkar, Yu, and Buddemeiser do not teach:
wherein the transformer-based audio encoder is a pre-trained audio encoder, further comprising:
training the decoder based, at least, on the transformer-based audio encoder,
wherein parameters for the transformer-based audio encoder are locked while the decoder is trained.
Weston teaches:
wherein the transformer-based audio encoder is a pre-trained audio encoder, further comprising (Weston [pg12 par2] Firstly, the trained Transformer encoder stack could 412 simply be frozen i.e. not part of the ongoing task specific training but simply left in its state following pre-training in which it translates a sequence of audio-linguistic tokens to a combined audio-linguistic representation sequence which can be classified to provide the diagnosis.):
training the decoder based, at least, on the transformer-based audio encoder (Weston [pg4 par8] Preferably, in these examples, training comprises training the encoder and decoder together to map the input sequence of audio and linguistic representations to the target output.),
Liu teaches:
wherein parameters for the transformer-based audio encoder are locked while the decoder is trained (Liu [0024] The network training is performed in two stages: the encoder is learned; then the 3D decoder is added and fine-tuned with the encoder parameters locked.).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser with Weston and Liu. Having the transformer encoder pre-trained and training the encoders together, as in Weston and Liu, would benefit the Parbhavalkar, Yu, and Buddemeiser teachings by allowing a way for the coders to be trained beforehand. Additionally, this is the application of a known technique, having a pre-trained encoder, to yield predictable results.
Claim(s) 3 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), and Zhou et al. (US 20180137857).
Regarding claim 3:
Parbhavalkar, Yu, and Buddemeiser teach:
The computer-implemented method of claim 1,
Parbhavalkar, Yu, and Buddemeiser do not teach:
wherein the plurality of features are selected during training.
Zhou teaches:
wherein the plurality of features are selected during training (Zhou [0006] The method further includes performing, with the processor, a training process for a neural network ranker using the plurality of feature vectors corresponding to the plurality of training speech recognition results as inputs to the neural network ranker).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser with Zhou. Selecting features while training, as in Zhou, would benefit the Parbhavalkar, Yu, and Buddemeiser teachings by customizing the training. Additionally, this is the application of a known technique, choosing what features to be trained on, to yield predictable results.
Claim(s) 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), and Yang et al. (CN 112822068).
Regarding claim 4:
Parbhavalkar, Yu, and Buddemeiser teach:
The computer-implemented method of claim 1, further comprising:
Parbhavalkar, Yu, and Buddemeiser do not teach:
receiving a respective layer vector associated with the plurality of features for layers of the transformer-based audio encoder; determining, for individual layers, a layer weight; applying the layer weight to the respective individual layer; and determining the weighted vector.
Yang teaches:
receiving a respective layer vector associated with the plurality of features for layers of the transformer-based audio encoder; determining, for individual layers, a layer weight; applying the layer weight to the respective individual layer; and determining the weighted vector (Yang [pg15 par2] obtaining an initial model comprising a feature extraction layer, a weight calculation layer and a prediction layer. structure of the initial model as shown in FIG. 8, the initial model is a comprising a feature extraction layer; a weight calculation layer and a prediction layer three-layer structure model, wherein the feature extraction layer is used for obtaining each sentence in the communication sentence vector, feature extraction layer input can be a communication data, as shown in FIG. 8 Si represents a communication data, such as feature extraction layer can receive a plurality of communication data at the same time; extracting each of the sentence vector of communication data by the feature extraction layer, feature extraction layer can be through but not limited to through BERT (Encoder Representation from Transformers are used to sentence vector the communication data. weight calculating layer to sentence vector input, calculating the weight of each sentence vector, such as weight calculating layer can be obtained by but not limited to BiLSTM (Bi-directional Long Short-Term Memory) ).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser with Yang. Having weighted vectors, as in Yang, would benefit the Parbhavalkar, Yu, and Buddemeiser teachings by having different types of vectors. Additionally, this is the application of a known technique, having different weights to vectors, to yield predictable results.
Claim(s) 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), and Liu et al. (US 20210266274).
Regarding claim 5:
Parbhavalkar, Yu, and Buddemeiser teach:
The computer-implemented method.of claim 1,
Parbhavalkar, Yu, and Buddemeiser do not teach:
wherein the audio data has a duration less than a threshold duration.
Liu teaches:
wherein the audio data has a duration less than a threshold duration (Liu [0067] When the audio duration corresponding to the audio data is less than or equal to the duration threshold).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser with Liu. Having audio duration less than a certain threshold, as in Liu, would benefit the Parbhavalkar, Yu, and Buddemeiser teachings by having a limit to audio. Additionally, this is the application of a known technique, having an audio limit threshold, to yield predictable results.
Claim(s) 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), and Deng et al. (CN 110837738).
Regarding claim 6:
Parbhavalkar, Yu, and Buddemeiser teach:
The computer-implemented method of claim 5,
Parbhavalkar, Yu, and Buddemeiser do not teach:
wherein the plurality of features are determined using transformer layers.
Deng teaches:
wherein the plurality of features are determined using transformer layers (Deng [pg3 par11] the vector matrix using a first transformer layer extracted by the characteristic extraction to obtain first feature matrix).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser with Deng. Using a feature matrix to store features extracted, as in Deng, would benefit the Parbhavalkar, Yu, and Buddemeiser teachings by having a way to extract features. Additionally, this is the application of a known technique, extracting features using transformer layers, to yield predictable results.
Claim(s) 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), and Gareth et al. (GB 2568475).
Regarding claim 7:
Parbhavalkar, Yu, and Buddemeiser teach:
The computer-implemented method of claim 1,
Parbhavalkar, Yu, and Buddemeiser do not teach:
wherein the one or more feature points corresponds to at least one of facial features, a tongue position, an eye position, or an extremity position.
Gareth teaches:
wherein the one or more feature points corresponds to at least one of facial features, a tongue position, an eye position, or an extremity position (Gareth [0034] The feature points correspond to facial features such as key points on eyes, nose, lips etc).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser with Gareth. Having feature points correspond to facial features, as in Gareth, would benefit the Parbhavalkar, Yu, and Buddemeiser teachings by having a way to have feature points correspond to something. Additionally, this is the application of a known technique, having feature points correspond to facial features, to yield predictable results.
Claim(s) 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), and Davis et al. (US 9705904).
Regarding claim 8:
Parbhavalkar, Yu, and Buddemeiser teach:
The computer-implemented method of claim 1,
Parbhavalkar, Yu, and Buddemeiser do not teach:
wherein.the plurality of features are extracted from the audio data via processing using a convolutional neural network (CNN).
Davis teaches:
wherein.the plurality of features are extracted from the audio data via processing using a convolutional neural network (CNN) (Davis [0016] Approaches such as convolutional neural networks can yield classifiers that can learn to extract features that are at least as effective as human-engineered features. While such models are currently applied to image and audio data).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser with Davis. Using a CNN to extract features, as in Davis, would benefit the Parbhavalkar, Yu, and Buddemeiser teachings by having a way to extract features using a neural network. Additionally, this is the application of a known technique, extracting features using a CNN, to yield predictable results.
Claim(s) 9 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), and Hewage et al. (WO 2019092459).
Regarding claim 9:
Parbhavalkar, Yu, and Buddemeiser teach:
The computer-implemented method of claim 1,
Parbhavalkar, Yu, and Buddemeiser do not teach:
wherein the component vector includes at least one of an emotion vector or a style vector.
Hewage teaches:
wherein the component vector includes at least one of an emotion vector or a style vector (Hewage [0020] The present disclosure provides methods and apparatus for a machine learning technique that uses a latent vector from a latent vector space in classifying input data, where the latent vector includes a label vector y and a style vector z).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser with Hewage. Having a style vector, as in Hewage, would benefit the Parbhavalkar, Yu, and Buddemeiser teachings by having a vector associated with the style. Additionally, this is the application of a known technique, using a style vector, to yield predictable results.
Claim(s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Yu et al. (CN 102708583), Buddemeiser et al. (US 20030043153), and Ludwig et al. (CN 104349074).
Regarding claim 10:
Parbhavalkar, Yu, and Buddemeiser teach:
The computer-implemented method of claim 1,
Parbhavalkar, Yu, and Buddemeiser do not teach:
wherein the decoder disregards information from one or more previous frames.
Ludwig teaches:
wherein the decoder disregards information from one or more previous frames (Ludwig [pg6 par11] In addition, in the inter-frame mode, it can use the ignore (skip). to ignore blocks for coding, and not sending residual or motion vector. encoder only records it is ignored. the decoder from other blocks omitted image information block has been decoded. block according to the invention, preferably, from the same frame of the digital video data or image information of the previous frame block derivation ignored.).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Yu, and Buddemeiser with Ludwig. Ignoring previous frames, as in Ludwig, would benefit the Parbhavalkar, Yu, and Buddemeiser teachings by not having redundant data go through. Additionally, this is the application of a known technique, ignoring previous frames for coding, to yield predictable results.
Claim(s) 12 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Seo et al. (KR 101720016), Park et al. (US 20200320171), and Kudo et al. (JP 2003346181).
Regarding claim 12:
Parbhavalkar teaches:
A processor comprising: one or more processing units to (Parbhavalkar [0088] The method 500 may be executed on data processing hardware 610 (FIG. 6) residing on a user device 106 associated with a user 102 that spoke the utterance 104.):
compute, using a transformer-based audio encoder and based, at least, on audio data corresponding to speech (Parbhavalkar [0010] One aspect of the disclosure provides a method for biasing speech recognition that includes receiving, at data processing hardware, audio data encoding an utterance, and obtaining, by the data processing hardware, a set of one or more biasing phrases corresponding to a context of the utterance, each biasing phrase in the set of one or more biasing phrases includes one or more words.),
a weighted feature vector associated with the audio data (Parbhavalkar [0042] Accordingly, concatenating the contextual biasing vector 138 with the weighted audio encoding vector 136 into a weighted vector “injects” contextual biasing into the speech recognition model 300. The weighted vector 140 collectively represents the audio, grapheme, and phoneme information. The weighted vector 140 is input to the decoder 142.);
Parbhavalkar does not teach:
compute, using the weighted feature vector and a component vector indicative of one or more properties associated with the speech, position data for one or more feature points of one or more deformable bodily components of a virtual character; and
render, for one or more time points 'in a sequence of time points of the audio data, image data representative of the virtual character based, at least, on the position data to generate an animation of the character appearing to utter the speech.
Seo and Park teach:
compute, using the weighted feature vector and a component vector indicative of one or more properties associated with the speech, position data for one or more feature points of one or more deformable bodily components of a virtual character (Seo [pg 7 par11] As described above, the costume image of the wearer's clothing 60 is a stored standard image, and the feature points are composed of feature vectors for identifying the positions of the feature points. The detail area is a subdivision of the whole area of the costume image according to the body part reference. Park [0032] One or more components, each also vector representations, make up the overall vector representation. One embodiment uses, as a component, a vector representation of the unit of narrative text. Another embodiment uses, as a component, a vector representation of a part of speech (e.g. a noun, verb, adjective, adverb, or another part of speech) corresponding to the unit of narrative text.); and
Kudo teaches:
render, for one or more time points 'in a sequence of time points of the audio data, image data representative of the virtual character based, at least, on the position data to generate an animation of the character appearing to utter the speech (Kudo [0002] In recent years, a three-dimensional (3D) animation image is created and displayed using computer graphic (CG) technology. [0003] Here, as one form of animation, a motion image of the state of the mouth of the face and the state of the facial expression corresponding to the utterance and emotion of the character is displayed. [0004] When creating a facial animation of utterances and facial expressions in CG in this way, a facial model animation pattern image is created for each frame of the video. They were arranged along the time axis and played continuously.).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar with Seo, Park and Kudo. Having vectors correspond with speech, position, and body features of an animation and then rendering the character, as in Seo, Park and Kudo, would benefit the Parbhavalkar teachings by allowing a way to create a visual representation. Additionally, this is the application of a known technique, combining speech, position, and body parts to create an animation, to yield predictable results.
Regarding claim 16:
Parbhavalkar, Seo, Park, and Kudo teach:
The processor of claim 12,
wherein the processor is comprised in at least one of: a system for performing simulation operations; a system for performing simulation operations to test or validate autonomous machine applications; a system for performing digital twin operations; a system for performing light transport simulation; a system for rendering graphical output; a system for performing deep learning operations; a system implemented using an edge device; a system for generating or presenting virtual reality (VR) content; a system for generating or presenting augmented reality (AR) content; a system for generating or presenting mixed reality (MR) content; a system incorporating one or more Virtual Machines (VMs);a system for performing operations for a conversational AI application; a system for performing operations for a generative AI application; a system for performing operations using a language model; a system for performing one or more generative content operations using a large language model (LLM);a system implemented at least partially in a data center; a system for performing hardware testing using simulation; a system for performing one or more generative content operations using a language model; a system for synthetic data generation; a collaborative content creation platform for 3D assets; or.*a system implemented at least partially using cloud computing resources (Parbhavalkar [0088] Optionally, the data processing hardware 610 may reside on a remote device (e.g., server of a cloud-based computing environment) in communication with the user device 106, e.g., over a network.).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar with Seo, Park and Kudo. Using a cloud computer network, as in Parbhavalkar, would benefit the Seo, Park and Kudo teachings by allowing to reallocate the process load. Additionally, this is the application of a known technique, having a cloud computing, to yield predictable results.
Claim(s) 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Seo et al. (KR 101720016), Park et al. (US 20200320171), Kudo et al. (JP 2003346181), and Yang (CN 112822068).
Regarding claim 13:
Parbhavalkar, Seo, Park, and Kudo teach:
The processor of claim 12,
Parbhavalkar, Seo, Park, and Kudo do not teach:
wherein the weighted feature vector is based, at least, on respective layer vectors for individual layers of the transformer-based audio encoder, wherein -individual layer vectors are associated with a plurality of features extracted from the audio data.
Yang teaches:
wherein the weighted feature vector is based, at least, on respective layer vectors for individual layers of the transformer-based audio encoder, wherein -individual layer vectors are associated with a plurality of features extracted from the audio data (Yang [pg15 par2] obtaining an initial model comprising a feature extraction layer, a weight calculation layer and a prediction layer. structure of the initial model as shown in FIG. 8, the initial model is a comprising a feature extraction layer; a weight calculation layer and a prediction layer three-layer structure model, wherein the feature extraction layer is used for obtaining each sentence in the communication sentence vector, feature extraction layer input can be a communication data, as shown in FIG. 8 Si represents a communication data, such as feature extraction layer can receive a plurality of communication data at the same time; extracting each of the sentence vector of communication data by the feature extraction layer, feature extraction layer can be through but not limited to through BERT (Encoder Representation from Transformers are used to sentence vector the communication data. weight calculating layer to sentence vector input, calculating the weight of each sentence vector, such as weight calculating layer can be obtained by but not limited to BiLSTM (Bi-directional Long Short-Term Memory) ).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Seo, Park, and Kudo with Yang. Having weighted vectors, as in Yang, would benefit the Parbhavalkar, Seo, Park, and Kudo teachings by having different types of vectors. Additionally, this is the application of a known technique, having different weights to vectors, to yield predictable results.
Claim(s) 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Seo et al. (KR 101720016), Park et al. (US 20200320171), Kudo et al. (JP 2003346181), and Liu (US 20190130562).
Regarding claim 14:
Parbhavalkar, Seo, Park, and Kudo teach:
The processor of claim 12,
Parbhavalkar, Seo, Park, and Kudo do not teach:
wherein parameters of the transformer-based audio encoder are locked during a training process for an associated decoder.
Zhou teaches:
wherein parameters of the transformer-based audio encoder are locked during a training process for an associated decoder (Liu [0024] The network training is performed in two stages: the encoder is learned; then the 3D decoder is added and fine-tuned with the encoder parameters locked.).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Seo, Park, and Kudo with Liu. Having weighted vectors, as in Parbhavalkar, Seo, Park, and Kudo, would benefit the Liu, Yu, and Buddemeiser teachings by having different types of vectors. Additionally, this is the application of a known technique, having different weights to vectors, to yield predictable results.
Claim(s) 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Seo et al. (KR 101720016), Park et al. (US 20200320171), Kudo et al. (JP 2003346181), and Hewage (WO 2019092459).
Regarding claim 15:
Parbhavalkar, Seo, Park, and Kudo teach:
The processor of claim 12,
Parbhavalkar, Seo, Park, and Kudo teach:
wherein the component vector includes at least one of an emotion vector or a style vector.
Hewage teaches:
wherein the component vector includes at least one of an emotion vector or a style vector (Hewage [0020] The present disclosure provides methods and apparatus for a machine learning technique that uses a latent vector from a latent vector space in classifying input data, where the latent vector includes a label vector y and a style vector z).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Seo, Park, and Kudo with Hewage. Having a style vector, as in Hewage, would benefit the Parbhavalkar, Seo, Park, and Kudo teachings by having a vector associated with the style. Additionally, this is the application of a known technique, using a style vector, to yield predictable results.
Claim(s) 17,18, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Seo et al. (KR 101720016), Park et al. (US 20200320171), and Hewage et al. (WO 2019092459).
Regarding claim 17:
Parbhavalkar, Seo, and Park teach:
A system, comprising: one or more processing units to generate an animation of a character using position data representative of one-or more positions of one or more feature points of the character (Parbhavalkar [0088] The method 500 may be executed on data processing hardware 610 (FIG. 6) residing on a user device 106 associated with a user 102 that spoke the utterance 104. [0010] One aspect of the disclosure provides a method for biasing speech recognition that includes receiving, at data processing hardware, audio data encoding an utterance, and obtaining, by the data processing hardware, a set of one or more biasing phrases corresponding to a context of the utterance, each biasing phrase in the set of one or more biasing phrases includes one or more words.) (Seo [pg7 par3] As described above, the costume image of the wearer's clothing 60 is a stored standard image, and the feature points are composed of feature vectors for identifying the positions of the feature points. The detail area is a subdivision of the whole area of the costume image according to the body part reference.) (Park [0032] One or more components, each also vector representations, make up the overall vector representation. One embodiment uses, as a component, a vector representation of the unit of narrative text. Another embodiment uses, as a component, a vector representation of a part of speech (e.g. a noun, verb, adjective, adverb, or another part of speech) corresponding to the unit of narrative text.),
Parbhavalkar, Seo, and Park do not teach:
the position data computed based at least in part on a transformer-based audio encoder processing audio data representative of speech and component data indicative of one or more values corresponding to at least one of a style parameter or an emotion parameter associated with the speech.
Hewage teaches:
the position data computed based at least in part on a transformer-based audio encoder processing audio data representative of the speech and component data indicative of one or more values corresponding to at least one of a style parameter or an emotion parameter associated with the speech (Hewage [0020] The present disclosure provides methods and apparatus for a machine learning technique that uses a latent vector from a latent vector space in classifying input data, where the latent vector includes a label vector y and a style vector z).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar with Seo, Park and Hewage. Having vectors correspond with speech, position, and body features of an animation and having a style vector, as in Seo, Park and Hewage, would benefit the Parbhavalkar teachings by allowing a way to create a visual representation. Additionally, this is the application of a known technique, combining speech, position, and body parts to create an animation, to yield predictable results.
Regarding claim 18:
Parbhavalkar, Seo, Park, and Hewage teach:
The system of claim 17;
wherein the transformer-based audio encoder computes a weighted feature vector based, at least, on respective layer vectors for individual layers of the transformer-based audio encoder (Seo [pg 7 par11] As described above, the costume image of the wearer's clothing 60 is a stored standard image, and the feature points are composed of feature vectors for identifying the positions of the feature points. The detail area is a subdivision of the whole area of the costume image according to the body part reference. Park [0032] One or more components, each also vector representations, make up the overall vector representation. One embodiment uses, as a component, a vector representation of the unit of narrative text. Another embodiment uses, as a component, a vector representation of a part of speech (e.g. a noun, verb, adjective, adverb, or another part of speech) corresponding to the unit of narrative text.).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar with Seo, Park and Hewage. Having a feature vector, as in Seo, Park and Hewage, would benefit the Parbhavalkar teachings by allowing a vector to store feature information. Additionally, this is the application of a known technique, having a feature vector in an audio coder, to yield predictable results.
Regarding claim 20:
Parbhavalkar, Seo, Park, and Hewage teach:
The system of claim 17;
wherein the system comprises at least one of: a system for performing simulation operations; a system for performing simulation operations to test or validate autonomous machine applications; a system for performing digital twin operations; a system for performing light transport simulation; a system for rendering. graphical output; a system for performing deep learning operations; a system implemented using an edge device; a system for generating or presenting virtual reality (VR) content; a system for generating or presenting augmented reality (AR) content; a system for generating or presenting mixed reality (MR) content; a system incorporating one or more Virtual Machines (VMs);a system for performing operations for a conversational AI application; a system for performing operations for a generative AI application;-a system for performing operations using a language model; a system for performing one or more generative content operations using a large language model (LLM);a system implemented at least partially in a data center; a system for performing hardware testing using simulation; a system for performing one or more generative content operations using, a language model; a system for synthetic data generation; a collaborative content creation platform for 3D assets; or a system implemented at least partially using cloud computing resources (Parbhavalkar [0088] Optionally, the data processing hardware 610 may reside on a remote device (e.g., server of a cloud-based computing environment) in communication with the user device 106, e.g., over a network.).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar with Seo, Park and Hewage. Using a cloud computer network, as in Parbhavalkar, would benefit the Seo, Park and Hewage teachings by allowing to reallocate the process load. Additionally, this is the application of a known technique, having a cloud computing, to yield predictable results.
Claim(s) 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Parbhavalkar et al. (US 20200402501) in view of Seo et al. (KR 101720016), Park et al. (US 20200320171), Hewage et al. (WO 2019092459), and Liu et al (US 20190130562).
Regarding claim 19:
Parbhavalkar, Seo, Park, and Hewage teach:
The system of claim 17;
Parbhavalkar, Seo, Park, and Hewage do not teach:
wherein parameters of the transformer-based audio encoder are locked during a training process for an associated decoder.
Liu teaches:
wherein parameters of the transformer-based audio encoder are locked during a training process for an associated decoder (Liu [0024] The network training is performed in two stages: the encoder is learned; then the 3D decoder is added and fine-tuned with the encoder parameters locked.).
Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to combine the teachings of Parbhavalkar, Seo, Park and Hewage with ref. Locking the parameter, as in Liu, would benefit the Parbhavalkar, Seo, Park and Hewage teachings by the parameters to not change. Additionally, locking parameters during training, to yield predictable results.
Allowable Subject Matter
In regards to claim 11, the cited prior art fails to teach the following limitations in that claim: “…penalizing motion between neighboring frames when-a volume of the audio data is less than a volume threshold.” Therefore, claim 11 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Response to Arguments
Applicant's arguments filed 10/23/2025 have been fully considered but they are not persuasive.
Applicant claims that there is no motivation to modify Pabhavalkar in view of Yu and Buddemeiser. Parbhavlkar is directed toward recognizing input speech for transcription purposed, Yu is directed towards automatic matching of two-dimensional animation to an animation field, and Buddemeiser is directed to translating facial animation values to head mesh positions for rendering facial features of an animated avatar.
The specification of the claimed invention at [0002] states: It may be desirable for various operations to animate a character to appear as if that character is uttering speech represented by audio data. Due in part to the time and complexity of creating such animation, it can be.beneficial to automate such a process, particularly for real-time or near real-time operations. Machine-learning based approaches have been used to generate animation of characters based on input audio,but these prior approaches are generally limited in their capabilities, producing animation that is not sufficiently realistic in many instances. For example, a prior approach can attempt to animate various facial features of a character, including the mouth or eyes, in order to correspond to speech represented by corresponding audio data - but these models often fail to provide realistic animations when used on languages that-the model is not explicitly trained for. This issue may be exacerbated for operations where the character is a virtual human that is intended to appear as an actual human that is uttering the speech in a realistic manner with realistic behavior.
While Parbhavalkar itself does not include the need for animation, the specification of the inventions mentions that speech recognition has been used prior to create these animations. Therefore, it would reasonable to assume that one skilled in the art would combine a speech recognition software with an animation software.
This same reasoning is applied to all independent and dependent claims.
Therefore, the rejection has not been withdrawn.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DENIS VASILIY MINKO whose telephone number is (571)270-5226. The examiner can normally be reached Monday-Thursday 8:30-6:00 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Said Broome can be reached at 571-272-2931. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/DENIS VASILIY MINKO/Examiner, Art Unit 2612
/Said Broome/Supervisory Patent Examiner, Art Unit 2612