DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 02/20/2026 has been entered.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-4, 6-7, 9-14, 16-17 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Zhuang et al. (Music2Dance: DanceNet for Music-Driven Dance Generation, ACM Trans. Multimedia Comput. Commun. Appl., Vol. 18, No. 2, Article 65. Publication date: February 2022, hereinafter “Zhuang”) in view of Joe (KR20070025384A, hereinafter “Joe”), and further in view of Carlbom et al. (US 20070294061 A1, hereinafter “Carlbom”. A machine-translated English version for Joe is attached.
Regarding claim 1, Zhuang teaches A method comprising:
receiving, by a processor, a real-time acoustic signal comprising a plurality of acoustic segments; (page 65:5, Fig. 2: “our framework” can be regarded as a processor to perform all operations, including receiving “Music (wave)”, extracting “Musical features”, classifying “Musical style”, generating musical context-aware code by “Musical context-aware encoder”, and exporting dance motion by “Motion decoder”; page 65:3, section 1, “modern dance for about 26.15 minutes (94,155 frames, frames per second (FPS) = 60), and curtilage dance for about 31.72 minutes (114,192 frames, FPS = 60”; page 65:7, Fig. 4: “Music(wave) and different music features. Music is represented by wave, with the mel spectrum as its basic feature. Chroma, beat (beat, downbeat), and onset are its high-level features”; page 65:8, section 6.1, “the music features mo are fed to the music encoder via the sliding window. To improve the training speed, the window size is fixed during the training phase, and we set it to 8 seconds. In the inference phase, the music-context code of each clip is extracted via the sliding window (overlapping 2 seconds) and then stitched to form a complete music-context code”). Note that: (1) the framework as a processor receives the music waves as acoustic signals; (2) It is obvious to one having ordinary skills in the art that the music waves from video datasets with 60 FPS are processed for dance motion synthesis, indicating that acoustic signals can be obtained and processed in real-time mode based on the music features (e.g., beat and onset) that have to be extracted in real-time to reflect music itself in inferring dance motion synthesis of the application; and (3) by using the sliding window (2-8 seconds) music waves are processed as acoustic or wave segments with the corresponding sliding window length.
generating using a danceability neural network a danceability score for each of the acoustic segments; (page 65:5, Fig. 2: “Musical style classifier” classifying “Musical style”, “Musical context-aware encoder” as a CNN layers plus “BI-LSTM” layer outputting the music-context code, “Motion encoder” outputting the motion code, and “DanceNet takes the musical style, rhythm, and melody as control signals to generate dance motion”;
PNG
media_image1.png
312
852
media_image1.png
Greyscale
page 65:8, section 5.3, “we adopt three temporal convolution layers and one Bi-LSTM (Bi-directional LSTM) layer as a high-dimensional feature extractor, and finally it is classified by a fully connected layer”; page 65:8, section 6.1, “We combine temporal convolution and Bi-LSTM as the music encoder to represent the musical rhythm and melody at every moment, while considering the contextual information of the music … since dance motion is smooth, the Bi-LSTM is added after time convolution to fuse the context information and ensure that the music-context code is smooth (when we used madmom to extract music features, we found that the chroma, beat, downbeat, and onset of a music clip would change drastically—jitter problem, as shown in Figure 4)”; page 65:8, section 6.1, “(2) Motion encoder. We stack two “Conv1D+Relu” modules as the motion encoder to encode the past k frames. The convolution kernel is set to 1, which ensures that each frame motion code (512 channels) is independent”). Note that: (1) the classified “Musical style” (2-channel), the music-context code (30-channel) indicating the smoothing degree of the music as the output of “Musical context-aware encoder”, and the motion code (512-channel) indicating motion features as the output of “Motion encoder”, can be regarded as a combined danceability score (e.g., a score vector) at every moment of music to reflect the local and global music features and to control the dance motion generation as control signals; and (2) the music feature extractor, the music style classifier, the “Motion encoder”, and “Musical context-aware encoder” are all of neural networks. based on a plurality of test quantification scores of test acoustic segments (page 65:5, Fig. 2: “Music wave” comprising “Musical features” is received by the neural networks; page 65:8, section 5.3, “we adopt three temporal convolution layers and one Bi-LSTM (Bi-directional LSTM) layer as a high-dimensional feature extractor, and finally it is classified by a fully connected layer”; page 65:8, section 6.1, “We combine temporal convolution and Bi-LSTM as the music encoder to represent the musical rhythm and melody at every moment, while considering the contextual information of the music … since dance motion is smooth, the Bi-LSTM is added after time convolution to fuse the context information and ensure that the music-context code is smooth (when we used madmom to extract music features, we found that the chroma, beat, downbeat, and onset of a music clip would change drastically—jitter problem, as shown in Figure 4)”; 65:5, section 4, “The musical style contains two types: smoothing music and fast-rhythm music”; page 65:8, section 6.1, “since dance motion is smooth, the Bi-LSTM is added after time convolution to fuse the context information and ensure that the music-context code is smooth”). Note that: it is obvious to one having ordinary skills in the art that: (1) Being consistent to the specification (see para. [0067] of this application), the input “Music (wave)” are regarded as test acoustic signal for neural network training and inferring; (2) during the training and inferring of the neural networks, the plurality of test acoustic signals as the plurality of test acoustic segments from the test dataset by applying sliding windows are received by the neural networks; (3) the sliding window as specified for claim 1 above is used to make the plurality of test acoustic segments from the test dataset; (4) the test acoustic segments are encoded into the “Musical style” and the music-context code values as two parts of danceability scores based on “Musical features”; and (5) “Musical style” (2-channel) and the music-context code values (30-channel) are test quantification scores of test acoustic segments during the test. and a plurality of momentum scores associated with a plurality of body parts of a dancer from a plurality of test video segments; (page 65:3, section 1, “we captured several high-quality music-dance motion pair datasets to boost the performance of music dance generation”; page 65:5, Fig. 2: “Music (wave)” and “Previous k frames” from paired video are received by the neural networks of the framework; page 65:5, Fig. 2: “Motion encoder” encodes skeletal approximation of the dancer with a skeleton representation of the dancer; page 65:6, section 5.1, “The motion information at tth frame in the motion data contains 55 joints: one joint is the root joint, whose motion is represented by translation and rotation related to the world coordinate (hx,t ,hy,t ,hz,t , rx,t , ry,t , rz,t , x-y-z are the three axes in the Cartesian coordinate system), and the remaining joints are represented by the rotation related to their parent joints (rbx,t , rby,t , rbz,t, b is the joint index) … our method can describe the next frame motion, which indicates the invariant of our data representation. The joint rotation motion at tth frame xrott can be described as follows: xrot t = [Δhx,t ,hy,t , Δhz,t , Δry,t , rx,t , rz,t ,r2x,t , r2y,t , r2z,t , . . . , rbx,t , rby,t , rbz,t ]. (1) The new representation would also generate a large accumulation of errors”; page 65:8, “We stack two “Conv1D+Relu” modules as the motion encoder to encode the past k frames. The convolution kernel is set to 1, which ensures that each frame motion code (512 channels) is independent”; page 65:5, Fig. 2: the frame motion code as the output of “Motion encoder” can be regarded as a test danceability score to affect “Residual motion control stacked module”). Note that: (1) the test dataset includes the music-dance motion pair datasets; (2) it obvious to one having ordinary skills in the art that the frames are from test videos and corresponding video segments paired to the test music wave (acoustic signals) and acoustic segments while the sliding window is applied to the input test dataset; (3) the motion information of at tth frame can be regarded as the body poses with corresponding translation and rotation of joints and body parts for each of the test video frames in the test dataset of the dancers. The new skeleton representation of the dancer to reflect the body or skeleton poses can be encoded and determined by “Motion encoder” as the motion code values; (4) the frame motion code values of the frames of the test dataset as the output of “Motion encoder” can be regarded as the plurality of momentum scores associated with a plurality of body parts and joints of the skeleton representation of the dancer; (5) the frame motion code as one part of the test danceability score is corresponding to each of the test video segments of the test dataset and is generated as the output of “Motion encoder”; and (6) the danceability score including the “Musical style”, the music-context code and the frame motion code, are based on the plurality of test quantification scores and the plurality of momentum scores. , wherein each of the plurality of momentum scores represent energy or effort exerted by the dancer for each of the body parts in each of the test video segments, wherein the danceability score is associated with a whole body of the dancer; (page 65:5, Fig. 2: “Motion encoder” encodes skeletal approximation of the dancer with a skeleton representation of the dancer; page 65:6, section 5.1, “The motion information at tth frame in the motion data contains 55 joints: one joint is the root joint, whose motion is represented by translation and rotation related to the world coordinate (hx,t ,hy,t ,hz,t , rx,t , ry,t , rz,t , x-y-z are the three axes in the Cartesian coordinate system), and the remaining joints are represented by the rotation related to their parent joints (rbx,t , rby,t , rbz,t, b is the joint index) … our method can describe the next frame motion, which indicates the invariant of our data representation. The joint rotation motion at tth frame xrott can be described as follows: xrot t = [Δhx,t ,hy,t , Δhz,t , Δry,t , rx,t , rz,t ,r2x,t , r2y,t , r2z,t , . . . , rbx,t , rby,t , rbz,t ]. (1) The new representation would also generate a large accumulation of errors”; page 65:8, “We stack two “Conv1D+Relu” modules as the motion encoder to encode the past k frames. The convolution kernel is set to 1, which ensures that each frame motion code (512 channels) is independent”; page 65:5, Fig. 2: the frame motion code as the output of “Motion encoder” can be regarded as a test danceability score to affect “Residual motion control stacked module”). Note that: (1) each of plurality of momentum scores (frame motion codes) describes the translation and rotation (movement and motion) of each of the joints and body parts, and represents the kinetic energy and angular kinetic energy of the dancing poses and efforts of the dancer; and (2) the combined danceability score including the “Musical style”, the music-context code and the frame motion code (momentum score), are based on the plurality of test quantification scores and the plurality of momentum scores associated with the body parts and joints (a whole body) of the dancer.
generating a real-time animation of a first avatar and a second avatar based on the danceability score and avatar characteristics associated with the first avatar and the second avatar; and (page 65:3, Fig. 1: “Our method can directly generate realistic dance motion sequences from the input music (wave)”; page 65:5, Fig. 2: “DanceNet takes the musical style, rhythm, and melody as control signals to generate dance motion” with “Previous k frames” and “Current frame”). Note that: DanceNet can generate a real-time animation of the animated dancer or the line-linked skeleton figure based on the music features (e.g., beat and onset) in real-time to reflect and match the music features in inferring dance motion synthesis of the application., while the danceability score as control signal affects the generation of real-time animation.
However, Zhuang fails to disclose, but in the same art of computer graphics, Joe discloses a first avatar and a second avatar … and avatar characteristics associated with the first avatar and the second avatar (Joe, page 17, lines 6-8, “The method of claim 1, Between step (a) and step (b), (a1) selecting one or more avatars to perform a dance movement from among a plurality of avatar characters provided by the avatar editor”; page 7, lines 23-26, “The avatar and prop selection unit 212 provides an interface that allows the client terminal 100 to select an avatar to produce a dance or a character prop to decorate the avatar, and selects an avatar and a character prop to produce a dance by the client terminal”). Note that: (1) the method allows the selection or generation of one or more dancing avatars to perform a dance movement (dance animation) as a first avatar and a second avatar; (2) an interface enables client terminal to select a character prop to decorate the avatar or configure the characteristics of the avatar; (3) it is obvious to one having ordinary skills that the avatar’s characteristics (e.g., various props) can be used as an additional control signal to the neural networks of dance motion generator (e.g., DanceNet of Zhuang) so that the avatar animation are based on the danceability score and also based on avatar characteristics; and (4) the avatars can substitute the animated dancer or the line-linked skeleton figure in Zhuang.
Zhuang and Joe are in the same field of endeavor, namely computer graphics. Before the effective filing date of the claimed invention, it would have been obvious to apply the selection or generation of one or more avatars with props as avatar characteristics, as taught by Joe into Zhuang. The motivation would have been “users can create an avatar dancing simply and quickly using their terminal” (page 1, lines 25-26). The suggestion for doing so would allow to generate dancing avatars on their devices simply and quickly. Therefore, it would have been obvious to combine Zhuang and Joe.
However, Zhuang in view of Joe fails to disclose, but in the same art of computer graphics, Calbom discloses causing to be displayed on a first client device the real-time animation of the first avatar and the second avatar (Calbom, para. [004], “Multi-user virtual environment systems incorporate computer graphics, sound, and optionally networking to simulate the experience of real-time interaction between multiple users who are represented by avatars in a shared three-dimensional (3D) virtual world”; para. [0088], “The system uses a client-server design, whereby each client provides an immersive audio/visual interface to the shared virtual environment from the perspective of one avatar. As the avatar "moves" through the environment, possibly under interactive user control, images and sounds representing the virtual environment from the avatar's simulated viewpoint are updated on the client's computer in real-time … and it sends appropriate messages with updates and spatialized audio streams back to the clients so that they may update their audio/visual displays”). Note that: (1) multi-users represented by avatars as clients can do real-time interaction between them; (2) clients or users have their audio/visual displays (devices for displaying the virtual environment changes); and (3) it is obvious to one having ordinary skills in the art that one of the clients’ displays can be regarded as a first client device to display the real-time animation or interactions of avatars including the first avatar and the second avatar for real-time interaction.
Zhuang in view of Joe, and Carlbom, are in the same field of endeavor, namely computer graphics. Before the effective filing date of the claimed invention, it would have been obvious to apply displaying the real-time interactions of avatars for real-time interaction, as taught by Carlbom into Zhuang in view of Joe. The motivation would have been “Multi-user virtual environment systems incorporate computer graphics, sound, and optionally networking to simulate the experience of real-time interaction between multiple users who are represented by avatars in a shared three-dimensional (3D) virtual world” (Calbom, para. [004]). The suggestion for doing so would allow to display the real-time animation or interactions of avatars including the first avatar and the second avatar for real-time interaction on a client display device or other clients’ display devices. Therefore, it would have been obvious to combine Zhuang, Joe, and Carlbom.
Regarding claim 2, the combination of Zhuang, Joe, and Carlbom discloses The method of claim 1, further comprising:
training the danceability neural network, wherein training the danceability neural network comprises: (Zhuang, page 65:3, section 1, “we captured several high-quality music-dance motion pair datasets to boost the performance of music dance generation”). Note that: (1) the paired music-dance motion datasets were acquired for training and testing; (2) it is obvious to one having ordinary skills in the art that for training the neural networks (the “Musical style classifier” and “Musical context-aware encoder” for the danceability score (combined value of “Musical style”, the music-context code, and the motion code) has a training process to update the parameters of the corresponding neural works by comparing the inputs and outputs of the neural networks based on the training datasets from the acquired datasets.
receiving a plurality of test acoustic signals including a plurality of test acoustic segments; (Zhuang, page 65:5, Fig. 2: “Music wave” comprising “Musical features” is received by the neural networks; page 65:8, section 5.3, “we adopt three temporal convolution layers and one Bi-LSTM (Bi-directional LSTM) layer as a high-dimensional feature extractor, and finally it is classified by a fully connected layer”; page 65:8, section 6.1, “We combine temporal convolution and Bi-LSTM as the music encoder to represent the musical rhythm and melody at every moment, while considering the contextual information of the music … since dance motion is smooth, the Bi-LSTM is added after time convolution to fuse the context information and ensure that the music-context code is smooth (when we used madmom to extract music features, we found that the chroma, beat, downbeat, and onset of a music clip would change drastically—jitter problem, as shown in Figure 4)”). Note that: it is obvious to one having ordinary skills in the art that: (1) the datasets for network work training are usually divided into three parts, i.e., a training dataset, a validation dataset, and a test dataset; (2) the test dataset is used to test the performance of the trained neural networks before its deployment for application inference; (3) during the test of the trained neural works, the plurality of test acoustic signals as the plurality of test acoustic segments from the test dataset are received by the neural networks; and (4) the sliding window as specified for claim 1 above is used to make the plurality of test acoustic segments from the test dataset.
encoding the plurality of test acoustic segments; and (Zhuang, Fig. 2: “Musical features” are encoded into “Musical style” and the music-context code by “Musical style classifier” and “Musical text-aware encoder”).
generating the test quantification scores for each of the test acoustic segments, wherein the test quantification scores are based on music features (Zhuang 65:5, section 4, “The musical style contains two types: smoothing music and fast-rhythm music”; page 65:8, section 6.1, “since dance motion is smooth, the Bi-LSTM is added after time convolution to fuse the context information and ensure that the music-context code is smooth”). Note that: (1) during the test of the trained neural networks the test acoustic segments are encoded into “Musical style” and the music-context code values as danceability scores based on “Musical features”; and (2) “Musical style” and the music-context code values are test quantification scores of test acoustic segments during the test.
Regarding claim 3, the combination of Zhuang, Joe, and Carlbom discloses The method of claim 2, wherein the music features comprise frequency response, chromagram, tempogram, or any combination thereof. (Zhuang, page 65:4, section 2, “the researchers used the mel spectrum, mel-frequency ceptral coefficient (MFCC), or short-time Fourier transform (STFT) spectrum as a music feature to represent music … The most basic features in music are beat, rhythm, and melody. More critically, this is the most important dependency in dance generation. In music information retrieval, most of the work is about how to extract the music features. Onset can express the beginning of music notes, and it is the most basic expression form of music rhythm [2, 13, 19]. Beat is another form of rhythm, and relevant work includes beat detection [4, 27, 28]. Melody, one of the most important music features, can be expressed via the chroma feature [17, 25, 40]. Most importantly, the chroma feature is highly adaptable to changes in timbre and instruments. Therefore, we adopt onset, beat, and chroma as the music features to represent music”). Note that: Zhuang specifies all music features corresponding to frequency response, chromagram, tempogram, or any combination thereof.
Regarding claim 4, the combination of Zhuang, Joe, and Carlbom discloses The method of claim 2, wherein training the danceability neural network further comprises:
receiving a plurality of test videos including a dancer performing dance movements and the test acoustic signals, the test videos comprising the plurality of test video segments, wherein each of the test video segments comprises a plurality of test video frames; (Zhuang, page 65:3, section 1, “we captured several high-quality music-dance motion pair datasets to boost the performance of music dance generation”; page 65:5, Fig. 2: “Music (wave)” and “Previous k frames” from paired video are received by the neural networks of the framework). Note that: (1) the test dataset includes the test part of the music-dance motion pair datasets; and (2) it obvious to one having ordinary skills in the art that the frames are from test videos and corresponding video segments paired to the test music wave (acoustic signals) and acoustic segments while the sliding window is applied to the input test dataset.
determining body poses for each of the test video frames using skeletal approximation of the dancer; (Zhuang, page 65:5, Fig. 2: “Motion encoder” encodes skeletal approximation of the dancer with a skeleton representation of the dancer; page 65:6, section 5.1, “The motion information at tth frame in the motion data contains 55 joints: one joint is the root joint, whose motion is represented by translation and rotation related to the world coordinate (hx,t ,hy,t ,hz,t , rx,t , ry,t , rz,t , x-y-z are the three axes in the Cartesian coordinate system), and the remaining joints are represented by the rotation related to their parent joints (rbx,t , rby,t , rbz,t, b is the joint index) … our method can describe the next frame motion, which indicates the invariant of our data representation. The joint rotation motion at tth frame xrott can be described as follows: xrot t = [Δhx,t ,hy,t , Δhz,t , Δry,t , rx,t , rz,t ,r2x,t , r2y,t , r2z,t , . . . , rbx,t , rby,t , rbz,t ]. (1) The new representation would also generate a large accumulation of errors”). Note that: the motion information of at tth frame can be regarded as the body poses for each of the test video frames in the test dataset; and (2) the new skeleton representation of the dancer can be encoded and determined by “Motion encoder”.
generating, for each of the plurality of test video segments, a plurality of momentum scores associated with the plurality of body parts of the dancer; and (Zhuang, page 65:8, “We stack two “Conv1D+Relu” modules as the motion encoder to encode the past k frames. The convolution kernel is set to 1, which ensures that each frame motion code (512 channels) is independent”). Note that: the frame motion code values of the frames of the test dataset as the output of “Motion encoder” can be regarded as the plurality of momentum scores associated with a plurality of body parts (skeleton representation) of the dancer.
generating a test danceability score for each of the test video segments based on the momentum scores. (Zhuang, page 65:5, Fig. 2: the frame motion code as the output of “Motion encoder” can be regarded as a test danceability score to affect “Residual motion control stacked module”). Note that: the frame motion code as one part of the test danceability score or the test danceability is corresponding to each of the test video segments of the test dataset and is generated as the output of “Motion encoder”.
Regarding claim 6, the combination of Zhuang, Joe, and Carlbom discloses The method of claim 4, further comprising:
for each of the test video segments,
associating the test danceability score with the test quantification score of the test acoustic segment, (Zhuang, page 65:3, section 1, “we captured several high-quality music-dance motion pair datasets to boost the performance of music dance generation”). Note that: since the test dataset of the music-dance motion pair datasets are used with the sliding window applied to obtain the paired test video segments and test acoustic segments, it is obvious to one having ordinary skills in the art that the test danceability score is associated with the test quantification score (combined value of Musical style and the music-context code) of the test acoustic segment.
wherein the test video segments correspond in time to the test acoustic segments in the test videos. Note that: in the same way above, the test video segments are synchronized in time to the test acoustic segments in the test videos.
Regarding claim 7, the combination of Zhuang, Joe, and Carlbom discloses The method of claim 6, wherein generating using the danceability neural network the danceability score for each of the acoustic segments further comprises:
generating the danceability score for each of the acoustic segments based on associated test danceability scores and the test quantification scores. (Zhuang, page 65:3, section 1, “we captured several high-quality music-dance motion pair datasets to boost the performance of music dance generation”). Note that: (1) the test dataset of the music-dance motion pair datasets are used with the sliding window applied to obtain the paired test video segments and test acoustic segments; (2) test danceability scores are based on momentum scores or are equivalent to weighted averages of the corresponding momentum scores of the test video segments, while test quantification scores are equivalent to the combined values of Musical style values and the music context code values of the test acoustic segments of the test dataset; and (3) it is obvious to one having ordinary skills in the art that the test danceability score for each of acoustic segments can be generated or formulated by combining each of test danceability scores and each of test quantification scores through the synchronized music-dance motion pairs in time with the same sliding window.
Regarding claim 9, the combination of Zhuang, Joe and Carlbom discloses The method of claim 1, further comprising:
causing to be displayed on a second client device the real-time animation of the first avatar and the second avatar. (Calbom, para. [004], “Multi-user virtual environment systems incorporate computer graphics, sound, and optionally networking to simulate the experience of real-time interaction between multiple users who are represented by avatars in a shared three-dimensional (3D) virtual world”; para. [0088], “The system uses a client-server design, whereby each client provides an immersive audio/visual interface to the shared virtual environment from the perspective of one avatar. As the avatar "moves" through the environment, possibly under interactive user control, images and sounds representing the virtual environment from the avatar's simulated viewpoint are updated on the client's computer in real-time … and it sends appropriate messages with updates and spatialized audio streams back to the clients so that they may update their audio/visual displays”). Note that: (1) multi-users represented by avatars as clients can perform real-time interaction between them; (2) clients or users have their audio/visual displays (devices for displaying the virtual environment changes); and (3) it is obvious to one having ordinary skills in the art that another one of the clients’ display besides the first user or client’s display can be regarded as a second client device to display the real-time animation or interactions of avatars including the first avatar and the second avatar for real-time interaction.
The motivation to combine Zhuang, Joe, and Carlbom given in claim 1 is incorporated here.
Regarding claim 10, the combination of Zhuang, Joe, and Carlbom discloses The method of claim 9,
… wherein the first user is associated with the first avatar, and the second user is associated with the second avatar. (Joe, page 3, lines 31-33, “users need a means of generating 3D avatars capable of expressing various movements or expressions, such as dance movements, which are desired from the passiveness”; page 4, lines 7-12, “the present invention provides a method of generating an avatar dancing to music, the method comprising: (a) a client terminal accessing an avatar generation server dancing using a wired or wireless communication network and loading the avatar editor; (b) receiving a section setting for matching dance motion picture data from the client terminal; (c) setting the dance motion of the avatar in units of sections by matching the dance motion video data for each section set in step (b)”). Note that: (1) the first user or client uses the first client device with display to generate the first avatar by loading the avatar editor and matching dance motion picture data from the first client terminal so that the first user is associated with the first avatar; and (2) the second user or client uses the second client device with display to generate the second avatar by loading the avatar editor and matching dance motion picture data from the second client terminal so that the second user is associated with the second avatar.
… wherein the first client device is associated with a first user and the second client device is associated with a second user, … (Calbom, para. [004], “Multi-user virtual environment systems incorporate computer graphics, sound, and optionally networking to simulate the experience of real-time interaction between multiple users who are represented by avatars in a shared three-dimensional (3D) virtual world”; para. [0088], “The system uses a client-server design, whereby each client provides an immersive audio/visual interface to the shared virtual environment from the perspective of one avatar. As the avatar "moves" through the environment, possibly under interactive user control, images and sounds representing the virtual environment from the avatar's simulated viewpoint are updated on the client's computer in real-time … and it sends appropriate messages with updates and spatialized audio streams back to the clients so that they may update their audio/visual displays”). Note that: (1) multi-users represented by avatars as clients can perform real-time interaction between them; (2) clients or users have their audio/visual displays (devices for displaying the virtual environment changes); and (3) it is obvious to one having ordinary skills in the art that: a) the first client or user device is used by or associated with the first user; and b) the second client or user device is used by or associated with a second user.
The motivation to combine Zhuang, Joe, and Carlbom given in claim 1 is incorporated here.
Claim 11 reciting “A system comprising: a processor; and a memory storing instructions that, when executed by the processor, cause the system to perform operations comprising:”, is corresponding to the method of claim 1. Therefore, claim 11 is rejected for the same rationale for claim 1.
In addition, the combination of Zhuang, Joe, and Carlbom discloses A system comprising: a processor; and a memory storing instructions that, when executed by the processor, cause the system to perform operations comprising: (Carlbom, Fig. 3: a system has a “COMPUTER” with “MEMORY”; para. [0088], “a series of experiments was conducted with a single server spatializing sounds on an SGI Onyx2 with four 195 MHz R10000 processors”). Note that: it is obvious to one having ordinary skills in the art that a computer with memory has processor(s) to execute the instructions that are stored in memory and perform operations.
The motivation to combine Zhuang, Joe, and Carlbom given in claim 1 is incorporated here.
Claims 12-14 and 16-17 are corresponding to the method of claims 2-4 and 6-7, respectively. Therefore, claims 12-14 and 16-17 are rejected for the same rationale for claims 2-4 and 6-7, respectively.
Claim 19 is corresponding to the method of both claims 9-10. Therefore, claim 19 is rejected for the same rationale for claims 9-10.
Claim 20 reciting “A non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that when executed by a processor, cause the processor to perform operations comprising:”, is corresponding to the method of claim 1. Therefore, claim 20 is rejected for the same rationale for claim 1.
In addition, the combination of Zhuang, Joe, and Carlbom discloses A non-transitory computer-readable storage medium, the computer-readable storage medium including instructions that when executed by a processor, cause the processor to perform operations comprising: (Carlbom, para. [0086], “A computer-readable medium, such as the disc 180 in FIG. 17 may be used to load computer-readable code into the mass storage device 120, which may then be transferred to the computer 110”; Fig. 16: a non-transitory computer-readable storage medium, “DISK”, and a “COMPUTER”). Note that: (1) A computer-readable medium, such as the disc, is a non-transitory computer-readable storage medium; and (2) it is obvious to one having ordinary skills in the art that a computer with memory usually has processor(s) to execute the instructions.
The motivation to combine Zhuang, Joe, and Carlbom given in claim 1 is incorporated here.
Claims 5 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Zhuang in view of Joe and Calbom, and further in view of Zhang et al. (Body part level attention model for skeleton-based action recognition, 2019 Chinese Automation Congress (CAC) (2019, Page(s): 4297-4302), 978-1-7281-4094-0/19, 2019 IEEE, hereinafter “Zhang”).
Regarding claim 5, the combination of Zhuang, Joe, and Carlbom failed to disclose, but in the same art of computer graphics, Zhang discloses generating a weighted average of the momentum scores. (Zhang, Fig. 3: “Illustration of body part level attention module” with N joints to export body part level attention “αt” at frame t or time t; page 4298, section 3.1, “At each moment t, given N joints xt = (xt,1, ..., xt,N )T , with xt,n ∈ R3. xt can be divided into K subsets xt = {Xt,1, ..., Xt,K}, with Xt,k is the joints of the kth part. Therefore, the input of the main network is x˜t = {Xt,1, ..., Xt,K}, which determine action scores zt = (zt,1, ..., zt,K)T , with zt,k is the action score obtained with the input data Xt,k. In the paper, K equals to 5.”; page 4299, section 3.2, “… the scores st = (st,1, ..., st,K)T for denoting the significance of the K body part calculated as st = WsRelu(Whshst + bhs ) + bs, (1)
PNG
media_image2.png
492
508
media_image2.png
Greyscale
”). Note that: (1) at body part level, at each frame t or time t, for the kth body part action score, the body part action score weight αt,k is calculated; and (2) the final score o for the final sequency level is a weighted average of the K body action scores (αt,k, k= 1, …, K), and it can be regarded as a test danceability score for each of the test video segments based on the body action scores (zi, I =1, …, K; momentum scores).
The combination of Zhuang, Joe and Carlbom, and Zhang, are in the same field of endeavor, namely computer graphics. Before the effective filing date of the claimed invention, it would have been obvious to apply calculating the final sequency level action score from body part level action scores in a weighted average formula, as taught by Zhang into the combination of Zhuang, Joe and Carlbom. The motivation would have been “The final representative of the action video is produced in a weighted average way and it will be fed into DNNs for classification” (Zhang, page 4297, Abstract). The suggestion for doing so would allow to generate a final test danceability score as a weighted average of the body part action scores (momentum scores). Therefore, it would have been obvious to combine Zhuang, Joe, Carlbom, and Zhang.
Claim 15 is corresponding to the method of claim 5. Therefore, claim 15 is rejected for the same rationale for claim 5.
Claims 8 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Zhuang in view of Joe and Calbom, and further in view of Sakr et al. (A Curvilinear Avatar with Avatar Collision Detection Scheme in Collaborative Virtual Environments, IEEE International Instrumentation and Measurement Technology Conference Victoria, Vancouver Island, Canada, May 12-15, 2008, hereinafter “Sakr”).
Regarding claim 8, the combination of Zhuang, Joe, and Carlbom discloses The method of claim 1, wherein generating the real-time animation of the first avatar and the second avatar further comprises:
generating the real-time animation (Zhuang, page 65:3, Fig. 1: “Our method can directly generate realistic dance motion sequences from the input music (wave)”; page 65:5, Fig. 2: “DanceNet takes the musical style, rhythm, and melody as control signals to generate dance motion” with “Previous k frames” and “Current frame”). Note that: DanceNet can generate a real-time animation of the animated dancer or the line-linked skeleton figure based on the music features (e.g., beat and onset) in real-time to reflect and match the music features in inferring dance motion synthesis of the application. based on a position of the first avatar displayed on the first client device and a position of the second avatar displayed on the first client device (Calbom, para. [004], “Multi-user virtual environment systems incorporate computer graphics, sound, and optionally networking to simulate the experience of real-time interaction between multiple users who are represented by avatars in a shared three-dimensional (3D) virtual world”; para. [0088], “The system uses a client-server design, whereby each client provides an immersive audio/visual interface to the shared virtual environment from the perspective of one avatar. As the avatar "moves" through the environment, possibly under interactive user control, images and sounds representing the virtual environment from the avatar's simulated viewpoint are updated on the client's computer in real-time … and it sends appropriate messages with updates and spatialized audio streams back to the clients so that they may update their audio/visual displays”). Note that: (1) multi-users represented by avatars as clients can do real-time interaction between them; (2) clients or users have their audio / visual displays (devices for displaying the virtual environment changes); and (3) it is obvious to one having ordinary skills in the art that one of the clients’ displays can be regarded as a first client device to display the real-time animation or interactions of avatars including the first avatar and the second avatar for real-time interaction.
The motivation to combine Zhuang, Joe, and Carlbom given in claim 1 is incorporated here.
However, the combination of Zhuang, Joe, and Carlbom fails to disclose, but in the same art of computer graphics, Sakr discloses … a position of … a position of …
to prevent an overlapping display of the first avatar and the second avatar. (Sakr, Abstract, “This paper presents a novel Curvilinear Avatar with Avatar Collision Detection Scheme (CAACD) that provides a collision-free environment for avatars going in a linear as well as curvilinear motion”; section I, “With CAACD, all static entities in the CVE embody a single vector representing their respective position in the virtual scene, while mobile avatars embody a combination of vectors representing their respective position, velocity, acceleration and displacement in the virtual scene”; section III, “Collision avoidance amends the avatar's navigation speed and direction to avoid a potential collision and generates a path amendment event”). Note that: the positions of the first avatar and the second avatar can be formulated or determined on the locations by the cited method to avoid collision or overlapping of each other.
The combination of Zhuang, Joe and Carlbom, and Sakr, are in the same field of endeavor, namely computer graphics. Before the effective filing date of the claimed invention, it would have been obvious to apply the method to determine the positions and other parameters of avatars without collision or overlapping of each other, as taught by Sakr into the combination of Zhuang, Joe and Carlbom. The motivation would have been “a novel Curvilinear Avatar with Avatar Collision Detection Scheme (CAACD) that provides a collision-free environment for avatars going in a linear as well as curvilinear motion” (Sakr, Abstract). The suggestion for doing so would allow to generate positions of avatars without collision or overlapping of each other. Therefore, it would have been obvious to combine Zhuang, Joe, Carlbom, and Sakr.
Claim 18 is corresponding to the method of claim 8. Therefore, claim 18 is rejected for the same rationale for claim 8.
Response to Arguments
Applicant's arguments with respect to claim rejection 35 U.S.C. 103, have been fully considered but they are not persuasive.
Applicant alleges, “For example, contrary to that alleged by the Examiner, ‘the frame motion code values of the frames of the test dataset as the output of ‘Motion encoder’’ cannot correspond to ‘the plurality of momentum scores’ because the frame motion code values of each frame does not ‘represent energy or effort exerted by the dancer for each of the body parts in each of the test video segments,’ respectively. Moreover, contrary to that alleged by the Examiner, there is no teaching of "Musical style", the music-context code, and the frame motion code being combined to produce the ‘danceability score [that] is associated with a whole body of the dancer,’ as recited in the claims.” (page 9, lines 11-17). However, Examiner respectfully disagrees about the respective allegations as whole because: (1) each of plurality of momentum scores (frame motion codes) describes the translation and rotation (movement and motion) of each of the joints and body parts, and represents the kinetic energy and angular kinetic energy of the dancing poses and efforts of the dancer; and (2) the danceability score including the “Musical style”, the music-context code and the frame motion code (momentum score), are based on the plurality of test quantification scores and the plurality of momentum scores associated with the body parts and joints (a whole body) of the dancer; and (3) since Applicant describes “momentum scores associated with the body parts of the dancer” (Specification, para. [0079]), the frame motion code values reflecting the translation / rotation of body parts and joints with the corresponding energy of the dancer are the specified momentum scores; and (4) the classified “Musical style” (2-channel), the music-context code (30-channel) indicating the smoothing degree of the music as the output of “Musical context-aware encoder”, and the motion code (512-channel) associated with body parts and joints indicating motion features as the output of “Motion encoder”, can be regarded as a combined danceability score (e.g., a score vector) associated with a whole body (body parts and joints) of the dancer at every moment of music. The arguments are not persuasive.
Applicant alleges, “Applicant further submits that a prima facie case of obviousness has not been established for dependent claims 2-10 and 12-19. However, based on the dependency of claims 2-10 and 12-19 on independent claims 1 and 11, respectively, which are believed to be in condition for allowance, Applicant respectfully submits that claims 2-10 and 12-19 are believed to be allowable for at least the reasons set forth above.” (page 9, lines 17-22). However, Examiner respectfully disagrees about the respective allegations as whole because: (1) claims 1 and 11 are rejected for the respective rationale above; and (2) claims 2-10 and 12-19 on independent claims 1 and 11 are rejected for the respective rationale above. The arguments are not persuasive.
Applicant alleges, “Therefore, Applicant believes that claims 1, 11 and 20 and claims dependent thereon are distinguishable over the cited references. Accordingly, Applicant respectfully requests the rejection under 35 U.S.C. §103 be withdrawn.” (page 9, lines 23-25). However, Examiner respectfully disagrees about the respective allegations as whole because claims 1, 11, and 20 are rejected for the respective rationale above. The arguments are not persuasive.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BIAO CHEN whose telephone number is (703)756-1199. The examiner can normally be reached M-F 8am-5pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee M Tung can be reached at (571)272-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/Biao Chen/
Patent Examiner, Art Unit 2611
/KEE M TUNG/Supervisory Patent Examiner, Art Unit 2611