DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
1. The information disclosure statement (IDS) submitted on 3/7/2024 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement has been considered by the examiner.
Priority
2. Acknowledgment is made of applicant's claim for foreign priority based on an application filed in China on 8/4/2022. It is noted, however, that applicant has not filed a certified copy of the CN202210934101.0 application as required by 37 CFR 1.55.
Drawings
3. The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description: 5208 in Fig. 25. Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
4. The disclosure is objected to because of the following informalities:
In ¶ [0148], line 2, "not necessarily performed" should read "not necessarily be performed"
In ¶ [0057], lines 9-11, ¶ [00114], 2, 4, 6, ¶ [00121], 1, 3, ¶ [00128], 6, ¶ [00129], 2-3, and ¶ [00163], 2, 5: The meaning of the word "adjoint" is unclear. "Adjoint" is often used as a mathematical term in linear algebra but its usage is not clear in this context. Based on the citation from ¶ [00114]: "In this way, by determining the final target action parameter of the action unit by jointly using the adjoint intensity information and the action intensity parameter, the accuracy of the target action parameter can be further improved." examiner will interpret "adjoint" as "joint." It is recommended to revise the specification so that the intended meaning is clear.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
5. Claims 12-13 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention.
6. Where applicant acts as his or her own lexicographer to specifically define a term of a claim contrary to its ordinary meaning, the written description must clearly redefine the claim term and set forth the uncommon definition so as to put one reasonably skilled in the art on notice that the applicant intended to so redefine that claim term. Process Control Corp. v. HydReclaim Corp., 190 F.3d 1350, 1357, 52 USPQ2d 1029, 1033 (Fed. Cir. 1999). As interpreted by the examiner, the term “adjoint” in claims 12-13 is used by the claim to mean “joint” (based on the citation from ¶ [00114]), while the accepted meaning is “a mathematical term often used in linear algebra.” The term is indefinite because the specification does not clearly redefine the term. Appropriate correction is required.
Claim Rejections - 35 USC § 103
7. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
8. Claims 1-4 and 15-20 are rejected under 35 U.S.C. 103 as being unpatentable over Edwards et al. (US-2018/0253881-A1, hereinafter "Edwards") in view of Merrill et al. (US-6181351-B1, hereinafter "Merrill").
9. As per claim 1, Edwards discloses: A mouth shape animation generation method, executed by a computing device having a processor, the method comprising: (Edwards, [0005], “In an aspect, there is provided a method for animated lip synchronization executed on a processing unit …”)
performing feature analysis based on a target audio, to generate viseme feature flow data, the viseme feature flow data comprising a plurality of sets of ordered viseme feature data, each set of viseme feature data corresponding to [[one audio frame]] in the target audio, respectively; (Edwards, [0057], “Turning to FIG. 3, a flowchart for a method for animated lip synchronization 300 is shown. In some cases, at block 302, a segment of speech is captured as input by the input module 202 from the input device 222.” and [0006], “In a particular case, the method further comprising capturing speech input; parsing the speech input into the phonemes; and aligning the phonemes to the corresponding portions of the speech input.” and [0005], “In an aspect, there is provided a method for animated lip synchronization executed on a processing unit, the method comprising: mapping phonemes to visemes; synchronizing the visemes into viseme action units, the viseme action units comprising jaw and lip contributions for each of the phonemes; and outputting the viseme action units.” and [0011], “In yet another case, mapping the phonemes to visemes comprises at least one of mapping a start time of at least one of the visemes to be prior to an end time of a previous respective viseme and mapping an end time of at least one of the visemes to be after a start time of a subsequent respective viseme.”; Examiner’s note: Speech audio is first parsed into phonemes, aligned with the speech input, and then mapped to visemes. The end result is ordered viseme data based on the speech audio.)
separately parsing each set of viseme feature data, to obtain viseme information and intensity information corresponding to the respective set of viseme feature data, the intensity characterizing a change intensity of a viseme corresponding to the viseme information; and (Edwards, [0064], “In some cases, jaw behavior and lip behavior can be captured as independent viseme shapes. As such, jaw and lip intensity can be used to modulate the blend-shape weight of the respective viseme shape. In this case, the viseme action units are a linear combination of the modulated viseme shape. In other words, the jaw contributions and the lip contributions can be respectively synchronized to independent visemes, and the viseme action units can be a linear combination of the independent visemes.” and [0066], “In some cases, viseme action units can be determined by manually setting jaw and lip values over time by a user via the input device 222. … In other cases, the lip contributions and the jaw contributions can be automatically determined by the system 300 from input speech audio and/or input speech transcript.”; Examiner’s note: Lip and jaw contribution values (including that of intensity) can be determined independently by a user or system. For instance, the intensity value is used to modulate the weight of the blend-shape. These contribution values must be parsed in order to influence the viseme.)
controlling, according to the viseme information and the intensity information corresponding to the sets of viseme feature data, a virtual face to change, so as to generate a mouth shape animation corresponding to the target audio. (Edwards, [0079]-[0080], “Applicant recognized the substantial advantage of using a JALI viseme field to provide a controllable abstraction over expressive speech animation of the same phonetic content. As described herein, the JALI viseme field setting over time, for a given performance, can be extracted plausibly through analysis of the audio signal. … In an embodiment, as shown in FIG. 7, an animatable facial rig can be constructed that is compatible with the JALI viseme field. The “Valley Girl” rig 702 is a fairly realistic facial model rigged in MAYA™. Her face is controlled through a typical combination of blend-shapes (to animate her facial action units) and skeletal skinning (to animate her jaw and tongue).” and [0064], “In some cases, jaw behavior and lip behavior can be captured as independent viseme shapes. As such, jaw and lip intensity can be used to modulate the blend-shape weight of the respective viseme shape.” and [0081], “As such, the face geometry is a composition of a neutral face nface, overlaid with skeletal jaw and tongue deformation jd; td, displaced by a linear blend of weighted blend-shape action unit displacements au; thus, face=nface+jd+td+au.”)
10. Edwards doesn't explicitly disclose but Merrill discloses: [[performing feature analysis based on a target audio, to generate viseme feature flow data, the viseme feature flow data comprising a plurality of sets of ordered viseme feature data, each set of viseme feature data corresponding to]] one audio frame [[in the target audio, respectively;]] (Merrill, col. 12, line 34-64, "At step 266, the speech sound data 206 is annotated with the word break information 216 and the phoneme information 218 to create a linguistically enhanced sound file 232. In the illustrated embodiment, the linguistic information and sound editing tool 208 combines the speech sound data 206, the word break information 216, and the phoneme information 218 into a single file 232 containing an audio chunk 234, a word marking list 236, and a phoneme marking list 238. The audio chunk is a part of the file 232 (e.g., a set of bytes) containing audio data. Typically, the audio chunk 234 is of the same format (e.g., WAV) as the speech sound data 206, but can be of some other format. ... For example, the file could be divided into frames, and the phoneme and word break data scattered throughout the file in the frames."; Examiner’s note: The input audio, as disclosed by Merrill is “annotated” with additional data, such as phoneme and word breaks, to create an enhanced sound file. This audio can still be broken up into individual frames even after various information has been annotated with additional data.)
11. Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of Edwards to include the disclosure generating one viseme feature data per one audio frame, of Merrill. The motivation for this modification could have been to divide up audio small segments so that there are many generated visemes, more accurately representing what is happening at that moment in the audio.
12. As per claim 2, Edwards in view of Merrill discloses: The method according to claim 1, wherein the performing feature analysis based on a target audio, to generate viseme feature flow data, comprises:
performing the feature analysis based on the target audio, to obtain phoneme flow data, the phoneme flow data comprising a plurality of sets of ordered phoneme data, each set of phoneme data corresponding to one audio frame in the target audio respectively; (Edwards, [0057], “Turning to FIG. 3, a flowchart for a method for animated lip synchronization 300 is shown. In some cases, at block 302, a segment of speech is captured as input by the input module 202 from the input device 222.” and Edwards, [0006], “In a particular case, the method further comprising capturing speech input; parsing the speech input into the phonemes; and aligning the phonemes to the corresponding portions of the speech input.” and Edwards, [0072], “Computer facial animation can be broadly classified as procedural, data-driven, or performance-capture. Procedural speech animation segments speech into a string of phonemes …” and Merrill, col. 12, line 34-64, "At step 266, the speech sound data 206 is annotated with the word break information 216 and the phoneme information 218 to create a linguistically enhanced sound file 232. In the illustrated embodiment, the linguistic information and sound editing tool 208 combines the speech sound data 206, the word break information 216, and the phoneme information 218 into a single file 232 containing an audio chunk 234, a word marking list 236, and a phoneme marking list 238. The audio chunk is a part of the file 232 (e.g., a set of bytes) containing audio data. Typically, the audio chunk 234 is of the same format (e.g., WAV) as the speech sound data 206, but can be of some other format. ... For example, the file could be divided into frames, and the phoneme and word break data scattered throughout the file in the frames.")
for each set of phoneme data, performing analysis processing on the respective set of phoneme data according to a preset mapping relationship between a phoneme and a viseme, to obtain the viseme feature data corresponding to the phoneme data; and (Edwards, [0005], “In an aspect, there is provided a method for animated lip synchronization executed on a processing unit, the method comprising: mapping phonemes to visemes; synchronizing the visemes into viseme action units, the viseme action units comprising jaw and lip contributions for each of the phonemes; and outputting the viseme action units.” and Edwards, [0011], “In yet another case, mapping the phonemes to visemes comprises at least one of mapping a start time of at least one of the visemes to be prior to an end time of a previous respective viseme and mapping an end time of at least one of the visemes to be after a start time of a subsequent respective viseme.” and Edwards, [0030], “FIG. 4 illustrates an example of phoneme-to-viseme mapping”)
generating the viseme feature flow data according to the viseme feature data respectively corresponding to the sets of phoneme data. (Edwards, [0006], “In a particular case, the method further comprising capturing speech input; parsing the speech input into the phonemes; and aligning the phonemes to the corresponding portions of the speech input.” and Edwards, [0005], “In an aspect, there is provided a method for animated lip synchronization executed on a processing unit, the method comprising: mapping phonemes to visemes; synchronizing the visemes into viseme action units, the viseme action units comprising jaw and lip contributions for each of the phonemes; and outputting the viseme action units.” and Edwards, [0011], “In yet another case, mapping the phonemes to visemes comprises at least one of mapping a start time of at least one of the visemes to be prior to an end time of a previous respective viseme and mapping an end time of at least one of the visemes to be after a start time of a subsequent respective viseme.” and Edwards, [0088], “In the animation phase 906, the aligned phonemes are mapped to visemes by the correspondence module 206. Viseme amplitudes are set (for articulation) 914. Then the visemes are re-processed 916, by the synchronization module 208, for co-articulation to produce viseme timings and resulting animation curves for the visemes (in an example, a Maya MEL script of sparsely keyframed visemes). These phonetic animation curves can be outputted by the output module 210 to demonstrate how the phonemes are changing over time.” and Edwards, [0095], “In the animation phase 904, a facial rig is animated by producing sparse animation keyframes for visemes by the correspondence module 206. The viseme to be keyframed is determined by the co-articulation model described herein. The timing of the viseme is determined by forced alignment after it has been processed through the co-articulation model.”)
13. Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 1 of Edwards to include the disclosure generating one phoneme data per one audio frame, of Merrill. The motivation for this modification could have been to divide up audio small segments so that there are many generated phonemes, more accurately representing what is happening at that moment in the audio.
14. As per claim 3, Edwards in view of Merrill discloses: The method according to claim 2, wherein the performing the feature analysis based on the target audio, to obtain phoneme flow data, further comprises: (See rejection for claim 2.)
determining a text matching the target audio; and (Edwards, [0093], “For the input phase 902, accurate speech transcript is preferable in order to produce procedural lip synchronization, as extra, missing, or mispronounced words and punctuation can result in poor alignment and cause cascading errors in the animated speech. In some cases, automatic transcription tools may be used for, for example, real-time speech animation.”)
performing alignment processing on the target audio and the text, and generating the phoneme flow data by parsing according to an alignment processing result. (Edwards, [0094], “Forced alignment 912 is then used by the alignment module 204 to align the speech audio to its phonemic transcript. Unlike the creation of speech text transcript, this task requires automation, and, in some cases, is done by training a Hidden Markov Model (HMM) on speech data annotated with the beginning, middle, and end of each phoneme, and then aligning phonemes to the speech features.” and [0058]-[0059], “In some cases, at block 304, the alignment module 204 employs forced alignment to align utterances in the soundtrack to the text, giving an output time series containing a sequence of phonemes. At block 306, the correspondence module 206 combines audio, text and alignment information to produce text-to-phoneme and phoneme-to-audio correspondences.”)
15. As per claim 4, Edwards in view of Merrill discloses: The method according to claim 3, wherein the performing alignment processing on the target audio and the text, and generating the phoneme flow data by parsing according to an alignment processing result, comprises: (See rejection for claim 3.)
obtaining reference phoneme flow data corresponding to the text; (Edwards, [0093], “For the input phase 902, accurate speech transcript is preferable in order to produce procedural lip synchronization, as extra, missing, or mispronounced words and punctuation can result in poor alignment and cause cascading errors in the animated speech. In some cases, automatic transcription tools may be used for, for example, real-time speech animation. In further cases, manual transcription from the speech recording may be used for ease and suitability. Any suitable transcript text-to-phoneme conversion, for various languages, can be used …” and Edwards, [0062], “In some cases, the speech input can include at least one of a speech audio and a speech transcript.”)
performing speech recognition on the target audio, to obtain initial phoneme flow data; and (Merrill, col. 11, lines 19-23, “When a phoneme-based speech recognition engine successfully matches speech sound data with a potential text, it generates a hypothetical phonetic alignment between the phonemes that the original speaker probably spoke and the sounds in the speech sound data.” and Edwards, [0093], “For the input phase 902, accurate speech transcript is preferable in order to produce procedural lip synchronization, as extra, missing, or mispronounced words and punctuation can result in poor alignment and cause cascading errors in the animated speech. In some cases, automatic transcription tools may be used for, for example, real-time speech animation. Any suitable transcript text-to-phoneme conversion, for various languages, can be used …”)
performing alignment processing on the initial phoneme flow data and the reference phoneme flow data, and adjusting a phoneme in the initial phoneme flow data by using the alignment processing result, to obtain the phoneme flow data corresponding to the target audio. (Merrill, col. 11, lines 34-55, “In the process of generating this alignment, the engine also generates an alignment between the utterance and the words in its text. ... As a result, these interfaces typically return a list of nodes for both phonemes and words corresponding to each recognition hypothesis. An application can iterate through the nodes in a recognition hypothesis using the GetPhonemeNode and GetWordNode methods on each node in a path in turn to generate the alignments for both words and phonemes.” and Edwards, [0087], “In another embodiment of a method for animated lip synchronization 900 shown in FIG. 9, there is provided an input phase 902, an animation phase 904, and an output phase 906. In the input phase 902, the input module 202, produces an alignment of the input audio recording of speech 910, and in some cases its transcript 908, by parsing the speech into phonemes. Then, the alignment module 204, aligns the phonemes with the audio 910 using a forced-alignment tool 912.”)
16. Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 3 of Edwards to include the disclosure of performing speech recognition on the target audio, to obtain initial phoneme flow data and performing alignment processing on the initial phoneme flow data and the reference phoneme flow data, and adjusting a phoneme in the initial phoneme flow data by using the alignment processing result, to obtain the phoneme flow data corresponding to the target audio, of Merrill. The motivation for this modification could have been to have two different sources of alignment information, both from reference and from speech recognition, to get a finer and more accurate alignment.
17. Claim 15 is similar in scope to claim 1 except for additional limitations that Edwards discloses: A mouth shape animation generation apparatus, comprising: a generation module … a parsing module … a control module … (Edwards, [0019], "In another aspect, there is provided a system for animated lip synchronization, the system having one or more processors and a data storage device, the one or more processors in communication with the data storage device, the one or more processors configured to execute ..." and [0047], “Any method, application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media and executed by the one or more processors.”)
18. Claim 16, which is similar in scope to dependent claim 2 and independent claim 15, is thus rejected under the same rationale as described above.
19. Claim 17 is similar in scope to claim 1 except for additional limitations that Edwards discloses: A computer device comprising:
a memory;
and one or more processors, wherein the memory stores computer-readable instructions, and the processor, when executing the computer-readable instructions, causes the computer device to perform: (Edwards, [0024], “In another aspect, there is provided a facial model for animation on a computing device, the computing device having one or more processors …” and [0047], “Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology … Any method, application or module herein described may be implemented using computer readable/ executable instructions that may be stored or otherwise held by such computer readable media and executed by the one or more processors.”)
20. Claim 18, which is similar in scope to dependent claim 2 and independent claim 17, is thus rejected under the same rationale as described above.
21. Claim 19 is similar in scope to claim 1 except for additional limitations that Edwards discloses:
One or more computer-readable storage media, storing computer-readable instructions, the computer-readable instructions, when executed by one or more processors, causes a computing apparatus to: (Edwards, [0019], "In another aspect, there is provided a system for animated lip synchronization, the system having one or more processors and a data storage device, the one or more processors in communication with the data storage device, the one or more processors configured to execute ..." and [0047], “Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology … Any method, application or module herein described may be implemented using computer readable/ executable instructions that may be stored or otherwise held by such computer readable media and executed by the one or more processors.”)
22. Claim 20, which is similar in scope to dependent claim 2 and independent claim 19, is thus rejected under the same rationale as described above.
23. Claims 5-6 are rejected under 35 U.S.C. 103 as being unpatentable over Edwards et al. (US-2018/0253881-A1, hereinafter "Edwards") in view of Merrill et al. (US-6181351-B1, hereinafter "Merrill") and further in view of Lande et al. (US-6665643-B1, hereinafter "Lande").
24. As per claim 5, Edwards in view of Merrill discloses: The method according to claim 1, wherein the viseme feature data comprises at least one viseme field and at least one intensity field; and
the separately parsing each set of viseme feature data, to obtain viseme information and intensity information corresponding to the viseme feature data comprises: (Edwards, [0064], “In some cases, jaw behavior and lip behavior can be captured as independent viseme shapes. As such, jaw and lip intensity can be used to modulate the blend-shape weight of the respective viseme shape. In this case, the viseme action units are a linear combination of the modulated viseme shape. In other words, the jaw contributions and the lip contributions can be respectively synchronized to independent visemes, and the viseme action units can be a linear combination of the independent visemes.” and [0066], “In some cases, viseme action units can be determined by manually setting jaw and lip values over time by a user via the input device 222. … In other cases, the lip contributions and the jaw contributions can be automatically determined by the system 300 from input speech audio and/or input speech transcript.”)
[[separately mapping, for each set of viseme feature data, viseme fields in the viseme feature data with visemes in a preset viseme list according to a preset mapping relationship between a viseme field and a viseme, to obtain the viseme information corresponding to the viseme feature data; and]]
parsing the intensity field in the viseme feature data, to obtain the intensity information corresponding to the viseme feature data. (Edwards, [0064], “In some cases, jaw behavior and lip behavior can be captured as independent viseme shapes. As such, jaw and lip intensity can be used to modulate the blend-shape weight of the respective viseme shape. In this case, the viseme action units are a linear combination of the modulated viseme shape. In other words, the jaw contributions and the lip contributions can be respectively synchronized to independent visemes, and the viseme action units can be a linear combination of the independent visemes.” and [0066], “In some cases, viseme action units can be determined by manually setting jaw and lip values over time by a user via the input device 222. … In other cases, the lip contributions and the jaw contributions can be automatically determined by the system 300 from input speech audio and/or input speech transcript.”)
25. Edwards in view of Merrill doesn't explicitly disclose but Lande discloses: separately mapping, for each set of viseme feature data, viseme fields in the viseme feature data with visemes in a preset viseme list according to a preset mapping relationship between a viseme field and a viseme, to obtain the viseme information corresponding to the viseme feature data; and (Lande, col. 5, lines 49-52, “The visemes defined by such a standard are listed in Table 1, where the first column shows the viseme index according to the standard…” and col. 6, lines 30-34, “The next step is the characterization of such visemes through macroparameters. The invention makes use of the four macroparameters LOW, LOH, JY, LP mentioned before, and for each viseme it associates a value or intensity to each macroparameter.”)
26. Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 1 of Edwards in view of Merrill to include the disclosure of separately mapping, for each set of viseme feature data, viseme fields in the viseme feature data with visemes in a preset viseme list according to a preset mapping relationship between a viseme field and a viseme, to obtain the viseme information corresponding to the viseme feature data, of Lande. The motivation for this modification could have been to generate many different sets of visemes where the values offer different intensities so that each viseme can have different or unique mouth shapes.
27. As per claim 6, Edwards in view of Merrill, and further in view of Lande discloses: The method according to claim 5, wherein the viseme field comprises at least one single-pronunciation viseme field and at least one co-pronunciation viseme field, the visemes in the viseme list comprise at least one single-pronunciation viseme and at least one co-pronunciation viseme; and (Edwards, [0077], “Conventional animation of human speech is based on a mapping from phonemes to visemes, such as the two labiodental phonemes /f v/ mapping to a single FFF viseme, shown in FIG. 5, where the lower lip is pressed against the upper teeth. Typically, animators create linearly superposed blend-shapes to represent these visemes and animate speech by keyframing these blend-shapes over time.” and [0098], “In the field of linguistics, “co-articulation” is the movement of articulators to anticipate the next sound or preserving movement from the last sound. In some cases, the representation of speech can have a few simplifying aspects.” and [0097], “For co-articulation 916, timing can be based on the alignment returned by the forced alignment and the results of the co-articulation model. Given the amplitude, the phoneme-to-viseme conversion is processed through a co-articulation model, or else the lips, tongue and jaw can distinctly pronounce each phoneme, which is neither realistic nor expressive. Severe mumbling or ventriloquism makes it clear that coherent audible speech can often be produced with very little visible facial motion, making co-articulation essential for realism.”)
the separately mapping, for each set of viseme feature data, viseme fields in the viseme feature data with visemes in a preset viseme list according to a preset mapping relationship between a viseme field and a viseme, to obtain the viseme information corresponding to the viseme feature data comprises: (Edwards, [0072], “Computer facial animation can be broadly classified as procedural, data-driven, or performance-capture. Procedural speech animation segments speech into a string of phonemes, which are then mapped by rules or look-up tables to visemes; typically many-to-one. As an example, / m b p / all map to the viseme MMM in FIG. 4.”)
separately mapping, for each set of viseme feature data, single-pronunciation viseme fields in the viseme feature data with single-pronunciation visemes in the viseme list according to a preset mapping relationship between a single-pronunciation viseme field and a single-pronunciation viseme; and (Edwards, [0098], “First, many phonemes map to a single viseme; for example, the phonemes: /AO/ (caught), /AX/ (about), AY/ (bite), and /AA/ (father) all map to the viseme AHH (see, for example, FIG. 4). Second, most motion of the tongue is typically hidden, as only glimpses of motion of the tongue are necessary to convince the viewer the tongue is participating in speech.”)
separately mapping co-pronunciation viseme fields in the viseme feature data with co-pronunciation visemes in the viseme list according to a preset mapping relationship between a co-pronunciation viseme field and a co-pronunciation viseme, to obtain the viseme information corresponding to the viseme feature data. (Edwards, [0098], “In the field of linguistics, “co-articulation” is the movement of articulators to anticipate the next sound or preserving movement from the last sound. In some cases, the representation of speech can have a few simplifying aspects.” and [0097], “For co-articulation 916, timing can be based on the alignment returned by the forced alignment and the results of the co-articulation model. Given the amplitude, the phoneme-to-viseme conversion is processed through a co-articulation model, or else the lips, tongue and jaw can distinctly pronounce each phoneme, which is neither realistic nor expressive. Severe mumbling or ventriloquism makes it clear that coherent audible speech can often be produced with very little visible facial motion, making co-articulation essential for realism.”)
28. Claims 7-14 are rejected under 35 U.S.C. 103 as being unpatentable over Edwards et al. (US-2018/0253881-A1, hereinafter "Edwards") in view of Merrill et al. (US-6181351-B1, hereinafter "Merrill"), and further in view of Steptoe et al. (US-11468616-B1, hereinafter "Steptoe").
29. As per claim 7, Edwards in view of Merrill discloses: The method according to claim 1, wherein the controlling, according to the viseme information and the intensity information corresponding to the sets of viseme feature data, a virtual face to change, so as to generate a mouth shape animation corresponding to the target audio, comprises: (See rejection for claim 1.)
[[assigning, for each set of viseme feature data, values to mouth shape controls in an animation production interface by using the viseme information corresponding to the viseme feature data, and assigning values to intensity controls in the animation production interface by using the intensity information corresponding to the viseme feature data;]]
controlling, by using the value-assigned mouth shape controls and the value-assigned intensity controls, a virtual face to change, so as to generate a mouth shape key frame corresponding to the viseme feature data; and (Edwards, [0064]-[0066], “In some cases, jaw behavior and lip behavior can be captured as independent viseme shapes. As such, jaw and lip intensity can be used to modulate the blend-shape weight of the respective viseme shape. In this case, the viseme action units are a linear combination of the modulated viseme shape. In other words, the jaw contributions and the lip contributions can be respectively synchronized to independent visemes, and the viseme action units can be a linear combination of the independent visemes. … In some cases, viseme action units can be determined by manually setting jaw and lip values over time by a user via the input device 222. In other cases, the viseme action units can be determined by receiving lip contributions via the input device 22, and having the jaw contributions be determined by determining the modulation of volume of input speech audio. In other cases, the lip contributions and the jaw contributions can be automatically determined by the system 300 from input speech audio and/or input speech transcript.”)
generating a mouth shape animation corresponding to the target audio according to the mouth shape key frames respectively corresponding to the sets of viseme feature data. (Edwards, [0079]-[0080], “Applicant recognized the substantial advantage of using a JALI viseme field to provide a controllable abstraction over expressive speech animation of the same phonetic content. As described herein, the JALI viseme field setting over time, for a given performance, can be extracted plausibly through analysis of the audio signal. … In an embodiment, as shown in FIG. 7, an animatable facial rig can be constructed that is compatible with the JALI viseme field. The “Valley Girl” rig 702 is a fairly realistic facial model rigged in MAYA™. Her face is controlled through a typical combination of blend-shapes (to animate her facial action units) and skeletal skinning (to animate her jaw and tongue).” and [0064], “In some cases, jaw behavior and lip behavior can be captured as independent viseme shapes. As such, jaw and lip intensity can be used to modulate the blend-shape weight of the respective viseme shape.” and [0081], “As such, the face geometry is a composition of a neutral face nface, overlaid with skeletal jaw and tongue deformation jd; td, displaced by a linear blend of weighted blend-shape action unit displacements au; thus, face=nface+jd+td+au.”)
30. Edwards in view of Merrill doesn't explicitly disclose but Steptoe discloses: assigning, for each set of viseme feature data, values to mouth shape controls in an animation production interface by using the viseme information corresponding to the viseme feature data, and assigning values to intensity controls in the animation production interface by using the intensity information corresponding to the viseme feature data; (Steptoe, col. 9, lines 1-3, “In some examples, an “intensity” of an AU may describe a degree of transition from a neutral position of an AU to a maximum position of the AU.” and col.9, lines 26-34, “In further examples, an intensity of an AU may be noted and/or described, without limitation, in terms of a portion of a whole, a rate, an absolute value, a relative value, and so forth. For example, an intensity of an AU may be noted and/or described as a percentage, where a higher percentage may indicate a higher intensity of the AU, with 0% representing an unobservable and/or neutral intensity of the AU and 100% representing a maximum intensity of the AU.”)
31. Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 1 of Edwards in view of Merrill to include the disclosure of assigning, for each set of viseme feature data, values to mouth shape controls in an animation production interface by using the viseme information corresponding to the viseme feature data, and assigning values to intensity controls in the animation production interface by using the intensity information corresponding to the viseme feature data, of Steptoe. The motivation for this modification could have been to assign values into each viseme control so that there are default values for the mouth shape. This provide a baseline for these values to be changes from the existing mouth shape to a new one.
32. As per claim 8, Edwards in view of Merrill, and further in view of Steptoe discloses: The method according to claim 7, wherein the viseme information comprises at least one single-pronunciation viseme parameter and at least one co-pronunciation viseme parameter, the mouth shape controls comprising at least one single-pronunciation mouth shape control and at least one co-pronunciation mouth shape control; and (Edwards, [0077], “Conventional animation of human speech is based on a mapping from phonemes to visemes, such as the two labiodental phonemes /f v/ mapping to a single FFF viseme, shown in FIG. 5, where the lower lip is pressed against the upper teeth. Typically, animators create linearly superposed blend-shapes to represent these visemes and animate speech by keyframing these blend-shapes over time.” and [0098], “In the field of linguistics, “co-articulation” is the movement of articulators to anticipate the next sound or preserving movement from the last sound. In some cases, the representation of speech can have a few simplifying aspects.” and [0097], “For co-articulation 916, timing can be based on the alignment returned by the forced alignment and the results of the co-articulation model. Given the amplitude, the phoneme-to-viseme conversion is processed through a co-articulation model, or else the lips, tongue and jaw can distinctly pronounce each phoneme, which is neither realistic nor expressive. Severe mumbling or ventriloquism makes it clear that coherent audible speech can often be produced with very little visible facial motion, making co-articulation essential for realism.”)
the assigning, for each set of viseme feature data, values to mouth shape controls in an animation production interface by using the viseme information corresponding to the viseme feature data comprises: (Edwards, [0072], “Computer facial animation can be broadly classified as procedural, data-driven, or performance-capture. Procedural speech animation segments speech into a string of phonemes, which are then mapped by rules or look-up tables to visemes; typically many-to-one. As an example, / m b p / all map to the viseme MMM in FIG. 4.”)
separately assigning, for each set of viseme feature data, values to single-pronunciation mouth shape controls in the animation production interface by using the single-pronunciation viseme parameters corresponding to the respective set viseme feature data; and (Edwards, [0098], “First, many phonemes map to a single viseme; for example, the phonemes: /AO/ (caught), /AX/ (about), AY/ (bite), and /AA/ (father) all map to the viseme AHH (see, for example, FIG. 4). Second, most motion of the tongue is typically hidden, as only glimpses of motion of the tongue are necessary to convince the viewer the tongue is participating in speech.”)
separately assigning values to co-pronunciation mouth shape controls in the animation production interface by using the co-pronunciation viseme parameters corresponding to the viseme feature data. (Edwards, [0098], “In the field of linguistics, “co-articulation” is the movement of articulators to anticipate the next sound or preserving movement from the last sound. In some cases, the representation of speech can have a few simplifying aspects.” and [0097], “For co-articulation 916, timing can be based on the alignment returned by the forced alignment and the results of the co-articulation model. Given the amplitude, the phoneme-to-viseme conversion is processed through a co-articulation model, or else the lips, tongue and jaw can distinctly pronounce each phoneme, which is neither realistic nor expressive. Severe mumbling or ventriloquism makes it clear that coherent audible speech can often be produced with very little visible facial motion, making co-articulation essential for realism.”)
33. As per claim 9, Edwards in view of Merrill, and further in view of Steptoe discloses: The method according to claim 7, wherein the intensity information comprises a horizontal intensity parameter and a vertical intensity parameter, the intensity control comprising a horizontal intensity control and a vertical intensity control; and (Merrill, col. 16, lines 15-26, “The mouth animation module 432 typically provides a choice of seven different mouth shapes that can be displayed for a character. Typically, the mouth shapes are loaded from a mouth data file containing a set of bitmap images that can be customized for the particular character being presented. In addition, other sets of mouth shapes can be provided to reflect changes in the character's size, perspective, posture, or position on the display. To create the illusion of speech, the mouth animation module 432 draws a bitmap representing the mouth on top of the constructed animation frame representing the character at the (x, y) coordinates of the character's mouth location.”)
the assigning values to intensity controls in the animation production interface by using the intensity information corresponding to the viseme feature data comprises: (Edwards, [0064]-[0066], “In some cases, jaw behavior and lip behavior can be captured as independent viseme shapes. As such, jaw and lip intensity can be used to modulate the blend-shape weight of the respective viseme shape. In this case, the viseme action units are a linear combination of the modulated viseme shape. In other words, the jaw contributions and the lip contributions can be respectively synchronized to independent visemes, and the viseme action units can be a linear combination of the independent visemes. … In some cases, viseme action units can be determined by manually setting jaw and lip values over time by a user via the input device 222. In other cases, the viseme action units can be determined by receiving lip contributions via the input device 22, and having the jaw contributions be determined by determining the modulation of volume of input speech audio. In other cases, the lip contributions and the jaw contributions can be automatically determined by the system 300 from input speech audio and/or input speech transcript.”)
assigning a value to the horizontal intensity control in the animation production interface by using the horizontal intensity parameter corresponding to the viseme feature data; and
assigning a value to the vertical intensity control in the animation production interface by using the vertical intensity parameter corresponding to the viseme feature data. (Merrill, col. 16, lines 15-26, “The mouth animation module 432 typically provides a choice of seven different mouth shapes that can be displayed for a character. Typically, the mouth shapes are loaded from a mouth data file containing a set of bitmap images that can be customized for the particular character being presented. In addition, other sets of mouth shapes can be provided to reflect changes in the character's size, perspective, posture, or position on the display. To create the illusion of speech, the mouth animation module 432 draws a bitmap representing the mouth on top of the constructed animation frame representing the character at the (x, y) coordinates of the character's mouth location.”)
34. Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 7 of Edwards in view of Steptoe to include the disclosure of parameters for a horizontal and vertical intensity controls where values are assigned to each via viseme feature data, of Merrill. The motivation for this modification could have been to fine tune the location of visemes so they better match the audio, particularly in the horizontal or vertical directions.
35. As per claim 10, Edwards in view of Merrill, and further in view of Steptoe discloses: The method according to claim 7, wherein after generating a mouth shape animation corresponding to the target audio according to the mouth shape key frames respectively corresponding to the sets of viseme feature data, the method further comprises: (See rejection for claim 7.)
performing control parameter updating for at least one of the value-assigned mouth shape controls and the value-assigned intensity controls in response to a trigger operation for the mouth shape controls; and (Merrill, col. 3, lines 4-11, “The speech recognition engine analyzes the speech sound data and the text to determine linguistic event values and time values. A linguistic event value indicates a linguistic event in the speech sound data, such as a spoken phoneme, a spoken word, or some other event. A time value indicates when the linguistic event occurs within the speech sound data.”)
controlling, by using an updated control parameter, the virtual face to change. (Merrill, col. 3, lines 12-22, “When the character is animated, the file player plays the linguistically enhanced sound file to produce sound output and send information to the animation. The information includes events specifying that the animation perform some action to indicate the linguistic event at a time indicated by the time value. For example, a particular mouth shape associated with a spoken phoneme could be presented in a frame of the character animation or the text of a spoken word could be presented in the character's word balloon. The result is a synchronized animation of a quality superior to that produced by amplitude-based mouth shape selection.”)
36. Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 7 of Edwards in view of Steptoe to include the disclosure of performing control parameter updating for at least one of the value-assigned mouth shape controls and the value-assigned intensity controls in response to a trigger operation for the mouth shape controls and controlling, by using an updated control parameter, the virtual face to change, of Merrill. The motivation for this modification could have been to when an event, or “trigger,” occurs the control parameters could dynamically update to respond to the event. This will help the mouth shape stay in sync with the audio.
37. As per claim 11, Edwards in view of Merrill, and further in view of Steptoe discloses: The method according to claim 7, wherein each mouth shape control in the animation production interface has a mapping relationship with a corresponding action unit, each action unit is used for controlling a corresponding region of the virtual face to produce a change; and (Edwards, [0019], “In another aspect, there is provided a system for animated lip synchronization, the system having one or more processors and a data storage device, the one or more processors in communication with the data storage device, the one or more processors configured to execute: a correspondence module for mapping phonemes to visemes; a synchronization module for synchronizing the visemes into viseme action units, the viseme action units comprising jaw and lip contributions for each of the phonemes; and an output module for outputting the viseme action units to an output device.”)
the controlling, by using the value-assigned mouth shape controls and the value-assigned intensity controls, a virtual face to change, so as to generate a mouth shape key frame corresponding to the viseme feature data, comprises: (Edwards, [0066], “In some cases, viseme action units can be determined by manually setting jaw and lip values over time by a user via the input device 222. In other cases, the viseme action units can be determined by receiving lip contributions via the input device 22, and having the jaw contributions be determined by determining the modulation of volume of input speech audio. In other cases, the lip contributions and the jaw contributions can be automatically determined by the system 300 from input speech audio and/or input speech transcript.” and [0086], “For animated speech, the JALI model provides a layer of speech abstraction over the phonetic structure. The JALI model can be phonetically controlled by traditional keyframing or automatic procedurally generated animation (as described herein).”)
determining, for an action unit mapped by each value-assigned mouth shape control, a target action parameter of the action unit according to an action intensity parameter of a matched intensity control, the matched intensity control being a value-assigned intensity control corresponding to the value-assigned mouth shape control; and (Edwards, [0081], “A conventional facial rig often has individual blend-shapes for each viseme; usually with a many-to-one mapping from phonemes to visemes, or many-to-many using dynamic visemes. In contrast, a JALI-rigged character, according to the system and method described herein, may require that such visemes be separated to capture sound production and shaping as mixed contribution of the jaw, tongue and facial muscles that control the lips.”)
controlling, according to the action unit having the target action parameter, the corresponding region of the virtual face to produce a change, so as to generate the mouth shape key frame corresponding to the viseme feature data. (Edwards, [0079]-[0080], “Applicant recognized the substantial advantage of using a JALI viseme field to provide a controllable abstraction over expressive speech animation of the same phonetic content. As described herein, the JALI viseme field setting over time, for a given performance, can be extracted plausibly through analysis of the audio signal. … In an embodiment, as shown in FIG. 7, an animatable facial rig can be constructed that is compatible with the JALI viseme field. The “Valley Girl” rig 702 is a fairly realistic facial model rigged in MAYA™. Her face is controlled through a typical combination of blend-shapes (to animate her facial action units) and skeletal skinning (to animate her jaw and tongue).” and [0064], “In some cases, jaw behavior and lip behavior can be captured as independent viseme shapes. As such, jaw and lip intensity can be used to modulate the blend-shape weight of the respective viseme shape.” and [0081], “As such, the face geometry is a composition of a neutral face nface, overlaid with skeletal jaw and tongue deformation jd; td, displaced by a linear blend of weighted blend-shape action unit displacements au; thus, face=nface+jd+td+au.”)
38. As per claim 12, Edwards in view of Merrill, and further in view of Steptoe discloses: The method according to claim 11, wherein the viseme information corresponding to each set of viseme feature data further comprises adjoint intensity information that affects the viseme corresponding to the viseme information; and (Steptoe, col. 10, lines 35-50, “In some examples, to enable a rich set of behaviors to accompany articulation of audio for virtual avatars, one or more of modules 102 (e.g., identifying module 104, determining module 106, etc.) may break down composite viseme shapes into one or more codified sets of AUs at variable levels of intensity. By way of illustration, FIG. 6 includes a table 600 that illustrates an identified set of AUs and/or AU parameters associated with a face that is producing a viseme named “nn” that may correspond to phonemes /n/ and/or /l/. As shown, one or more of modules 102 (e.g., identifying module 104, determining module 106, etc.) may determine that viseme nn may be produced by AU 14 (“Dimpler”) at an intensity of 0.200, AU 16 (“Lower Lip Depressor”) at an intensity of 0.600, and AU 22 (“Lip Funneler”) at an intensity of 0.180.”)
the determining, for an action unit mapped by each value-assigned mouth shape control, a target action parameter of the action unit according to an action intensity parameter of a matched intensity control comprises: (Steptoe, col. 10, lines 58-65, “The user may engage the respective muscle groups when producing the nn viseme. As will be described in greater detail below, one or more of modules 102 (e.g., determining module 106) may, for each AU in the set of AUs, determine a set of AU parameters associated with the AU and the viseme, such as a target intensity level for the AU of 0.200, 0.600, and 0.180, respectively.” and col. 6, lines 2-7, “Moreover, directing module 110 may cause user device 202, server 206, and/or target device 208 to direct a computer-generated avatar (e.g., computer-generated avatar 238) that represents the user to produce the viseme in accordance with the set of AU parameters associated with each AU in response to detecting that the user has produced the sound.”)
determining, for the action unit mapped by each value-assigned mouth shape control, the target action parameter of the action unit according to the adjoint intensity information and the action intensity parameter of the matched intensity control. (Steptoe, col. 5, lines 34-45, “Furthermore, determining module 106 may cause user device 202, server 206, and/or target device 208 to, for each AU in the set of AUs, determine a set of AU parameters (e.g., set of AU parameters 222) associated with the AU and the viseme. In some examples, the set of AU parameters may include (1) an onset curve (e.g., onset curve 224) associated with the viseme, and (2) a falloff curve (e.g., falloff curve 226) associated with the viseme. In some examples, the set of AU parameters may further include a target intensity level (e.g., target intensity 228) associated with the viseme and/or a source intensity level (e.g., source intensity 230) associated with a previous viseme.”)
39. Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 11 of Edwards in view of Merrill to include the disclosure of viseme information corresponding to each set of viseme feature data further comprises adjoint intensity information that affects the viseme corresponding to the viseme information and determining he target action parameter of the action unit according to the adjoint intensity information and the action intensity parameter of the matched intensity control, of Steptoe. The motivation for this modification could have been to join multiple visemes together so that their joint motions encompass a new unique movement or shift of the mouth shape or face. Typically, the joint visemes would provide a larger overall intensity to a face region beyond what an individual viseme could control.
40. As per claim 13, Edwards in view of Merrill, and further in view of Steptoe discloses: The method according to claim 12, wherein the adjoint intensity information comprises an initial animation parameter of the action unit; and (Steptoe, col. 18, lines 48-55, “Transition period 912 may indicate an intensity of the AU over time as directing module 110 directs computer-generated avatar 238 to transition from the initial viseme to the subsequent viseme. As shown by a portion of curve 906 in transition period 912, directing module 110 may cause computer-generated avatar 238 to smoothly transition from the target intensity level of the initial viseme to the target intensity level of the subsequent viseme.”)
wherein the determining, for the action unit mapped by each value-assigned mouth shape control, the target action parameter of the action unit according to the adjoint intensity information and the action intensity parameter of the matched intensity control comprises: (Edwards, [0081], “A conventional facial rig often has individual blend-shapes for each viseme; usually with a many-to-one mapping from phonemes to visemes, or many-to-many using dynamic visemes. In contrast, a JALI-rigged character, according to the system and method described herein, may require that such visemes be separated to capture sound production and shaping as mixed contribution of the jaw, tongue and facial muscles that control the lips.”)
weighting, for the action unit mapped by each value-assigned mouth shape control, the action intensity parameter of the matched intensity control with the initial animation parameter of the action unit, to obtain the target action parameter of the action unit. (Edwards, [0024], “In another aspect, there is provided a facial model for animation on a computing device, the computing device having one or more processors, the facial model comprising: a neutral face position; an overlay of skeletal jaw deformation, lip deformation and tongue deformation; and a displacement of the skeletal jaw deformation, the lip deformation and the tongue deformation by a linear blend of weighted blend-shape action units.” and [0064], “In some cases, jaw behavior and lip behavior can be captured as independent viseme shapes. As such, jaw and lip intensity can be used to modulate the blend-shape weight of the respective viseme shape. In this case, the viseme action units are a linear combination of the modulated viseme shape. In other words, the jaw contributions and the lip contributions can be respectively synchronized to independent visemes, and the viseme action units can be a linear combination of the independent visemes.”)
41. Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 12 of Edwards in view of Merrill to include the disclosure of adjoint intensity information comprises an initial animation parameter of the action unit, of Steptoe. The motivation for this modification could have been to join multiple visemes together so that their joint motions encompass a new unique movement or shift of the mouth shape or face. Typically, the joint visemes would provide a larger overall intensity to a face region beyond what an individual viseme could control.
42. As per claim 14, Edwards in view of Merrill, and further in view of Steptoe discloses: The method according to claim 7, wherein the generating a mouth shape animation corresponding to the target audio according to the mouth shape key frames respectively corresponding to the sets of viseme feature data comprises: (See rejection for 7.)
bonding and recording, for the mouth shape key frame corresponding to each set of viseme feature data, the mouth shape key frame corresponding to the viseme feature data and a timestamp corresponding to the viseme feature data, to obtain a record result corresponding to the mouth shape key frame; (Merrill, col. 2, line 66-col. 3, line 1, “In one implementation, a system synchronizes the animation of a character with recorded speech in the form of speech sound data.” and col. 3, line 17-22, “For example, a particular mouth shape associated with a spoken phoneme could be presented in a frame of the character animation or the text of a spoken word could be presented in the character's word balloon. The result is a synchronized animation of a quality superior to that produced by amplitude-based mouth shape selection.” and col. 13, line 63-col. 14, line 7, “The user can employ the interface to manipulate the linguistic information via the markers 384 and 390. For example, the user can change the size and position of a marker by dragging (i.e., manipulating with a pointing device such as a mouse) a marker edge 386 or 388. The change in the size and position of the marker is reflected in the linguistic information stored in the tool and is used during subsequent previews in the mouth animation pane 362. The user can select a save option to generate a linguistically enhanced sound file based on the graphical representation of sound data 382 and the phoneme and word break information represented by the markers.”)
obtaining an animation playing curve corresponding to the target audio according to the record results respectively corresponding to the mouth shape key frames; and (Edwards, [0018], “In yet another case, the method further comprising outputting a phonetic animation curve based on the change of viseme action units over time.” and [0088], “In the animation phase 906, the aligned phonemes are mapped to visemes by the correspondence module 206. Viseme amplitudes are set (for articulation) 914. Then the visemes are re-processed 916, by the synchronization module 208, for co-articulation to produce viseme timings and resulting animation curves for the visemes (in an example, a Maya MEL script of sparsely keyframed visemes). These phonetic animation curves can be outputted by the output module 210 to demonstrate how the phonemes are changing over time.” and Edwards, [0095], “In the animation phase 904, a facial rig is animated by producing sparse animation keyframes for visemes by the correspondence module 206. The viseme to be keyframed is determined by the co-articulation model described herein. The timing of the viseme is determined by forced alignment after it has been processed through the co-articulation model.”)
sequentially playing the mouth shape key frames according to the animation playing curve, to obtain a mouth shape animation corresponding to the target audio. (Merrill, col. 3, lines 12-22, “When the character is animated, the file player plays the linguistically enhanced sound file to produce sound output and send information to the animation. The information includes events specifying that the animation perform some action to indicate the linguistic event at a time indicated by the time value. For example, a particular mouth shape associated with a spoken phoneme could be presented in a frame of the character animation or the text of a spoken word could be presented in the character's word balloon. The result is a synchronized animation of a quality superior to that produced by amplitude-based mouth shape selection.”)
43. Before the effective filing date of the claimed invention, it would have been obvious to a person having ordinary skill in the art to modify the method of claim 7 of Edwards in view of Steptoe to include the disclosure of bonding and recording, for the mouth shape key frame corresponding to each set of viseme feature data, the mouth shape key frame corresponding to the viseme feature data and a timestamp corresponding to the viseme feature data, to obtain a record result corresponding to the mouth shape key frame and sequentially playing the mouth shape key frames according to the animation playing curve, to obtain a mouth shape animation corresponding to the target audio, of Merrill. The motivation for this modification could have been to allow for a final rendered animation to be fully synchronized and recorded, producing a final result. This would then allow for the animation to be played back at any time without need of the facial animation control rigging.
Conclusion
44. Any inquiry concerning this communication or earlier communications from the examiner should be directed to MATTHEW CLOTHIER whose telephone number is (571)272-4667. The examiner can normally be reached Mon-Fri 8:00am-4:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kent Chang can be reached at (571)272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MATTHEW CLOTHIER/Examiner, Art Unit 2614
/KENT W CHANG/Supervisory Patent Examiner, Art Unit 2614