Last updated: April 19, 2026
Application No. 18/576,647
Rendering Avatar to Have Viseme Corresponding to Phoneme Within Detected Speech

Non-Final OA §103
Filed
Jan 04, 2024
Examiner
CRAWFORD, JACINTA M
Art Unit
2617
Tech Center
2600 — Communications
Assignee
Hewlett-Packard Development Company, L.P.
OA Round
2 (Non-Final)
Interview Optional

— +9.2% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 805 resolved cases, 2023–2026
Examiner Intelligence

CRAWFORD, JACINTA M View full profile →
Grants 88% — above average
Career Allow Rate
709 granted / 805 resolved
+26.1% vs TC avg
Moderate +9% lift
Without
With
+9.2%
Interview Lift
resolved cases with interview
Typical timeline
2y 7m
Avg Prosecution
29 currently pending
Career history
834
Total Applications
across all art units
Statute-Specific Performance

§101
7.7%
-32.3% vs TC avg
§103
55.1%
+15.1% vs TC avg
§102
5.2%
-34.8% vs TC avg
§112
16.8%
-23.2% vs TC avg
Black line = Tech Center average estimate • Based on career data from 805 resolved cases
Office Action

§103
DETAILED ACTION

This action is in response to communications: Amendment filed October 27, 2025.

Claims 1-15 are pending in this case.  No claims have been amended, added, or cancelled.  This action is made Non-Final.

Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments

Applicant’s arguments, see pages 7 and 8, filed October 27, 2025, with respect to the rejection(s) of claim(s) 1-8 and 11-15 under 35 U.S.C. 103 claim rejection have been fully considered and are persuasive.  Applicant argues that the prior art of record, BEITH, has a filing date after the earlier filing date of the present application, and, thus BEITH is disqualified from being used as prior art.  The Examiner notes that at the time of prior examination, the bibliographic (BIB) data sheet did not contain accurate information, e.g. the present application is a 371 of PCT/US2021/041822 filed on July 15, 2021.  The Examiner has contacted the appropriate handling department to update the BIB data sheet with this information.  Additionally, the previous rejection using BEITH as prior art has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Steptoe et al. (US 11,468,616).  Please see the rejection and notes regarding the claims below.

Allowable Subject Matter

The indicated allowability of claims 9 and 10 is withdrawn in view of the newly discovered reference(s) to Xiao et al. (US 11,113,859).  Rejections based on the newly cited reference(s) follow.

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-5, 7, 8, and 12-15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Steptoe et al. (US 11,468,616).

As to claim 1, Steptoe et al. further disclose a non-transitory computer-readable data storage medium (Figure 1, system 100, further illustrated as system 200 in Figure 2, with memory 120) storing program code (e.g. storing modules 102 as well as data and/or computer-readable instructions, column 3, lines 48-59) executable by a processor (e.g. physical processor 130, column 3, lines 60 thru column 4, line 10) to perform processing (process of Figure 3) comprising: detecting speech using a microphone (step 330, detecting that a user produced a sound, where column 16, lines 31-56 notes user device 202 (illustrated as computing device 202) may include a microphone or other audio sensor, where detecting module 108 may receive sound 220, which may include one or more phonemes articulated by the user, via the microphone or other audio sensor, detecting module 108 may then identify the one or more phonemes included in sound 220 using any suitable speech recognition technique) of a head-mountable display (HMD)(column 6, lines 12-18 notes user device 202 may be a wearable device, with lines 19-28 further noting the user device 202 including one or more sensors, e.g. a microphone and other audio sensors, where column 5, lines 10-19 notes system 100 may encompass all of system 200, e.g. each of user device 202, server 206, and target device 208, implementing one or more module 102, where column 28, lines 64 thru column 29, lines 22 notes system may be implemented in conjunction with an artificial reality system, which may include a virtual-reality (VR), augmented reality (AR), mixed reality (MR), hybrid reality, or some combination and/or derivative thereof, the artificial reality system provides artificial reality content that may be implemented on various platforms, including a head-mounted display (HMD) connected to a host computer system, a standalone HMD, a mobile device or computing system, or any other hardware platform capable of providing artificial reality content to one or more viewers, where column 17, lines 14-21 further notes the computer-generated avatar may include an augmented reality avatar presented within an augmented reality environment, a virtual reality avatar presented within a virtual reality environment, and/or a video conferencing avatar presented within a video conferencing application, and thus may be considered the user device 202 may be considered a head-mounted display (HMD), the user of user device 202 being a wearer of the HMD), the speech including a phoneme (e.g. column 16, lines 49-51 notes sound 220 received may include one or more phonemes articulated by the user); determining whether a wearer of the HMD uttered the speech (step 330, detecting that a user produced a sound, where column 16, lines 31-56 further notes based on identifying the one or more phonemes, detecting module 108 may detect that the user has produced sound 220, where column 16, lines 57 thru column 17, lines 2 gives illustration of multiple users of different devices, where the detecting module may pick up the sound 220 specifically from user device 202 and detect that the sound 220 came from user of user device 202 (NOTE: determining whether a wearer uttered the speech may additionally include steps 310 and 320 of Figure 3, see claims below)); and in response to determining that the wearer uttered the speech (e.g. in response to each of steps 310, 320, and 330, detecting the user has produced sound 220), rendering an avatar representing the wearer to have a viseme corresponding to the phoneme (step 340, directing a computer-generated avatar that represents the user to produce the viseme in accordance with the set of action unit parameters associated with each unit in response to detecting that the user has produced the sound, where column 17, lines 3 thru column 18, lines 22 notes computer-generated avatar 238 may be configured to produce one or more visemes via one or more action units (AUs) and/or AU parameters, where Figure 4 illustrates a table 400 that shows various examples to visemes, associated phonemes, examples of words that, when pronounced by a human, may cause the human’s face to produce the visemes, and visual depictions of a human face producing associated visemes, where Figure 5 further illustrates a table 500 that shows various examples of AUs, AU names, and visual depictions of a human face when the AU is at a maximum intensity, where column 8, lines 42-52 notes an “action unit” may include any information that describes actions of individual muscles and/or group of muscles, e.g. an AU may be associated with one or more facial muscles that may be engaged by a user to produce a viseme). 

As noted above, Steptoe et al. describe its system 100/200 for performing the operations as described, where user device 202, e.g. computing device 202, as part of system 200, which may include wearable devices, further comprising various sensors, including imaging devices and a microphone.  Steptoe et al. further disclose its system may be implemented in conjunction with an artificial reality system, which may include a virtual-reality (VR), augmented reality (AR), mixed reality (MR), hybrid reality, or some combination and/or derivative thereof, the artificial reality system provides artificial reality content that may be implemented on various platforms, including a head-mounted display (HMD) connected to a host computer system, a standalone HMD, a mobile device or computing system, or any other hardware platform capable of providing artificial reality content, e.g. the computer-generated avatar, to one or more viewers.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention to implement the system 100/200 of Steptoe et al., e.g. user device 202, as a wearable device including an head-mounted display (HMD), such that the computer-generated avatar may be rendered and displayed as described, thus yielding predictable results, without changing the scope of the invention.
 
As to claim 2, Steptoe et al. disclose the processing comprises: displaying the rendered avatar representing the wearer of the HMD (e.g. displaying computer-generated avatar representing the user that produced the sound, e.g. wearer of user device 202 as the HMD)(column 2, lines 56 thru column 3, lines 3 notes the computer-generated avatar 238 may represent a user within an artificial environment, such as a VR and/or AR environment, to accurately and realistically reproduce, in real-time, facial expressions and/or facial motions associated with a series of phonemes produced by the user (e.g., words spoken by the user) and/or body actions executed by the user).

As to claim 3, Steptoe et al. disclose the processing comprises: in response to determining that the wearer did not utter the speech, not rendering the avatar to have the viseme corresponding to the phoneme (as noted in claim 1, step 340 is performed “in response to” each of steps 310, 320, and 330, detecting that the user produced the sound, thus if the output of any of steps 310, 320, and 330 does not render the user uttered the speech, including previous step 330, it is not detected that the user produced the sound, it would be obvious that step 340 is not performed, thus the computer-generated avatar is not rendered).

As to claim 4, Steptoe et al. disclose determining whether the wearer of the HMD uttered the speech (Figure 3) comprises: detecting, using a camera of the HMD (e.g. via one or more sensors and/or imaging devices, where column 6, lines 19-28 notes user device 202 (e.g. HMD) may include one or more sensors that may gather data associated with the environment that includes the user device 202, e.g. an imaging device configured to capture one or more portions of an electromagnetic spectrum (e.g. a visible-light camera, an infrared camera, an ultraviolet camera, etc.)), whether mouth movement of the wearer occurred while the speech was detected (column 11, lines 52-67 further notes identifying module 104 may include one or more sensors, e.g. imaging device and/or an audio recording device, for capturing a set of images 232, e.g. a series of images, a video file, a set of frames of a video file, and recording audio 236, e.g. a recording of the user producing phonemes 234 and/or responding to prompts included in phonemes 234, as the user speaks sets of phonemes 234)(step 310, column 7, lines 33-57 notes identifying a set of AUs associated with a face of a user, each AU associated with at least one muscle group engaged by the user to produce a viseme associated with a sound produced by the user, where a “viseme” may include a visual representation of a configuration (shape) of a face of a person as the person produces an associated set of phonemes, each viseme associated with a mouth shape for a specific set of phonemes (see Figures 4 and 5), where column 12, lines 1-23 further notes identifying module 104 may identify viseme 218 from the set of images 232 via image recognition and may employ any suitable face tracking and/or identification system to identify a set of AUs 210 associate with face 212, e.g. from the set of captured images, and associate a feature of the face of the user, e.g. a portion of the user’s face that may generally correspond to a set of muscle groups associated with the set of AUs, with the set of AUs based on the identification of the viseme, the set of images, and recorded audio, and step 320, column 12, lines 31 thru column 13, lines 14 further notes, for each AU in the set of AUs, determining a set of AU parameters associated with the AU and the viseme, the set of AU parameters may include (1) an onset curve associated with the viseme, and (2) a falloff curve associated with the viseme, which may specify a maximum velocity and/or activation/deactivation velocity of a given muscle group, thus as a person strains his or her lips, such as with rapid queuing or chaining of mouth shapes, e.g. visemes, the person may distort a shape of his or her mouth in varying ways to make way for a next mouth shape, then at step 330, column 16, lines 31 thru column 17, lines 2 notes detecting that the user has produced the sound); in response to detecting that the mouth movement of the wearer occurred while the speech was detected, determining that the wearer uttered the speech (e.g. determining the user of user device 202, e.g. wearer of HMD, uttered the speech based on the determinations at each of steps 310, 320, and 330 as noted above, which includes determining mouth (and/or lip) movements (steps 310, 320) while detecting sound (step 330)); and in response to not detecting that the mouth movement of the wearer occurred while the speech was detected, determining that the wearer did not utter the speech (e.g. determining the user of user device 202, e.g. wearer of HMD, did not utter the speech based on the determinations at each of steps 310, 320, and 330 as noted, which includes determining mouth (and/or lip) movements (steps 310, 320) and/or detecting sound (step 330)).

As to claim 5, Steptoe et al. disclose determining whether the wearer of the HMD uttered the speech (Figure 3) comprises: detecting, using a sensor of the HMD other than the microphone (e.g. via one or more sensors and/or imaging devices, where column 6, lines 19-28 notes user device 202 (e.g. HMD) may include one or more sensors that may gather data associated with the environment that includes the user device 202, e.g. an imaging device configured to capture one or more portions of an electromagnetic spectrum (e.g. a visible-light camera, an infrared camera, an ultraviolet camera, etc.)), whether mouth movement of the wearer occurred while the speech was detected (column 11, lines 52-67 further notes identifying module 104 may include one or more sensors, e.g. imaging device and/or an audio recording device, for capturing a set of images 232, e.g. a series of images, a video file, a set of frames of a video file, and recording audio 236, e.g. a recording of the user producing phonemes 234 and/or responding to prompts included in phonemes 234, as the user speaks sets of phonemes 234); in response to detecting that the mouth movement of the wearer occurred while the speech was detected, determining that the wearer uttered the speech; and in response to not detecting that the mouth movement of the wearer occurred while the speech was detected, determining that the wearer did not utter the speech (see details claim 4 above).

As to claim 7, Steptoe et al. disclose the wearer is determined as having uttered the speech (e.g. based on the determination of steps 310, 320, and 330 of Figure 3), wherein the processing further comprises: capturing facial images of the wearer while the speech is detected, using a camera of the HMD (e.g. via one or more sensors and/or imaging devices, where column 6, lines 19-28 notes user device 202 (e.g. HMD) may include one or more sensors that may gather data associated with the environment that includes the user device 202, e.g. an imaging device configured to capture one or more portions of an electromagnetic spectrum (e.g. a visible-light camera, an infrared camera, an ultraviolet camera, etc.)), the facial images comprising the viseme corresponding to the phoneme (column 11, lines 52-67 further notes identifying module 104 may include one or more sensors, e.g. imaging device and/or an audio recording device, for capturing a set of images 232, e.g. a series of images, a video file, a set of frames of a video file, and recording audio 236, e.g. a recording of the user producing phonemes 234 and/or responding to prompts included in phonemes 234, as the user speaks sets of phonemes 234)(see details of claim 4 regarding steps 310 and 320 for identifying visemes corresponding to the phoneme, e.g. from the set of captured images 232), and wherein the avatar is rendered to have the viseme corresponding to the phoneme based on both the phoneme within the detected speech and the captured facial images including the viseme (e.g. as noted in claim 1, step 340, directing a computer-generated avatar that represents the user to produce the viseme in accordance with the set of action unit parameters associated with each unit in response to detecting that the user has produced the sound, where column 17, lines 3 thru column 18, lines 22 notes computer-generated avatar 238 may be configured to produce one or more visemes via one or more action units (AUs) and/or AU parameters).

As to claim 8, Steptoe et al. disclose the processing further comprises: capturing sensor data while the speech is detected, using one or multiple sensors other than the camera and the microphone of the HMD (column 6, lines 19-28 notes user device 202 (e.g. HMD) may include one or more sensors that may gather data associated with the environment that includes the user device 202, e.g. an imaging device configured to capture one or more portions of an electromagnetic spectrum (e.g. a visible-light camera, an infrared camera, an ultraviolet camera, etc.), an inertial measurement unit (IMU), an accelerometer, a global positioning system device, a thermometer, a barometer, an altimeter, and other audio sensors, where one type of camera may be considered different from another as well as other audio sensors may be different from the microphone, thus “other than the camera and microphone”), and wherein the avatar is rendered to have the viseme corresponding to the phoneme further based on the captured sensor data (e.g. as noted in claim 4, steps 310 and 320, the set of captured images 232 may be from any one of the imaging devices and step 330 notes detecting that the user produced the sound with the microphone or other audio sensors, then at step 340, directing a computer-generated avatar that represents the user to produce the viseme in accordance with the set of action unit parameters associated with each unit in response to detecting that the user has produced the sound (e.g. using the other audio sensors), where column 17, lines 3 thru column 18, lines 22 notes computer-generated avatar 238 may be configured to produce one or more visemes via one or more action units (AUs) and/or AU parameters (identified via other capturing devices)).

As to claim 12, Steptoe et al. disclose a method comprising: detecting, by a processor using a microphone, speech including a phoneme; determining, by the processor, whether a user uttered the speech; in response to determining that the user uttered the speech, rendering, by the processor, an avatar representing the user to have a viseme corresponding to the phoneme (see details of claim 1 above); and displaying, by the processor, the avatar representing the user (see details of claim 2 above).  Claim 12 is similar in scope to claims 1 and 2 combined, and is therefore rejected under similar rationale.  Please see the rejection and rationale of claims 1 and 2 above.

As to claim 13, Steptoe et al. disclose the user is determined as having uttered the speech, wherein the method further comprises: capturing facial images of the user while the speech is detected, by a processor using a camera, the facial images comprising the viseme corresponding to the phoneme, and wherein the avatar is rendered to have the viseme corresponding to the phoneme based on both the phoneme within the detected speech and the captured facial images including the viseme (see details of claim 7 above).

As to claim 14, Steptoe et al. disclose a head-mountable display (HMD) (e.g. Figure 2, user device, where details of claim 1 notes user device 202 may be an head-mounted display (HMD)) comprising: a microphone to detect speech including a phoneme (e.g. column 6, lines 19-28 notes user device 202 (e.g. HMD) may include one or more sensors that may gather data associated with the environment that includes the user device 202, including a microphone, where step 330, where column 16, lines 47-56 notes microphone for receiving sound which may include one or more phonemes articulated by the user); a camera to capture facial images of a wearer of the HMD while the speech is detected (e.g. one or more sensors as an imaging devices, where column 6, lines 19-28 notes user device 202 (e.g. HMD) may include one or more sensors that may gather data associated with the environment that includes the user device 202, e.g. an imaging device configured to capture one or more portions of an electromagnetic spectrum (e.g. a visible-light camera, an infrared camera, an ultraviolet camera, etc.), where column 11, lines 52-67 further notes identifying module 104 may include one or more sensors, e.g. imaging device and/or an audio recording device, for capturing a set of images 232, e.g. a series of images, a video file, a set of frames of a video file, and recording audio 236, e.g. a recording of the user producing phonemes 234 and/or responding to prompts included in phonemes 234, as the user speaks sets of phonemes 234); and circuitry to: detect whether mouth movement of the wearer occurred while the speech was detected, from the captured facial images; and in response to detecting that the mouth movement of the wearer occurred while the speech was detected, render an avatar representing the wearer to have a viseme corresponding to the phoneme (see details of claims 4 and 5).  Claim 14 is similar in scope to claims 1, 4, and 5 combined, and is therefore rejected under similar rationale.  Please see the rejections and rationale regarding the claims above.

As to claim 15, Steptoe et al. disclose the captured facial images comprise the viseme corresponding to the phoneme, and wherein the avatar is rendered to have the viseme corresponding to the phoneme based on both the phoneme within the detected speech and the captured facial images including the viseme (see details of claim 7 above).

Claim(s) 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Steptoe et al. (US 11,468,616) as applied to claim 1 above, and further in view of Karakotsios (US 9,094,576).

As to claim 6, Steptoe et al. disclose the microphone (e.g. as noted in claim 1, the microphone), and determining whether the wearer of the HMD uttered the speech comprises: detecting, using the microphone, whether the speech was uttered from a direction of a mouth of the wearer; in response to detecting that the speech was uttered from the direction of the mouth of the wearer, determining that the wearer uttered the speech; and in response to detecting that the speech was not uttered from the direction of the mouth of the wearer, determining that the wearer did not utter the speech (see details of claims 4 and 5).  However, Steptoe et al. do not disclose, but Karakotsios disclose the microphone comprises a microphone array…and detecting, using the microphone array, whether the speech was uttered from a direction of a mouth of the wearer (Figure 1a, column 3, lines 3-17 notes computing device 100 also includes one or more microphones 110 or other audio capture devices capable of capturing audio data, such as words spoken by the user 102 of the device, the microphone 110 is placed on the same side of the device 100 as the display screen 108, such that the microphone 110 will typically be better able to capture words spoken by a user of the device, where the microphone can be a directional microphone that captures sound information from substantially directly in front of the device, and picks up only a limited amount of sound from other directions, which can help to better capture words spoken by a primary user of the device, and column 5, lines 10-14 further notes an array of microphones for capturing sound from multiple directions).  The modification of Steptoe et al. with Karakotsios may further render each of detecting the speech was uttered from “the direction of the mouth of the wearer.”

It would have been obvious to one of ordinary skill in the art at the time of the invention to further modify Steptoe et al.’s method of determining whether a user uttered speech via microphone with Karakotsios microphone array, e.g. directional microphones, such that the system may be able to better capture words spoken by the primary user, e.g. speaker, of the device (column 3, lines 3-17 of Karakotsios).  

Claim(s) 9-11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Steptoe et al. (US 11,468,616) as applied to claim 7 above, and further in view of Xiao et al. (US 11,113,859).

As to claim 9, Steptoe et al. disclose rendering the avatar (e.g. Figure 3, step 340), but do not disclose, but Xiao et al. disclose rendering the avatar (Figure 4) comprises: applying a model to the captured facial images to generate blendshape weights corresponding to a facial expression of the wearer while the wearer uttered the speech (step 410, column 10, lines 56-64 notes obtaining an audio stream and image data, the audio stream may include or capture a vocal output of a person and image data may include or capture a face image of the person, step 440, column 11, lines 16-45 notes determining blendshapes with corresponding weights to form a 3D model of an avatar, a blendshape can include, correspond to, or be indicative of a structure, shape, and/or profile of a face part (e.g. eye, node, eyebrow, lips, etc.) of an avatar, where a corresponding weight may indicate an amount of emphasis (e.g. on the structure, shape, and/or profile of the face part), the system may obtain, detect, determine, or extract landmarks of a face of the person in the image data, landmarks can include, correspond to, or be indicative of data indicating locations or shapes of different body parts, and according to the landmarks, the system may determine, identify, and/or select a number of blendshapes); identifying the phoneme within the detected speech (step 420, column 10, lines 65 thru column 11, lines 7 notes the system predicts phonemes of vocal output from the audio stream); modifying the generated blendshape weights based on the identified phoneme (step 430, column 11, lines 8-15 notes the system translates the predicted phonemes into visemes, a viseme can include, correspond to, or be indicative of a model or data indicating a mouth shape associated with a particular sound or a phoneme, where as noted above, step 440, determining blendshapes with corresponding weights to form a 3D model of an avatar, where column 11, lines 43-45 notes this step may be performed after one or both of step 420 and 430, thus may render blendshapes determined or generated with respect to the predicted phonemes translated into visemes); and rendering the avatar from the modified blendshape weights (step 450, column 11, lines 46 thru column 12, lines 4 notes the system combining the visemes with the 3D model of the avatar, the system generates or constructs the 3D model of the avatar such that face parts of the 3D model are formed or shaped as indicated by the blendshapes, the system also combines the visemes with the 3D model of the avatar to form a 3D representation of the avatar, by synchronizing the visemes with the 3D model of the avatar in time).

It would have been obvious to one of ordinary skill in the art at the time of the invention to further modify Steptoe et al.’s system and method of rendering an avatar with Xiao et al.’s method of using blendshapes and corresponding weights to allow generating the 3D model or representation of the avatar with realistic expressions and/or facial movements in sync with the vocal output of the person to provide improved artificial reality (e.g. including augmented and virtual reality) experiences (column 5, lines 51-55 of Xiao et al.).

As to claim 10, Steptoe et al. disclose rendering the avatar (e.g. Figure 3, step 340), but do not disclose, but Xiao et al. disclose rendering the avatar (Figure 4) comprises: applying a first model to the captured facial image to generate first blendshape weights corresponding to a facial expression of the wearer while the wearer uttered the speech (step 410, column 10, lines 56-64 notes obtaining an audio stream and image data, the audio stream may include or capture a vocal output of a person and image data may include or capture a face image of the person, step 440, (for the image data), column 11, lines 16-45 notes determining blendshapes with corresponding weights to form a 3D model of an avatar, a blendshape can include, correspond to, or be indicative of a structure, shape, and/or profile of a face part (e.g. eye, node, eyebrow, lips, etc.) of an avatar, where a corresponding weight may indicate an amount of emphasis (e.g. on the structure, shape, and/or profile of the face part), the system may obtain, detect, determine, or extract landmarks of a face of the person in the image data, landmarks can include, correspond to, or be indicative of data indicating locations or shapes of different body parts, and according to the landmarks, the system may determine, identify, and/or select a number of blendshapes); applying a second model to the detected speech to generate second blendshape weights corresponding to the facial expression of the wearer while the wearer uttered the speech (e.g. repeating step 440 above for the audio stream, where column 11, lines 43-45 notes this step may be performed after one or both of step 420 and 430, thus may render blendshapes determined or generated with respect to the predicted phonemes translated into visemes); combining the first blendshape weights and the second blendshape weights to yield combined blendshape weights corresponding to the facial expression of the wearer while the wearer uttered the speech; and rendering the avatar from the combined blendshape weights (step 450, column 11, lines 46 thru column 12, lines 4 notes the system combining the visemes with the 3D model of the avatar, the system generates or constructs the 3D model of the avatar such that face parts of the 3D model are formed or shaped as indicated by the blendshapes, the system also combines the visemes with the 3D model of the avatar to form a 3D representation of the avatar, by synchronizing the visemes with the 3D model of the avatar in time).

It would have been obvious to one of ordinary skill in the art at the time of the invention to further modify Steptoe et al.’s system and method of rendering an avatar with Xiao et al.’s method of using blendshapes and corresponding weights to allow generating the 3D model or representation of the avatar with realistic expressions and/or facial movements in sync with the vocal output of the person to provide improved artificial reality (e.g. including augmented and virtual reality) experiences (column 5, lines 51-55 of Xiao et al.).

As to claim 11, Steptoe et al. disclose rendering the avatar (e.g. Figure 3, step 340), but do not disclose, but Xiao et al. disclose rendering the avatar (Figure 4) comprises: applying a model to the captured facial image and to the detected speech to generate blendshape weights corresponding to a facial expression of the wearer while the wearer uttered the speech (step 410, column 10, lines 56-64 notes obtaining an audio stream and image data, the audio stream may include or capture a vocal output of a person and image data may include or capture a face image of the person, step 440, column 11, lines 16-45 notes determining blendshapes with corresponding weights to form a 3D model of an avatar, a blendshape can include, correspond to, or be indicative of a structure, shape, and/or profile of a face part (e.g. eye, node, eyebrow, lips, etc.) of an avatar, where a corresponding weight may indicate an amount of emphasis (e.g. on the structure, shape, and/or profile of the face part), the system may obtain, detect, determine, or extract landmarks of a face of the person in the image data, landmarks can include, correspond to, or be indicative of data indicating locations or shapes of different body parts, and according to the landmarks, the system may determine, identify, and/or select a number of blendshapes, see additional details of claim 9); and rendering the avatar from the blendshape weights (step 450, column 11, lines 46 thru column 12, lines 4 notes the system combining the visemes with the 3D model of the avatar, the system generates or constructs the 3D model of the avatar such that face parts of the 3D model are formed or shaped as indicated by the blendshapes, the system also combines the visemes with the 3D model of the avatar to form a 3D representation of the avatar, by synchronizing the visemes with the 3D model of the avatar in time).

It would have been obvious to one of ordinary skill in the art at the time of the invention to further modify Steptoe et al.’s system and method of rendering an avatar with Xiao et al.’s method of using blendshapes and corresponding weights to allow generating the 3D model or representation of the avatar with realistic expressions and/or facial movements in sync with the vocal output of the person to provide improved artificial reality (e.g. including augmented and virtual reality) experiences (column 5, lines 51-55 of Xiao et al.).

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JACINTA M CRAWFORD whose telephone number is (571)270-1539. The examiner can normally be reached 8:30a.m. to 4:30p.m.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, King Y. Poon can be reached at (571)272-7440. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/JACINTA M CRAWFORD/Primary Examiner, Art Unit 2617
Read full office action
Prosecution Timeline

Jan 04, 2024
Application Filed
Jul 23, 2025
Non-Final Rejection — §103
Oct 07, 2025
Applicant Interview (Telephonic)
Oct 08, 2025
Examiner Interview Summary
Oct 27, 2025
Response Filed
Feb 07, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/262,963
Patent 12602734
GRAPHICS PROCESSORS
2y 5m to grant Granted Apr 14, 2026
18/463,386
Patent 12602735
GRAPH DATA CALCULATION METHOD AND APPARATUS
2y 5m to grant Granted Apr 14, 2026
18/481,437
Patent 12602841
HIGH DYNAMIC RANGE VISUALIZATIONS INDICATING RANGES, POINT CURVES, AND PREVIEWS
2y 5m to grant Granted Apr 14, 2026
18/645,254
Patent 12597180
ARTIFICIAL INTELLIGENCE AUGMENTATION OF GEOGRAPHIC DATA LAYERS
2y 5m to grant Granted Apr 07, 2026
18/736,418
Patent 12591946
DETECTING ERROR IN SAFETY-CRITICAL GPU BY MONITORING FOR RESPONSE TO AN INSTRUCTION
2y 5m to grant Granted Mar 31, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

2-3
Expected OA Rounds
88%
Grant Probability
97%
With Interview (+9.2%)
2y 7m
Median Time to Grant
Moderate
PTA Risk
Based on 805 resolved cases by this examiner. Grant probability derived from career allow rate.
Rendering Avatar to Have Viseme Corresponding to Phoneme Within Detected Speech

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email