DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Response to Amendment
Applicant's amendments and remarks filed 10/31/2025 have been fully considered but they are not persuasive. Claims 1-2, 11 and 16 have been amended. Claims 8-9,14-15, 19-20 have been withdrawn which are directed non-elected Inventions. Applicant’s amendments have necessitated the new grounds of rejection set forth herein; accordingly, this action is made final.
Response to Arguments
Claim Rejections - 35 USC § 103:
Applicant’s arguments with respect to independent claims have been considered but are moot because the rejection has been modified to address the newly added limitations. The Examiner now relies on the new reference KELLY.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
1. Claims 1-4, 11, 16 are rejected under 35 U.S.C. 103 as being unpatentable over Smith et al., U.S Patent No.11,854146 (“Smith”) in view of KELLY et al., U.S Patent Application Publication No. 20210303897 (“KELLY”) further in view of JU et al, KR20200134623- English translated (“JU”) further in view of ZHE et al, CN110139115 -English translated (“ZHE”) further in view of CHEN YAN, WO2020063009-English translated (“YAN”)
Regarding independent claim 1, Smith teaches a method of generating a virtual character for an online streamer (abstract “Described are systems and methods directed to generation of a dimensionally accurate three-dimensional (“3D”) model of a body, such as a human body, based on two-dimensional (“2D”) images of at least a portion of that body. A user may use a 2D camera, such as a digital camera typically included in many of today's portable devices (e.g., cell phones, tablets, laptops, etc.) and obtain a series of 2D body images of at least a portion of their body from different views with respect to the camera. The 2D body images may then be used to generate a plurality of predicted body parameters corresponding to the body represented in the 2D body images. Those predicted body parameters may then be further processed to generate a dimensionally accurate 3D model or avatar of the body of the user.”; col.24, lines 61-63 “As discussed above, the 2D body images may be generated using any 2D imaging element, such as a camera on a portable device, a webcam, etc.), comprising:
obtaining a human body image of a target online streamer captured by an image collection device, wherein the human body image of the target online streamer comprises a face and an upper body part of the target online streamer (see at least col.2, lines 61-64 “As discussed above, the 2D body images may be generated using any 2D imaging element, such as a camera on a portable device, a webcam, etc.”; col.3, lines 8-20 as shown in Fig.1A “3D modeling of a body from 2D body images begins with the receipt or creation of a 2D body image 102 that includes a representation of at least a portion of the body 103 of the user to be modeled. 2D body images 102 for use with the disclosed implementations may be generated using any conventional imaging element, such as a standard 2D Red, Green, Blue (“RGB”) digital camera that is included on many current portable devices (e.g., tablets, cellular phones, laptops, etc.). The 2D body image may be a still image generated by the imaging element or an image extracted from video generated by the imaging element. Likewise, any number of images may be used with the disclosed implementations.”);
performing face recognition and upper-body limb recognition on the human body image which operates on different regions of the same human body image to obtain features and limb features (see at least col.3, lines 44-67-col.4, lines 1-12 “The 2D body image 102 that includes a representation of at least a portion of the body 103 of the user may then be processed to produce a segmented silhouette 104 of the body 103 of the user represented in the image 102. A variety of techniques may be used to generate the segmented silhouette 104. For example, background subtraction may be used to subtract or black out pixels of the image that correspond to a background of the image while pixels corresponding to the body 103 of user (i.e., foreground) may be assigned a white or other color values. In another example, a semantic segmentation algorithm may be utilized to label background and body (foreground) pixels. For example, a convolutional neural network (“CNN”) may be trained with a semantic segmentation algorithm to determine bodies, such as human bodies, in images. In addition or as an alternative thereto, the segmented silhouette may be segmented into one or more body segments, such as hair segment 104-1, head segment 104-2, neck segment 104-3, upper clothing segment 104-4, upper left arm 104-5, lower left arm 104-6, left hand 104-7, torso 104-8, upper right arm 104-9, lower right arm 104-10, right hand 104-11, lower clothing 104-12, upper left leg 104-13, upper right leg 104-16, etc. For example, the CNN may be trained with a semantic segmentation algorithm to predict for each pixel of an image the likelihood that the pixel corresponds to a segment label (e.g., hair, upper clothing, lower clothing, head, upper right arm, etc.). For example, the CNN may be trained to process each 2D body image and output, for each pixel of each image, a vector that indicates a probability for each label that the pixel corresponds to that label”; col.4, lines 61-67-col.5, lines 1-7 “In some implementations, the CNN(s) 106 may be trained to predict hundreds of body parameters of the body represented in the image 102. For example, any number of 2D landmarks, such as top of head, ears, left shoulder, right shoulder, right elbow, left elbow, right wrist, left wrist, left hip, right hip, left knee, right knee, left ankle, right ankle, may be determined in accordance with the disclosed implementations. In such implementations, a visibility indicator may also be determined for each 2D landmark. The visibility indicator for each 2D landmark may be determined based on, for example, the position of the body segment(s) that connect to the 2D landmark, the position of other body segments and/or other 2D landmarks, etc.”);
determining parameters associated with a virtual character corresponding to the target online streamer based on the features and the limb features (see at least col.4, lines 61-67-col.5, lines 1-7 “In some implementations, the CNN(s) 106 may be trained to predict hundreds of body parameters of the body represented in the image 102. For example, any number of 2D landmarks, such as top of head, ears, left shoulder, right shoulder, right elbow, left elbow, right wrist, left wrist, left hip, right hip, left knee, right knee, left ankle, right ankle, may be determined in accordance with the disclosed implementations. In such implementations, a visibility indicator may also be determined for each 2D landmark. The visibility indicator for each 2D landmark may be determined based on, for example, the position of the body segment(s) that connect to the 2D landmark, the position of other body segments and/or other 2D landmarks, etc.”); and generating the virtual character corresponding to the target online streamer based on the parameters, wherein the generated virtual character has a motion and an expression corresponding to that of the target streamer (see at least col.5, lines 43-63 “Utilizing the predicted body parameters and visibility indicators, a 3D model of the body is generated. For example, the body parameters may be provided to a body model, such as the Shape Completion and Animation of People (“SCAPE”) body model, a Skinned Multi-Person Linear (“SMPL”) body model, etc., and the body model may generate the 3D model of the body of the user based on those predicted body parameters. To improve accuracy of the 3D model, in some implementations, data corresponding to any 2D landmark that is determined to not be visible, as indicated by the respective visibility indicator, may be ignored or omitted by the body model in generation of the 3D model as the data for those landmark body parameters may be unreliable or inaccurate. Instead, the model may determine 2D landmarks for those non-visible body joints based on the position of other body joints of the body of the user that are visible. In other implementations, the inferred position for one or more 2D landmarks that are determined to not be visible, such as those that are within the field of view of the image but occluded, may be considered in determining the 3D model”). Smith is understood to be silent on the remaining limitations of claim 1.
In the same field of endeavor, KELLY teaches performing face recognition and upper-body limb recognition on the human body image by distinct and independently trained models each of which operates on different regions of the same human body image to obtain face features and limb features (see at least [0050] For instance, to train a neural network to identify the most relevant guesses for identifying a human body part, for example, a predetermined dataset of raw image data including image data of human body parts, and with known outputs, is provided to the neural network. As each node function is applied to the raw input of a known output, an error correction analysis is performed such that node functions that result in outputs near or matching the known output may be given an increased weight while node functions having a significant error may be given a decreased weight. In the example of identifying a human face, node functions that consistently recognize image patterns of facial features (e.g., nose, eyes, mouth, etc.) may be given additional weight. Similarly, in the example of identifying a human hand, node functions that consistently recognize image patterns of hand features (e.g., wrist, fingers, palm, etc.) may be given additional weight. The outputs of the node functions (including the respective weights) are then evaluated in combination to provide an output such as a data structure representing a human face. Training may be repeated to further refine the pattern-recognition of the model, and the model may still be refined during deployment (i.e., raw input without a known data output). [0051] At least some of the neural network models applied by the tracking controller 204 may be deep neural network (DNN) models. DNN models include at least three layers of node functions linked together to break the complexity of image analysis into a series of steps of increasing abstraction from the original image data. For example, for a DNN model trained to detect human faces from an image, a first layer may be trained to identify groups of pixels that represent the boundary of facial features, a second layer may be trained to identify the facial features as a whole based on the identified boundaries, and a third layer may be trained to determine whether or not the identified facial features form a face and distinguish the face from other faces. The multi-layered nature of the DNN models may facilitate more targeted weights, a reduced number of node functions, and/or pipeline processing of the image data (e.g., for a three-layered DNN model, each stage of the model may process three frames of image data in parallel).[0052] In at least some embodiments, each model applied by the tracking controller 204 may be configured to identify a particular aspect of the image data and provide different outputs such that the tracking controller 204 may aggregate the outputs of the neural network models together to identify physical objects as described herein. For example, one model may be trained to identify human faces, while another model may be trained to identify the bodies of players. In such an example, the tracking controller 204 may link together a face of a player to a body of the player by analyzing the outputs of the two models. In other embodiments, a single DNN model may be applied to perform the functionality of several models; [0059] In some embodiments, one or more image neural network models are implemented to analyze captured images. In some examples, several neural network models can be implemented together by tracking controller 204 to extract different features from the image data. That is, the neural network models may be trained to identify particular characteristics of physical objects. For example, one neural network model may be trained to identify human faces, while another neural network model may be trained to identify human torsos, while yet another neural network model may be trained to identify human hands. Although the output of the image neural network models may vary depending upon the specific functionality of each model, the outputs generally include one or more data elements that represent a physical feature or characteristic of a person or object in the image data in a format that can be recognized and processed by a tracking controller and/or other computing devices. For example, one example neural network model may be used to detect the hands of players in the image data and output a map of data elements representing “key” physical features of the detected hands, such as the position of joints or knuckles in relation to each other, palm lines, sizes of fingers, etc. The map may indicate a relative position of each hand feature within the space defined by the image data (in the case of a singular, two-dimensional image, the space may be a corresponding two-dimensional plane) and cluster several hand features together to distinguish between detected hands. The output map is a data abstraction of the underlying raw image data that has a known structure and format, which may be advantageous for use in other devices and/or software modules. In the example embodiment, applying the image neural network models to the image data causes the tracking controller to generate one or more key data elements as the outputs of the image processing (including the models). The key data elements may include any suitable amount and/or type of data based at least partially on the corresponding neural network model. At least some of the key data elements include position data indicating a relative position of the represented physical characteristics within a space at least partially defined by the scope of the image data. Key data elements may include, but are not limited to, boundary boxes, key feature points, vectors, wireframes, outlines, pose models, and the like. Boundary boxes are visual boundaries that encapsulate an object in the image and classify the encapsulated object according to a plurality of predefined classes e.g., classes may include “human,” “hand,” “token,” “token stack,” etc.). A boundary box may be associated with a single class or several classes (e.g., a player may be classified as both a “human” and a “male”). The key feature points, similar to the boundary boxes, classify features of objects in the image data, but instead assign a singular position to the classified features.
Therefore, it would have been obvious to one of ordinary skill in the art before the effecting filling date of the claimed invention to modify the method of generate a 3D model of the body of the user based on body parameters corresponding the body image of Smith with applying distinct and independently trained models for identify different part of the human body image as seen in KELLY because this modification would achieve the expected benefits of providing more accurate and robust results. Both Smith and KELLY are understood to be silent on the remaining limitations of claim 1.
In the same field of endeavor, JU teaches performing face recognition on the human body image to obtain face features ([0052] Here, the face recognition unit (110) can first detect a face area from a face image, and extract landmarks representing each facial expression, etc. within the extracted face area. That is, the face recognition unit (110) can distinguish between the face area included in the face image and the remaining background image, and then extract each feature point in the face area to distinguish the face from other faces. For example, eyebrows, eyes, nose, mouth, jawline, etc. included in the face can be extracted as feature points.”);
determining parameters associated with a virtual character corresponding to the target online streamer based on the face features ([0060] Here, it can be seen that there is a nonlinear transformation relationship between the 3D facial feature points representing the user's face and the target character feature points representing the 3D character, and the nonlinear transformation relationship between them can be expressed using RBF transformation parameters. [0061] In some embodiments, it is also possible to pre-set RBF parameters for RBF transformation by machine learning a plurality of 3D facial feature points and corresponding target character feature points. [0062] That is, the transformation relationship between each feature point included in the user's face and the feature point of a three-dimensional virtual character can be extracted in advance, and the RBF parameters can be set to perform RBF transformation according to the transformation relationship. Accordingly, RBF transformation can be performed so that the user's facial expressions, etc. included in the 3D facial feature points are reflected in the target character feature points of the 3D virtual character.”); and
generating the virtual character corresponding to the target online streamer based on the parameters, wherein the generated virtual character has an expression corresponding to that of the target online streamer ([0063] Once the target character features are created, a synthetic expression can be created using the blend shape technique. That is, sample expressions of a 3D virtual character can be extracted to correspond to the target character's feature points, and weights can be set for the extracted sample expressions. The retargeting unit (120) can extract and synthesize only some of the multiple sample expressions included in the expression set, and can set weights for each sample expression during synthesis to create a synthetic expression that is as similar as possible to the target character's features”) JU is understood to be silent on the remaining limitations of claim 1.
Therefore, it would have been obvious to one of ordinary skill in the art before the effecting filling date of the claimed invention to modify the method of generate a 3D model of the body of the user based on body parameters corresponding the body image of Smith and applying distinct and independently trained models for identify different part of the human body image of KELLY with using nonlinear transformation relationship between the 3D facial feature points representing the user's face and the target character feature points seen in JU because this modification would create a synthetic expression that is as similar as possible to the target character's features ([0063] of JU). Smith, KELLY and JU are understood to be silent on the remaining limitations of claim 1.
In the same field of endeavor, ZHE teaches obtaining a human body image of a target online streamer captured by an image collection device, wherein the human body image of the target online streamer comprises a face of the target online streamer ([0032] “For example, the live broadcast providing terminal 100 can send the live broadcast video stream to the live broadcast server 200, and the audience can access the live broadcast server 200 through the live broadcast receiving terminal 300 to watch the live broadcast video. The live video stream pushed by the live broadcast server 200 may be a video stream currently being broadcast live on the live broadcast platform or a complete video stream formed after the live broadcast is completed. [0040] “In this embodiment, in step 210, the anchor video frame of the anchor during the live broadcast process can be captured by the video capture device 400, and the anchor video frame includes the anchor image.[0041] First, a sample image containing the anchor is obtained, and the sample image may be an image set containing multiple images. The anchor image included in the sample image is not limited to a specific object, and can be a sample image including any anchor image. For example, anchor images containing different postures and movements can be collected in advance as sample images, such as crossed-legs movement, hand-raising movement, and heart-shaped movement with both hands. Each sample image is marked with the location information of each key point of the anchor's limbs. Limb key points may include, but are not limited to, head key points, neck key points, left/right shoulder key points, left/right elbow key points, left/right wrist key points, left/right hip key points, left/right knee key points, left/right ankle key points, etc., and the list here is not exhaustive”)
performing upper-body limb recognition on the human body image to obtain limb features ([0050] “Through the above training of the neural network model, a key point recognition model that can be used to recognize the key points of the limbs is obtained. On this basis, the host video frame captured by the video capture device 400 can be imported into the trained key point recognition model for recognition, so as to output a heat map corresponding to each limb key point of the host in the host video frame. In this embodiment, the key point recognition model obtained by training the neural network model also includes an input layer, a multi-layer convolution layer, and a multi-layer deconvolution layer. The key point recognition model can obtain a heat map corresponding to each limb key point of the host's video frame in the following manner”; [0058] “The posture information of the limb part can be obtained according to the position information of the limb key points belonging to the same limb part. For example, among the key points of the limbs, the elbow, wrist, and shoulder all belong to the limb part arm, and the posture information of the arm can be obtained according to the position information of the elbow, wrist, and shoulder.”);
determining parameters associated with a virtual character corresponding to the target online streamer based on the limb features ([0055] “In this embodiment, the position information of each key point of the anchor's body is calculated, and its purpose is to control the virtual image on the live broadcast interface to achieve action linkage between the virtual image and the anchor and achieve action following. Considering that the image size of the captured host video frame is not necessarily the same as the size of the interface where the virtual image is located on the live broadcast interface, and the position of the host in the host video frame may also be different from the position of the virtual image on the live broadcast interface, directly controlling the virtual image with the position information of the key points of the host's limbs may result in control difficulties or inaccurate control. [0056] “Based on the above content, in step 220 and step 230, the posture information of each limb part of the host can be obtained according to the position information of each limb key point of the host, and then the posture of the virtual image can be controlled according to the posture information of each limb part. Based on the calculation of the position of the anchor's body key points, the anchor's movements can be accurately identified, and then the virtual image can be controlled based on the posture information. In this way, the problem that the position information of the key points of the host's body and the position information of the key points of the virtual image's body are difficult to unify is solved, and the purpose of the virtual image's posture and movement accurately following the host's posture is achieve”)
the virtual character corresponding to the target online streamer based on the parameters, wherein the generated virtual character has a motion corresponding to that of the target online streamer ([0083] “The control module 133 is used to control the virtual image displayed in the live broadcast screen to perform corresponding posture actions according to the posture information of each body part of the anchor. It can be understood that the control module 133 can be used to execute the above step 230. The detailed implementation of the control module 133 can refer to the above content related to step 230” [0085] “In summary, the key point-based virtual image posture control method, device and electronic device provided in the embodiments of the present application perform image analysis on the host video frame captured by the video capture device 400 to obtain multiple limb key points of the host in the host video frame and the position information of each limb key point. Then, based on the position information of each limb key point, the posture information of each limb part of the host is calculated. Finally, the virtual image displayed in the live broadcast screen is controlled to perform corresponding posture movements according to the posture information of each body part of the anchor. In this way, by analyzing the position information of the key points of the host's limbs, the accurate posture of each limb part can be obtained, and then the virtual image can be controlled to perform corresponding posture movements according to the host's posture, so that the virtual image can follow the host's posture movements and improve the interactive effect during the live broadcast.)
Therefore, in combination of Smith, KELLY, JU, it would have been obvious to one of ordinary skill in the art before the effecting filling date of the claimed invention to modify the method of generate a 3D model of the body of the user based on body parameters corresponding the body image of Smith and applying distinct and independently trained models for identify different part of the human body image of KELLY with obtaining plurality of limb key points of the online streamer as seen in ZHE because this modification would control virtual image perform corresponding the host’s posture movements and improve the interactive effect during the live broadcast (0085] of ZHE). Smith, KELLY, JU and ZHE are understood to be silent on the remaining limitations of claim 1.
In the same field of endeavor, YAN teaches determining parameters associated with a virtual character corresponding to the target online streamer based on the face features and the limb feature ([0092] As can be seen from the above, an image processing method provided in this embodiment obtains a human body image; analyzes the human body image to identify facial feature point information and limb feature point information corresponding to the human body image; combines the facial feature point information and the limb feature point information to generate human body model information; and controls the virtual object accordingly based on the human body model information. In this way, the facial feature point information and limb feature point information in the human body image can be identified in real time, and human body model information with facial expressions and finger movements can be generated based on the facial feature point information and limb feature point information. The virtual object can be controlled accordingly according to the action information in the generated human body model information, thereby improving the convenience and accuracy of human-computer interaction.”) and
the virtual character corresponding to the target online streamer based on the parameters, wherein the generated virtual character has a motion and an expression corresponding to that of the target online streamer ([0192] “An embodiment of the present application also provides a storage medium storing a computer program. When the computer program runs on a computer, the computer executes the image processing method in any of the above embodiments, such as: obtaining a human body image; analyzing the human body image to identify facial feature point information and limb feature point information corresponding to the human body image; combining the facial feature point information and the limb feature point information to generate human body model information; and controlling the virtual object accordingly based on the human body model information”)
Therefore, in combination of Smith, KELLY, JU and ZHE, it would have been obvious to one of ordinary skill in the art before the effecting filling date of the claimed invention to modify the method of generate a 3D model of the body of the user based on body parameters corresponding the body image of Smith and KELLY with combining the facial feature point information and the limb feature point information as seen in YAN because this modification would generate human body model information ([0192] of YAN)
Thus, the combination of Smith, KELLY, JU, ZHE and YAN teaches a method of generating a virtual character for an online streamer, comprising: obtaining a human body image of a target online streamer captured by an image collection device, wherein the human body image of the target online streamer comprises a face and an upper body part of the target online streamer; performing face recognition and upper-body limb recognition on the human body image by distinct and independently trained models each of which operates on different regions of the same human body image to obtain face features and limb features; determining parameters associated with a virtual character corresponding to the target online streamer based on the face features and the limb features; and generating the virtual character corresponding to the target online streamer based on the parameters, wherein the generated virtual character has a motion and an expression corresponding to that of the target online streamer.
Regarding claim 2, Smith, KELLY, JU, ZHE and YAN teach the method according to claim 1, wherein the performing face recognition and upper-body limb recognition on the human body image to obtain face features and limb features comprises:
recognizing a face region from the human body image, and determining the face features based on the face region (see at least col.3, lines 44-67-col.4, lines 1-12 of Smith; [0050-0052,0059] of KELLY; [0012] of JU “Here, the face recognition step may include: a step of detecting a face region from the face image and extracting feature points from the face region; a step of generating 3D motion information including position information indicating the positions of the feature points in 3D space and pose information indicating the amount of rotation and movement of the face region in 3D space; and a step of generating 3D facial feature points by restoring the feature points in 3D using a pinhole camera model and the 3D motion information”; [0079] “Specifically, as illustrated in FIG. 7, in the face recognition step (S100), a face region can be detected from a face image and feature points can be extracted from the face region (S110). That is, the face region included in the face image and the remaining background image can be distinguished, and then each feature point can be extracted to distinguish the face from other faces in the face region.”; [0039] of YAN “In one embodiment, the step of analyzing facial image elements in a human body image through a preset learning model and identifying facial feature point information in the human body image may include: scanning the lines and contours of the facial area in the human body image through a preset learning model to obtain facial image elements; performing facial feature analysis on the facial image elements to identify a preset number of facial feature point information in the facial image elements.”); and recognizing an upper-body limb region from the human body image, and determining the limb features based on the upper-body limb region (see at least col.3, lines 44-67-col.4, lines 1-12 of Smith; [0050-0052,0059] of KELLY; see at least [0040] of ZHE “In this embodiment, in step 210, the anchor video frame of the anchor during the live broadcast process can be captured by the video capture device 400, and the anchor video frame includes the anchor image. The collected anchor video frames can be imported into a pre-trained key point recognition model for image analysis to identify the anchor's body key points. The key point recognition model is obtained by training a neural network model based on pre-acquired sample images of the anchor. Specifically, the key point recognition model can be obtained in the following way “; [0070] “ For each limb key point of each limb part, that is, each limb key point belonging to the same limb part, the position information of the limb key point can be obtained, and then the position information of the origin of the preset direction axis set based on the host video frame can be obtained. Among them, the preset direction axis can be a direction axis of any direction established based on the anchor video frame, such as a horizontal direction axis, a vertical direction axis, or a rotation direction axis rotated by a certain angle on the basis of the horizontal direction axis, etc.”; see at least [0073] of YAN “Among them, the Openpose model can be used to scan the lines and contours of the limb area on the human body image, and identify the upper body limb image elements and lower body limb image elements in the limb area. The upper body limb image elements are feature point information on the upper body limbs that can be identified by electronic devices. Feature intelligent analysis is performed on each upper body limb image element to identify important human feature point information of the upper body limbs. The important human feature point information of the upper body limbs may include corresponding feature point information of the neck, shoulders, elbows, wrists and fingers, constituting upper body limb feature information. The upper body limb feature information can reflect the user's upper body limb movements, especially the precise movements of the fingers”; [0166] “Performing upper body limb feature analysis on the upper body limb image element to identify a preset number of upper body limb feature information in the upper body limb image element”) In addition, the same motivation is used as the rejection for claim 1.
Regarding claim 3, Smith, KELLY, JU, ZHE and YAN teach the method according to claim 2, wherein the recognizing a face region from the human body image, and determining the face features based on the face region comprises:
inputting the human body image into a pre-trained face recognition model to obtain the face region; and determining location information of face feature points in the face region, and determining the face features based on the location information of the face feature points in the face region (see at least col.3, lines 44-67-col.4, lines 1-12 of Smith “... For example, a convolutional neural network (“CNN”) may be trained with a semantic segmentation algorithm to determine bodies, such as human bodies, in images. In addition or as an alternative thereto, the segmented silhouette may be segmented into one or more body segments, such as hair segment 104-1, head segment 104-2, neck segment 104-3, upper clothing segment 104-4, upper left arm 104-5, lower left arm 104-6, left hand 104-7, torso 104-8, upper right arm 104-9, lower right arm 104-10, right hand 104-11, lower clothing 104-12, upper left leg 104-13, upper right leg 104-16, etc. For example, the CNN may be trained with a semantic segmentation algorithm to predict for each pixel of an image the likelihood that the pixel corresponds to a segment label (e.g., hair, upper clothing, lower clothing, head, upper right arm, etc.). For example, the CNN may be trained to process each 2D body image and output, for each pixel of each image, a vector that indicates a probability for each label that the pixel corresponds to that label”; [0050-0052,0059] of KELLY; [0040] of JU “In this embodiment, in step 210, the anchor video frame of the anchor during the live broadcast process can be captured by the video capture device 400, and the anchor video frame includes the anchor image. The collected anchor video frames can be imported into a pre-trained key point recognition model for image analysis to identify the anchor's body key points. The key point recognition model is obtained by training a neural network model based on pre-acquired sample images of the anchor. Specifically, the key point recognition model can be obtained in the following way.”; [0012] of ZHE “Here, the face recognition step may include: a step of detecting a face region from the face image and extracting feature points from the face region; a step of generating 3D motion information including position information indicating the positions of the feature points in 3D space and pose information indicating the amount of rotation and movement of the face region in 3D space; and a step of generating 3D facial feature points by restoring the feature points in 3D using a pinhole camera model and the 3D motion information”; [0039] of YAN “In one embodiment, the step of analyzing facial image elements in a human body image through a preset learning model and identifying facial feature point information in the human body image may include: scanning the lines and contours of the facial area in the human body image through a preset learning model to obtain facial image elements; performing facial feature analysis on the facial image elements to identify a preset number of facial feature point information in the facial image elements.” [0062] “Among them, the preset learning model can be an Openpose model, which can recognize the captured human body image and identify the human body feature point information from the human body image. The human body feature point information can be understood as the body feature parts of the human body, so that the human body structure model can be constructed according to the human body feature point information in the later stage. Therefore, the facial feature point information of the facial image elements corresponding to the facial area in the human body image can be identified according to the Openpose model, and the limb feature point information of the limb image elements corresponding to the limb area in the human body image can be identified according to the Openpose model.”) In addition, the same motivation is used as the rejection for claim 1.
Regarding claim 4, Smith, KELLY, JU, ZHE and YAN teach the method according to claim 2, wherein the recognizing an upper-body limb region from the human body image, and determining the limb features based on the upper- body limb region comprises:
inputting the human body image into a pre-trained limb recognition model to obtain the upper-body limb region; and determining location information of limb feature points in the upper-body limb region, and determining the limb features based on the location information of the limb feature points in the upper-body limb region (see at least col.3, lines 44-67-col.4, lines 1-12 of Smith “... For example, a convolutional neural network (“CNN”) may be trained with a semantic segmentation algorithm to determine bodies, such as human bodies, in images. In addition or as an alternative thereto, the segmented silhouette may be segmented into one or more body segments, such as hair segment 104-1, head segment 104-2, neck segment 104-3, upper clothing segment 104-4, upper left arm 104-5, lower left arm 104-6, left hand 104-7, torso 104-8, upper right arm 104-9, lower right arm 104-10, right hand 104-11, lower clothing 104-12, upper left leg 104-13, upper right leg 104-16, etc. For example, the CNN may be trained with a semantic segmentation algorithm to predict for each pixel of an image the likelihood that the pixel corresponds to a segment label (e.g., hair, upper clothing, lower clothing, head, upper right arm, etc.). For example, the CNN may be trained to process each 2D body image and output, for each pixel of each image, a vector that indicates a probability for each label that the pixel corresponds to that label”; [0050-0052,0059] of KELLY; [0040] of JU “In this embodiment, in step 210, the anchor video frame of the anchor during the live broadcast process can be captured by the video capture device 400, and the anchor video frame includes the anchor image. The collected anchor video frames can be imported into a pre-trained key point recognition model for image analysis to identify the anchor's body key points. The key point recognition model is obtained by training a neural network model based on pre-acquired sample images of the anchor. Specifically, the key point recognition model can be obtained in the following way.”; [0106] of YAN “In step S205 , the lines and contours of the limbs region in the human body image are scanned using a preset learning model to obtain upper body limbs image elements and lower body limbs image elements.[0107] “Among them, the lines and contours of the limb area in the human body image can be scanned through the Openpose model to obtain upper body limb image elements and lower body limb image elements, as shown in Figure 3. The upper body limb image elements can be feature point information of the upper body limbs in the human body image corresponding to the user 12, such as the corresponding feature point information of the neck, shoulders, elbows, wrists and fingers of the upper body limbs in the human body image corresponding to the user 12..” [0062] “Among them, the preset learning model can be an Openpose model, which can recognize the captured human body image and identify the human body feature point information from the human body image. The human body feature point information can be understood as the body feature parts of the human body, so that the human body structure model can be constructed according to the human body feature point information in the later stage. Therefore, the facial feature point information of the facial image elements corresponding to the facial area in the human body image can be identified according to the Openpose model, and the limb feature point information of the limb image elements corresponding to the limb area in the human body image can be identified according to the Openpose model.”) In addition, the same motivation is used as the rejection for claim 1.
Regarding independent claim 11, Smith teaches a system of generating a virtual character for an online streamer(abstract “Described are systems and methods directed to generation of a dimensionally accurate three-dimensional (“3D”) model of a body, such as a human body, based on two-dimensional (“2D”) images of at least a portion of that body. A user may use a 2D camera, such as a digital camera typically included in many of today's portable devices (e.g., cell phones, tablets, laptops, etc.) and obtain a series of 2D body images of at least a portion of their body from different views with respect to the camera. The 2D body images may then be used to generate a plurality of predicted body parameters corresponding to the body represented in the 2D body images. Those predicted body parameters may then be further processed to generate a dimensionally accurate 3D model or avatar of the body of the user.”; col.24, lines 61-63 “As discussed above, the 2D body images may be generated using any 2D imaging element, such as a camera on a portable device, a webcam, etc.),, comprising: at least one processor; and at least one memory communicatively coupled to the at least one processor and comprising computer-readable instructions that upon execution by the at least one processor cause the at least one processor to perform operations ( see Fig. 3, “The data and/or computer executable instructions, programs, firmware, software and the like (also referred to herein as “computer executable” components) described herein may be stored on a computer-readable medium that is within or accessible by computers or computer components such as the servers 312-1, 312-2 . . . 312-M, the processor 326, the servers 372-1, 372-2 . . . 372-J, or any other computers or control systems utilized by the body composition system 310, the portable device 330, applications 323, or the external media storage facility 370, and having sequences of instructions which, when executed by a processor (e.g., a central processing unit, or “CPU”), cause the processor to perform all or a portion of the functions, services and/or methods described herein. Such computer executable instructions, programs, software and the like may be loaded into the memory of one or more computers using a drive mechanism associated with the computer readable medium, such as a floppy drive, CD-ROM drive, DVD-ROM drive, network interface, or the like, or via external connections.”)comprising: Remaining limitations of claim 11 is similar scope to claim 1 and therefore rejected under the same rationale.
Regarding independent claim 16, Smith teaches a non-transitory computer-readable storage medium, storing computer- readable instructions that upon execution by a processor cause the processor to implement operations (Some implementations of the systems and methods of the present disclosure may also be provided as a computer-executable program product including a non-transitory machine-readable storage medium having stored thereon instructions (in compressed or uncompressed form) that may be used to program a computer (or other electronic device) to perform processes or methods described herein. The machine-readable storage media of the present disclosure may include, but is not limited to, hard drives, floppy diskettes, optical disks, CD-ROMs, DVDs, ROMs, RAMs, erasable programmable ROMs (“EPROM”), electrically erasable programmable ROMs (“EEPROM”), flash memory, magnetic or optical cards, solid-state memory devices, or other types of media/machine-readable medium that may be suitable for storing electronic instructions. Further, implementations may also be provided as a computer executable program product that includes a transitory machine-readable signal (in compressed or uncompressed form). Examples of machine-readable signals, whether modulated using a carrier or not, may include, but are not limited to, signals that a computer system or machine hosting or running a computer program can be configured to access, or including signals that may be downloaded through the Internet or other networks.”) comprising: Remaining limitations of claim 16 is similar scope to claim 1 and therefore rejected under the same rationale.
2. Claims 5-7, 10,12-13, 17-18,are rejected under 35 U.S.C. 103 as being unpatentable over Smith et al., U.S Patent No.11,854146 (“Smith”) in view of ELLY et al., U.S Patent Application Publication No. 20210303897 (“KELLY”) further in view of JU et al, KR20200134623- English translated (“JU”) further in view of ZHE et al, CN110139115 -English translated (“ZHE”) further in view of CHEN YAN, WO2020063009-English translated (“YAN”) further in view of Onoe et al, U.S Patent Application Publication No.20110052081 (“Onoe”)
Regarding claim 5, Smith, KELLY, JU, ZHE and YAN teach the method according to claim 1, wherein the parameters comprises a first type of parameter indicating a head pose of the target online streamer (see at least col.4, lines 61-67-col.5, lines 1-7 “In some implementations, the CNN(s) 106 may be trained to predict hundreds of body parameters of the body represented in the image 102. For example, any number of 2D landmarks, such as top of head, ears, left shoulder, right shoulder, right elbow, left elbow, right wrist, left wrist, left hip, right hip, left knee, right knee, left ankle, right ankle, may be determined in accordance with the disclosed implementations. In such implementations, a visibility indicator may also be determined for each 2D landmark. The visibility indicator for each 2D landmark may be determined based on, for example, the position of the body segment(s) that connect to the 2D landmark, the position of other body segments and/or other 2D landmarks, etc.”; see at least [0080] of JU “After extracting feature points, 3D motion information including position information indicating the location of the feature points in 3D space and pose information indicating the amount of rotation and translation of the facial area in 3D space can be generated (S120). That is, the two-dimensional face region can be restored back to a three-dimensional space, and at this time, the three-dimensional position information matching the two-dimensional feature points in the three-dimensional space and the pose information such as the rotation amount and translation amount of the face region in the three-dimensional space can be calculated, respectively. Specifically, the Solve-PnP technique can be used to convert a two-dimensional face region into a three-dimensional space, and a pinhole camera model can be utilized at this time”), a second type of parameter indicating a facial expression of the target online streamer (see at least [0050-0052,0059] of KELLY [0012] of JU “Here, it can be seen that there is a nonlinear transformation relationship between the 3D facial feature points representing the user's face and the target character feature points representing the 3D character, and the nonlinear transformation relationship between them can be expressed using RBF transformation parameters. [0061] In some embodiments, it is also possible to pre-set RBF parameters for RBF transformation by machine learning a plurality of 3D facial feature points and corresponding target character feature points. [0062] That is, the transformation relationship between each feature point included in the user's face and the feature point of a three-dimensional virtual character can be extracted in advance, and the RBF parameters can be set to perform RBF transformation according to the transformation relationship. Accordingly, RBF transformation can be performed so that the user's facial expressions, etc. included in the 3D facial feature points are reflected in the target character feature points of the 3D virtual character.” [0087] “Afterwards, by transforming the 3D facial feature points into RBF (Radial Basis Function Transform), target character feature points representing the facial expression of the 3D virtual character can be generated (S212).;” [0121] of YAN “As shown in Figure 3, the electronic device 100 determines the movement posture of the face, body, hands, fingers and legs of the 3D stickman model information 13 based on the position information of the face, body, hands, fingers and legs of the 3D stickman model information 13, and controls the face, body, hands, fingers and legs of the virtual object 14 to make the same movement posture according to the movement posture, thereby realizing somatosensory control, and can also realize precise control of expressions and fingers”; [0142] Among them, the combining unit 33 can connect the facial feature points in sequence according to the rules of facial expressions to establish the facial model information of the user's real-time facial expression movements, and connect the limb feature point information in sequence according to the rules of limb movements to establish the limb model information of the user's real-time limb operation movements. The combining unit 33 combines the facial model information with the limb model information to generate human body model information, that is, human body skeleton information. The human body model information represents the performance of the body posture of the user being photographed. In one embodiment, the human body model information can be a 3D stickman model, and the 3D stickman model can express the user's body posture, especially the real-time expression and finger operation movement), and a third type of parameter indicating a limb pose of the target online streamer (see at least [0050-0052,0059] of KELLY; [0013] of ZHE “A calculation module, used to calculate the posture information of each limb part of the anchor according to the position information of each limb key point”; [0017] “Based on any of the above aspects, the embodiment of the present application obtains multiple limb key points of the host in the host video frame and the position information of each limb key point by performing image analysis on the host video frame captured by the video capture device. Then, based on the position information of each limb key point, the posture information of each limb part of the host is calculated.”); and
wherein the generating parameters associated with a virtual character corresponding to the target online streamer based on the face features and the limb features further comprises:
generating the first type of parameter by performing head pose ([0012] of JU “Here, the face recognition step may include: a step of detecting a face region from the face image and extracting feature points from the face region; a step of generating 3D motion information including position information indicating the positions of the feature points in 3D space and pose information indicating the amount of rotation and movement of the face region in 3D space; and a step of generating 3D facial feature points by restoring the feature points in 3D using a pinhole camera model and the 3D motion information,”; [0080] of JU “After extracting feature points, 3D motion information including position information indicating the location of the feature points in 3D space and pose information indicating the amount of rotation and translation of the facial area in 3D space can be generated (S120). That is, the two-dimensional face region can be restored back to a three-dimensional space, and at this time, the three-dimensional position information matching the two-dimensional feature points in the three-dimensional space and the pose information such as the rotation amount and translation amount of the face region in the three-dimensional space can be calculated, respectively. Specifically, the Solve-PnP technique can be used to convert a two-dimensional face region into a three-dimensional space, and a pinhole camera model can be utilized at this time”; [0066] of YAN “Among them, the Openpose model can be used to scan the lines and contours of the facial area on the human body image, and identify the facial image elements on the facial area. The facial image elements are feature point information on the face that can be recognized by the electronic device. Feature intelligent analysis is performed on each facial image element to identify important human feature point information on the face. The important human feature point information on the face can include feature point information corresponding to the chin, mouth, nose, eyes and eyebrows, forming facial feature point information. The facial feature point information can establish the user's real-time expression)
generating the second type of parameter by performing facial expression parsing on the face features (see at least [0050-0052,0059] of KELLY ; [0060] of JU “Here, it can be seen that there is a nonlinear transformation relationship between the 3D facial feature points representing the user's face and the target character feature points representing the 3D character, and the nonlinear transformation relationship between them can be expressed using RBF transformation parameters. [0061] In some embodiments, it is also possible to pre-set RBF parameters for RBF transformation by machine learning a plurality of 3D facial feature points and corresponding target character feature points. [0062] That is, the transformation relationship between each feature point included in the user's face and the feature point of a three-dimensional virtual character can be extracted in advance, and the RBF parameters can be set to perform RBF transformation according to the transformation relationship. Accordingly, RBF transformation can be performed so that the user's facial expressions, etc. included in the 3D facial feature points are reflected in the target character feature points of the 3D virtual character.” [0087] “Afterwards, by transforming the 3D facial feature points into RBF (Radial Basis Function Transform), target character feature points representing the facial expression of the 3D virtual character can be generated (S212);. [0092] of YAN “As can be seen from the above, an image processing method provided in this embodiment obtains a human body image; analyzes the human body image to identify facial feature point information and limb feature point information corresponding to the human body image; combines the facial feature point information and the limb feature point information to generate human body model information; and controls the virtual object accordingly based on the human body model information. In this way, the facial feature point information and limb feature point information in the human body image can be identified in real time, and human body model information with facial expressions and finger movements can be generated based on the facial feature point information and limb feature point information. The virtual object can be controlled accordingly according to the action information in the generated human body model information, thereby improving the convenience and accuracy of human-computer interaction.”)., and
generating the third type of parameter based on parsing the limb features (see at least col.4, lines 61-67-col.5, lines 1-7 “In some implementations, the CNN(s) 106 may be trained to predict hundreds of body parameters of the body represented in the image 102. For example, any number of 2D landmarks, such as top of head, ears, left shoulder, right shoulder, right elbow, left elbow, right wrist, left wrist, left hip, right hip, left knee, right knee, left ankle, right ankle, may be determined in accordance with the disclosed implementations. In such implementations, a visibility indicator may also be determined for each 2D landmark. The visibility indicator for each 2D landmark may be determined based on, for example, the position of the body segment(s) that connect to the 2D landmark, the position of other body segments and/or other 2D landmarks, etc.”; see at least[0050-0052,0059] of KELLY ;see at least [0013] of ZHE “A calculation module, used to calculate the posture information of each limb part of the anchor according to the position information of each limb key point”; [0017] “Based on any of the above aspects, the embodiment of the present application obtains multiple limb key points of the host in the host video frame and the position information of each limb key point by performing image analysis on the host video frame captured by the video capture device. Then, based on the position information of each limb key point, the posture information of each limb part of the host is calculated.”; [0092] of YAN As can be seen from the above, an image processing method provided in this embodiment obtains a human body image; analyzes the human body image to identify facial feature point information and limb feature point information corresponding to the human body image; combines the facial feature point information and the limb feature point information to generate human body model information; and controls the virtual object accordingly based on the human body model information. In this way, the facial feature point information and limb feature point information in the human body image can be identified in real time, and human body model information with facial expressions and finger movements can be generated based on the facial feature point information and limb feature point information. The virtual object can be controlled accordingly according to the action information in the generated human body model information, thereby improving the convenience and accuracy of human-computer interaction.”) In addition, the same motivation is used as the rejection for claim 1. Smith, KELLY, JU, ZHE and YAN are understood to be silent on the remaining limitations of claim 5.
In the same field of endeavor, Onoe teaches wherein the parameters comprises a first type of parameter indicating a head pose of the target online streamer, a second type of parameter indicating a facial expression of the target online streamer(see at least[0074] The filtering processor 42 (FIG. 1) performs a filtering process on parameters output from the parameter calculators in the parameter generator 17 (the face orientation parameter calculator 43, the face position parameter calculator 44, the eye close ratio parameter calculator 45, the mouth open ratio parameter calculator 46, and the smiling face parameter calculator 47) in order to stabilize the parameters. [0069] The face orientation correction processor 41 receives an rotation angle of the face region (faceRoll, faceYaw) included in the face detection result information output by the face detection processor 32, and a posture (PoseRoll, PosePitch, PoseYaw) of the face region included in the face position information output by the part detector 33.”; [0080] The face orientation parameter calculator 43 (FIG. 1) calculates a parameter controlling the face angle of the avatar generated by the image generator 18, in accordance with the face angle of the user corrected by the face orientation correction processor 41. [0081] The face orientation correction processor 41 performs the correction process in the right-hand coordinate system while the image generator 18 performs the process thereof in the left-hand coordinate system. Referring to FIG. 8A, the right-hand coordinate system and the left-hand coordinate system are opposite each other in the rolling motion and the yawing motion. The face orientation parameter calculator 43 thus inverts the signs of the directions of the rolling motion and the yawing motion of the face angle of the user corrected by the face orientation correction processor 41”) [0096] “The mouth open ratio parameter calculator 46 (FIG. 1) calculates a parameter controlling an open ratio of the mouth of the avatar to be generated by the image generator 18, based on 14 points identifying the outline of the mouth of the user detected by the part detail detector 35. [0097] Before calculating the parameter, the mouth open ratio parameter calculator 46 performs two evaluation processes to determine on the basis of each point identifying the outline of the mouth of the user whether detection results of the outline of the mouth are correct”), wherein the generating parameters associated with a virtual character corresponding to the target online streamer based on the face features further comprises:
generating the first type of parameter by performing head pose parsing on the face features([0059] The face position information is described below with reference to FIG. 4. As illustrated in FIG. 4, the coordinates of the four corners of the face region (regionX[0]-[3], and regionY[0]-[3]) are represented by the values that are normalized with (0,0) representing the top left corner of the entire image and (1,1) representing the bottom right corner of the entire image. The posture of the face region (PoseRoll, PosePitch, PoseYaw) is represented in the left-hand coordinate system as illustrated in FIG. 6.”; [0069] The face orientation correction processor 41 receives an rotation angle of the face region (faceRoll, faceYaw) included in the face detection result information output by the face detection processor 32, and a posture (PoseRoll, PosePitch, PoseYaw) of the face region included in the face position information output by the part detector 33.”; [0080] The face orientation parameter calculator 43 (FIG. 1) calculates a parameter controlling the face angle of the avatar generated by the image generator 18, in accordance with the face angle of the user corrected by the face orientation correction processor 41. [0081] The face orientation correction processor 41 performs the correction process in the right-hand coordinate system while the image generator 18 performs the process thereof in the left-hand coordinate system. Referring to FIG. 8A, the right-hand coordinate system and the left-hand coordinate system are opposite each other in the rolling motion and the yawing motion. The face orientation parameter calculator 43 thus inverts the signs of the directions of the rolling motion and the yawing motion of the face angle of the user corrected by the face orientation correction processor 41”),
generating the second type of parameter by performing facial expression parsing on the face feature ([0096] “The mouth open ratio parameter calculator 46 (FIG. 1) calculates a parameter controlling an open ratio of the mouth of the avatar to be generated by the image generator 18, based on 14 points identifying the outline of the mouth of the user detected by the part detail detector 35. [0097] Before calculating the parameter, the mouth open ratio parameter calculator 46 performs two evaluation processes to determine on the basis of each point identifying the outline of the mouth of the user whether detection results of the outline of the mouth are correct… [0100] If it is determined in the first and second evaluation processes that the detection results of the outline of the mouth are correct, the mouth open ratio parameter calculator 46 calculates the parameter of the open ratio of the avatar to be generated by the image generator 18, based on the 14 points identifying the outline of the mouth of the user detected by the part detail detector 35. If the detection results of the outline of the mouth are correct, the openness of the mouth of the user is reflected in the mouth of the avatar.”; [0112] The renderer 53 performs a rendering process by morphing the 3D data supplied by the 3D data manager 51, in accordance with expression parameters output by the parameter calculators in the parameter generator 17 (the parameters indicating the eye open ratio of each of the right eye and the left eye of the avatar, the parameter indicating the smile ratio of the avatar, and the parameter indicating the mouth open ratio of the avatar”)
Therefore, it would have been obvious to one of ordinary skill in the art before the effecting filling date of the claimed invention to modify the method of generate a 3D model of the body of the user based on body parameters corresponding the body image of Smith, KELLY, JU, ZHE and YAN with generating a variety of parameters includes a face orientation correction, a face orientation parameter , a face position parameter, an eye close ratio parameter, a mouth open ratio parameter, and a smiling face parameter as seen in Onoe because this modification would serve as control data by the variety of parameters when the image generator generates an avatar ([0069] of Onoe)
Thus, the combination of Smith, KELLY, JU, ZHE, YAN and Onoe teaches wherein the parameters comprises a first type of parameter indicating a head pose of the target online streamer, a second type of parameter indicating a facial expression of the target online streamer, and a third type of parameter indicating a limb pose of the target online streamer; and wherein the generating parameters associated with a virtual character corresponding to the target online streamer based on the face features and the limb features further comprises: generating the first type of parameter by performing head pose parsing on the face features, generating the second type of parameter by performing facial expression parsing on the face features, and generating the third type of parameter based on parsing the limb features.
Regarding claim 6, Smith, KELLY, JU, ZHE, YAN and Onoe teach the method according to claim 5, wherein the face features comprise location information of face feature points in a face region recognized from the human body image by a pre-trained face recognition model(see at least col.3, lines 44-67-col.4, lines 1-12 of Smith “... For example, a convolutional neural network (“CNN”) may be trained with a semantic segmentation algorithm to determine bodies, such as human bodies, in images. In addition or as an alternative thereto, the segmented silhouette may be segmented into one or more body segments, such as hair segment 104-1, head segment 104-2, neck segment 104-3, upper clothing segment 104-4, upper left arm 104-5, lower left arm 104-6, left hand 104-7, torso 104-8, upper right arm 104-9, lower right arm 104-10, right hand 104-11, lower clothing 104-12, upper left leg 104-13, upper right leg 104-16, etc. For example, the CNN may be trained with a semantic segmentation algorithm to predict for each pixel of an image the likelihood that the pixel corresponds to a segment label (e.g., hair, upper clothing, lower clothing, head, upper right arm, etc.). For example, the CNN may be trained to process each 2D body image and output, for each pixel of each image, a vector that indicates a probability for each label that the pixel corresponds to that label”; [0050-0052,0059] of KELLY; [0040] of JU “In this embodiment, in step 210, the anchor video frame of the anchor during the live broadcast process can be captured by the video capture device 400, and the anchor video frame includes the anchor image. The collected anchor video frames can be imported into a pre-trained key point recognition model for image analysis to identify the anchor's body key points. The key point recognition model is obtained by training a neural network model based on pre-acquired sample images of the anchor. Specifically, the key point recognition model can be obtained in the following way.”; [0012] of ZHE “Here, the face recognition step may include: a step of detecting a face region from the face image and extracting feature points from the face region; a step of generating 3D motion information including position information indicating the positions of the feature points in 3D space and pose information indicating the amount of rotation and movement of the face region in 3D space; and a step of generating 3D facial feature points by restoring the feature points in 3D using a pinhole camera model and the 3D motion information”; [0039] of YAN “In one embodiment, the step of analyzing facial image elements in a human body image through a preset learning model and identifying facial feature point information in the human body image may include: scanning the lines and contours of the facial area in the human body image through a preset learning model to obtain facial image elements; performing facial feature analysis on the facial image elements to identify a preset number of facial feature point information in the facial image elements.” [0062] “Among them, the preset learning model can be an Openpose model, which can recognize the captured human body image and identify the human body feature point information from the human body image. The human body feature point information can be understood as the body feature parts of the human body, so that the human body structure model can be constructed according to the human body feature point information in the later stage. Therefore, the facial feature point information of the facial image elements corresponding to the facial area in the human body image can be identified according to the Openpose model, and the limb feature point information of the limb image elements corresponding to the limb area in the human body image can be identified according to the Openpose model.”; [0052] of Onoe “The face recognizer 15 includes a face recognition controller 31, a face detection processor 32, a part detector 33, an attribute determiner 34, and a part detail detector 35. The face recognizer 15 performs a face recognition process on an image supplied from the decoder 22 in the camera input unit 14. [0053] The face recognition controller 31 performs a control process on each element in the face recognizer 15. For example, the face recognition controller 31 performs the control process to cause the output of the attribute determiner 34 to be supplied to the model selector 16 throughout a predetermined number of frames (15 frames, for example) starting with a frame from which a new face is detected, and then to be supplied to the parameter generator 17 after the predetermined number of frames.); and wherein the generating the first type of parameter by performing head pose parsing on the face features comprises:
separately determining location information of a plurality of specified face feature points in the face region (see at least [0050-0052,0059] of KELLY; [0051] of JU “Specifically, as illustrated in FIG. 4, a user (u) can capture a facial image using a user terminal (10) (s11), and a facial recognition unit (110) can extract feature points from the facial image (s12). [0052] Here, the face recognition unit (110) can first detect a face area from a face image, and extract landmarks representing each facial expression, etc. within the extracted face area. That is, the face recognition unit (110) can distinguish between the face area included in the face image and the remaining background image, and then extract each feature point in the face area to distinguish the face from other faces. For example, eyebrows, eyes, nose, mouth, jawline, etc. included in the face can be extracted as feature points”; see at least [0063] of Onoe “The part result information is described with reference to FIGS. 5A and 5B. The part detail detector 35 may perform a standard process to detect the points of each part from the entire face, and a light-workload process to detect the points as an outline of the mouth.[0064] In the standard process, the part detail detector 35 detects 55 points identifying the outline of the face, the shape of the eyebrows, the outlines of the eyes, the shape of the nose, and the outline of the mouth as illustrated in FIG. 5A. In the light-workload process, the part detail detector 35 detects 14 points identifying the outline of the mouth as illustrated in FIG. 5B. The XY coordinates of each point are represented by values that are normalized with (0,0) as the coordinates of the top left corner of the face region and (1,1) as the coordinates of the bottom right corner of the face region. In the part result information, parts ID (partsID[0]-[55]) identifying each point is mapped to the XY coordinates (partsX, partsY) of the corresponding point.”);
determining a roll angle, a yaw angle, and a pitch angle of a head of the target online streamer based on the location information of the plurality of specified face feature points and a spatial location relationship between the plurality of specified face feature points on the head of the target online streamer ( see at least [0080] of JU “After extracting feature points, 3D motion information including position information indicating the location of the feature points in 3D space and pose information indicating the amount of rotation and translation of the facial area in 3D space can be generated (S120). That is, the two-dimensional face region can be restored back to a three-dimensional space, and at this time, the three-dimensional position information matching the two-dimensional feature points in the three-dimensional space and the pose information such as the rotation amount and translation amount of the face region in the three-dimensional space can be calculated, respectively. Specifically, the Solve-PnP technique can be used to convert a two-dimensional face region into a three-dimensional space, and a pinhole camera model can be utilized at this time”; see at least [0058] of Onoe “The part detector 33 determines a position and a posture of the face based on a geometry relationship of the detected parts, and outputs face position information indicating the position and the posture of the face. The face position information includes coordinates of the four corners of the face region, and the posture of the face (regionX[0]-[3], regionY[0]-[3], PoseRoll, PosePitch, PoseYaw). [0059] The face position information is described below with reference to FIG. 4. As illustrated in FIG. 4, the coordinates of the four corners of the face region (regionX[0]-[3], and regionY[0]-[3]) are represented by the values that are normalized with (0,0) representing the top left corner of the entire image and (1,1) representing the bottom right corner of the entire image. The posture of the face region (PoseRoll, PosePitch, PoseYaw) is represented in the left-hand coordinate system as illustrated in FIG. 6.”; [0069] The face orientation correction processor 41 receives an rotation angle of the face region (faceRoll, faceYaw) included in the face detection result information output by the face detection processor 32, and a posture (PoseRoll, PosePitch, PoseYaw) of the face region included in the face position information output by the part detector 33. [0070] Referring to FIG. 6, a detection range of a rotation angle of the face region included in the face detection result information is set, and a detection range of the posture of the face region included in the face position information is set. The face detection result information is represented in the right-hand coordinate system, and the face position information is represented in the left-hand coordinate system. In the face detection result information, the face detection processor 32 is set to have about .+-.20 degrees as a detection range of a rolling motion of the face. No detection is performed in a pitching motion of the face. The face detection processor 32 is set to have about .+-.35 degrees as a detection range of a yawing motion of the face. In the face position information, the part detail detector 35 is set to have about .+-.35 degrees as a detection range of a rolling motion of the face, about .+-.20 degrees as a detection range of a pitching motion of the face, and about .+-.40 degrees as a detection range of a yawing motion of the face.”);
and determining the first type of parameter based on the roll angle, the yaw angle, and the pitch angle ([0012] of JU “Here, the face recognition step may include: a step of detecting a face region from the face image and extracting feature points from the face region; a step of generating 3D motion information including position information indicating the positions of the feature points in 3D space and pose information indicating the amount of rotation and movement of the face region in 3D space; and a step of generating 3D facial feature points by restoring the feature points in 3D using a pinhole camera model and the 3D motion information,”; [0080] of JU “After extracting feature points, 3D motion information including position information indicating the location of the feature points in 3D space and pose information indicating the amount of rotation and translation of the facial area in 3D space can be generated (S120). That is, the two-dimensional face region can be restored back to a three-dimensional space, and at this time, the three-dimensional position information matching the two-dimensional feature points in the three-dimensional space and the pose information such as the rotation amount and translation amount of the face region in the three-dimensional space can be calculated, respectively. Specifically, the Solve-PnP technique can be used to convert a two-dimensional face region into a three-dimensional space, and a pinhole camera model can be utilized at this time” where rotation and movement is considered as roll, the yaw, and the pitch; see at least[0074] of Onoe “The filtering processor 42 (FIG. 1) performs a filtering process on parameters output from the parameter calculators in the parameter generator 17 (the face orientation parameter calculator 43, the face position parameter calculator 44, the eye close ratio parameter calculator 45, the mouth open ratio parameter calculator 46, and the smiling face parameter calculator 47) in order to stabilize the parameters. [0069] The face orientation correction processor 41 receives an rotation angle of the face region (faceRoll, faceYaw) included in the face detection result information output by the face detection processor 32, and a posture (PoseRoll, PosePitch, PoseYaw) of the face region included in the face position information output by the part detector 33.”; [0080] The face orientation parameter calculator 43 (FIG. 1) calculates a parameter controlling the face angle of the avatar generated by the image generator 18, in accordance with the face angle of the user corrected by the face orientation correction processor 41. [0081] The face orientation correction processor 41 performs the correction process in the right-hand coordinate system while the image generator 18 performs the process thereof in the left-hand coordinate system. Referring to FIG. 8A, the right-hand coordinate system and the left-hand coordinate system are opposite each other in the rolling motion and the yawing motion. The face orientation parameter calculator 43 thus inverts the signs of the directions of the rolling motion and the yawing motion of the face angle of the user corrected by the face orientation correction processor 41”). In addition, the same motivation is used as the rejection for claim 5.
Regarding claim 7, Smith, KELLY, JU, ZHE, YAN and Onoe teach the method according to claim 6, wherein the determining the first type of parameter based on the roll angle, the yaw angle, and the pitch angle comprises:
separately converting the roll angle, the yaw angle, and the pitch angle into coordinates in a two-dimensional coordinate system to obtain a coordinate conversion result ([0058] of Onoe “The part detector 33 determines a position and a posture of the face based on a geometry relationship of the detected parts, and outputs face position information indicating the position and the posture of the face. The face position information includes coordinates of the four corners of the face region, and the posture of the face (regionX[0]-[3], regionY[0]-[3], PoseRoll, PosePitch, PoseYaw)”; [0065] “The user may incline his or her face, causing the points of each part of the face to be rotated (shifted) with respect to the entire image. The XY coordinates of each point are represented by values that are normalized in the face region. The positions of each point remain unchanged relative to the top left corner (origin) of the face region. The coordinate axis of each point is rotated in response to the inclination of the face region. If the position of each point with respect to the entire image is determined, a correction operation to incline the face in an opposite direction to the inclination of the face region is to be performed.”; [0070] Referring to FIG. 6, a detection range of a rotation angle of the face region included in the face detection result information is set, and a detection range of the posture of the face region included in the face position information is set. The face detection result information is represented in the right-hand coordinate system, and the face position information is represented in the left-hand coordinate system. In the face detection result information, the face detection processor 32 is set to have about .+-.20 degrees as a detection range of a rolling motion of the face. No detection is performed in a pitching motion of the face. The face detection processor 32 is set to have about .+-.35 degrees as a detection range of a yawing motion of the face. In the face position information, the part detail detector 35 is set to have about .+-.35 degrees as a detection range of a rolling motion of the face, about .+-.20 degrees as a detection range of a pitching motion of the face, and about .+-.40 degrees as a detection range of a yawing motion of the face”); and performing angle value correction and interpolation smoothing processing on the coordinate conversion result to obtain the first type of parameter (see at least [0069] The face orientation correction processor 41 receives an rotation angle of the face region (faceRoll, faceYaw) included in the face detection result information output by the face detection processor 32, and a posture (PoseRoll, PosePitch, PoseYaw) of the face region included in the face position information output by the part detector 33.”; [0074] The filtering processor 42 (FIG. 1) performs a filtering process on parameters output from the parameter calculators in the parameter generator 17 (the face orientation parameter calculator 43, the face position parameter calculator 44, the eye close ratio parameter calculator 45, the mouth open ratio parameter calculator 46, and the smiling face parameter calculator 47) in order to stabilize the parameters.” [0080] The face orientation parameter calculator 43 (FIG. 1) calculates a parameter controlling the face angle of the avatar generated by the image generator 18, in accordance with the face angle of the user corrected by the face orientation correction processor 41. [0081] The face orientation correction processor 41 performs the correction process in the right-hand coordinate system while the image generator 18 performs the process thereof in the left-hand coordinate system. Referring to FIG. 8A, the right-hand coordinate system and the left-hand coordinate system are opposite each other in the rolling motion and the yawing motion. The face orientation parameter calculator 43 thus inverts the signs of the directions of the rolling motion and the yawing motion of the face angle of the user corrected by the face orientation correction processor 41”; [0194] The parameter generator 17 of FIG. 1 outputs a parameter that is obtained when the filtering processor 42 performs the filtering processing the parameter calculated by each parameter calculator. Even if an unstable variation is contained in the parameter calculated by each parameter calculator, the filtering processor 42 removes the unstable variation from the parameter. A stable parameter is thus output. The avatar generated by the image generator 18 smoothly moves.) In addition, the same motivation is used as the rejection for claim 5.
Regarding claim 10, Smith, KELLY, JU, ZHE and YAN teach the method according to claim 1, further comprising: driving a head motion of the virtual character using a first type of parameter indicating poses of the target online streamer; driving a facial expression of the virtual character using a second type of parameter indicating a facial expression of the target online streamer; and driving an upper-body limb motion of the virtual character based on a third type of parameter indicating limb poses of the target online streamer (see at least [0050-0052,0059] of KELLY; [0083] of ZHE “The control module 133 is used to control the virtual image displayed in the live broadcast screen to perform corresponding posture actions according to the posture information of each body part of the anchor. It can be understood that the control module 133 can be used to execute the above step 230. The detailed implementation of the control module 133 can refer to the above content related to step 230” [0085] “In summary, the key point-based virtual image posture control method, device and electronic device provided in the embodiments of the present application perform image analysis on the host video frame captured by the video capture device 400 to obtain multiple limb key points of the host in the host video frame and the position information of each limb key point. Then, based on the position information of each limb key point, the posture information of each limb part of the host is calculated. Finally, the virtual image displayed in the live broadcast screen is controlled to perform corresponding posture movements according to the posture information of each body part of the anchor. In this way, by analyzing the position information of the key points of the host's limbs, the accurate posture of each limb part can be obtained, and then the virtual image can be controlled to perform corresponding posture movements according to the host's posture, so that the virtual image can follow the host's posture movements and improve the interactive effect during the live broadcast.) In addition, the same motivation is used as the rejection for claim 1. Smith, KELLY, JU, ZHE and YAN are understood to be silent on the remaining limitations of claim 10.
In the same field of endeavor, Onoe teaches driving a head motion of the virtual character using a first type of parameter indicating head poses of the target online streamer (see at least [0080] The face orientation parameter calculator 43 (FIG. 1) calculates a parameter controlling the face angle of the avatar generated by the image generator 18, in accordance with the face angle of the user corrected by the face orientation correction processor 41.” [0081] The face orientation correction processor 41 performs the correction process in the right-hand coordinate system while the image generator 18 performs the process thereof in the left-hand coordinate system. Referring to FIG. 8A, the right-hand coordinate system and the left-hand coordinate system are opposite each other in the rolling motion and the yawing motion. The face orientation parameter calculator 43 thus inverts the signs of the directions of the rolling motion and the yawing motion of the face angle of the user corrected by the face orientation correction processor 41.; 0084] The face position parameter calculator 44 (FIG. 1) calculates the parameter controlling the face position of the avatar generated by the image generator 18, in accordance with the reference point, the horizontal width, and the vertical length (faceX, faceY, faceW, faceH) included in the face detection result information output by the filtering processor 42.”);
driving a facial expression of the virtual character using a second type of parameter indicating a facial expression of the target online streamer ([0096] “The mouth open ratio parameter calculator 46 (FIG. 1) calculates a parameter controlling an open ratio of the mouth of the avatar to be generated by the image generator 18, based on 14 points identifying the outline of the mouth of the user detected by the part detail detector 35. [0097] Before calculating the parameter, the mouth open ratio parameter calculator 46 performs two evaluation processes to determine on the basis of each point identifying the outline of the mouth of the user whether detection results of the outline of the mouth are correct… [0100] If it is determined in the first and second evaluation processes that the detection results of the outline of the mouth are correct, the mouth open ratio parameter calculator 46 calculates the parameter of the open ratio of the avatar to be generated by the image generator 18, based on the 14 points identifying the outline of the mouth of the user detected by the part detail detector 35. If the detection results of the outline of the mouth are correct, the openness of the mouth of the user is reflected in the mouth of the avatar.”; [0112] The renderer 53 performs a rendering process by morphing the 3D data supplied by the 3D data manager 51, in accordance with expression parameters output by the parameter calculators in the parameter generator 17 (the parameters indicating the eye open ratio of each of the right eye and the left eye of the avatar, the parameter indicating the smile ratio of the avatar, and the parameter indicating the mouth open ratio of the avatar) In addition, the same motivation is used as the rejection for claim 5.
Thus, the combination of Smith, KELLY, JU, ZHE, YAN and Onoe teaches further comprising: driving a head motion of the virtual character using a first type of parameter indicating head poses of the target online streamer; driving a facial expression of the virtual character using a second type of parameter indicating a facial expression of the target online streamer; and driving an upper-body limb motion of the virtual character based on a third type of parameter indicating limb poses of the target online streamer.
Regarding claim 12, Smith, KELLY, JU, ZHE, YAN teach the system according to claim 11, Remaining limitations of claim 12 is similar scope to claim 5 and therefore rejected under the same rationale.
Regarding claim 13, Smith, KELLY, JU, ZHE, YAN and Onoe teach the system according to claim 12, Remaining limitations of claim 13 is similar scope to claim 6 and therefore rejected under the same rationale.
Regarding claim 17, Smith, KELLY, JU, ZHE, YAN teach the non-transitory computer-readable storage medium according to claim 16, Remaining limitations of claim 17 is similar scope to claim 5 and therefore rejected under the same rationale.
Regarding claim 18, Smith, KELLY, JU, ZHE, YAN and Onoe teach the non-transitory computer-readable storage medium according to claim 17, Remaining limitations of claim 18 is similar scope to claim 6 and therefore rejected under the same rationale.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Contact
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SARAH LE whose telephone number is (571)270-7842. The examiner can normally be reached Monday: 8AM-4:30PM EST, Tuesday: 8 AM-3:30PM EST, Wednesday: 8AM-2:30PM EST, Thursday and Friday off.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kent Chang can be reached at (571) 272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SARAH LE/Primary Examiner, Art Unit 2614