DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This is in response to the applicant’s reply filed December 12, 2025. In the applicant’s reply; claims 1, 4-6, 8-9, 12, 18 and 20 were amended. Claims 1-20 are pending in this application.
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
Response to Arguments
Applicants' amendments filed on December 12, 2025 have been fully considered. The amendments overcome the following rejections set forth in the office action mailed on OA_DATE.
Applicant’s amendments overcome the objection to the title of the specification, and the objection is hereby withdrawn.
Applicant’s amendments overcome the anticipatory rejections of Claims 1-3, 5-14, and 17-19 under 35 U.S.C. 102(a)(1) as being anticipated by Zatpeyakin et al. (US PGPub US2018/0182165 A1), and the rejection is hereby withdrawn.
Applicant’s amendments overcome the rejections of Claims 1, 4-5, 14-16, 18 and 20 under 35 U.S.C. 103 as being unpatentable over Zatpeyakin et al. (US PGPub US2018/0182165 A1, hereby referred to as “Zatpeyakin”), in view of Chen et al. (US PGPub 20230186439), hereby referred to as “Chen”, and the rejection is hereby withdrawn.
Applicant's arguments with respect to claims 1-20 have been considered but are moot in view of the new grounds of rejection, presented below and necessitated by applicant’s amendment.
Applicant’s arguments, see “Remarks”, filed December 12, 2025, with respect to the use of Zatpeyakin et al. (US PGPub US2018/0182165 A1, hereby referred to as “Zatpeyakin”) as an anticipatory reference have been fully considered and are persuasive in light of Applicant’s amendments. Therefore, the rejection has been withdrawn. However, upon further consideration, a new ground(s) of rejection is made in view of Zatpeyakin et al. (US PGPub US2018/0182165 A1, hereby referred to as “Zatpeyakin”), in view of Riemenschneider et al. (US PGPub US 2024/0242444 A1), hereby referred to as “Riemenschneider” originally filed on January 17, 2023.
Applicant should submit an argument under the heading “Remarks” pointing out disagreements with the examiner’s contentions. Applicant must also discuss the references applied against the claims, explaining how the claims avoid the references or distinguish from them.
Applicants' arguments filed on December 12, 2025 have been fully considered but they are not persuasive. The Examiner has thoroughly reviewed Applicants' arguments but firmly believes that the cited reference to reasonably and properly meet the claimed limitation.
Applicant argues that the subject matter does qualify as statutory subject matter under 35 U.S.C. 101. Specifically, applicant argues that
“MPEP 2106.04(d)(III) states "[t]he Prong Two analysis considers the claim as a whole. That is, the limitations containing the judicial exception as well as the additional elements in the claim besides the judicial exception need to be evaluated together to determine whether the claim integrates the judicial exception into a practical application."”
Applicant then goes on to cite that:
“Par. 49 of the specification states ‘each anchor box prediction network 406-1 to 406-9 may analyze the features in anchor boxes shown in FIG. 2 in the shaded boxes, and not features outside of respective anchor boxes 204. This may reduce the amount of data that is analyzed by anchor box prediction networks 406-1 to 406-9 and improve the speed of the computation.’”.
Examiner respectfully disagrees. Applicant's arguments fail to comply with 37 CFR 1.111(b) because they amount to a general allegation that the claims define a patentable invention without specifically pointing out how the language of the claims are directed to a statutory subject matter. The cited features that the applicant argues in Par. 49 are NOT present in the claims, and as such do not qualify as “additional elements” that are to be evaluated together in the determination of whether the claim “integrates the judicial exception into a practical application”. It is noted that the features upon which applicant relies (i.e., ‘each anchor box prediction network 406-1 to 406-9 may analyze the features in anchor boxes shown in FIG. 2 in the shaded boxes, and not features outside of respective anchor boxes 204. This may reduce the amount of data that is analyzed by anchor box prediction networks 406-1 to 406-9 and improve the speed of the computation.’) are not recited in the rejected claim(s). Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims, especially when determining if a claim is directed to statutory subject matter. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993). Applicant is encouraged to amend the claims in order to ensure that these “additional elements” are present in the claims in order to ensure that it “integrates the judicial exception into a practical application”.
Priority
Acknowledgment is made of applicant's claim for foreign priority based on an application filed in World Intellectual Property Office (WIPO) on October 26, 2023. It is noted, however, that applicant has not filed a certified copy of the PCT/CN2023/126670 application as required by 37 CFR 1.55.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.
Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter. The claim(s) does/do not fall within at least one of the four categories of patent eligible subject matter because the claimed invention is directed to a judicial exception (i.e., abstract idea – an idea to itself/mathematical concept) without significantly more.
(1) Are the claims directed to a process, machine, manufacture or composition of matter;
(2A) Prong One: Are the claims directed to a judicially recognized exception, i.e., a law of nature, a natural phenomenon, or an abstract idea;
Prong Two: If the claims are directed to a judicial exception under Prong One, then is the judicial exception integrated into a practical application;
(2B) If the claims are directed to a judicial exception and do not integrate the judicial exception, do the claims provide an inventive concept.
(Step 1) In the context of the flowchart in MPEP § 2106, subsection III, Step (1): Are the claims directed to a process, machine, manufacture or composition of matter? YES: the instant claims recite a method.
(Step 2A) In the context of the flowchart in MPEP § 2106, subsection III, Step 2A Prong Two determines whether: Is the claim directed to a law of nature, a natural phenomenon, or an abstract idea? YES.
When viewed under the broadest most reasonable interpretation, the instant claims are directed to a Judicial Exception – an abstract idea belonging to the group of “organizing information data, comparing new and stored data and a mental/manual process”.As a whole, the claimed features can be interpreted as an overall mathematical process, as they are a natural language representation of an overall mathematical algorithm, which qualifies as a judicial exception. Claims 1, 18 and 20 are independent claims, and Claim 1 is exemplary of the recited claims, and is presented below. Specifically, the claim recites:
-; 1. A method comprising:
a. receiving an image;
b. analyzing different portion of the image that are formed from respective shapes of
c. wherein the output rates a cropping of the image using a respective anchor shape;
d. analyzing respective outputs for the anchor shapes in the plurality of anchor shapes to select an anchor shape;
e. and cropping the image using the anchor shape.
When viewed under the broadest most reasonable interpretation, the instant claims are directed to a Judicial Exception – an abstract idea belonging to the group of organizing information data, comparing new and stored data and a mental/manual process.As a whole, the claimed features can be interpreted as an overall mathematical process, as the limitations are directed to a series of manual/mental processes for data gathering and comparison, which qualifies as a judicial exception. The step of (a) “receiving an image” is considered to be extra-solution activity. The steps for (b) “analyzing different portions of the image that are formed from respective shapes of a plurality of anchor shapes to generate respective outputs for anchor shapes in the plurality of anchor shapes” is also considered to be an mental or manual process to identify different shapes with the image and (c) “wherein the output rates a cropping of the image using a respective anchor shape” is considered to be a general image processing feature that is recited at a high level of generality and is inherent to most image processing algorithms. The steps of (d) “analyzing respective outputs for the anchor shapes in the plurality of anchor shapes to select an anchor shape” and (e) “cropping the image using the anchor shape” both recite a series of mental processes or mathematical operations at a high level of generality, which are not operationally connected or algorithmically defined in the claimed features. These claimed limitations are directed to judicially recited mathematical concept/algorithm. There is nothing in the claim that requires more than an operation that a human, armed with the appropriate apparatus executing a mathematical algorithm (in this case “operations”) can perform.
These features, in combination, are applied to an overall data stream, and can be applied to any input that can be computationally processed. As the overall features of the claims can be interpreted to be a mathematical algorithm, it does meet the requirements of Step2A for a judicial exception.
Since the claim as a whole does not integrate the exception into a practical application, in which case the claim is directed to the judicial exception (Step 2A: YES), it requires further analysis under Step 2B (where it may still be eligible if it amounts to an inventive concept).
(Step 2B) In the context of the flowchart in MPEP 2106, subsection III, Step 2B determines whether: Does the claim recited additional elements that amount to significantly more than the judicial exception? NO.
The independent claims clearly recite processes for organizing data and data comparison, which can constitute a mental or manual process, at best with some possible mathematical elements, as presented in the claimed features. The instant claims do not apply, rely on, or use the judicial exception in a manner that imposes a meaningful limit on the judicial exception of claims 1-20 and therefore does not integrate the judicial exception into a practical application.In accordance with the MPEP § 2106.04(d) Integration of a Judicial Exception Into A Practical Application [R-07.2022],
With respect to the claimed limitations, the following features are directed to a mental/manual process for organizing/comparison of data that does not integrate the judicial exception into a practical application, as there is no “meaningful limit”.
In particular, the claim includes additional elements as follows and includes using a processing apparatus to perform the following:
-; 1. A method comprising:
a. receiving an image;
b. analyzing different portion of the image that are formed from respective shapes of
c. wherein the output rates a cropping of the image using a respective anchor shape;
d. analyzing respective outputs for the anchor shapes in the plurality of anchor shapes to select an anchor shape;
e. and cropping the image using the anchor shape.
The step of (a) “receiving an image” is considered to be extra-solution activity. The steps for (b) “analyzing different portion of the image that are formed from respective shapes of a plurality of anchor shapes to generate respective outputs for anchor shapes in the plurality of anchor shapes” is also considered to be an mental or manual process to identify different shapes with the image and (c) “wherein the output rates a cropping of the image using a respective anchor shape” is considered to be a general image processing feature that is recited at a high level of generality and is inherent to most image processing algorithms. The steps of (d) “analyzing respective outputs for the anchor shapes in the plurality of anchor shapes to select an anchor shape” and (e) “cropping the image using the anchor shape” both recite a series of mental processes or mathematical operations at a high level of generality, which are not operationally connected or algorithmically defined in the claimed features. These claimed limitations are directed to judicially recited mathematical concept/algorithm. There is nothing in the claim that requires more than an operation that a human, armed with the appropriate apparatus executing a mathematical algorithm (in this case “operations”) can perform.
With regard to (2B), the pending claims do not show what is more than a routine in the art presented in the claims, i.e., the additional elements are nothing more than routine and well-known steps. There is no improvement to technology here. There is only a “receive”/“transmit”/“classify” (extra-solution step), and it has not been shown that the mental process allows the “technology” (whether it is computer technology or any other technology) to do something that it previously was not able to do.
The claimed features, at best would invoke the analysis of the features under MPEP § 2106.05(a) as to whether the features qualify as improvements to the functioning of a computer or to any other technology or technical field. Even when considering the relevant consideration for evaluating whether the additional elements amount to an inventive concept, the limitations do not recite any elements that can be considered to qualify as “significantly more” when recited in a claim with judicial exception. At best, the claimed limitations only recite features that “apply” the judicial exception, and add “insignificant extra-solution activity to the judicial exception”,
Dependent claims 2-17 and 19 are rejected for the same reasons; the dependent claims further recite additional features that would be further considered mathematical representations of the different tensor elements, and these claims are directed to a judicial exception and do not integrate the judicial exception.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all obviousness rejections set forth in this Office action:
(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are such that the subject matter as a whole would have been obvious at the time the invention was made to a person having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the manner in which the invention was made.
Claims 1-3, 5-14, and 17-120 are rejected under 35 U.S.C. § 103 as being unpatentable over Zatpeyakin et al. (US PGPub US2018/0182165 A1) in view of Riemenschneider et al. (US PGPub 20240242444, originally filed on January 17, 2023), hereby referred to as “Riemenschneider”.
Consider Claims 1, 18 and 20.
Zatepyakin teaches:
1. A method comprising: / 18. A non-transitory computer-readable storage medium having stored thereon computer executable instructions, which when executed by a computing device, cause the computing device to be operable for: / 20. An apparatus comprising: one or more computer processors; and a computer-readable storage medium comprising instructions for controlling the one or more computer processors to be operable for: (Zatepyakin: abstract, A face tracking system generates a model for extracting a set of facial anchor points on a face within a portion of a face image based a multiple-level cascade of decision trees. The face tracking system identifies a mesh shape adjusted to an image of a face. For each decision tree, the face tracking system identifies an adjustment vector for the mesh shape relative to the image of the face. For each cascade level, the face tracking system combines the identified adjustment for each decision tree to determine a combined adjustment vector for the cascade level. The face tracking system modifies adjustment of the mesh shape to the face in the image based on the combined adjustment vector. The face tracking system reduces the model to a dictionary and atom weights using a learned dictionary. The model may be more easily transmitted to devices and stored on devices. [0018]-[0026], Figures 1A-B; [0018] FIG. 1A is a system environment 100 of a face tracking system 140 including a face alignment module 146, in accordance with an embodiment. The system environment 100 shown by FIG. 1 comprises one or more client devices 110, a network 120, one or more external sources 130, and the face tracking system 140. In alternative configurations, different and/or additional components may be included in the system environment 100.)
1. receiving an image; / 18. receiving an image; / 20. receiving an image; (Zatepyakin: [0019] The client devices 110 are one or more computing devices capable of capturing face images of a user, receiving user input as well as transmitting and/or receiving data via the network 120. In one embodiment, a client device 110 is a conventional computer system that includes an imaging device for capturing images having a user's face. Examples of an imaging device include a camera, a video camera, or other image capture device. [0020] The application 112 on the client device may perform facial alignment of a face within the captured image. To determine the facial image, the application 112 applies a trained model to analyze a face in the image to extract a set of facial anchor points on the face. The application 112 may receive the trained model from the face tracking system 140 and after applying the model, use the extracted set of facial anchor points to interpret or augment the image.)
1. analyzing the image based on a plurality of anchor shapes to generate respective outputs for anchor shapes in the plurality of anchor shapes, wherein the output rates a cropping of the image using a respective anchor shape; / 18. analyzing the image based on a plurality of anchor shapes to generate respective outputs for anchor shapes in the plurality of anchor shapes, wherein the output rates a cropping of the image using a respective anchor shape; / 20. analyzing the image based on a plurality of anchor shapes to generate respective outputs for anchor shapes in the plurality of anchor shapes, wherein the output rates a cropping of the image using a respective anchor shape; (Zatepyakin: [0020] The application 112 on the client device may perform facial alignment of a face within the captured image. To determine the facial image, the application 112 applies a trained model to analyze a face in the image to extract a set of facial anchor points on the face. The application 112 may receive the trained model from the face tracking system 140 and after applying the model, use the extracted set of facial anchor points to interpret or augment the image. The application 112 may determine facial anchor points as described below with respect to modules of the face tracking system 140. After identifying the facial anchor points, the application 112 may use the anchor points to track and characterize the face, for example to look for further features of the face between anchor points, or to display an overlay or mask over the user's face. The anchor points may also be captured over time to identify how a user's face moves during a video capture, which may for example be used to populate animated expressions using the anchor points, among other uses. The application 112 may also send the set of facial anchor points to another client device or the face tracking system 140 for similar uses. An a further example, the application 112 may provide video chat services for users of the client device, permitting users to capture and send video to another user. By capturing the anchor points of a face during the video, the video can be augmented using the anchor points, e.g., to add a mask to a user's face, or the by sending the anchor points for each frame of the video to another client device. In some embodiments, the anchor points may be determined for an initial frame of the video, and subsequent frames may use alternate face tracking techniques to monitor the movement of the face after the anchor points have been determined. [0021]-[0026], [0027] FIG. 1B shows examples of a captured image 160 and identification of a facial shape for the image, in accordance with an embodiment. FIG. 1B includes a bounding box 162 having an identified face 164, a cropped bounding box 166, a default shape 168 and a fitted shape 170 of the system environment illustrated in FIG. 1A. As shown in FIG. 1B, the default shape 168 has predefined facial anchor points around eyes, noses, mouth, and jaw lines. The default shape 168 is centered and scaled according to the cropped bounding box 166. The default shape does not account for the actual position and alignment of the face in the image. By applying the prediction model as described below, the fitted shape 170 is identified that has better positions of the facial anchor points aligned with the identified face in the cropped bounding box 166 than the adjusted default shape 172.)
1. analyzing respective outputs for the anchor shapes in the plurality of anchor shapes to select an anchor shape; / 18. analyzing respective outputs for the anchor shapes in the plurality of anchor shapes to select an anchor shape; / 20. analyzing respective outputs for the anchor shapes in the plurality of anchor shapes to select an anchor shape; (Zatepyakin:[0028]- n one embodiment, the face alignment module 146 uses a barycentric mesh-based shape for prediction. The barycentric mesh-based shape uses a barycentric coordinates system. The barycentric coordinate system is a coordinate system in which a position of a point within an element (e.g., a triangle, or tetrahedron) is represented by a linear combination of its vertices. For example, when the element is a triangle, points inside the triangle can be represented by a linear combination of three vertices of the triangle. The mesh-based shape may consist of multiple triangles covering all the predefined facial anchor points. Each facial anchor point can be represented by a linear combination of vertices in an associated triangle. [0029] FIG. 2 shows an example of barycentric mesh-based shapes, in accordance with an embodiment. As shown in FIG. 2, a barycentric mesh-based default shape 210 has multiple triangles. The triangles cover all the predefined facial anchor points as shown in dash lines. The barycentric mesh-based default shape 210 may be adjusted according to the cropped bounding box 166. The adjusted barycentric mesh-based default shape 220 may determine updated positions of predefined facial anchor points 230 using vertices of the associated triangles to correspond the predefined facial anchor points to the default shape applied to the cropped bounding box 166. When applying the prediction model, a barycentric mesh-based fitted shape 240 is generated to adjust the mesh to the face within the image and include updated triangles. Then, the barycentric mesh-based fitted shape 240 may determine updated positions of predefined facial anchor points 250 using vertices of associated update triangles. )
1. and cropping the image using the anchor shape. / 18. and cropping the image using the anchor shape. / 20. and cropping the image using the anchor shape. (Zatepyakin: [0031] FIG. 3 shows an example of a regression tree 300 for generating an adjustment vector, in accordance with an embodiment. In the example of FIG. 3, the regression tree 300 includes two depths and 4 leafs (N3-N6). An input for the regression tree 300 includes a cropped bounding box 168 having an identified face and a barycentric mesh-based default shape 210. In other examples, the mesh shape input to the tree may include already-applied adjustments to the default mesh, for example from a prior adjustment of the shape to match the face. For node N0, two positions A and B close to predefined facial anchor points are specified in the default shape 210. The default shape 210 is adjusted according to the cropped bounding box 168. After adjusting the default shape to the cropped bounding box 168, the adjusted default shape 220 may have the same size as the cropped bounding box 168.)
Even if Zatepyakin does not teach: analyzing different portion of the image that are formed from respective shapes of
Riemenschneider teaches:
1. A method comprising: / 18. A non-transitory computer-readable storage medium having stored thereon computer executable instructions, which when executed by a computing device, cause the computing device to be operable for: / 20. An apparatus comprising: one or more computer processors; and a computer-readable storage medium comprising instructions for controlling the one or more computer processors to be operable for: (Riemenschneider: abstract, One embodiment of the present invention sets forth a technique for generating augmented reality (AR) content. The technique includes inputting a first layout of a physical space and a first set of anchor content into a machine learning model. The technique also includes generating, via execution of the machine learning model, a first three-dimensional (3D) volume that includes (i) a first subset of the physical space and (ii) a placement of one or more 3D representations of the first set of anchor content in a second subset of the physical space. The technique further includes causing one or more views of the first 3D volume to be outputted in a computing device. [0021]-[0030], Figures 1, and 2A)
1. receiving an image; / 18. receiving an image; / 20. receiving an image; (Riemenschneider: [0028]-[0030], Figures 1, 2A; [0028] In some embodiments, training engine 122 trains one or more machine learning models to generate augmented reality (AR) environments that incorporate traditional media content into physical spaces. For example, training engine 122 could train one or more neural networks to extend a two-dimensional (2D) scene depicted in an image and/or video across walls, ceilings, floors, and/or other surfaces of a room. Training engine 122 could also, or instead, train one or more neural networks to generate a three-dimensional (3D) volume that incorporates objects, colors, shapes, textures, structures, and/or other attributes of the 2D scene into the layout of the room.[0029] Execution engine 124 uses the trained machine learning model(s) to generate AR environments that combine traditional media content with layouts of physical spaces. For example, execution engine 124 could input a physical layout of a room and anchor content that includes an image or video depicting a scene into a trained neural network. Execution engine 124 could use the trained neural network to depict chairs, tables, windows, doors, and/or other objects in the room in a normal fashion while extending the scene across the walls, ceilings, floors, and/or other surfaces of the room. Execution engine 124 could also, or instead, use the trained neural network to generate a 3D volume that places 2D and/or 3D representations of objects, colors, shapes, textures, and/or other attributes of the 2D scene into the room. “Neural Generation of Augmented Reality Environments from Anchor Content” [0030] FIG. 2A is a more detailed illustration of training engine 122 and execution engine 124 of FIG. 1 , according to various embodiments. More specifically, FIG. 2A illustrates the operation of training engine 122 and execution engine 124 in using a machine learning model 200 to generate an AR environment 290 that extends a set of anchor content 230 across a layout 232 of a physical space.)
1. analyzing different portion of the image that are formed from respective shapes of different portion of the image that are formed from respective shapes of different portion of the image that are formed from respective shapes of image using a respective anchor shape; (Riemenschneider: [0028]-[0030], Figures 1, 2A; Neural Generation of Augmented Reality Environments from Anchor Content, [0034] Machine learning model 200 includes a space segmentation network 202, a content segmentation network 204, and an extrapolation network 206. In some embodiments, space segmentation network 202, content segmentation network 204, and extrapolation network 206 are implemented as neural networks and/or other types of machine learning models. For example, space segmentation network 202, content segmentation network 204, and extrapolation network 206 could include, but are not limited to, one or more convolutional neural networks, fully connected neural networks, recurrent neural networks, residual neural networks, transformer neural networks, autoencoders, variational autoencoders, generative adversarial networks, autoregressive models, bidirectional attention models, mixture models, diffusion models, and/or other types of machine learning models that can process and/or generate content. [0035] More specifically, machine learning model 200 generates output 2D content 238 that is incorporated into AR environment 290 based on anchor content 230 and layout 232. Layout 232 includes the positions and/or orientations of objects 234(1)-234(X) (each of which is referred to individually as object 234) in a physical space. For example, layout 232 could include a 2D or 3D “map” of a room. The map includes a semantic segmentation of the room into different regions corresponding walls, floor, ceiling, door, table, chair, rug, window, and/or other objects in the room. [0036] In one or more embodiments, layout 232 is generated by space segmentation network 202 based on sensor data 228 associated with the physical space. For example, sensor data 228 could include images, depth maps, point clouds, and/or another representation of the physical space. Sensor data 228 could be collected by cameras, inertial sensors, depth sensors, and/or other types of sensors on an augmented reality device and/or another type of computing device within and/or in proximity to the physical space. Sensor data 228 could also be used to generate a 2D or 3D model corresponding to a “virtual twin” of the physical space. Sensor data 228 and/or the virtual twin could be inputted into space segmentation network 202, and predictions of objects and/or categories of objects for individual elements (e.g., pixel locations, points in a point cloud, etc.), locations, or regions in sensor data 228 and/or the virtual win could be obtained as output of space segmentation network 202.)
1. analyzing respective outputs for the anchor shapes in the plurality of anchor shapes to select an anchor shape; / 18. analyzing respective outputs for the anchor shapes in the plurality of anchor shapes to select an anchor shape; / 20. analyzing respective outputs for the anchor shapes in the plurality of anchor shapes to select an anchor shape; (Riemenschneider:[0037] In some embodiments, anchor content 230 is similarly processed by content segmentation network 204 to generate a content segmentation 294. For example, one or more images in anchor content 230 could be inputted into content segmentation network 204, and content segmentation 294 could be obtained as predictions of objects and/or categories of objects (e.g., foreground, background, clouds, stars, objects, animals, plants, faces, structures, shapes, settings, etc.) generated by content segmentation network 204 for individual pixel locations and/or other subsets of the image(s). [0038] Anchor content 230, sensor data 228, layout 232, and/or content segmentation 294 are provided as input into extrapolation network 206. In response to the input, extrapolation network 206 generates latent representations 236(1)-236(Y) (each of which is referred to individually as latent representation 236) of various portions of the inputted data. Extrapolation network 206 also converts latent representations 236 into output 2D content 238 that includes a number of images 240(1)-240(Z) (each of which is referred to individually as image 240). [0039] In some embodiments, each image 240 in output 2D content 238 represents one or more portions of the physical space and depicts a semantically meaningful extension of anchor content 230 into the physical space. For example, output 2D content 238 could include six images 240 corresponding to six surfaces of a cube that represents a standard box-shaped room. In another example, output 2D content 238 could include one or more images 240 that depict a 360-degree, spherical, and/or another type of “panorama” view of a physical space that is not limited to a box-shaped room. In both examples, each image 240 could include real-world objects in the room, such as (but not limited to) doors, windows, furniture, and/or decorations. Each image 240 could also include various subsets of anchor content 230 (as identified in content segmentation 294) overlaid onto the walls, floor, ceiling, and/or other surfaces in the room. These components of anchor content 230 could additionally be positioned or distributed within the corresponding images 240 to avoid occluding and/or overlapping with doors, windows, furniture, decorations, and/or certain other types of objects in the room.)
1. and cropping the image using the anchor shape. / 18. and cropping the image using the anchor shape. / 20. and cropping the image using the anchor shape. (Riemenschneider: [0055], [0059]-[0062], [0060] In one or more embodiments, anchor content 296 is outputted and/or captured within the physical space that is incorporated into AR environment 292. For example, anchor content 296 could include a photograph, painting, and/or 2D or 3D video that is outputted by a television, projector, display, and/or another visual output device in a room. In another example, anchor content 296 could include a painting, photograph, mural, sculpture, sound, and/or another physical entity that is present in or detected from the room. Anchor content 296 can be specified by a user interacting with AR environment 292 using a bounding box or bounding shape, via a calibration process that involves displaying a known image on an output device prior to displaying anchor content 230, and/or via another method. [0060] Anchor content 296 can also, or instead, exist separately from the physical space into which anchor content 296 is to be placed. For example, anchor content 296 could be specified as a file that includes an image, video, audio, text, 3D model, and/or another type of content that can be retrieved from a data store and incorporated into AR environment 292. In another example, a user interacting with AR environment 292 could generate and/or update anchor content 296 via one or more cropping, scaling, rotation, translation, color adjustment, and/or painting operations. [0061] Machine learning model 280 includes a space segmentation network 242, a content segmentation network 244, and a 3D synthesis network 246. In some embodiments, space segmentation network 242, content segmentation network 244, and 3D synthesis network 246 are implemented as neural networks and/or other types of machine learning models. For example, space segmentation network 242, content segmentation network 244, and 3D synthesis network 246 could include, but are not limited to, one or more convolutional neural networks, fully connected neural networks, recurrent neural networks, residual neural networks, transformer neural networks, autoencoders, variational autoencoders, generative adversarial networks, autoregressive models, bidirectional attention models, mixture models, diffusion models, neural radiance field models, and/or other types of machine learning models that can process and/or generate content. [0069]-[0071])
It would have been obvious before the effective filing date of the claimed invention to one of ordinary skill in the art to modify Zetpayakin’s method and system for a learning algorithm for feature extraction using atom weights in a learned dictionary to leverage the learned anchor content of Riemenschneider augmented reality system. The determination of obviousness is predicated upon the following findings: Both references are directed towards the field of 3D image analysis, and one skilled in the art would have been motivated to modify the dictionary learning and feature extraction process of Zetpayakin in order to incorporate in anchor content and layouts of those feature spaces. Furthermore, the prior art collectively includes each element claimed (though not all in the same reference), and one of ordinary skill in the art could have combined the elements in the manner explained above using known engineering design, interface and programming techniques, without changing a “fundamental” operating principle of Zetpayakin, while the teaching of Riemenschneider continues to perform the same function as originally taught prior to being combined, in order to produce the repeatable and predictable result of leveraging anchor content with the physical layout of spaces for improvement in the overall feature detection and analysis process. It is for at least the aforementioned reasons that the examiner has reached a conclusion of obviousness with respect to the claim in question.
Consider Claims 2 and 19.
The combination of Zatepyakin and Riemenschneider teaches:
2. The method of claim 1, wherein anchor shapes in the plurality of anchor shapes crop different portions of the image. / 19. The non-transitory computer-readable storage medium of claim 18, wherein anchor shapes in the plurality of anchor shapes crop different portions of the image. (Zatepyakin: [0035]-[0044], [0037] Though the cascading trees shown in FIG. 4 include two levels for illustration, the changes to the shape from one level of the cascade to another may be applied to a large number of cascade levels and across many trees for each level with greater depth. [0038] FIG. 5 is a flowchart illustrating a process for predicting a shape based on a cascade of regression trees, in accordance with an embodiment. The process 500 may include different or additional steps than those described in conjunction with FIG. 5 in some embodiments or perform steps in different orders than the order described in conjunction with FIG. 5. [0039] The face tracking system 140 identifies 530 a mesh shape adjusted to an image of a face, the mesh shape having a set of elements, each element having a set of vertices. For example, the mesh shape is in a barycentric coordinate system. At a first cascade level, the mesh shape is a default shape or the default shape adjusted to the cropped image. An example is described in FIG. 4. In another example, at other cascade levels, the current shape is a fitted shape from the prior level. The fitted shape may have the same size as an image to be aligned as discussed in FIG. 4.)
Consider Claim 3.
The combination of Zatepyakin and Riemenschneider teaches:
3. The method of claim 1, wherein anchor shapes in the plurality of anchor shapes are predefined shapes. (Zatepyakin: [0035]-[0044], [0037] Though the cascading trees shown in FIG. 4 include two levels for illustration, the changes to the shape from one level of the cascade to another may be applied to a large number of cascade levels and across many trees for each level with greater depth. [0038] FIG. 5 is a flowchart illustrating a process for predicting a shape based on a cascade of regression trees, in accordance with an embodiment. The process 500 may include different or additional steps than those described in conjunction with FIG. 5 in some embodiments or perform steps in different orders than the order described in conjunction with FIG. 5. [0039] The face tracking system 140 identifies 530 a mesh shape adjusted to an image of a face, the mesh shape having a set of elements, each element having a set of vertices. For example, the mesh shape is in a barycentric coordinate system. At a first cascade level, the mesh shape is a default shape or the default shape adjusted to the cropped image. An example is described in FIG. 4. In another example, at other cascade levels, the current shape is a fitted shape from the prior level. The fitted shape may have the same size as an image to be aligned as discussed in FIG. 4.)
Consider Claim 4.
The combination of Zatepyakin and Riemenschneider teaches:
4. The method of claim 1, wherein analyzing different portions of the image comprises: generating a feature map from the image, wherein the feature map represents one or more characteristics of the image; and analyzing the feature map to generate respective outputs for the anchor shapes. (Zatepyakin: [0032] After comparing the normalized difference of pixels C′ and D′, node N1 proceeds to node N3 or N4 based on the threshold. If the normalized difference is smaller than the first learned threshold, at a leaf N4, an adjustment vector is generated. The adjustment vector is applied to the adjusted default shape 220 to generate a fitted shape 320. The fitted shape 320 has the same size as the cropped bounding box 168. [0033] Since the positions of a node are defined with respect to the elements of the adjusted default shape, the pixel coordinates are quickly identified using the vertices of the specified element in the adjusted default shape. The barycentrically-defined position can then be applied to the vertices to determine the pixel location within the image. This permits rapid traversal of the tree, as identification of desired pixels for the threshold comparison simply looks up the location of the desired pixel by the coordinates of the adjusted default shape, which is ‘overlaid’ on the image and mapped to the image coordinates. As such, this technique does not require a transformation matrix (e.g., describing scale and rotation modifications) or other complex formula to map pixel comparison locations for a node to the image. This reduces errors and computational cost caused by calculations of transformation matrix. A Prediction Model Based on a Cascade of Regression Trees [0034] FIG. 4 shows an example of a prediction model based on a cascade 400 of regression trees, in accordance with an embodiment. In some embodiments, a prediction model may be generated by a cascade of regression trees. A cascade of regression trees may have multiple levels and multiple regression trees for each level. Riemenschneider: [0055], [0059]-[0062], [0060] In one or more embodiments, anchor content 296 is outputted and/or captured within the physical space that is incorporated into AR environment 292. For example, anchor content 296 could include a photograph, painting, and/or 2D or 3D video that is outputted by a television, projector, display, and/or another visual output device in a room. In another example, anchor content 296 could include a painting, photograph, mural, sculpture, sound, and/or another physical entity that is present in or detected from the room. Anchor content 296 can be specified by a user interacting with AR environment 292 using a bounding box or bounding shape, via a calibration process that involves displaying a known image on an output device prior to displaying anchor content 230, and/or via another method. [0060] Anchor content 296 can also, or instead, exist separately from the physical space into which anchor content 296 is to be placed. For example, anchor content 296 could be specified as a file that includes an image, video, audio, text, 3D model, and/or another type of content that can be retrieved from a data store and incorporated into AR environment 292. In another example, a user interacting with AR environment 292 could generate and/or update anchor content 296 via one or more cropping, scaling, rotation, translation, color adjustment, and/or painting operations. [0061] Machine learning model 280 includes a space segmentation network 242, a content segmentation network 244, and a 3D synthesis network 246. In some embodiments, space segmentation network 242, content segmentation network 244, and 3D synthesis network 246 are implemented as neural networks and/or other types of machine learning models. For example, space segmentation network 242, content segmentation network 244, and 3D synthesis network 246 could include, but are not limited to, one or more convolutional neural networks, fully connected neural networks, recurrent neural networks, residual neural networks, transformer neural networks, autoencoders, variational autoencoders, generative adversarial networks, autoregressive models, bidirectional attention models, mixture models, diffusion models, neural radiance field models, and/or other types of machine learning models that can process and/or generate content. [0069]-[0071])
Consider Claim 6.
The combination of Zatepyakin and Riemenschneider teaches:
6. The method of claim 1, wherein analyzing different portions of the image comprises: analyzing the different portions of the image using a plurality of prediction networks, wherein prediction networks in the plurality of prediction networks are associated with respective anchor shapes in the plurality of anchor shapes. (Zatepyakin: [0024] The interface module 142 facilitates the communication among the client device 110, the face tracking system 140, and the external source 130. In one embodiment, the interface module 142 interacts with the client devices 110 and may provide a prediction model for extracting anchor points to the client device 110, and may also receive the captured face image and provide extracted facial anchor points to the client device 110. The interface module 142 may receive one or more face databases from the external source 130. In another embodiment, the interface module 142 may provide the prediction model to the client device 110 for further processing. [0025]-[0026] The face alignment module 146 localizes facial anchor points with the captured face image using a prediction model. Examples of facial anchor points may include contour points around facial features such as eyes, noses, mouth, and jaw lines. The prediction model predicts a fitted shape of the face based on a default shape and the captured face image. A default shape provides a set of predefined facial anchor points corresponding to a generic face. In some embodiments, a default shape may be a mean shape obtained from training data as further described below. The default shape may be centered and scaled according to a bounding box including an identified face. The bounding box may be cropped for further processing to reduce computational cost. Riemenschneider: [0055], [0059]-[0062], [0060] In one or more embodiments, anchor content 296 is outputted and/or captured within the physical space that is incorporated into AR environment 292. For example, anchor content 296 could include a photograph, painting, and/or 2D or 3D video that is outputted by a television, projector, display, and/or another visual output device in a room. In another example, anchor content 296 could include a painting, photograph, mural, sculpture, sound, and/or another physical entity that is present in or detected from the room. Anchor content 296 can be specified by a user interacting with AR environment 292 using a bounding box or bounding shape, via a calibration process that involves displaying a known image on an output device prior to displaying anchor content 230, and/or via another method. [0060] Anchor content 296 can also, or instead, exist separately from the physical space into which anchor content 296 is to be placed. For example, anchor content 296 could be specified as a file that includes an image, video, audio, text, 3D model, and/or another type of content that can be retrieved from a data store and incorporated into AR environment 292. In another example, a user interacting with AR environment 292 could generate and/or update anchor content 296 via one or more cropping, scaling, rotation, translation, color adjustment, and/or painting operations. [0061] Machine learning model 280 includes a space segmentation network 242, a content segmentation network 244, and a 3D synthesis network 246. In some embodiments, space segmentation network 242, content segmentation network 244, and 3D synthesis network 246 are implemented as neural networks and/or other types of machine learning models. For example, space segmentation network 242, content segmentation network 244, and 3D synthesis network 246 could include, but are not limited to, one or more convolutional neural networks, fully connected neural networks, recurrent neural networks, residual neural networks, transformer neural networks, autoencoders, variational autoencoders, generative adversarial networks, autoregressive models, bidirectional attention models, mixture models, diffusion models, neural radiance field models, and/or other types of machine learning models that can process and/or generate content. [0069]-[0071])
Consider Claim 7.
The combination of Zatepyakin and Riemenschneider teaches:
7. The method of claim 6, wherein each prediction network is associated with an anchor shape in the plurality of anchor shapes and generates an output based on the respective anchor shape. (Zatepyakin: [0026] The face alignment module 146 localizes facial anchor points with the captured face image using a prediction model. Examples of facial anchor points may include contour points around facial features such as eyes, noses, mouth, and jaw lines. The prediction model predicts a fitted shape of the face based on a default shape and the captured face image. [0027] FIG. 1B shows examples of a captured image 160 and identification of a facial shape for the image, in accordance with an embodiment. FIG. 1B includes a bounding box 162 having an identified face 164, a cropped bounding box 166, a default shape 168 and a fitted shape 170 of the system environment illustrated in FIG. 1A. As shown in FIG. 1B, the default shape 168 has predefined facial anchor points around eyes, noses, mouth, and jaw lines. The default shape 168 is centered and scaled according to the cropped bounding box 166. The default shape does not account for the actual position and alignment of the face in the image. By applying the prediction model as described below, the fitted shape 170 is identified that has better positions of the facial anchor points aligned with the identified face in the cropped bounding box 166 than the adjusted default shape 172. Barycentric Mesh-Based Shape [0028] In one embodiment, the face alignment module 146 uses a barycentric mesh-based shape for prediction. The barycentric mesh-based shape uses a barycentric coordinates system. The barycentric coordinate system is a coordinate system in which a position of a point within an element (e.g., a triangle, or tetrahedron) is represented by a linear combination of its vertices. For example, when the element is a triangle, points inside the triangle can be represented by a linear combination of three vertices of the triangle. The mesh-based shape may consist of multiple triangles covering all the predefined facial anchor points. Each facial anchor point can be represented by a linear combination of vertices in an associated triangle. [0029] FIG. 2 shows an example of barycentric mesh-based shapes, in accordance with an embodiment. As shown in FIG. 2, a barycentric mesh-based default shape 210 has multiple triangles.)
Consider Claim 8.
The combination of Zatepyakin and Riemenschneider teaches:
8. The method of claim 6, wherein each prediction network analyzes information from different portions of the image based on the respective anchor shape to generate the output. (Zatepyakin: [0027] FIG. 1B shows examples of a captured image 160 and identification of a facial shape for the image, in accordance with an embodiment. FIG. 1B includes a bounding box 162 having an identified face 164, a cropped bounding box 166, a default shape 168 and a fitted shape 170 of the system environment illustrated in FIG. 1A. As shown in FIG. 1B, the default shape 168 has predefined facial anchor points around eyes, noses, mouth, and jaw lines. The default shape 168 is centered and scaled according to the cropped bounding box 166. The default shape does not account for the actual position and alignment of the face in the image. By applying the prediction model as described below, the fitted shape 170 is identified that has better positions of the facial anchor points aligned with the identified face in the cropped bounding box 166 than the adjusted default shape 172. Barycentric Mesh-Based Shape [0028] In one embodiment, the face alignment module 146 uses a barycentric mesh-based shape for prediction. The barycentric mesh-based shape uses a barycentric coordinates system. The barycentric coordinate system is a coordinate system in which a position of a point within an element (e.g., a triangle, or tetrahedron) is represented by a linear combination of its vertices. For example, when the element is a triangle, points inside the triangle can be represented by a linear combination of three vertices of the triangle. The mesh-based shape may consist of multiple triangles covering all the predefined facial anchor points. Each facial anchor point can be represented by a linear combination of vertices in an associated triangle. [0029] FIG. 2 shows an example of barycentric mesh-based shapes, in accordance with an embodiment. As shown in FIG. 2, a barycentric mesh-based default shape 210 has multiple triangles. Riemenschneider: [0055], [0059]-[0062], [0060] In one or more embodiments, anchor content 296 is outputted and/or captured within the physical space that is incorporated into AR environment 292. For example, anchor content 296 could include a photograph, painting, and/or 2D or 3D video that is outputted by a television, projector, display, and/or another visual output device in a room. In another example, anchor content 296 could include a painting, photograph, mural, sculpture, sound, and/or another physical entity that is present in or detected from the room. Anchor content 296 can be specified by a user interacting with AR environment 292 using a bounding box or bounding shape, via a calibration process that involves displaying a known image on an output device prior to displaying anchor content 230, and/or via another method. [0060] Anchor content 296 can also, or instead, exist separately from the physical space into which anchor content 296 is to be placed. For example, anchor content 296 could be specified as a file that includes an image, video, audio, text, 3D model, and/or another type of content that can be retrieved from a data store and incorporated into AR environment 292. In another example, a user interacting with AR environment 292 could generate and/or update anchor content 296 via one or more cropping, scaling, rotation, translation, color adjustment, and/or painting operations. [0061] Machine learning model 280 includes a space segmentation network 242, a content segmentation network 244, and a 3D synthesis network 246. In some embodiments, space segmentation network 242, content segmentation network 244, and 3D synthesis network 246 are implemented as neural networks and/or other types of machine learning models. For example, space segmentation network 242, content segmentation network 244, and 3D synthesis network 246 could include, but are not limited to, one or more convolutional neural networks, fully connected neural networks, recurrent neural networks, residual neural networks, transformer neural networks, autoencoders, variational autoencoders, generative adversarial networks, autoregressive models, bidirectional attention models, mixture models, diffusion models, neural radiance field models, and/or other types of machine learning models that can process and/or generate content. [0069]-[0071])
Consider Claim 9.
The combination of Zatepyakin and Riemenschneider teaches:
9. The method of claim 6, wherein each prediction network analyzes information that is within the respective different portions of the image based on respective anchor shape and not outside of the respective different portions of the image to generate the output. (Zatepyakin: [0027] FIG. 1B shows examples of a captured image 160 and identification of a facial shape for the image, in accordance with an embodiment. FIG. 1B includes a bounding box 162 having an identified face 164, a cropped bounding box 166, a default shape 168 and a fitted shape 170 of the system environment illustrated in FIG. 1A. As shown in FIG. 1B, the default shape 168 has predefined facial anchor points around eyes, noses, mouth, and jaw lines. The default shape 168 is centered and scaled according to the cropped bounding box 166. The default shape does not account for the actual position and alignment of the face in the image. By applying the prediction model as described below, the fitted shape 170 is identified that has better positions of the facial anchor points aligned with the identified face in the cropped bounding box 166 than the adjusted default shape 172. Barycentric Mesh-Based Shape [0028] In one embodiment, the face alignment module 146 uses a barycentric mesh-based shape for prediction. The barycentric mesh-based shape uses a barycentric coordinates system. The barycentric coordinate system is a coordinate system in which a position of a point within an element (e.g., a triangle, or tetrahedron) is represented by a linear combination of its vertices. For example, when the element is a triangle, points inside the triangle can be represented by a linear combination of three vertices of the triangle. The mesh-based shape may consist of multiple triangles covering all the predefined facial anchor points. Each facial anchor point can be represented by a linear combination of vertices in an associated triangle. [0029] FIG. 2 shows an example of barycentric mesh-based shapes, in accordance with an embodiment. As shown in FIG. 2, a barycentric mesh-based default shape 210 has multiple triangles.)
Consider Claim 10.
The combination of Zatepyakin and Riemenschneider teaches:
10. The method of claim 9, wherein the information comprises a portion of a feature map that represents one or more characteristics of the image. (Zatepyakin: [0032] After comparing the normalized difference of pixels C′ and D′, node N1 proceeds to node N3 or N4 based on the threshold. If the normalized difference is smaller than the first learned threshold, at a leaf N4, an adjustment vector is generated. The adjustment vector is applied to the adjusted default shape 220 to generate a fitted shape 320. The fitted shape 320 has the same size as the cropped bounding box 168. [0033] Since the positions of a node are defined with respect to the elements of the adjusted default shape, the pixel coordinates are quickly identified using the vertices of the specified element in the adjusted default shape. The barycentrically-defined position can then be applied to the vertices to determine the pixel location within the image. This permits rapid traversal of the tree, as identification of desired pixels for the threshold comparison simply looks up the location of the desired pixel by the coordinates of the adjusted default shape, which is ‘overlaid’ on the image and mapped to the image coordinates. As such, this technique does not require a transformation matrix (e.g., describing scale and rotation modifications) or other complex formula to map pixel comparison locations for a node to the image. This reduces errors and computational cost caused by calculations of transformation matrix. A Prediction Model Based on a Cascade of Regression Trees [0034] FIG. 4 shows an example of a prediction model based on a cascade 400 of regression trees, in accordance with an embodiment. In some embodiments, a prediction model may be generated by a cascade of regression trees. A cascade of regression trees may have multiple levels and multiple regression trees for each level.)
Consider Claim 11.
The combination of Zatepyakin and Riemenschneider teaches:
11. The method of claim 1, wherein the output represents a score for an overlap of a respective anchor shape and a preferred cropped image. (Zatepayakin: [0053] The compression may be performed by the face tracking system 140 by transforming the adjustment vectors of each leaf to correspond to a dictionary of “atoms.” Each atom in the dictionary describes a function to adjust the values of one or more adjustment values. Thus, rather than defining the adjustment vector of a leaf node by the complete set of vector adjustment values, the leaf node may specify a set of atoms in the dictionary and a weight for each atom in dictionary. [0054] The face tracking system identifies a dictionary of atoms for which to determine the atoms for a leaf. The dictionary of atoms may be defined by a matrix specifying functions and an adjustment value that is the primary adjustment value that the function is applied on. For example, a function may specify modifying the primary adjustment value and a set of nearby adjustment values according to a decaying function. By specifying a variety of functions and that can each apply different changes to the adjustment values and variously adjust other adjustment values, each atom may represent a significant amount of information about the adjustment values, and a small number of atoms together can represent significant change in the adjustment vector. Thus, in one embodiment, the dictionary defines a matrix in which one side of the matrix represents a set of functions and another side of the matrix represents the set of adjustment values. The intersection of a given adjustment value and a given function in the matrix represents an atom for applying the given function to the given adjustment value as the primary adjustment value. In one embodiment, there are 136 adjustment values in the matrix and 1024 functions. [0055] )
Consider Claim 12.
The combination of Zatepyakin and Riemenschneider teaches:
12. The method of claim 1, further comprising: analyzing the different portions of the image based on the plurality of anchor shapes to generate offset coordinates for anchor shapes in the plurality of anchor shapes, wherein the offset coordinates are used to crop the image. (Zatepyakin: Node Split Based on a Pixel Comparison in a Regression Tree [0030] To extract and generate the set of anchor points 250, the prediction model uses regression trees. A regression tree has multiple nodes. The nodes can be divided into split nodes and leafs. Each leaf (e.g., a node without children) generates an adjustment vector to adjust a current shape. A split node represents a traversal decision of the tree. At each split node in the regression tree, a traversal decision is made based on a threshold difference between intensities of two pixels in a captured image. Two pixels are defined in a coordinate system of the default shape. To compare coordinates for traversing the tree, however, the coordinate system of the two pixels is translated to the location of the shape on the image. Thus, the coordinate system of the default shape is translated through the current position of the shape to determine a coordinate on the image. For example, a captured image is represented in a Cartesian coordinate system, and a barycentric mesh-based default shape is represented in a Barycentric coordinate system. Two positions in the barycentric mesh-based default shape close to predefined facial anchor points are selected. As mentioned above, the barycentric mesh-based default shape is adjusted according to a bounding box in the captured image. The shape may be further adjusted to one or more fitted shapes, as further discussed below, that closer align the shape with the facial image. The two positions of pixels in the coordinate system of the default shape are also translated according to the adjusted shape to determine the corresponding pixels on the image. A difference between intensities of the two determined pixels on the image can be calculated. For example, assume that there are two pixels A and B. A normalized difference between two pixels is calculated based on (pixel A−pixel B)/(pixel A+pixel B). In another example, a difference may be calculated based on (pixel A−pixel B). By comparing the calculated normalized difference or difference with an associated threshold, a decision is made designating a subsequent node in the tree. [0031] FIG. 3 shows an example of a regression tree 300 for generating an adjustment vector, in accordance with an embodiment. In the example of FIG. 3, the regression tree 300 includes two depths and 4 leafs (N3-N6). An input for the regression tree 300 includes a cropped bounding box 168 having an identified face and a barycentric mesh-based default shape 210. In other examples, the mesh shape input to the tree may include already-applied adjustments to the default mesh, for example from a prior adjustment of the shape to match the face. For node N0, two positions A and B close to predefined facial anchor points are specified in the default shape 210. The default shape 210 is adjusted according to the cropped bounding box 168. After adjusting the default shape to the cropped bounding box 168, the adjusted default shape 220 may have the same size as the cropped bounding box 168. Accordingly, the two positions A and B are adjusted to determine two pixels A′ and B′ in the adjusted default shape 220. Since the positions A, B may be defined with respect to a specific triangle or element in the default shape 210 and the adjusted default shape 220 is located on the image, the pixels A′, B′ in the image may be identified as the pixel location in the image corresponding to the element-defined coordinate of A′, B′ in the adjusted default shape 220. At node N0, a normalized difference between two pixels A′ and B′ in the image is calculated, and the normalized difference is compared with a first threshold associated with N0. The first threshold may be learned from training data. If the normalized difference is larger than the first learned threshold, the decision tree proceeds to node N1, and if the difference smaller than the first learned threshold, the decision tree proceeds to node N2. At a node N1, two pixels C′ and D′ close to predefined facial anchor points are similarly identified based on specified positions C and D for the node N1. That is, positions C and D may be specified by node N1 in respective barycentric coordinates with respect to an element of a mesh, and pixels C′ and D′ are determined by identifying the pixel in the image corresponding to the coordinates as applied to the location of element in the adjusted default shape 220.)
Consider Claim 13.
The combination of Zatepyakin and Riemenschneider teaches:
13. The method of claim 12, wherein: the offset coordinates adjust coordinates of the anchor shape to generate adjusted coordinates, and the adjusted coordinates are used to crop the image. (Zatepyakin: [0032] After comparing the normalized difference of pixels C′ and D′, node N1 proceeds to node N3 or N4 based on the threshold. If the normalized difference is smaller than the first learned threshold, at a leaf N4, an adjustment vector is generated. The adjustment vector is applied to the adjusted default shape 220 to generate a fitted shape 320. The fitted shape 320 has the same size as the cropped bounding box 168. [0033] Since the positions of a node are defined with respect to the elements of the adjusted default shape, the pixel coordinates are quickly identified using the vertices of the specified element in the adjusted default shape. The barycentrically-defined position can then be applied to the vertices to determine the pixel location within the image. This permits rapid traversal of the tree, as identification of desired pixels for the threshold comparison simply looks up the location of the desired pixel by the coordinates of the adjusted default shape, which is ‘overlaid’ on the image and mapped to the image coordinates. As such, this technique does not require a transformation matrix (e.g., describing scale and rotation modifications) or other complex formula to map pixel comparison locations for a node to the image. This reduces errors and computational cost caused by calculations of transformation matrix. A Prediction Model Based on a Cascade of Regression Trees [0034] FIG. 4 shows an example of a prediction model based on a cascade 400 of regression trees, in accordance with an embodiment. In some embodiments, a prediction model may be generated by a cascade of regression trees. A cascade of regression trees may have multiple levels and multiple regression trees for each level.)
Consider Claim 14.
The combination of Zatepyakin and Riemenschneider teaches:
14. The method of claim 1, further comprising: training a model for the plurality of anchor shapes using a training image, wherein parameters of the model are adjusted using a comparison of a first output for an anchor shape to a second output that is based on a labeled shape for the training image. (Zatepyakin: [0027] FIG. 1B shows examples of a captured image 160 and identification of a facial shape for the image, in accordance with an embodiment. FIG. 1B includes a bounding box 162 having an identified face 164, a cropped bounding box 166, a default shape 168 and a fitted shape 170 of the system environment illustrated in FIG. 1A. As shown in FIG. 1B, the default shape 168 has predefined facial anchor points around eyes, noses, mouth, and jaw lines. The default shape 168 is centered and scaled according to the cropped bounding box 166. The default shape does not account for the actual position and alignment of the face in the image. By applying the prediction model as described below, the fitted shape 170 is identified that has better positions of the facial anchor points aligned with the identified face in the cropped bounding box 166 than the adjusted default shape 172. Barycentric Mesh-Based Shape [0028] In one embodiment, the face alignment module 146 uses a barycentric mesh-based shape for prediction. The barycentric mesh-based shape uses a barycentric coordinates system. The barycentric coordinate system is a coordinate system in which a position of a point within an element (e.g., a triangle, or tetrahedron) is represented by a linear combination of its vertices. For example, when the element is a triangle, points inside the triangle can be represented by a linear combination of three vertices of the triangle. The mesh-based shape may consist of multiple triangles covering all the predefined facial anchor points. Each facial anchor point can be represented by a linear combination of vertices in an associated triangle. [0029] FIG. 2 shows an example of barycentric mesh-based shapes, in accordance with an embodiment. As shown in FIG. 2, a barycentric mesh-based default shape 210 has multiple triangles. The triangles cover all the predefined facial anchor points as shown in dash lines. The barycentric mesh-based default shape 210 may be adjusted according to the cropped bounding box 166. The adjusted barycentric mesh-based default shape 220 may determine updated positions of predefined facial anchor points 230 using vertices of the associated triangles to correspond the predefined facial anchor points to the default shape applied to the cropped bounding box 166. When applying the prediction model, a barycentric mesh-based fitted shape 240 is generated to adjust the mesh to the face within the image and include updated triangles. Then, the barycentric mesh-based fitted shape 240 may determine updated positions of predefined facial anchor points 250 using vertices of associated update triangles.)
Consider Claim 17.
The combination of Zatepyakin and Riemenschneider teaches:
17. The method of claim 1, wherein: the anchor shape is a shape that is defined by coordinates, and the coordinates are used to crop the image. (Zatepyakin: Node Split Based on a Pixel Comparison in a Regression Tree [0030] To extract and generate the set of anchor points 250, the prediction model uses regression trees. A regression tree has multiple nodes. The nodes can be divided into split nodes and leafs. Each leaf (e.g., a node without children) generates an adjustment vector to adjust a current shape. A split node represents a traversal decision of the tree. At each split node in the regression tree, a traversal decision is made based on a threshold difference between intensities of two pixels in a captured image. Two pixels are defined in a coordinate system of the default shape. To compare coordinates for traversing the tree, however, the coordinate system of the two pixels is translated to the location of the shape on the image. Thus, the coordinate system of the default shape is translated through the current position of the shape to determine a coordinate on the image. For example, a captured image is represented in a Cartesian coordinate system, and a barycentric mesh-based default shape is represented in a Barycentric coordinate system. Two positions in the barycentric mesh-based default shape close to predefined facial anchor points are selected. As mentioned above, the barycentric mesh-based default shape is adjusted according to a bounding box in the captured image. The shape may be further adjusted to one or more fitted shapes, as further discussed below, that closer align the shape with the facial image. The two positions of pixels in the coordinate system of the default shape are also translated according to the adjusted shape to determine the corresponding pixels on the image. A difference between intensities of the two determined pixels on the image can be calculated. For example, assume that there are two pixels A and B. A normalized difference between two pixels is calculated based on (pixel A−pixel B)/(pixel A+pixel B). In another example, a difference may be calculated based on (pixel A−pixel B). By comparing the calculated normalized difference or difference with an associated threshold, a decision is made designating a subsequent node in the tree. [0031] FIG. 3 shows an example of a regression tree 300 for generating an adjustment vector, in accordance with an embodiment. In the example of FIG. 3, the regression tree 300 includes two depths and 4 leafs (N3-N6). An input for the regression tree 300 includes a cropped bounding box 168 having an identified face and a barycentric mesh-based default shape 210. In other examples, the mesh shape input to the tree may include already-applied adjustments to the default mesh, for example from a prior adjustment of the shape to match the face. For node N0, two positions A and B close to predefined facial anchor points are specified in the default shape 210. The default shape 210 is adjusted according to the cropped bounding box 168. After adjusting the default shape to the cropped bounding box 168, the adjusted default shape 220 may have the same size as the cropped bounding box 168. Accordingly, the two positions A and B are adjusted to determine two pixels A′ and B′ in the adjusted default shape 220. Since the positions A, B may be defined with respect to a specific triangle or element in the default shape 210 and the adjusted default shape 220 is located on the image, the pixels A′, B′ in the image may be identified as the pixel location in the image corresponding to the element-defined coordinate of A′, B′ in the adjusted default shape 220. At node N0, a normalized difference between two pixels A′ and B′ in the image is calculated, and the normalized difference is compared with a first threshold associated with N0. The first threshold may be learned from training data. If the normalized difference is larger than the first learned threshold, the decision tree proceeds to node N1, and if the difference smaller than the first learned threshold, the decision tree proceeds to node N2. At a node N1, two pixels C′ and D′ close to predefined facial anchor points are similarly identified based on specified positions C and D for the node N1. That is, positions C and D may be specified by node N1 in respective barycentric coordinates with respect to an element of a mesh, and pixels C′ and D′ are determined by identifying the pixel in the image corresponding to the coordinates as applied to the location of element in the adjusted default shape 220.)
Claims 4-5, and 14-16, are further rejected under 35 U.S.C. 103 as being unpatentable over Zatpeyakin et al. (US PGPub US2018/0182165 A1, hereby referred to as “Zatpeyakin”), in view of Chen et al. (US PGPub 20230186439), hereby referred to as “Chen”.
Consider Claims 4-5.
The combination of Zatepyakin and Riemenschneider does teach:
4. The method of claim 1, wherein analyzing the image comprises: generating a feature map from the image, wherein the feature map represents one or more characteristics of the image; and analyzing the feature map to generate respective outputs for the anchor shapes. (Zatepyakin: [0032] After comparing the normalized difference of pixels C′ and D′, node N1 proceeds to node N3 or N4 based on the threshold. If the normalized difference is smaller than the first learned threshold, at a leaf N4, an adjustment vector is generated. The adjustment vector is applied to the adjusted default shape 220 to generate a fitted shape 320. The fitted shape 320 has the same size as the cropped bounding box 168. [0033] Since the positions of a node are defined with respect to the elements of the adjusted default shape, the pixel coordinates are quickly identified using the vertices of the specified element in the adjusted default shape. The barycentrically-defined position can then be applied to the vertices to determine the pixel location within the image. This permits rapid traversal of the tree, as identification of desired pixels for the threshold comparison simply looks up the location of the desired pixel by the coordinates of the adjusted default shape, which is ‘overlaid’ on the image and mapped to the image coordinates. As such, this technique does not require a transformation matrix (e.g., describing scale and rotation modifications) or other complex formula to map pixel comparison locations for a node to the image. This reduces errors and computational cost caused by calculations of transformation matrix. A Prediction Model Based on a Cascade of Regression Trees [0034] FIG. 4 shows an example of a prediction model based on a cascade 400 of regression trees, in accordance with an embodiment. In some embodiments, a prediction model may be generated by a cascade of regression trees. A cascade of regression trees may have multiple levels and multiple regression trees for each level. Riemenschneider: [0069] Training anchor content 252 includes images, video, audio, and/or other content that can be combined with training sensor data 250 to produce AR environments (e.g., AR environment 292). Like anchor content 296, training anchor content 252 can be depicted and/or captured in training sensor data 250 (e.g., as a part of the corresponding physical spaces) and/or retrieved separately from training sensor data 250 (e.g., as digital files from a data store). [0070] Ground truth segmentations 248 include labels associated with training sensor data 250 and/or training anchor content 252. For example, ground truth segmentations 248 could include labels representing floors, walls, ceilings, light fixtures, furniture, decorations, doors, windows, and/or other objects that can be found in physical spaces represented by training sensor data 250. These labels could be assigned to regions of pixels, 3D points, meshes, sub-meshes, and/or other data elements within training sensor data 250. In another example, ground truth segmentations 248 could include labels representing foreground, background, textures, objects, shapes, structures, people, characters, faces, body parts, animals, plants, and/or other entities that are found in or represented by training anchor content 252. These labels could be assigned to regions of pixels, point clouds, meshes, sub-meshes, audio tracks or channels, and/or other elements or portions of training anchor content 252. Ground truth segmentations 248 could be available for all sets of training sensor data 250 and/or training anchor content 252 to allow for fully supervised training of one or more components of machine learning model 280, or ground truth segmentations 248 could be available for a subset of training sensor data 250 and/or training anchor content 252 to allow for semi-supervised and/or weakly supervised training of the component(s).[0071]Training 3D objects 254 include 3D representations of training anchor content 252. For example, training anchor content 252 could include images or videos that depict 2D renderings of 3D models or scenes, and training 3D objects 254 could include the 3D models or scenes. In other words, training 3D objects 254 can be used as “ground truth” 3D representations of objects in training anchor content 252.)
Even if the combination of Zatepyakin and Riemenschneider does not teach: 5. The method of claim 4, wherein: the feature map comprises multiple channels, wherein channels are associated with characteristics of the image, and the channels are analyzed to generate the output.
Chen teaches:
1. A method comprising: / 18. A non-transitory computer-readable storage medium having stored thereon computer executable instructions, which when executed by a computing device, cause the computing device to be operable for: / 20. An apparatus comprising: one or more computer processors; and a computer-readable storage medium comprising instructions for controlling the one or more computer processors to be operable for: (Chen: abstract; A lane detection method integratedly using image enhancement and a deep convolutional neural network. On the assumption that lanes have similar widths in a local region of an image and a lane can be segmented into several image blocks, each of which contains lane marking in the center, a method based on a deep convolutional neural network is provided to detect lane marking blocks in the image. Input to the model includes road images captured by a camera as well as a set of enhanced images generated by the contrast limited adaptive histogram equalization (CLAHE) algorithm. The method according to the present disclosure can effectively overcome difficulties of lane detection under complex imaging conditions, such as poor image quality, and small lane marking targets, so as to achieve better robustness. [0004]-[0019])
1. receiving an image; / 18. receiving an image; / 20. receiving an image; (Chen: [0004]-[0006], Step (1), acquiring a color image I contains lanes, including three component images I(0), I(1), and I(2) corresponding to red, green, and blue color components of I, respectively; performing the CLAHE algorithm to enhance the contrast of I and generate K enhanced images, where the kth enhanced image, k = 0, 1, ..., K - 1, is formed by using the cth channel image I(c) as the input, where c is the remainder of k divided by 3.)
1. analyzing the image based on a plurality of anchor shapes to generate respective outputs for anchor shapes in the plurality of anchor shapes, wherein the output rates a cropping of the image using a respective anchor shape; / 18. analyzing the image based on a plurality of anchor shapes to generate respective outputs for anchor shapes in the plurality of anchor shapes, wherein the output rates a cropping of the image using a respective anchor shape; / 20. analyzing the image based on a plurality of anchor shapes to generate respective outputs for anchor shapes in the plurality of anchor shapes, wherein the output rates a cropping of the image using a respective anchor shape; (Chen: [0006]-[0007] Step (2), constructing the deep convolutional neural network, which consists of an input module, a spatial attention module, a feature extraction module, and a detection module, for lane detection, and stacking the three component images of the color image as well as the K enhanced images generated by the CLAHE algorithm in step (1) as a tensor including K + 3 channels to serve as the input to the deep convolutional neural network. [0027] Step (1), I is set as a to-be-processed color image, including three component images I(0), I(1), and I(2), corresponding to red, green, and blue, respectively, and the CLAHE is performed K times on I to enhance the contrast of an input image and generate K enhanced images, where the kth enhanced image, k = 0, 1, ..., K - 1, is formed by using the cth channel image I(c) as the input. In one embodiment of the present disclosure, K = 6, and c is equal to the remainder of k divided by 3. Steps of the algorithm are as follows. First, an image I(c) is processed by using a sliding window. The height and the width of the sliding window are Mb + kΔ and Nb + kΔ, respectively, where Mb, Nb, and Δ are preset constants, which may be Mb = 18, Nb = 24, and Δ = 4. Second, the histogram of a block image covered by the sliding window is calculated and denoted as H; and if any histogram bin Hi exceeds a specified limit h, it is clipped as Hi = h, and amplitude differences are accumulated according to the following formula:
PNG
media_image1.png
42
220
media_image1.png
Greyscale
; [0028]-[0031] Step (3), the deep convolutional neural network for lane detection includes an input module, a spatial attention module, a feature extraction module, and a detection module. According to the data flow of the input module during forward propagation, input data first passes through a convolutional layer with 64 7 × 7 kernels and a stride of 2, and then a batch normalization operation and a ReLU activation operation are performed. The final part of the input module is a max pooling layer with a 3 × 3 sampling kernel and with a stride of 2. [0032] Step (4), output x of the input module is an M1 × N1 × C feature map, where M1 and N1 denote the height and the width, respectively, and C denotes the number of channels of the feature map)
1. analyzing respective outputs for the anchor shapes in the plurality of anchor shapes to select an anchor shape; / 18. analyzing respective outputs for the anchor shapes in the plurality of anchor shapes to select an anchor shape; / 20. analyzing respective outputs for the anchor shapes in the plurality of anchor shapes to select an anchor shape; (Chen: [0033] Step (5), elements in the spatial attention map are taken as weights. Values of all positions of each channel of the output feature map x of the input module are multiplied by weights of corresponding positions of the spatial attention map to form a feature map, and then is fed to the feature extraction module in the embodiment of the present disclosure. [0034] Step (6), Stage 2, Stage 3, and Stage 4 convolutional layer groups of ResNet50 are taken as the feature extraction module, and the output of Stage 3 serves as the input to Stage 4 as well as the input to a convolutional layer consists of 5nB kernels of size 1 × 1 and with a stride of 1, where nB denotes a preset number of detection boxes for each anchor point, and the convolutional layer finally outputs a feature map denoted by F1. Output of Stage 4 passes through a convolutional layer consists of 5nB kernels of size 1 × 1 and with a stride of 1, and the generated feature map is up-sampled and then sums corresponding elements one by one with F1 to generate an M2 × N2 × 5nB feature map F.)
5. wherein: the feature map comprises multiple channels, wherein channels are associated with characteristics of the image, and the channels are analyzed to generate the output. (Chen: [0006]-[0007] Step (2), constructing the deep convolutional neural network, which consists of an input module, a spatial attention module, a feature extraction module, and a detection module, for lane detection, and stacking the three component images of the color image as well as the K enhanced images generated by the CLAHE algorithm in step (1) as a tensor including K + 3 channels to serve as the input to the deep convolutional neural network. [0027] Step (1), I is set as a to-be-processed color image, including three component images I(0), I(1), and I(2), corresponding to red, green, and blue, respectively, and the CLAHE is performed K times on I to enhance the contrast of an input image and generate K enhanced images, where the kth enhanced image, k = 0, 1, ..., K - 1, is formed by using the cth channel image I(c) as the input. In one embodiment of the present disclosure, K = 6, and c is equal to the remainder of k divided by 3. Steps of the algorithm are as follows. First, an image I(c) is processed by using a sliding window. The height and the width of the sliding window are Mb + kΔ and Nb + kΔ, respectively, where Mb, Nb, and Δ are preset constants, which may be Mb = 18, Nb = 24, and Δ = 4. Second, the histogram of a block image covered by the sliding window is calculated and denoted as H; and if any histogram bin Hi exceeds a specified limit h, it is clipped as Hi = h, and amplitude differences are accumulated according to the following formula:
PNG
media_image1.png
42
220
media_image1.png
Greyscale
; [0028]-[0031] Step (3), the deep convolutional neural network for lane detection includes an input module, a spatial attention module, a feature extraction module, and a detection module. According to the data flow of the input module during forward propagation, input data first passes through a convolutional layer with 64 7 × 7 kernels and a stride of 2, and then a batch normalization operation and a ReLU activation operation are performed. The final part of the input module is a max pooling layer with a 3 × 3 sampling kernel and with a stride of 2. [0032] Step (4), output x of the input module is an M1 × N1 × C feature map, where M1 and N1 denote the height and the width, respectively, and C denotes the number of channels of the feature map)
5. The method of claim 4, wherein: the feature map comprises multiple channels, wherein channels are associated with characteristics of the image, and the channels are analyzed to generate the respective output. (Chen: [0006]-[0007] Step (2), constructing the deep convolutional neural network, which consists of an input module, a spatial attention module, a feature extraction module, and a detection module, for lane detection, and stacking the three component images of the color image as well as the K enhanced images generated by the CLAHE algorithm in step (1) as a tensor including K + 3 channels to serve as the input to the deep convolutional neural network. [0027] Step (1), I is set as a to-be-processed color image, including three component images I(0), I(1), and I(2), corresponding to red, green, and blue, respectively, and the CLAHE is performed K times on I to enhance the contrast of an input image and generate K enhanced images, where the kth enhanced image, k = 0, 1, ..., K - 1, is formed by using the cth channel image I(c) as the input. In one embodiment of the present disclosure, K = 6, and c is equal to the remainder of k divided by 3. Steps of the algorithm are as follows. First, an image I(c) is processed by using a sliding window. The height and the width of the sliding window are Mb + kΔ and Nb + kΔ, respectively, where Mb, Nb, and Δ are preset constants, which may be Mb = 18, Nb = 24, and Δ = 4. Second, the histogram of a block image covered by the sliding window is calculated and denoted as H; and if any histogram bin Hi exceeds a specified limit h, it is clipped as Hi = h, and amplitude differences are accumulated according to the following formula:
PNG
media_image1.png
42
220
media_image1.png
Greyscale
; [0028]-[0031] Step (3), the deep convolutional neural network for lane detection includes an input module, a spatial attention module, a feature extraction module, and a detection module. According to the data flow of the input module during forward propagation, input data first passes through a convolutional layer with 64 7 × 7 kernels and a stride of 2, and then a batch normalization operation and a ReLU activation operation are performed. The final part of the input module is a max pooling layer with a 3 × 3 sampling kernel and with a stride of 2. [0032] Step (4), output x of the input module is an M1 × N1 × C feature map, where M1 and N1 denote the height and the width, respectively, and C denotes the number of channels of the feature map)
It would have been obvious before the effective filing date of the claimed invention was made to one of ordinary skill in the art to modify the machine learning algorithm of the combination of Zatpeyakin and Riemenschneider’s for extraction of facial features using set of adjustable mesh shapes and anchor content with the teachings of Chen for machine learning and feature extraction of using image enhancement and contrast limited adaptive histogram equalization. The determination of obviousness is predicated upon the following findings: One skilled in the art would have been motivated to modify the combination of Zatpeyakin and Riemenschneider in order to improve the overall feature detection and machine learning algorithms to leverage a contrast adaptive histogram equalization algorithm to ensure enhanced accuracy and robustness. Furthermore, the prior art collectively includes each element claimed (though not all in the same reference), and one of ordinary skill in the art could have combined the elements in the manner explained above using known engineering design, interface and/or programming techniques, without changing a “fundamental” operating principle of the combination of Zatpeyakin and Riemenschneider, while the teaching of Chen continues to perform the same function as originally taught prior to being combined, in order to produce the repeatable and predictable result of enhancing overall computational efficiency and accuracy. It is for at least the aforementioned reasons that the examiner has reached a conclusion of obviousness with respect to the claim in question.
Consider Claims 14-15.
The combination of Zatpeyakin and Riemenschneider does teach:
14. The method of claim 1, further comprising: training a model for the plurality of anchor shapes using a training image, wherein parameters of the model are adjusted using a comparison of a first output for an anchor shape to a second output that is based on a labeled shape for the training image. (Zatepyakin: [0027] FIG. 1B shows examples of a captured image 160 and identification of a facial shape for the image, in accordance with an embodiment. FIG. 1B includes a bounding box 162 having an identified face 164, a cropped bounding box 166, a default shape 168 and a fitted shape 170 of the system environment illustrated in FIG. 1A. As shown in FIG. 1B, the default shape 168 has predefined facial anchor points around eyes, noses, mouth, and jaw lines. The default shape 168 is centered and scaled according to the cropped bounding box 166. The default shape does not account for the actual position and alignment of the face in the image. By applying the prediction model as described below, the fitted shape 170 is identified that has better positions of the facial anchor points aligned with the identified face in the cropped bounding box 166 than the adjusted default shape 172. Barycentric Mesh-Based Shape [0028] In one embodiment, the face alignment module 146 uses a barycentric mesh-based shape for prediction. The barycentric mesh-based shape uses a barycentric coordinates system. The barycentric coordinate system is a coordinate system in which a position of a point within an element (e.g., a triangle, or tetrahedron) is represented by a linear combination of its vertices. For example, when the element is a triangle, points inside the triangle can be represented by a linear combination of three vertices of the triangle. The mesh-based shape may consist of multiple triangles covering all the predefined facial anchor points. Each facial anchor point can be represented by a linear combination of vertices in an associated triangle. [0029] FIG. 2 shows an example of barycentric mesh-based shapes, in accordance with an embodiment. As shown in FIG. 2, a barycentric mesh-based default shape 210 has multiple triangles. The triangles cover all the predefined facial anchor points as shown in dash lines. The barycentric mesh-based default shape 210 may be adjusted according to the cropped bounding box 166. The adjusted barycentric mesh-based default shape 220 may determine updated positions of predefined facial anchor points 230 using vertices of the associated triangles to correspond the predefined facial anchor points to the default shape applied to the cropped bounding box 166. When applying the prediction model, a barycentric mesh-based fitted shape 240 is generated to adjust the mesh to the face within the image and include updated triangles. Then, the barycentric mesh-based fitted shape 240 may determine updated positions of predefined facial anchor points 250 using vertices of associated update triangles. Riemenschneider: [0069]-[0071], [0069] Training anchor content 252 includes images, video, audio, and/or other content that can be combined with training sensor data 250 to produce AR environments (e.g., AR environment 292). Like anchor content 296, training anchor content 252 can be depicted and/or captured in training sensor data 250 (e.g., as a part of the corresponding physical spaces) and/or retrieved separately from training sensor data 250 (e.g., as digital files from a data store).)
15. The method of claim 14, wherein training comprises: receiving the training image and the labeled shape; (Zatepyakin: [0020] The application 112 on the client device may perform facial alignment of a face within the captured image. To determine the facial image, the application 112 applies a trained model to analyze a face in the image to extract a set of facial anchor points on the face. The application 112 may receive the trained model from the face tracking system 140 and after applying the model, use the extracted set of facial anchor points to interpret or augment the image. The application 112 may determine facial anchor points as described below with respect to modules of the face tracking system 140. After identifying the facial anchor points, the application 112 may use the anchor points to track and characterize the face, for example to look for further features of the face between anchor points, or to display an overlay or mask over the user's face. The anchor points may also be captured over time to identify how a user's face moves during a video capture, which may for example be used to populate animated expressions using the anchor points, among other uses. The application 112 may also send the set of facial anchor points to another client device or the face tracking system 140 for similar uses. An a further example, the application 112 may provide video chat services for users of the client device, permitting users to capture and send video to another user. By capturing the anchor points of a face during the video, the video can be augmented using the anchor points, e.g., to add a mask to a user's face, or the by sending the anchor points for each frame of the video to another client device. In some embodiments, the anchor points may be determined for an initial frame of the video, and subsequent frames may use alternate face tracking techniques to monitor the movement of the face after the anchor points have been determined. [0021]-[0027]; Riemenschneider: [0069]-[0071], [0070] Ground truth segmentations 248 include labels associated with training sensor data 250 and/or training anchor content 252. For example, ground truth segmentations 248 could include labels representing floors, walls, ceilings, light fixtures, furniture, decorations, doors, windows, and/or other objects that can be found in physical spaces represented by training sensor data 250. These labels could be assigned to regions of pixels, 3D points, meshes, sub-meshes, and/or other data elements within training sensor data 250. In another example, ground truth segmentations 248 could include labels representing foreground, background, textures, objects, shapes, structures, people, characters, faces, body parts, animals, plants, and/or other entities that are found in or represented by training anchor content 252. These labels could be assigned to regions of pixels, point clouds, meshes, sub-meshes, audio tracks or channels, and/or other elements or portions of training anchor content 252. Ground truth segmentations 248 could be available for all sets of training sensor data 250 and/or training anchor content 252 to allow for fully supervised training of one or more components of machine learning model 280, or ground truth segmentations 248 could be available for a subset of training sensor data 250 and/or training anchor content 252 to allow for semi-supervised and/or weakly supervised training of the component(s).)
15. generating the first output using the model for the anchor shape; determining the second output; (Zatpeyakin: [0021]-[0026], [0027] FIG. 1B shows examples of a captured image 160 and identification of a facial shape for the image, in accordance with an embodiment. FIG. 1B includes a bounding box 162 having an identified face 164, a cropped bounding box 166, a default shape 168 and a fitted shape 170 of the system environment illustrated in FIG. 1A. As shown in FIG. 1B, the default shape 168 has predefined facial anchor points around eyes, noses, mouth, and jaw lines. The default shape 168 is centered and scaled according to the cropped bounding box 166. The default shape does not account for the actual position and alignment of the face in the image. By applying the prediction model as described below, the fitted shape 170 is identified that has better positions of the facial anchor points aligned with the identified face in the cropped bounding box 166 than the adjusted default shape 172.)
15. and comparing the first output and the second output, wherein a difference between the first output and the second output is used to adjust the parameters of the model. (Zatepyakin:[0028]- n one embodiment, the face alignment module 146 uses a barycentric mesh-based shape for prediction. The barycentric mesh-based shape uses a barycentric coordinates system. The barycentric coordinate system is a coordinate system in which a position of a point within an element (e.g., a triangle, or tetrahedron) is represented by a linear combination of its vertices. For example, when the element is a triangle, points inside the triangle can be represented by a linear combination of three vertices of the triangle. The mesh-based shape may consist of multiple triangles covering all the predefined facial anchor points. Each facial anchor point can be represented by a linear combination of vertices in an associated triangle. [0029] FIG. 2 shows an example of barycentric mesh-based shapes, in accordance with an embodiment. As shown in FIG. 2, a barycentric mesh-based default shape 210 has multiple triangles. The triangles cover all the predefined facial anchor points as shown in dash lines. The barycentric mesh-based default shape 210 may be adjusted according to the cropped bounding box 166. The adjusted barycentric mesh-based default shape 220 may determine updated positions of predefined facial anchor points 230 using vertices of the associated triangles to correspond the predefined facial anchor points to the default shape applied to the cropped bounding box 166. When applying the prediction model, a barycentric mesh-based fitted shape 240 is generated to adjust the mesh to the face within the image and include updated triangles. Then, the barycentric mesh-based fitted shape 240 may determine updated positions of predefined facial anchor points 250 using vertices of associated update triangles. Riemenschneider: [0069]-[0070], [0071] Training 3D objects 254 include 3D representations of training anchor content 252. For example, training anchor content 252 could include images or videos that depict 2D renderings of 3D models or scenes, and training 3D objects 254 could include the 3D models or scenes. In other words, training 3D objects 254 can be used as “ground truth” 3D representations of objects in training anchor content 252.)
Zatepyakin does not teach: 15. “wherein the second output is based on an overlap of the labeled shape and the anchor shape;” or
Chen does teach:
15. The method of claim 14, wherein training comprises: receiving the training image and the labeled shape; (Chen: [0004]-[0006], Step (1), acquiring a color image I contains lanes, including three component images I(0), I(1), and I(2) corresponding to red, green, and blue color components of I, respectively; performing the CLAHE algorithm to enhance the contrast of I and generate K enhanced images, where the kth enhanced image, k = 0, 1, ..., K - 1, is formed by using the cth channel image I(c) as the input, where c is the remainder of k divided by 3.)
15. generating the first output using the model for the anchor shape; (Chen: [0006]-[0007] Step (2), constructing the deep convolutional neural network, which consists of an input module, a spatial attention module, a feature extraction module, and a detection module, for lane detection, and stacking the three component images of the color image as well as the K enhanced images generated by the CLAHE algorithm in step (1) as a tensor including K + 3 channels to serve as the input to the deep convolutional neural network. [0027] Step (1), I is set as a to-be-processed color image, including three component images I(0), I(1), and I(2), corresponding to red, green, and blue, respectively, and the CLAHE is performed K times on I to enhance the contrast of an input image and generate K enhanced images, where the kth enhanced image, k = 0, 1, ..., K - 1, is formed by using the cth channel image I(c) as the input. In one embodiment of the present disclosure, K = 6, and c is equal to the remainder of k divided by 3. Steps of the algorithm are as follows. First, an image I(c) is processed by using a sliding window. The height and the width of the sliding window are Mb + kΔ and Nb + kΔ, respectively, where Mb, Nb, and Δ are preset constants, which may be Mb = 18, Nb = 24, and Δ = 4. Second, the histogram of a block image covered by the sliding window is calculated and denoted as H; and if any histogram bin Hi exceeds a specified limit h, it is clipped as Hi = h, and amplitude differences are accumulated according to the following formula:
PNG
media_image1.png
42
220
media_image1.png
Greyscale
; [0028]-[0031])
15. determining the second output, wherein the second output is based on an overlap of the labeled shape and the anchor shape; (Chen: [0027] Step (1), I is set as a to-be-processed color image, including three component images I(0), I(1), and I(2), corresponding to red, green, and blue, respectively, and the CLAHE is performed K times on I to enhance the contrast of an input image and generate K enhanced images, where the kth enhanced image, k = 0, 1, ..., K - 1, is formed by using the cth channel image I(c) as the input. In one embodiment of the present disclosure, K = 6, and c is equal to the remainder of k divided by 3. Steps of the algorithm are as follows. First, an image I(c) is processed by using a sliding window. The height and the width of the sliding window are Mb + kΔ and Nb + kΔ, respectively, where Mb, Nb, and Δ are preset constants, which may be Mb = 18, Nb = 24, and Δ = 4. Second, the histogram of a block image covered by the sliding window is calculated and denoted as H; and if any histogram bin Hi exceeds a specified limit h, it is clipped as Hi = h, and amplitude differences are accumulated according to the following formula:
PNG
media_image1.png
42
220
media_image1.png
Greyscale
; [0028]-[0031] Step (3), the deep convolutional neural network for lane detection includes an input module, a spatial attention module, a feature extraction module, and a detection module. According to the data flow of the input module during forward propagation, input data first passes through a convolutional layer with 64 7 × 7 kernels and a stride of 2, and then a batch normalization operation and a ReLU activation operation are performed. The final part of the input module is a max pooling layer with a 3 × 3 sampling kernel and with a stride of 2. [0032] Step (4), output x of the input module is an M1 × N1 × C feature map, where M1 and N1 denote the height and the width, respectively, and C denotes the number of channels of the feature map.)
15. and comparing the first output and the second output, wherein a difference between the first output and the second output is used to adjust the parameters of the model. (Chen: [0033] Step (5), elements in the spatial attention map are taken as weights. Values of all positions of each channel of the output feature map x of the input module are multiplied by weights of corresponding positions of the spatial attention map to form a feature map, and then is fed to the feature extraction module in the embodiment of the present disclosure.[0036] Step (8), output of the detection module is a set of detected marking blocks, and a lane model is determined by the Hough transform algorithm using center coordinates of all the blocks in the set as inputs. Specifically, the center coordinates of a detected marking block is (υ, ν), and a lane is written as a straight line expressed in the polar coordinate system: ρ=ucos θ + vsin θ)
It would have been obvious before the effective filing date of the claimed invention was made to one of ordinary skill in the art to modify the machine learning algorithm of the combination of Zatpeyakin and Riemenschneider’s for extraction of facial features using set of adjustable mesh shapes and anchor content with the teachings of Chen for machine learning and feature extraction of using image enhancement and contrast limited adaptive histogram equalization. The determination of obviousness is predicated upon the following findings: One skilled in the art would have been motivated to modify the combination of Zatpeyakin and Riemenschneider in order to improve the overall feature detection and machine learning algorithms to leverage a contrast adaptive histogram equalization algorithm to ensure enhanced accuracy and robustness. Furthermore, the prior art collectively includes each element claimed (though not all in the same reference), and one of ordinary skill in the art could have combined the elements in the manner explained above using known engineering design, interface and/or programming techniques, without changing a “fundamental” operating principle of the combination of Zatpeyakin and Riemenschneider, while the teaching of Chen continues to perform the same function as originally taught prior to being combined, in order to produce the repeatable and predictable result of enhancing overall computational efficiency and accuracy. It is for at least the aforementioned reasons that the examiner has reached a conclusion of obviousness with respect to the claim in question.
Consider Claim 16.
The combination of Zatpeyakin, Riemenschneider and Chen teaches:
16. The method of claim 15, wherein training comprises: generating first outputs using the model for anchor shapes in the plurality of anchor shapes; determining second outputs, wherein the second outputs are based on an overlap of the labeled shape and the respective anchor shapes; and comparing the respective first outputs and the respective second outputs, wherein a difference between the respective first outputs and the respective second outputs is used to adjust the parameters of the model for the anchor shapes. (Chen: [0033] Step (5), elements in the spatial attention map are taken as weights. Values of all positions of each channel of the output feature map x of the input module are multiplied by weights of corresponding positions of the spatial attention map to form a feature map, and then is fed to the feature extraction module in the embodiment of the present disclosure. [0034] Step (6), Stage 2, Stage 3, and Stage 4 convolutional layer groups of ResNet50 are taken as the feature extraction module, and the output of Stage 3 serves as the input to Stage 4 as well as the input to a convolutional layer consists of 5nB kernels of size 1 × 1 and with a stride of 1, where nB denotes a preset number of detection boxes for each anchor point, and the convolutional layer finally outputs a feature map denoted by F1. Output of Stage 4 passes through a convolutional layer consists of 5nB kernels of size 1 × 1 and with a stride of 1, and the generated feature map is up-sampled and then sums corresponding elements one by one with F1 to generate an M2 × N2 × 5nB feature map F. Zatepyakin: [0027] FIG. 1B shows examples of a captured image 160 and identification of a facial shape for the image, in accordance with an embodiment. FIG. 1B includes a bounding box 162 having an identified face 164, a cropped bounding box 166, a default shape 168 and a fitted shape 170 of the system environment illustrated in FIG. 1A. As shown in FIG. 1B, the default shape 168 has predefined facial anchor points around eyes, noses, mouth, and jaw lines. The default shape 168 is centered and scaled according to the cropped bounding box 166. The default shape does not account for the actual position and alignment of the face in the image. By applying the prediction model as described below, the fitted shape 170 is identified that has better positions of the facial anchor points aligned with the identified face in the cropped bounding box 166 than the adjusted default shape 172. [0028]- n one embodiment, the face alignment module 146 uses a barycentric mesh-based shape for prediction. The barycentric mesh-based shape uses a barycentric coordinates system. The barycentric coordinate system is a coordinate system in which a position of a point within an element (e.g., a triangle, or tetrahedron) is represented by a linear combination of its vertices. For example, when the element is a triangle, points inside the triangle can be represented by a linear combination of three vertices of the triangle. The mesh-based shape may consist of multiple triangles covering all the predefined facial anchor points. Each facial anchor point can be represented by a linear combination of vertices in an associated triangle. [0029] FIG. 2 shows an example of barycentric mesh-based shapes, in accordance with an embodiment. As shown in FIG. 2, a barycentric mesh-based default shape 210 has multiple triangles. The triangles cover all the predefined facial anchor points as shown in dash lines. The barycentric mesh-based default shape 210 may be adjusted according to the cropped bounding box 166. The adjusted barycentric mesh-based default shape 220 may determine updated positions of predefined facial anchor points 230 using vertices of the associated triangles to correspond the predefined facial anchor points to the default shape applied to the cropped bounding box 166. When applying the prediction model, a barycentric mesh-based fitted shape 240 is generated to adjust the mesh to the face within the image and include updated triangles. Then, the barycentric mesh-based fitted shape 240 may determine updated positions of predefined facial anchor points 250 using vertices of associated update triangles. )
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TAHMINA N ANSARI whose telephone number is (571)270-3379. The examiner can normally be reached on IFP Flex - Monday through Friday 9 to 5.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, O' NEAL MISTRY can be reached on 313-446-4912. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
TAHMINA N. ANSARI
Examiner
Art Unit 2672
2672
March 16, 2026
/TAHMINA N ANSARI/Primary Examiner, Art Unit 2674