Office Action Analysis: 18574628 — GENERATING A REAL-TIME VIDEO STREAM OF A USER FACE BASED ON OBLIQUE REAL-TIME 3D SENSING

Office Action

§103
DETAILED ACTION
	This office action is responsive to applicant’s communication filed 01/08/2026.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
	Applicant claims the benefit of US Provisional Application No. 63/215,469, filed 06/27/2021.  Claims 1-15 have been afforded the benefit of this filing date.

Response to Arguments
Applicant’s arguments, see pg. 8, filed 01/08/2026, with respect to the objection to the abstract of the disclosure have been fully considered and are persuasive.  The objection to the abstract of the disclosure has been withdrawn.

Applicant’s arguments, see pg. 8, filed 01/08/2026, with respect to the objection to claims 4, 11, and 15 have been fully considered and are persuasive.  The objection to claims 4, 11, and 15 has been withdrawn.

Applicant's arguments, filed 01/08/2026, with respect to the rejection of claims 1, 2, 8, 9, 12, and 13 have been fully considered but they are not persuasive.  Applicant presents five distinct arguments contesting the rejection of these claims.

Firstly, regarding claims 1, 9, and 13, applicant argues that Cullen does not teach “the 3D image being captured from the first angle in real-time or “scan the object by the 3D image sensor in real-time” as stated in claim 1, or “obtaining the streaming 3D measurement in real time” as stated in claims 9 and 13, because Cullen does not teach 3D scanning in real time.
However, Cullen teaches the following: 
[0028] “the first device 10a supports real-time visual communication with the second device 10b”
[0043] “image data may comprise depth maps of the first device user periodically captured by the 3D camera system 26”
[0044] “The hybrid visual communicator 24 uses the image data to determine corresponding 3D mesh model updates” 
[0048] “The 3D mesh model updates may be transmitted to the second device for the second device to update display of the 3D mesh model of the first device… On the second device, the received data is turned into video by animating the data frame-to frame for display.”
[0051] “the hybrid visual communicator 24′ uses the 3D model updates 25 to animate, render or modify playback of the 3D mesh model displayed on the second device to express the perceived emotional state and/or the body position of the user in real-time” ([0051]).
The 3D image data captured by the sensor is used to generate 3D model updates for each frame, which are used to generate real-time animation/video.  If the rate at which the 3D image is being captured is sufficient to generate real-time, frame by frame data, then it can be inferred that the image itself is being captured in real time.  
More precisely, Cullen suggests that there are two possible types of data transmission: [0051] “If the 3D model updates 25 comprise changes to vertices, then the hybrid visual communicator 24′ uses the 3D model updates 25 to update the vertices of the 3D mesh model. If the 3D model updates 25 comprise blend shape coefficients, then the hybrid visual communicator 24′ uses the blend shape coefficients to select blend shapes or key poses from the emotional state database 29′ and then interpolates between a neutral expression of the original 3D mesh model and a selected key pose, or between a previous key pose and the selected key pose.”
The first mode suggests that the 3D model updates are 1:1 with the frame updates, while the second suggests that one model update may correspond to multiple frames, with interpolated data in between.  The first mode directly corresponds with the claim language, but even in the second mode, where the capture rate of the 3D image data is lower than the frame rate of the generated video, the 3D image capture would still be in real-time.  It is not prerecorded, and is directly linked and directly causing a real-time output – it would simply be real-time with a low framerate.

Secondly, regarding claim 1, applicant argues that Cullen does not teach the limitations: “create, in real-time, a 2D real-time image of the object, based on the 3D model and the 3D image being captured from the first angle in real-time; and communicate the 2D real-time image using the transceiver”, because the invention of Cullen first communicates the 3D model to the second device, then creates the 2D real-time image on the second device.
The invention of Cullen as a whole does teach the first part: “create, in real-time, a 2D real-time image of the object, based on the 3D model and the 3D image being captured from the first angle in real-time.”  The 3D image capture and 2D video creation are performed on different physical devices, but this aspect of Cullen is not relevant to the set of limitations which Cullen is relied upon to teach.
Specifically, Cullen is not relied upon to teach: “communicate the 2D real-time image using the transceiver”.  Sommerlade teaches this limitation, as cited in the original office action; it describes transmitting a finalized 2D image, rather than intermediary 3D model update data.

Thirdly, applicant argues that the references of Giger and Sommerlade are not relevant to the claimed invention because they use 2D video cameras as input as opposed to 3D sensors.
The test for obviousness is not whether the features of a secondary reference may be bodily incorporated into the structure of the primary reference; nor is it that the claimed invention must be expressly suggested in any one or all of the references.  Rather, the test is what the combined teachings of the references would have suggested to those of ordinary skill in the art.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981).
Giger and Sommerlade are each analogous to the claimed invention because they are in the same field of endeavor (adjusting the perceived angle of a captured facial image); the task they are performing is still roughly the same as that of both the claimed invention and Cullen, regardless of the precise type of input they use.  (In fact, Giger directly compares the results of their invention with the results of Kuster’s, which does use a 3D sensor input – see Giger fig. 4.)  Therefore, it is accepted that one of ordinary skill in the art may have considered aspects of these inventions, regardless of whether the entire invention was physically compatible with the invention of Cullen.

Fourthly, regarding claim 2, applicant argues that the cited references do not teach creating a 2D real-time image that is computed for the “second angle”.  Applicant suggests that gaze correction only involves changing eye contact, not adjusting the perceived angle of the face as a whole.
Giger, which is relied upon to teach the limitations of claim 2, does teach rotating the entire face towards the viewer, not just the eyes (pg. 3 section 3.3 “Seam optimization”: “To correct the gaze direction, the textured deformed head mesh now needs to be rotated to achieve the feeling of eye contact”; see fig. 4 for visual examples).  In this case, the “second angle” is the (virtual) camera pointing straight forward the user’s face (as explained in claim 1), which is taught by Giger.
It is important to note that the exact language of claim 1 which defines the second angle is “the 2D image taken by the external camera from a second angle with respect to the object”.  Although the camera is fixed in place, the user moves “the object” (their face) as the camera stays still, meaning that the angle of the camera relative to the user’s face changes.  For these claims, the camera points directly towards the user for angle 2, and faces them at another angle for angle 1.

Fifthly, regarding claims 8 and 12, applicant argues that Kuster does not teach a “third angle being different from the second angle with respect to the object” and “the second angle being different from the first angle with respect to the object” because Kuster teaches a single fixed camera position.
The claim language requires that the first and third camera angles each need to be different from the second angle with respect to the object, but they do not need to be different from each other.  In the case of Kuster, the first and third angles are the same.  The streaming 2D image of the object is generated at the same angle that the initial 2D image and 3D measurement are taken (directly facing the person), which is different from the angle of the streaming 3D measurement (looking up at the person).
Next, the claim language explicitly states that each of these angles is determined “with respect to the object”, which in the case of Kuster is a human face.  Applicant argues that the camera angle stays the same because the camera itself is fixed in place.  However, as previously discussed, the user moves “the object” (their face) as the camera stays still, meaning that the angle of the camera relative to the user’s face changes.  For these claims, the camera points directly towards the user for angles 1 and 3, and looks up at the user for angle 2.  Thus, Kuster does teach the claimed camera angles.

Therefore, the rejection of claims 1, 2, 8, 9, 12, and 13, as well as their dependent claims, is maintained under 35 U.S.C. 103.

Claim Objections
Claims 11 and 15 objected to because of the following informalities:
Claim 11 recites the limitation: “using a smartphone camera, a handheld camera, and a wrist-mounted camera to obtain the 2D image of the object”.  Claim 15 recites a similar limitation.  This is being treated as a typo where “or” was intended instead of “and” (see “Claim Interpretation” section).
Appropriate correction is required.

Claim Interpretation
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification.
The following limitations in the claims have been given the following interpretations in light of the specification:
Claim 11: “using a smartphone camera, a handheld camera, and a wrist-mounted camera to obtain the 2D image of the object”.
Claim 15: “using at least one of a smartphone camera, a handheld camera, and a wrist-mounted camera to obtain the 2D image of the object”.
Figs. 5 and 6 show the step: “Receive from camera 22 a frontal 2D image of the object”.  Page 10 of the specification states: “Fig. 2 also shows a camera 22, that may be as a hand-held camera, for example a camera of a smartphone 23 or any similar computational device (e.g., a tablet with a camera, a laptop with a camera, etc.). Particularly, a selfie camera of smartphone 23.”  Page 16 of the specification states: “The secondary camera can be imaging unit 12 (e.g., forward- looking, landscape, camera, etc.), imaging unit 21 (e.g., backward-looking, background, camera, etc.), camera 22 (e.g., hand-held, wrist-mounted, smartphone, camera, etc.), or any other camera. The image received from the secondary camera is referred to as secondary image 63.”
The specification refers to a single camera used to obtain the 2D image.  Additionally, it does not make logical sense to use three separate cameras to collectively take a single 2D picture as described in claims 8 and 12.  Thus, the limitations from claims 11 and 15 is interpreted as: “…a smartphone camera, a handheld camera, or a wrist-mounted camera…”.
Should applicant wish different definitions, applicant should point to the portions of the specification that clearly show a different definition.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1, 2, and 4 is/are rejected under 35 U.S.C. 103 as being unpatentable over Cullen (US 20160353056 A1) in view of Giger et al. ("Gaze correction with a single webcam," 2014 IEEE International Conference on Multimedia and Expo (ICME), Chengdu, China, 2014, pp. 1-6. https://doi.org/10.1109/ICME.2014.6890306, hereinafter “Giger”) and Sommerlade et al. (US 20210097644 A1, hereinafter “Sommerlade”).
	
	Regarding claim 1, Cullen teaches an imaging system comprising:
	an 3D image sensor (fig. 1 3D camera system 26, 3D sensor 32; [0027] “In one embodiment, the sensor arrays 25 and 25′ may include any number of sensors (1-n). Example types of sensors may include, but are not limited to, image sensors such as a 3D camera system 26 or a 2D camera system 28…”) mounted in a first angle with respect to an object to be imaged ([0002] “Most video communication systems for use by an individual user require a player application running on a computer device that includes a camera and a display. Examples of the computer device may include a desktop or laptop computer having a camera mounted at the top of the screen, or a mobile phone with the front facing camera built in to a bezel at the top.”), wherein the object appearance is changing in time ([0021] “The sensor data includes image data capturing facial expressions and motion of the user as the user moves and changes facial expressions.”), and wherein the 3D image sensor is operative to create a 3D image of the object, the 3D image being captured from the first angle in real-time ([0029] “During the visual communication session, the hybrid visual communicator 24 may collect sensor data from the sensor array 25, including image data from the 3D camera system 26 capturing facial expressions and motion of the first device user and background images, and other sensor data relevant to a context of the visual communication session.”; [0031] In one embodiment, the 3D mesh model may be created by taking pictures of the first device user with the 3D camera system 26. The resulting image data may be used by the 3D model component 34 to create a digital, 3D mesh model.); 
a transceiver for communicating with an external communication device (fig. 1 input/output devices 18 and 18’; [0024] “The system 10 may include a first device 10a and a second device 10b, which communicate over a network 12.”; [0025] “Example components comprising the I/O 18 and 18′ include a microphone, speaker, and a wireless network interface controller (or similar component) for communication over the network 12.”); and 
a controller communicatively coupled to the 3D image sensor and to the transceiver (fig. 1 memory 14 and 14’, processor 16 and 16’; [0025] “The memory 14 and 14′, the processor 16 and 16′ and the I/O 18 and 18′ may be coupled together via a system bus (not shown).”); 
wherein the controller is configured to: 
receive a 2D image of the object ([0038] “In one embodiment, a color image of the user's face and/or one or more texture maps may also be associated with the 3D mesh model.); 
create a 3D model of the object, based on a combination of the 3D image and the 2D image ([0038] “In one embodiment, a color image of the user's face and/or one or more texture maps may also be associated with the 3D mesh model. The 3D model component 34 may then use the resulting data to create a flexible, polygonal mesh representation of at least the person's face and head by fitting images to depth maps of the user's face and head.”); 
scan the object by the 3D image sensor in real-time ([0028] “According to the exemplary embodiment, the first device 10a supports real-time visual communication with the second device 10b”, [0043] “Referring again to FIG. 2, during the visual communication session between the first device 10a and the second device 10b, the hybrid visual communicator 24 may collect sensor data from a sensor array, where the sensor data may include image data capturing changing facial expressions and motion of the first device user (block 202). In one embodiment, image data may comprise depth maps of the first device user periodically captured by the 3D camera system 26 and the structured light source 30.”); 
create, in real-time, a 2D real-time image of the object, based on the 3D model and the 3D image being captured from the first angle in real-time ([0020] “The exemplary embodiments provide a hybrid visual communication method and system between two devices that display the actual likeness, facial expressions, and motion of a user of one of the devices in real time on the other device, while reducing bandwidth.”; [0048] “On the second device, the received data is turned into video by animating the data frame-to frame for display.”; [0051] “Once the second device 10b receives the 3D model updates 25, the hybrid visual communicator 24′ uses the 3D model updates 25 to animate, render or modify playback of the 3D mesh model displayed on the second device to express the perceived emotional state and/or the body position of the user in real-time. If the 3D model updates 25 comprise changes to vertices, then the hybrid visual communicator 24′ uses the 3D model updates 25 to update the vertices of the 3D mesh model.”).
Cullen does not explicitly teach that the controller is configured to receive, via the transceiver, from an external camera a 2D image of the object, the 2D image taken by the external camera from a second angle with respect to the object, the second angle being different from the first angle; or to communicate the 2D real-time image using the transceiver.
Giger teaches that the 2D image is taken from a second angle with respect to the object, the second angle being different from the first angle (pg. 2-3 section 3.2 “Occlusion and Texture Stretching”: To address this, we parameterize the template in the 2D domain and create a complete, albeit static texture of the user’s face with the correct gaze direction, that we can use for the occluded vertices. This is performed at the beginning of the session when the user is asked to look straight at the camera for just a brief instant.).
Cullen and Giger are both analogous to the claimed invention because they are in the same field of gaze correction for video communication.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the invention of Cullen with the teachings of Giger to capture a 2D input image from the desired output angle (facing directly forward), which is a different angle than the consistent video input angle.  The motivation would have been to provide a reference texture for the intended gaze direction of the final 3D model, and/or serve as a texture to be applied to the 3D mesh to generate the final 3D model, as taught by both Cullen and Giger.  (Cullen does not explicitly state the source or capture angle of its texture images as Giger does.)
The combination of Cullen in view of Giger does not explicitly teach that the controller is configured to receive, via the transceiver, from an external camera a 2D image of the object; or communicate the 2D real-time image using the transceiver.
Sommerlade teaches that the controller is configured to: 
receive, via the transceiver, from an external camera a 2D image of the object ([0015] “FIG. 1 schematically depicts an example user 100 and a camera 102 capturing video of the user. As indicated above, gaze adjustment and detail enhancement may be performed by a computing device communicatively coupled with the camera. The camera may be separate from the computing device and communicate with the computing device over a suitable wired or wireless connection.”; [0016] “Regardless, the computing device that implements the herein-described techniques will receive a digital input image, either from a camera, another computing device, or another suitable source.”); and
communicate the 2D real-time image using the transceiver ([0037] “Thus, outputting the image at S5 may include displaying the image, combining the image with one or more other images, transmitting the enhanced image over a network (e.g., for display by a second device), or saving the enhanced image for later viewing.”).
	Sommerlade and the combination of Cullen in view of Giger are both analogous to the claimed invention because they are in the same field of gaze correction for video communication.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the invention of Cullen in view of Giger with the teachings of Sommerlade to both receive input images and transmit output images using wireless communication.  The motivation would have been to increase the portability of the invention, making it more suitable for use on phones, and/or to be able to offload the computation associated with the 3D facial models to a different device.

Regarding claim 2, the combination of Cullen in view of Giger and Sommerlade teaches the imaging system according to claim 1, wherein the 2D real-time image is computed for the second angle (“Second angle” as previously mentioned in claim 1 was looking straight forward: Giger figs. 1 and 4 show final 2D gaze-corrected output, pg. 1 Abstract “We apply recent shape deformation techniques to generate a 3D face model that matches the user’s face. We then render a gaze-corrected version of this face model and seamlessly insert it into the original image.”; pg. 3 section 3.3 “Seam optimization”: “To correct the gaze direction, the textured deformed head mesh now needs to be rotated to achieve the feeling of eye contact.”; pg. 3-4 section 4 “Results”: “similarly to real-life face-to-face communication, our system provides eye contact only when the user is looking at the communication partner, i.e. at the video conferencing window.”).
Cullen, Giger, and Sommerlade are analogous to the claimed invention because they are in the same field of gaze correction for video communication; it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the previously discussed combination of Cullen in view of Giger and Sommerlade with the further teachings of Giger to generate an output image at the same angle as the input image (facing directly forward, towards the viewer).  This is the main purpose and motivation of gaze correction, which serves to counteract the offset camera position of a video conference setup and make it appear as though a user is making eye contact despite looking at their screen, not the camera.

Regarding claim 4, the combination of Cullen in view of Giger and Sommerlade teaches the imaging system according to claim 1, wherein the controller is additionally configured to:
use the transceiver to communicate with a mobile communication device comprising a camera and a display to receive from the mobile communication device the 2D image of the object taken by the camera of the mobile communication device (Sommerlade [0015] “FIG. 1 schematically depicts an example user 100 and a camera 102 capturing video of the user. As indicated above, gaze adjustment and detail enhancement may be performed by a computing device communicatively coupled with the camera… Alternatively, the camera may be an integral component of the computing device—e.g., a forward-facing smartphone camera; [0016] “In some examples, gaze correction and detail enhancement may be performed by a different computing device than the one that receives images of the human user from the camera. For instance, a computing device communicatively coupled with camera 102 may transmit one or more images captured by camera 102 to a second computing device over a network.”); and 
use the transceiver to communicate with the mobile communication device to display on the display of the mobile communication device the 2D real-time image of the object (Sommerlade fig. 1; [0013] “The present disclosure primarily focuses on a scenario in which live video of a user is captured. In other words, the digital input image to which gaze adjustment and detail enhancement are applied may be one frame of a video stream including a plurality of frames.”; [0037] “Thus, outputting the image at S5 may include displaying the image, combining the image with one or more other images, transmitting the enhanced image over a network (e.g., for display by a second device), or saving the enhanced image for later viewing.”).
Cullen, Giger, and Sommerlade are analogous to the claimed invention because they are in the same field of gaze correction for video communication; it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the previously discussed combination of Cullen in view of Giger and Sommerlade with the further teachings of Sommerlade to both receive input images and transmit output images to a smartphone.  The motivation would have been to increase the portability of the invention, making it more suitable for use on phones and their associated cell networks, and/or to be able to offload the computation associated with the 3D facial models to a different device.

Regarding claim 3, the combination of Cullen in view of Giger and Sommerlade teaches the imaging system according to claim 1,
wherein the 2D real-time image is computed with the resolution of the 2D image (Giger fig. 1a; output image is based on input 2D image and includes parts of the original image, suggesting that it is the same resolution; pg. 3 section 4 “Results”: “Our system is fully automatic and runs in real-time, namely at 25fps for 800x600 input videos and 30fps for 640x480 input videos on a standard consumer computer…”; Giger uses the same camera to capture the single 2D image as the input video, so the same resolution will be maintained); and 
wherein the 2D image is captured full color (Giger fig. 1a “Input: color image acquired by a single camera…” , and wherein the 3D image is captured with no colors (Cullen [0031] “In a further embodiment, the 3D camera system 26 may comprise a time-of-flight (ToF) camera that resolves distance based on the known speed of light, and measures the time-of-flight of a light signal between the camera and the object for each point of the image.”; time-of-flight cameras only detect distance/position, not color), and wherein the 2D real-time image is computed with the colors obtained by the 2D image (Giger fig. 1d final result is a combination of 2D color image and 3D colorless mesh.)
Cullen, Giger, and Sommerlade are analogous to the claimed invention because they are in the same field of gaze correction for video communication; it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the previously discussed combination of Cullen in view of Giger and Sommerlade with the further teachings of Giger to specify that the 2D output video maintains the best characteristics of both of its source inputs: the color and resolution of the 2D input image, as well as the depth information from the 3D input.  The motivation would have been to generate the highest quality output video possible.
	The combination of Cullen in view of Giger and Sommerlade does not explicitly teach wherein the 2D image is captured in relatively high resolution, and wherein the 3D image is captured in relatively low-resolution.
	Yang teaches a system for facial image capture wherein the 2D image is captured in relatively high resolution, and wherein the 3D image is captured in relatively low-resolution (fig. 1 shows corresponding 2D image and 3D depth image inputs, pg. 1 col. 2 “In this paper, we develop a framework to track face shapes by using both color and depth information… The low-resolution depth image is captured by using Microsoft Kinect, and is used to predict head pose and generate extra constraints at the face boundary.”).
	Yang and the invention of Cullen in view of Giger and Sommerlade are both analogous to the claimed invention because they pertain to the same issue of capturing both 2D image and 3D depth data of a human face; in particular Yang uses the Microsoft Kinect sensors, the use of which is also alluded to by Giger (pg. 4 col. 1).  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the combination of Cullen in view of Giger and Sommerlade with the teachings of Yang to use Kinect sensors (or a similar equivalent) to capture 2D color and 3D depth data, of which the 3D data is a lower resolution.  The motivation would have been to use commonly available consumer-grade hardware to make the invention easily accessible to the public.

Claim(s) 5 and 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Cullen (US 20160353056 A1) in view of Giger et al. (“Gaze correction with a single webcam”) and Sommerlade (US 20210097644 A1) as applied to claim 1 above, and further in view of Bosworth (US 11039651 B1).

Regarding claim 5, the combination of Cullen in view of Giger and Sommerlade teaches the imaging system according to claim 1, but does not teach additionally comprising:
a cap having a visor and wherein the imaging system is mounted on the visor facing a user's face wearing the cap; and 
wherein the object being imaged is the face of the user wearing the cap.
Bosworth teaches a cap having a visor and wherein the imaging system is mounted on the visor facing a user's face wearing the cap; and 
wherein the object being imaged is the face of the user wearing the cap (col. 14 lines 15-25 “In some examples, the face-tracking subsystem 305 and/or the body-tracking subsystem 307 may include one or more body- and/or face-tracking light sources and/or optical sensors, such as face/body-tracking component 404 in FIG. 4, along with potentially other sensors or hardware components. These components may be positioned or directed toward the user's face and/or body so as to capture movements of the user's mouth, cheeks, lips, chin, etc., as well as potentially movement of the user's body, including their arms, legs, hands, feet, torso, etc.”; col. 14 lines 56-67 to col. 15 lines 1-4 “As with the eye-tracking subsystem 303, the face-tracking subsystem 305 and/or the body-tracking subsystem 307 may be incorporated within and/or coupled to the artificial reality hats disclosed herein in a variety of ways. In one example, all or a portion of the face-tracking subsystem 305 and/or the body-tracking subsystem 307 may be embedded within and/or attached to the brim portion of an artificial reality hat… By doing so, the face/body-tracking component(s) 404 may be positioned far enough away from the user's face and/or body to have a clear view of the user's facial expressions and/or facial and body movements.”).
Bosworth and the combination of Cullen in view of Giger and Sommerlade are both analogous to the claimed invention because they are in the same field of facial imaging.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the invention of Cullen in view of Giger and Sommerlade with the teachings of Bosworth to mount the imaging sensors on the brim of a hat.  The motivation would have been to make the system portable rather than depending on a fixed sensor and display emplacement. 

Regarding claim 7, the combination of Cullen in view of Giger and Sommerlade and further in view of Bosworth teaches the imaging system according to claim 5, wherein the 2D real-time image is provided as a video stream (Giger pg. 5 “…our approach can be applied for standard home video conferencing.”)
Cullen, Giger, and Sommerlade are analogous to the claimed invention because they are in the same field of gaze correction for video communication; it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the previously discussed combination of Cullen in view of Giger and Sommerlade with the further teachings of Giger to specify that the 2D output video can be used for streaming video communication; the motivation would have been (self-explanatorily) to apply the invention towards video conferencing, and to provide an alternative to Cullen’s client-side video rendering that still allows the usage of the gaze correction system.

Claim(s) 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Cullen (US 20160353056 A1) in view of Giger et al. (“Gaze correction with a single webcam”) and Sommerlade (US 20210097644 A1) and further in view of Bosworth (US 11039651 B1) as applied to claim 5 above, and further in view of Yang ("Robust face tracking with a consumer depth camera").

Regarding claim 6, the combination of Cullen in view of Giger and Sommerlade and further in view of Bosworth teaches the imaging system according to claim 5, wherein the imaging system captures a 3D, low-resolution (Yang, see claim 3), no-color, real-time image of the user's face (Cullen [0028] “According to the exemplary embodiment, the first device 10a supports real-time visual communication with the second device 10b”, [0043] “Referring again to FIG. 2, during the visual communication session between the first device 10a and the second device 10b, the hybrid visual communicator 24 may collect sensor data from a sensor array, where the sensor data may include image data capturing changing facial expressions and motion of the first device user (block 202). In one embodiment, image data may comprise depth maps of the first device user periodically captured by the 3D camera system 26 and the structured light source 30.”; time-of-flight sensor data is colorless as previously discussed for claim 3) in an angle to the profile of the user (Giger pg. 5 col. 1 “The sequence of Figure 6-c has been acquired on a desktop computer equipped with a standard external webcam located at the top of the screen”), and communicates a 2D high-resolution, full-color, real-time image of the profile of the user (Giger fig. 1D ‘final result’, pg. 3 section 3.4 “Discussion”: “…an efficient real-time gaze correction system using a single webcam.”).
Yang and the combination of Cullen in view of Giger and Sommerlade and further in view of Bosworth are both analogous to the claimed invention because they pertain to the same issue of capturing both 2D image and 3D depth data of a human face.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the invention of Cullen in view of Giger and Sommerlade and further in view of Bosworth with the teachings of Yang to use a low-resolution 3D depth camera found in commonly available consumer-grade hardware to make the invention easily accessible to the public.  Additionally, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have further combined the teachings of Cullen, Giger, Sommerlade, and Bosworth in the claimed manner in order to maximize the quality of the gaze correction system described by Cullen.

Claim(s) 8, 9, 12, and 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Cullen (US 20160353056 A1) in view of Kuster et al. (“Gaze correction for home video conferencing,” ACM Transactions on Graphics, Vol. 31 no. 6 (Nov 01, 2012), pp. 174:1-174:6. https://doi.org/10.1145/2366145.2366193, hereinafter “Kuster”).
	
Regarding claim 8, Cullen teaches a computer-implemented method for creating a streaming 2D image of an object, the method comprising:
obtaining a 2D image of the object ([0038] “In one embodiment, a color image of the user's face and/or one or more texture maps may also be associated with the 3D mesh model.”); 
obtaining a 3D measurement of the object ([0031] “In one embodiment, the 3D mesh model may be created by taking pictures of the first device user with the 3D camera system 26. The resulting image data may be used by the 3D model component 34 to create a digital, 3D mesh model.”); 
creating a 3D model of the object, the 3D model of the object being based on the 2D image of the object and the 3D measurement of the object ([0038] “The 3D model component 34 may then use the resulting data to create a flexible, polygonal mesh representation of at least the person's face and head by fitting images to depth maps of the user's face and head.”); 
obtaining a streaming 3D measurement of the object ([0028] “According to the exemplary embodiment, the first device 10a supports real-time visual communication with the second device 10b”, [0043] “Referring again to FIG. 2, during the visual communication session between the first device 10a and the second device 10b, the hybrid visual communicator 24 may collect sensor data from a sensor array, where the sensor data may include image data capturing changing facial expressions and motion of the first device user (block 202). In one embodiment, image data may comprise depth maps of the first device user periodically captured by the 3D camera system 26 and the structured light source 30.”); and 
creating a streaming 2D image of the object, the streaming 2D image of the object being based on the 3D model of the object and the streaming 3D measurement of the object ([0020] “The exemplary embodiments provide a hybrid visual communication method and system between two devices that display the actual likeness, facial expressions, and motion of a user of one of the devices in real time on the other device, while reducing bandwidth.”; [0048] “On the second device, the received data is turned into video by animating the data frame-to frame for display.”; [0051] “Once the second device 10b receives the 3D model updates 25, the hybrid visual communicator 24′ uses the 3D model updates 25 to animate, render or modify playback of the 3D mesh model displayed on the second device to express the perceived emotional state and/or the body position of the user in real-time. If the 3D model updates 25 comprise changes to vertices, then the hybrid visual communicator 24′ uses the 3D model updates 25 to update the vertices of the 3D mesh model.”), the streaming 2D image of the object being created for a third angle with respect to the object ([0078] “With respect to presentation, the 3D mesh models may be rendered to look directly at the viewer, as opposed to down.”).
Cullen does not specify the relative angles of the various 2D and 3D input images; therefore, it does not explicitly teach the 2D image of the object obtained from a first angle with respect to the object; 
the 3D measurement of the object obtained from the first angle with respect to the object; 
the streaming 3D measurement of the object obtained from a second angle with respect to the object, the second angle being different from the first angle with respect to the object; or 
the streaming 2D image of the object being created for a third angle with respect to the object, the third angle being different from the second angle with respect to the object.
Kuster teaches a system for gaze correction comprising the 2D image of the object obtained from a first angle with respect to the object (the “first angle” is defined as looking straight at the camera; pg. 4 section 3.1 “Initial Calibration” describes both a 2D and 3D image being taken from this angle: “The first parameter that needs to be set is the position of the virtual camera. This is equivalent to finding a rigid transformation that, when applied to the geometry, results in an image that makes eye contact. We provide two mechanisms for that… The second one is a semi-automatic technique where two snapshots are taken from the Kinect: one while the user is looking straight at the Kinect and one while the user is looking straight at the video conference window. From these two depth images we can compute the rigid transformation that maps one into the other. This is accomplished by matching the eye-tracker points in the two corresponding color/depth images.”);
the 3D measurement of the object obtained from the first angle with respect to the object (pg. 4 section 3.1 “Initial Calibration”: “The second one is a semi-automatic technique where two snapshots are taken from the Kinect: one while the user is looking straight at the Kinect and one while the user is looking straight at the video conference window. From these two depth images we can compute the rigid transformation that maps one into the other. This is accomplished by matching the eye-tracker points in the two corresponding color/depth images.”); 
the streaming 3D measurement of the object obtained from a second angle with respect to the object, the second angle being different from the first angle with respect to the object (pg. 3 section 3 “System Overview”: “Although webcams are usually mounted on the top of the screen, the current hybrid sensor devices are typically quite bulky and it is more natural to place them at the bottom of the screen.”; user is suggested to be looking at the screen, with the sensors facing upwards); and 
the streaming 2D image of the object being created for a third angle with respect to the object, the third angle being different from the second angle with respect to the object (third angle is the same as the first; directly facing the camera/viewer: fig. 3, fig. 4d output image, pg. 2 section 1 “Introduction”: “This results in an image with no missing pixels or significant visual artifacts in which the subject makes eye contact.”).
Cullen and Kuster are both analogous to the claimed invention because they are in the same field of gaze correction for video communication.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the invention of Cullen with the teachings of Kuster to capture a 2D and 3D input image from the desired output angle (looking directly forward), which is a different angle than the consistent video input angle.  The motivation would have been to provide a reference texture for the intended gaze direction of the final 3D model, and/or serve as a texture to be applied to the 3D mesh to generate the final 3D model, as taught by both Cullen and Kuster.

Regarding claim 9, the combination of Cullen in view of Kuster teaches the computer-implemented method according to claim 8 additionally comprising:
creating the streaming 2D image with a quality that is higher than the quality of the streaming 3D measurement of the object (Kuster fig. 4, compare depth image in step a with 2D output image in step d); 
using a high-quality 2D image (Kuster fig. 4a input color image) to create a high-quality 3D model (Kuster fig. 4b: “b) Synthesize an image of the subject with the gaze corrected (by performing an appropriate 3D transformation of the head geometry).”) to create a high-quality streaming 2D image (fig. 4d output image), wherein the quality of the 2D image and the quality of the streaming 2D image is higher than the quality of the 3D measurement and the streaming 3D measurement of the object (Kuster fig. 4a, compare streaming input 2D image to streaming input 3D depth image; pg. 4 section 3.1 “Initial Calibration” describes that the static 2D and 3D images from the first angle are taken using the same sensors, so the quality will be the same); 
obtaining the streaming 3D measurement in real time (Kuster pg. 3 section 2 “Related Work”: “Since the main focus of many of these methods is reconstructing the underlying geometry of the head or face, the emergence of consumer-level depth/color sensors such as the Kinect, giving easy access to real-time geometry and color information, is an important technological breakthrough that can be harnessed to solve the problem.”); 
creating the streaming 2D image in real time (Kuster pg. 2 section 1 “Introduction”: “In this paper we propose a gaze correction system targeted at a peer-to-peer video conferencing model that runs in real-time on average consumer hardware and requires only one hybrid depth/color sensor such as the Kinect.”); 
communicating the streaming 2D image of the object to at least one of a remote network server and a remote recipient client device (Kuster section 4 “Results and Discussion”: the invention was incorporated into a Skype plugin for video conferencing, which requires transmitting the output image to a client device); and 
providing the streaming 2D image as a video stream (Kuster section 4 “Results and Discussion”: the invention was incorporated into a Skype plugin for video conferencing).
Cullen and Kuster are both analogous to the claimed invention because they are in the same field of gaze correction for video communication.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the invention of Cullen in view of Kuster with the additional teachings of Kuster in order to maximize the quality of the output video stream, and to ensure all components run in real time.  The motivation would have been to optimize the invention for use in video conferencing software.

Claim(s) 10 and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Cullen (US 20160353056 A1) in view of Kuster (“Gaze correction for home video conferencing”) as applied to claims 9 and 13 above, and further in view of Yang ("Robust face tracking with a consumer depth camera").

Regarding claim 10, the combination of Cullen in view of Kuster teaches the computer-implemented method according to claim 9, wherein the higher quality is higher temporal resolution (Cullen [0051] “If the 3D model updates 25 comprise blend shape coefficients, then the hybrid visual communicator 24′ uses the blend shape coefficients to select blend shapes or key poses from the emotional state database 29′ and then interpolates between a neutral expression of the original 3D mesh model and a selected key pose, or between a previous key pose and the selected key pose.” – if interpolated frames are generated, then the 3D model updates (corresponding to the claimed “streaming 3D measurement” of claim 9) must be transmitted at a lower frame rate than the output animation, which is used to generate the claimed “streaming 2D image” of claim 9), and being colorful (Kuster fig. 4a, 2D input image is in color while 3D input depth map is not).
The combination of Cullen in view of Kuster does not explicitly teach: wherein the higher quality is higher spatial resolution. 
Yang teaches wherein the higher quality is higher spatial resolution (fig. 1 shows corresponding 2D image and 3D depth image inputs, pg. 1 col. 2 “In this paper, we develop a framework to track face shapes by using both color and depth information… The low-resolution depth image is captured by using Microsoft Kinect, and is used to predict head pose and generate extra constraints at the face boundary.”)
Yang and the invention of Cullen in view Kuster are both analogous to the claimed invention because they pertain to the same issue of capturing both 2D image and 3D depth data of a human face using Kinect sensors.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the combination of Cullen in view of Kuster with the teachings of Yang to specify the use of the Kinect sensors to capture 2D color and 3D depth data, where the 3D data has a lower resolution than the 2D data.  The motivation would have been to use commonly available consumer-grade hardware to make the invention easily accessible to the public.

Claim(s) 11 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Cullen (US 20160353056 A1) in view of Kuster (“Gaze correction for home video conferencing”) as applied to claims 8 and 12 above, and further in view of Bosworth (US 11039651 B1).

Regarding claim 11, the combination of Cullen in view of Kuster teaches the computer-implemented method according to claim 8 additionally comprising:
using a smartphone camera, a handheld camera, and a wrist-mounted camera, to obtain the 2D image of the object (Cullen [0002] “Examples of the computer device may include a desktop or laptop computer having a camera mounted at the top of the screen, or a mobile phone with the front facing camera built in to a bezel at the top.”; [0027] “According to the exemplary embodiment, the first and second devices 10a and 10b may communicate using hybrid visual communication, and therefore further include respective hybrid visual communicators 24 and 24′ and sensor arrays 25 and 25′… Example types of sensors may include, but are not limited to, image sensors such as a 3D camera system 26 or a 2D camera system 28”). 
The combination of Cullen in view of Kuster does not teach using a cap mounted camera to obtain the streaming 3D measurement of the object wherein the object being imaged is the face of the user wearing the cap.
Bosworth teaches using a cap mounted camera to obtain the streaming 3D measurement of the object wherein the object being imaged is the face of the user wearing the cap (Bosworth col. 14 lines 15-25 “In some examples, the face-tracking subsystem 305 and/or the body-tracking subsystem 307 may include one or more body- and/or face-tracking light sources and/or optical sensors, such as face/body-tracking component 404 in FIG. 4, along with potentially other sensors or hardware components. These components may be positioned or directed toward the user's face and/or body so as to capture movements of the user's mouth, cheeks, lips, chin, etc., as well as potentially movement of the user's body, including their arms, legs, hands, feet, torso, etc.”; col. 14 lines 56-67 to col. 15 lines 1-4 “As with the eye-tracking subsystem 303, the face-tracking subsystem 305 and/or the body-tracking subsystem 307 may be incorporated within and/or coupled to the artificial reality hats disclosed herein in a variety of ways. In one example, all or a portion of the face-tracking subsystem 305 and/or the body-tracking subsystem 307 may be embedded within and/or attached to the brim portion of an artificial reality hat… By doing so, the face/body-tracking component(s) 404 may be positioned far enough away from the user's face and/or body to have a clear view of the user's facial expressions and/or facial and body movements.”).
Bosworth and the combination of Cullen in view of Kuster are both analogous to the claimed invention because they are in the same field of facial imaging.  It would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to have modified the invention of Cullen in view of Kuster with the teachings of Bosworth to mount the imaging sensors on the brim of a hat.  The motivation would have been to make the system portable rather than depending on a fixed sensor and display emplacement. 

Regarding claims 12, 13, 14, and 15, they are rejected using the same references, rationale, and motivations to combine described in the rejections of claims 8, 9, 10, and 11 respectively because their limitations substantially correspond to the limitations of claims 8, 9, 10, and 11 respectively, along with the additional limitation of a computer program product embodied on a non-transitory computer readable medium (Cullen [0084] “For example, the exemplary embodiment can be implemented using hardware, software, a computer readable medium containing program instructions, or a combination thereof. Software written according to the present disclosure is to be either stored in some form of computer-readable medium such as a memory, a hard disk, or a CD/DVD-ROM and is to be executed by a processor.”).

References Cited
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Astarabadi et al. (US 20200358983 A1) teaches a video conferencing system which can generate a gaze-corrected 2D video output by applying a pre-generated 2D texture of a user’s face to a 3D model of the user’s face, which is animated in real-time based on camera input.

Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BENJAMIN STATZ whose telephone number is (571)272-6654. The examiner can normally be reached Mon-Fri 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Tammy Goddard can be reached at (571)272-7773. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/BENJAMIN TOM STATZ/Examiner, Art Unit 2611              

/TAMMY PAIGE GODDARD/Supervisory Patent Examiner, Art Unit 2611
Read full office action
GENERATING A REAL-TIME VIDEO STREAM OF A USER FACE BASED ON OBLIQUE REAL-TIME 3D SENSING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

GENERATING A REAL-TIME VIDEO STREAM OF A USER FACE BASED ON OBLIQUE REAL-TIME 3D SENSING

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email