Office Action Analysis: 18543234 — Template-Based Behaviors in Machine Learning

Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on February 25, 2026 has been entered.

Response to Amendment
Applicants Amendments filed on February 25, 2026 has been entered and made of record.
		Currently pending Claim(s) 		1-21
		Independent Claim(s)			1, 10, 17
		Amended Claim(s)			1, 10, 17
		Canceled Claim(s)			16
		New Claim(s)				21

Response to Arguments
This office action is responsive to the Applicant’s Arguments/Remarks Made in an Amendment received on February 25, 2026. 
In view of amendments filed on February 25, 2026, the Applicant has amended independent Claim 1 to recite the additional imitation of “determining, by the processing device using the machine learning model, a time for displaying the animated content relative to a detected movement of the object”. The applicant has further amended Claim 1 by including the limitation of “displaying, by the processing device, the rendered animated content within the frame of the digital video in a user interface at the determined time”. Originally, (in the claim set dated December 17, 2025) Claim 1 recited “determining, by the processing device using a machine learning model, a location within the frame of the digital video to place the animated content, the location indicating layering for the animated content relative to the object based on a type of the animated content”, and was rejected using Banica (US Pub No 2014/0359656) and Kundu (US Pub No 2023/0236660). 
As discussed in the following paragraphs, the Examiner argues that the newly amended claims remain unpatentable over Banica (US Pub No 2014/0359656) and Kundu (US Pub No 2023/0236660).    
In view of the Applicant’s Arguments/Remarks Made filed on February 25, 2026, the Applicant explained (on Remarks page 10) that both Banica and Kundu fail, “to teach or suggest determining by the processing device using the machine learning model, a time for displaying the animated content relative to a detected movement of the object”.  The Applicant further argued that both Banica and Kundu fail “to teach or suggest displaying, but the processing device, the rendered animated content within the frame of the digital video in a user interface at the determined time.”
The applicant explained  (on Remarks page 10) that Banica “does not describe determining a time”. The applicant further argued (on Remarks page 11) that although Banica teaches inserting displays into video content, Banica, “does not describe determining a time and therefore cannot describe displaying rendered animated content within a frame of the digital video in a user interface at the determined time.” The examiner agrees that Banica fails to teach determining a time to display animations with respect to a movement of an object. Banica teaches displaying rendering the animated content within the frame of the digital video at the location, (see paragraph [0025], “Depending on the duration of an overlay, it can appear in frames spanning multiple clips or scenes of the video content”) and displaying the rendered animated content within the frame of the digital video in a user interface (see paragraph [0008], “The system inserts the overlay into the selected location and renders the video content with the inserted overlay on the display device”), but fails to explicitly teach determining a time.

The applicant further argued (on Remarks page 10) that Kundu, (who teaches displaying animations in response to the movements of a virtual actor), fails to teach the new limitation because, “identifying a trigger action, as described by Kundu, is not equivalent to a time for displaying animated content.” The applicant further argued (on Remarks page 11) that Kundu fails to teach displaying the rendered animated content at the determine time because, “updating a virtual three-dimensional environment based on detected trigger is not equivalent to displaying animated content within a frame of the digital video in a user interface at a determine time. For instance, as explained above, Kundu does not describe a ‘time’ aspect.”
The examiner respectfully disagrees. Kundu teaches that animations can be displayed in response to user actions (see paragraph [0063], “determining if and how the virtual actor makes contact with interactive objects within the three-dimensional scene so as to trigger corresponding actions”). By identifying ‘a trigger action’, a time immediately after the ‘trigger action’ is inherently determined to display the corresponding action. Kundu further explicitly teaches determining a time for displaying the animated content relative in paragraph [0101] (“The system identifies the moving first as a “drag” gesture. The system therefore updates the scene this time to show interactive object 1208 being dragged to the right. The system displays the scene to the user, before the process loop proceeds to begin the next iteration", where the ‘object’ is the user, and the user moves, and the system determines a time to display the interactive object moving). Kundu is combinable with Banica since both are from the analogous field of image analysis and animation. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine determining a time for displaying animations taught by Kundu with the animation generation method taught by Banica. The motivation to do so would be to provide the user with real-time animations of objects and scenes (see paragraph [0005], “The present disclosure provides system and method that enable the user(s) to interact in real-time with other objects or items in the scene or even with each other in the case of multiple users”). 

Claim 10 has been amended to recite “determining, using the machine learning model, a time for executing the behavior in response to a detected gesture of the object”. The applicant has further amended Claim 10 by including the limitation of “displaying the rendered animated content within the frame of the digital video in a user interface including executing the behavior determined by the machine learning model at the determined time” Originally, Claim 10 was rejected by Banica in view of Kundu. 
The Examiner argues that the newly amended claim remains unpatentable over Banica in view of Kundu. Kundu teaches determining a time for executing the behavior in response to a detected gesture of the object (see paragraph [0101], “The system identifies the moving first as a “drag” gesture. The system therefore updates the scene this time to show interactive object 1208 being dragged to the right. The system displays the scene to the user, before the process loop proceeds to begin the next iteration", where the ‘object’ is the user, and the user moves, and the behavior is the interactive object moving to the right). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the timing determination taught by Kundu with the teachings of Banica. The motivation for doing so would be to produce real-time animation effects, as taught by Kundu in paragraph [0005]. 

Claim 17 has been amended to recite “ determining, using a machine learning model, an updated behavior of the animated content, including a time for execution of the behavior, based on the digital video and based on a detected movement of the object”. The applicant has further amended Claim 17 by including the limitation of  “displaying the rendered animated content within the frame of the digital video in a user interface, including the updated behavior at the determined time determined by the machine learning model”. Originally, Claim 17 was rejected by Banica in view of Kundu and Nguyen (US Pub No 2021/0272363). Nguyen discusses an animation path that moves according a motion path sketch drawn by the user. 
Upon further consideration, the Examiner argues that the newly amended Claim 17 remains unpatentable over Banica, Nguyen, and Kundu. Kundu teaches determining an updated behavior of the animated content, including a time for execution of the behavior, based on a detected movement of the object (see paragraph [0101], “The system identifies the moving first as a “drag” gesture. The system therefore updates the scene this time to show interactive object 1208 being dragged to the right. The system displays the scene to the user, before the process loop proceeds to begin the next iteration", where the ‘object’ is the user, and the user moves, and the updated behavior is the interactive object moving to the right).  Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Kundu with the teachings of Banica and Nguyen. The motivation for doing so would be to produce real-time animation effects, as taught by Kundu in paragraph [0005]. 


As to newly added Claim 21, the Applicant argues (on Remarks page 12)  that Kundu, “does not describe identifying a portion of an object to avoid covering. For example, Kundu does not discuss identifying a portion of the “subject layer” to avoid covering by he “actor layer.” Rather, Kundu treats each object in its entirely for purposes of layering.” 
The Examiner respectfully disagrees. Kundu teaches that portions of an object can be identified (see paragraph [0128], “Once trained, the neural network is used as a classifier by which it can tag, in a binary matter, which regions of the image are most likely part of a human face or body parts”, where the object is the body of a user, and the face and body parts are portions of the user), and that animated content can be placed to avoid covering the portion of the object (see paragraph [0123], “The position of the layers (i.e., the depth position within the layers, for example, a subject layer and a background layer) in a scene can be determined based on the location of the content within or on the layer. By way of an example, the content of the data toward the top of the scene is deprioritized and assigned to the background layer while content toward the middle or bottom of the scene is prioritized to be in the subject layer. This enables the actor to be able to stand and present content which is occluding the actor's lower and possibly middle section without the actor's head being obstructed by the content located higher up in the scene”, where the actor’s head is the portion of the object that is not obstructed). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the timing determination taught by Kundu with the teachings of Banica. The motivation for doing would be to prevent the user’s face being obstructed by the animations, as taught Kundu in paragraph [0123]. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claim(s) 1-6, 8, 10-11, 13-15, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Banica et al. (US Pub No 2014/0359656), hereinafter Banica in view of Kundu (US Pub No 2023/0236660), hereinafter Kundu.

	As to Claim 1, Banica claims a method (see paragraph [0007]) comprising:
receiving, by a processing device, a digital video and data executable to generate animated
content (see paragraph [0007], “method receiving, at a computing device, video content and an
indication of an overlay to be placed in the video content”),
	detecting, by the processing device, an object depicted in a frame of the digital video (see paragraph [0118], “Certain of these embodiments use an attention model that detects the Attention Objects (AOs).  Each AO is described by the following values: ROI (region of interest), AV (attention value), and MPS (minimal perceivable size). ROI indicates the area where the object is (and it is represented as a shape such as a rectangle)... The AV of each face takes into account the face size and position (larger faces and faces detected in the center of a frame have a higher AV));
determining, by the processing device using a machine learning model, (see paragraph [0008],
“the system performs attention modeling on frames of the video content to identify zones in the video content likely to be of interest to a viewer of the video content”, where attention modeling is an
example of a machine learning model),
a location within a frame of the digital video to place the animated content, (see paragraph
[0007], “Based at least in part on properties of the overlay and properties of the video content, the
method determines locations where the overlay can be placed within the video content”, where an
overlay can be multimedia content including animation),
rendering, by the processing device, the animated content within the frame of the digital video at the location, (see paragraph [0025], “Depending on the duration of an overlay, it can appear in frames spanning multiple clips or scenes of the video content”) 
and displaying, by the processing device, the rendered animated content within the frame of the digital video in a user interface (see paragraph [0008], “The system inserts the overlay into the selected location and renders the video content with the inserted overlay on the display device”).
Banica fails to teach that the location indicates layering for the animated content based on a type of the animated content. However, Kundu teaches that the type of animated content can influence what layer the animated content appears on (see paragraph [0120], “In some examples, particular types of content may be prioritized to be assigned to the subject layer which is further in front and therefore less likely to be obstructed. By way of an example, text content can be prioritized to be assigned as the subject layer so that it is positioned in front of the actor layer, thereby not obstructed by the actor layer”). Banica is combinable with Kundu as both are from the analogous field of image analysis and animations. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the layering system taught by Kundu with the system taught by Banica. The motivation for doing so would be to prevent certain content from being obstructed. 
Banica fails to teach determining, by the processing device using the machine learning model, a time for displaying the animated content relative to a detected movement of the object. Banica further fails to teach displaying the animated content at the determined time. However, Kundu teaches determining a time for displaying the animated content relative to a detected movement of the object. (see paragraph [0101]  “The system identifies the moving first as a “drag” gesture. The system therefore updates the scene this time to show interactive object 1208 being dragged to the right. The system displays the scene to the user, before the process loop proceeds to begin the next iteration", where the ‘object’ is the user, and the user moves, and the system determines a time to display the interactive object moving). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine determining a time for displaying animations taught by Kundu with the animation generation method taught by Banica. The motivation to do so would be to provide the user with real-time animations of objects and scenes (see paragraph [0005], “The present disclosure provides system and method that enable the user(s) to interact in real-time with other objects or items in the scene or even with each other in the case of multiple users”). Thus, it would have been obvious to combine the teachings of Kundu with the teachings of Banica in order to obtain the invention as claimed in Claim 1.


As to Claim 2, Banica in view of Kundu teaches a machine learning model (see Banica,  “attention modeling” in paragraph [0008]) that determines the location based on a location of an object (see Banica, paragraph [0025], “The method and system determine non-obtrusive locations within the video that the provided overlays can be placed in. Non-obtrusiveness can be based on properties of the overlays and properties of frames of the video content. Depending on the duration of an overlay, it can appear in frames spanning multiple clips or scenes of the video content. Locations are determined to be non-obtrusive if an overlay having certain size, dimension, color, and/or translucency properties will not obscure or overlap with important objects in the frames”).

As to Claim 3, Banica fails to teach that rendering the animated content involves attaching a portion of the animated content to the object detected in the frame of the digital video. However, Kundu teaches that animated content can be ‘attached’ to an object in a video (see paragraph [0072], “Once the user's image is isolated, some embodiments of the system may use all or some of this image as the user representation (also referred to as “virtual user” or “virtual user representation”). In some embodiments, the virtual user representation may be entirely the “in real life” image of the physical user. In other embodiments, the virtual user representation may be in part the “in real life” image (for example, only the face) while other parts of the virtual user may be virtual (such as the body of an avatar) or augmented (such as wearing virtual clothing or holding virtual objects)”, where the clothes and objects are ‘attached’ to the actor in the frame of the digital video). Kundu and Banica are combinable because both are from analogous fields of image analysis and display. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Kundu. The motivation for doing so would be to allow for the user to easily change the appearance of themselves in the digital video by attaching animations to the user. Thus, it would have been obvious to combine the teachings of Banica and Kundu in order to obtain the invention as claimed in Claim 3.

As to Claim 4, Banica teaches a machine learning model (see “attention modeling” in paragraph [0008]) , that determines the location based on tracking a person in the frame (see paragraph [0118], “These are deemed to attract more attention (although they may not be salient from a bottom-up perspective). The AV of each face takes into account the face size and position (larger faces and faces detected in the center of a frame have a higher AV)”, where the face of the detected person is tracked and deemed ‘salient’, and animated content is located away from ‘salient’ regions).
Banica fails to teach that the machine learning model determines the location based on tracking a pose of a detected person depicted in frame of the digital video. However, Kundu teaches that poses can be detected (see paragraph [0140], “In some examples, recognition of the movements or motions of the actor is used to cause interaction with the interactive object. The recognition of the motion or movement of the actor can be done using video recognition approaches, such as You Only Look Once (YOLO), a real-time object detection method, human pose estimation (including based on skeletal based, contour-based, and/or volume-based models), and the like. Similarly, the actor's body part (hand, foot, head, etc.) can be tracked by object recognition approaches such as YOLO, human pose estimation (including based on skeletal-based, contour-based, and/or volume-based models), etc.”),
and the locations of animations can be determined from the detected pose (see paragraph [0063], “if the physical actor is determined to be making a pinching gesture with their fingers as if holding a pen, then in step 1313 the system maps the physical location of the physical actor to corresponding virtual location of the virtual actor within the virtual three-dimensional space where the recognition of the pinching gesture (i.e., the pen gesture) causes the system in step 1313 to update the virtual three-dimensional space to show writing or drawing in the virtual location of the hand (or virtual pen being held by the hand) of the virtual actor in the virtual three-dimensional space”).
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Kundu. The motivation for doing so would be to allow the user to interact with animated content without accidentally obscuring the content. Kundu teaches in paragraph [0004], “The present disclosure provides systems and methods that are useful for many situations where the actor needs to be on screen in realtime with other content, but in a way that does not overlap or occlude some objects in the virtual world. This is accomplished by inserting a representation of the actor into a “scene”. Such a scene is composed of content at a multitude of different levels or layers of depth where, from the point of view of the viewer, some of the content is behind the actor and some content is in front of the actor and therefore not occluded by the actor.” Thus, by recognizing poses and gestures, the animated content can be displayed in a way such that gestures done by a person in a video do not obscure the animated content. Thus, it would have been obvious to combine the teachings of Banica and Kundu in order to obtain the invention as claimed in Claim 4.

As to Claim 5, Banica in view of Kundu teaches a method wherein rendering the animated content further comprises auto-framing a face depicted in the frame of the digital video (see paragraph [0118] of Banica, “Besides saliency, the attention modeling also incorporates top-down, semantic information by identifying faces and text. These are deemed to attract more attention (although they may not be salient from a bottom up perspective)”). It is recognized that the citations and evidence provided above are derived from potentially different embodiments of a single reference. Nevertheless, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains to employ combinations and sub-combinations of these complementary embodiments, because Banica explicitly motivates doing so at least in paragraphs [0161] including “Embodiments of the methods disclosed herein may be performed in the operation of such computing devices. The order of the steps presented in the examples above can be varied—for example, steps can be re-ordered, combined, and/or broken into sub-steps. Certain steps or processes can be performed in parallel” and otherwise motivating combing the method for recognizing faces with the method of placing unobtrusive displays so that displays are not placed over the faces of actors in a digital video.

As to Claim 6, Banica fails to teach a portion of the animated content is layered behind or in front of the object depicted in the frame of the digital video based on determining whether the object depicted in the frame is a salient feature of the digital video. Banica teaches identifying salient features in videos (see paragraph [0118], “These are deemed to attract more attention (although they may not be salient from a bottom-up perspective). The AV of each face takes into account the face size and position (larger faces and faces detected in the center of a frame have a higher AV)”, where the face of the detected person is tracked and deemed ‘salient’, and animated content is located away from ‘salient’ regions), but fails to teach layering. However, Kundu teaches that animated content may be layered to prevent an object detected in the video from being obstructed (see paragraph [0123], “The position of the layers (i.e., the depth position within the layers, for example, a subject layer and a background layer) in a scene can be determined based on the location of the content within or on the layer. By way of an example, the content of the data toward the top of the scene is deprioritized and assigned to the background layer while content toward the middle or bottom of the scene is prioritized to be in the subject layer. This enables the actor to be able to stand and present content which is occluding the actor's lower and possibly middle section without the actor's head being obstructed by the content located higher up in the scene”, where the actor is a salient feature).  Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the salient object recognition taught by Banica with the layering taught by Kundu. The motivation for doing so would be to prevent important objects in the video from being obstructed, as taught in Kundu in paragraph [0123]. Thus, it would have been obvious to combine the layering taught by Kundu with the system taught by Banica in order to obtain the invention as claimed in Claim 6. 

As to Claim 8, Banica fails to teach that a behavior of the animated content is triggered by a word or phrase detected in audio of the digital video. However, Kundu teaches animations can be triggered by a word or phrase (see paragraph [0140], “In some examples, the actor uses speech or sound to trigger an action in the scene. Specifically, the actor can say “next slide” to trigger the action of changing the scene or triggering an action from one slide to the next in a presentation and, similarly, “previous slide” to trigger the action of changing the scene from one slide to the previous slide in a presentation”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Kundu. The motivation for doing so would be to allow the users to trigger actions and point out animated content without needing to use a computer mouse. Kundu teaches in paragraph [0007], “This is particularly problematic for remote presentations. Today, the actor's video stream is typically displayed in a window completely separate from the content, thereby making it more difficult for the actor to point out content (with anything other than a mouse pointer) and more effectively communicate with the audience.” Thus, it would have been obvious to combine the teachings of Banica and Kundu in order to obtain the invention as claimed in Claim 8.

As to Claim 10, Banica teaches a system comprising: a memory component and a processing device coupled to the memory component, (see paragraph [0008], “In another embodiment, a system has an input device, a display device, a processor, and a memory”), 
receiving, by a processing device, a digital video and data executable to generate animated content (see paragraph [0007], “method receiving, at a computing device, video content and an indication of an overlay to be placed in the video content”),
using a machine learning model (see paragraph [0057], “ automated process implements overlay placement and transformation algorithms”)
to determine a behavior for a portion of the animated content (see paragraph [0140], “As described above with reference to FIGS. 1A and 1B, the transformations can include spatial (i.e., reshaping), color, translucency, transparency, and/or resizing transformations. In embodiments, step 912 can comprise applying two main types of effects so that an overlay fits better into video content received in step 902. The first type includes spatial transformations that move the corners of an overlay”)
and displaying, by the processing device, the rendered animated content within the frame of the digital video in a user interface (see paragraph [0008], “The system inserts the overlay into the selected location and renders the video content with the inserted overlay on the display device”).
Banica fails to teach determining a behavior including a specified movement for the animated content relative to the object, based on the digital video and the object. However, Kundu teaches that a specified movement can be determined through a machine learning model, (see paragraph [0057], “In some examples, the extraction or isolation or segmentation of the actor can use chroma key green screen, virtual green screen technology, skeletal/human pose recognition/estimation, neural networks trained to isolate the actor within an image or video stream, and the like”) and displayed to the user (see paragraph [0141], “In some examples, such interaction feature can be used by the actor to trigger an animation in a presentation slide from software such as PowerPoint or Google Slides. In some examples, a student (i.e., the actor) in a virtual classroom (i.e., the scene) can virtually touch a virtual flashcard (i.e., the interactive object) to cause it to flip over (i.e., the triggered or caused action)”, where the actor is the object, and where the flashcard being flipped is an example of a specified movement relative to the movement of the actor). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Kundu. The motivation for doing so would be to allow users to interact with virtual educational content, and thus enhance the student’s learning experience. Kundu teaches in paragraph [0141], “As shown in FIG. 11 , the system and method of the present disclosure can be applied in an educational setting to enhance the student's learning ability by adding an interactive element to the learning process.” 
Banica fails to teach determining, using the machine learning model, a time for executing the behavior in response to a detected gesture of the object. Banica fails to further teach executing the behavior. Kundu teaches determining a time for executing the behavior in response to a detected gesture of the object, and then executing the behavior (see paragraph [0101], “The system identifies the moving first as a “drag” gesture. The system therefore updates the scene this time to show interactive object 1208 being dragged to the right. The system displays the scene to the user, before the process loop proceeds to begin the next iteration", where the ‘object’ is the user, and the user moves, and the behavior is the interactive object moving to the right). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the timing determination taught by Kundu with the teachings of Banica. The motivation for doing so would be to produce real-time animation effects, as taught by Kundu in paragraph [0005]. Thus, it would have been obvious to combine the teachings of Banica and Kundu in order to obtain the invention as claimed in Claim 10.

As to Claim 11, Banica fails to teach that determining the behavior is based on detected speech in the digital video. However, Kundu teaches that the behavior of an animation can be triggered by a spoken word or phrase (see paragraph [0140], “In some examples, the actor uses speech or sound to trigger an action in the scene. Specifically, the actor can say “next slide” to trigger the action of changing the scene or triggering an action from one slide to the next in a presentation and, similarly, “previous slide” to trigger the action of changing the scene from one slide to the previous slide in a presentation”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Kundu. The motivation for doing so would be to allow users to interact with virtual educational content, and thus enhance the student’s learning experience as taught by Kundu in paragraph [0141].

As to Claim 13, Banica fails to teach that rendering the animated content further comprises removing a background from the frame of the digital video. However, Kundu teaches that the background can be removed (see paragraph [0071], “In step 1604 the system isolates a user representation from the video data received in Step 1602. In some embodiments, the isolation is done by means of human body segmentation or “selfie segmentation” as is commonly used by video conferencing software to isolate or segment the user's image from the image of their surrounding actual environment so as remove and replace it with a more desirable background”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Kundu. The motivation for doing so would be to allow the user to replace the background with a more desirable background. Thus, it would have been obvious to combine the teachings of Banica and Kundu in order to obtain the invention as claimed in Claim 13. 

As to Claim 14, Banica fails to teach a machine learning model determines the behavior based on a location of the object detected in the frame of the digital video. However, Kundu teaches that the behavior of an object can be determined using a machine learning model (see paragraph [0118], “Artificial intelligence and deep learning techniques can be employed for the system to identify the background layer and the subject layer”, where the background layer contains the object) based on a location of an object (see paragraph [0093], “In some embodiments, step 1710 may include the step of utilizing the location (including depth) and/or orientation data from step 1812 to update the three-dimensional scene. In step 1710 the system may do so by utilizing the location of the physical user and/or the virtual location of the virtual user representation to detect the triggering of an interactive object and then execute the associated action… Upon detecting the virtual user now being immediately in front of the virtual automatic sliding door, the system, in step 1710, determines the sliding door interactive object to be triggered, and therefore executes the associated action which is for the doors to slide open. As such, the system in step 1710 updates the three-dimensional scene by sliding the virtual doors open”, where the door is the object, and the behavior of the door sliding was triggered by the location of the user). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Kundu. The motivation for doing so would be to allow users to easily interact with virtual objects in a scene. Kundu teaches in paragraph [0007], “This is particularly problematic for remote presentations. Today, the actor's video stream is typically displayed in a window completely separate from the content, thereby making it more difficult for the actor to point out content (with anything other than a mouse pointer) and more effectively communicate with the audience. If an actor is placed in front of content (such as by use of a chroma key green screen), the actor occludes some of the very content they are trying to present.” Thus, by tracking the location of an actor, behaviors of virtual objects can be programmed so that they are seen and not accidentally occluded. Thus, it would have been obvious to combine the teachings of Banica and Kundu in order to obtain the invention as claimed in Claim 14. 

As to Claim 15, Banica fails to teach a machine learning model determines the behavior-based tracking a pose of a detected person depicted in the frame of the digital video. However, Kundu teaches a machine learning model (see paragraph [0140], “cognition of the motion or movement of the actor can be done using video recognition approaches, such as You Only Look Once (YOLO)”, where YOLO is a well-known machine learning model) 
that determines the behavior of an object based on a pose of the user (see paragraph [0102], “Regarding snapshot 2304, with the visual feedback that his virtual self has successfully dragged the virtual piece of candy 1208 to its desired final position, the actual user opens his fist. The image 2304 is displayed showing the user's virtual user representation 1312 having released the piece of candy (an interaction object 1208) in its final desired location. This is an example of a user's gesture/pose triggering a programmed action of an interactive object 1208”, where the movement of the animated candy is an example of a behavior based on tracking the pose of the user). 
Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Kundu. The motivation for doing so would be to allow users to easily interact with virtual objects in a scene. Kundu teaches in paragraph [0007], “This is particularly problematic for remote presentations. Today, the actor's video stream is typically displayed in a window completely separate from the content, thereby making it more difficult for the actor to point out content (with anything other than a mouse pointer) and more effectively communicate with the audience. If an actor is placed in front of content (such as by use of a chroma key green screen), the actor occludes some of the very content they are trying to present.” Thus, by tracking the location of an actor, behaviors of virtual objects can be programmed so that they are seen and not accidentally occluded. Thus, it would have been obvious to combine the teachings of Banica and Kundu in order to obtain the invention as claimed in Claim 15.
As to Claim 21, Banica fails to teach identifying portion of the object and determining, using the machine learning model, the location within the frame of the digital video to place the animated content that avoids covering the portion of the object.  However Kundu teaches that a portion of the object can be identified(see paragraph [0128], “Once trained, the neural network is used as a classifier by which it can tag, in a binary matter, which regions of the image are most likely part of a human face or body parts”, where the object is the body of a user, and the face and body parts are portions of the user), and that animated content can be placed to avoid covering the portion of the object (see paragraph [0123], “The position of the layers (i.e., the depth position within the layers, for example, a subject layer and a background layer) in a scene can be determined based on the location of the content within or on the layer. By way of an example, the content of the data toward the top of the scene is deprioritized and assigned to the background layer while content toward the middle or bottom of the scene is prioritized to be in the subject layer. This enables the actor to be able to stand and present content which is occluding the actor's lower and possibly middle section without the actor's head being obstructed by the content located higher up in the scene”, where the actor’s head is the portion of the object that is not obstructed). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the timing determination taught by Kundu with the teachings of Banica. The motivation for doing would be to prevent the user’s face being obstructed by the animations, as taught Kundu in paragraph [0123]. 

Claim(s) 9 and 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over Banica et al. (US Pub No 2014/0359656), hereinafter Banica, in view of Kundu (US Pub No 2023/0236660),  in view of Nguyen (US Pub No US 2021/0272363), hereinafter Nguyen. 

As to Claim 9, Banica fails to teach that the animated content comprises a behavior including a specified movement for a portion of the animated content selected by a user. However, Nguyen teaches that animated content can comprise a movement selected by a user (see paragraph [0007], “The video prototyping module can receive an input of a motion path sketch on the additional spatial layer, and the augmented reality feature of the spatial layer is assigned to the motion path sketch on the additional spatial layer. The augmented reality feature then displays as an animation that moves according to the motion path sketch during playback of the captured video”). Banica and Nguyen are combinable because both are from analogous fields of image capture, analysis and display. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Nguyen. The motivation for doing so would be to allow the user to control the movement of animated objects in an intuitive manner. Nguyen teaches in paragraph [0026], “The transformation of the movement in 3D space (as six-degrees-of-freedom movement) can be recorded as a motion path, which allows the designer to express complex motion trajectory in a more intuitive manner than with a traditional, spline-based interface.” Thus, it would have been obvious to combine the teachings of Banica and Nguyen in order to obtain the invention as claimed in Claim 9. 

As to Claim 17, Banica teaches a non-transitory computer-readable storage medium storing executable instructions (see paragraph [0145], “For example, some functionality performed by client devices 134 a-n and server 104 shown in FIGS. 1A and 1B, can be implemented in the computer system 1000 using hardware, software, firmware, non-transitory computer readable media having instructions stored thereon”), 
which when executed by a processing device, cause the processing device to perform operations comprising: 
receiving, by a processing device, a digital video and data executable to generate animated content (see paragraph [0007], “method receiving, at a computing device, video content and an indication of an overlay to be placed in the video content”), 
detecting an object depicted in a frame of the digital video (see paragraph [0118], “Certain of these embodiments use an attention model that detects the Attention Objects (AOs).  Each AO is described by the following values: ROI (region of interest), AV (attention value), and MPS (minimal perceivable size). ROI indicates the area where the object is (and it is represented as a shape such as a rectangle)... The AV of each face takes into account the face size and position (larger faces and faces detected in the center of a frame have a higher AV)),
using a machine learning model (see paragraph [0057], “automated process implements overlay placement and transformation algorithms”), 
to determine a behavior for a portion of the animated content (see paragraph [0140], “As described above with reference to FIGS. 1A and 1B, the transformations can include spatial (i.e., reshaping), color, translucency, transparency, and/or resizing transformations. In embodiments, step 912 can comprise applying two main types of effects so that an overlay fits better into video content received in step 902. The first type includes spatial transformations that move the corners of an overlay”)
and displaying, by the processing device, the rendered animated content within the frame of the digital video in a user interface (see paragraph [0008], “The system inserts the overlay into the selected location and renders the video content with the inserted overlay on the display device”).
However, Banica fails to teach receiving a behavior including a specified movement for a portion of the animated content, and determining an updated behavior based on the digital video and based on the object, and rendering the animated content within a frame of the digital video including the updated behavior determined by the machine learning mode. However, Nguyen teaches that specified behaviors can be received, and updated behaviors can be displayed (see paragraph [0007], “The video prototyping module can receive an input of a motion path sketch on the additional spatial layer, and the augmented reality feature of the spatial layer is assigned to the motion path sketch on the additional spatial layer. The augmented reality feature then displays as an animation that moves according to the motion path sketch during playback of the captured video”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Nguyen. The motivation for doing so would be to allow the user to control the movement of animated objects in an intuitive manner. Nguyen teaches in paragraph [0026], “The transformation of the movement in 3D space (as six-degrees-of-freedom movement) can be recorded as a motion path, which allows the designer to express complex motion trajectory in a more intuitive manner than with a traditional, spline-based interface.” 
Nguyen fails to explicitly teach that the updated movement is determined based on the digital video and based on the object. However, Kundu teaches that an updated movement can be determined with respect to an object in a digital video (see paragraph [0063], “in step 1313, the system may also take as input and utilize the additional actor information so as to update the scene 1302 based on location, pose, or gestures of the actor 1301. For example, by utilizing location/depth information 1309 and/or pose & gesture information 1311, the positioning & updating step 1313 involves determining if and how the virtual actor makes contact with interactive objects within the three-dimensional scene so as to trigger corresponding actions. Upon actions being triggered, in step 1313, the system updates the three-dimensional scene accordingly to reflect the action (such as visual state change, etc.”, where the actor is the object in the digital video). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Kundu. The motivation for doing so would be to allow users to easily interact with virtual objects in a scene. Kundu teaches in paragraph [0007], “This is particularly problematic for remote presentations. Today, the actor's video stream is typically displayed in a window completely separate from the content, thereby making it more difficult for the actor to point out content (with anything other than a mouse pointer) and more effectively communicate with the audience. If an actor is placed in front of content (such as by use of a chroma key green screen), the actor occludes some of the very content they are trying to present.” Thus, by tracking the location of an actor, behaviors of virtual objects can be programmed so that they are seen and not accidentally occluded. Thus, it would have been obvious to combine the teachings of Banica, Nguyen,  and Kundu in order to obtain the invention as claimed in Claim 17.
Kundu teaches determining an updated behavior of the animated content, including a time for execution of the behavior, based on a detected movement of the object (see paragraph [0101], “The system identifies the moving first as a “drag” gesture. The system therefore updates the scene this time to show interactive object 1208 being dragged to the right. The system displays the scene to the user, before the process loop proceeds to begin the next iteration", where the ‘object’ is the user, and the user moves, and the updated behavior is the interactive object moving to the right).  Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Kundu with the teachings of Banica and Nguyen. The motivation for doing so would be to produce real-time animation effects, as taught by Kundu in paragraph [0005]. 


As to Claim 18, Banica in view of Nguyen fails to teach herein the machine learning model determines the updated behavior based on a location of an object detected in the frame of the digital video. However, Kundu teaches that behavior of an animated object can be updated based on the location of an object (see paragraph [0063], “in step 1313, the system may also take as input and utilize the additional actor information so as to update the scene 1302 based on location, pose, or gestures of the actor 1301. For example, by utilizing location/depth information 1309 and/or pose & gesture information 1311, the positioning & updating step 1313 involves determining if and how the virtual actor makes contact with interactive objects within the three-dimensional scene so as to trigger corresponding actions. Upon actions being triggered, in step 1313, the system updates the three-dimensional scene accordingly to reflect the action (such as visual state change, etc.”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Kundu. The motivation for doing so would be to allow users to easily interact with virtual objects in a scene. Kundu teaches in paragraph [0007], “This is particularly problematic for remote presentations. Today, the actor's video stream is typically displayed in a window completely separate from the content, thereby making it more difficult for the actor to point out content (with anything other than a mouse pointer) and more effectively communicate with the audience. If an actor is placed in front of content (such as by use of a chroma key green screen), the actor occludes some of the very content they are trying to present.” Thus, by tracking the location of an actor, behaviors of virtual objects can be programmed so that they are seen and not accidentally occluded. Thus, it would have been obvious to combine the teachings of Banica, Nguyen,  and Kundu in order to obtain the invention as claimed in Claim 18.

As to Claim 19, Banica in view of Nguyen fails to teach that the machine learning model determines the updated behavior based on a location of an object detected in the frame of the digital video. However, Kundu teaches that the pose of an actor can be detected, (see paragraph [0127], “the system extracts an image of the actor from the video stream or the data feed using various techniques including, but not limited to, use of chroma key green screen, virtual green screen technology, skeletal/human pose recognition/estimation, neural networks trained to isolate the actor within the image”)  and that behavior can be updated based on the detected pose (see paragraph [0063], “in step 1313, the system may also take as input and utilize the additional actor information so as to update the scene 1302 based on location, pose, or gestures of the actor 1301. For example, by utilizing location/depth information 1309 and/or pose & gesture information 1311, the positioning updating step 1313 involves determining if and how the virtual actor makes contact with interactive objects within the three-dimensional scene so as to trigger corresponding actions. Upon actions being triggered, in step 1313, the system updates the three-dimensional scene accordingly to reflect the action (such as visual state change, etc.”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Kundu. The motivation for doing so would be to allow users to easily interact with virtual objects in a scene. Kundu teaches in paragraph [0007], “This is particularly problematic for remote presentations. Today, the actor's video stream is typically displayed in a window completely separate from the content, thereby making it more difficult for the actor to point out content (with anything other than a mouse pointer) and more effectively communicate with the audience. If an actor is placed in front of content (such as by use of a chroma key green screen), the actor occludes some of the very content they are trying to present.” Thus, by tracking the location of an actor, behaviors of virtual objects can be programmed so that they are seen and not accidentally occluded. Thus, it would have been obvious to combine the teachings of Banica and Kundu in order to obtain the invention as claimed in Claim 19.

Claim(s) 7 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Banica et al. (US Pub No 2014/0359656), hereinafter Banica, in view of Kundu (US Pub No 2023/0236660), and further in view of Kamemota et al.  (US Pub No US 2024/0428484), hereinafter Kamemota.

As to Claim 7, Banica fails to teach that the location is based on detected audio of the digital video. However, Kamemota teaches that the location of an animation caption can be based on the detected audio. (see paragraph [0062], “As for a character display position, an algorithm for determining a character display position having a predetermined positional relationship with a speaker is stored in the storage 9. The predetermined positional relationship is, for example, such a relationship that a distance to the speaker is within a certain range. As the predetermined positional relationship, a distance to the speaker is preferably close. By reducing the distance to the speaker, when utterance content is displayed as a caption, a viewer can easily specify the speaker of the caption”). Banica and Kamemota are combinable as both are from analogous fields of display of videos and animations. ”). Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica with the teachings of Kamemota. The motivation for doing so would be to align the caption to the speaker, and thus make it easier for the viewer to determine who is speaking. Thus, it would have been obvious to combine the teachings of Banica and Kamemota in order to obtain the invention as claimed in Claim 7.


As to Claim 12, Banica in view of Kundu fails to teach further comprising converting the detected speech into text for incorporation into the animated content. However, Kamemota teaches that speech can be converted into text and displayed (see paragraph [0135] and [0136], “It is assumed that the sound indicated by the sound signal includes a narration (speech) “Beautiful autumn leaves”. In this case, the sound data extraction processor 43 extracts the sound “Beautiful autumn leaves” using a sound recognition technique. The conversion processor 44 converts the sound extracted by the sound data extraction processor 43 into a caption signal. In this example, the conversion processor 44 converts data on the sound “Beautiful autumn leaves” obtained from the sound data extraction processor 43 into a caption signal. The conversion processor 44 inputs the converted caption signal to a caption feature extractor 6.  A caption is displayed on the display 16 in the display form determined by the display form determiner 10.”). Kamemota further teaches that captions can appear as animations, (see paragraph [0107], “For example, when a category of content is drama, a caption is displayed at a position close to a person who speeches words of the caption as animation. As a result, the person who speeches is clearly identified and realistic sensations may become high. Furthermore, for example, when a category of content is variety show, a caption is moved, enlarged, or miniaturized as animation to enhance fun of a program”). Kamemota is combinable with Banica and Kundu because all are from analogous fields of video display. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the method taught by Banica and Kundu with the teachings of Kamemota. The motivation for doing so would be to enhance the user’s experience by using animations to convey the atmosphere of the content. Kamemota teaches in paragraph [0107], “In addition to the example described here, an atmosphere of the content can be more easily given to the user who is the viewer by adding an appropriate animation to the caption depending on a category of the content.” Thus, it would have been obvious to combine the teachings of Banica, Kundu and Kamemota in order to obtain the invention as claimed in Claim 12.


Claim(s) 20 is rejected under 35 U.S.C. 103 as being unpatentable over Banica et al. (US Pub No 2014/0359656), in view of Kundu (US Pub No 2023/0236660), in view of Nguyen (US Pub No 2021/0272363), and further in view of Kamemota et al. (US Pub No US 2024/0428484).

	As to Claim 20, Banica in view of Nguyen and Kundu fails to teach that wherein the machine learning model determines the updated behavior based on audio of the digital video. However, Kamemota teaches that the audio in the scene can cause an updated behavior of animated captions (see paragraph [0115], “It is assumed that the caption feature extractor 6 extracts a caption of explanation "sound of raining". In this case, the display form determiner 10 replaces the caption "sound of raining" by a caption of characters "pouring" indicating sound with reference to the table illustrated in FIG. 8. for example… When "animation" is added to the caption of "pouring", the caption of "pouring" is moved on the display screen, for example”, where the updated movement would be moving the “pouring” caption onto the display, and where the caption feature extractor extracts the caption from the audio of rain in the digital video). Kamemota is combinable with Banica, Kundu, and Nguyen because all four are from analogous fields of image analysis and displaying animations. Thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Kamemota with the teachings of Banica, Kundu, and Nguyen. The motivation for doing so would be to provide a more realistic sensation. Kamemota teaches in paragraph [0117], “By this, since the explanation about sound is replaced by the characters representing the sound in display, realistic sensations may be given to the user.” Thus, it would have been obvious to combine the teachings of Kamemota with the teachings of Banica, Kundu, and Nguyen in order to obtain the invention as claimed in Claim 20.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Deng et al. (US Pub No 2024/0054714) teaches a method for adding animations of videos including layering animations on top of objects detected in a video, and determining a path of motion for objects in a video. Deng further teaches that the type of animation can be determined from features of objects in the video.
Dickens (US Pat No 10,593,124) discloses a system in which animated effects can be added to objects identified in digital videos. The system can also recognize gestures executed by a person in the frame of the video.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SOUMYA THOMAS whose telephone number is (571)272-8639. The examiner can normally be reached M-F 8:30-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached at (571) 272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/S.T./Examiner, Art Unit 2664                                                                                                                                                                                                        
/JENNIFER MEHMOOD/Supervisory Patent Examiner, Art Unit 2664
Read full office action
Template-Based Behaviors in Machine Learning

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Template-Based Behaviors in Machine Learning

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email