Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
35 USC § 101
Claims 1 – 20 are considered to be patent eligible under 35 USC § 101.
Allowable Subject Matter
Claims 4 and 7-12, 17 and 18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-3, 5-6, 13-16, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over MISEIKIS et al. (US 20240331317 A1) in view of Kolve et al. (AI2-THOR: An Interactive 3D Environment for Visual AI) in further view of Hold-Geoffroy et al. (US 20240135612 A1).
Regarding 1, Miseikis teaches a system (See title, “INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING SYSTEM AND METHOD ”) for enhancing animation media production (See Fig. 8A-B, ¶151-153. See ¶153, “FIG. 8b shows a user's view on the stage 650 after activating novel synthetic view generation. The terminal device performs image recognition on images taken by a camera of the terminal device to identify if the user's view is obstructed. If the image recognition provides the result that the user's view on stage 650 is obstructed, novel synthetic view presentation is switched on in the terminal device. A request is sent to a service provider 30 to generate a novel synthetic view of the live event based on the position, orientation and imaging characteristics of the user's terminal device. A novel synthetic view of stage 650 is then generated by the service provider based on the parameters of the terminal device and sent to the user's terminal device. The novel synthetic view 640 (dashed rectangle) of the live event generated by the service provider is presented to the user. In this way, the user is enabled to see the artist 630 on stage 650 as if the view on the stage 650 were unobstructed. This embodiment provides an experience to the user which is as close to a live event as possible while removing obstructions from the field of view.” The synthetic view is enhanced animation. As the view is now no longer obstructed.), comprising:
a computing device having at least one processor, wherein the computing device is in communication with a server through a network (¶35, “Circuitry of the terminal device may include a processor, a memory (RAM, ROM or the like), a storage, input means (mouse, keyboard, camera, etc.), output means (display (e.g. liquid crystal, (organic) light emitting diode, etc.), loudspeakers, etc., a (wireless) interface, etc., as it is generally known for electronic devices (computers, smartphones, smart glasses, etc.)”. See Fig. 1, ¶88, “As illustrated in FIG. 1, the information processing system according to the present embodiment includes a terminal device 1, a service provider 30 with several novel synthetic view generators (NVS) 50 (see 230 in FIG. 4 for more details), and a communication network 40.” The services provider is interpreted as a server. See Fig. 2-4. ¶104 for fig. 2. ¶107 for Fig. 3. ¶116 for Fig. 4); and
a memory in communication with said processor configured to store instructions that are executable by said processor (¶35, “Circuitry of the terminal device may include a processor, a memory (RAM, ROM or the like), a storage, input means (mouse, keyboard, camera, etc.), output means (display (e.g. liquid crystal, (organic) light emitting diode, etc.), loudspeakers, etc., a (wireless) interface, etc., as it is generally known for electronic devices (computers, smartphones, smart glasses, etc.)”. See ¶172, “The system controller 10 functions as an external situation determination unit 10a that determines an external situation and an operation control unit 10b that give a control instruction to each unit according to a determination result of the external situation determination unit 10a, as illustrated in FIG. 3.” The memory is in communication with processor, so that the processor can carry out the instructions that were stored in memory.),
wherein said processor is configured to execute the stored instructions to cause the system to perform operations (¶35, “Circuitry of the terminal device may include a processor, a memory (RAM, ROM or the like), a storage, input means (mouse, keyboard, camera, etc.), output means (display (e.g. liquid crystal, (organic) light emitting diode, etc.), loudspeakers, etc., a (wireless) interface, etc., as it is generally known for electronic devices (computers, smartphones, smart glasses, etc.)”. See ¶172, “The system controller 10 functions as an external situation determination unit 10a that determines an external situation and an operation control unit 10b that give a control instruction to each unit according to a determination result of the external situation determination unit 10a, as illustrated in FIG. 3.” ) comprising:
analyzing media data to identify a plurality of static and dynamic elements in said media data (¶127, “ In NeRF, a deep neural network is trained to predict the radiance (color and density) and occupancy of a 3D point in the scene given its 3D location. The network is trained on a set of input images captured from different viewpoints, which are used to optimize the parameters of the network” the media data are input images. ” The input images may be the media data. ¶169, “In addition or alternatively, based on the relative positions and viewing angle differences between the captured object and the viewpoint of virtual displays 2a and 2b, the correct viewing angle difference can be determined, and a corresponding request can be sent to the render engine. For static objects, such as items inside the box, shop floor, etc., pre-rendering would be done ahead of time to reduce the computational cost of the renderer. Then just the model captured from the requested viewpoint would be delivered to the device. For real-time object request, renderer would have to synthesize the object from the correct viewing angle.” ¶131, “In a case where a scene is assumed not to be static, meaning that the appearance and geometry of the scene are not fixed and change over time, dynamic NeRF (D-NeRF) may be used for generating novel synthetic view.” ¶152, “As shown in FIG. 8a which shows a real view of the user, the user's view onto a stage 650 is partially blocked by three people 620. Due to this blocking of the view the artist 630 is only partially visible to the user. A service provided captures the live event by multiple cameras 610 from different viewpoints. Based on the images captured by cameras 610 the service provider trains a novel synthetic view generator (e.g. a D-NeRF model) time component. The novel synthetic view generator is thus configured to generate views of the live event from arbitrary viewpoints.” ¶153, “FIG. 8b shows a user's view on the stage 650 after activating novel synthetic view generation. The terminal device performs image recognition on images taken by a camera of the terminal device to identify if the user's view is obstructed. If the image recognition provides the result that the user's view on stage 650 is obstructed, novel synthetic view presentation is switched on in the terminal device. A request is sent to a service provider 30 to generate a novel synthetic view of the live event based on the position, orientation and imaging characteristics of the user's terminal device. A novel synthetic view of stage 650 is then generated by the service provider based on the parameters of the terminal device and sent to the user's terminal device. The novel synthetic view 640 (dashed rectangle) of the live event generated by the service provider is presented to the user. In this way, the user is enabled to see the artist 630 on stage 650 as if the view on the stage 650 were unobstructed. This embodiment provides an experience to the user which is as close to a live event as possible while removing obstructions from the field of view.” The examiner notes that ¶131 mentions that the appearance and geometry of the scene are not fixed and change over time, this however doesn’t exclude static components from being included as well.);
rendering a three-dimensional (3D) model of a scene with a precise depth and location data based on radiance information and spatial data of said plurality of static and dynamic elements in said media data through a neural network to ensure placement of said plurality of static and dynamic elements in said 3D model of the scene (¶127, “ In NeRF, a deep neural network is trained to predict the radiance (color and density) and occupancy of a 3D point in the scene given its 3D location. The network is trained on a set of input images captured from different viewpoints, which are used to optimize the parameters of the network” The media data is input images. ¶168, “ It should be noted that volume rendering techniques as used in S550 are known to the skilled person. For example, the 2D pixelmap of a novel view may be projected on a 3D representation of the virtual displays 2a, 2b that can be rendered by the VR engine with conventional means.”.¶169, “In addition or alternatively, based on the relative positions and viewing angle differences between the captured object and the viewpoint of virtual displays 2a and 2b, the correct viewing angle difference can be determined, and a corresponding request can be sent to the render engine. For static objects, such as items inside the box, shop floor, etc., pre-rendering would be done ahead of time to reduce the computational cost of the renderer. Then just the model captured from the requested viewpoint would be delivered to the device. For real-time object request, renderer would have to synthesize the object from the correct viewing angle.” ¶131, “In a case where a scene is assumed not to be static, meaning that the appearance and geometry of the scene are not fixed and change over time, dynamic NeRF (D-NeRF) may be used for generating novel synthetic view.” ¶152, “As shown in FIG. 8a which shows a real view of the user, the user's view onto a stage 650 is partially blocked by three people 620. Due to this blocking of the view the artist 630 is only partially visible to the user. A service provided captures the live event by multiple cameras 610 from different viewpoints. Based on the images captured by cameras 610 the service provider trains a novel synthetic view generator (e.g. a D-NeRF model) time component. The novel synthetic view generator is thus configured to generate views of the live event from arbitrary viewpoints.” ¶153, “FIG. 8b shows a user's view on the stage 650 after activating novel synthetic view generation. The terminal device performs image recognition on images taken by a camera of the terminal device to identify if the user's view is obstructed. If the image recognition provides the result that the user's view on stage 650 is obstructed, novel synthetic view presentation is switched on in the terminal device. A request is sent to a service provider 30 to generate a novel synthetic view of the live event based on the position, orientation and imaging characteristics of the user's terminal device. A novel synthetic view of stage 650 is then generated by the service provider based on the parameters of the terminal device and sent to the user's terminal device. The novel synthetic view 640 (dashed rectangle) of the live event generated by the service provider is presented to the user. In this way, the user is enabled to see the artist 630 on stage 650 as if the view on the stage 650 were unobstructed. This embodiment provides an experience to the user which is as close to a live event as possible while removing obstructions from the field of view.” The examiner notes that ¶131 mentions that the appearance and geometry of the scene are not fixed and change over time, this however doesn’t exclude static components from being included as well.);
[…]thereby ensuring interaction of said plurality of static and dynamic elements with props and an environment in said 3D model of the scene (¶168, “ It should be noted that volume rendering techniques as used in S550 are known to the skilled person. For example, the 2D pixelmap of a novel view may be projected on a 3D representation of the virtual displays 2a, 2b that can be rendered by the VR engine with conventional means.”.¶169, “In addition or alternatively, based on the relative positions and viewing angle differences between the captured object and the viewpoint of virtual displays 2a and 2b, the correct viewing angle difference can be determined, and a corresponding request can be sent to the render engine. For static objects, such as items inside the box, shop floor, etc., pre-rendering would be done ahead of time to reduce the computational cost of the renderer. Then just the model captured from the requested viewpoint would be delivered to the device. For real-time object request, renderer would have to synthesize the object from the correct viewing angle.” ¶131, “In a case where a scene is assumed not to be static, meaning that the appearance and geometry of the scene are not fixed and change over time, dynamic NeRF (D-NeRF) may be used for generating novel synthetic view.” ¶152, “As shown in FIG. 8a which shows a real view of the user, the user's view onto a stage 650 is partially blocked by three people 620. Due to this blocking of the view the artist 630 is only partially visible to the user. A service provided captures the live event by multiple cameras 610 from different viewpoints. Based on the images captured by cameras 610 the service provider trains a novel synthetic view generator (e.g. a D-NeRF model) time component. The novel synthetic view generator is thus configured to generate views of the live event from arbitrary viewpoints.” ¶153, “FIG. 8b shows a user's view on the stage 650 after activating novel synthetic view generation. The terminal device performs image recognition on images taken by a camera of the terminal device to identify if the user's view is obstructed. If the image recognition provides the result that the user's view on stage 650 is obstructed, novel synthetic view presentation is switched on in the terminal device. A request is sent to a service provider 30 to generate a novel synthetic view of the live event based on the position, orientation and imaging characteristics of the user's terminal device. A novel synthetic view of stage 650 is then generated by the service provider based on the parameters of the terminal device and sent to the user's terminal device. The novel synthetic view 640 (dashed rectangle) of the live event generated by the service provider is presented to the user. In this way, the user is enabled to see the artist 630 on stage 650 as if the view on the stage 650 were unobstructed. This embodiment provides an experience to the user which is as close to a live event as possible while removing obstructions from the field of view.” A prop can be any object within the scene.);
monitoring and mapping said environment in said 3D model of the scene through a simultaneous localization and mapping system in real-time, thereby ensuring accurate interactions between said plurality of dynamic elements and said 3D model of the scene (¶109, “The tracking unit 130 is configured to detect the position information of the terminal device 1. Here, the position information of the terminal device 1 may be detected through any method. For example, a positioning sensor (for example, global positioning system (GPS) sensor, see 21 in FIG. 9) may generate positioning data (the latitude and the longitude) of the terminal device 1 in the real space on the basis of an arrival period of time (a difference between a transmission time and a reception time) of a signal received from each GPS satellite by the terminal device 1. In addition, a so-called simultaneous localization and mapping (SLAM) technology may also be used in self-position estimation of the terminal device 1. SLAM refers to a technology that executes localization and the creation of an environment map in parallel by utilizing an imaging unit such as a camera, various sensors, an encoder, and the like. As a more specific example, in SLAM (in particular, visual SLAM), a three-dimensional shape of a captured scene (or a subject) is successively reconstructed on the basis of a moving image captured by an imaging unit. Then, creation of a surrounding environmental map and estimation of the position and posture of an imaging unit (and consequently, the terminal device 1) in the environment are performed by associating a reconstruction result of the captured scene with a detection result of the position and posture of the imaging unit. Note that, for example, various types of sensors such as an acceleration sensor or an angular velocity sensor are provided in the terminal device 1, and thereby it is possible to estimate the position and posture of the imaging unit as information indicating a relative change on the basis of a detection result of the sensors. Obviously, as long as the position and the attitude of the imaging unit can be estimated, the method is not necessarily limited only to a method based on the detection results of various sensors such as an acceleration sensor and an angular velocity sensor. SLAM is described in detail in, for example, “Real-Time Simultaneous Localization and Mapping with a Single Camera” (Andrew J. Davison, Proceedings of the 9th IEEE International Conference on Computer Vision Volume 2, 2003, pp. 1403-1410).” ¶142, “Once the novel synthetic view generator (e.g. NeRF model) is trained, it is used in conjunction with AR technology to provide a live insight into the mall. Users who wear AR headsets can look around the mall, with the novel synthetic view generator overlaying a virtual representation of specific places within the shopping mall (such as shops, etc.) onto their view. This virtual representation may be updated in real-time based on the user's position and orientation, which could be achieved through the use of tracking technologies such as SLAM.”);
tracking said plurality of dynamic elements within a dynamic scene of said 3D model of the scene through said simultaneous localization and mapping system to maintain consistent and accurate relative positions of said plurality of dynamic elements (¶109, “The tracking unit 130 is configured to detect the position information of the terminal device 1. Here, the position information of the terminal device 1 may be detected through any method. For example, a positioning sensor (for example, global positioning system (GPS) sensor, see 21 in FIG. 9) may generate positioning data (the latitude and the longitude) of the terminal device 1 in the real space on the basis of an arrival period of time (a difference between a transmission time and a reception time) of a signal received from each GPS satellite by the terminal device 1. In addition, a so-called simultaneous localization and mapping (SLAM) technology may also be used in self-position estimation of the terminal device 1. SLAM refers to a technology that executes localization and the creation of an environment map in parallel by utilizing an imaging unit such as a camera, various sensors, an encoder, and the like. As a more specific example, in SLAM (in particular, visual SLAM), a three-dimensional shape of a captured scene (or a subject) is successively reconstructed on the basis of a moving image captured by an imaging unit. Then, creation of a surrounding environmental map and estimation of the position and posture of an imaging unit (and consequently, the terminal device 1) in the environment are performed by associating a reconstruction result of the captured scene with a detection result of the position and posture of the imaging unit. Note that, for example, various types of sensors such as an acceleration sensor or an angular velocity sensor are provided in the terminal device 1, and thereby it is possible to estimate the position and posture of the imaging unit as information indicating a relative change on the basis of a detection result of the sensors. Obviously, as long as the position and the attitude of the imaging unit can be estimated, the method is not necessarily limited only to a method based on the detection results of various sensors such as an acceleration sensor and an angular velocity sensor. SLAM is described in detail in, for example, “Real-Time Simultaneous Localization and Mapping with a Single Camera” (Andrew J. Davison, Proceedings of the 9th IEEE International Conference on Computer Vision Volume 2, 2003, pp. 1403-1410).” ¶142, “Once the novel synthetic view generator (e.g. NeRF model) is trained, it is used in conjunction with AR technology to provide a live insight into the mall. Users who wear AR headsets can look around the mall, with the novel synthetic view generator overlaying a virtual representation of specific places within the shopping mall (such as shops, etc.) onto their view. This virtual representation may be updated in real-time based on the user's position and orientation, which could be achieved through the use of tracking technologies such as SLAM.”); and
but doesn’t explicitly disclose:
providing depth maps through said neural network;
identifying and adjusting optimal positions of said plurality of dynamic and static elements in said 3D model of the scene through one or more distributed artificial intelligence (AI) agents to ensure that said plurality of static and dynamic elements are precisely placed in said 3D model of the scene with respect to depth and interaction,
whereby said system analyses said 3D model of the scene and continuously gather feedback on placements, interactions, and adaptations and make adjustments accordingly for enhancing animation media production in real-time.
Kolve teaches identifying and adjusting optimal positions of said plurality of dynamic and static elements in said 3D model of the scene through one or more distributed artificial intelligence (AI) agents to ensure that said plurality of static and dynamic elements are precisely placed in said 3D model of the scene with respect to depth and interaction (See Fig. 7, “Figure 7: Examples of image modalities supported in AI2-THOR, including RGB, depth, semantic segmentation, instance segmentation, and normals.” See page 5, “Figure 7 shows a suite of different image modalities that can be rendered from each of the cameras in the scene, including RGB, depth, semantic segmentation, instance segmentation, and normals. Each agent comes with a camera attached to it, but more cameras can also be added, such as one to capture a top-down view of the scene. More image modalities can be added by modifying the Unity back-end (often by adding shaders).” See page 5, “Environment metadata is returned after each action is executed. It includes information such as the pose of each agent; the pose and state of each object in the scene (e.g., whether the object is moving, if it is visible to the agent, how far open it is, if it is clean or dirty); metadata about the scene, such as its size; and if the most recent action executed successfully (e.g., the agent did not collide with an object while trying to move). Metadata is often not provided to the agent for most tasks, as it would make the tasks too simple and easily solvable with a
heuristic. Instead, many tasks use metadata to build a reward function with access to “expert-level” information that is hidden from the agent, build an imitation learning expert, and construct training and evaluation datasets.” Page 1-2, “Interactions. AI2-THOR supports many types of interactions, including object state changes, arm-based
manipulation, and causal interactions. For example, a microwave can be opened or closed, a loaf of bread can be sliced and toasted in the toaster, and a faucet can be turned on to fill a mug with water. Figure 6 shows some examples of interactions supported in AI2-THOR.” The examiner notes the precise placement with respect to depth and interaction are done based upon the agent not colliding and the agent have expert level interactions.),
whereby said system analyses said 3D model of the scene and continuously gather feedback on placements, interactions, and adaptations and make adjustments accordingly for enhancing animation media production in real-time (See abstract, “AI2-THOR consists of near photo-realistic 3D indoor scenes, where AI agents can navigate in the scenes and interact with objects to perform tasks.” See page 11, “When training this means that, in practice, important "tricks" are employed to ensure that scene changes are infrequent or synchronized, without these tricks, performance may be dramatically lower.” A synchronized scene is considered to be a scene that is occurring in real time.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Miseikis in view of Kolve, as a applying the AI agent in the 3D graphics scene of Kolve yields results that are a predictable use of prior art elements according to their known established functions, therefore the combination is obvious.
Miseikis in view of Kolve doesn’t explicitly disclose providing depth maps through said neural network;
Hold-Geoffroy teaches providing depth maps through said neural network (¶543, “In one or more embodiments, the depth estimation/refinement model 4104 includes a depth estimation neural network to generate a depth map including per-pixel depth values for the digital image 4102 relative to a view of the digital image 4102.”).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Miseikis in view of Kolve in further view of Hold-Geoffroy as using a depth map enhances 3D scene understanding, improved visual effects, and simplifies 3D model creation.
Regarding claim 2, Miseikis in view of Kolve in further view of Hold-Geoffroy teaches the system of claim 1, wherein the system is configured to […]
of static and dynamic elements […] (See Misekis (¶168, “ It should be noted that volume rendering techniques as used in S550 are known to the skilled person. For example, the 2D pixelmap of a novel view may be projected on a 3D representation of the virtual displays 2a, 2b that can be rendered by the VR engine with conventional means.”.¶169, “In addition or alternatively, based on the relative positions and viewing angle differences between the captured object and the viewpoint of virtual displays 2a and 2b, the correct viewing angle difference can be determined, and a corresponding request can be sent to the render engine. For static objects, such as items inside the box, shop floor, etc., pre-rendering would be done ahead of time to reduce the computational cost of the renderer. Then just the model captured from the requested viewpoint would be delivered to the device. For real-time object request, renderer would have to synthesize the object from the correct viewing angle.” ¶131, “In a case where a scene is assumed not to be static, meaning that the appearance and geometry of the scene are not fixed and change over time, dynamic NeRF (D-NeRF) may be used for generating novel synthetic view.” ¶152, “As shown in FIG. 8a which shows a real view of the user, the user's view onto a stage 650 is partially blocked by three people 620. Due to this blocking of the view the artist 630 is only partially visible to the user. A service provided captures the live event by multiple cameras 610 from different viewpoints. Based on the images captured by cameras 610 the service provider trains a novel synthetic view generator (e.g. a D-NeRF model) time component. The novel synthetic view generator is thus configured to generate views of the live event from arbitrary viewpoints.” ¶153, “FIG. 8b shows a user's view on the stage 650 after activating novel synthetic view generation. The terminal device performs image recognition on images taken by a camera of the terminal device to identify if the user's view is obstructed. If the image recognition provides the result that the user's view on stage 650 is obstructed, novel synthetic view presentation is switched on in the terminal device. A request is sent to a service provider 30 to generate a novel synthetic view of the live event based on the position, orientation and imaging characteristics of the user's terminal device. A novel synthetic view of stage 650 is then generated by the service provider based on the parameters of the terminal device and sent to the user's terminal device. The novel synthetic view 640 (dashed rectangle) of the live event generated by the service provider is presented to the user. In this way, the user is enabled to see the artist 630 on stage 650 as if the view on the stage 650 were unobstructed. This embodiment provides an experience to the user which is as close to a live event as possible while removing obstructions from the field of view.”).
adjust lighting and perspective for said plurality of static and dynamic elements of said 3D model of the scene to ensure that plurality of static and dynamic elements match real-world conditions.
Hold-Geoffroy adjust lighting and perspective for […]said 3D model of the scene to ensure that plurality of static and dynamic elements match real-world conditions (See Hold-Geoffroy ¶304,“ In one or more implementations, the scene-based image editing system 106 utilizes a depth estimation neural network to estimate lighting parameters for an object or scene in a digital image and stores the determined lighting parameters in the semantic scene graph 1412. For example, the scene-based image editing system 106 utilizes a source-specific-lighting-estimation-neural network as described in U.S. application Ser. No. 16/558,975, filed Sep. 3, 2019, titled “DYNAMICALLY ESTIMATING LIGHT-SOURCE-SPECIFIC PARAMETERS FOR DIGITAL IMAGES USING A NEURAL NETWORK,” which is herein incorporated by reference in its entirety. The scene-based image editing system 106 then accesses the lighting parameters for an object or scene from the semantic scene graph 1412 when editing an object to perform a realistic scene edit. For example, when moving an object within an image or inserting a new object in a digital image, the scene-based image editing system 106 accesses the lighting parameters for from the semantic scene graph 1412 to ensure that the object being moved/placed within the digital image has realistic lighting.” ¶532,“As mentioned, in one or more embodiments the scene-based image editing system 106 provides editing of two-dimensional images based on three-dimensional (“3D”) characteristics of scenes of the two-dimensional (“2D”) images. Specifically, the scene-based image editing system 106 processes a two-dimensional image utilizing a plurality of models to determine a three-dimensional understanding of a two-dimensional scene in the two-dimensional image. The scene-based image editing system 106 also provides tools for editing the two-dimensional image, such as by moving objects within the two-dimensional image or inserting objects into the two-dimensional image.).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Miseikis in view of Kolve in further view of Hold-Geoffroy as ensuring 3d model would have realistic lighting applied to it would be obvious to apply to towards static and dynamic elements, as Hold-Geoffroy, already takes into account moving positions of objects. As it would predictable way to improve the static and dynamic elements, making it an obvious solution to the problem of creating a more realistic model.
Regarding claim 3, Miseikis in view of Kolve in further view of Hold-Geoffroy teaches the system of claim 1, wherein the neural network is a neural radiance field (NeRF) system (See Miseikis ¶68, “The novel synthetic view generator may for example be a Deep Neural Network (DNN) or algorithm. In particular an algorithm representing a scene using a fully-connected (non-convolutional) deep network called Neural Radiance Fields for View Synthesis NeRF.”).
Regarding claim 5, Miseikis in view of Kolve in further view of Hold-Geoffroy teaches the system of claim 1, wherein the media data includes at least one of image files (See Miseikis ¶127. See MPEP 2173.05(h)).
Regarding claim 6, Miseikis in view of Kolve in further view of Hold-Geoffroy teaches the system of claim 1, wherein the static and dynamic elements includes animation characters (See Miseikis Fig. 8B, ¶153, “ FIG. 8b shows a user's view on the stage 650 after activating novel synthetic view generation. The terminal device performs image recognition on images taken by a camera of the terminal device to identify if the user's view is obstructed. If the image recognition provides the result that the user's view on stage 650 is obstructed, novel synthetic view presentation is switched on in the terminal device. A request is sent to a service provider 30 to generate a novel synthetic view of the live event based on the position, orientation and imaging characteristics of the user's terminal device. A novel synthetic view of stage 650 is then generated by the service provider based on the parameters of the terminal device and sent to the user's terminal device. The novel synthetic view 640 (dashed rectangle) of the live event generated by the service provider is presented to the user. In this way, the user is enabled to see the artist 630 on stage 650 as if the view on the stage 650 were unobstructed. This embodiment provides an experience to the user which is as close to a live event as possible while removing obstructions from the field of view.”), animation elements (See Fig. 8B, ¶153, the artist can also be considered to have animation elements), objects (See Kolve Fig. 6c-d, “Figure 6: Examples of actions supported in AI2-THOR, including navigation actions (e.g. movement), interactive actions (e.g. object state changes and grasping), environment queries (e.g. finding the shortest path), and environment state changes (e.g. randomizing materials).”), and props (See Kolve Fig. 6c-d, “Figure 6: Examples of actions supported in AI2-THOR, including navigation actions (e.g. movement), interactive actions (e.g. object state changes and grasping), environment queries (e.g. finding the shortest path), and environment state changes (e.g. randomizing materials).”).
Regarding claim 13, Miseikis in view of Kolve in further view of Hold-Geoffroy teaches the system of claim 1, wherein the one or more animation parameters includes pose, orientation and lighting of said 3D model of the scene (See MISEIKIS ¶50, ¶52 position and orientation also refer to pose. ¶53, mentions a 3d effect meaning the model is 3d. ¶140, mentions reconstruction a 3d scene the model of this scene is 3d.. ¶129, “One of the key advantages of NeRF is its ability to handle highly complex scenes with varying lighting conditions and dynamic objects. This is because the neural network can learn to model the scene's appearance and lighting conditions as a continuous function, rather than relying on a discrete set of geometry and texture information” The examiner notes the various lighting are different lighting parameters).
Regarding claim 14, Miseikis teaches a method (See title, “INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING SYSTEM AND METHOD ”) for enhancing animation media production using a system (See Fig. 8A-B, ¶151-153. See ¶153, “FIG. 8b shows a user's view on the stage 650 after activating novel synthetic view generation. The terminal device performs image recognition on images taken by a camera of the terminal device to identify if the user's view is obstructed. If the image recognition provides the result that the user's view on stage 650 is obstructed, novel synthetic view presentation is switched on in the terminal device. A request is sent to a service provider 30 to generate a novel synthetic view of the live event based on the position, orientation and imaging characteristics of the user's terminal device. A novel synthetic view of stage 650 is then generated by the service provider based on the parameters of the terminal device and sent to the user's terminal device. The novel synthetic view 640 (dashed rectangle) of the live event generated by the service provider is presented to the user. In this way, the user is enabled to see the artist 630 on stage 650 as if the view on the stage 650 were unobstructed. This embodiment provides an experience to the user which is as close to a live event as possible while removing obstructions from the field of view.” The synthetic view is enhanced animation. As the view is now no longer obstructed.), comprising:
enabling a user to access said system by providing user credentials through a user interface of a computing device (See ¶108, “The communication interface 120 performs transmission and reception of data with an external device, where the data may be any data necessary for implementing the processes described with reference to the embodiments described below in more detail. According to the present embodiment, the external device is a server of e.g. a service provider 30. The data transmitted or received via the communication interface 120 may be position data (position information provided from the tracking unit 130), image data (images captured by the imaging unit 160), audio data, or the like. The communication interface 120 may, for example, be implemented by the communication unit 26 described with reference to FIG. 16.” Any data necessary for includes credentials for being able to access a service provider.),
wherein said computing device having at least one processor and a memory in communication with said processor configured to store instructions that are executable by said processor (¶35, “Circuitry of the terminal device may include a processor, a memory (RAM, ROM or the like), a storage, input means (mouse, keyboard, camera, etc.), output means (display (e.g. liquid crystal, (organic) light emitting diode, etc.), loudspeakers, etc., a (wireless) interface, etc., as it is generally known for electronic devices (computers, smartphones, smart glasses, etc.)”. See ¶172, “The system controller 10 functions as an external situation determination unit 10a that determines an external situation and an operation control unit 10b that give a control instruction to each unit according to a determination result of the external situation determination unit 10a, as illustrated in FIG. 3.” The memory is in communication with processor, so that the processor can carry out the instructions that were stored in memory.);
analyzing media data to identify a plurality of static and dynamic elements in said media data (¶127, “ In NeRF, a deep neural network is trained to predict the radiance (color and density) and occupancy of a 3D point in the scene given its 3D location. The network is trained on a set of input images captured from different viewpoints, which are used to optimize the parameters of the network” the media data are input images. ” The input images may be the media data. ¶169, “In addition or alternatively, based on the relative positions and viewing angle differences between the captured object and the viewpoint of virtual displays 2a and 2b, the correct viewing angle difference can be determined, and a corresponding request can be sent to the render engine. For static objects, such as items inside the box, shop floor, etc., pre-rendering would be done ahead of time to reduce the computational cost of the renderer. Then just the model captured from the requested viewpoint would be delivered to the device. For real-time object request, renderer would have to synthesize the object from the correct viewing angle.” ¶131, “In a case where a scene is assumed not to be static, meaning that the appearance and geometry of the scene are not fixed and change over time, dynamic NeRF (D-NeRF) may be used for generating novel synthetic view.” ¶152, “As shown in FIG. 8a which shows a real view of the user, the user's view onto a stage 650 is partially blocked by three people 620. Due to this blocking of the view the artist 630 is only partially visible to the user. A service provided captures the live event by multiple cameras 610 from different viewpoints. Based on the images captured by cameras 610 the service provider trains a novel synthetic view generator (e.g. a D-NeRF model) time component. The novel synthetic view generator is thus configured to generate views of the live event from arbitrary viewpoints.” ¶153, “FIG. 8b shows a user's view on the stage 650 after activating novel synthetic view generation. The terminal device performs image recognition on images taken by a camera of the terminal device to identify if the user's view is obstructed. If the image recognition provides the result that the user's view on stage 650 is obstructed, novel synthetic view presentation is switched on in the terminal device. A request is sent to a service provider 30 to generate a novel synthetic view of the live event based on the position, orientation and imaging characteristics of the user's terminal device. A novel synthetic view of stage 650 is then generated by the service provider based on the parameters of the terminal device and sent to the user's terminal device. The novel synthetic view 640 (dashed rectangle) of the live event generated by the service provider is presented to the user. In this way, the user is enabled to see the artist 630 on stage 650 as if the view on the stage 650 were unobstructed. This embodiment provides an experience to the user which is as close to a live event as possible while removing obstructions from the field of view.” The examiner notes that ¶131 mentions that the appearance and geometry of the scene are not fixed and change over time, this however doesn’t exclude static components from being included as well.);
rendering a three-dimensional (3D) model of a scene with a precise depth and location data based on radiance information and spatial data of said plurality of static and dynamic elements in said media data through a neural network to ensure placement of said plurality of static and dynamic elements in said 3D model of the scene (¶127, “ In NeRF, a deep neural network is trained to predict the radiance (color and density) and occupancy of a 3D point in the scene given its 3D location. The network is trained on a set of input images captured from different viewpoints, which are used to optimize the parameters of the network” The media data is input images. ¶168, “ It should be noted that volume rendering techniques as used in S550 are known to the skilled person. For example, the 2D pixelmap of a novel view may be projected on a 3D representation of the virtual displays 2a, 2b that can be rendered by the VR engine with conventional means.”.¶169, “In addition or alternatively, based on the relative positions and viewing angle differences between the captured object and the viewpoint of virtual displays 2a and 2b, the correct viewing angle difference can be determined, and a corresponding request can be sent to the render engine. For static objects, such as items inside the box, shop floor, etc., pre-rendering would be done ahead of time to reduce the computational cost of the renderer. Then just the model captured from the requested viewpoint would be delivered to the device. For real-time object request, renderer would have to synthesize the object from the correct viewing angle.” ¶131, “In a case where a scene is assumed not to be static, meaning that the appearance and geometry of the scene are not fixed and change over time, dynamic NeRF (D-NeRF) may be used for generating novel synthetic view.” ¶152, “As shown in FIG. 8a which shows a real view of the user, the user's view onto a stage 650 is partially blocked by three people 620. Due to this blocking of the view the artist 630 is only partially visible to the user. A service provided captures the live event by multiple cameras 610 from different viewpoints. Based on the images captured by cameras 610 the service provider trains a novel synthetic view generator (e.g. a D-NeRF model) time component. The novel synthetic view generator is thus configured to generate views of the live event from arbitrary viewpoints.” ¶153, “FIG. 8b shows a user's view on the stage 650 after activating novel synthetic view generation. The terminal device performs image recognition on images taken by a camera of the terminal device to identify if the user's view is obstructed. If the image recognition provides the result that the user's view on stage 650 is obstructed, novel synthetic view presentation is switched on in the terminal device. A request is sent to a service provider 30 to generate a novel synthetic view of the live event based on the position, orientation and imaging characteristics of the user's terminal device. A novel synthetic view of stage 650 is then generated by the service provider based on the parameters of the terminal device and sent to the user's terminal device. The novel synthetic view 640 (dashed rectangle) of the live event generated by the service provider is presented to the user. In this way, the user is enabled to see the artist 630 on stage 650 as if the view on the stage 650 were unobstructed. This embodiment provides an experience to the user which is as close to a live event as possible while removing obstructions from the field of view.” The examiner notes that ¶131 mentions that the appearance and geometry of the scene are not fixed and change over time, this however doesn’t exclude static components from being included as well.);
[…]thereby ensuring interaction of said plurality of static and dynamic elements with props and an environment in said 3D model of the scene (¶168, “ It should be noted that volume rendering techniques as used in S550 are known to the skilled person. For example, the 2D pixelmap of a novel view may be projected on a 3D representation of the virtual displays 2a, 2b that can be rendered by the VR engine with conventional means.”.¶169, “In addition or alternatively, based on the relative positions and viewing angle differences between the captured object and the viewpoint of virtual displays 2a and 2b, the correct viewing angle difference can be determined, and a corresponding request can be sent to the render engine. For static objects, such as items inside the box, shop floor, etc., pre-rendering would be done ahead of time to reduce the computational cost of the renderer. Then just the model captured from the requested viewpoint would be delivered to the device. For real-time object request, renderer would have to synthesize the object from the correct viewing angle.” ¶131, “In a case where a scene is assumed not to be static, meaning that the appearance and geometry of the scene are not fixed and change over time, dynamic NeRF (D-NeRF) may be used for generating novel synthetic view.” ¶152, “As shown in FIG. 8a which shows a real view of the user, the user's view onto a stage 650 is partially blocked by three people 620. Due to this blocking of the view the artist 630 is only partially visible to the user. A service provided captures the live event by multiple cameras 610 from different viewpoints. Based on the images captured by cameras 610 the service provider trains a novel synthetic view generator (e.g. a D-NeRF model) time component. The novel synthetic view generator is thus configured to generate views of the live event from arbitrary viewpoints.” ¶153, “FIG. 8b shows a user's view on the stage 650 after activating novel synthetic view generation. The terminal device performs image recognition on images taken by a camera of the terminal device to identify if the user's view is obstructed. If the image recognition provides the result that the user's view on stage 650 is obstructed, novel synthetic view presentation is switched on in the terminal device. A request is sent to a service provider 30 to generate a novel synthetic view of the live event based on the position, orientation and imaging characteristics of the user's terminal device. A novel synthetic view of stage 650 is then generated by the service provider based on the parameters of the terminal device and sent to the user's terminal device. The novel synthetic view 640 (dashed rectangle) of the live event generated by the service provider is presented to the user. In this way, the user is enabled to see the artist 630 on stage 650 as if the view on the stage 650 were unobstructed. This embodiment provides an experience to the user which is as close to a live event as possible while removing obstructions from the field of view.” A prop can be any object within the scene.);
monitoring and map said environment in said 3D model of the scene through a simultaneous localization and mapping system in real-time, thereby ensuring accurate interaction