Office Action Analysis: 18633532 — SCENE ENCODING GENERATING APPARATUS, METHOD, AND NON-TRANSITORY COMPUTER READABLE STORAGE MEDIUM

Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
	No information disclosure statement was submitted with the present application filed on April 12, 2024.

Priority
	The present application claims benefit of provisional application filed on April 19, 2023.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 11, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Abdo (US 20220319054 A1) in view of Wu (“Location prediction on trajectory data: A review”; copy provided by examiner).

Regarding claim 1,
Abdo teaches: 1. A scene encoding generating apparatus, comprising:
(Abstract “Methods, systems, and apparatus, including computer programs encoded on computer storage media, for predicting scene flow”; [0024] “The scene flow prediction system 150 processes the point clouds 132 to generate a scene flow output.”
a communication interface; ([0106] “"a computing system that includes…a client computer having a graphical user interface…components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network")
and a processor, coupled to the communication interface, and the processor is configured to execute the following operations ([0098] "…'data processing apparatus' refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers."; [0101] "As used in this specification, an 'engine,' or 'software engine,' refers to a software implemented input/output system…Each engine can be implemented on any appropriate type of computing device…that includes one or more processors…"; Refer to [0106] and the quote found in the previous limitation that describes the interconnectivity (i.e. “coupled to” relationship) of the components of the system.  Because the system includes a processor and communication interface, the processor and communication interface interconnectivity relationship constitutes a coupling relationship.):
receiving a position and a movement state in a first time point of each of a plurality of obstacles ([0103] "central processing unit will receive instructions and data"; Abdo teaches receiving current (i.e. first time point) point cloud representing an observed scene, including location coordinates (Abstract “obtaining a current point cloud representing an observed scene at a current time point; obtaining object label data that identifies a first three-dimensional region in the observed scene”;  [0027]  “A point cloud generally includes multiple points that represent a sensor measurement of a scene in an environment captured by one or more sensors. Each point has a location in a specified coordinate system, e.g., a three-dimensional coordinate system centered at the sensor, and can optionally be associated with additional features, e.g., intensity, second return, and so on.”).   Abdo further teaches generating “scene flow labels” that represent motion (e.g. velocity) of the point when the point cloud was generated by using ground truth information ([0009]- [0010] “This specification describes a system…that generates scene flow labels for point clouds…A ‘scene flow label’ for a given point cloud identifies, for some or all of the three-dimensional points in the given point cloud, a respective motion vector that represents the motion of the point as of the time that the given point cloud was generated… the motion vector can specify the velocity of the point along each of one or more directions.”; Abstract “a scene flow label for the current point cloud that comprises a respective ground truth motion vector for each of a plurality of the current three-dimensional points”; See [0060]-[0061], including “…the motion vector for a given point can include, for each of multiple directions, a respective velocity component in the direction in the reference frame of the laser sensor at the current time point”). The ground truth information is available at each time step, allowing the system to receive object state information associated with the scene at the time.  For example, an object state (e.g. position and movement parameters) at a first time point, is used with later scene information to determine displacement between frames.  Accordingly, Abdo teaches receiving movement state information at a first point in time, which is later used to determine motion between frames.  Abdo further teaches the generation of scene flow labels described in the disclosure is used to identify positions of objects within point clouds to generate ground truth motion and the described techniques use object data [0014].  Abdo further teaches, as an example, the prediction of scene flow enables autonomous vehicles to identify potential obstacles [0011]. Thus, Abdo teaches the position and movement state at a first time point may correspond to a plurality of objects, including obstacles.)  
generating a local coordinate system corresponding to each of the obstacles based on the position and the movement state corresponding to each of the obstacles (Abdo teaches generating scene flow labels for point clouds captured at different time points based on ground-truth information including motion (i.e. movement states) and three-dimensional regions ([0037]-[0040] “The training system 120 receives raw training examples from vehicles operating in the real world and/or from other sources… raw training example 155 includes two point clouds: (i) a current point cloud generated at a current time point and representing an observed scene in an environment the current time point and (ii) an earlier point cloud generated at an earlier time point and representing the observed scene in the environment at the earlier time point… The new training example 175 includes… a scene flow label that includes a respective ground truth motion vector for each of a plurality of the points in the current point cloud… the training system 120 generates the new training example 175 by generating the scene flow label for the current point cloud in the raw training example 155 using object label data that identifies three-dimensional regions of the environment that contain objects.”; [0070] “The system determines, from a pose of the given current three-dimensional region in the reference frame of the sensor at the current time point, a current pose of the given object at the current time point in the reference frame of the sensor at the current time point (step 306). That is, the system sets the current pose of the given object to be the pose of the given current three-dimensional region. The pose of the given current three-dimensional region can also be represented as a transformation matrix that includes 3D translation and rotational components.).  Thus, Abdo teaches associating spatial coordinates of points corresponding to objects (i.e. obstacles [0011]) with motion vectors that describe movement of those points between time steps based on observed ground truth object pose and object label data to generate a local coordinate system.) 
transforming the position and the movement state corresponding to each of the obstacles into the local coordinate system of the corresponding obstacle to generate a local position and a local movement state of the corresponding obstacle (Refer back to [0037]-[0040] and [0070] from previous limitation.  Abdo teaches associating spatial coordinates of points corresponding to objects (i.e. obstacles [0011]) with motion vectors that describe movement of those points between time steps based on observed ground truth object pose and object label data, thereby providing positional and local movement information corresponding to the obstacles relative to the corresponding object pose.)
generating a first point cloud] corresponding to the obstacles based on the local positions and the local movement states corresponding to the obstacles, wherein the first point cloud] corresponds to the first time point (Abdo teaches generating point clouds based on objects’, including obstacles at a previous and a current time point, positions and moving states received (previously discussed).) 
and inputting the first point cloud] into a scene encoder to generate a first scene encoding, wherein the first scene encoding corresponds to the first time point corresponding to the first point cloud] ([0092] “The system processes the most recent and earlier point clouds through an encoder neural network to generate respective embeddings for each of the most recent and earlier point clouds…”; The encoder neural network generates feature embeddings based on the spatial coordinates of points in the point cloud, which represent encoded features of the observed first scene.  Examiner interprets an encoded feature embedding representative of scene data to be equivalent to a first scene encoding.), and the first scene encoding is configured to be inputted into a decoder to generate a flow embedding feature map and predicted motion vector] prediction corresponding to the obstacles ([0093]- [0094] “The system processes the respective embeddings for each of the first and second point clouds through a decoder neural network to generate a flow embedding feature map (step 506). The flow embedding feature map includes a respective flow embedding for each grid cell of a spatial grid over the most recent point cloud. The system generates a respective predicted motion vector for each point in the most recent point cloud using the flow embedding feature map.”).
Abdo fails to explicitly disclose a trajectory prediction corresponding to the obstacles and using tensor structures, specifically generating a first tensor corresponding to the obstacles based on the local positions and the local movement states corresponding to the obstacles, wherein the first obstacle tensor corresponds to the first time point and inputting the first obstacle tensor into a scene encoder to generate a first scene encoding, and while Abdo teaches predicting scene flow and motion vectors. 
	In a related art, Wu’s study teaches: location prediction methods based on trajectory data, including spatiotemporal-pattern-based prediction (Abstract).  Wu also teaches location prediction techniques are used for route navigation, and traffic control (Abstract).	  Wu further states, “Trajectory data characterizes the locations and times of moving objects” (p. 109, left column, lines 7-8).  Under the broadest reasonable interpretation, Wu teaches trajectory prediction methods corresponding to obstacles because Wu teaches a prediction method based on trajectory data characterized by locations and times of objects, and obstacles are types of objects. 
Wu further teaches a preference-based method seen in previous art using tensor structures based on a matrix and matrix factorization that “integrates user, activity, location, and temporal information to predict location” (p. 116, right column, sub-section “4.1.4 Preference-based methods” through p. 117, left column, lines 1-21). While Wu identifies drawbacks of tensor methods to include being both resource and time consuming, Wu also teaches the mobility preference gained by the use of tensor-based methods is an important factor in predicting user location and the use of various data sources used in the model provides new perspectives regarding movement patterns (p. 116, right column, sub-section “4.1.4 Preference-based methods” through p. 117, left column, lines 1-28).
	A person of ordinary skill in the art could have substituted the tensor structure of Wu for the point-cloud structure of Abdo.  Furthermore, the substitution would have achieved the predictable result of a first obstacle tensor that improves the accuracy of predicting scene flow and identifying potential obstacles by distinguishing between parts of a scene that are static and dynamic, a goal of Abdo (Abdo [0011]), through the use of a tensor data structure that accounts for user location and various data sources to provide new perspectives in relation to movement patterns (Wu p.116, sub-section “4.1.4 Preference-based methods”).
	It would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to substitute the tensor data structure of Wu for the point-cloud data structure of Abdo to yield the predictable result of providing new perspectives in relation to movement patterns relating to obstacles, thus increasing the accuracy of the scene encoding performed by the apparatus.
	Furthermore, a person of ordinary skill in the art could have modified the scene encoding generating apparatus taught by Abdo, previously modified by Wu for a first obstacle tensor, to incorporate the trajectory prediction model corresponding to obstacles, taught by Wu because both frameworks take into account location and movement of objects at a point in time to make a prediction.  Doing so would make the scene encoding generating apparatus more robust, thereby increasing the amount of data relied on by users (e.g. autonomous car users) by not enabling the user to not only generate motion predictions, but also generate trajectory predictions based on obstacle location, movement, and time, with respect to user location.  For applications relating to vehicles, an increase in data relied on and accuracy would also predictably result in increased safety and a decrease in overall time required for navigation. Both inventions lie in the field of endeavor of image analysis for vehicle use, specifically relating to prediction models that account for objects, movement, time, and location.

Regarding claim 11,
	Abdo teaches: A scene encoding generating method, being adapted for use in a scene encoding generating apparatus (Abstract “Methods…and apparatus, including computer programs encoded on computer storage media, for predicting scene flow”; [0097])
	Abdo further teaches the interpreted structure of “the scene encoding generating apparatus” identified in the U.S.C. 112(f) interpretation, previously discussed in the present office action, for performing the limitations found in the present claim ([0098]; [0101]; [0103]; [106]).
The remaining limitations found in claim 11 equally mirror limitations found in claim 1 and are rejected based on the same known prior art (Abdo and Wu) and motivations to combine as seen above, in 35 U.S.C. 103 claim 1 rejection.  Please refer to claim 1 rejection above for basis of rejection.

Regarding claim 20,
Abdo teaches: A non-transitory computer readable storage medium, having a computer program stored therein, wherein the computer program comprises a plurality of codes, the computer program executes a scene encoding generating method after being loaded into an electronic apparatus, the scene encoding generating method comprises ([0097] “Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by, or to control the operation of, data processing apparatus.”:
Abdo further teaches the interpreted structure of “the scene encoding generating apparatus” identified in the U.S.C. 112(f) interpretation, previously discussed in the present office action, for performing the limitations found in the present claim ([0098]; [0101]; [0103]; [106]).
The remaining limitations found in claim 20 equally mirror limitations found in claim 1 and 11 and are rejected based on the same known prior art (Abdo and Wu) and motivations to combine as seen above, in 35 U.S.C. 103 claim 1 rejection.  Please refer to claim 1 rejection above for basis of rejection.

Claims 2 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Abdo (US 20220319054 A1) in view of Wu (“Location prediction on trajectory data: A review”; copy provided by examiner), and in further view of Vaswani (“Attention Is All You Need”; copy provided by examiner).

Regarding claims 2 and 12,
	Abdo and Wu teach: the scene encoding generating apparatus of claim 1, including inputting the first obstacle tensor into a scene encoder to generate a first scene encoding, wherein the first scene encoding corresponds to the first time point corresponding to the first obstacle tensor.
	Abdo and Wu fail to teach wherein the scene encoder comprises a time attention layer, the time attention layer is configured to perform an attention calculation based on a first input tensor corresponding to the first time point and at least one second input tensor corresponding to at least one second time point to generate a first output tensor.
In a related art, Vaswani teaches: wherein the scene encoder comprises a (p. 5, subsection 3.2.3, line 4, “The encoder contains self-attention layers”), the time attention layer is configured to perform an attention calculation based on a first input tensor corresponding to the first (Vaswani teaches attention calculation is performed by the encoder based on a first and second input vectors of queries, keys, and value matrixes (i.e. three-dimensional matrix) corresponding to a first and second point from a sequence and the decoder produces an output sequence in the form of a output vector of queries, keys, and values matrixes (p. 2, section “3 Model Architecture”, lines 2-4, “Here, the encoder maps an input sequence of symbol representations (                        
                            
                                    x
                                
                                    1
                                
                    ,...,                         
                            
                                    x
                                
                                    n
                                
                    ) to a sequence of continuous representations z = (                        
                            
                                    z
                                
                                    1
                                
                    ,...,                         
                            
                                    z
                                
                                    n
                                
                    ). Given z, the decoder then generates an output sequence (                        
                            
                                    y
                                
                                    1
                                
                    ,...,                        
                            
                                    y
                                
                                    m
                                
                    ) of symbols one element at a time.”; p. 5, subsection 3.2.3, line 6-7, “Each position in the encoder can attend to all positions in the previous layer of the encoder.”; p. 3, subsection 3.2.1, lines 5-7, “…we compute the attention function on a set of queries simultaneously, packed together into a matrix Q. The keys and values are also packed together into matrices K and V . We compute the matrix of outputs as: Attention(Q,K,V ) = softmax((                        
                            
                                    Q
                                    K
                                
                                    T
                                
                    )/ √                         
                            
                                    d
                                
                                    k
                                
                     )V”).  Multi-dimensional matrixes (e.g. vector comprising of queries, keys, and values matrixes) are known to a person of ordinary skill in the art as an example of a tensor.  Vaswani further defines the attention as, “An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values…” (p.3, subsection 3.2, lines 1-3).  Thus, Vaswani teaches a representation at a first position of a sequence provides a first tensor, which is used in an attention calculation, with second tensor from other positions corresponding to at least a second point to generate an output tensor representation from the first position.).
	Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to apply the attention layer techniques (e.g. the relationship of a first and second point in a sequence to calculate an outcome) taught by Vaswani to the scene encoding generating apparatus taught by Abdo, and previously modified by Wu, that uses obstacle time data and bases predictions on time parameters (e.g. first time point, second time point, etc.).  Doing so would provide the predictable results of improving trajectory prediction results by enabling each position in a time sequence to selectively incorporate information from all other positions via time attention, and further allowing efficient parallel computations.  Abdo, Vaswami, and the instant application are all related art because they’re all directed to using encoders and decoders to more efficiently process and calculate data, while Wu, Vaswami, and the instant application aim to use tensor data structures to improve data processing techniques.

Claims 9-10, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Abdo (US 20220319054 A1) in view of Wu (“Location prediction on trajectory data: A review”; copy provided by examiner), and in further view of Mangalam (US 20210295531 A1).

Regarding claims 9 and 19,	
Abdo and Wu teach: The scene encoding generating apparatus of claim 1, including a first encoded feature embedding based on a time point as a first scene encoding and generating the trajectory prediction.
Abdo and Wu fail to explicitly disclose: concatenating the first scene encoding corresponding to the first time point and at least one second scene encoding corresponding to at least one second time point to generate an output scene encoding, wherein the output scene encoding is configured to be inputted into the decoder to generate the trajectory prediction corresponding to the obstacles.
In a related art, Mangalam teaches: concatenating the first scene encoding corresponding to the first time point and at least one second scene encoding corresponding to at least one second time point to generate an output scene encoding, wherein the output scene encoding is configured to be inputted into the decoder to generate the trajectory prediction corresponding to the obstacles (Mangalam teaches a system for trajectory prediction based on scene data having a plurality of pedestrians, (Abstract) i.e. obstacles.  Mangalam further teaches pedestrians have been observed over a period of time, which is indicative of a first and at least a second time, to create motion history that may be a collection of trajectories indicating the position and the trajectory of the pedestrians, and/or over a previous time ([0023]).  Next Mangalam teaches encoding the past trajectory of all pedestrians yields motion history of the one or more pedestrians and the motion history (which represents a first and at least a second time point) are concatenated together with a future endpoint to produce an output parameter to be input into a latent decoder to yield “ground truth endpoints Ĝ.sup.k” ([0040]-[0041]). Mangalam further teaches Ĝ.sup.k (ground truth endpoints) are used in the “future trajectory module” to determine future trajectory points for at least one of the plurality of pedestrians ([0043]).) 
Thus, under the broadest reasonable interpretation of the claim, Mangalam teaches concatenating the first scene encoding corresponding to the first time point and at least one second scene encoding corresponding to at least one second time point (e.g. encoded motion history of pedestrians at a first time point and at least a second time point with a future endpoint is concatenated) to generate an output scene encoding (e.g. encoded output parameter), wherein the output scene encoding is configured to be inputted into the decoder (e.g. output parameter is input into a latent decoder) to generate the trajectory prediction corresponding to the obstacles (e.g. the decoder yields ““ground truth endpoints Ĝ.sup.k” which is subsequently used to generate the future trajectory points corresponding to at least one of the plurality of pedestrians.
Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the scene encoding generating apparatus taught by Abdo and modified by Wu, including a trajectory prediction framework, to incorporate the trajectory prediction techniques taught by Mangalam in order to provide a more accurate prediction model relating to pedestrian movement, resulting in improvements to downstream components described by Mangalam, including better path planning and decision-making (Mangalam, [0004]).  All of these inventions lie in the same field of endeavor of image analysis for making predictions (e.g. trajectory, scene flow, etc.) based on obstacles (e.g. pedestrians, objects, etc.) and their corresponding time points, movement, and location in a scene.

Regarding claim 10,	
	Abdo, Wu, and Mangalam teach: the scene encoding generating apparatus of claim 9.  
Abdo and Wu previously taught, in claim 1, “inputting the first obstacle tensor into a scene encoder to generate a first scene encoding”, and “the first obstacle tensor corresponds to the first time point” (refer back to claim 1 for further details of Abdo and Wu teachings).
Mangalam previously taught, in claim 9, “at least one second scene encoding corresponding to at least one second time point” by modeling the encoder, decoder, and prediction techniques around motion history across a first and at least a second time point (refer back to claim 9 for further details of Mangalam teachings).
 A person of ordinary skill in the art could have modified the teachings of a scene encoding generating apparatus by Abdo, previously modified by Wu and Mangalam, that inputs at least one first obstacle tensor corresponding to the at least one first second time point into the scene encoder to generate at least one first scene encoding to incorporate the teachings of at least one second scene encoding corresponding to at least one second time point, taught by Mangalam’s framework that uses motion history of pedestrians across multiple periods of time (e.g. at a first time point and at least a second time point).  The combinations would have achieved the predictable result of at least one second scene encoding being generated after inputting at least one second obstacle tensor corresponding to the at least one second time point into the scene encoder, thereby increasing the accuracy of the predictions made by the scene encoding generating by basing predictions on data derived from multiple time points.
Therefore, it would have been obvious for a person of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Abdo, previously modified by Wu and Mangalam, to incorporate the teachings of at least one second scene encoding corresponding to at least one second time point taught by Mangalam to yield the predictable result of a more robust and accurate model of predicting trajectory by accounting for obstacles at least two time points.  The combination of these known techniques would also predictably result in better path planning and decision-making for vehicles, a goal identified by Mangalam (Mangalam, [0004]). All of these inventions lie in the same field of endeavor of image analysis for making predictions (e.g. trajectory, scene flow, etc.) based on obstacles (e.g. pedestrians, objects, etc.) and their corresponding time points, movement, and location in a scene.

Allowable Subject Matter
	Claims 3-8, and 13-18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form, including all of the limitations of the base claim and any intervening claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SAMUEL DAVID BAYNES whose telephone number is (571)272-0607. The examiner can normally be reached Monday - Friday 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Stephen R Koziol can be reached at (408)918-7630. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/S.D.B/
Samuel D. Baynes
Examiner, Art Unit 2665

/BOBBAK SAFAIPOUR/Primary Examiner, Art Unit 2665
Read full office action
SCENE ENCODING GENERATING APPARATUS, METHOD, AND NON-TRANSITORY COMPUTER READABLE STORAGE MEDIUM

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

SCENE ENCODING GENERATING APPARATUS, METHOD, AND NON-TRANSITORY COMPUTER READABLE STORAGE MEDIUM

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email