Prosecution Insights
Last updated: April 19, 2026
Application No. 17/682,610

SYSTEMS AND METHODS FOR VIDEO CAPTIONING SAFETY-CRITICAL EVENTS FROM VIDEO DATA

Final Rejection §103
Filed
Feb 28, 2022
Examiner
HOANG, HAN DINH
Art Unit
2661
Tech Center
2600 — Communications
Assignee
Verizon Patent and Licensing Inc.
OA Round
5 (Final)
74%
Grant Probability
Favorable
6-7
OA Rounds
3y 2m
To Grant
93%
With Interview

Examiner Intelligence

Grants 74% — above average
74%
Career Allow Rate
120 granted / 162 resolved
+12.1% vs TC avg
Strong +19% interview lift
Without
With
+19.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 2m
Avg Prosecution
25 currently pending
Career history
187
Total Applications
across all art units

Statute-Specific Performance

§101
6.9%
-33.1% vs TC avg
§103
65.7%
+25.7% vs TC avg
§102
15.5%
-24.5% vs TC avg
§112
7.1%
-32.9% vs TC avg
Black line = Tech Center average estimate • Based on career data from 162 resolved cases

Office Action

§103
DETAILED ACTION Notice of Pre-AIA or AIA Status The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Response to Arguments Applicant’s amendment filed 01/30/2026 has been entered and made of record. Claims 1, 3, 8 and 15 are amended. New Claim 23 is added. Claim 3, 7 and 21 is cancelled. Claims 1-6, 8-20 and 22-23 are pending. Applicant’s arguments with respect to claims 1-6, 8-20 and 22-23 have been considered but are moot because the new ground of rejection set forth below. Applicant argues on page 13 of the remarks filed that the previous cited prior art does not explicitly disclose wherein the corresponding sensor information is related to information from one or more sensor devices of the vehicle and includes information identifying speeds, accelerations and orientations of the vehicle when the video was captured. The Examiner agrees as the cited prior art did not disclose the amended limitation. However, after further search and consideration, the newly discovered art of O'Malley US Patent (US 11526721 B1) would disclose this limitation in all of the independent claims. Please see updated Claim Rejections under 35 USC § 103 below. Claim Rejections - 35 USC § 103 In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action: A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made. The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 1. Determining the scope and contents of the prior art. 2. Ascertaining the differences between the prior art and the claims at issue. 3. Resolving the level of ordinary skill in the pertinent art. 4. Considering objective evidence present in the application indicating obviousness or nonobviousness. Claims 1-2, 8, 14-15, 18, 20 and 22-23 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al (“Visual to Text: Survey of Image and Video Captioning”) in view of Pan et al. US PG-Pub(US 20210295093 A1) in view of Pei et al. US PG-Pub(US 20210281774 A1) in view of O'Malley US Patent(US 11526721 B1). Regarding Claim 1, Li teaches a method, comprising: generating, by the device, a tensor based on the feature vectors; processing, by the device, the tensor, with a convolutional neural network model, to generate a modified tensor (Page 302, CNN Encoder to RNN Decoder, Paragraph 1, “As shown in Fig. 2, one of the most representative encoder decoder models is based on a CNN image encoder and a RNN text decoder, where CNN extracts various vision cues from one still image as a single real-valued feature representation, and RNN generates caption for that image conditioned on its representation at the very beginning”, as disclosed in this section of the prior art, feature extraction is performed and a CNN extracts the features and generates a single value tensor from the features.); selecting, by the device, a decoder model from a plurality of decoder models(Fig. 2. Shows the decoder model used for video captioning.); processing, by the device and based on attributes associated with the video, the modified tensor, with the decoder model, to generate a caption for the video (Fig. 4, shows the RNN decoder generating a caption from the feature extraction performed by the CNN.);and performing, by the device, one or more actions based on the caption for the video(Page 302, Attention Mechanism, Paragraph 1, “All the methods above mainly encode image with the top layer of pre-trained CNNs, and keep the image content fixed during the decoding process for generating natural language sentence. However, it is not an easy task to distill all the necessary information into one single vector, considering the cluttered background and multiple objects, as well as the complex relationship between objects. Thus, it will be helpful for caption generation by looking at different image regions according to the context. In light of this, attention mechanism has been widely used for image captioning, which generally learns where and what the RNN decoder should attend to”, as disclosed in this section of the prior art, based on the caption generated the model is trained to determine where or what the decoder should do in terms of performance.). Li does not explicitly teach receiving, by a device, a video and corresponding sensor information associated with a vehicle extracting, by the device, feature vectors associated with the corresponding sensor information and an appearance and a geometry of another vehicle captured in the video and processing, based on attributes associated with the video to generate the caption for the video, wherein the attributes include the corresponding sensor information Pan teaches receiving, by a device, a video and corresponding sensor information associated with a vehicle (¶[0048], “The sensor module 302 may receive the sensor data from the first sensor 304 and the second sensor 306. According to aspects of the present disclosure, the ego perception module 310 may receive sensor data directly from the first sensor 304 or the second sensor 306 to perform video captioning using knowledge distillation based on a spatio-temporal graph model from images captured by the first sensor 304 or the second sensor 306 of the car 350.”, ¶[0048] discloses receiving sensor data from a sensor installing on a vehicle.)wherein the corresponding sensor information is related to information from one or more sensor devices of the vehicle, and wherein the one or more sensor devices collect sensor information of the vehicle (¶[0028] “FIG. 3 is a diagram illustrating an example of a hardware implementation for a video captioning system 300 with knowledge distillation using a spatio-temporal graph, according to aspects of the present disclosure. The video captioning system 300 may be configured for understanding a scene to enable planning and controlling an ego vehicle in response to an image from video captured through a camera during operation of a car 350. The video captioning system 300 may be a component of a vehicle, a robotic device, or other device. For example, as shown in FIG. 3, the video captioning system 300 is a component of the car 350. Aspects of the present disclosure are not limited to the video captioning system 300 being a component of the car 350, as other devices, such as a bus, motorcycle, or other like vehicle, are also contemplated for using the video captioning system 300. The car 350 may be autonomous or semi-autonomous.”, as disclosed in this section of the prior art, vehicle sensor data is processed by the video captioning system in order to understand the scene ahead of the vehicle.)extracting, by the device, feature vectors associated with the corresponding sensor information and an appearance and a geometry of another vehicle captured in the video ([0058] Given a sequence of RGB frames {x.sub.1, x.sub.2, . . . x.sub.T}, two types of features are extracted (e.g., scene features and object features, as shown by the object branch 510 and the scene branch 520. [0059] Scene Features. A procedure is performed in which a sequence of 2D frame features F.sub.2D={f.sub.1, f.sub.2, . . . , f.sub.T} is first extracted (e.g., using ResNet-101), with each f.sub.t ϵcustom-character.sub.d2D. A set of 3D clip features F.sub.3D={v.sub.1, v.sub.2, . . . , v.sub.L} are also extracted via an I3D network, with v.sub.1ϵcustom-character.sup.d2D. [0060] Object Features. A convolutional neural network (CNN) is run (e.g., Faster R-CNN) on each frame to generate a set of object features F.sub.o={o.sub.1.sup.1, o.sub.1.sup.2, . . . , o.sub.t.sup.j, . . . , o.sub.T.sup.NT}, where Nt denotes the number of objects in frame t and j is the object index within each frame. Each o.sub.t.sup.j has the same dimension d.sub.2D as F.sub.2D.as disclosed in ¶[0058]-¶[0060], the prior art discloses extracting feature vectors from the image data acquired from the sensor of the vehicle to determine objects in the frame.) processing, based on attributes associated with the video to generate the caption for the video, wherein the attributes include the corresponding sensor information (¶[0052] “Fig. 4 depicts improved video captioning of a scene according to aspects of the present disclosure. Consider the scene 400 shown in FIG. 4. To understand the video caption: “A cat jumps into a box,” a “cat” and “box,” are first identified, and then the transformation of “cat jumps into the box” is captured. That is, scenes are complicated, not only because of the diverse set of entities involved, but also the complex interactions among them. To understand the scene 400, it is important to ignore the “television” and “bed,” because they mostly serve as distractors from comprehending what is happening in the scene 400.”, as disclosed in ¶[0052] the prior art uses the sensor data acquired from the vehicle to determine a caption for the image frame.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li with Pan in order to acquire sensor data from a vehicle and use the data to generate a caption. One skilled in the art would have been motivated to modify Li in this manner in order to identify objects within areas of interest in an image of a surrounding scene of the autonomous agent. (Pan, ¶[0003]) However, Li and Pan do not explicitly teach wherein the decoder model is selected from the plurality of decoder models based on a quality of a caption to be generated Pei teaches wherein the decoder model is selected from the plurality of decoder models based on a quality of a caption to be generated (¶[0059], “Step 503. Decode the target visual feature by using an auxiliary decoder of the video caption generating model, to obtain a second selection probability corresponding to the each candidate word, a memory structure of the auxiliary decoder including reference visual context information corresponding to the each candidate word, and the reference visual context information being generated according to a related video corresponding to the candidate word.” ¶[0069], “However, in this embodiment of the present disclosure, because the memory structure of the auxiliary decoder includes an association (that is, the reference visual context information) between “pouring” and a related video screen 62, the decoded word “pouring” can be accurately obtained through decoding, thereby improving the captioning quality of the video caption.”, as disclosed in ¶[0059], the prior art selects a auxiliary decoder model to decode the target visual feature and shown in figure 4, there are two different decoder models used for different processing when encoding the target video and ¶[0069] discloses that the auxiliary decoder is used to recognize certain words generated in the video caption by using visual context between the decoded word and video screen to improve the caption quality of the video caption.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li and Pan with Pei in order to select a decoder model to perform the decoding. One skilled in the art would have been motivated to modify Li and Pan in this manner in order to improve the captioning quality of the video caption. (Pei, ¶[0069]) However, the combination of Li, Pan and Pei do not explicitly teach wherein the corresponding sensor information includes information identifying an orientation of the vehicle when the video was captured O’Malley teaches wherein the corresponding sensor information includes information identifying an orientation of the vehicle when the video was captured (Col 11, Lines 11-21, “The vehicle computing device can use the sensor data to generate a trajectory for the vehicle(s) 104. In some instances, the vehicle computing device can also determine pose data associated with a position of the vehicle(s) 104. For example, the vehicle computing device can use the sensor data to determine position data, coordinate data, and/or orientation data of the vehicle(s) 104 in the environment 102. In some instances, the pose data can include x-y-z coordinates and/or can include pitch, roll, and yaw data associated with the vehicle(s) 104.”, Col 11, Lines 11-21 disclose obtaining vehicle sensor data and determining an orientation of the vehicle based on the sensor data acquired.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan and Pei with O’Malley in order to determine the orientation of the vehicle. One skilled in the art would have been motivated to modify Li, Pan and Pei in this manner in order to detect a safety concern and/or an uncertainty and transmit a request, in addition to the status data, to a remote computing device associated with a teleoperator. (O’Malley, Col 8, Lines 41-43) Regarding Claim 2, the combination of Li, Pan, Pei and O’Malley teach the method of claim 1, where Pan further teaches further comprising: receiving sensor information associated with sensors of vehicles that capture a plurality of videos(¶[0048] discloses receiving sensor data from a sensor installing on a vehicle.); receiving the plurality of videos; and mapping, in a data store, the sensor information and the plurality of videos, wherein the video and the corresponding sensor information is received from the data store. ([0030] discloses “transceiver may transmit captioned video and/or planned actions from the ego perception module 310 to a server (not shown).”, the sensor data appears to first be processed and then a caption is generated and stored in a server.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pei and O’Malley with Pan in order to acquire sensor data from a vehicle and use the data to generate a caption. One skilled in the art would have been motivated to modify Li, Pei and O’Malley in this manner in order to identify objects within areas of interest in an image of a surrounding scene of the autonomous agent. (Pan, ¶[0003]) Regarding Claim 8, Li teaches a device, comprising: one or more processors(Fig.1 shows the language model architecture which is a machine learning framework that would inherently be ran by a GPU coupled with memory.) configured to: generate a tensor based on the feature vectors and process the tensor, with a convolutional neural network model, to generate a modified tensor (Page 302, CNN Encoder to RNN Decoder, Paragraph 1, “As shown in Fig. 2, one of the most representative encoder decoder models is based on a CNN image encoder and a RNN text decoder, where CNN extracts various vision cues from one still image as a single real-valued feature representation, and RNN generates caption for that image conditioned on its representation at the very beginning”, as disclosed in this section of the prior art, feature extraction is performed and a CNN extracts the features and generates a single value tensor from the features.); select a decoder model from a plurality of decoder models(Fig. 2. Shows the decoder model used for video captioning.); process the modified tensor, with the decoder model, to generate a caption for the video based on attributes associated with the video (Fig. 4, shows the RNN decoder generating a capture from the feature extraction performed by the CNN.);and perform one or more actions based on the caption for the video(Page 302, Attention Mechanism, Paragraph 1, “All the methods above mainly encode image with the top layer of pre-trained CNNs, and keep the image content fixed during the decoding process for generating natural language sentence. However, it is not an easy task to distill all the necessary information into one single vector, considering the cluttered background and multiple objects, as well as the complex relationship between objects. Thus, it will be helpful for caption generation by looking at different image regions according to the context. In light of this, attention mechanism has been widely used for image captioning, which generally learns where and what the RNN decoder should attend to”, as disclosed in this section of the prior art based on the caption generated the model is trained to determine where or what the decoder should do in terms of performance.). Li does not explicitly teach receive a video and corresponding sensor information associated with a vehicle extracting, by the device, feature vectors associated with the corresponding sensor information and an appearance and a geometry of another vehicle captured in the video and processing, based on attributes associated with the video to generate the caption for the video, wherein the attributes include the corresponding sensor information Pan teaches receive a video and corresponding sensor information associated with a vehicle (¶[0048], “The sensor module 302 may receive the sensor data from the first sensor 304 and the second sensor 306. According to aspects of the present disclosure, the ego perception module 310 may receive sensor data directly from the first sensor 304 or the second sensor 306 to perform video captioning using knowledge distillation based on a spatio-temporal graph model from images captured by the first sensor 304 or the second sensor 306 of the car 350.”, ¶[0048] discloses receiving sensor data from a sensor installing on a vehicle.)wherein the corresponding sensor information is related to information from one or more sensor devices of the vehicle, and wherein the one or more sensor devices collect sensor information of the vehicle (¶[0028] “FIG. 3 is a diagram illustrating an example of a hardware implementation for a video captioning system 300 with knowledge distillation using a spatio-temporal graph, according to aspects of the present disclosure. The video captioning system 300 may be configured for understanding a scene to enable planning and controlling an ego vehicle in response to an image from video captured through a camera during operation of a car 350. The video captioning system 300 may be a component of a vehicle, a robotic device, or other device. For example, as shown in FIG. 3, the video captioning system 300 is a component of the car 350. Aspects of the present disclosure are not limited to the video captioning system 300 being a component of the car 350, as other devices, such as a bus, motorcycle, or other like vehicle, are also contemplated for using the video captioning system 300. The car 350 may be autonomous or semi-autonomous.”, as disclosed in this section of the prior art, vehicle sensor data is processed by the video captioning system in order to understand the scene ahead of the vehicle.)extracting, by the device, feature vectors associated with the corresponding sensor information and an appearance and a geometry of another vehicle captured in the video ([0058] Given a sequence of RGB frames {x.sub.1, x.sub.2, . . . x.sub.T}, two types of features are extracted (e.g., scene features and object features, as shown by the object branch 510 and the scene branch 520. [0059] Scene Features. A procedure is performed in which a sequence of 2D frame features F.sub.2D={f.sub.1, f.sub.2, . . . , f.sub.T} is first extracted (e.g., using ResNet-101), with each f.sub.t ϵcustom-character.sub.d2D. A set of 3D clip features F.sub.3D={v.sub.1, v.sub.2, . . . , v.sub.L} are also extracted via an I3D network, with v.sub.1ϵcustom-character.sup.d2D. [0060] Object Features. A convolutional neural network (CNN) is run (e.g., Faster R-CNN) on each frame to generate a set of object features F.sub.o={o.sub.1.sup.1, o.sub.1.sup.2, . . . , o.sub.t.sup.j, . . . , o.sub.T.sup.NT}, where Nt denotes the number of objects in frame t and j is the object index within each frame. Each o.sub.t.sup.j has the same dimension d.sub.2D as F.sub.2D.as disclosed in ¶[0058]-¶[0060], the prior art discloses extracting feature vectors from the image data acquired from the sensor of the vehicle to determine objects in the frame.) process, based on attributes associated with the video to generate the caption for the video, wherein the attributes include the corresponding sensor information (¶[0052] “Fig. 4 depicts improved video captioning of a scene according to aspects of the present disclosure. Consider the scene 400 shown in FIG. 4. To understand the video caption: “A cat jumps into a box,” a “cat” and “box,” are first identified, and then the transformation of “cat jumps into the box” is captured. That is, scenes are complicated, not only because of the diverse set of entities involved, but also the complex interactions among them. To understand the scene 400, it is important to ignore the “television” and “bed,” because they mostly serve as distractors from comprehending what is happening in the scene 400.”, as disclosed in ¶[0052] the prior art uses the sensor data acquired from the vehicle to determine a caption for the image frame.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li with Pan in order to acquire sensor data from a vehicle and use the data to generate a caption. One skilled in the art would have been motivated to modify Li in this manner in order to identify objects within areas of interest in an image of a surrounding scene of the autonomous agent. (Pan, ¶[0003]) However, Li and Pan do not explicitly teach wherein the decoder model is selected from the plurality of decoder models based on a quality of a caption to be generated Pei teaches wherein the decoder model is selected from the plurality of decoder models based on a quality of a caption to be generated (¶[0059], “Step 503. Decode the target visual feature by using an auxiliary decoder of the video caption generating model, to obtain a second selection probability corresponding to the each candidate word, a memory structure of the auxiliary decoder including reference visual context information corresponding to the each candidate word, and the reference visual context information being generated according to a related video corresponding to the candidate word.” ¶[0069], “However, in this embodiment of the present disclosure, because the memory structure of the auxiliary decoder includes an association (that is, the reference visual context information) between “pouring” and a related video screen 62, the decoded word “pouring” can be accurately obtained through decoding, thereby improving the captioning quality of the video caption.”, as disclosed in ¶[0059], the prior art selects a auxiliary decoder model to decode the target visual feature and shown in figure 4, there are two different decoder models used for different processing when encoding the target video and ¶[0069] discloses that the auxiliary decoder is used to recognize certain words generated in the video caption by using visual context between the decoded word and video screen to improve the caption quality of the video caption.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li and Pan with Pei in order to select a decoder model to perform the decoding. One skilled in the art would have been motivated to modify Li and Pan in this manner in order to improve the captioning quality of the video caption. (Pei, ¶[0069]) However, Li, Pan and Pei do not explicitly teach wherein the corresponding sensor information is related to information from one or more sensor devices of the vehicle and includes information identifying speeds, accelerations, and orientations of the vehicle when the video was captured O’Malley teaches wherein the corresponding sensor information is related to information from one or more sensor devices of the vehicle and includes information identifying speeds, accelerations(Col 11, Lines 22-26, “The vehicle computing device can generate log data 108. For example, the log data 108 can include the sensor data, perception data, planning data, vehicle status data, velocity data, intent data, and/or other data generated by the vehicle computing device.”, Col 11, Lines 22-26 disclose obtaining the speed of the vehicle and logging the data.), and orientations of the vehicle when the video was captured (Col 11, Lines 11-21, “The vehicle computing device can use the sensor data to generate a trajectory for the vehicle(s) 104. In some instances, the vehicle computing device can also determine pose data associated with a position of the vehicle(s) 104. For example, the vehicle computing device can use the sensor data to determine position data, coordinate data, and/or orientation data of the vehicle(s) 104 in the environment 102. In some instances, the pose data can include x-y-z coordinates and/or can include pitch, roll, and yaw data associated with the vehicle(s) 104.”, Col 11, Lines 11-21 disclose obtaining vehicle sensor data and determining an orientation of the vehicle based on the sensor data acquired.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan and Pei with O’Malley in order to determine the speed, acceleration and orientation of the vehicle. One skilled in the art would have been motivated to modify Li, Pan and Pei in this manner in order to detect a safety concern and/or an uncertainty and transmit a request, in addition to the status data, to a remote computing device associated with a teleoperator. (O’Malley, Col 8, Lines 41-43) Regarding Claim 14, the combination of Li, Pan, Pei and O’Malley teach the device of claim 8, where Li further teaches wherein the one or more processors, to process the tensor, with the convolutional neural network model, to generate the modified tensor, are configured to: perform convolution operations on the tensor to generate convolution results; perform rectified linear unit activations on the convolution results to generate activation results; and perform max-pooling operations on the activation results to generate the modified tensor. (Page 303, Left Col, Paragraph 1, the last convolutional layer of pre-trained CNNs is employed for the image encoder, instead of using a fully connected layer. By this means, the visual information is vectorized as a set of representations, e.g., a = {a1 ,..., aL }, ai ∈ RD , which are corresponding to different (i.e., L) regions of the given image, and hence allow the RNN decoder attending to different spatial image regions under the attention mechanism. Like previous works [90], the RNN decoder is also formulated as a one-layer LSTM. However, instead of keeping the visual content fixed, they introduce a key concept of context vector to compute the hidden state of LSTM at each time step. As disclosed in this section the prior art uses a convolutional neural network to generate the modified tensor.) Regarding Claim 15, Li teaches a non-transitory computer-readable medium storing a set of instructions, the set of instructions comprising: one or more instructions that, when executed by one or more processors of a device(Fig.1 shows the language model architecture which is a machine learning framework that would inherently be ran by a GPU coupled with a non transitory computer readable medium.), cause the device to: generate a tensor based on the feature vectors and process the tensor, with a convolutional neural network model, to generate a modified tensor (Page 302, CNN Encoder to RNN Decoder, Paragraph 1, “As shown in Fig. 2, one of the most representative encoder decoder models is based on a CNN image encoder and a RNN text decoder, where CNN extracts various vision cues from one still image as a single real-valued feature representation, and RNN generates caption for that image conditioned on its representation at the very beginning”, as disclosed in this section of the prior art, feature extraction is performed and a CNN extracts the features and generates a single value tensor from the features.); select a decoder model from a plurality of decoder models(Fig. 2. Shows the decoder model used for video captioning.); process the modified tensor, with the decoder model, to generate a caption for the video based on attributes associated with the video (Fig. 4, shows the RNN decoder generating a capture from the feature extraction performed by the CNN.) and perform one or more actions based on the caption for the video(Page 302, Attention Mechanism, Paragraph 1, “All the methods above mainly encode image with the top layer of pre-trained CNNs, and keep the image content fixed during the decoding process for generating natural language sentence. However, it is not an easy task to distill all the necessary information into one single vector, considering the cluttered background and multiple objects, as well as the complex relationship between objects. Thus, it will be helpful for caption generation by looking at different image regions according to the context. In light of this, attention mechanism has been widely used for image captioning, which generally learns where and what the RNN decoder should attend to”, as disclosed in this section of the prior art based on the caption generated the model is trained to determine where or what the decoder should do in terms of performance.). Li does not explicitly teach receive a video and corresponding sensor information associated with a vehicle extract feature vectors associated with the corresponding sensor information and an appearance and a geometry of another vehicle captured in the video and processing, based on attributes associated with the video to generate the caption for the video, wherein the attributes include the corresponding sensor information Pan teaches receive a video and corresponding sensor information associated with a vehicle (¶[0048], “The sensor module 302 may receive the sensor data from the first sensor 304 and the second sensor 306. According to aspects of the present disclosure, the ego perception module 310 may receive sensor data directly from the first sensor 304 or the second sensor 306 to perform video captioning using knowledge distillation based on a spatio-temporal graph model from images captured by the first sensor 304 or the second sensor 306 of the car 350.”, ¶[0048] discloses receiving sensor data from a sensor installing on a vehicle.)wherein the corresponding sensor information is related to information from one or more sensor devices of the vehicle, and wherein the one or more sensor devices collect sensor information of the vehicle (¶[0028] “FIG. 3 is a diagram illustrating an example of a hardware implementation for a video captioning system 300 with knowledge distillation using a spatio-temporal graph, according to aspects of the present disclosure. The video captioning system 300 may be configured for understanding a scene to enable planning and controlling an ego vehicle in response to an image from video captured through a camera during operation of a car 350. The video captioning system 300 may be a component of a vehicle, a robotic device, or other device. For example, as shown in FIG. 3, the video captioning system 300 is a component of the car 350. Aspects of the present disclosure are not limited to the video captioning system 300 being a component of the car 350, as other devices, such as a bus, motorcycle, or other like vehicle, are also contemplated for using the video captioning system 300. The car 350 may be autonomous or semi-autonomous.”, as disclosed in this section of the prior art, vehicle sensor data is processed by the video captioning system in order to understand the scene ahead of the vehicle.)extract feature vectors associated with the corresponding sensor information and an appearance and a geometry of another vehicle captured in the video ([0058] Given a sequence of RGB frames {x.sub.1, x.sub.2, . . . x.sub.T}, two types of features are extracted (e.g., scene features and object features, as shown by the object branch 510 and the scene branch 520. [0059] Scene Features. A procedure is performed in which a sequence of 2D frame features F.sub.2D={f.sub.1, f.sub.2, . . . , f.sub.T} is first extracted (e.g., using ResNet-101), with each f.sub.t ϵcustom-character.sub.d2D. A set of 3D clip features F.sub.3D={v.sub.1, v.sub.2, . . . , v.sub.L} are also extracted via an I3D network, with v.sub.1ϵcustom-character.sup.d2D. [0060] Object Features. A convolutional neural network (CNN) is run (e.g., Faster R-CNN) on each frame to generate a set of object features F.sub.o={o.sub.1.sup.1, o.sub.1.sup.2, . . . , o.sub.t.sup.j, . . . , o.sub.T.sup.NT}, where Nt denotes the number of objects in frame t and j is the object index within each frame. Each o.sub.t.sup.j has the same dimension d.sub.2D as F.sub.2D.as disclosed in ¶[0058]-¶[0060], the prior art discloses extracting feature vectors from the image data acquired from the sensor of the vehicle to determine objects in the frame.) process, based on attributes associated with the video to generate the caption for the video, wherein the attributes include the corresponding sensor information (¶[0052] “Fig. 4 depicts improved video captioning of a scene according to aspects of the present disclosure. Consider the scene 400 shown in FIG. 4. To understand the video caption: “A cat jumps into a box,” a “cat” and “box,” are first identified, and then the transformation of “cat jumps into the box” is captured. That is, scenes are complicated, not only because of the diverse set of entities involved, but also the complex interactions among them. To understand the scene 400, it is important to ignore the “television” and “bed,” because they mostly serve as distractors from comprehending what is happening in the scene 400.”, as disclosed in ¶[0052] the prior art uses the sensor data acquired from the vehicle to determine a caption for the image frame.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li with Pan in order to acquire sensor data from a vehicle and use the data to generate a caption. One skilled in the art would have been motivated to modify Li in this manner in order to identify objects within areas of interest in an image of a surrounding scene of the autonomous agent. (Pan, ¶[0003]) However, Li and Pan do not explicitly teach wherein the decoder model is selected from the plurality of decoder models based on a quality of a caption to be generated Pei teaches wherein the decoder model is selected from the plurality of decoder models based on a quality of a caption to be generated (¶[0059], “Step 503. Decode the target visual feature by using an auxiliary decoder of the video caption generating model, to obtain a second selection probability corresponding to the each candidate word, a memory structure of the auxiliary decoder including reference visual context information corresponding to the each candidate word, and the reference visual context information being generated according to a related video corresponding to the candidate word.” ¶[0069], “However, in this embodiment of the present disclosure, because the memory structure of the auxiliary decoder includes an association (that is, the reference visual context information) between “pouring” and a related video screen 62, the decoded word “pouring” can be accurately obtained through decoding, thereby improving the captioning quality of the video caption.”, as disclosed in ¶[0059], the prior art selects a auxiliary decoder model to decode the target visual feature and shown in figure 4, there are two different decoder models used for different processing when encoding the target video and ¶[0069] discloses that the auxiliary decoder is used to recognize certain words generated in the video caption by using visual context between the decoded word and video screen to improve the caption quality of the video caption.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li and Pan with Pei in order to select a decoder model to perform the decoding. One skilled in the art would have been motivated to modify Li and Pan in this manner in order to improve the captioning quality of the video caption. (Pei, ¶[0069]) However, Li, Pan and Pei do not explicitly teach wherein the corresponding sensor information is related to information from one or more sensor devices of the vehicle and includes information identifying speeds, accelerations, and orientations of the vehicle when the video was captured O’Malley teaches wherein the corresponding sensor information is related to information from one or more sensor devices of the vehicle and includes information identifying speeds, accelerations(Col 11, Lines 22-26, “The vehicle computing device can generate log data 108. For example, the log data 108 can include the sensor data, perception data, planning data, vehicle status data, velocity data, intent data, and/or other data generated by the vehicle computing device.”, Col 11, Lines 22-26 disclose obtaining the speed of the vehicle and logging the data.), and orientations of the vehicle when the video was captured (Col 11, Lines 11-21, “The vehicle computing device can use the sensor data to generate a trajectory for the vehicle(s) 104. In some instances, the vehicle computing device can also determine pose data associated with a position of the vehicle(s) 104. For example, the vehicle computing device can use the sensor data to determine position data, coordinate data, and/or orientation data of the vehicle(s) 104 in the environment 102. In some instances, the pose data can include x-y-z coordinates and/or can include pitch, roll, and yaw data associated with the vehicle(s) 104.”, Col 11, Lines 11-21 disclose obtaining vehicle sensor data and determining an orientation of the vehicle based on the sensor data acquired.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan and Pei with O’Malley in order to determine the speed, acceleration and orientation of the vehicle. One skilled in the art would have been motivated to modify Li, Pan and Pei in this manner in order to detect a safety concern and/or an uncertainty and transmit a request, in addition to the status data, to a remote computing device associated with a teleoperator. (O’Malley, Col 8, Lines 41-43) Regarding Claim 18, the combination of Li, Pan, Pei and O’Malley teach the non-transitory computer-readable medium of claim 15, where Li further teaches wherein the plurality of decoder models includes one or more of: a single-loop recurrent neural network (RNN) model with pooling, a single-loop RNN model with attention, a hierarchical RNN model with pooling, and a hierarchical RNN model with attention. (Page 303, Left Col, Paragraph 1, the last convolutional layer of pre-trained CNNs is employed for the image encoder, instead of using a fully connected layer. By this means, the visual information is vectorized as a set of representations, e.g., a = {a1 ,..., aL }, ai ∈ RD , which are corresponding to different (i.e., L) regions of the given image, and hence allow the RNN decoder attending to different spatial image regions under the attention mechanism. Like previous works [90], the RNN decoder is also formulated as a one-layer LSTM. However, instead of keeping the visual content fixed, they introduce a key concept of context vector to compute the hidden state of LSTM at each time step. As disclosed in this section the prior art uses a convolutional neural network to generate the modified tensor.) Regarding Claim 20, the combination of Li, Pan, Pei and O’Malley i teach the non-transitory computer-readable medium of claim 15, where Li further teaches wherein the one or more instructions, that cause the device to process the tensor, with the convolutional neural network model, to generate the modified tensor, cause the device to: perform convolution operations on the tensor to generate convolution results; perform rectified linear unit activations on the convolution results to generate activation results; and perform max-pooling operations on the activation results to generate the modified tensor. (Page 303, Left Col, Paragraph 1, the last convolutional layer of pre-trained CNNs is employed for the image encoder, instead of using a fully connected layer. By this means, the visual information is vectorized as a set of representations, e.g., a = {a1 ,..., aL }, ai ∈ RD , which are corresponding to different (i.e., L) regions of the given image, and hence allow the RNN decoder attending to different spatial image regions under the attention mechanism. Like previous works [90], the RNN decoder is also formulated as a one-layer LSTM. However, instead of keeping the visual content fixed, they introduce a key concept of context vector to compute the hidden state of LSTM at each time step. As disclosed in this section the prior art uses a convolutional neural network to generate the modified tensor.) Regarding Claim 22, the combination of Li, Pan, Pei and O’Malley teach the device of claim 8, where Pei further teaches wherein the decoder model utilizes the attributes to adjust a probability of words to be utilized in generating the caption. (¶[0058] discloses calculating probability of words to best capture the action in the frame.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan and O’Malley with Pei in order to select a decoder model to perform the decoding. One skilled in the art would have been motivated to modify Li, Pan and O’Malley in this manner in order to improve the captioning quality of the video caption. (Pei, ¶[0069]) Regarding Claim 23, the combination of Li, Pan, Pei and O’Malley teach the method of claim 1, where O’Malley further teaches wherein the attribute is associated with safety. (Col 8, Lines 40-45, “ the vehicle computing device can be configured to detect a safety concern and/or an uncertainty and transmit a request, in addition to the status data, to a remote computing device associated with a teleoperator. The teleoperator can assess the situation and provide guidance to the vehicle.”, Col 8, Lines 40-45 discloses detecting safety concerns and transmitting the data to a teleoperator.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan and Pei with O’Malley in order to determine the safety concerns with the sensor data. One skilled in the art would have been motivated to modify Li, Pan and Pei in this manner in order to detect a safety concern and/or an uncertainty and transmit a request, in addition to the status data, to a remote computing device associated with a teleoperator. (O’Malley, Col 8, Lines 41-43) Claims 4 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al (“Visual to Text: Survey of Image and Video Captioning”) in view of Pan et al. US PG-Pub(US 20210295093 A1) in view of Pei et al. US PG-Pub(US 20210281774 A1) in view of O'Malley US Patent(US 11526721 B1) in view of Mori et al. US PG-Pub(US 20230245467 A1). Regarding Claim 4, the combination of Li, Pan and Pei teach the method of claim 1, they do not explicitly teach wherein extracting the feature vectors associated with the corresponding sensor information and the appearance and the geometry of the other vehicle captured in the video comprises: extracting an appearance feature vector based on the appearance of the other vehicle extracting a geometry feature vector based on the geometry of the other vehicle and extracting a sensor feature vector based on the corresponding sensor information. Mori teaches wherein extracting the feature vectors associated with the corresponding sensor information and the appearance and the geometry of the other vehicle captured in the video comprises: extracting an appearance feature vector based on the appearance of the other vehicle (¶[0042], “The vehicle extraction unit 14 outputs a moving image (a frame including a vehicle) from which a vehicle is extracted among the identified moving images to the number recognizing unit 15, and outputs the moving image to the matching processing unit 16.”, ¶[0042], discloses the extraction process of a vehicle feature.); extracting a geometry feature vector based on the geometry of the other vehicle and extracting a sensor feature vector based on the corresponding sensor information. ([0046], “The feature quantity extraction unit 18 extracts a feature quantity of the target vehicle by analyzing a moving image including the target vehicle. More specifically, the feature quantity extraction unit 18 calculates the traveling speed of the target vehicle based on the temporal change (For example, the amount of movement of the target vehicle between frames and the amount of change of the size of the target vehicle between frames.) of the target vehicle in the frame including the target vehicle.”, as disclosed in ¶[0046], the prior art extracts features and determines the speed of the target object in the frame and change of position as well.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan, Pei and O’Malley with Mori in order to identify the speed of the vehicles from the video image frames. One skilled in the art would have been motivated to modify Li, Pan, Pei and O’Malley in this manner in order to satisfactorily extract an imaging-target object from each frame of a moving image even when the imaging-target object is a movable object such as a vehicle or the like. (Mori, ¶[0006]) Regarding Claim 16, the combination of Li, Pan, Pei and O’Malley teach the non-transitory computer-readable medium of claim 15, However, they do not explicitly teach wherein the one or more instructions, that cause the device to extract the feature vectors associated with the corresponding sensor information and the appearance and the geometry of the other vehicle captured in the video comprises: extracting an appearance feature vector based on the appearance of the other vehicle extracting a geometry feature vector based on the geometry of the other vehicle and extracting a sensor feature vector based on the corresponding sensor information Mori teaches wherein the one or more instructions, that cause the device to extract the feature vectors associated with the corresponding sensor information and the appearance and the geometry of the other vehicle captured in the video comprises: extracting an appearance feature vector based on the appearance of the other vehicle (¶[0042], “The vehicle extraction unit 14 outputs a moving image (a frame including a vehicle) from which a vehicle is extracted among the identified moving images to the number recognizing unit 15, and outputs the moving image to the matching processing unit 16.”, ¶[0042], discloses the extraction process of a vehicle feature.); extracting a geometry feature vector based on the geometry of the other vehicle and extracting a sensor feature vector based on the corresponding sensor information. ([0046], “The feature quantity extraction unit 18 extracts a feature quantity of the target vehicle by analyzing a moving image including the target vehicle. More specifically, the feature quantity extraction unit 18 calculates the traveling speed of the target vehicle based on the temporal change (For example, the amount of movement of the target vehicle between frames and the amount of change of the size of the target vehicle between frames.) of the target vehicle in the frame including the target vehicle.”, as disclosed in ¶[0046], the prior art extracts features and determines the speed of the target object in the frame and change of position as well.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan, Pei and O’Malley with Mori in order to identify the speed of the vehicles from the video image frames. One skilled in the art would have been motivated to modify Li, Pan, Pei and O’Malley in this manner in order to satisfactorily extract an imaging-target object from each frame of a moving image even when the imaging-target object is a movable object such as a vehicle or the like. (Mori, ¶[0006]) Claims 5-6 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al (“Visual to Text: Survey of Image and Video Captioning”) in view of Pan et al. US PG-Pub(US 20210295093 A1) in view of Pei et al. US PG-Pub(US 20210281774 A1) in view of O'Malley US Patent(US 11526721 B1) in further view of Wu et al. US PG-Pub(US 20210357670 A1). Regarding Claim 5, while the combination of Li, Pan, Pei and O’Malley teach the method of claim 1, they do not explicitly teach wherein generating the tensor based on the feature vectors comprises: concatenating the feature vectors, based on a feature dimension, to generate the tensor. Wu teaches wherein generating the tensor based on the feature vectors comprises: concatenating the feature vectors, based on a feature dimension, to generate the tensor (¶[0071], A vector is concatenated to the feature vectors including vehicle speed, turn status, brake status, navigation instructions, etc. and the feature vector is resized to be a feature map 3101, 3102 . . . 310N. Subsequently, a 1×1 convolutional layer 312 reduces the feature maps 3101, 3102 . . . 310N down to single decision points on a per-pixel basis, referred to herein as per-pixel decision points. A classifier (not shown) uses the per-pixel decision points to determine whether a particular pixel belongs to a particular target class, such as a “crosswalk” target class.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan, Pei and O’Malley with Wu in order to concatenate the feature vector to generate a reduced dimension tensor. One skilled in the art would have been motivated to modify Li, Pan, Pei and O’Malley in this manner in order to collect the scene information including obtaining street images in real-time from the one or more sensors or an online map. (Wu, ¶[0007]) Regarding Claim 6, while he combination of Li, Pan, Pei and O’Malley teach the method of claim 1, they do not explicitly teach wherein the modified tensor includes a reduced temporal dimension compared to a temporal dimension of the tensor, and includes a different feature dimension compared to a feature dimension of the tensor. Wu teaches wherein the modified tensor includes a reduced temporal dimension compared to a temporal dimension of the tensor, and includes a different feature dimension compared to a feature dimension of the tensor. (¶[0071], A vector is concatenated to the feature vectors including vehicle speed, turn status, brake status, navigation instructions, etc. and the feature vector is resized to be a feature map 3101, 3102 . . . 310N. Subsequently, a 1×1 convolutional layer 312 reduces the feature maps 3101, 3102 . . . 310N down to single decision points on a per-pixel basis, referred to herein as per-pixel decision points. A classifier (not shown) uses the per-pixel decision points to determine whether a particular pixel belongs to a particular target class, such as a “crosswalk” target class.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan, Pei and O’Malley with Wu in order to concatenate the feature vector to generate a reduced dimension tensor. One skilled in the art would have been motivated to modify Li, Pan, Pei and O’Malley in this manner in order to collect the scene information including obtaining street images in real-time from the one or more sensors or an online map. (Wu, ¶[0007]) Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Li et al (“Visual to Text: Survey of Image and Video Captioning”) in view of Pan et al. US PG-Pub(US 20210295093 A1) in view of Pei et al. US PG-Pub(US 20210281774 A1) in view of O'Malley US Patent(US 11526721 B1) in further view of Wu et al. US PG-Pub(US 20210357670 A1). Regarding Claim 17, while the combination of Li, Pan, Pei and O’Malley teach the non-transitory computer-readable medium of claim 15, they do not explicitly teach wherein the one or more instructions, that cause the device to generate the tensor based on the feature vectors, cause the device to: concatenate the feature vectors, based on a feature dimension, to generate the tensor. Wu teaches wherein the one or more instructions, that cause the device to generate the tensor based on the feature vectors, cause the device to: concatenate the feature vectors, based on a feature dimension, to generate the tensor. (¶[0071], A vector is concatenated to the feature vectors including vehicle speed, turn status, brake status, navigation instructions, etc. and the feature vector is resized to be a feature map 3101, 3102 . . . 310N. Subsequently, a 1×1 convolutional layer 312 reduces the feature maps 3101, 3102 . . . 310N down to single decision points on a per-pixel basis, referred to herein as per-pixel decision points. A classifier (not shown) uses the per-pixel decision points to determine whether a particular pixel belongs to a particular target class, such as a “crosswalk” target class.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan, Pei and O’Malley with Wu in order to concatenate the feature vector to generate a reduced dimension tensor. One skilled in the art would have been motivated to modify Li, Pan, Pei and O’Malley in this manner in order to collect the scene information including obtaining street images in real-time from the one or more sensors or an online map. (Wu, ¶[0007]) Claims 9 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al (“Visual to Text: Survey of Image and Video Captioning”) in view of Pan et al. US PG-Pub(US 20210295093 A1) in view of Pei et al. US PG-Pub(US 20210281774 A1) in view of O'Malley US Patent(US 11526721 B1) in further view of Jiang et al. ("Recurrent Fusion Network for Image Captioning"). Regarding Claim 9, while the combination of Li, Pan, Pei and O’Malley teach the device of claim 8, they do not explicitly teach wherein the plurality of decoder models includes one or more of: a single-loop decoder model with pooling, a single-loop decoder model with attention, a hierarchical decoder model with pooling, or a hierarchical decoder model with attention. Jiang teaches wherein the plurality of decoder models includes one or more of: a single-loop decoder model with pooling, a single-loop decoder model with attention, a hierarchical decoder model with pooling, or a hierarchical decoder model with attention. (Fig. 1. The framework of our RFNet. Multiple CNNs are employed as encoders and a recurrent fusion procedure is inserted after the encoders to form better representations for the decoder. The fusion procedure consists of two stages. The first stage exploits interactions among the representations from multiple CNNs to generate multiple sets of thought vectors. The second stage performs multi-attention on the sets of thought vectors from the first stage and generates a new set of thought vectors for the decoder, as shown in figure 1, the prior art uses a RNN with pooling and attention to generate the features for the decoder.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan, Pei and O’Malley i with Jiang in order to have a RNN with pooling and attention. One skilled in the art would have been motivated to modify Li, Pan, Pei and O’Malley in this manner in order to generate new compact and informative representations for the decoder. (Jiang, Abstract) Regarding Claim 13, while the combination of Li, Pan, Pei and O’Malley teach the device of claim 8, they do not explicitly teach wherein the plurality of decoder models includes one or more of: a single-loop decoder model with pooling, a single-loop decoder model with attention, a hierarchical decoder model with pooling, or a hierarchical decoder model with attention. Jiang teaches wherein the decoder model includes one of: a single-loop recurrent neural network (RNN) model with pooling, a single-loop RNN model with attention, a hierarchical RNN model with pooling, or a hierarchical RNN model with attention. (Fig. 1. The framework of our RFNet. Multiple CNNs are employed as encoders and a recurrent fusion procedure is inserted after the encoders to form better representations for the decoder. The fusion procedure consists of two stages. The first stage exploits interactions among the representations from multiple CNNs to generate multiple sets of thought vectors. The second stage performs multi-attention on the sets of thought vectors from the first stage and generates a new set of thought vectors for the decoder, as shown in figure 1, the prior art uses a RNN with pooling and attention to generate the features for the decoder.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan, Pei and O’Malley with Jiang in order to have a RNN with pooling and attention. One skilled in the art would have been motivated to modify Li, Pan, Pei and O’Malley in this manner in order to generate new compact and informative representations for the decoder. (Jiang, Abstract) Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Li et al (“Visual to Text: Survey of Image and Video Captioning”) in view of Pan et al. US PG-Pub(US 20210295093 A1) in view of Pei et al. US PG-Pub(US 20210281774 A1) in view of O'Malley US Patent(US 11526721 B1) in further view of Chen et al. US PG-Pub(US 20180225519 A1). Regarding Claim 10, while the combination of Li, Pan, Pei and O’Malley teach the device of claim 8, they do not explicitly teach wherein the attributes associated with the video include one or more of: an attribute indicating that the vehicle is associated with a crash event, or an attribute indicating that the vehicle is associated with a near-crash event. Chen teaches wherein the attributes associated with the video include one or more of: an attribute indicating that the vehicle is associated with a crash event, or an attribute indicating that the vehicle is associated with a near-crash event. ([0022] A variety of targeting approaches may be used. Example targeting approaches may include: [0023] General Highlight Approach: Generating captions to summarize the general highlights or events of a long media file (e.g., telling a story of videos taken on a long trip); [0024] High Risk Approach: Generating captions to identify high-risk or abnormal events of a surveillance media recording (e.g., generating text alerts of crashes or fights); [0025] Person Name Approach: Generating captions to summarize a target person's (or entity's) activities in a crowd/sports/family media file (e.g., featuring a kid or a couple, and allowing each person in the video to be separately featured)) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan, Pei and O’Malley with Chen in order to caption a near crash event. One skilled in the art would have been motivated to modify Li, Pan, Pei and O’Malley in this manner in order to generate a summary of the media file based on the caption. (Chen, Abstract) Claims 11-12 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Li et al (“Visual to Text: Survey of Image and Video Captioning”) in view of Pan et al. US PG-Pub(US 20210295093 A1) in view of Pei et al. US PG-Pub(US 20210281774 A1) in view of O'Malley US Patent(US 11526721 B1) in further view of Lin et al. US PG-Pub(US 20220014807 A1). Regarding Claim 11, while the combination of Li, Pan, Pei and O’Malley teach the device of claim 8, they do not explicitly teach wherein the one or more processors, to perform the one or more actions, are configured to one or more of: cause the caption to be displayed or played for a driver of the vehicle; cause the caption to be displayed or played for a passenger of the vehicle when the vehicle is an autonomous vehicle; or provide the caption and the video to a fleet system responsible for the vehicle. Lin teaches wherein the one or more processors, to perform the one or more actions, are configured to one or more of: cause the caption to be displayed or played for a driver of the vehicle; cause the caption to be displayed or played for a passenger of the vehicle when the vehicle is an autonomous vehicle or provide the caption and the video to a fleet system responsible for the vehicle. ([0366]], during the user driving, it may collect real-time video in front of the user's line of sight and analyze the video, so as to give the user a corresponding reminder by analyzing the generated captioning information of the video or play the captioning information to the user when the user needs to be prompted, such as when there is a potential danger ahead.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan, Pei and O’Malley with Lin in order to determine dangerous behavior and alert the user about the risk. One skilled in the art would have been motivated to modify Li, Pan, Pei and Mori i in this manner in order for the accuracy of the generated text caption of the multimedia data can be effectively improved. (Lin, Abstract) Regarding Claim 12, while the combination of Li, Pan, Pei and Mori teach the device of claim 8, they do not explicitly teach wherein the one or more processors, to perform the one or more actions, are configured to one or more of: cause a driver of the vehicle to be scheduled for a defensive driving course based on the caption; cause insurance for a driver of the vehicle to be adjusted based on the caption; or retrain the convolutional neural network model or one or more of the plurality of decoder models based on the caption. Lin teaches wherein the one or more processors, to perform the one or more actions, are configured to one or more of: cause a driver of the vehicle to be scheduled for a defensive driving course based on the caption; cause insurance for a driver of the vehicle to be adjusted based on the caption; or retrain the convolutional neural network model or one or more of the plurality of decoder models based on the caption ([0253] “Step S203: the captioning model is trained based on the value of the final loss function until the final loss function converges, to obtain a trained multimedia data captioning model.” [0254] “Specifically, after obtaining the final loss function of the video captioning model, the model parameters of the video captioning model are updated based on the final loss function until the final loss function converges based on the minimum value, so as to obtain a trained video captioning model. The final loss function of the video captioning model is determined by the first loss function and the second loss function.”, as disclosed in ¶[0253], the prior discloses retraining the neural network using a loss function based on the video caption. Note under broadest reasonable interpretation, the claim recites performing one or more actions therefore the examiner is only required one of the corresponding elements.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan, Pei and O’Malley with Lin in order to determine dangerous behavior and alert the user about the risk. One skilled in the art would have been motivated to modify Li, Pan, Pei and O’Malley in this manner in order for the accuracy of the generated text caption of the multimedia data can be effectively improved. (Lin, Abstract) Regarding Claim 19, the combination of Li, Pan, Pei and O’Malley teach the non-transitory computer-readable medium of claim 15, they do not explicitly teach wherein the one or more instructions, that cause the device to perform the one or more actions, cause the device to one or more of: cause the caption to be displayed or played for a driver of the vehicle; cause the caption to be displayed or played for a passenger of the vehicle when the vehicle is an autonomous vehicle; provide the caption and the video to a fleet system responsible for the vehicle Lin teaches wherein the one or more instructions, that cause the device to perform the one or more actions, cause the device to one or more of: cause the caption to be displayed or played for a driver of the vehicle; cause the caption to be displayed or played for a passenger of the vehicle when the vehicle is an autonomous vehicle; provide the caption and the video to a fleet system responsible for the vehicle;cause a driver of the vehicle to be scheduled for a defensive driving course based on the caption; cause insurance for a driver of the vehicle to be adjusted based on the caption; or retrain the convolutional neural network model or one or more of the plurality of decoder models based on the caption. ([0253] “Step S203: the captioning model is trained based on the value of the final loss function until the final loss function converges, to obtain a trained multimedia data captioning model.” [0254] “Specifically, after obtaining the final loss function of the video captioning model, the model parameters of the video captioning model are updated based on the final loss function until the final loss function converges based on the minimum value, so as to obtain a trained video captioning model. The final loss function of the video captioning model is determined by the first loss function and the second loss function.”, as disclosed in ¶[0253], the prior discloses retraining the neural network using a loss function based on the video caption. Note under broadest reasonable interpretation, the claim recites performing one or more actions therefore the examiner is only required one of the corresponding elements.) It would have been obvious to one of ordinary skill in the art before the effective filing date to modify the claimed invention as taught by Li, Pan, Pei and O’Malley with Lin in order to determine dangerous behavior and alert the user about the risk. One skilled in the art would have been motivated to modify Li, Pan, Pei and O’Malley in this manner in order for the accuracy of the generated text caption of the multimedia data can be effectively improved. (Lin, Abstract) Conclusion Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a). A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. Any inquiry concerning this communication or earlier communications from the examiner should be directed to HAN D HOANG whose telephone number is (571)272-4344. The examiner can normally be reached Monday-Friday 8-5. Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, JOHN M VILLECCO can be reached at 571-272-7319. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. /HAN HOANG/Primary Examiner, Art Unit 2661
Read full office action

Prosecution Timeline

Feb 28, 2022
Application Filed
Jun 10, 2024
Non-Final Rejection — §103
Jul 31, 2024
Interview Requested
Sep 03, 2024
Applicant Interview (Telephonic)
Sep 06, 2024
Examiner Interview Summary
Sep 19, 2024
Response Filed
Dec 23, 2024
Non-Final Rejection — §103
Feb 04, 2025
Interview Requested
Mar 12, 2025
Applicant Interview (Telephonic)
Mar 12, 2025
Examiner Interview Summary
Apr 03, 2025
Response Filed
Jul 03, 2025
Final Rejection — §103
Aug 14, 2025
Interview Requested
Aug 26, 2025
Examiner Interview Summary
Aug 26, 2025
Applicant Interview (Telephonic)
Sep 05, 2025
Response after Non-Final Action
Oct 08, 2025
Request for Continued Examination
Oct 10, 2025
Response after Non-Final Action
Oct 28, 2025
Non-Final Rejection — §103
Dec 17, 2025
Interview Requested
Jan 08, 2026
Applicant Interview (Telephonic)
Jan 09, 2026
Examiner Interview Summary
Jan 30, 2026
Response Filed
Mar 23, 2026
Final Rejection — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

Patent 12602835
POINT CLOUD DATA TRANSMISSION DEVICE, POINT CLOUD DATA TRANSMISSION METHOD, POINT CLOUD DATA RECEPTION DEVICE, AND POINT CLOUD DATA RECEPTION METHOD
2y 5m to grant Granted Apr 14, 2026
Patent 12602778
INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND INFORMATION PROCESSING PROGRAM
2y 5m to grant Granted Apr 14, 2026
Patent 12602918
LEARNING DATA GENERATING APPARATUS, LEARNING DATA GENERATING METHOD, AND NON-TRANSITORY RECORDING MEDIUM HAVING LEARNING DATA GENERATING PROGRAM RECORDED THEREON
2y 5m to grant Granted Apr 14, 2026
Patent 12592070
IMAGE PROCESSING APPARATUS
2y 5m to grant Granted Mar 31, 2026
Patent 12586364
SINGLE IMAGE CONCEPT ENCODER FOR PERSONALIZATION USING A PRETRAINED DIFFUSION MODEL
2y 5m to grant Granted Mar 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.

AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Powered by AI — typically takes 5-10 seconds

Prosecution Projections

6-7
Expected OA Rounds
74%
Grant Probability
93%
With Interview (+19.3%)
3y 2m
Median Time to Grant
High
PTA Risk
Based on 162 resolved cases by this examiner. Grant probability derived from career allow rate.

Sign in with your work email

Enter your email to receive a magic link. No password needed.

Personal email addresses (Gmail, Yahoo, etc.) are not accepted.

Free tier: 3 strategy analyses per month