Last updated: April 19, 2026
Application No. 18/840,896
INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD, AND INFORMATION PROCESSING PROGRAM

Non-Final OA §103
Filed
Aug 22, 2024
Examiner
ROBINSON, TERRELL M
Art Unit
2614
Tech Center
2600 — Communications
Assignee
Omron Corporation
OA Round
1 (Non-Final)
Interview Optional

— +7.5% interview lift. This examiner has a relatively high allow rate; a written response may suffice.
Based on 486 resolved cases, 2023–2026
Examiner Intelligence

ROBINSON, TERRELL M View full profile →
Grants 83% — above average
Career Allow Rate
403 granted / 486 resolved
+20.9% vs TC avg
Moderate +8% lift
Without
With
+7.5%
Interview Lift
resolved cases with interview
Typical timeline
2y 3m
Avg Prosecution
27 currently pending
Career history
513
Total Applications
across all art units
Statute-Specific Performance

§101
7.0%
-33.0% vs TC avg
§103
54.5%
+14.5% vs TC avg
§102
11.7%
-28.3% vs TC avg
§112
17.2%
-22.8% vs TC avg
Black line = Tech Center average estimate • Based on career data from 486 resolved cases
Office Action

§103
DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
The preliminary amendment filed on August 22, 2024 has been entered. Claims 1-14 are now pending in the application.

Specification
The title of the invention is not descriptive.  The following title is suggested: “Information Processing Device, Information Processing Method, and Information Processing Program for Appending Captions to Event Video Images ” or a title more indicative of the claimed subject matter.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

Use of the word “means” (or “step for”) in a claim with functional language creates a rebuttable presumption that the claim element is to be treated in accordance with 35 U.S.C. 112(f) (pre-AIA  35 U.S.C. 112, sixth paragraph).  The presumption that 35 U.S.C. 112(f) (pre-AIA  35 U.S.C. 112, sixth paragraph) is invoked is rebutted when the function is recited with sufficient structure, material, or acts within the claim itself to entirely perform the recited function.  
Absence of the word “means” (or “step for”) in a claim creates a rebuttable presumption that the claim element is not to be treated in accordance with 35 U.S.C. 112(f) (pre-AIA  35 U.S.C. 112, sixth paragraph).  The presumption that 35 U.S.C. 112(f) (pre-AIA  35 U.S.C. 112, sixth paragraph) is not invoked is rebutted when the claim element recites function but fails to recite sufficiently definite structure, material or acts to perform that function. 
Claim elements in this application that use the word “means” (or “step for”) are presumed to invoke 35 U.S.C. 112(f) except as otherwise indicated in an Office action.  Similarly, claim elements that do not use the word “means” (or “step for”) are presumed not to invoke 35 U.S.C. 112(f) except as otherwise indicated in an Office action.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  
Such claim limitation(s) is/are: 
Acquisition unit in claim 1
Partitioning unit component in claims 1 and 8
Event selection unit in claims 1, 3, 5, and 11
Generation unit in claim 1, 4, 6, and 12
Storage unit in claims 3-6, 11, and 12
Update unit in claims 5, 6, and 12
Training unit in claims 7, 8, 13, and 14
**The units listed above has been interpreted as tied to the structure of a processor as disclosed in the originally filed specification at least at paragraph [0025]**
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Allowable Subject Matter

Claims 5, 6, and 12 are objected to as being dependent upon a rejected base claim, but would be allowable if the claims are rewritten in independent form including all of the limitations of the base claim and any intervening claims. The following is a statement of reasons for the indication of allowable subject matter:  

In regards to dependent claim 5, none of the cited prior art alone or in combination provides motivation to teach “further comprising: an update unit that updates the first memory vector stored in the storage unit, wherein the update unit updates the first memory vector using the second memory vector, and the event selection unit uses the updated first memory vector to select a next of the event video images from the candidates” as the references only teach various techniques for video partitioning with action and event classifying and captioning, however the references fail to explicitly disclose the specific process for updating memory vectors based on each other for facilitating event video image selection from a group of images of the partitioned video, in conjunction with the features of claim 4 with which it depends which establishes the second memory vector used for word selection for continuing a generated caption.
In addition, there is no teaching, suggestion, or motivation found in the current references and none that can be inferred from the examiner’s own knowledge with respect to the current limitation.
In regards to dependent claim 6, none of the cited prior art alone or in combination provides motivation to teach “further comprising: an update unit that updates the second memory vector stored in the storage unit, wherein the update unit updates the second memory vector using the first memory vector, and the generation unit selects a next word for the selected event video image using the updated second memory vector” as the references only teach various techniques for video partitioning with action and event classifying and captioning, however the references fail to explicitly disclose the specific process for updating memory vectors based on each other for facilitating event video image selection from a group of images of the partitioned video, in conjunction with the features of claim 4 with which it depends which establishes the second memory vector used for word selection for continuing a generated caption.
In addition, there is no teaching, suggestion, or motivation found in the current references and none that can be inferred from the examiner’s own knowledge with respect to the current limitation.
In regards to dependent claim 12, this claim recites limitations similar in scope to claim 6, and thus is objected based on the same rationale as provided above.
As allowable subject matter has been indicated, applicant's reply must either comply with all formal requirements or specifically traverse each requirement not complied with.  See 37 CFR 1.111(b) and MPEP § 707.07(a).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-4, 7-11, 13, and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Xiong (2018 “Move Forward and Tell: A Progressive Generator of Video Descriptions”, hereinafter referenced “Xiong”) in view of Zhao (2017 “Temporal Action Detection with Structured Segment Networks”, hereinafter referenced “Zhao”).

In regards to Claim 1. (Original) Xiong discloses an information processing device (Xiong, Fig. 2) comprising:
-an acquisition unit that acquires a video image (Xiong, Fig. 2 and “3.1 Overview” section, page 5; Reference at Fig. 2 illustrates the system frame work (i.e. acquisition unit) as the section details a natural video often comprises multiple events that are located sparsely along the temporal range. a proposed framework as shown in Figure 2, which generates a descriptive paragraph in two stages, namely event localization and paragraph generation. In event localization, we localize candidate events in the video with a high recall (i.e. indicating acquisition of a video image));
-a partitioning unit that partitions the acquired video image into a plurality of event video images as candidates for appending a caption by partitioning according to events (Xiong, Fig. 2 and “3.1 Overview” section, page 5; Reference at Fig. 2 illustrates the system frame work (i.e. partitioning unit)  as the section discloses a natural video often comprises multiple events that are located sparsely along the temporal range. Here, events refer to those video segments that contain distinctive semantics that need to be conveyed…Therefore, we propose a framework as shown in Figure 2, which generates a descriptive paragraph in two stages, namely event localization and paragraph generation. In event localization, we localize candidate events in the video with a high recall. In paragraph generation, we first filter out redundant or trivial candidates, so as to get a sequence of important and distinctive events (i.e. paragraph generation for the event sequences of the video segments interpreted as partitions the acquired video image into a plurality of event video images as candidates for appending a caption by partitioning according to events)); 

-and a generation unit that generates a video image set with captions by employing an appending model for appending a caption to an event represented by the input event video image to append a caption to the selected event video image (Xiong, Fig. 2 and “3.1 Overview” section, page 5; Reference at Fig. 2 illustrates the system frame work (i.e. generation unit) as the section details a natural video often comprises multiple events that are located sparsely along the temporal range. Therefore, we propose a framework as shown in Figure 2, which generates a descriptive paragraph in two stages, namely event localization and paragraph generation. In event localization, we localize candidate events in the video with a high recall. In paragraph generation, we first filter out redundant or trivial candidates, so as to get a sequence of important and distinctive events. We then use this sequence to generate a single descriptive paragraph for the entire video in a progressive manner, taking into account the coherence among sentences (i.e. generated video set for an event with captions employed by the framework)).  
Xiong does not explicitly disclose but Zhao teaches
-an event selection unit that selects the event video image from the partitioned event video image candidates by using a selection model for selecting the event video image from a plurality of input event video images such that a range of an event is neither too broad nor too narrow, with the selection model selecting the event video image using a differentiable function (Zhao, Fig. 2 and “3.3. Activity and Completeness Classifiers” section, page 2917; Reference discloses structured segment network (i.e. event selection unit) which uses Convolutional neural networks for a segmented video as the augmented proposal is divided into starting (orange), course (green), and ending (blue) stages. An additional level of pyramid with two sub-parts is constructed on the course stage. Features from CNNs are pooled within these five parts and concatenated to form the global region representations. The activity classifier and the completeness classifier operate on the region representations to produce activity probability and class conditional completeness probability (i.e. selects the event video image from the partitioned event video image candidates by using a selection model for selecting the event video image from a plurality of input event video images such that a range of an event is neither too broad nor too narrow regarding segmented video window). Section 3.3 discloses both types of classifiers are implemented as linear classifiers on top of high-level features. Given a proposal pi, the activity classifier will produce a vector of normalized responses via a softmax layer (interpreted as with the selection model selecting the event video image using a differentiable function));
Xiong and Zhao are combinable because they are in the same field of endeavor regarding video partitioning. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the progressive generator of video descriptions of Xiong to include the structured segment network features of Zhao in order to provide the user with a system that allows for use of an efficient framework that can generate a coherent paragraphs to describe a given video as taught by Xiong, while incorporating the structured segment network features of Zhao to allow for use of a structured segment network (SSN) with a framework which models the temporal structure of each action instance via a structured
temporal pyramid further having a decomposed discriminative model for classifying actions and determining completeness within a video for improving the activity/event recognition process, applicable to video event processing systems such as those taught in Xiong.

In regards to Claim 2. (Original) Xiong in view of Zhao teach the information processing device of claim 1.
Xiong does not explicitly disclose but Zhao teaches
-wherein the differentiable function includes a Gumbel-Softmax function (Zhao, “3.3. Activity and Completeness Classifiers” section, page 2917; Reference discloses both types of classifiers are implemented as linear classifiers on top of high-level features. Given a proposal pi, the activity classifier will produce a vector of normalized responses via a softmax layer (interpreted as use of a differentiable function regarding Gumbel-softmax)).  
Xiong and Zhao are combinable because they are in the same field of endeavor regarding video partitioning. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the progressive generator of video descriptions of Xiong to include the structured segment network features of Zhao in order to provide the user with a system that allows for use of an efficient framework that can generate a coherent paragraphs to describe a given video as taught by Xiong, while incorporating the structured segment network features of Zhao to allow for use of a structured segment network (SSN) with a framework which models the temporal structure of each action instance via a structured
temporal pyramid further having a decomposed discriminative model for classifying actions and determining completeness within a video for improving the activity/event recognition process, applicable to video event processing systems such as those taught in Xiong.

In regards to Claim 3. (Currently Amended) Xiong in view of Zhao teach the information processing device of claim 1.
Xiong further discloses
-further comprising: a storage unit that stores a feature of an event video image selected in the past from the candidates in a first memory vector, wherein when selecting the event video image, the event selection unit employs the first memory vector to select an event video image indicating a continuation of an event video image selected in the past (Xiong, Fig. 2 and “3.3. Progressive Event Selection and Captioning” section, page 6; Reference at Fig. 2 illustrates the framework (i.e. storage unit) with visual and range features of an event video image from a sequence of video images. The section details in this work, we develop a progressive generation framework that couples two recurrent networks, one for event selection and the other for caption generation…Event Selection With all event candidates arranged in the chronological order, denoted as (e1, . . . , eT ), the event selection network begins with the first candidate in the sequence and moves forward gradually as follows: The selection of event candidates in a sequential manner from the video via the formulas (1)(2) illustrating the vectors interpreted as wherein when selecting the event video image, the event selection unit employs the first memory vector to select an event video image indicating a continuation of an event video image selected in the past).  

In regards to Claim 4. (Original) Xiong in view of Zhao teach the information processing device of claim 3.
Xiong further discloses
-wherein: the storage unit stores a feature value of the caption appended to the event video image in a second memory vector; and the generation unit employes the second memory vector to select a word indicating a continuation of an appended caption, and adds the selected word to a caption appended in the past and appends this to the event video image (Xiong, Fig. 2 and “Caption Generation” section, page 7; Reference discloses the event selection network and the caption generation network work hand in hand with each other when generating a description for a given video. On one hand, the selection of next event candidate depends on what has been said. Particularly, one input to the event selection network is ckt, which is set to the g(kt) , the last latent state of the caption generation network in generating the previous sentence. On the other hand, the caption generation network is invoked only when the event selection network outputs yt = 1, and the generation of the current sentence depends on those that come before. The formulas (3)(4) illustrates the caption generation process regarding the feature values of the captions appended to the event video via memory vectors as the discussion details the production of paragraphs  based on what was said previously in conjunction with the video event sequences).  

In regards to Claim 7. (Currently Amended) Xiong in view of Zhao teach the information processing device of claim 1, 
Xiong further discloses
-further comprising: a training unit that trains by propagating a training result learnt by the appending model to the selection model, or that individually trains each of the selection model and the appending model (Xiong, “4. Training” section, page 8; Reference discloses three modules in our framework (i.e. training unit) need to be trained, namely event localization, caption generation, and event selection. In particular, we train the event localization module simply following the procedure presented in [31]. The other two modules, the caption generation network and the event selection network, are trained separately. We first train the caption generation network using the ground-truth event captions. Thereon, we then train the event selection network, which requires the caption generation states as input (i.e. trains by propagating a training result learnt by the appending model to the selection model)).  
In regards to Claim 8. (Original) Xiong in view of Zhao teach the information processing device of claim 7.
Xiong further discloses
-wherein: the partitioning unit includes a partitioning model that has been trained to partition the event video image from the video image (Xiong, Fig.2; Reference discloses specifically, an LSTM net, serves as a selection module, will pick out a sequence of coherent and semantically independent events, based on appearances, temporal locations of events, as well as their semantic relationships (i.e. picking semantically independent yet coherent events from video image interpreted as a partitioning model that has been trained to partition the event video image from the video image)); and the training unit trains by propagating a training result learnt by the selection model to the partitioning model (Xiong, “4. Training” section, page 8; Reference discloses we first train the caption generation network using the ground-truth event captions. Thereon, we then train the event selection network, which requires the caption generation states as input (i.e. propagating of the training results).  

In regards to Claim 9. (Original) Xiong discloses an information processing method (Xiong, Fig. 2) comprising:
-acquiring a video image (Xiong, Fig. 2 and “3.1 Overview” section, page 5; Reference at Fig. 2 illustrates the system frame work (i.e. acquisition unit) as the section details a natural video often comprises multiple events that are located sparsely along the temporal range. a proposed framework as shown in Figure 2, which generates a descriptive paragraph in two stages, namely event localization and paragraph generation. In event localization, we localize candidate events in the video with a high recall (i.e. indicating acquisition of a video image)); 
-partitioning the acquired video image into a plurality of event video images as candidates for appending a caption by partitioning according to events (Xiong, Fig. 2 and “3.1 Overview” section, page 5; Reference at Fig. 2 illustrates the system frame work (i.e. partitioning unit)  as the section discloses a natural video often comprises multiple events that are located sparsely along the temporal range. Here, events refer to those video segments that contain distinctive semantics that need to be conveyed…Therefore, we propose a framework as shown in Figure 2, which generates a descriptive paragraph in two stages, namely event localization and paragraph generation. In event localization, we localize candidate events in the video with a high recall. In paragraph generation, we first filter out redundant or trivial candidates, so as to get a sequence of important and distinctive events (i.e. paragraph generation for the event sequences of the video segments interpreted as partitions the acquired video image into a plurality of event video images as candidates for appending a caption by partitioning according to events));  

-and generating a video image set with captions by employing an appending model for appending a caption to an event represented by the input event video image to append a caption to the selected event video image (Xiong, Fig. 2 and “3.1 Overview” section, page 5; Reference at Fig. 2 illustrates the system frame work (i.e. generation unit) as the section details a natural video often comprises multiple events that are located sparsely along the temporal range. Therefore, we propose a framework as shown in Figure 2, which generates a descriptive paragraph in two stages, namely event localization and paragraph generation. In event localization, we localize candidate events in the video with a high recall. In paragraph generation, we first filter out redundant or trivial candidates, so as to get a sequence of important and distinctive events. We then use this sequence to generate a single descriptive paragraph for the entire video in a progressive manner, taking into account the coherence among sentences (i.e. generated video set for an event with captions employed by the framework)).   
Xiong does not explicitly disclose but Zhao teaches
-selecting the event video image from the partitioned event video image candidates by using a selection model for selecting the event video image from a plurality of input event video images such that a range of an event is neither too broad nor too narrow, with the selection model selecting the event video image using a differentiable function (Zhao, Fig. 2 and “3.3. Activity and Completeness Classifiers” section, page 2917; Reference discloses structured segment network (i.e. event selection unit) which uses Convolutional neural networks for a segmented video as the augmented proposal is divided into starting (orange), course (green), and ending (blue) stages. An additional level of pyramid with two sub-parts is constructed on the course stage. Features from CNNs are pooled within these five parts and concatenated to form the global region representations. The activity classifier and the completeness classifier operate on the region representations to produce activity probability and class conditional completeness probability (i.e. selects the event video image from the partitioned event video image candidates by using a selection model for selecting the event video image from a plurality of input event video images such that a range of an event is neither too broad nor too narrow regarding segmented video window). Section 3.3 discloses both types of classifiers are implemented as linear classifiers on top of high-level features. Given a proposal pi, the activity classifier will produce a vector of normalized responses via a softmax layer (interpreted as with the selection model selecting the event video image using a differentiable function));  
Xiong and Zhao are combinable because they are in the same field of endeavor regarding video partitioning. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the progressive generator of video descriptions of Xiong to include the structured segment network features of Zhao in order to provide the user with a system that allows for use of an efficient framework that can generate a coherent paragraphs to describe a given video as taught by Xiong, while incorporating the structured segment network features of Zhao to allow for use of a structured segment network (SSN) with a framework which models the temporal structure of each action instance via a structured
temporal pyramid further having a decomposed discriminative model for classifying actions and determining completeness within a video for improving the activity/event recognition process, applicable to video event processing systems such as those taught in Xiong.

In regards to Claim 10. (Currently Amended) Xiong discloses a non-transitory computer-readable storage medium storing an information processing program that causes processing to be executed by a computer (Xiong, Fig. 2), the processing comprising:
-acquiring a video image (Xiong, Fig. 2 and “3.1 Overview” section, page 5; Reference at Fig. 2 illustrates the system frame work (i.e. acquisition unit) as the section details a natural video often comprises multiple events that are located sparsely along the temporal range. a proposed framework as shown in Figure 2, which generates a descriptive paragraph in two stages, namely event localization and paragraph generation. In event localization, we localize candidate events in the video with a high recall (i.e. indicating acquisition of a video image)); 
-partitioning the acquired video image into a plurality of event video images as candidates for appending a caption by partitioning according to events (Xiong, Fig. 2 and “3.1 Overview” section, page 5; Reference at Fig. 2 illustrates the system frame work (i.e. partitioning unit)  as the section discloses a natural video often comprises multiple events that are located sparsely along the temporal range. Here, events refer to those video segments that contain distinctive semantics that need to be conveyed…Therefore, we propose a framework as shown in Figure 2, which generates a descriptive paragraph in two stages, namely event localization and paragraph generation. In event localization, we localize candidate events in the video with a high recall. In paragraph generation, we first filter out redundant or trivial candidates, so as to get a sequence of important and distinctive events (i.e. paragraph generation for the event sequences of the video segments interpreted as partitions the acquired video image into a plurality of event video images as candidates for appending a caption by partitioning according to events));  

-and generating a video image set with captions by employing an appending model for appending a caption to an event represented by the input event video image to append a caption to the selected event video image (Xiong, Fig. 2 and “3.1 Overview” section, page 5; Reference at Fig. 2 illustrates the system frame work (i.e. generation unit) as the section details a natural video often comprises multiple events that are located sparsely along the temporal range. Therefore, we propose a framework as shown in Figure 2, which generates a descriptive paragraph in two stages, namely event localization and paragraph generation. In event localization, we localize candidate events in the video with a high recall. In paragraph generation, we first filter out redundant or trivial candidates, so as to get a sequence of important and distinctive events. We then use this sequence to generate a single descriptive paragraph for the entire video in a progressive manner, taking into account the coherence among sentences (i.e. generated video set for an event with captions employed by the framework)).   
Xiong does not explicitly disclose but Zhao teaches
-selecting the event video image from the partitioned event video image candidates by using a selection model for selecting the event video image from a plurality of input event video images such that a range of an event is neither too broad nor too narrow, with the selection model selecting the event video image using a differentiable function (Zhao, Fig. 2 and “3.3. Activity and Completeness Classifiers” section, page 2917; Reference discloses structured segment network (i.e. event selection unit) which uses Convolutional neural networks for a segmented video as the augmented proposal is divided into starting (orange), course (green), and ending (blue) stages. An additional level of pyramid with two sub-parts is constructed on the course stage. Features from CNNs are pooled within these five parts and concatenated to form the global region representations. The activity classifier and the completeness classifier operate on the region representations to produce activity probability and class conditional completeness probability (i.e. selects the event video image from the partitioned event video image candidates by using a selection model for selecting the event video image from a plurality of input event video images such that a range of an event is neither too broad nor too narrow regarding segmented video window). Section 3.3 discloses both types of classifiers are implemented as linear classifiers on top of high-level features. Given a proposal pi, the activity classifier will produce a vector of normalized responses via a softmax layer (interpreted as with the selection model selecting the event video image using a differentiable function));
Xiong and Zhao are combinable because they are in the same field of endeavor regarding video partitioning. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the progressive generator of video descriptions of Xiong to include the structured segment network features of Zhao in order to provide the user with a system that allows for use of an efficient framework that can generate a coherent paragraphs to describe a given video as taught by Xiong, while incorporating the structured segment network features of Zhao to allow for use of a structured segment network (SSN) with a framework which models the temporal structure of each action instance via a structured
temporal pyramid further having a decomposed discriminative model for classifying actions and determining completeness within a video for improving the activity/event recognition process, applicable to video event processing systems such as those taught in Xiong.

In regards to Claim 11. (New) Xiong in view of Zhao teach the information processing device of claim 2.
Xiong further discloses
-further comprising: a storage unit that stores a feature of an event video image selected in the past from the candidates in a first memory vector, wherein when selecting the event video image, the event selection unit employs the first memory vector to select an event video image indicating a continuation of an event video image selected in the past (Xiong, Fig. 2 and “3.3. Progressive Event Selection and Captioning” section, page 6; Reference at Fig. 2 illustrates the framework (i.e. storage unit) with visual and range features of an event video image from a sequence of video images. The section details in this work, we develop a progressive generation framework that couples two recurrent networks, one for event selection and the other for caption generation…Event Selection With all event candidates arranged in the chronological order, denoted as (e1, . . . , eT ), the event selection network begins with the first candidate in the sequence and moves forward gradually as follows: The selection of event candidates in a sequential manner from the video via the formulas (1)(2) illustrating the vectors interpreted as wherein when selecting the event video image, the event selection unit employs the first memory vector to select an event video image indicating a continuation of an event video image selected in the past).  

In regards to Claim 13. (New) Xiong in view of Zhao teach the information processing device of claim 2.
Xiong further discloses
-further comprising: a training unit that trains by propagating a training result learnt by the appending model to the selection model, or that individually trains each of the selection model and the appending model (Xiong, “4. Training” section, page 8; Reference discloses three modules in our framework (i.e. training unit) need to be trained, namely event localization, caption generation, and event selection. In particular, we train the event localization module simply following the procedure presented in [31]. The other two modules, the caption generation network and the event selection network, are trained separately. We first train the caption generation network using the ground-truth event captions. Thereon, we then train the event selection network, which requires the caption generation states as input (i.e. trains by propagating a training result learnt by the appending model to the selection model)).  

In regards to Claim 14. (New) Xiong in view of Zhao teach the information processing device of claim 3.
Xiong further discloses
-further comprising: a training unit that trains by propagating a training result learnt by the appending model to the selection model, or that individually trains each of the selection model and the appending model (Xiong, “4. Training” section, page 8; Reference discloses three modules in our framework (i.e. training unit) need to be trained, namely event localization, caption generation, and event selection. In particular, we train the event localization module simply following the procedure presented in [31]. The other two modules, the caption generation network and the event selection network, are trained separately. We first train the caption generation network using the ground-truth event captions. Thereon, we then train the event selection network, which requires the caption generation states as input (i.e. trains by propagating a training result learnt by the appending model to the selection model)).


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: See the Notice of References Cited (PTO-892)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TERRELL M ROBINSON whose telephone number is (571)270-3526. The examiner can normally be reached 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KENT CHANG can be reached at 571-272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/TERRELL M ROBINSON/Primary Examiner, Art Unit 2614
Read full office action
Prosecution Timeline

Aug 22, 2024
Application Filed
Feb 07, 2026
Non-Final Rejection — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/551,675
Patent 12602852
DYNAMIC GRAPHIC EDITING METHOD AND DEVICE
2y 5m to grant Granted Apr 14, 2026
18/427,878
Patent 12572196
MANAGING AN INDUSTRIAL ENVIRONMENT HAVING MACHINERY OPERATED BY REMOTE WORKERS AND PHYSICALLY PRESENT WORKERS
2y 5m to grant Granted Mar 10, 2026
18/617,535
Patent 12573124
PROGRESSIVE REAL-TIME DIFFUSION OF LAYERED CONTENT FILES WITH ANIMATED FEATURES
2y 5m to grant Granted Mar 10, 2026
18/691,461
Patent 12573111
INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING SYSTEM, AND INFORMATION PROCESSING METHOD FOR APPROPRIATE DISPLAY OF PRESENTER AND PRESENTATION ITEM
2y 5m to grant Granted Mar 10, 2026
18/411,559
Patent 12561904
IMAGE PROCESSING DEVICE AND IMAGE PROCESSING METHOD FOR CORRECTING COMPUTER GRAPHICS IMAGE IN MIXED REALITY
2y 5m to grant Granted Feb 24, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
AI Strategy Recommendation

Get an AI-powered prosecution strategy using examiner precedents, rejection analysis, and claim mapping.
Prosecution Projections

1-2
Expected OA Rounds
83%
Grant Probability
90%
With Interview (+7.5%)
2y 3m
Median Time to Grant
Low
PTA Risk
Based on 486 resolved cases by this examiner. Grant probability derived from career allow rate.
INFORMATION PROCESSING DEVICE, INFORMATION PROCESSING METHOD, AND INFORMATION PROCESSING PROGRAM

Interview Optional

Examiner Intelligence

Statute-Specific Performance

Office Action

Prosecution Timeline

Precedent Cases

Applications granted by this same examiner with similar technology

AI Strategy Recommendation

Prosecution Projections

Ready to respond to this office action?

Sign in with your work email