Last updated: May 29, 2026
Application No. 18/684,292
POSITION-AWARE TEMPORAL GRAPH NETWORKS FOR SURGICAL PHASE RECOGNITION ON LAPAROSCOPIC VIDEOS

Non-Final OA §103
Filed
Feb 16, 2024
Priority
Aug 19, 2021 — provisional 63/235,027 +1 more
Examiner
HAUSMANN, MICHELLE M
Art Unit
2671
Tech Center
2600 — Communications
Assignee
Digital Surgery Limited
OA Round
1 (Non-Final)
Interview Optional

— +21.3% interview lift. Examiner has a relatively high allowance rate (76%); +21.3% interview lift. A written response may suffice.
Based on 870 resolved cases, 2023–2026
Examiner Intelligence

HAUSMANN, MICHELLE M View full profile →
Grants 76% — above average
Career Allowance Rate
663 granted / 870 resolved
+14.2% vs TC avg
Strong +21% interview lift
Without
With
+21.3%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
22 currently pending
Career history
895
Total Applications
across all art units
Statute-Specific Performance

§101
1.1%
-38.9% vs TC avg
§103
94.8%
+54.8% vs TC avg
§102
0.6%
-39.4% vs TC avg
§112
0.9%
-39.1% vs TC avg
Black line = Tech Center average estimate • Based on career data from 870 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Election/Restrictions
Applicant's election with traverse of species 1 in the reply filed on 26 March, 2026 is acknowledged.  The traversal is on the ground(s) that while the inventions in the independent claims are independent and distinct, there is overlap when the dependent claims are taken into consideration, thereby there is not considered to be a serious burden on the examiner.  This is found persuasive, however examiner does note that as claim 17 is rejected with a different combination of references than claims 1 and 13, any further amendments should be mindful of the already existing distinctions between the claims. The restriction requirement is withdrawn and all claims are hereby examined.
Priority
Applicant’s claim for the benefit of a prior-filed application under 35 U.S.C. 119(e) or under 35 U.S.C. 120, 121, 365(c), or 386(c) is acknowledged. Applicant has not complied with one or more conditions for receiving the benefit of an earlier filing date under 35 U.S.C. 35 USC 119(e) as follows: The later-filed application must be an application for a patent for an invention which is also disclosed in the prior application (the parent or original nonprovisional application or provisional application). The disclosure of the invention in the parent application and in the later-filed application must be sufficient to comply with the requirements of 35 U.S.C. 112(a) or the first paragraph of pre-AIA  35 U.S.C. 112, except for the best mode requirement. See Transco Products, Inc. v. Performance Contracting, Inc., 38 F.3d 551, 32 USPQ2d 1077 (Fed. Cir. 1994).
The disclosure of the prior-filed application, Application No. 63/235,027, fails to provide adequate support or enablement in the manner provided by 35 U.S.C. 112(a) or pre-AIA  35 U.S.C. 112, first paragraph for one or more claims of this application. The disclosure of the prior-filed application, Application No. 63/235,027, fails to give support for (at a minimum): “each frame feature being a latent representation of the corresponding video frame from the surgical video” in claim 1, “the phase labels are generated using computer vision based on the latent representation” in claim 5, “each node comprises a latent representation of a corresponding video frame from the surgical video” and “for each layer of a graph neural network, aggregating, at each node, latent representations of adjacent nodes at a predefined time step associated with each layer, the graph neural network comprising a predetermined number of layers” in claim 17. 
Applicant is given priority to PCT/EP2022/073102 filed 18 August, 2022. This is the earliest priority which has adequate support.
Claim Objections
Claim 13 is objected to because of the following informalities: “aggregating, at each node, information from one or more adjacent nodes of the each node; and identifying a surgical phase represented by each video frame based on the information aggregated at the each node” is unclear, the limitation should read “the node” or “each node”.  Appropriate correction is required.
Claim 17 is objected to because of the following informalities: “and identifying a surgical phase represented by each video frame based on the aggregated information at the each node” is unclear, the limitation should read “the node” or “each node”.  Appropriate correction is required.
Claim 18 is objected to because of the following informalities: “wherein the each layer of the graph neural network is associated with a distinct predefined time step” is unclear, the limitation should read “the node” or “each node”.  Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1-5, 11, 13, 15, and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Giataganas et al. (US 20190279765 A1) in view of Gan et al. (US 20210124987 A1).

Regarding claim 1, Giataganas et al. disclose a computer-implemented method comprising: computing, by a processor, using an encoder machine learning model, a plurality of frame features respectively corresponding to a plurality of video frames from a surgical video, each frame feature being a latent representation of the corresponding video frame from the surgical video (“A multi-dimensional artificial intelligence protocol 215 can receive the surgical data 210. The multi-dimensional artificial intelligence protocol 215 segments the surgical data 210 into one or more segments to generate feature-extraction data 220 for each data stream of the surgical data. The one or more segments correspond to an incomplete subset of the data portion. In some embodiments, the artificial intelligence segments the surgical data 210 using feature extraction. Anatomic feature extraction can be performed on the surgical images to detect an anatomic state, anatomic pose, or any other attribute possible to derive from video that can be used later to determine a precise procedural state”, [0057], “In a first stage, a multi-dimensional artificial intelligence can identify the current procedural state 225 from the feature-extraction data 220. Based on each of the one or more segments from the feature-extraction data 220, each segment can be classified and/or localized as corresponding to a particular object type of a set of predefined object types”, [0058]); generating, by the processor, using a decoder machine learning model, a position-aware temporal graph data structure that comprises a plurality of nodes and a plurality of edges, wherein each node represents a respective frame feature and an edge between two nodes indicates a relative position of the two nodes (Each of the surgical data structures may include a plurality of nodes and a plurality of edges. Each edge of the plurality of edges may be configured to connect two nodes of the plurality of nodes. The nodes and edges may be associated with and/or arranged in an order that corresponds to a sequence in which various actions may be performed and/or specific surgical instruments in a surgical procedure, [0025], trained machine-learning model can include, convolutional neural network adaptations, adversarial networks, recurrent neural networks, deep Bayesian neural networks, or other type of deep learning or graphical models, [0045], surgical data structures can identify a surgical action based on weights assigned to one or more edges between the node-to-node transitions, surgical weights may be determined or updated using a machine learning algorithm based on collected surgical data, [0090]); aggregating, by the processor, an embedding at each node, the embedding at a first node is computed by applying an aggregation function to the embedding of each node connected to the first node (The third data stream 730 can be processed to identify surgical instruments detected during the sleeve gastrectomy procedure using a third data structure and the fourth data stream 740 can be processed to identify surgical events during the sleeve gastrectomy procedure. Each portion of the video stream 700 can be processed using a multi-dimensional artificial intelligence protocol as described herein, [0094], “Each of the data structures may include a plurality of nodes connected to nodes in another data structure representing additional characteristics of a sleeve gastrectomy procedure. For example, a second data structure may represent the anatomical features present in the sleeve gastrectomy procedure and a third data structure may represent the surgical tools used in the sleeve gastrectomy procedure. The interconnected nodes in each of these data structures can provide relational metadata regarding the surgical procedure sleeve gastrectomy procedure. For example, in fourth data stream 740, the relational metadata between the multiple data structures can provide relevant information that is aggregated over time. The relational metadata can provide information (for example) including usage of surgical tools near anatomical features that may cause injury, prolonged usage of surgical instruments, medical personnel at specific stages of surgery, or actions/events at specific stages of surgery. This information can be compiled and output as an electronic output (e.g., an operational note)”, [0095]); generating, by the processor, phase labels for the nodes based on the embedding at each node and identifying, by the processor, one or more surgical phases in the surgical video based on the phase labels (surgical data structure can be generated by using training data with pixel-level labels from the segmented endoscopic procedure video stream, [0024], “In some embodiments, each surgical data structure can include multiple nodes, each node representing particular characteristics of surgery (e.g., procedural state, state of a patient, surgical instruments, etc.). For example, in one data structure, a node can represent a discrete physiological and procedural state of the patient during a procedure (e.g., initial preparation and sterilization complete, skin incision made, bone exposed, etc.) and, in a second data structure, a node can represent a discrete surgical tool in the operating room during a procedure (e.g., forceps, stapler, laparoscopic instruments, etc.). The nodes of the first data structure and the nodes of the second data structure can be interconnected to provide relational metadata. For example, the relational metadata may represent a pose, anatomical state, sterility, a state of a patient, such as patient position, location of an organ, location of a surgical instrument, location of an incision, etc.”, [0026], edges between nodes for each data structure may identify the procedural state by identifying progress of a surgery, etc., identify an action that is being performed or is about to be performed e.g., as identified in an edge that connects a node corresponding to the procedural state with a node corresponding to a next procedural state, and/or identify one or more considerations e.g., a risk, tools being used or that are about to be used, a warning, etc., [0028], state detection 120 can use the output from the classification and/or localization to determine a particular state of a set of procedural states based on the classification and/or localization data 124 and at least some of the sets of characteristic metadata in the data structure, data structure can include a set of nodes, with each node corresponding to a potential state, [0049], The surgical data structure can provide precise and accurate coded descriptions of complete surgical procedures. The surgical data structure 205 can describe routes through surgical procedures. For example, the surgical data structure 205 can include multiple nodes, each node representing a procedural state, [0054], “Each of the surgical data structures 604, 606, 608 may represent a set of characteristics for a surgical procedure (e.g., cataract surgery, heart bypass, sleeve gastrectomy, hip replacement, etc.). Each of the surgical data structures 604, 606, 608 are interconnected (e.g., nodes are connected by edges) such that additional information for portions (e.g., segments of a video stream) of the surgical procedure can be identified. For example, the first data structure 604 can identify a set of procedural stages in the surgical procedure represented by nodes 610, 612, 614 and the second data structure 606 can identify a set of surgical tools in the surgical procedure represented by nodes 660, 662, 664”, [0084], identify an action that is being performed or is about to be performed (e.g., as identified in an edge that connects a node corresponding to the procedural state with a node corresponding to a next procedural state), [0089]) [initial preparation and sterilization complete, skin incision made, actions, and procedural state interpreted as claimed “surgical phases”].

Giataganas et al. do not disclose the language “decoder machine learning model”. In paragraph 0014 of the publication of the applicant’s specification, the applicant indicates “According to one or more aspects, the decoder machine learning model is a graph neural network”.  Therefore the described graph neural network is interpreted as disclosing this feature. 

Giataganas et al. do not disclose the language “an encoder machine learning model”.

Gan et al. teach computing, by a processor, using an encoder machine learning model, a plurality of frame features respectively corresponding to a plurality of video frames from a video, each frame feature being a latent representation of the corresponding video frame from the video (Embodiments described herein include systems, computer-implemented methods, apparatus, and/or computer program products that facilitate few-shot temporal action localization based on graph convolutional networks. In one or more embodiments, a support set can include one or more one-shot support videos respectively corresponding to one or more temporal action classifications. For instance, the one-shot support videos can be short video snippets, with each short video snippet displaying an example of a corresponding/respective temporal action classification (e.g., a first snippet demonstrating a person running, a second snippet demonstrating a person jumping, a third snippet demonstrating a person throwing an object, and so on). In various instances, each one-shot support video (and thus each temporal action classification) can correspond to an example feature vector generated by a gated recurrent unit based on the one-shot support videos (e.g., a first vector representing the running classification, a second vector representing the jumping classification, a third vector representing the throwing classification, and so on). In various aspects, a graph can be generated that models the support set. Nodes of the graph can respectively correspond to the temporal action classifications (e.g., a first node corresponding to the running classification, a second node corresponding to the jumping classification, a third node corresponding to the throwing classification, and so on). Edges of the graph can correspond to similarities between respective temporal action classifications (e.g., an edge between the running node and the jumping node can correspond to a similarity between the running classification and the jumping classification, an edge between the jumping node and the throwing node can correspond to a similarity between the jumping classification and the throwing classification, an edge between the running node and the throwing node can correspond to a similarity between the running classification and the throwing classification, and so on)., [0027], neural networks, [0054], [0056], [0057]) and generating, by the processor, using a decoder machine learning model, a position-aware temporal graph data structure that comprises a plurality of nodes and a plurality of edges, wherein each node represents a respective frame feature and an edge between two nodes indicates a relative position of the two nodes (generate a graph that models a support set. The support set can include one or more one-shot support video snippets (or support images, in some embodiments), with each one-shot support video snippet exhibiting an exemplar of a corresponding/respective temporal action classification. For instance, the support set can have a first one-shot support video snippet that displays an example of a person running (e.g., a running temporal action classification), a second one-shot support video snippet that displays an example of a person jumping (e.g., a jumping temporal action classification), a third one-shot support video snippet that displays an example of a person throwing an object (e.g., a throwing temporal action classification), and so on. Nodes of the graph can respectively correspond to the temporal action classifications in the support set (e.g., a first node corresponding to the running classification, a second node corresponding to the jumping classification, a third node corresponding to the throwing classification, and so on). Edges of the graph can correspond to similarities between the temporal action classifications (e.g., an edge between the first node and the second node can correspond to a similarity value between the running classification and the jumping classification, an edge between the second node and the third node can correspond to a similarity value between the jumping classification and the throwing classification, an edge between the first node and the third node can correspond to a similarity value between the running classification and the throwing classification, and so on), [0006], [0046], Graph convolutions (e.g., spectral-based, spatial-based, and so on) are mathematical operations performed by convolutional neural networks, [0054]) [applicant specification indicates “an encoder machine learning model” can constitute representing data with vectors using a neural network].

Gan et al. do not disclose performing on “surgical video”, but as Giataganas et al. disclose surgical video, together the references disclose/teach the entirety of the claim limitation.

Giataganas et al. and Gan et al. are in the same art of representing information with graph data structures/nodes (Giataganas et al., abstract; Gan et al., abstract). The combination of Gan et al. with Giataganas et al. will enable using an encoder machine learning model. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the encoder machine learning model of Gan et al. with the invention of Giataganas et al. as this was known at the time of filing, the combination would have predictable results, and as Gan et al. indicate, “The subject disclosure relates to action localization in machine vision, and more specifically to few-shot temporal action localization based on graph convolutional networks. Temporal action localization involves receiving an untrimmed video, locating (e.g., identifying start and stop times of) an action displayed in the untrimmed video, and classifying the action (e.g., identifying the action as running, jumping, throwing, and so on). Conventional temporal action localization techniques require vast amounts of training data, which can be inordinately time-consuming and very expensive to acquire. Few-shot temporal action localization solves this problem by learning how to classify actions based on only a few (e.g., small number of) examples. Most existing few-shot temporal action localization systems utilize a model-agnostic-meta-learning (MAML) framework. Other existing few-shot temporal action localization systems utilize learning sequence matching networks. In any case, systems/techniques that can achieve few-shot temporal action localization with more accuracy/precision than existing few-shot temporal action localization systems/techniques are advantageous” ([0001]) providing an accuracy and training efficiency benefit to combining inventions.

Regarding claim 2, Giataganas et al. and Gan et al. disclose the computer-implemented method of claim 1.  Giataganas et al. further indicate a subset of the nodes is associated with a first phase based on each of the subset of the nodes having the same phase label (Each of the surgical data structures may include one or more connections between each respective surgical data structure (e.g., interconnected nodes) that represent characteristics in a surgery and corresponding information. The surgical dataset can be processed in accordance with the multiple interconnected surgical data structures associated with the surgery to identify characteristics of a surgical procedure (e.g., a specific procedural state, surgical tools, surgical actions, anatomical features, events, etc.), [0019], Each edge can be associated with information about the surgical action, such as an identification of the action, a time typically associated with the action, tools used for the action, a position of interactions relative to anatomy, etc. For example, for an interconnected set of nodes between two data structures can pool metadata from each respective data structure to represent additional characteristics of a surgical procedure., [0027], For example, it may be determined whether a surgical team traversed through an approved set of actions. As another example, a video may be annotated with captions to identify what steps are being performed (e.g., to train another entity), [0064], Based on the identified nodes and related edges, (relational) metadata is availed. The metadata can provide information regarding events that occurred during a procedure (for example) including surgical events, risk assessment, or detection of actions that lead to specific events, [0029], The first surgical data structure 604 may include a plurality of nodes 610, 612, 614 representing each procedural stage of the surgical procedure. The edges 616, 618, 620 between each of nodes may represent metadata (for example) including medical personnel and/or tools necessary to perform the procedural action. A first procedural state may be associated with a first node 610, a second procedural state is associated with a second node 612, and a third procedural state may be associated with a third node 614., [0086] [actions, events, state associated with nodes = same phase label]).

Regarding claim 3, Giataganas et al. and Gan et al. disclose the computer-implemented method of claim 1. Giataganas et al. further indicate storing, by the processor, information about the one or more surgical phases, the information identifying the video frames from the surgical video corresponding to the one or more surgical phases (The surgical dataset can be processed in accordance with the multiple interconnected surgical data structures associated with the surgery to identify characteristics of a surgical procedure (e.g., a specific procedural state, surgical tools, surgical actions, anatomical features, events, etc.). Metadata associated with the characteristics (e.g., which can include information associated with procedural state) can be retrieved (e.g., from the surgical data structure or from a location identified by the surgical data structure) and stored from each respective data structure to generate an electronic output, [0019], The one or more data streams may be transmitted (e.g., as it is collected or after collection of all data from a procedure) to a processing unit that may process the data in real-time or store the data for subsequent processing, [0036], In some embodiments, the data structure may be received from another remote device (e.g., cloud server), stored and then retrieved prior to or during performance of a surgical procedure. The data structure can provide precise and accurate coded descriptions of complete surgical procedures. The data structure can describe routes through surgical procedures. For example, the data structure can include multiple nodes, each node representing a procedural state (e.g., corresponding to a state of surgery). The data structure can further include multiple edges, with each edge connecting multiple nodes and representing a surgical action, [0044]).

Regarding claim 4, Giataganas et al. and Gan et al. disclose the computer-implemented method of claim 1.  Giataganas et al. further indicate the surgical video is captured using a camera that is one from a group comprising an endoscopic camera, a laparoscopic camera, a portable camera, and a stationary camera (Generally, some data streams available during a surgical procedure (for example) can include video data (e.g., in-light camera, laparoscopic camera, wearable camera, microscope, etc.), audio data (e.g., near-field sounds, far-field sounds, microphones, wearable microphones, etc.), signals from medical devices (e.g., anesthesia machine, pulse and blood pressure monitors, cardiac monitors, navigation systems, etc.) and/or various inputs from other operating room equipment (e.g., pedals, touch screens, etc.), [0038], an operating room camera, [0066], [0073]).

Regarding claim 5, Giataganas et al. and Gan et al. disclose the computer-implemented method of claim 1. Giataganas et al. and Gan et al. further indicate the phase labels are generated using computer vision based on the latent representation (Giataganas et al., Multi-dimensional artificial intelligence system 110 can perform a classification and/or localization of the feature-extraction data 122 to obtain classification and/or localization data 124. For example, the multi-dimensional artificial intelligence system 110 can classify the feature-extraction data 122 as corresponding to a particular object type of a set of predefined object types in a classification 124 during state detection 120. The artificial intelligence can determine (e.g., state-detection stage) a particular state based on the classification and at least some of the sets of characteristic procedural metadata in the data structure. In some instances, the multi-dimensional artificial intelligence system can be configured to classify an image using a single image-level classification or by initially classifying individual image patches. The classifications of the patches can be aggregated and processed to identify a final classification, [0048]; Gan et al., graph convolutional networks, In various embodiments, an instantiation component can input into the nodes respective input vectors based on a proposed feature vector representing the action to be classified. In various cases, the respective temporal action classifications can correspond to respective example feature vectors, and the respective input vectors can be concatenations of the respective example feature vectors and the proposed feature vector, abstract, Embodiments described herein include systems, computer-implemented methods, apparatus, and/or computer program products that facilitate few-shot temporal action localization based on graph convolutional networks. In one or more embodiments, a support set can include one or more one-shot support videos respectively corresponding to one or more temporal action classifications. For instance, the one-shot support videos can be short video snippets, with each short video snippet displaying an example of a corresponding/respective temporal action classification (e.g., a first snippet demonstrating a person running, a second snippet demonstrating a person jumping, a third snippet demonstrating a person throwing an object, and so on). In various instances, each one-shot support video (and thus each temporal action classification) can correspond to an example feature vector generated by a gated recurrent unit based on the one-shot support videos (e.g., a first vector representing the running classification, a second vector representing the jumping classification, a third vector representing the throwing classification, and so on). In various aspects, a graph can be generated that models the support set. Nodes of the graph can respectively correspond to the temporal action classifications (e.g., a first node corresponding to the running classification, a second node corresponding to the jumping classification, a third node corresponding to the throwing classification, and so on). Edges of the graph can correspond to similarities between respective temporal action classifications (e.g., an edge between the running node and the jumping node can correspond to a similarity between the running classification and the jumping classification, an edge between the jumping node and the throwing node can correspond to a similarity between the jumping classification and the throwing classification, an edge between the running node and the throwing node can correspond to a similarity between the running classification and the throwing classification, and so on)., [0027], neural networks, [0054], [0056], [0057]).

Regarding claim 11, Giataganas et al. and Gan et al. disclose the computer-implemented method of claim 1. Giataganas et al. and Gan et al. further indicate the decoder machine learning model is a graph neural network (Giataganas et al., the multiple data structures 604, 606, 608 described herein include a global surgical graph 602 that may include a map (e.g., nodes and edges) to a plurality of interconnected surgical data structures 604, 606, 608., [0084], during a surgical procedure, the global surgical graph 602 can include a first surgical data structure 604, a second surgical data structure 606, and a third surgical data structure 608, each associated with a set of characteristics of a surgical procedure, [0085]; Gan et al., (generate a graph that models a support set. The support set can include one or more one-shot support video snippets (or support images, in some embodiments), with each one-shot support video snippet exhibiting an exemplar of a corresponding/respective temporal action classification. For instance, the support set can have a first one-shot support video snippet that displays an example of a person running (e.g., a running temporal action classification), a second one-shot support video snippet that displays an example of a person jumping (e.g., a jumping temporal action classification), a third one-shot support video snippet that displays an example of a person throwing an object (e.g., a throwing temporal action classification), and so on. Nodes of the graph can respectively correspond to the temporal action classifications in the support set (e.g., a first node corresponding to the running classification, a second node corresponding to the jumping classification, a third node corresponding to the throwing classification, and so on). Edges of the graph can correspond to similarities between the temporal action classifications (e.g., an edge between the first node and the second node can correspond to a similarity value between the running classification and the jumping classification, an edge between the second node and the third node can correspond to a similarity value between the jumping classification and the throwing classification, an edge between the first node and the third node can correspond to a similarity value between the running classification and the throwing classification, and so on), [0006], [0046], Graph convolutions (e.g., spectral-based, spatial-based, and so on) are mathematical operations performed by convolutional neural networks, [0054]).

Regarding claim 13, Giataganas et al. disclose a system comprising: a machine learning system comprising: an encoder that is trained to encode a plurality of video frames of a surgical video into a corresponding plurality of frame features (“A multi-dimensional artificial intelligence protocol 215 can receive the surgical data 210. The multi-dimensional artificial intelligence protocol 215 segments the surgical data 210 into one or more segments to generate feature-extraction data 220 for each data stream of the surgical data. The one or more segments correspond to an incomplete subset of the data portion. In some embodiments, the artificial intelligence segments the surgical data 210 using feature extraction. Anatomic feature extraction can be performed on the surgical images to detect an anatomic state, anatomic pose, or any other attribute possible to derive from video that can be used later to determine a precise procedural state”, [0057], “In a first stage, a multi-dimensional artificial intelligence can identify the current procedural state 225 from the feature-extraction data 220. Based on each of the one or more segments from the feature-extraction data 220, each segment can be classified and/or localized as corresponding to a particular object type of a set of predefined object types”, [0058]); and a temporal decoder that is trained to segment the surgical video into a plurality of surgical phases (The trained machine-learning model can include, for example, a convolutional neural network adaptations, adversarial networks, recurrent neural networks, deep Bayesian neural networks, or other type of deep learning or graphical models, [0045], surgical data structures can identify a surgical action based on weights assigned to one or more edges between the node-to-node transitions, surgical weights may be determined or updated using a machine learning algorithm based on collected surgical data, [0090]), each surgical phase comprising a subset of the plurality of video frames (The surgical dataset can be processed in accordance with the multiple interconnected surgical data structures associated with the surgery to identify characteristics of a surgical procedure (e.g., a specific procedural state, surgical tools, surgical actions, anatomical features, events, etc.), [0019], data streams can be processed using the second data structure to identify the events during a surgical procedure, [0022], “In one instance, video streams from a previous surgical procedure can be processed (e.g., using image-segmentation) to identify, detect, and determine probabilities of a surgical procedure. The video streams can be annotated to include information relevant to different portions of the surgical procedure to generate surgical data structures. For example, a video stream from an endoscopic procedure can be segmented to identify surgical instruments during the procedure. The surgical data structure can be generated by using training data with pixel-level labels (i.e., full supervision) from the segmented endoscopic procedure video stream. In some aspects, generating a surgical data structure can be produced using other methods”, [0024], nodes and edges may be associated with and/or arranged in an order that corresponds to a sequence in which various actions may be performed in a surgical procedure, Each edge can be associated with information about the surgical action, such as an identification of the action, a time typically associated with the action, tools used for the action, a position of interactions relative to anatomy, etc , [0027], The edges between nodes for each data structure may (for example) identify the procedural state (e.g., by identifying a state of a patient, progress of a surgery, etc.), identify an action that is being performed or is about to be performed (e.g., as identified in an edge that connects a node corresponding to the procedural state with a node corresponding to a next procedural state), [0028], The surgical data can include data collected over a time period (e.g., a predefined time increment, since a previous state-transition time or procedure initiation time, or a time of an entire surgical procedure), [0042], Each information unit of the set of information units corresponds to a different temporal association relative to other information units of the set of information units, [0056], In some embodiments, temporal information is identified for a part of the data streams associated with the procedural state. The temporal information may include (for example) a start time, end time, duration and/or range. The temporal information may be absolute (e.g., specifying one or more absolute times) or relative (e.g., specifying a time from a beginning of a surgery, from a beginning of a procedure initiation, etc.). The temporal information may include information defining a time period or time corresponding to the video data during which it is estimated that the surgery was in the procedural state, [0059]), wherein segmenting the surgical video by the temporal decoder comprises: generating a position-aware temporal graph that comprises a plurality of nodes and a plurality of edges, each node represents a corresponding frame feature, and an edge between two nodes is associated with a time step between the video frames associated with the frame features corresponding to the two nodes (nodes and edges may be associated with and/or arranged in an order that corresponds to a sequence in which various actions may be performed in a surgical procedure, Each edge can be associated with information about the surgical action, such as an identification of the action, a time typically associated with the action, tools used for the action, a position of interactions relative to anatomy, etc , [0027], The edges between nodes for each data structure may (for example) identify the procedural state (e.g., by identifying a state of a patient, progress of a surgery, etc.), identify an action that is being performed or is about to be performed (e.g., as identified in an edge that connects a node corresponding to the procedural state with a node corresponding to a next procedural state), [0028], The surgical data can include data collected over a time period (e.g., a predefined time increment, since a previous state-transition time or procedure initiation time, or a time of an entire surgical procedure), [0042], Each information unit of the set of information units corresponds to a different temporal association relative to other information units of the set of information units, [0056], In some embodiments, temporal information is identified for a part of the data streams associated with the procedural state. The temporal information may include (for example) a start time, end time, duration and/or range. The temporal information may be absolute (e.g., specifying one or more absolute times) or relative (e.g., specifying a time from a beginning of a surgery, from a beginning of a procedure initiation, etc.). The temporal information may include information defining a time period or time corresponding to the video data during which it is estimated that the surgery was in the procedural state, [0059]); aggregating, at each node, information from one or more adjacent nodes of the each node (The third data stream 730 can be processed to identify surgical instruments detected during the sleeve gastrectomy procedure using a third data structure and the fourth data stream 740 can be processed to identify surgical events during the sleeve gastrectomy procedure. Each portion of the video stream 700 can be processed using a multi-dimensional artificial intelligence protocol as described herein, [0094], “Each of the data structures may include a plurality of nodes connected to nodes in another data structure representing additional characteristics of a sleeve gastrectomy procedure. For example, a second data structure may represent the anatomical features present in the sleeve gastrectomy procedure and a third data structure may represent the surgical tools used in the sleeve gastrectomy procedure. The interconnected nodes in each of these data structures can provide relational metadata regarding the surgical procedure sleeve gastrectomy procedure. For example, in fourth data stream 740, the relational metadata between the multiple data structures can provide relevant information that is aggregated over time. The relational metadata can provide information (for example) including usage of surgical tools near anatomical features that may cause injury, prolonged usage of surgical instruments, medical personnel at specific stages of surgery, or actions/events at specific stages of surgery. This information can be compiled and output as an electronic output (e.g., an operational note)”, [0095]); and identifying a surgical phase represented by each video frame based on the information aggregated at the each node (edges between nodes for each data structure may (for example) identify the procedural state (e.g., by identifying a state of a patient, progress of a surgery, etc.), identify an action that is being performed or is about to be performed (e.g., as identified in an edge that connects a node corresponding to the procedural state with a node corresponding to a next procedural state), and/or identify one or more considerations (e.g., a risk, tools being used or that are about to be used, a warning, etc.), [0028], The state detection 120 can use the output from the classification and/or localization to determine a particular state of a set of procedural states based on the classification and/or localization data 124 and at least some of the sets of characteristic metadata in the data structure, data structure can include a set of nodes, with each node corresponding to a potential state, [0049], Each of the data structures may include a plurality of nodes connected to nodes in another data structure representing additional characteristics of a sleeve gastrectomy procedure. For example, a second data structure may represent the anatomical features present in the sleeve gastrectomy procedure and a third data structure may represent the surgical tools used in the sleeve gastrectomy procedure. The interconnected nodes in each of these data structures can provide relational metadata regarding the surgical procedure sleeve gastrectomy procedure. For example, in fourth data stream 740, the relational metadata between the multiple data structures can provide relevant information that is aggregated over time. The relational metadata can provide information (for example) including usage of surgical tools near anatomical features that may cause injury, prolonged usage of surgical instruments, medical personnel at specific stages of surgery, or actions/events at specific stages of surgery. This information can be compiled and output as an electronic output (e.g., an operational note), [0095]).

Giataganas et al. do not disclose the language “temporal decoder”. In paragraph 0014 of the publication of the applicant’s specification, the applicant indicates “According to one or more aspects, the decoder machine learning model is a graph neural network”.  Therefore the described graph neural network is interpreted as disclosing this feature. 

Giataganas et al. do not disclose an encoder that is trained to encode a plurality of video frames of a surgical video into a corresponding plurality of frame features.

Gan et al. teach an encoder that is trained to encode a plurality of video frames of a surgical video into a corresponding plurality of frame features (Embodiments described herein include systems, computer-implemented methods, apparatus, and/or computer program products that facilitate few-shot temporal action localization based on graph convolutional networks. In one or more embodiments, a support set can include one or more one-shot support videos respectively corresponding to one or more temporal action classifications. For instance, the one-shot support videos can be short video snippets, with each short video snippet displaying an example of a corresponding/respective temporal action classification (e.g., a first snippet demonstrating a person running, a second snippet demonstrating a person jumping, a third snippet demonstrating a person throwing an object, and so on). In various instances, each one-shot support video (and thus each temporal action classification) can correspond to an example feature vector generated by a gated recurrent unit based on the one-shot support videos (e.g., a first vector representing the running classification, a second vector representing the jumping classification, a third vector representing the throwing classification, and so on). In various aspects, a graph can be generated that models the support set. Nodes of the graph can respectively correspond to the temporal action classifications (e.g., a first node corresponding to the running classification, a second node corresponding to the jumping classification, a third node corresponding to the throwing classification, and so on). Edges of the graph can correspond to similarities between respective temporal action classifications (e.g., an edge between the running node and the jumping node can correspond to a similarity between the running classification and the jumping classification, an edge between the jumping node and the throwing node can correspond to a similarity between the jumping classification and the throwing classification, an edge between the running node and the throwing node can correspond to a similarity between the running classification and the throwing classification, and so on)., [0027], neural networks, [0054], [0056], [0057]) wherein segmenting the surgical video by the temporal decoder comprises: generating a position-aware temporal graph that comprises a plurality of nodes and a plurality of edges, each node represents a corresponding frame feature, and an edge between two nodes is associated with a time step between the video frames associated with the frame features corresponding to the two nodes  (generate a graph that models a support set, support set can include one or more one-shot support video snippets (or support images, in some embodiments), with each one-shot support video snippet exhibiting an exemplar of a corresponding/respective temporal action classification. For instance, the support set can have a first one-shot support video snippet that displays an example of a person running (e.g., a running temporal action classification), a second one-shot support video snippet that displays an example of a person jumping (e.g., a jumping temporal action classification), a third one-shot support video snippet that displays an example of a person throwing an object (e.g., a throwing temporal action classification), and so on. Nodes of the graph can respectively correspond to the temporal action classifications in the support set (e.g., a first node corresponding to the running classification, a second node corresponding to the jumping classification, a third node corresponding to the throwing classification, and so on). Edges of the graph can correspond to similarities between the temporal action classifications (e.g., an edge between the first node and the second node can correspond to a similarity value between the running classification and the jumping classification, an edge between the second node and the third node can correspond to a similarity value between the jumping classification and the throwing classification, an edge between the first node and the third node can correspond to a similarity value between the running classification and the throwing classification, and so on), [0006], [0046], Graph convolutions (e.g., spectral-based, spatial-based, and so on) are mathematical operations performed by convolutional neural networks, [0054]).

Gan et al. do not disclose performing on “surgical video”, but as Giataganas et al. disclose surgical video, together the references disclose/teach the entirety of the claim limitation.

Giataganas et al. and Gan et al. are in the same art of representing information with graph data structures/nodes (Giataganas et al., abstract; Gan et al., abstract). The combination of Gan et al. with Giataganas et al. will enable using an encoder machine learning model. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the encoder machine learning model of Gan et al. with the invention of Giataganas et al. as this was known at the time of filing, the combination would have predictable results, and as Gan et al. indicate, “The subject disclosure relates to action localization in machine vision, and more specifically to few-shot temporal action localization based on graph convolutional networks. Temporal action localization involves receiving an untrimmed video, locating (e.g., identifying start and stop times of) an action displayed in the untrimmed video, and classifying the action (e.g., identifying the action as running, jumping, throwing, and so on). Conventional temporal action localization techniques require vast amounts of training data, which can be inordinately time-consuming and very expensive to acquire. Few-shot temporal action localization solves this problem by learning how to classify actions based on only a few (e.g., small number of) examples. Most existing few-shot temporal action localization systems utilize a model-agnostic-meta-learning (MAML) framework. Other existing few-shot temporal action localization systems utilize learning sequence matching networks. In any case, systems/techniques that can achieve few-shot temporal action localization with more accuracy/precision than existing few-shot temporal action localization systems/techniques are advantageous” ([0001]) providing an accuracy and training efficiency benefit to combining inventions.

Regarding claim 14, Giataganas et al. and Gan et al. disclose the system of claim 13. Giataganas et al. further indicate the machine learning system further comprises outputting the surgical phases identified (edges between nodes for each data structure may (for example) identify the procedural state (e.g., by identifying a state of a patient, progress of a surgery, etc.), identify an action that is being performed or is about to be performed (e.g., as identified in an edge that connects a node corresponding to the procedural state with a node corresponding to a next procedural state), and/or identify one or more considerations (e.g., a risk, tools being used or that are about to be used, a warning, etc.), [0028], Each of the data structures may include a plurality of nodes connected to nodes in another data structure representing additional characteristics of a sleeve gastrectomy procedure. For example, a second data structure may represent the anatomical features present in the sleeve gastrectomy procedure and a third data structure may represent the surgical tools used in the sleeve gastrectomy procedure. The interconnected nodes in each of these data structures can provide relational metadata regarding the surgical procedure sleeve gastrectomy procedure. For example, in fourth data stream 740, the relational metadata between the multiple data structures can provide relevant information that is aggregated over time. The relational metadata can provide information (for example) including usage of surgical tools near anatomical features that may cause injury, prolonged usage of surgical instruments, medical personnel at specific stages of surgery, or actions/events at specific stages of surgery. This information can be compiled and output as an electronic output (e.g., an operational note), [0095]).

Regarding claim 15, Giataganas et al. and Gan et al. disclose the system of claim 13. Giataganas et al. and Gan et al. further indicate a surgical phase represented by each video frame is identified based on a latent representation of the video frame that is encoded into a frame feature (Giataganas et al., Each surgical data structure is used to determine a current node associated with a characteristic of a surgical procedure and present relevant metadata associated with the surgical procedure. Each surgical data structure includes at least one node interconnected to one or more nodes of another data structure. The interconnected nodes between one or more data structures includes relational metadata associated with the surgical procedure, abstract, Thus, state detection 450 can use the feature-extraction data 122 generated by multi-dimensional artificial intelligence system 110 (e.g., that indicates the presence and/or characteristics of particular objects within a field of view) to identify an estimated node to which the real image data corresponds. Each of the surgical data structures may include one or more interconnected nodes between each respective surgical data structure (e.g., interconnected nodes) that represent metadata (e.g., characteristics) in a surgical procedure and corresponding information, [0050]; Gan et al., generate a graph that models a support set, support set can include one or more one-shot support video snippets, [0006], [0046]) [nodes, metadata = encoded into a frame feature].

Regarding claim 16, Giataganas et al. and Gan et al. disclose the system of claim 13. Giataganas et al. and Gan et al. further indicate the position-aware temporal graph is generated using a graph neural network (Giataganas et al., global surgical graph 602 that may include a map e.g., nodes and edges, [0084], [0085]; Gan et al., Nodes of the graph can correspond to respective temporal action classifications in the support set, abstract, [0006], [0027]).

Claim(s) 6-10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Giataganas et al. (US 20190279765 A1) and Gan et al. (US 20210124987 A1) as applied to claim 1 above, further in view of Makrinich et al. (US 20210313052 A1).

Regarding claim 6, Giataganas et al. and Gan et al. disclose the computer-implemented method of claim 1. Giataganas et al. and Gan et al. do not disclose generating a user interface that comprises a progress bar with a plurality of sections, each section representing a respective surgical phase from the one or more surgical phases.

Makrinich et al. teach generating a user interface that comprises a progress bar with a plurality of sections, each section representing a respective surgical phase from the one or more surgical phases (The ongoing surgical procedure may be at any stage of the surgical procedure such as a preparation stage, an injection, an incision, an implantation, a wound sealing, a cleaning, or any other stage of a surgical procedure. The surgical video feed from an ongoing surgical procedure is not limited to the time a patient is in the operating room and may include video of preparation activities or cleanup activities in an operating room before the entry of the patient and after the egress of the patient, [0185], The video may be presented in a video playback region 1801, which may sequentially display one or more frames of the video. Interface 1800 may include a timeline 1815 displayed as a horizontal bar representing time, with the leftmost portion of the bar representing a beginning time of the video and the rightmost portion of the bar representing an end time. Timeline 1815 may include a position indicator 1807 indicating the current playback position of the video relative to the timeline. Colored region 1805 of timeline 1815 may represent the progress within timeline 1815 (e.g., corresponding to video that has already been viewed by the user, or to video coming before the currently presented frame), [0281], “In the example shown in FIG. 18, timeline 1815 may be displayed such that it overlaps video playback region 1801, either physically, temporally, or both. In some embodiments, timeline 1815 may not be displayed at all times. As one example, timeline 1815 may automatically switch to a collapsed or hidden view while a user is viewing the video and may return to the expanded view shown in FIG. 18 when the user takes an action to interact with timeline 1815. For example, user may move a mouse pointer while viewing the video, move the mouse pointer over the collapsed timeline, move the mouse pointer to a particular region, click or tap the video playback region, or perform any other actions that may indicate an intent to interact with timeline 1815. As discussed above, timeline 1815 may be displayed in various other locations relative to video playback region 1801, including on a top portion of video playback region 1801, above or below video playback region 1801, or within control bar 1803. In some embodiments, timeline 1815 may be displayed separately from a video progress bar. For example, a separate video progress bar, including position indicator 1807 and colored region 1805, may be displayed in control bar 1803 and timeline 1815 may be a separate timeline of events associated with a surgical procedure. In such embodiments, timeline 1815 may not have the same scale or range of time as the video or the video progress bar. For example, the video progress bar may represent the time scale and range of the video, whereas timeline 1815 may represent the timeframe of the surgical procedure, which may not be the same (e.g., where the video includes a surgical summary, as discussed in detail above)”, [0282]

    PNG
    media_image1.png
    220
    570
    media_image1.png
    Greyscale
, Fig. 6, 
    PNG
    media_image2.png
    355
    570
    media_image2.png
    Greyscale
, Fig. 18).
Giataganas et al. and Makrinich et al. are in the same art of surgical event segmentation (Giataganas et al., [0019]; Makrinich et al., abstract). The combination of Makrinich et al. with Giataganas et al. and Gan et al. will enable adding a progress bar. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the bar of Makrinich et al. with the invention of Giataganas et al. and Gan et al. as this was known at the time of filing, the combination would have predictable results, and as Makrinich et al. indicate, “When performing a surgical procedure, it may be beneficial to automatically identify surgical planes or review video of previous steps or expected future steps for a surgeon to review during an ongoing surgical procedure. Furthermore, there is a need to analyze videos to automatically populate a post-operative report, or to view statistical data with links to surgical videos that substantiate the statistic. In addition, there is a need to identify patient data derived from surgical equipment location data and to assign surgical teams to prospective surgeries. Therefore, there is a need for unconventional approaches that efficiently and effectively analyze surgical videos to enable a medical professional to receive support during an ongoing surgical procedure, view performance related statistics and data, and to facilitate scheduling and patient-data collection” ([0003]-[0004]) suggesting the user friendliness of the interface for medical professionals will be improved with the combination, thereby providing a clinical application benefit.

Regarding claim 7, Giataganas et al. and Gan et al. and Makrinich et al. disclose the computer-implemented method of claim 6. Makrinich et al. further indicate the progress bar is updated in real-time as the surgical video is being captured and processed (The surgical video feed may be from an ongoing surgical procedure. An ongoing surgical procedure may be a surgical procedure that that is currently in progress. The ongoing surgical procedure may be at any stage of the surgical procedure such as a preparation stage, an injection, an incision, an implantation, a wound sealing, a cleaning, or any other stage of a surgical procedure. The surgical video feed from an ongoing surgical procedure is not limited to the time a patient is in the operating room and may include video of preparation activities or cleanup activities in an operating room before the entry of the patient and after the egress of the patient. Surgical video feed from an ongoing surgical procedure may be received in real-time or in near real-time. For example, the video of the surgical procedure may be recorded by an image capture device, such as cameras 115, 121, 123, 125, and 127 as shown in FIG. 1, in an operating room, or in a cavity of a patient, [0185], “In the example shown in FIG. 18, timeline 1815 may be displayed such that it overlaps video playback region 1801, either physically, temporally, or both. In some embodiments, timeline 1815 may not be displayed at all times. As one example, timeline 1815 may automatically switch to a collapsed or hidden view while a user is viewing the video and may return to the expanded view shown in FIG. 18 when the user takes an action to interact with timeline 1815. For example, user may move a mouse pointer while viewing the video, move the mouse pointer over the collapsed timeline, move the mouse pointer to a particular region, click or tap the video playback region, or perform any other actions that may indicate an intent to interact with timeline 1815. As discussed above, timeline 1815 may be displayed in various other locations relative to video playback region 1801, including on a top portion of video playback region 1801, above or below video playback region 1801, or within control bar 1803. In some embodiments, timeline 1815 may be displayed separately from a video progress bar. For example, a separate video progress bar, including position indicator 1807 and colored region 1805, may be displayed in control bar 1803 and timeline 1815 may be a separate timeline of events associated with a surgical procedure. In such embodiments, timeline 1815 may not have the same scale or range of time as the video or the video progress bar. For example, the video progress bar may represent the time scale and range of the video, whereas timeline 1815 may represent the timeframe of the surgical procedure, which may not be the same (e.g., where the video includes a surgical summary, as discussed in detail above)”, [0282]) [indicates options of real time or later playback]

Regarding claim 8, Giataganas et al. and Gan et al. and Makrinich et al. disclose the computer-implemented method of claim 6. Makrinich et al. further indicate each of the sections is depicted using a respective visual attribute (Timeline 1815 may include a position indicator 1807 indicating the current playback position of the video relative to the timeline. Colored region 1805 of timeline 1815 may represent the progress within timeline 1815 (e.g., corresponding to video that has already been viewed by the user, or to video coming before the currently presented frame). In some embodiments, position indicator 1807 may be interactive, such that the user can move to different positions within the video by moving position indicator 1807. In some embodiments, the surgical timeline may include markers identifying at least one of a surgical phase, an intraoperative surgical event, and a decision making junction. For example, timeline 1815 may further include one or more markers 1811, 1813, and/or 1817. Such markers may correspond to surgical events, decision points, or other points of interests, as described herein. Interface 1800 may also include a control bar 1803, with a play button and time indication 1809, [0281], In some embodiments, timeline 1815 may be displayed separately from a video progress bar. For example, a separate video progress bar, including position indicator 1807 and colored region 1805, may be displayed in control bar 1803 and timeline 1815 may be a separate timeline of events associated with a surgical procedure. In such embodiments, timeline 1815 may not have the same scale or range of time as the video or the video progress bar, [0282]

    PNG
    media_image2.png
    355
    570
    media_image2.png
    Greyscale
, Fig. 18).

Regarding claim 9, Giataganas et al. and Gan et al. and Makrinich et al. disclose the computer-implemented method of claim 8. Makrinich et al. further indicate the visual attribute comprises at least one of a color, transparency, icon, pattern, and shape (Timeline 1815 may include a position indicator 1807 indicating the current playback position of the video relative to the timeline. Colored region 1805 of timeline 1815 may represent the progress within timeline 1815 (e.g., corresponding to video that has already been viewed by the user, or to video coming before the currently presented frame). In some embodiments, position indicator 1807 may be interactive, such that the user can move to different positions within the video by moving position indicator 1807. In some embodiments, the surgical timeline may include markers identifying at least one of a surgical phase, an intraoperative surgical event, and a decision making junction. For example, timeline 1815 may further include one or more markers 1811, 1813, and/or 1817. Such markers may correspond to surgical events, decision points, or other points of interests, as described herein. Interface 1800 may also include a control bar 1803, with a play button and time indication 1809, [0281], In some embodiments, timeline 1815 may be displayed separately from a video progress bar. For example, a separate video progress bar, including position indicator 1807 and colored region 1805, may be displayed in control bar 1803 and timeline 1815 may be a separate timeline of events associated with a surgical procedure. In such embodiments, timeline 1815 may not have the same scale or range of time as the video or the video progress bar, [0282]

    PNG
    media_image2.png
    355
    570
    media_image2.png
    Greyscale
, Fig. 18).

Regarding claim 10, Giataganas et al. and Gan et al. and Makrinich et al. disclose the computer-implemented method of claim 6. Makrinich et al. further indicate selecting a section causes a playback of the surgical video to navigate to a surgical phase corresponding to the section (Interface 1800 may also include a control bar 1803, with a play button and time indication 1809, [0281], In the example shown in FIG. 18, timeline 1815 may be displayed such that it overlaps video playback region 1801, either physically, temporally, or both. In some embodiments, timeline 1815 may not be displayed at all times. As one example, timeline 1815 may automatically switch to a collapsed or hidden view while a user is viewing the video and may return to the expanded view shown in FIG. 18 when the user takes an action to interact with timeline 1815. For example, user may move a mouse pointer while viewing the video, move the mouse pointer over the collapsed timeline, move the mouse pointer to a particular region, click or tap the video playback region, or perform any other actions that may indicate an intent to interact with timeline 1815. As discussed above, timeline 1815 may be displayed in various other locations relative to video playback region 1801, including on a top portion of video playback region 1801, above or below video playback region 1801, or within control bar 1803. In some embodiments, timeline 1815 may be displayed separately from a video progress bar. For example, a separate video progress bar, including position indicator 1807 and colored region 1805, may be displayed in control bar 1803 and timeline 1815 may be a separate timeline of events associated with a surgical procedure. In such embodiments, timeline 1815 may not have the same scale or range of time as the video or the video progress bar, [0282]).

Claim(s) 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Giataganas et al. (US 20190279765 A1) and Gan et al. (US 20210124987 A1) as applied to claim 1 above, further in view of Zhang et al. (“Layer Embedding Analysis in Convolutional Neural Networks for Improved Probability Calibration and Classification,” 2020).

Regarding claim 12, Giataganas et al. and Gan et al. disclose the computer-implemented method of claim 11. Giataganas et al. and Gan et al. do not disclose the graph neural network comprises: a first block comprising a series of calibration layers; a second block comprising a predetermined number of graph convolution layers; and a third block comprising a classification head.

Zhang et al. teach a graph neural network comprises: a first block comprising a series of calibration layers; a second block comprising a predetermined number of graph convolution layers; and a third block comprising a classification head.

    PNG
    media_image3.png
    358
    396
    media_image3.png
    Greyscale
, abstract,


    PNG
    media_image4.png
    366
    384
    media_image4.png
    Greyscale
, Fig. 4

    PNG
    media_image5.png
    342
    408
    media_image5.png
    Greyscale
, layer embeddings
analysis and calibration step, part VB, We next comprehensively investigate the performance of
the embedding output at different layers, for both pre/post convolution representations. Afterwards, calibrated model outputs using these embedding outputs, which is what we are
ultimately interested in, are calculated, part VD).

Giataganas et al. and Zhang et al. are in the same art of representing information with data structures/nodes (Giataganas et al., abstract; Zhang et al., abstract). The combination of Zhang et al. with Giataganas et al. and Gan et al. will enable having a first block comprising a series of calibration layers; a second block comprising a predetermined number of graph convolution layers; and a third block comprising a classification head. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the architecture of Zhang et al. with the invention of Giataganas et al. and Gan et al. as this was known at the time of filing, the combination would have predictable results, and as Zhang et al. indicate, “The results show that our method is not only able to provide visualizations that are easy to interpret, but that the embedded decision-based information is also useful for improving model performance in terms of probability calibration and classification, achieving the best performance compared to other baseline methods. Moreover, this method is computationally efficient, easy to implement, and robust to hyper-parameters” (abstract) providing a computational efficiency benefit to combining inventions.

Claim(s) 17-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Giataganas et al. (US 20190279765 A1) in view of Zhang et al. (“Layer Embedding Analysis in Convolutional Neural Networks for Improved Probability Calibration and Classification,” 2020).

Regarding claim 17, Giataganas et al. disclose a computer program product comprising a memory device having computer-executable instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a method to autonomously identify surgical phases in a surgical video, the method comprising: generating, using a machine learning system, a position-aware temporal graph to represent the surgical video, the position-aware temporal graph comprises a plurality of nodes and a plurality of edges, each node comprises a latent representation of a corresponding video frame from the surgical video, and an edge between two nodes is associated with a time step between the video frames corresponding to the two nodes (The nodes and edges may be associated with and/or arranged in an order that corresponds to a sequence in which various actions may be performed in a surgical procedure. Each of one more or all nodes in the surgical data structure and/or each of one, more or all edges in the surgical data structure may be associated with procedural metadata. In some cases, an edge that connects two nodes of the plurality of nodes within a single data structure can represent transition within that specific data structure. For example, for a data structure representing procedural states, an edge can represent one or more surgical actions executed to transition between different nodes. In other embodiments, an edge can connect two nodes of the plurality of nodes between two different data structures. Each the edges connecting interconnecting nodes between surgical data structures may be associated with additional procedural metadata. Each edge can be associated with information about the surgical action, such as an identification of the action, a time typically associated with the action, tools used for the action, a position of interactions relative to anatomy, etc. For example, for an interconnected set of nodes between two data structures can pool metadata from each respective data structure to represent additional characteristics of a surgical procedure, [0027], The metadata can provide information regarding events that occurred during a procedure (for example) including surgical events, risk assessment, or detection of actions that lead to specific events. For example, based on the identified state, the data structures can identify the spatial relation of a surgical tool to an anatomical feature over a period of time that may have caused damage or bleeding., [0029]); for each layer of a graph neural network, aggregating, at each node, latent representations of adjacent nodes at a predefined time step associated with each layer, the graph neural network comprising a predetermined number of layers (The third data stream 730 can be processed to identify surgical instruments detected during the sleeve gastrectomy procedure using a third data structure and the fourth data stream 740 can be processed to identify surgical events during the sleeve gastrectomy procedure. Each portion of the video stream 700 can be processed using a multi-dimensional artificial intelligence protocol as described herein, [0094], “Each of the data structures may include a plurality of nodes connected to nodes in another data structure representing additional characteristics of a sleeve gastrectomy procedure. For example, a second data structure may represent the anatomical features present in the sleeve gastrectomy procedure and a third data structure may represent the surgical tools used in the sleeve gastrectomy procedure. The interconnected nodes in each of these data structures can provide relational metadata regarding the surgical procedure sleeve gastrectomy procedure. For example, in fourth data stream 740, the relational metadata between the multiple data structures can provide relevant information that is aggregated over time. The relational metadata can provide information (for example) including usage of surgical tools near anatomical features that may cause injury, prolonged usage of surgical instruments, medical personnel at specific stages of surgery, or actions/events at specific stages of surgery. This information can be compiled and output as an electronic output (e.g., an operational note)”, [0095]); and identifying a surgical phase represented by each video frame based on the aggregated information at the each node (edges between nodes for each data structure may (for example) identify the procedural state (e.g., by identifying a state of a patient, progress of a surgery, etc.), identify an action that is being performed or is about to be performed (e.g., as identified in an edge that connects a node corresponding to the procedural state with a node corresponding to a next procedural state), and/or identify one or more considerations (e.g., a risk, tools being used or that are about to be used, a warning, etc.), [0028], The state detection 120 can use the output from the classification and/or localization to determine a particular state of a set of procedural states based on the classification and/or localization data 124 and at least some of the sets of characteristic metadata in the data structure, data structure can include a set of nodes, with each node corresponding to a potential state, [0049], Each of the data structures may include a plurality of nodes connected to nodes in another data structure representing additional characteristics of a sleeve gastrectomy procedure. For example, a second data structure may represent the anatomical features present in the sleeve gastrectomy procedure and a third data structure may represent the surgical tools used in the sleeve gastrectomy procedure. The interconnected nodes in each of these data structures can provide relational metadata regarding the surgical procedure sleeve gastrectomy procedure. For example, in fourth data stream 740, the relational metadata between the multiple data structures can provide relevant information that is aggregated over time. The relational metadata can provide information (for example) including usage of surgical tools near anatomical features that may cause injury, prolonged usage of surgical instruments, medical personnel at specific stages of surgery, or actions/events at specific stages of surgery. This information can be compiled and output as an electronic output (e.g., an operational note), [0095]).

Giataganas et al. do not disclose for each layer of a graph neural network, aggregating, at each node, latent representations of adjacent nodes at a predefined time step associated with each layer, the graph neural network comprising a predetermined number of layers.

Zhang et al. teach aggregating, at each node, latent representations of adjacent nodes at a predefined time step associated with each layer, the graph neural network comprising a predetermined number of layers (In this architecture, the whole network consists only of convolutional layers, batch normalization, and activation layers without any skip connections. It is important to note that when we choose this minimal architecture of the neural network, our focus is to achieve a design that is convenient for the subsequent layer embedding analysis, and also has a reasonable (though at times can be flawed) classification performance, part II,

    PNG
    media_image6.png
    404
    414
    media_image6.png
    Greyscale
, part III,


    PNG
    media_image7.png
    438
    416
    media_image7.png
    Greyscale
, part IIIC) [predetermined number of layers interpreted as described architecture of convolutional layers, batch normalization, and activation layers without any skip connections]
Giataganas et al. and Zhang et al. are in the same art of representing information with data structures/nodes (Giataganas et al., abstract; Zhang et al., abstract). The combination of Zhang et al. with Giataganas et al. will enable for each layer of a graph neural network, aggregating. It would have been obvious at the time of filing to one of ordinary skill in the art to combine the layer model of Zhang et al. with the invention of Giataganas et al. as this was known at the time of filing, the combination would have predictable results, as the neural networks of Giataganas et al. have layers that are just not described, and as Zhang et al. indicate, “The results show that our method is not only able to provide visualizations that are easy to interpret, but that the embedded decision-based information is also useful for improving model performance in terms of probability calibration and classification, achieving the best performance compared to other baseline methods. Moreover, this method is computationally efficient, easy to implement, and robust to hyper-parameters” (abstract) providing a computational efficiency benefit to combining inventions.

Regarding claim 18, Giataganas et al. and Zhang et al. disclose the computer program product of claim 17. Giataganas et al. further indicate each layer of the graph neural network is associated with a distinct predefined time step (“In some embodiments, temporal information is identified for a part of the data streams associated with the procedural state. The temporal information may include (for example) a start time, end time, duration and/or range. The temporal information may be absolute (e.g., specifying one or more absolute times) or relative (e.g., specifying a time from a beginning of a surgery, from a beginning of a procedure initiation, etc.). The temporal information may include information defining a time period or time corresponding to the video data during which it is estimated that the surgery was in the procedural state”, [0059], For example, if surgery data is divided into data chunks, and each data chunk is associated with a procedural state, the temporal information may identify a start and end time that corresponds to (for example) a start time of a first data chunk associated with a given procedural state and an end time of a last data chunk associated with a given procedural state, respectively, [0060].

Regarding claim 19, Giataganas et al. and Zhang et al. disclose the computer program product of claim 17. Giataganas et al. further indicate the method further comprises storing a starting timepoint and an ending timepoint of the surgical phase based on a set of sequential video frames identified to represent the surgical phase (“In some embodiments, temporal information is identified for a part of the data streams associated with the procedural state. The temporal information may include (for example) a start time, end time, duration and/or range. The temporal information may be absolute (e.g., specifying one or more absolute times) or relative (e.g., specifying a time from a beginning of a surgery, from a beginning of a procedure initiation, etc.). The temporal information may include information defining a time period or time corresponding to the video data during which it is estimated that the surgery was in the procedural state”, [0059], For example, if surgery data is divided into data chunks, and each data chunk is associated with a procedural state, the temporal information may identify a start and end time that corresponds to (for example) a start time of a first data chunk associated with a given procedural state and an end time of a last data chunk associated with a given procedural state, respectively, [0060]).

Regarding claim 20, Giataganas et al. and Zhang et al. disclose the computer program product of claim 17. Giataganas et al. further indicate the surgical video is a real-time video stream or the surgical video is processed post-operatively (a computer-implemented method is provided that uses a multi-dimensional artificial intelligence to process live or previously collected surgical data. The processing, using the multiple interconnected structures, can result in both an identification of particular characteristics of the surgery and further a comparison of the particular characteristics to target characteristics, [0030] The one or more data streams may be transmitted (e.g., as it is collected or after collection of all data from a procedure) to a processing unit that may process the data in real-time or store the data for subsequent processing [0036] The one or more data streams can be generated and received from one or more electronic devices configured and positioned to capture live or previously collected data during a surgical procedure [0038] The multi-dimensional artificial intelligence can be used to process e.g., in real-time or offline one or more data streams e.g., video streams, [0039], FIG. 1 shows a network for using live or previously collected data from a surgical procedure to generate and/or update an electronic output in accordance with some embodiments of the invention, [0041]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: US 20240112809 A1 [Highly relevant, could have used alternatively for claim 1 as while provisional application for the reference does not have support for the filing date therefore earliest priority date of the reference is the PCT date March 2022, the current application also does not have support for the claimed priority date, therefore any amendments going forward should take this reference into consideration as it does predate the current application PCT date of Aug 2022]: As used herein, a “surgical phase” or “surgical phase” is a period of time within a surgical procedure in which an action or set of related actions is taken by the surgeon. In general, surgical phases are sequential, although it will be appreciated that the order of some the surgical phases can vary for a given procedure and that some phases can be interrupted by another phase, such that they appear more than once in a given sequence. A “temporal processing network module,” as used herein, is a cell or set of related cells within a neural network with some form of memory, that is, the cell or set of cells is capable of accessing data associated with inputs to the network other than a current input or training samples used in an initial training of the network. Examples of such networks include long short term memory networks, networks employing gated recurrent units, continuous time recurrent neural networks, neural Turing machines, Attention-based networks, and temporal convolutional networks. At least some of the nodes in the graph, and the temporal processing network modules 122-124 representing the nodes, are connected to other nodes via edges 126 and 127, such that data connections exist between nodes in the first set of plurality of temporal processing network modules 122-125 and the second set of the plurality of neural networks 126 and 127 to provide a graph structure. Some of the first set of temporal processing network modules 127 represent global nodes that represent information about the current state of the surgery outside of the specific concepts represented by the connected nodes. In one example, the statistical parameter representing the state of the surgical procedure for the time period represents the likelihood that a surgical safety notion, such as the critical view of safety, has been achieved during the surgical procedure. In another implementation, where the surgical procedure is a laparoscopic cholecystectomy, the statistical parameter can represent a value for the Parkland Grading Scale. Further, in one implementation, the state of each of the first set of temporal processing network modules represents a concept associated with the surgery. In the example of a laparoscopic cholecystectomy, the state of a first temporal processing network module of the first set of temporal processing network modules represents a likelihood of exposure of the cystic plate and the state of a second temporal processing network module of the first set of temporal processing network modules represents a likelihood of exposure of the cystic duct]; US 20240390103 A1 (In some embodiments, the computer image analysis may include using a neural network model trained using example video frames including previously identified surgical events to thereby identify a similar surgical event in a set of frames. In other words, frames of one or more videos that are known to be associated with a particular surgical event may be used to train a neural network model. The trained neural network model may therefore be used to identify whether one or more video frames are also associated with the surgical event, [0032], “Machine learning algorithms (also referred to artificial intelligence) may be employed for the purposes of analyzing the video to identify surgical events. Such algorithms be trained using training examples, such as described below. Some non-limiting examples of such machine learning algorithms may include classification algorithms, data regressions algorithms, image segmentation algorithms, visual detection algorithms (such as object detectors, face detectors, person detectors, motion detectors, edge detectors, etc.), visual recognition algorithms (such as face recognition, person recognition, object recognition, etc.), speech recognition algorithms, mathematical embedding algorithms, natural language processing algorithms, support vector machines, random forests, nearest neighbors algorithms, deep learning algorithms, artificial neural network algorithms, convolutional neural network algorithms, recursive neural network algorithms, linear machine learning models, non-linear machine learning models, ensemble algorithms, and so forth. For example, a trained machine learning algorithm may comprise an inference model, such as a predictive model, a classification model, a regression model, a clustering model, a segmentation model, an artificial neural network (such as a deep neural network, a convolutional neural network, a recursive neural network, etc.), a random forest, a support vector machine, and so forth”, [0033] Disclosed embodiments may involve receiving a plurality of video frames from a plurality of surgical videos. Surgical videos may refer to any video, group of video frames, or video footage including representations of a surgical procedure. For example, the surgical video may include one or more video frames captured during a surgical operation, [0061], a data structure associating combinations of surgical instruments and surgical phases with alternative prospective actions may be accessed based on the surgical instrument detected by Step 720 and the phase of the ongoing surgical procedure at the particular time determined by Step 730 to determine the at least one alternative prospective action. In one example, a data structure associating surgical phases with alternative prospective actions may be accessed based on the phase of the ongoing surgical procedure at the particular time determined by Step 730 to determine the at least one alternative prospective action. In some examples, a relationship between the at least one alternative prospective action and the surgical instrument detected by Step 720 may be determined. For example, a statistical model may be used to determine a statistical relationship between the at least one alternative prospective action and the surgical instrument detected by Step 720. In another example, a graph data structure with edges connecting nodes of prospective actions and nodes of surgical instruments may be accessed based on the on the surgical instrument detected by Step 720 and the determined at least one alternative prospective action to determine the relationship based on the existence of an edge between the two, or based on a weight or a label associated with an edge connecting the two. In some examples, Step 740 may further base the determination of the likelihood that the prospective action involving the surgical instrument is about to take place at the unsuitable phase of the ongoing surgical procedure on the relationship between the at least one alternative prospective action and the surgical instrument detected by Step 720, [0100]).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHELLE M ENTEZARI HAUSMANN whose telephone number is (571)270-5084. The examiner can normally be reached 10-7 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vincent M Rudolph can be reached at (571) 272-8243. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHELLE M ENTEZARI HAUSMANN/Primary Examiner, Art Unit 2671
Read full office action
Prosecution Timeline

Feb 16, 2024
Application Filed
May 06, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

18/267,598
Patent 12638400
Method for monitoring and/or controlling phase separation in chemical processes and samples
2y 11m to grant Granted May 26, 2026
18/348,495
Patent 12639803
SYSTEMS AND METHODS FOR MATERIAL ACCRETION DETECTION AND REMOVAL
2y 10m to grant Granted May 26, 2026
18/136,006
Patent 12629121
METHOD OF DETERMINING VESSEL FLUID FLOW VELOCITY
3y 1m to grant Granted May 19, 2026
18/034,833
Patent 12626375
HOMOGRAPHY MATRIX GENERATION APPARATUS, CONTROL METHOD, AND COMPUTER-READABLE MEDIUM
3y 0m to grant Granted May 12, 2026
18/179,635
Patent 12620252
INFORMATION SOURCE DETECTION USING UNIQUE WATERMARKS
3y 2m to grant Granted May 05, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
76%
Grant Probability
98%
With Interview (+21.3%)
3y 0m (~8m remaining)
Median Time to Grant
Low
PTA Risk
Based on 870 resolved cases by this examiner. Grant probability derived from career allowance rate.