Last updated: May 29, 2026
Application No. 18/383,279
SYSTEM AND METHOD FOR CLASSIFYING TASK

Non-Final OA §103
Filed
Oct 24, 2023
Priority
Oct 24, 2022 — provisional 63/418,907
Examiner
MANGIALASCHI, TRACY
Art Unit
2668
Tech Center
2600 — Communications
Assignee
3M Company
OA Round
1 (Non-Final)
Interview Optional

— +28.0% interview lift. Examiner has a relatively high allowance rate (75%); +28.0% interview lift. A written response may suffice.
Based on 586 resolved cases, 2023–2026
Examiner Intelligence

MANGIALASCHI, TRACY View full profile →
Grants 75% — above average
Career Allowance Rate
439 granted / 586 resolved
+12.9% vs TC avg
Strong +28% interview lift
Without
With
+28.0%
Interview Lift
resolved cases with interview
Typical timeline
3y 0m
Avg Prosecution
14 currently pending
Career history
600
Total Applications
across all art units
Statute-Specific Performance

§101
1.1%
-38.9% vs TC avg
§103
85.8%
+45.8% vs TC avg
§102
4.3%
-35.7% vs TC avg
§112
1.0%
-39.0% vs TC avg
Black line = Tech Center average estimate • Based on career data from 586 resolved cases
Office Action

§103
DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of the Claims
Claims 1-20, as originally filed, are currently pending and have been considered below.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA  to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1-8 and 12-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yamamoto et al., Japanese Publication No. JP 2021 076913 (A), hereinafter, “Yamamoto”, and further in view of Zhang, U.S. Publication No. 2021/0027485, hereinafter, “Zhang”.

As per claim 1, Yamamoto discloses a method of classifying a task in a workplace, the method comprising: 
obtaining, via at least one image capturing device, a plurality of images for a predetermined period of time (Yamamoto, ¶0016, The multimodal recognition system consists of a computer 100, a microphone 101, a camera 102, a sensor 103, and a wireless device 104; Yamamoto, ¶0017, Microphone 101 collects various sounds, including speech and ambient noise. Camera 102 acquires an image. Sensor 103 measures temperature, acceleration, and other parameters; Yamamoto, ¶0027, The motion recognition model 132 is a model for identifying human motion based on input images and sensor values. The location identification model 133 is a model for identifying a person's location based on input images and sensor values; Yamamoto, ¶0035, The motion identification unit 122 uses the motion identification model 132 to output information indicating human motion as event data (identification result) from image data input via the camera 102and sensor data input via the sensor 103. Event data includes a timestamp; Yamamoto, ¶0039, The task identification unit 124 uses the event data obtained by inputting data for each modality included in a certain time range into each identification unit, along with the task identification model 134, to calculate the probability that the task corresponds to the task in the task identification model 134); 
obtaining, via at least one audio sensor, an audio signal corresponding to the predetermined period of time (Yamamoto, ¶0002, use of multimodal recognition systems, which combine data from multiple modalities such as audio, images, and sensor values to identify arbitrary events ... multimodal recognition systems are used to identify tasks performed in factories and other facilities; Yamamoto, ¶0016, The multimodal recognition system consists of a computer 100, a microphone 101, a camera 102, a sensor 103, and a wireless device 104; Yamamoto, ¶0017, Microphone 101 collects various sounds, including speech and ambient noise; Yamamoto, ¶0027, The speech recognition model 130 is a model for identifying human speech based on input sounds. The ambient sound recognition model 131 is a model for recognizing ambient sounds based on input sounds; Yamamoto, ¶0033, The voice recognition unit 120 uses the voice recognition model 130 to output natural language text information as event data (recognition result) from the sound data input via the microphone 101. Event data includes a timestamp; Yamamoto, ¶0034, The ambient sound identification unit 121 uses the ambient sound identification model 131 to output the type of ambient sound as event data (identification result) from the sound data input via the microphone 101. Event data includes a timestamp; Yamamoto, ¶0039, The task identification unit 124 uses the event data obtained by inputting data for each modality included in a certain time range into each identification unit, along with the task identification model 134, to calculate the probability that the task corresponds to the task in the task identification model 134); 
classifying, via a first trained machine learning model, the plurality of images (Yamamoto, ¶0008, a learning unit that generates a model for identifying a relevant event from among multiple events based on event data which is the identification result output from multiple classifiers that handle data of different modalities ... an event tag indicating the type of the event data, and event tag data which includes the event data to which the event tag is attached); 
classifying, via a second trained machine learning model, the audio signal (Yamamoto, ¶0008, a learning unit that generates a model for identifying a relevant event from among multiple events based on event data which is the identification result output from multiple classifiers that handle data of different modalities ... an event tag indicating the type of the event data, and event tag data which includes the event data to which the event tag is attached); 
determining, via a merging algorithm, a list of third class probabilities and a list of third class labels corresponding to the list of third class probabilities based at least on the list of first class probabilities and the list of second class probabilities, wherein each third class probability is indicative of a probability of the corresponding third class label being the task (Yamamoto, ¶0003, A multimodal recognition system is realized by combining a model that identifies data from multiple modalities (individual models) with a model that uses the identification results of each individual model as input to output a final identification result; Yamamoto, ¶0037, The task identification unit 124 uses the task identification model 134 to calculate the probability that an event corresponds to the task identified by the task identification model 134 from the event data output from the voice identification unit 120, the ambient sound identification unit 121, the motion identification unit 122, and the location identification unit 123, and outputs information indicating the task as an identification result based on the probability; Yamamoto, ¶0039, The task identification unit 124 uses the event data obtained by inputting data for each modality included in a certain time range into each identification unit, along with the task identification model 134, to calculate the probability that the task corresponds to the task in the task identification model 134; Yamamoto, ¶0040, The task identification unit 124 of Example 1 calculates the probability of multiple tasks using multiple task identification models 134. The task identification unit 124 outputs information indicating the task with the highest probability as the identification result. For example, as shown in Figure 4, the task identification unit 124 outputs an identification result consisting of the task 401, start time 402, end time 403, and confidence level 404); and 
determining, via a processor, the task corresponding to the predetermined period of time based at least on the list of third class probabilities (Yamamoto, ¶0003, A multimodal recognition system is realized by combining a model that identifies data from multiple modalities (individual models) with a model that uses the identification results of each individual model as input to output a final identification result; Yamamoto, ¶0037, The task identification unit 124 uses the task identification model 134 to calculate the probability that an event corresponds to the task identified by the task identification model 134 from the event data output from the voice identification unit 120, the ambient sound identification unit 121, the motion identification unit 122, and the location identification unit 123, and outputs information indicating the task as an identification result based on the probability; Yamamoto, ¶0040; Yamamoto, ¶0041, The confidence level 404 is a field that stores the probability that the task corresponds to task 401).
Yamamoto does not explicitly disclose the following limitations as further recited however Zhang discloses 
classifying, via a first trained machine learning model, the plurality of images to generate a list of first class probabilities and a list of first class labels corresponding to the list of first class probabilities, wherein each first class probability is indicative of a probability of the corresponding first class label being the task (Zhang, ¶0012, obtaining, by the one or more computers, image data from a camera, the image data representing an image of a monitored area; providing, by the one or more computers and to one or more machine learning models, input data that is based on the image data representing the image of the monitored area, where the one or more machine learning models have been trained to detect different properties of the monitored area; receiving, by the one or more computers, output of the one or more machine learning models, the output indicating (i) one or more status classifications for the monitored area and a respective location for each of the one or more status classifications; Zhang, ¶0014, evaluating the output of the one or more machine learning models includes applying one or more post-processing rules to the output of the one or more machine learning models to filter a list of the identified objects based on the confidence scores. Evaluating the filtered list of identified objects with respect to one or more predetermined criteria to detect a condition present in the monitored area; Zhang, ¶0029, providing output indicating the detected condition present in the monitored area includes providing image data for an image of the monitored area having an annotation indicating a location of the detected condition within the monitored area; Zhang, ¶0041, providing the output comprises providing annotation data for the image data; Zhang, ¶0131, The image data can be labelled with the corresponding conditions represented in the image data; Zhang, ¶0019, the method includes generating a record for a task corresponding to the detected condition; Zhang, ¶0035, the method includes, based on evaluating the filtered list of identified objects to detect the condition, generating a record for a task corresponding to the detected condition); 
classifying, via a second trained machine learning model, the audio signal to generate a list of second class probabilities and a list of second class labels corresponding to the list of second class probabilities, wherein each second class probability is indicative of a probability of the corresponding second class label being the task (Zhang, ¶0014, evaluating the output of the one or more machine learning models includes applying one or more post-processing rules to the output of the one or more machine learning models to filter a list of the identified objects based on the confidence scores. Evaluating the filtered list of identified objects with respect to one or more predetermined criteria to detect a condition present in the monitored area; Zhang, ¶0023, the method includes obtaining audio data recorded by a microphone located in the monitored area; and using the audio data to determine an event or condition at the monitored area; Zhang, ¶0045, In some implementations, the method includes obtaining audio data recorded by a microphone located in the monitored area; and using the audio data to determine an event or condition at the monitored area; Zhang, ¶0079, In stage (B), the computing system 120 processes the sensor data and generates input for one or more machine learning models. For example, the computing system 120 receives the image data 114a, 144b and the audio data 116 and can use a data pre-processor 121 to extract feature values to be provided as input ... To facilitate data processing, each set of sensor data is associated with an accompanying set of metadata that indicates, for example, a timestamp indicating a time of capture, a sensor identifier (e.g., indicating which camera, microphone, etc. generated the data), a location identifier; Zhang, ¶0080, In stage (C), one or more machine learning models 123 process the input data representing the sensed parameters of the environment …  Another model or set of models can be used to process input from the audio data 116; Zhang, ¶0110, As noted above, the system 100 can use audio data 116 as well as image data 114a, 114b to detect events and conditions at monitored locations. The data can be associated with timestamps so that the system 100 aligns or synchronizes sensor data from different sources, allowing multiple forms of sensor data to be used to detect conditions ... A set of predetermined words and phrases can be associated with different conditions, and the detection of these words in the audio data 116 can be used to detect a condition ... the computer system 120 can further localize the condition detected from audio data 116 using the results from the neural network models 123 in processing image data 114a, 114b; Zhang, ¶0138, The audio data, with timestamps indicating the portions that correspond to different captured images, can also be used by the models to corroborate or verify that certain events or conditions have occurred). 
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine the teachings of Zhang with Yamamoto because they are in the same field of endeavor.  One skilled in the art would have been motivated to include the filtered lists as taught by Zhang in the system of Yamamoto in order to provide an alternate means to determine the probabilities of events corresponding to tasks (Zhang, ¶0035; Yamamoto, ¶0037).

As per claim 2, Yamamoto and Zhang disclose the method of claim 1, further comprising: determining, via the processor, a location of the task within the workplace; and determining, via the processor, a plurality of predetermined tasks performable within the location (Yamamoto, ¶0026, The sub-memory 112 stores a voice recognition model 130, an ambient sound recognition model 131,an action recognition model 132, a location recognition model 133, and multiple task recognition models134 as models for realizing multimodal recognition processing; Yamamoto, ¶0028, The task identification model 134 is a model for identifying whether or not a task falls under a particular category, based on the identification results output using the voice identification model 130, the ambient sound identification model 131, the motion identification model 132, and the location identification model 133; Yamamoto, ¶0036, The position identification unit 123 uses the position identification model 133 to output information indicating a person's location as event data (identification result) from sensor data input via sensor 103 and wireless communication data input via wireless device 104; Yamamoto, ¶0037, The task identification unit 124 uses the task identification model 134 to calculate the probability that an event corresponds to the task identified by the task identification model 134 from the event data output from the voice identification unit 120, the ambient sound identification unit 121, the motion identification unit 122, and the location identification unit 123, and outputs information indicating the task as an identification result based on the probability).

As per claim 3, Yamamoto and Zhang disclose the method of claim 2.  Zhang discloses further comprising: 
modifying, via the processor, the list of first class labels and the list of first class probabilities received from the first trained machine learning model by removing one or more first class labels from the list of first class labels that are absent in the plurality of predetermined tasks performable within the location and removing the corresponding one or more first class probabilities from the list of first class probabilities (Zhang, ¶0014,  evaluating the output of the one or more machine learning models includes applying one or more post-processing rules to the output of the one or more machine learning models to filter a list of the identified objects based on the confidence scores. Evaluating the filtered list of identified objects with respect to one or more predetermined criteria to detect a condition present in the monitored area); and 
providing, via the processor, the modified list of first class labels and the modified list of first class probabilities to the merging algorithm prior to determination of the list of third class probabilities and the list of third class labels, wherein the merging algorithm determines the list of third class probabilities and the list of third class labels based at least on the modified list of first class probabilities (Zhang, ¶0030, a method includes: obtaining image data from a camera, the image data representing an image of a monitored area; providing, to one or more machine learning models, input data obtained based on the image data representing the image of the monitored area, wherein the one or more machine learning models have been trained to detect a plurality of different types of objects and indicate a status of at least one of the types of objects; receiving output of the one or more machine learning models, the output indicating (i) locations of identified objects in the image data representing the image of the monitored area, (ii) an object status classification for at least one of the identified objects, and (iii) confidence scores for the identification of the objects and/or the object status classifications; applying one or more post-processing rules to the output of the one or more machine learning models to filter a list of the identified objects based on the confidence scores; evaluating the filtered list of identified objects with respect to one or more predetermined criteria to detect a condition present in the monitored area; and providing output indicating the detected condition present in the monitored area; Zhang, ¶0035, the method includes, based on evaluating the filtered list of identified objects to detect the condition, generating a record for a task corresponding to the detected condition; Zhang, ¶0098, During stage (D), the computer system 120 processes the outputs of the models 123 using a post-processing module 124, which can filter or otherwise adjust and interpret the results from the models 123 ... The post-processing module 124 actions can remove detected objects that have confidence scores less than a threshold indicated by the rule set 125).  The motivation would be the same as above in claim 1.

As per claim 4, Yamamoto and Zhang disclose the method of claim 2.  Zhang discloses further comprising: modifying, via the processor, the list of second class labels and the list of second class probabilities received from the second trained machine learning model by removing one or more second class labels from the list of second class labels that are absent in the plurality of predetermined tasks performable within the location and removing the corresponding one or more second class probabilities from the list of second class probabilities (Zhang, ¶0014,  evaluating the output of the one or more machine learning models includes applying one or more post-processing rules to the output of the one or more machine learning models to filter a list of the identified objects based on the confidence scores. Evaluating the filtered list of identified objects with respect to one or more predetermined criteria to detect a condition present in the monitored area); and 
providing, via the processor, the modified list of second class labels and the modified list of second class probabilities to the merging algorithm prior to determination of the list of third class probabilities and the list of third class labels, wherein the merging algorithm determines the list of third class probabilities and the list of third class labels based at least on the modified list of second class probabilities (Zhang, ¶0030,  a method includes: obtaining image data from a camera, the image data representing an image of a monitored area; providing, to one or more machine learning models, input data obtained based on the image data representing the image of the monitored area, wherein the one or more machine learning models have been trained to detect a plurality of different types of objects and indicate a status of at least one of the types of objects; receiving output of the one or more machine learning models, the output indicating (i) locations of identified objects in the image data representing the image of the monitored area, (ii) an object status classification for at least one of the identified objects, and (iii) confidence scores for the identification of the objects and/or the object status classifications; applying one or more post-processing rules to the output of the one or more machine learning models to filter a list of the identified objects based on the confidence scores; evaluating the filtered list of identified objects with respect to one or more predetermined criteria to detect a condition present in the monitored area; and providing output indicating the detected condition present in the monitored area; Zhang, ¶0035, the method includes, based on evaluating the filtered list of identified objects to detect the condition, generating a record for a task corresponding to the detected condition; Zhang, ¶0098,  During stage (D), the computer system 120 processes the outputs of the models 123 using a post-processing module 124, which can filter or otherwise adjust and interpret the results from the models 123 ... The post-processing module 124 actions can remove detected objects that have confidence scores less than a threshold indicated by the rule set 125).  The motivation would be the same as above in claim 1.

As per claim 5, Yamamoto and Zhang disclose the method of claim 2, wherein determining the task corresponding to the predetermined period of time further based on an overlap between the list of third class labels and the plurality of predetermined tasks performable within the location (Yamamoto, ¶0026, The sub-memory 112 stores a voice recognition model 130, an ambient sound recognition model 131,an action recognition model 132, a location recognition model 133, and multiple task recognition models 134 as models for realizing multimodal recognition processing; Yamamoto, ¶0028, The task identification model 134 is a model for identifying whether or not a task falls under a particular category, based on the identification results output using the voice identification model 130, the ambient sound identification model 131, the motion identification model 132, and the location identification model 133. In this embodiment, a work identification model 134 is maintained for each work; Yamamoto, ¶0037, The task identification unit 124 uses the task identification model 134 to calculate the probability that an event corresponds to the task identified by the task identification model 134 from the event data output from the voice identification unit 120, the ambient sound identification unit 121, the motion identification unit 122, and the location identification unit 123, and outputs information indicating the task as an identification result based on the probability; Yamamoto, ¶0039, The task identification unit 124 uses the event data obtained by inputting data for each modality included in a certain time range into each identification unit, along with the task identification model 134, to calculate the probability that the task corresponds to the task in the task identification model 134; Yamamoto, ¶0040, The task identification unit 124 of Example 1 calculates the probability of multiple tasks using multiple task identification models 134. The task identification unit 124 outputs information indicating the task with the highest probability as the identification result. For example, as shown in Figure 4, the task identification unit 124 outputs an identification result consisting of the task 401, start time 402, end time 403, and confidence level 404).

As per claim 6, Yamamoto and Zhang disclose the method of claim 2.  Zhang discloses wherein the location of the task within the workplace is determined based on a predetermined location of the at least one image capturing device and/or a predetermined location of the at least one audio sensor (Zhang, ¶0006, In general, the techniques herein enable a computer system to use a camera or other sensor to monitor an area, detect conditions in the monitored area that satisfy criteria ... one or more machine learning models can be used to detect an object or region but also to detect the state or condition of the object or region. In addition, the system can evaluate the detected conditions; Zhang, ¶0008, With data from cameras and/or other sensors, the system can identify issues in a space).  The motivation would be the same as above in claim 1.

As per claim 7, Yamamoto and Zhang disclose the method of claim 2, wherein determining the location of the task within the workplace further comprises determining, via the first trained machine learning model, the location of the task based on the plurality of images (Yamamoto, ¶0027, The location identification model 133 is a model for identifying a person's location based on input images and sensor values).

As per claim 8, Yamamoto and Zhang disclose the method of claim 2.  Zhang discloses wherein the location comprises a plurality of zones (Zhang, ¶0054, In some implementations, the method includes: using multiple cameras located at a location to capture image data representing different types of areas of the location, and processing the image data from each of the multiple cameras using a different model corresponding to the camera, each of the different models having a different training state).  The motivation would be the same as above in claim 1.

As per claim 12, Yamamoto and Zhang disclose the method of claim 1, further comprising: obtaining, via at least one environment sensor, an environmental signal, wherein the environmental signal is indicative of an environmental parameter associated with the task; and determining the task corresponding to the predetermined period of time further based on the environmental signal (Yamamoto, ¶0017, Sensor 103 measures temperature, acceleration, and other parameters; Yamamoto, ¶0035, The motion identification unit 122 uses the motion identification model 132 to output information indicating human motion as event data (identification result) from image data input via the camera 102 and sensor data input via the sensor 103. Event data includes a timestamp).

As per claim 13, Yamamoto and Zhang disclose the method of claim 1, further comprising: receiving a set of labelled images, wherein the set of labelled images comprises a corresponding first task label indicative of a potential task; providing the set of labelled images to a first machine learning algorithm; and generating the first trained machine learning model through the first machine learning algorithm (Yamamoto, ¶0008, a learning unit that generates a model for identifying a relevant event from among multiple events based on event data which is the identification result output from multiple classifiers that handle data of different modalities ... an event tag indicating the type of the event data, and event tag data which includes the event data to which the event tag is attached; Yamamoto, ¶0026, the sub-memory 112 stores input data management information 135 and event tag management information 136 as information used in the learning process; Yamamoto, ¶0043, The learning unit 125, the learning data generation unit 126, and the key event extraction unit 127 are functionally configured to realize the learning process for generating the work identification model 134; Yamamoto, ¶0044, the learning data generation unit 126 generates learning data by adding event tags to the input data 140; Yamamoto, ¶0045, The key event extraction unit 127 extracts key event data, which is event data specific to a particular task; Yamamoto, ¶0046, The learning unit 125 uses the training data to perform a training process to generate a task identification model 134).

As per claim 14, Yamamoto and Zhang disclose the method of claim 1, further comprising: receiving a set of labelled audio clips, wherein each labelled audio clip comprises a corresponding second task label indicative of a sound produced within the workplace; providing the set of labelled audio clips to a second machine learning algorithm; and generating the second trained machine learning model through the second machine learning algorithm (Yamamoto, ¶0019, Computer 100 generates a model for realizing multimodal recognition processing using data from multiple modalities, such as sound, images, and sensor values, and also executes multimodal recognition processing using the said model. In this embodiment, the computer 100 performs multimodal recognition processing to identify tasks using data from multiple modalities; Yamamoto, ¶0026, The sub-memory 112 stores a voice recognition model 130, an ambient sound recognition model 131 … the sub-memory 112 stores input data management information 135 and event tag management information 136 as information used in the learning process; Yamamoto, ¶0027, The speech recognition model 130 is a model for identifying human speech based on input sounds. The ambient sound recognition model 131 is a model for recognizing ambient sounds based on input sounds; Yamamoto, ¶0028, The task identification model 134 is a model for identifying whether or not a task falls under a particular category, based on the identification results output using the voice identification model 130, the ambient sound identification model 131, the motion identification model 132, and the location identification model 133; Yamamoto, ¶0033, The voice recognition unit 120 uses the voice recognition model 130 to output natural language text information as event data (recognition result) from the sound data input via the microphone 101. Event data includes a timestamp; Yamamoto, ¶0034, The ambient sound identification unit 121 uses the ambient sound identification model 131 to output the type of ambient sound as event data (identification result) from the sound data input via the microphone 101. Event data includes a timestamp; Yamamoto, ¶0037, The task identification unit 124 uses the task identification model 134 to calculate the probability that an event corresponds to the task identified by the task identification model 134 from the event data output from the voice identification unit 120, the ambient sound identification unit 121, the motion identification unit 122, and the location identification unit 123, and outputs information indicating the task as an identification result based on the probability).

As per claim 15, Yamamoto discloses a system for classifying a task in a workplace, the system comprising: 
at least one image capturing device configured to capture a plurality of images for a predetermined period of time (Yamamoto, ¶0016, The multimodal recognition system consists of a computer 100, a microphone 101, a camera 102, a sensor 103, and a wireless device 104; Yamamoto, ¶0017, Microphone 101 collects various sounds, including speech and ambient noise. Camera 102 acquires an image. Sensor 103 measures temperature, acceleration, and other parameters; Yamamoto, ¶0027, The motion recognition model 132 is a model for identifying human motion based on input images and sensor values. The location identification model 133 is a model for identifying a person's location based on input images and sensor values; Yamamoto, ¶0035, The motion identification unit 122 uses the motion identification model 132 to output information indicating human motion as event data (identification result) from image data input via the camera 102and sensor data input via the sensor 103. Event data includes a timestamp; Yamamoto, ¶0039, The task identification unit 124 uses the event data obtained by inputting data for each modality included in a certain time range into each identification unit, along with the task identification model 134, to calculate the probability that the task corresponds to the task in the task identification model 134); 
at least one audio sensor configured to capture sound waves corresponding to the predetermined period of time and generate an audio signal based on the captured sound waves (Yamamoto, ¶0002, use of multimodal recognition systems, which combine data from multiple modalities such as audio, images, and sensor values to identify arbitrary events ... multimodal recognition systems are used to identify tasks performed in factories and other facilities; Yamamoto, ¶0016, The multimodal recognition system consists of a computer 100, a microphone 101, a camera 102, a sensor 103, and a wireless device 104; Yamamoto, ¶0017, Microphone 101 collects various sounds, including speech and ambient noise; Yamamoto, ¶0027, The speech recognition model 130 is a model for identifying human speech based on input sounds. The ambient sound recognition model 131 is a model for recognizing ambient sounds based on input sounds; Yamamoto, ¶0033, The voice recognition unit 120 uses the voice recognition model 130 to output natural language text information as event data (recognition result) from the sound data input via the microphone 101. Event data includes a timestamp; Yamamoto, ¶0034, The ambient sound identification unit 121 uses the ambient sound identification model 131 to output the type of ambient sound as event data (identification result) from the sound data input via the microphone 101. Event data includes a timestamp; Yamamoto, ¶0039, The task identification unit 124 uses the event data obtained by inputting data for each modality included in a certain time range into each identification unit, along with the task identification model 134, to calculate the probability that the task corresponds to the task in the task identification model 134); 
a processor communicably coupled to the at least one image capturing device and the at least one audio sensor, wherein the processor is configured to obtain the plurality of images from the at least one image capturing device and the audio signal from the at least one audio sensor (Yamamoto, ¶0016, The multimodal recognition system consists of a computer 100, a microphone 101, a camera 102, a sensor 103, and a wireless device 104. Computer 100 is connected to microphone 101, camera 102, sensor 103, and wireless device 104 either directly or via a network. The network may be a LAN (Local Area Network) or a WAN (Wide Area Network), and the connection method may be either wireless or wired. Note that the number of each of the computer 100, microphone 101, camera 102, sensor 103, and wireless device 104 may be two or more); 
a first trained machine learning model communicably coupled to the processor, wherein the first trained machine learning model is configured to classify the plurality of images (Yamamoto, ¶0008, a learning unit that generates a model for identifying a relevant event from among multiple events based on event data which is the identification result output from multiple classifiers that handle data of different modalities ... an event tag indicating the type of the event data, and event tag data which includes the event data to which the event tag is attached); 
a second trained machine learning model communicably coupled to the processor, wherein the second trained machine learning model is configured to classify the audio signal (Yamamoto, ¶0008, a learning unit that generates a model for identifying a relevant event from among multiple events based on event data which is the identification result output from multiple classifiers that handle data of different modalities ... an event tag indicating the type of the event data, and event tag data which includes the event data to which the event tag is attached); and 
a merging algorithm communicably coupled to the processor, wherein the merging algorithm is configured to generate a list of third class probabilities and a list of third class labels corresponding to the list of third class probabilities based at least on the list of first class probabilities and the list of second class probabilities, and wherein each third class probability is indicative of a probability of the corresponding third class label being the task (Yamamoto, ¶0003, A multimodal recognition system is realized by combining a model that identifies data from multiple modalities (individual models) with a model that uses the identification results of each individual model as input to output a final identification result; Yamamoto, ¶0037, The task identification unit 124 uses the task identification model 134 to calculate the probability that an event corresponds to the task identified by the task identification model 134 from the event data output from the voice identification unit 120, the ambient sound identification unit 121, the motion identification unit 122, and the location identification unit 123, and outputs information indicating the task as an identification result based on the probability; Yamamoto, ¶0039, The task identification unit 124 uses the event data obtained by inputting data for each modality included in a certain time range into each identification unit, along with the task identification model 134, to calculate the probability that the task corresponds to the task in the task identification model 134; Yamamoto, ¶0040, The task identification unit 124 of Example 1 calculates the probability of multiple tasks using multiple task identification models 134. The task identification unit 124 outputs information indicating the task with the highest probability as the identification result. For example, as shown in Figure 4, the task identification unit 124 outputs an identification result consisting of the task 401, start time 402, end time 403, and confidence level 404); 
wherein the processor is further configured to determine the task corresponding to the predetermined period of time based at least on the list of third class probabilities (Yamamoto, ¶0003, A multimodal recognition system is realized by combining a model that identifies data from multiple modalities (individual models) with a model that uses the identification results of each individual model as input to output a final identification result; Yamamoto, ¶0037, The task identification unit 124 uses the task identification model 134 to calculate the probability that an event corresponds to the task identified by the task identification model 134 from the event data output from the voice identification unit 120, the ambient sound identification unit 121, the motion identification unit 122, and the location identification unit 123, and outputs information indicating the task as an identification result based on the probability; Yamamoto, ¶0040; Yamamoto, ¶0041, The confidence level 404 is a field that stores the probability that the task corresponds to task 401).
Yamamoto does not explicitly disclose the following limitations as further recited however Zhang discloses
a first trained machine learning model communicably coupled to the processor, wherein the first trained machine learning model is configured to classify the plurality of images to generate a list of first class probabilities and a list of first class labels corresponding to the list of first class probabilities, and wherein each first class probability is indicative of a probability of the corresponding first class label being the task (Zhang, ¶0012, obtaining, by the one or more computers, image data from a camera, the image data representing an image of a monitored area; providing, by the one or more computers and to one or more machine learning models, input data that is based on the image data representing the image of the monitored area, where the one or more machine learning models have been trained to detect different properties of the monitored area; receiving, by the one or more computers, output of the one or more machine learning models, the output indicating (i) one or more status classifications for the monitored area and a respective location for each of the one or more status classifications; Zhang, ¶0014, evaluating the output of the one or more machine learning models includes applying one or more post-processing rules to the output of the one or more machine learning models to filter a list of the identified objects based on the confidence scores. Evaluating the filtered list of identified objects with respect to one or more predetermined criteria to detect a condition present in the monitored area; Zhang, ¶0029, providing output indicating the detected condition present in the monitored area includes providing image data for an image of the monitored area having an annotation indicating a location of the detected condition within the monitored area; Zhang, ¶0041, providing the output comprises providing annotation data for the image data; Zhang, ¶0131, The image data can be labelled with the corresponding conditions represented in the image data; Zhang, ¶0019, the method includes generating a record for a task corresponding to the detected condition; Zhang, ¶0035, the method includes, based on evaluating the filtered list of identified objects to detect the condition, generating a record for a task corresponding to the detected condition);
a second trained machine learning model communicably coupled to the processor, wherein the second trained machine learning model is configured to classify the audio signal to generate a list of second class probabilities and a list of second class labels corresponding to the list of second class probabilities, and wherein each second class probability is indicative of a probability of the corresponding second class label being the task (Zhang, ¶0014, evaluating the output of the one or more machine learning models includes applying one or more post-processing rules to the output of the one or more machine learning models to filter a list of the identified objects based on the confidence scores. Evaluating the filtered list of identified objects with respect to one or more predetermined criteria to detect a condition present in the monitored area; Zhang, ¶0023, the method includes obtaining audio data recorded by a microphone located in the monitored area; and using the audio data to determine an event or condition at the monitored area; Zhang, ¶0045, In some implementations, the method includes obtaining audio data recorded by a microphone located in the monitored area; and using the audio data to determine an event or condition at the monitored area; Zhang, ¶0079, In stage (B), the computing system 120 processes the sensor data and generates input for one or more machine learning models. For example, the computing system 120 receives the image data 114a, 144b and the audio data 116 and can use a data pre-processor 121 to extract feature values to be provided as input ... To facilitate data processing, each set of sensor data is associated with an accompanying set of metadata that indicates, for example, a timestamp indicating a time of capture, a sensor identifier (e.g., indicating which camera, microphone, etc. generated the data), a location identifier; Zhang, ¶0080, In stage (C), one or more machine learning models 123 process the input data representing the sensed parameters of the environment …  Another model or set of models can be used to process input from the audio data 116; Zhang, ¶0110, As noted above, the system 100 can use audio data 116 as well as image data 114a, 114b to detect events and conditions at monitored locations. The data can be associated with timestamps so that the system 100 aligns or synchronizes sensor data from different sources, allowing multiple forms of sensor data to be used to detect conditions ... A set of predetermined words and phrases can be associated with different conditions, and the detection of these words in the audio data 116 can be used to detect a condition ... the computer system 120 can further localize the condition detected from audio data 116 using the results from the neural network models 123 in processing image data 114a, 114b; Zhang, ¶0138, The audio data, with timestamps indicating the portions that correspond to different captured images, can also be used by the models to corroborate or verify that certain events or conditions have occurred). 
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine the teachings of Zhang with Yamamoto because they are in the same field of endeavor.  One skilled in the art would have been motivated to include the filtered lists as taught by Zhang in the system of Yamamoto in order to provide an alternate means to determine the probabilities of events corresponding to tasks (Zhang, ¶0035; Yamamoto, ¶0037).

As per claim 16, Yamamoto and Zhang disclose the system of claim 15, wherein the processor is further configured to: determine a location of the task within the workplace; and determine a plurality of predetermined tasks performable within the location (Yamamoto, ¶0026, The sub-memory 112 stores a voice recognition model 130, an ambient sound recognition model 131,an action recognition model 132, a location recognition model 133, and multiple task recognition models134 as models for realizing multimodal recognition processing; Yamamoto, ¶0028, The task identification model 134 is a model for identifying whether or not a task falls under a particular category, based on the identification results output using the voice identification model 130, the ambient sound identification model 131, the motion identification model 132, and the location identification model 133; Yamamoto, ¶0036, The position identification unit 123 uses the position identification model 133 to output information indicating a person's location as event data (identification result) from sensor data input via sensor 103 and wireless communication data input via wireless device 104; Yamamoto, ¶0037, The task identification unit 124 uses the task identification model 134 to calculate the probability that an event corresponds to the task identified by the task identification model 134 from the event data output from the voice identification unit 120, the ambient sound identification unit 121, the motion identification unit 122, and the location identification unit 123, and outputs information indicating the task as an identification result based on the probability).

As per claim 17, Yamamoto and Zhang disclose the system of claim 16.  Zhang discloses wherein the processor is further configured to: modify the list of first class labels and the list of first class probabilities received from the first trained machine learning model by removing one or more first class labels from the list of first class labels that are absent in the plurality of predetermined tasks performable within the location and removing the corresponding one or more first class probabilities from the list of first class probabilities (Zhang, ¶0014, evaluating the output of the one or more machine learning models includes applying one or more post-processing rules to the output of the one or more machine learning models to filter a list of the identified objects based on the confidence scores. Evaluating the filtered list of identified objects with respect to one or more predetermined criteria to detect a condition present in the monitored area); and 
provide the modified list of first class labels and the modified list of first class probabilities to the merging algorithm prior to determination of the list of third class probabilities and the list of third class labels, wherein the merging algorithm determines the list of third class probabilities and the list of third class labels based at least on the modified list of first class probabilities (Zhang, ¶0030,  a method includes: obtaining image data from a camera, the image data representing an image of a monitored area; providing, to one or more machine learning models, input data obtained based on the image data representing the image of the monitored area, wherein the one or more machine learning models have been trained to detect a plurality of different types of objects and indicate a status of at least one of the types of objects; receiving output of the one or more machine learning models, the output indicating (i) locations of identified objects in the image data representing the image of the monitored area, (ii) an object status classification for at least one of the identified objects, and (iii) confidence scores for the identification of the objects and/or the object status classifications; applying one or more post-processing rules to the output of the one or more machine learning models to filter a list of the identified objects based on the confidence scores; evaluating the filtered list of identified objects with respect to one or more predetermined criteria to detect a condition present in the monitored area; and providing output indicating the detected condition present in the monitored area; Zhang, ¶0035, the method includes, based on evaluating the filtered list of identified objects to detect the condition, generating a record for a task corresponding to the detected condition; Zhang, ¶0098,  During stage (D), the computer system 120 processes the outputs of the models 123 using a post-processing module 124, which can filter or otherwise adjust and interpret the results from the models 123 ... The post-processing module 124 actions can remove detected objects that have confidence scores less than a threshold indicated by the rule set 125).  The motivation would be the same as above in claim 15.

As per claim 18, Yamamoto and Zhang disclose the system of claim 16.  Zhang discloses wherein the processor is further configured to: modify the list of second class labels and the list of second class probabilities received from the second trained machine learning model by removing one or more second class labels from the list of second class labels that are absent in the plurality of predetermined tasks performable within the location and removing the corresponding one or more second class probabilities from the list of second class probabilities (Zhang, ¶0014, evaluating the output of the one or more machine learning models includes applying one or more post-processing rules to the output of the one or more machine learning models to filter a list of the identified objects based on the confidence scores. Evaluating the filtered list of identified objects with respect to one or more predetermined criteria to detect a condition present in the monitored area); and 
provide the modified list of second class labels and the modified list of second class probabilities to the merging algorithm prior to determination of the list of third class probabilities and the list of third class labels, wherein the merging algorithm determines the list of third class probabilities and the list of third class labels based at least on the modified list of second class probabilities (Zhang, ¶0030, a method includes: obtaining image data from a camera, the image data representing an image of a monitored area; providing, to one or more machine learning models, input data obtained based on the image data representing the image of the monitored area, wherein the one or more machine learning models have been trained to detect a plurality of different types of objects and indicate a status of at least one of the types of objects; receiving output of the one or more machine learning models, the output indicating (i) locations of identified objects in the image data representing the image of the monitored area, (ii) an object status classification for at least one of the identified objects, and (iii) confidence scores for the identification of the objects and/or the object status classifications; applying one or more post-processing rules to the output of the one or more machine learning models to filter a list of the identified objects based on the confidence scores; evaluating the filtered list of identified objects with respect to one or more predetermined criteria to detect a condition present in the monitored area; and providing output indicating the detected condition present in the monitored area; Zhang, ¶0035, the method includes, based on evaluating the filtered list of identified objects to detect the condition, generating a record for a task corresponding to the detected condition; Zhang, ¶0098,  During stage (D), the computer system 120 processes the outputs of the models 123 using a post-processing module 124, which can filter or otherwise adjust and interpret the results from the models 123 ... The post-processing module 124 actions can remove detected objects that have confidence scores less than a threshold indicated by the rule set 125).  The motivation would be the same as above in claim 15.

As per claim 19, Yamamoto and Zhang disclose the system of claim 16, wherein the processor is further configured to determine the task corresponding to the predetermined period of time further based on an overlap between the list of third class labels and the plurality of predetermined tasks performable within the location (Yamamoto, ¶0026, The sub-memory 112 stores a voice recognition model 130, an ambient sound recognition model 131,an action recognition model 132, a location recognition model 133, and multiple task recognition models 134 as models for realizing multimodal recognition processing; Yamamoto, ¶0028, The task identification model 134 is a model for identifying whether or not a task falls under a particular category, based on the identification results output using the voice identification model 130, the ambient sound identification model 131, the motion identification model 132, and the location identification model 133. In this embodiment, a work identification model 134 is maintained for each work; Yamamoto, ¶0037, The task identification unit 124 uses the task identification model 134 to calculate the probability that an event corresponds to the task identified by the task identification model 134 from the event data output from the voice identification unit 120, the ambient sound identification unit 121, the motion identification unit 122, and the location identification unit 123, and outputs information indicating the task as an identification result based on the probability; Yamamoto, ¶0039, The task identification unit 124 uses the event data obtained by inputting data for each modality included in a certain time range into each identification unit, along with the task identification model 134, to calculate the probability that the task corresponds to the task in the task identification model 134; Yamamoto, ¶0040, The task identification unit 124 of Example 1 calculates the probability of multiple tasks using multiple task identification models 134. The task identification unit 124 outputs information indicating the task with the highest probability as the identification result. For example, as shown in Figure 4, the task identification unit 124 outputs an identification result consisting of the task 401, start time 402, end time 403, and confidence level 404).

As per claim 20, Yamamoto and Zhang disclose the system of claim 16.  Zhang discloses wherein the processor is further configured to determine the location of the task within the workplace based on a predetermined location of the at least one image capturing device and/or a predetermined location of the at least one audio sensor (Zhang, ¶0006, In general, the techniques herein enable a computer system to use a camera or other sensor to monitor an area, detect conditions in the monitored area that satisfy criteria ... one or more machine learning models can be used to detect an object or region but also to detect the state or condition of the object or region. In addition, the system can evaluate the detected conditions; Zhang, ¶0008, With data from cameras and/or other sensors, the system can identify issues in a space).  The motivation would be the same as above in claim 15.

Claim(s) 9 and 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yamamoto et al., Japanese Publication No. JP 2021 076913 (A), hereinafter, “Yamamoto”, in view of Zhang, U.S. Publication No. 2021/0027485, hereinafter, “Zhang” as applied to claim 1 above, and further in view of Tripathi, U.S. Publication No. 2013/0070056, hereinafter, “Tripathi”.

As per claim 9, Yamamoto and Zhang disclose the method of claim 1, and (Yamamoto, ¶0017, Sensor 103 measures temperature, acceleration, and other parameters), but do not explicitly disclose the following limitations as further recited however Tripathi discloses further comprising: obtaining, via at least one sensor, a sensor signal, wherein the at least one sensor is coupled to a tool, and wherein the task is performed by the tool; and determining the task corresponding to the predetermined period of time further based on the sensor signal (Tripathi, ¶0022, Inspectors often use tools while performing an inspection. The tools used by the operators may be marked, tagged or otherwise made identifiable so that the tools can be identified and/or so that a particular use of the tool can be monitored and logged. For example, an inspector's wrench can be identified by having a red end and a blue end. A turning of the wrench can be identified by the detection of the red and blue ends both striking a same arc about a common center point. In another example, an inspector's screwdriver is provided with a handle having a distal end cap that is half black and half white. Turning of the screwdriver is confirmed by the detection of a rotation of the white and black halves about a center point at the axis or rotation of the screwdriver. A tool can be identified by shape or by a particular logo or indicia placed on the tool).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine the teachings of Tripathi with Yamamoto and Zhang because they are in the same field of endeavor.  One skilled in the art would have been motivated to include the tag / marker / sensor on the tool as taught by Tripathi in the system of Yamamoto and Zhang as an alternate means to track and identify tasks (Tripathi, ¶0128).

As per claim 10, Yamamoto, Zhang and Tripathi disclose the method of claim 9, further comprising: determining, via the at least one sensor, a time period of operation of the tool; and determining the task corresponding to the predetermined period of time further based on the time period of operation of the tool (Yamamoto, ¶0040, The task identification unit 124 of Example 1 calculates the probability of multiple tasks using multiple task identification models 134. The task identification unit 124 outputs information indicating the task with the highest probability as the identification result. For example, as shown in Figure 4, the task identification unit 124 outputs an identification result consisting of the task 401, start time 402, end time 403, and confidence level 404; Tripathi, ¶0022, Inspectors often use tools while performing an inspection. The tools used by the operators may be marked, tagged or otherwise made identifiable so that the tools can be identified and/or so that a particular use of the tool can be monitored and logged). 


Claim(s) 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Yamamoto et al., Japanese Publication No. JP 2021 076913 (A), hereinafter, “Yamamoto”, in view of Zhang, U.S. Publication No. 2021/0027485, hereinafter, “Zhang” as applied to claim 1 above, and further in view of Kovach et al., U.S. Publication No. 2019/0080274, hereinafter, “Kovach”.

As per claim 11, Yamamoto and Zhang disclose the method of claim 1, and (Yamamoto, ¶0017, Sensor 103 measures temperature, acceleration, and other parameters) but do not explicitly disclose the following limitations as further recited however Kovach discloses further comprising: obtaining, via a personal protective equipment (PPE) article, a PPE signal, wherein the task involves the PPE article; and determining the task corresponding to the predetermined period of time further based on the PPE signal (Kovach, ¶0087, facility analytics platform 205 may analyze an amount of time a worker spends performing a particular task. Continuing with the previous example, facility analytics platform 205 may determine whether an amount of time that a worker spends performing a particular task satisfies a threshold, whether an amount of time for a task exceeds an average amount of time for the worker or for other workers (e.g., by a threshold amount), may identify tasks that take a threshold amount of time on average; Kovach, ¶0070, facility analytics platform 205 may identify a particular worker in an image by using a facial recognition technique, by identifying characteristics of a worker in an image (e.g., by identifying a color pattern of a uniform, skin tone, protective gear, etc. of a worker in an image).
It would have been obvious to one skilled in the art before the effective filing date of the claimed invention to combine the teachings of Kovach with Yamamoto and Zhang because they are in the same field of endeavor.  One skilled in the art would have been motivated to include the tag / marker / sensor on worker as taught by Kovach in the system of Yamamoto and Zhang as an alternate means to track and identify tasks (Kovach, ¶0087).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TRACY MANGIALASCHI whose telephone number is (571)270-5189. The examiner can normally be reached M-F, 9:30AM TO 6:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Vu Le can be reached at (571) 272-7332. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/TRACY MANGIALASCHI/Primary Examiner, Art Unit 2668
Read full office action
Prosecution Timeline

Oct 24, 2023
Application Filed
Apr 28, 2026
Non-Final Rejection mailed — §103 (current)
Precedent Cases

Applications granted by this same examiner with similar technology

17/732,978
Patent 12639907
MULTIVIEW ASSOCIATION OF MULTIPLE ITEMS FOR ITEM RECOGNITION
4y 0m to grant Granted May 26, 2026
18/106,134
Patent 12633086
METHOD, DEVICE, AND COMPUTER PROGRAM PRODUCT FOR DETECTING STICKERS
3y 3m to grant Granted May 19, 2026
18/095,033
Patent 12622347
SYSTEM AND METHOD FOR TURNING IRRIGATION PIVOTS INTO A NETWORK OF ROBOTS FOR OPTIMIZING FERTILIZATION
3y 4m to grant Granted May 12, 2026
18/554,612
Patent 12614271
POLARIZATION IMAGE-BASED BUILDING INSPECTION
2y 6m to grant Granted Apr 28, 2026
17/970,140
Patent 12608838
DEVICE AND METHOD FOR TRACKING AN OBJECT
3y 6m to grant Granted Apr 21, 2026
Study what changed to get past this examiner. Based on 5 most recent grants.
Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.
Typically takes 5-10 seconds — AI-generated, attorney review required before filing
Prosecution Projections

1-2
Expected OA Rounds
75%
Grant Probability
99%
With Interview (+28.0%)
3y 0m (~5m remaining)
Median Time to Grant
Low
PTA Risk
Based on 586 resolved cases by this examiner. Grant probability derived from career allowance rate.