DETAILED ACTION
This Office Action is responsive to amendments and arguments filed on February 17th, 2026. Claims 1, 6-8 and 11-12 are amended, claims 2 and 5 are cancelled, claims 1, 3-4 and 6-13 are pending; hence this action has been made FINAL.
Any previous objections/rejections not mentioned in this Office Action have been withdrawn by the examiner.
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in Application No. 18/605538, filed on March 14th, 2024.
Should applicant desire to obtain the benefit of foreign priority under 35 U.S.C. 119(a)-(d), a certified English translation of the foreign application must be submitted in reply to this action. 37 CFR 41.154(b) and 41.202(e). Failure to provide a certified translation may result in no benefit being accorded for the non-English application.
Information Disclosure Statement
The information disclosure statement (IDS) submitted on March 14 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.
Response to Amendments and Arguments
With respect to rejections made under 35 U.S.C. 102, Applicant argues, “At least the above features of claim 1 are not disclosed or suggested by Acharya.
At best, Acharya's audio data 108 may be compared to the dialog data of claim 1. However, in claim 1, the coordinates are extracted based on the intention understanding for the dialog data, whereas Acharya's machine learning system 112 correlates the object recognized in the video data 107 with a reference to the object in the audio data 108,” (page 7 of Remarks).
Acharya’s teaching of a machine learning system that correlates an on-screen object with a spoken reference to that object produces the same outcome as a system that identifies an object based on an intention understanding – that intention being a reference to the object (page 3, “The voice data may include this statement before the SME performs the inspection operation, during the inspection operation by the SME, or after the SME performs the inspection operation. In a system as described herein, the correlation between the SME remark and the object recognized in the video data is identified, and the period during which the related remark occurs in the voice data or the related object is the video.”). Any system identifying an object on a screen or any other 2-demnsional or 3-dimensional projection would necessarily use coordinates to do so.
In the interest of advancing prosecution, teachings from Montgomerie are cited under 35 U.S.C. 103 in combination with Acharya. Accordingly, Applicant’s argument is moot; further details are provided below.
Claim Objections
Claims 1 and 12 are objected to for the following informalities:
Claims 1 and 12 recite “acquire an image, a coordinate, and dialog data communicated between a first device and a second device, the first device being used by a first person performing a task”, then go on to reference “extract at least one of a plurality of the coordinates based on an intention understanding for the dialog data according to the first scenario.” Further, the claims include the limitation, “generate training data including the image, a classification of the image, and a relative coordinate of the extracted at least one of the coordinates with respect to the coordinate of the object,” implying different instances or collections of coordinates with unclear relationships.
In the interest of a fair interpretation during examination, Examiner suggests clarifying the coordinate language to instantiate “one or more coordinates” or “a plurality of coordinates,” and then reference those elements as “at least one of the one or more coordinates” or “at least one coordinate of the plurality of coordinates,” to clearly describe a set of object coordinates. Other coordinates drawn from some relative element should then be instantiated and referenced appropriately.
In the interest of compact prosecution, the claims are interpreted as describing coordinates used to locate an object within the view of the HMD or AR device.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1, 3 and 6-13 are rejected under 35 U.S.C. 103 as being unpatentable over Japan Invention Application 2021-099810 to Acharya et al. (hereinafter, "Acharya") in view of U.S. Patent Application Publication 2016/0291922 to Montgomerie et al. (hereinafter, “Montgomerie”).
Regarding claims 1 and 12, Acharya teaches a processing device and method comprising: one or more processors (page 15, "The techniques described in this disclosure may be implemented, at least in part, in hardware, software, firmware or any combination thereof. For example, various aspects of the described techniques can be described in one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or any other equivalent."); and
a memory storing instructions (page 15, "The techniques described in the present disclosure may also be embodied or encoded in a computer-readable medium such as a computer-readable storage medium containing instructions. An instruction embedded or encoded in a computer-readable storage medium may, for example, cause a programmable processor or other processor to perform the method when the instruction is executed.") that, when executed by the one or more processors, cause the one or more processors to:
acquire an image, a coordinate, and dialog data communicated between a first device and a second device, the first device being used by a first person performing a task, the second device being used by a second person, an object of the task being visible in the image (page 3, "In the examples of techniques described herein, the system acquires SME activity (eg, a series of maintenance work, machining activities or manufacturing activities, etc.) in both first-person and third-person views. Provide one or more 3D video cameras for. Such a system further comprises one or more microphones configured to acquire dictation as the SME practices the activity," and page 4, "In a system as described herein, object recognition may be performed to recognize an object in the moving image data with minimal training examples. The system described herein may achieve this by identifying an object in a first moving image and applying this knowledge across multiple subsequent moving images… In addition, the system described herein updates the domain model of a task and applies the domain model to generate training information used by a second user to perform the task. In some examples, the system uses voice or text data obtained from the SME in a first language (eg, Japanese) to train a second user to perform a task.");
select a first scenario corresponding to the task from a plurality of scenarios in which processing procedures are defined (page 6, "In some examples, the knowledge database 116 receives from a second user 118 an instruction to perform a task or a query for one of a plurality of steps to perform a task. In response to the query, the knowledge database 116 applies the domain model 114 to generate training information 117 for performing the task or one of the plurality of steps for performing the task.");
extract at least one of a plurality of the coordinates based on an intention understanding for the dialog data according to the first scenario (page 13, "The display 802draws a time series graph of the moving image data 107. The display 804 depicts a transcription of a portion of the audio data 107 (eg, a narrative by a first user 102) that correlates with each portion of the moving image data 107. The display 806 identifies a portion of the moving image data 107 as corresponding to a recognized sequence of actions (eg, one of a plurality of steps for performing a task). The display 808 depicts the label of the behavior sequence recognized for that portion of the moving image data 107 identified by the display 806. The display 810 identifies a portion of the moving image data 107 as depicting a recognized object. The display 812 identifies the location of the recognized object in the work space depicted in the moving image data 107.");
generate training data including the image, a classification of the image, and a relative coordinate of the extracted at least one of the coordinates with respect to the coordinate of the object (page 14, "FIG. 9 is a flowchart showing an exemplary operation for generating training information according to the technique of the present disclosure… Based on the correlation between the video data 107 and the audio data 109, the sensor data 121, and the text data, the machine learning system 112 describes at least one of the video data that describes one of the plurality of steps for performing the task," and page 15, "For example, the machine learning system 112 identifies the object depicted in the moving image data 107, identifies the reference to the object from the audio data 109, and refers to the object depicted in the moving image data 107 to the object in the audio data 109. A part of the moving image data may be correlated with a part of the audio data by correlating with."); and
associate the extracted at least one of the plurality of coordinates with the task (page 15, "The training unit 210 applies the domain model 114 to generate training information 117 for the task (914). The training information 117 is, for example, a task, one or more steps of a plurality of steps constituting the task, or an object (eg, a tool)associated with each type of data cross-referenced to each other type of data. Or workpieces), including, for example, video data, audio data, sensor data and / or text data.").
As noted in the Response section above, the teachings of Acharya – by identifying a location of an object on a screen or in a projection – would rely on that object’s coordinates. Given, arguendo, that Acharya does not rely on the use of object coordinates, Montgomerie is introduced.
Montgomerie teaches, at paragraph [0055], "An alternative exemplary method of AR interaction according to embodiments of the present disclosure, as shown in FIG. 21, may include capturing a video image by a camera (2100), generating AR coordinates corresponding to captured video image (2105), update a scene camera view (2110), encoding the video image (2115), combining the AR coordinates, 3D objects, and encoded video image (2120), transmitting the combined data to a remote user (2125), and receiving from the remote user updated AR coordinates, 3D objects, and video image (2130)."
Acharya and Montgomerie are considered analogous because they are each concerned with user training with VR/AR devices. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Acharya with the teachings of Montgomerie for the purpose of improving training quality. Given that all the claimed elements were known in the prior art, one skilled in the art could have combined the elements by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Regarding claim 3, Acharya teaches at least one of the plurality of first coordinates and at least one of the plurality of second coordinates are extracted based on the dialog data (page 8, "As an example, the machine learning system 112 identifies the object depicted in the moving image data 107, identifies the reference to the object from the audio data 109, and transfers the object depicted in the moving image data 107 to the object in the audio data 109."), but does not explicitly teach “The processing device according to claim 1, wherein the plurality of coordinates includes: a plurality of first coordinates transmitted from the first device to the second device,” or “a plurality of second coordinates transmitted from the second device to the first device,” and thus, Montgomerie is introduced.
Montgomerie teaches a plurality of first coordinates transmitted from the first device to the second device (paragraph [0044], "In some embodiments, the shared state of augmented reality may be manipulated by, for example, drawing upon the remote expert's screen using a mouse, digital pen, finger or other such pointing device. The path traced using the pointing device is then reflected upon the local user device, in the location within the AR environment in which it was originally drawn."); anda plurality of second coordinates transmitted from the second device to the first device (paragraph [0058], "In one or more embodiments, either or both of the remote expert 200 and the local user 210 may generate and manipulate overlay content 240.").
Acharya and Montgomerie are considered analogous because they are each concerned with user training with VR/AR devices. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Acharya with the teachings of Montgomerie for the purpose of improving training quality. Given that all the claimed elements were known in the prior art, one skilled in the art could have combined the elements by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Regarding claim 6, Acharya teaches the processing device according to claim 1, wherein the one or more processors calculate one of the plurality of coordinates of the object of the task visible in the image(page 3, "In a system as described herein, the correlation between the SME remark and the object recognized in the video data is identified, and the period during which the related remark occurs in the voice data or the related object is the video. Coordinate multiple sources regardless of the time period identified in the data.").
Regarding claim 7, Acharya teaches the processing device according to claim 1, wherein in the generating of the training data, the image including the extracted at least one of the plurality of coordinates and a classification of the image are associated (page 15, "The training unit 210 applies the domain model 114 to generate training information 117 for the task (914). The training information 117 is, for example, a task, one or more steps of a plurality of steps constituting the task, or an object (eg, a tool)associated with each type of data cross-referenced to each other type of data. Or workpieces), including, for example, video data, audio data, sensor data and / or text data.").
Regarding claim 8, Acharya teaches a training device comprising: one or more other processors (page 15, “It may be implemented in one or more processors, including integrated or individual logic circuits, as well as any combination of such components… Further, any of the described units, modules or components may be implemented together as separate but interoperable logic devices or separately”); and
a memory storing instructions (page 15, “The techniques described in the present disclosure may also be embodied or encoded in a computer-readable medium such as a computer-readable storage medium containing instructions.”) that, when executed by the one or more other processors, cause the one or more other processors to:
perform machine learning by using the training data generated by the processing device according to claim 1 (page 5, "The machine learning system 112 is initialized by training the machine learning system 112 with training sample data (not depicted in FIG. 1) that includes video data, text data, audio data and / or sensor data.").
Regarding claim 9, Acharya teaches the training device according to claim 8, wherein the machine learning includes performing clustering or training a classification model (pages 5 and 6, "The machine learning system 112 is initialized by training the machine learning system 112 with training sample data (not depicted in FIG. 1) that includes video data, text data, audio data and / or sensor data. May be good. In some examples, the machine learning system 112 uses such training sample data to teach a machine learning model to capture the elements depicted in video data, text data, audio data and / or sensor data. Whether such elements are more or less likely to be related to each other, such as by identifying, training the machine learning system 112, assigning different weights to different elements, applying different coefficients to such elements, and so on. To judge.").
Regarding claim 10, Acharya teaches a processing device, configured to: transmit, to the first device, guidance related to the task by using data trained by the training device according to claim 8 (page 3, "In some examples, the system acquires the execution of such a task from the perspective of SME and reproduces the execution of the trainee's task in the form of augmented reality content.").
Regarding claim 11, Acharya teaches a processing system, comprising: a first device mounted to a first person performing a task (page 5, "For example, the sensor 120 may be worn by a first user 102 or on an article worn by the first user 102, such as a motion tracking glove that detects the movement and / or force of a user's fingers, hands and / or arms. It may be incorporated");
a second device mounted to a second person (page 9, "The training unit 210 may output such augmented reality content to, for example, a head-mounted display (HMD) worn by a second user 118 to provide an empirical first-person view of performing tasks by the SME.");
a processing device comprising: one or more processors (page 15, “It may be implemented in one or more processors, including integrated or individual logic circuits, as well as any combination of such components.");
a memory storing instructions (page 15, “The techniques described in the present disclosure may also be embodied or encoded in a computer-readable medium such as a computer-readable storage medium containing instructions.”) that, when executed by the one or more processors, cause the one or more processors to:
to acquire an image, a coordinate, and dialog data communicated between the first device and the second device (page 3, "In the examples of techniques described herein, the system acquires SME activity (eg, a series of maintenance work, machining activities or manufacturing activities, etc.) in both first-person and third-person views. Provide one or more 3D video cameras for. Such a system further comprises one or more microphones configured to acquire dictation as the SME practices the activity," and page 4, "In addition, the system described herein updates the domain model of a task and applies the domain model to generate training information used by a second user to perform the task. In some examples, the system uses voice or text data obtained from the SME in a first language (eg, Japanese) to train a second user to perform a task."),
select a first scenario corresponding to the task from a plurality of scenarios in which processing procedures are defined (page 6, "In some examples, the knowledge database 116 receives from a second user 118 an instruction to perform a task or a query for one of a plurality of steps to perform a task. In response to the query, the knowledge database 116 applies the domain model 114 to generate training information 117 for performing the task or one of the plurality of steps for performing the task."),
extract at least one of a plurality of the coordinates based on an intention understanding for the dialog data according to the first scenario (page 13, "The display 802draws a time series graph of the moving image data 107. The display 804 depicts a transcription of a portion of the audio data 107 (eg, a narrative by a first user 102) that correlates with each portion of the moving image data 107. The display 806 identifies a portion of the moving image data 107 as corresponding to a recognized sequence of actions (eg, one of a plurality of steps for performing a task). The display 808 depicts the label of the behavior sequence recognized for that portion of the moving image data 107 identified by the display 806. The display 810 identifies a portion of the moving image data 107 as depicting a recognized object. The display 812 identifies the location of the recognized object in the work space depicted in the moving image data 107."), and
generate training data including the image, a classification of the image, and a relative coordinate of the extracted at least one of the coordinates with respect to the coordinate of the object (page 14, "FIG. 9 is a flowchart showing an exemplary operation for generating training information according to the technique of the present disclosure… Based on the correlation between the video data 107 and the audio data 109, the sensor data 121, and the text data, the machine learning system 112 describes at least one of the video data that describes one of the plurality of steps for performing the task," and page 15, "For example, the machine learning system 112 identifies the object depicted in the moving image data 107, identifies the reference to the object from the audio data 109, and refers to the object depicted in the moving image data 107 to the object in the audio data 109. A part of the moving image data may be correlated with a part of the audio data by correlating with."); and
a training device performing machine learning by using the data (page 5, "The machine learning system 112 is initialized by training the machine learning system 112 with training sample data (not depicted in FIG. 1) that includes video data, text data, audio data and / or sensor data.").
As noted with regard to claim 1 above, the teachings of Acharya – by identifying a location of an object on a screen or in a projection – would rely on that object’s coordinates. Given, arguendo, that Acharya does not rely on the use of object coordinates, Montgomerie is introduced.
Montgomerie teaches, at paragraph [0055], "An alternative exemplary method of AR interaction according to embodiments of the present disclosure, as shown in FIG. 21, may include capturing a video image by a camera (2100), generating AR coordinates corresponding to captured video image (2105), update a scene camera view (2110), encoding the video image (2115), combining the AR coordinates, 3D objects, and encoded video image (2120), transmitting the combined data to a remote user (2125), and receiving from the remote user updated AR coordinates, 3D objects, and video image (2130)."
Acharya and Montgomerie are considered analogous because they are each concerned with user training with VR/AR devices. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Acharya with the teachings of Montgomerie for the purpose of improving training quality. Given that all the claimed elements were known in the prior art, one skilled in the art could have combined the elements by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Regarding claim 13, Acharya teaches A non-transitory computer-readable storage medium storing a program, the program, when executed by a computer, causing the computer to perform the method according to claim 12 (page 15, "The techniques described in the present disclosure may also be embodied or encoded in a computer-readable medium such as a computer-readable storage medium containing instructions.").
Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over Acharya and Montgomerie as applied to claim 1 above, further in view of Japan Patent 6004051 to Fukakusa Hiroki. (hereinafter, "Fukakusa").
Regarding claim 4, Acharya teaches the processing device according to claim 1, wherein the first device and the second device each are head mounted displays (page 3, "In the examples of techniques described herein, the system acquires SME activity (eg, a series of maintenance work, machining activities or manufacturing activities, etc.) in both first-person and third-person views. Provide one or more 3D video cameras for," and page 9, "The training unit 210 may output such augmented reality content to, for example, a head-mounted display (HMD) worn by a second user 118 to provide an empirical first-person view of performing tasks by the SME.") and Montgomerie teaches the use of coordinates in the interaction of AR devices, but the combination of Acharya and Montgomerie does not explicitly teach “the plurality of coordinates includes a coordinate pointed to by eye tracking,” and thus Fukakusa is introduced.
Fukakusa teaches the plurality of coordinates includes a coordinate pointed to by eye tracking (page 3, "The HMD line-of-sight identifying unit 305 is a functional unit that identifies the line of sight of the user wearing the HMD 100from the position and orientation of the HMD 100 acquired in the HMD information acquisition unit 303 in the real space.").
Acharya, Montgomerie and Fukakusa are considered analogous because they are each concerned with user interacting with VR/AR devices. Therefore, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to have modified Acharya and Montgomerie with the teachings of Fukakusa for the purpose of improving training quality. Given that all the claimed elements were known in the prior art, one skilled in the art could have combined the elements by known methods with no change in their respective functions, and the combination would have yielded nothing more than predictable results.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Japan Invention Application 2006-293605 to Sakauchi Yuichi et al.
Korea Patent 10-2330218 to Kim Yeon Pyo.
China Invention Application 109658523 to Huang et al.
China Invention Application 114880422 to Ji-zhou Huang.
U.S. Patent 10,896,544 to Yang et al.
U.S. Patent 11,120,703 to McLeod.
U.S. Patent Application Publication 2006/0170652 to Yuichi et al.
U.S. Patent Application Publication 2014/0186801 to Slayton et al.
U.S. Patent Application Publication 2017/0053456 to Cho et al.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SEAN T SMITH whose telephone number is (571)272-6643. The examiner can normally be reached Monday - Friday 8:00am - 5:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, PIERRE-LOUIS DESIR can be reached at (571) 272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/SEAN THOMAS SMITH/Examiner, Art Unit 2659
/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659