ILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
This action is responsive to the Remark filed on 10/10/25.
Claim(s) 45 & 78 was/were amended.
Claim(s) 45-47, 68-70 & 78-80 is/are presented for examination.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 45-47, 68, 78 & 80 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liao, U.S. Patent/Pub. No. US 2021/0065582 A1 in view of Mahajan, U.S. Patent/Pub. No. US 11044287 B1, and further in view of Reece, US 2021/0264900 A1.
As to claim 45, Liao teaches a method comprising:
receiving onto a remote device digital video data captured during a time interval at a location of a remote party, wherein the digital video data depicts the remote party (Liao, page 2, paragraph 23; i.e., [0023] In addition, in some implementations, a user device can be configured to transmit data captured locally during use of relevant application(s) to the cloud or the local ML program);
receiving audio data onto the remote device, wherein the audio data captures sounds spoken by the remote party during the time interval, wherein the remote device converts the audio data into words of text, and wherein the remote device captures prosodic information describing the sounds spoken by the remote party during the time interval (Liao, page 2, paragraph 19-20 & 23; page 9, paragraph 80; i.e., [0020] This may be achieved by utilizing a speech recognition algorithm that converts spoken words to text in real-time, determining the number of syllables in the words spoken for a given time period and calculating the speaking rate based on the number of syllables, instead of the number of words. Furthermore, phonetic features of the audio signal such as pitch, intensity or energy (e.g., formant) may be taken into account to determine the speaking rate; [0023] In addition, in some implementations, a user device can be configured to transmit data captured locally during use of relevant application(s) to the cloud or the local ML program and provide supplemental training data that can serve to fine-tune or increase the effectiveness of the MLA; [0080] receiving a transcript for the audio data, the transcript including a plurality of words spoken during the speech rehearsal session);
receiving onto a local device the words of text (Liao, page 1, paragraph 4; i.e., [0004] receiving a transcript for the audio data, the transcript including a plurality of words spoken during the speech rehearsal session);
receiving onto the local device the prosodic information corresponding to the sounds spoken by the remote party during the time interval that were converted into the words of text (Liao, page 2, paragraph 19-20 & 23; page 5, paragraph 43-44 & 46-47; page 9, paragraph 80; i.e., [0020] This may be achieved by utilizing a speech recognition algorithm that converts spoken words to text in real-time, determining the number of syllables in the words spoken for a given time period. Furthermore, phonetic features of the audio signal such as pitch, intensity or energy (e.g., formant) may be taken into account to determine the speaking rate; [0023] In addition, in some implementations, a user device can be configured to transmit data captured locally during use of relevant application(s) to the cloud or the local ML program; [0043] Once audio data is received, a request to transcribe the audio data may be submitted from the application or service ( or the speech rehearsal assistance service) to a speech recognition engine for converting the spoken words to text, at 520. speech recognition mechanism that provides real-time speech recognition and conversion may be used. In an example, real-time speech recognition may be provided for audio portions that cover short periods of time (e.g., 1 to 3 seconds); [0044] In response to the request, transcribed text corresponding to the audio data may be received, at 525. The transcribed text may be provided to the speech rehearsal assistance service in real-time as the user is speaking. In one implementation, the information relating to the transcribed text may include metadata such as when the text is received and the duration of the speech results);
storing the words of text and the prosodic information in association with one another (Liao, page 2, paragraph 19-20 & 23; page 5, paragraph 43-44 & 46-47; page 9, paragraph 80; i.e., [0019] However, it does not provide specific real-time information regarding which portions of a speech are too fast or too slow; [0020] This may be achieved by utilizing a speech recognition algorithm that converts spoken words to text in real-time, determining the number of syllables in the words spoken for a given time period and calculating the speaking rate based on the number of syllables, instead of the number of words. Furthermore, phonetic features of the audio signal such as pitch, intensity or energy (e.g., formant) may be taken into account to determine the speaking rate).
But Liao failed to teach the claim limitation wherein receiving onto a remote device digital video data captured during a time interval at a location of a remote party, wherein the digital video data depicts the remote party; and determining an emotion based at least on the stored prosodic information.
However, Mahajan teaches the limitation wherein receiving onto a remote device digital video data captured during a time interval at a location of a remote party, wherein the digital video data depicts the remote party (Mahajan, col 1, lines 53-60; i.e., facilitates the collaboration of different users through an online platform that connects the computing devices of online meeting participants. Online meeting content that is transmitted during an online meeting may include audio, video and text data that is encoded into data packets that are transmitted between online participant computing).
It would have been obvious to one of ordinary skill in the art before the effective date of the claimed invention to modify Liao to substitute transmission periodically from Mahajan for time period from Liao to fastest transmissions of the data to each of the different client with the best reliability (Mahajan, col 15, lines 50-55).
However, Reece teaches the limitation wherein determining an emotion based at least on the stored prosodic information (Reece, figure 10 & 16; page 7, paragraph 85; page 8, paragraph 98; page 14, paragraph 157; i.e., [0085] In other words, video processing component 506 may include a neural network trained on labeled images of faces. Video processing component 506 may identify facial expressions, such as smiling, crying, laughing, and grimacing, and further determine the associated emotion. The confidence score indicates a relative confidence in the accuracy of the emotional label determined.; [0157] Emotional indicators 1508 include high level emotion data consistent with each of the data modalities. Conversation synthesis ML system 1506 can analyze features from the data modalities to generate more emotional indicators 1508. In the illustrated implementation, emotional indicators 1508 are dependent on gaze and facial expression data from video data modality 1602, volume and tone data from audio data modality 1604, and phrases data from text data modality 1608. In other words, a single modality (e.g., text 1608) may provide an incomplete picture of the user's emotional state).
It would have been obvious to one of ordinary skill in the art before the effective date of the claimed invention to modify Liao to substitute conversation score from Reece for rate of speech from Liao to produce data indicative of a goal ( e.g., minimizing a loss function) (Reece, page 2, paragraph 36).
As to claim 46, Liao-Mahajan-Reece teaches the method as recited in claim 45, wherein the prosodic information includes information selected from the group consisting of: lengths of the sounds spoken by the remote party, a time of occurrence of a pitch point of a word spoken by the remote party, and an amplitude of a sound spoken by the remote party (Liao, page 2, paragraph 19-20 & 23; page 9, paragraph 80; i.e., [0020] This may be achieved by utilizing a speech recognition algorithm that converts spoken words to text in real-time, determining the number of syllables in the words spoken for a given time period and calculating the speaking rate based on the number of syllables, instead of the number of words. Furthermore, phonetic features of the audio signal such as pitch, intensity or energy (e.g., formant) may be taken into account to determine the speaking rate; [0023] In addition, in some implementations, a user device can be configured to transmit data captured locally during use of relevant application(s) to the cloud or the local ML program and provide supplemental training data that can serve to fine-tune or increase the effectiveness of the MLA; [0080] receiving a transcript for the audio data, the transcript including a plurality of words spoken during the speech rehearsal session).
As to claim 47, Liao-Mahajan-Reece teaches the method as recited in claim 45, wherein displaying on a graphical user interface of the local device the words of text and the prosodic information associated with the words of text (Liao, page 2, paragraph 19-20 & 23; page 6, paragraph 50; i.e., [0020] This may be achieved by utilizing a speech recognition algorithm that converts spoken words to text in real-time, determining the number of syllables in the words spoken for a given time period and calculating the speaking rate based on the number of syllables, instead of the number of words. Furthermore, phonetic features of the audio signal such as pitch, intensity or energy (e.g., formant) may be taken into account to determine the speaking rate; [0023] In addition, in some implementations, a user device can be configured to transmit data captured locally during use of relevant application(s) to the cloud or the local ML program and provide supplemental training data that can serve to fine-tune or increase the effectiveness of the MLA; [0050] After displaying the notification and/or if it is determined that the speaking rate is within the acceptable range. Filler words may include words, phrases and sounds that are not necessary to a spoken sentence or phrase and may include phrases such as like, basically, I mean, um, uh, and the like. To detect such filler words a first approach may simply examine the transcript of the audio portion to determine if any phrases or sounds identified as potential filler phrase exist in the transcribed text).
As to claim 68, Liao-Mahajan-Reece teaches the method as recited in claim 45, wherein the words of text and the prosodic information are not stored together with the digital video data that was captured during the time interval (Liao, page 2, paragraph 19-20 & 23; page 3, paragraph 27; i.e., [0020] This may be achieved by utilizing a speech recognition algorithm that converts spoken words to text in real-time, determining the number of syllables in the words spoken for a given time period and calculating the speaking rate based on the number of syllables, instead of the number of words. Furthermore, phonetic features of the audio signal such as pitch, intensity or energy (e.g., formant) may be taken into account to determine the speaking rate; [0023] In addition, in some implementations, a user device can be configured to transmit data captured locally during use of relevant application(s) to the cloud or the local ML program; [0027] For example, the speaking rate engine may be used to calculate the user's speaking rate by utilizing various mechanisms which may include examining the audio data to identify the syllable nuclei in each word for calculating the number of syllables before determining the speaking rate. The filler word detection model, on the other hand, may examine the transcript of the audio data to determine if any words in the transcript correspond to filler words, sounds or phrases).
As to claim 78, Liao teaches a method comprising:
wherein the data is captured during a time interval, wherein the data is timestamped (Liao, page 2, paragraph 23; page 6, paragraph 52; i.e., [0023] In addition, in some implementations, a user device can be configured to transmit data captured locally during use of relevant application(s) to the cloud or the local ML program; [0052] recurrent convolutional neural network may be developed that examines every
time stamp in every audio frame of the audio data across a certain window to determine how to classify the words in the audio. This may involve providing the values of pitch, and
the first, second and third formants to the deep neural network to have the neural network determine if they correspond with filler words.);
determining a value of a physiological parameter of the remote party using the data, wherein the value of the physiological parameter is determined remotely at the location of the remote party, and wherein the value of the physiological parameter is timestamped (Liao, page 2, paragraph 23; page 6, paragraph 52; i.e., [0023] In addition, in some implementations, a user device can be configured to transmit data captured locally during use of relevant application(s) to the cloud or the local ML program; [0052] recurrent convolutional neural network may be developed that examines every time stamp in every audio frame of the audio data across a certain window to determine how to classify the words in the audio. This may involve providing the values of pitch, and the first, second and third formants to the deep neural network to have the neural network determine if they correspond with filler words.);
receiving onto the remote device audio data that captures sounds spoken by the remote party during the time interval, wherein the remote device generates prosodic information from the audio data (Liao, page 1, paragraph 4; page 2, paragraph 19-20 & 23; page 9, paragraph 80; i.e., [0020] Furthermore, phonetic features of the audio signal such as pitch, intensity or energy (e.g., formant) may be taken into account to determine the speaking rate; [0023] In addition, in some implementations, a user device can be configured to transmit data captured locally during use of relevant application(s) to the cloud or the local ML program and provide supplemental training data that can serve to fine-tune or increase the effectiveness of the MLA; [0080] receiving a transcript for the audio data, the transcript including a plurality of words spoken during the speech rehearsal session);
receiving onto a local device the prosodic information corresponding to the sounds spoken by the remote party during the time interval (Liao, page 1, paragraph 4; i.e., [0004] receiving a transcript for the audio data, the transcript including a plurality of words spoken during the speech rehearsal session);
converting the prosodic information into words of text at the local device, wherein the words of text correspond to the sounds spoken by the remote party during the time interval (Liao, page 2, paragraph 19-20 & 23; page 5, paragraph 43-44 & 46-47; page 9, paragraph 80; i.e., [0020] This may be achieved by utilizing a speech recognition algorithm that converts spoken words to text in real-time, determining the number of syllables in the words spoken for a given time period and calculating the speaking rate based on the number of syllables, instead of the number of words. Furthermore, phonetic features of the audio signal such as pitch, intensity or energy (e.g., formant) may be taken into account to determine the speaking rate; [0023] In addition, in some implementations, a user device can be configured to transmit data captured locally during use of relevant application(s) to the cloud or the local ML program; [0044] In response to the request, transcribed text corresponding to the audio data may be received, at 525. The transcribed text may be provided to the speech rehearsal assistance service in real-time as the user is speaking; [0047] In one implementation, features of the audio data may be utilized to determine the number of syllables in each word. For example, the audio data may be examined to identify audio parameters such as pitch and intensity to detect syllable nuclei in the voice);
receiving the value of the physiological parameter onto the local device (Liao, page 1, paragraph 4; i.e., [0004] receiving a transcript for the audio data, the transcript including a plurality of words spoken during the speech rehearsal session); and
storing the words of text and the value of the physiological parameter such that the words of text are associated with the value of the physiological parameter (Liao, page 2, paragraph 19-20 & 23; page 5, paragraph 43-44 & 46-47; page 9, paragraph 80; i.e., [0020] This may be achieved by utilizing a speech recognition algorithm that converts spoken words to text in real-time, determining the number of syllables in the words spoken for a given time period and calculating the speaking rate based on the number of syllables, instead of the number of words. Furthermore, phonetic features of the audio signal such as pitch, intensity or energy (e.g., formant) may be taken into account to determine the speaking rate).
But Liao failed to teach the claim limitation wherein capturing video data on a remote device at a location of a remote party, and wherein the video data depicts the remote party; determining a value using the video data; determining an emotion based at least on the stored value of the physiological parameter associated with the stored words of text converted from the prosodic information.
However, Mahajan teaches the limitation wherein capturing video data on a remote device at a location of a remote party, and wherein the video data depicts the remote party; determining a value using the video data (Mahajan, col 1, lines 53-60; i.e., facilitates the collaboration of different users through an online platform that connects the computing devices of online meeting participants. Online meeting content that is transmitted during an online meeting may include audio, video and text data that is encoded into data packets that are transmitted between online participant computing).
It would have been obvious to one of ordinary skill in the art before the effective date of the claimed invention to modify Liao to substitute transmission periodically from Mahajan for time period from Liao to fastest transmissions of the data to each of the different client with the best reliability (Mahajan, col 15, lines 50-55).
However, Reece teaches the limitation wherein determining an emotion based at least on the stored value of the physiological parameter associated with the stored words of text converted from the prosodic information (Reece, figure 10 & 16; page 7, paragraph 85; page 8, paragraph 98; page 14, paragraph 157; i.e., [0157] Emotional indicators 1508 include high level emotion data consistent with each of the data modalities. Conversation synthesis ML system 1506 can analyze features from the data modalities to generate more emotional indicators 1508. In the illustrated implementation, emotional indicators 1508 are dependent on gaze and facial expression data from video data modality 1602, volume and tone data from audio data modality 1604, and phrases data from text data modality 1608. In other words, a single modality (e.g., text 1608) may provide an incomplete picture of the user's emotional state).
It would have been obvious to one of ordinary skill in the art before the effective date of the claimed invention to modify Liao to substitute conversation score from Reece for rate of speech from Liao to produce data indicative of a goal ( e.g., minimizing a loss function) (Reece, page 2, paragraph 36).
As to claim 80, Liao-Mahajan-Reece teaches the method as recited in claim 78, wherein marking the transcript at a location during which the value of the physiological parameter surpasses a threshold for the physiological parameter (Liao, page 9, paragraph 80-84; i.e., [0080] receiving a transcript for the audio data, the
transcript including a plurality of words spoken during the speech rehearsal session;
[0082] calculating a speaking rate based at least in part on the number of syllables; [0083] determining if the speaking rate is within a threshold range; and [0084] enabling display of a notification on a display device in real time, if the speaking rate falls outside the threshold range).
Claim(s) 69-70 & 79 is/are rejected under 35 U.S.C. 103 as being unpatentable over Liao, U.S. Patent/Pub. No. US 2021/0065582 A1 in view of Mahajan, U.S. Patent/Pub. No. US 11044287 B1, and Reece, US 2021/0264900 A1, and further in view of Margon, US 2012/0146796 A1.
As to claim 69, Liao-Mahajan-Reece teaches the method as recited in claim 45. But Liao-Mahajan-Reece failed to teach the claim limitation wherein the audio data captures background noise at the location of the remote party, further comprising: determining a value of the background noise based on the audio data; receiving the value of the background noise onto the local device; and determining a magnitude of a remote party combination feature locally at the local device, wherein the remote party combination feature is determined based on the prosodic information combined with the value of the background noise.
However, Margon teaches the limitation wherein the audio data captures background noise at the location of the remote party, further comprising:
determining a value of the background noise based on the audio data (Margon, page 3, paragraph 43; i.e., [0043] The more random the noise background is the more effective the present invention is at looking for non-random changes in the received signal. The present invention measures both the rate of change and the magnitude of the change); receiving the value of the background noise onto the local device (Margon, page 3, paragraph 43; i.e., [0043] The more random the noise background is the more effective the present invention is at looking for non-random changes in the received signal. The present invention measures both the rate of change and the magnitude of the change); and determining a magnitude of a remote party combination feature locally at the local device, wherein the remote party combination feature is determined based on the prosodic information combined with the value of the background noise (Margon, page 3, paragraph 43; page 5, paragraph 67; i.e., [0043] The more random the noise background is the more effective the present invention is at looking for non-random changes in the received signal. The present invention measures both the rate of change and the magnitude of the change; [0067] The techniques, systems, and methods described herein to measure heart rhythm and respiration).
It would have been obvious to one of ordinary skill in the art before the effective date of the claimed invention to modify Liao-Mahajan-Reece to substitute modified signals from Margon for audio signal from Liao-Mahajan-Reece to of recording the monitored physiological function upon detection of the movement (Margon, page 1, paragraph 10).
As to claim 70, Liao-Mahajan-Reece teaches the method as recited in claim 45. But Liao-Mahajan-Reece failed to teach the claim limitation wherein the words of text and the prosodic information are displayed to a health professional, and wherein the remote party is a patient.
However, Margon teaches the limitation wherein the words of text and the prosodic information are displayed to a health professional, and wherein the remote party is a patient (Margon, page 2, paragraph 39; i.e., [0039] measuring heart rate and detecting heart defects, and respiratory rate. For example, using Doppler Effect principles, heart rate and motion can be measured from the differences in frequency, phase, and/or wavelength between the source signal and the modified signal reflected back from the heart moving within the patient).
It would have been obvious to one of ordinary skill in the art before the effective date of the claimed invention to modify Liao-Mahajan-Reece to substitute modified signals from Margon for audio signal from Liao-Mahajan-Reece to of recording the monitored physiological function upon detection of the movement (Margon, page 1, paragraph 10).
As to claim 79, Liao-Mahajan-Reece teaches the method as recited in claim 78. But Liao-Mahajan-Reece failed to teach the claim limitation wherein the physiological parameter is selected from the group consisting of: an instantaneous heart rate of the remote party at the first time instant, an average heart rate of the remote party over the time interval, a heart rate variability of heart beats of the remote party during the time interval, an average breathing rate of the remote party over the time interval, and an average pupil dilation amount of the remote party over the time interval.
However, Margon teaches the limitation wherein the physiological parameter is selected from the group consisting of: an instantaneous heart rate of the remote party at the first time instant, an average heart rate of the remote party over the time interval, a heart rate variability of heart beats of the remote party during the time interval, an average breathing rate of the remote party over the time interval, and an average pupil dilation amount of the remote party over the time interval (Margon, page 3, paragraph 43; i.e., [0043] The more random the noise background is the more effective the present invention is at looking for non-random changes in the received signal. The present invention measures both the rate of change and the magnitude of the change); receiving the value of the background noise onto the local device (Margon, page 3, paragraph 43; i.e., [0043] The more random the noise background is the more effective the present invention is at looking for non-random changes in the received signal. The present invention measures both the rate of change and the magnitude of the change); and determining a magnitude of a remote party combination feature locally at the local device, wherein the remote party combination feature is determined based on the prosodic information combined with the value of the background noise (Margon, page 3, paragraph 43; page 5, paragraph 67; i.e., [0043] The more random the noise background is the more effective the present invention is at looking for non-random changes in the received signal. The present invention measures both the rate of change and the magnitude of the change; [0067] The techniques, systems, and methods described herein to measure heart rhythm and respiration).
It would have been obvious to one of ordinary skill in the art before the effective date of the claimed invention to modify Liao-Mahajan-Reece to substitute modified signals from Margon for audio signal from Liao-Mahajan-Reece to of recording the monitored physiological function upon detection of the movement (Margon, page 1, paragraph 10).
Response to Arguments
Applicant's arguments with respect to claim(s) 45-47, 68-70 & 78-80 have been considered but are moot in view of the new ground(s) of rejection.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Listing of Relevant Arts
Thorn, U.S. Patent/Pub. No. US 20170352361 A1 discloses speech-to-text, words are time-stamps.
Ai, U.S. Patent/Pub. No. US 20150154960 A1 discloses teleconferencing to remote users, convert the speech into text over a period of time.
Contact Information
The present application is being examined under the pre-AIA first to invent provisions.
THUONG NGUYEN whose telephone number is (571)272-3864. The examiner can normally be reached on Monday-Friday 9:00-6:00.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Noel Beharry can be reached on 571-270-5630. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/THUONG NGUYEN/Primary Examiner, Art Unit 2416