DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
Applicant's arguments filed 01/15/2026 have been fully considered but they are not persuasive. The applicant argues “The amendments place the claims in condition for allowance”.
The applicant’s argument is fully considered, but is not persuasive. The independent claims are rejected as being unpatentable over McCulloch in view of Scott as detailed in this action. New citations have been added where appropriate in addition to Poore as detailed in this action.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary. Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claim(s) 1-4, 6-12, 14-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over McCulloch et al. (US 2014/0337023)(Hereinafter referred to as McCulloch) in view of Scott, II et al. (US 11,055,533)(Hereinafter referred to as Scott).
Regarding claim 1, McCulloch teaches A method (Embodiments that relate to converting audio inputs from an environment into text are disclosed. For example, in one disclosed embodiment a speech conversion program receives audio inputs from a microphone array of a head-mounted display device. Image data is captured from the environment, and one or more possible faces are detected from image data. Eye-tracking data is used to determine a target face on which a user is focused. A beamforming technique is applied to at least a portion of the audio inputs to identify target audio inputs that are associated with the target face. The target audio inputs are converted into text that is displayed via a transparent display of the head-mounted display device. See abstract)
comprising:
detecting an ambient sound from audio data generated by a plurality of microphones on a wearable display (In some examples, the user 304 may have a hearing impairment that can make understanding speech difficult, particularly in an environment with multiple speakers and/or significant background noise. In the example shown in FIG. 3, each of the first person 316, second person 320 and third person 324 may be speaking simultaneously. The wallmounted display 328 may also be emitting audio. All of these audio inputs 68 may be received by the microphones 224, 228, 232, 236, 240, and 244 of the HMD device 200. See paragraph [0036]);
determining, based on the audio data, a location of a sound source of the ambient sound relative to the wearable device (Using the location of target face1 54, the speech conversion program 14 may use the beamformer program 26 to apply one or more beamforming techniques to at least a portion of the audio inputs 68 from the microphone array 66. Alternatively expressed, one or more beamforming techniques may be applied to portions of audio inputs 68 that originate from the location of target facel 54. In this manner, the beamformer program 26 may identify target audio inputs, generally indicated at 336 in FIG. 3, that are associated with the face1 54 of the first person 316. See paragraph [0039]);
determining that the ambient sound includes a non-speech sound based on an audio segment of the audio data that includes the ambient sound (The wallmounted display 328 may also be emitting audio. All of these audio inputs 68 may be received by the microphones 224, 228, 232, 236, 240, and 244 of the HMD device 200. See paragraph [0036]),
generating an annotation for a speech sound using the audio segment, the annotation including a textual description that identifies the speech sound and the sound source (See figure 3.)( With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]); and
displaying, the annotation at a display position in the display of the wearable device based on the location of the sound source (See figure 3.)(With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]).
but is silent to generating an annotation for the non-speech sound using the audio segment, the annotation including a textual description that identifies the non-speech sound and the sound source; and displaying, the annotation at a display position in the display of the wearable device based on the location of the sound source.
Scott teaches detecting an audio location and providing an augmented reality textual and graphical description of an event superimposed to the user (FIG. 1 illustrates providing AR content 110 and an audio description 120 of a real-world event, according to one embodiment described herein. FIG. 1 includes a display 105 and a speaker 115 that can be part of a wearable device (e.g., a HUD) or a handheld device (e.g., a smart phone or tablet). In this example, the display 105 outputs an image of a real-world event that can be captured using a camera. However, in another embodiment, rather than displaying an image captured using a camera on, e.g., a LCD or LED display screen, the display 105 can instead be transparent such that environmental light can pass through the display 105 to reach the eyes of the user. In FIG. 1, the real-world event includes people playing soccer in the distance. If the user is visually impaired, she may be able to hear the sounds associated with this event (e.g., shouting, laughter, a ball being kicked, etc.) but not be able to see the event. In the embodiments herein, an AR device can identify the source of the sound ( or sounds) and generate a magnified version of the source as AR content 110. That is, the AR device determines that the player 10 kicking the ball is the source of the sound and then generates a magnified (larger) version of the source that is then superimposed onto the display 105 as the AR content 110. The AR device can magnify the AR content 110 to a size that is readily viewable to the visually impaired user. Addition- 15 ally, the display 105 can output markers or demarcations to set apart the AR content 110 from the other objects in the display 105. In addition to generating the AR content 110, the AR device can use the speaker 115 to output an audio description 20 of the source as well as its action or actions. In this example, the audio description 120 is "A PERSON PLAYING SOCCER" which indicates the source of the sound (i.e., a person) and its action (i.e., playing soccer). The user can correlate the sound she hears ( e.g., the sounds generated by the soccer 25 player) to the AR content 110 and the audio description 120. See Col. 2, lines 58 to Col. 3, line 26)( As shown, the memory 220 stores a distance calculator 225, an audio controller 235, an AR controller 250, and a sound of interest (SOI) identifier 260, which may be software applications or modules. The distance calculator 225 uses time difference of arrival (TDOA) 230 to determine a location of a source generating a sound. That is, the distance calculator 225 can use the microphones 213 to perform TDOA 230 and identify a distance from the AR device 205 to a sound source. Acoustic source localization is the task of locating a sound source given measurements of the sound field. The TDOA technique for acoustic source localization uses (at least) two microphones, to compute the direction and distance of the acoustic source by using the different times the signal arrived at the microphones to triangulate its position. Knowing the direction and distance of the sound source can help the AR device 205 to generate the AR
content 110 and the audio description 120 as described below. See col. 3, line 57 to col. 4, line 7)( The audio controller 235 is tasked with generating the audio description 120. To do so, the audio controller 235 includes an object identifier 240 and a speech generator 245. In one embodiment, the object identifier 240 performs a ML technique that identifies objects in captured images. That is, using images captured by the camera 210 and the location of the sound source identified by the distance calculator 225, the object identifier 240 can determine what object in the captured images is the sound source. In one embodiment, the object identifier 240 decomposes the captured video of the user's FOY into image frames. For each frame, the object identifier 240 uses a visual recognition ML technique to
perform object detection within the frame such as, e.g., a custom trained You-Only-Look-Once (YOLO) convolutional neural network model to perform object detection of the elements within the frame, capturing the bounding box coordinates of detected elements/regions. The speech generator 245 can use a language model to generate text for the objects and actions identified by the object identifier 240. In one embodiment, the language model is a long short-term memory (LSTM) recurrent neural network that is trained on encoded video frames images and word-embedding that describe the objects and corresponding actions occurring in the images. The speech generator 245 can then translate the text into the audio descriptions 120. See col. 4, lines 8-33)
McCulloch and Scott teach of presenting augmented reality information to a user about audio and Scott teaches that the information provided can be a contextual annotation of an event occurring based on a detected sound, therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to combine the system of McCulloch with the detection and annotation of event sounds as taught by Scott such that the user could visualize both speech and non-speech audio and visualize the source.
Regarding claim 2, McCulloch in view of Scott teaches The method of claim 1, wherein the textual description includes information identifying a type of the non-speech sound and information identifying a category of a physical object that produced the non-speech sound (Scott; FIG. 1 illustrates providing AR content 110 and an audio description 120 of a real-world event, according to one embodiment described herein. FIG. 1 includes a display 105 and a speaker 115 that can be part of a wearable device (e.g., a HUD) or a handheld device (e.g., a smart phone or tablet). In this example, the display 105 outputs an image of a real-world event that can be captured using a camera. However, in another embodiment, rather than displaying an image captured using a camera on, e.g., a LCD or LED display screen, the display 105 can instead be transparent such that environmental light can pass through the display 105 to reach the eyes of the user. In FIG. 1, the real-world event includes people playing soccer in the distance. If the user is visually impaired, she may be able to hear the sounds associated with this event (e.g., shouting, laughter, a ball being kicked, etc.) but not be able to see the event. In the embodiments herein, an AR device can identify the source of the sound ( or sounds) and generate a magnified version of the source as AR content 110. That is, the AR device determines that the player 10 kicking the ball is the source of the sound and then generates a magnified (larger) version of the source that is then superimposed onto the display 105 as the AR content 110. The AR device can magnify the AR content 110 to a size that is readily viewable to the visually impaired user. Addition- 15 ally, the display 105 can output markers or demarcations to set apart the AR content 110 from the other objects in the display 105. In addition to generating the AR content 110, the AR device can use the speaker 115 to output an audio description 20 of the source as well as its action or actions. In this example, the audio description 120 is "A PERSON PLAYING SOCCER" which indicates the source of the sound (i.e., a person) and its action (i.e., playing soccer). The user can correlate the sound she hears ( e.g., the sounds generated by the soccer 25 player) to the AR content 110 and the audio description 120. See Col. 2, lines 58 to Col. 3, line 26).
Regarding claim 3, McCulloch in view of Scott teaches The method of claim 1,
wherein the ambient sound is a first ambient sound, the sound source is a first sound source, the audio segment is a first audio segment, and the annotation is a first annotation (McCulloch in view of Scott allows for multiple sound sources)(Scott; FIG. 1 illustrates providing AR content 110 and an audio description 120 of a real-world event, according to one embodiment described herein. FIG. 1 includes a display 105 and a speaker 115 that can be part of a wearable device (e.g., a HUD) or a handheld device (e.g., a smart phone or tablet). In this example, the display 105 outputs an image of a real-world event that can be captured using a camera. However, in another embodiment, rather than displaying an image captured using a camera on, e.g., a LCD or LED display screen, the display 105 can instead be transparent such that environmental light can pass through the display 105 to reach the eyes of the user. In FIG. 1, the real-world event includes people playing soccer in the distance. If the user is visually impaired, she may be able to hear the sounds associated with this event (e.g., shouting, laughter, a ball being kicked, etc.) but not be able to see the event. In the embodiments herein, an AR device can identify the source of the sound ( or sounds) and generate a magnified version of the source as AR content 110. That is, the AR device determines that the player 10 kicking the ball is the source of the sound and then generates a magnified (larger) version of the source that is then superimposed onto the display 105 as the AR content 110. The AR device can magnify the AR content 110 to a size that is readily viewable to the visually impaired user. Addition- 15 ally, the display 105 can output markers or demarcations to set apart the AR content 110 from the other objects in the display 105. In addition to generating the AR content 110, the AR device can use the speaker 115 to output an audio description 20 of the source as well as its action or actions. In this example, the audio description 120 is "A PERSON PLAYING SOCCER" which indicates the source of the sound (i.e., a person) and its action (i.e., playing soccer). The user can correlate the sound she hears ( e.g., the sounds generated by the soccer 25 player) to the AR content 110 and the audio description 120. See Col. 2, lines 58 to Col. 3, line 26), the method further comprising:
detecting a second ambient sound from the audio data (McCulloch; Second person talking)( McCulloch; See figure 3) (McCulloch; In some examples, the user 304 may have a hearing impairment that can make understanding speech difficult, particularly in an environment with multiple speakers and/or significant background noise. In the example shown in FIG. 3, each of the first person 316, second person 320 and third person 324 may be speaking simultaneously. The wallmounted display 328 may also be emitting audio. All of these audio inputs 68 may be received by the microphones 224, 228, 232, 236, 240, and 244 of the HMD device 200. See paragraph [0036]);
determining, based on the audio data, a location of a second sound source of the second ambient sound relative to the wearable device (McCulloch; Using the location of target face 1 54, the speech conversion program 14 may use the beamformer program 26 to apply one or more beamforming techniques to at least a portion of the audio inputs 68 from the microphone array 66. Alternatively expressed, one or more beamforming techniques may be applied to portions of audio inputs 68 that originate from the location of target facel 54. In this manner, the beamformer program 26 may identify target audio inputs, generally indicated at 336 in FIG. 3, that are associated with the face1 54 of the first person 316. See paragraph [0039]);
determining that the second ambient sound includes speech based on a second audio segment of the audio data that includes the second ambient sound (McCulloch; See figure 3.) (McCulloch; With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]);
generating a second annotation that identifies the second sound source initiating conversion of the speech to text (McCulloch; See figure 3.)( McCulloch; With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]); and
displaying the second annotation and the text in the display of the wearable device based on the location of the second sound source (McCulloch; See figure 3.)( McCulloch; With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]).
Regarding claim 4, McCulloch in view of Scott teaches The method of claim 1, further comprising: computing a direction and a distance of the non-speech sound relative to the wearable device: and determining the display position for the annotation within a three-dimensional space based on the direction and the distance of the non-speech sound (Scott; As shown, the memory 220 stores a distance calculator 225, an audio controller 235, an AR controller 250, and a sound of interest (SOI) identifier 260, which may be software applications or modules. The distance calculator 225 uses time difference of arrival (TDOA) 230 to determine a location of a source generating a sound. That is, the distance calculator 225 can use the microphones 213 to perform TDOA 230 and identify a distance from the AR device 205 to a sound source. Acoustic source localization is the task of locating a sound source given measurements of the sound field. The TDOA technique for acoustic source localization uses (at least) two microphones, to compute the direction and distance of the acoustic source by using the different times the signal arrived at the microphones to triangulate its position. Knowing the direction and distance of the sound source can help the AR device 205 to generate the AR
content 110 and the audio description 120 as described below. See col. 3, line 57 to col. 4, line 7)( Scott; The audio controller 235 is tasked with generating the audio description 120. To do so, the audio controller 235 includes an object identifier 240 and a speech generator 245. In one embodiment, the object identifier 240 performs a ML technique that identifies objects in captured images. That is, using images captured by the camera 210 and the location of the sound source identified by the distance calculator 225, the object identifier 240 can determine what object in the captured images is the sound source. In one embodiment, the object identifier 240 decomposes the captured video of the user's FOY into image frames. For each frame, the object identifier 240 uses a visual recognition ML technique to
perform object detection within the frame such as, e.g., a custom trained You-Only-Look-Once (YOLO) convolutional neural network model to perform object detection of the elements within the frame, capturing the bounding box coordinates of detected elements/regions. The speech generator 245 can use a language model to generate text for the objects and actions identified by the object identifier 240. In one embodiment, the language model is a long short-term memory (LSTM) recurrent neural network that is trained on encoded video frames images and word-embedding that describe the objects and corresponding actions occurring in the images. The speech generator 245 can then translate the text into the audio descriptions 120. See col. 4, lines 8-33)
Regarding claim 6, McCulloch in view of Scott teaches The method of claim 4, further comprising: determining an updated value for at least one of the direction or the distance as the sound source moves relative to the wearable device and adjusting the display position of the annotation within the three-dimensional space based on the updated value (McCulloch; In one example, the speech conversion program 14 may tag the text 40' to the first person 316 such that the text bubble 340 is spatially anchored to the first person and follows the first person as the first person moves. See paragraph [0043]).
Regarding claim 7, McCulloch in view of Scott teaches The method of claim 1, wherein the annotation includes information that identifies a spatial orientation of the ambient sound relative to the wearable device (See figure 1, Scott;)(Scott; The AR controller 250 can generate the AR content 110 as 50 a still image or as a series of images that update in real-time. That is, the AR controller 250 can generate a single image of a magnified version of the sound source, or continuously generate magnified images of the sound source as the AR content 110, which track the movement of the sound source 55 in real-time. See col. 4, lines 49-55)( As shown, the memory 220 stores a distance calculator 225, an audio controller 235, an AR controller 250, and a sound of interest (SOI) identifier 260, which may be software applications or modules. The distance calculator 225 uses time difference of arrival (TDOA) 230 to determine a location of a source generating a sound. That is, the distance calculator 225 can use the microphones 213 to perform TDOA 230 and identify a distance from the AR device 205 to a sound source. Acoustic source localization is the task of locating a sound source given measurements of the sound field. The TDOA technique for acoustic source localization uses (at least) two microphones, to compute the direction and distance of the acoustic source by using the different times the signal arrived at the microphones to triangulate its position. Knowing the direction and distance of the sound source can help the AR device 205 to generate the AR content 110 and the audio description 120 as described below. See col. 3, line 57 to col. 4, line 7)( The audio controller 235 is tasked with generating the audio description 120. To do so, the audio controller 235 includes an object identifier 240 and a speech generator 245. In one embodiment, the object identifier 240 performs a ML technique that identifies objects in captured images. That is, using images captured by the camera 210 and the location of the sound source identified by the distance calculator 225, the object identifier 240 can determine what object in the captured images is the sound source. In one embodiment, the object identifier 240 decomposes the captured video of the user's FOY into image frames. For each frame, the object identifier 240 uses a visual recognition ML technique to perform object detection within the frame such as, e.g., a custom trained You-Only-Look-Once (YOLO) convolutional neural network model to perform object detection of the elements within the frame, capturing the bounding box coordinates of detected elements/regions. The speech generator 245 can use a language model to generate text for the objects and actions identified by the object identifier 240. In one embodiment, the language model is a long short-term memory (LSTM) recurrent neural network that is trained on encoded video frames images and word-embedding that describe the objects and corresponding actions occurring in the images. The speech generator 245 can then translate the text into the audio descriptions 120. See col. 4, lines 8-33).
Regarding claim 8, McCulloch teaches A wearable device (With reference now also to FIG. 2, one example of an HMD device 200 in the form of a pair of wearable glasses with a transparent display 44 is provided. See paragraph [0018]) comprising:
at least one processor (The HMD device 36 may also include a processor 250 having a logic subsystem and a storage subsystem, as discussed in more detail below with respect to FIG. 5, that are in communication with the various sensors and systems of the HMD device. See paragraph [0032]); and
a non-transitory computer-readable medium storing executable instructions that cause the at least one processor(In one example, the storage subsystem may include instructions that are executable by the logic subsystem to receive signal inputs from the sensors and forward such inputs to computing device 22 (in unprocessed or processed form), and to present images to a user via the transparent display 44. See paragraph [0032]) to:
detect an ambient sound from audio data generated by a plurality of microphones on the wearable device (In some examples, the user 304 may have a hearing impairment that can make understanding speech difficult, particularly in an environment with multiple speakers and/or significant background noise. In the example shown in FIG. 3, each of the first person 316, second person 320 and third person 324 may be speaking simultaneously. The wallmounted display 328 may also be emitting audio. All of these audio inputs 68 may be received by the microphones 224, 228, 232, 236, 240, and 244 of the HMD device 200. See paragraph [0036]);
determine, based on the audio data, a location of a sound source of the ambient sound relative to the wearable device (Using the location of target face1 54, the speech conversion program 14 may use the beamformer program 26 to apply one or more beamforming techniques to at least a portion of the audio inputs 68 from the microphone array 66. Alternatively expressed, one or more beamforming techniques may be applied to portions of audio inputs 68 that originate from the location of target facel 54. In this manner, the beamformer program 26 may identify target audio inputs, generally indicated at 336 in FIG. 3, that are associated with the face1 54 of the first person 316. See paragraph [0039]);
determine that the ambient sound includes a non-speech sound based on an audio segment of the audio data that includes the ambient sound (The wallmounted display 328 may also be emitting audio. All of these audio inputs 68 may be received by the microphones 224, 228, 232, 236, 240, and 244 of the HMD device 200. See paragraph [0036]),
generating an annotation for a speech sound using the audio segment, the annotation including a textual description that identifies the speech sound and the sound source (See figure 3.)( With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]); and
displaying, the annotation at a display position in the display of the wearable device based on the location of the sound source (See figure 3.)(With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]).
but is silent to generate an annotation for the non-speech sound using the audio segment, the annotation including a textual description that identifies the non-speech sound and the sound source; and display the annotation of the non-speech sound at a display position in a display of the wearable device based on the location of the sound source.
Scott teaches detecting an audio location and providing an augmented reality textual and graphical description of an event superimposed to the user (FIG. 1 illustrates providing AR content 110 and an audio description 120 of a real-world event, according to one embodiment described herein. FIG. 1 includes a display 105 and a speaker 115 that can be part of a wearable device (e.g., a HUD) or a handheld device (e.g., a smart phone or tablet). In this example, the display 105 outputs an image of a real-world event that can be captured using a camera. However, in another embodiment, rather than displaying an image captured using a camera on, e.g., a LCD or LED display screen, the display 105 can instead be transparent such that environmental light can pass through the display 105 to reach the eyes of the user. In FIG. 1, the real-world event includes people playing soccer in the distance. If the user is visually impaired, she may be able to hear the sounds associated with this event (e.g., shouting, laughter, a ball being kicked, etc.) but not be able to see the event. In the embodiments herein, an AR device can identify the source of the sound ( or sounds) and generate a magnified version of the source as AR content 110. That is, the AR device determines that the player 10 kicking the ball is the source of the sound and then generates a magnified (larger) version of the source that is then superimposed onto the display 105 as the AR content 110. The AR device can magnify the AR content 110 to a size that is readily viewable to the visually impaired user. Addition- 15 ally, the display 105 can output markers or demarcations to set apart the AR content 110 from the other objects in the display 105. In addition to generating the AR content 110, the AR device can use the speaker 115 to output an audio description 20 of the source as well as its action or actions. In this example, the audio description 120 is "A PERSON PLAYING SOCCER" which indicates the source of the sound (i.e., a person) and its action (i.e., playing soccer). The user can correlate the sound she hears ( e.g., the sounds generated by the soccer 25 player) to the AR content 110 and the audio description 120. See Col. 2, lines 58 to Col. 3, line 26)( As shown, the memory 220 stores a distance calculator 225, an audio controller 235, an AR controller 250, and a sound of interest (SOI) identifier 260, which may be software applications or modules. The distance calculator 225 uses time difference of arrival (TDOA) 230 to determine a location of a source generating a sound. That is, the distance calculator 225 can use the microphones 213 to perform TDOA 230 and identify a distance from the AR device 205 to a sound source. Acoustic source localization is the task of locating a sound source given measurements of the sound field. The TDOA technique for acoustic source localization uses (at least) two microphones, to compute the direction and distance of the acoustic source by using the different times the signal arrived at the microphones to triangulate its position. Knowing the direction and distance of the sound source can help the AR device 205 to generate the AR
content 110 and the audio description 120 as described below. See col. 3, line 57 to col. 4, line 7)( The audio controller 235 is tasked with generating the audio description 120. To do so, the audio controller 235 includes an object identifier 240 and a speech generator 245. In one embodiment, the object identifier 240 performs a ML technique that identifies objects in captured images. That is, using images captured by the camera 210 and the location of the sound source identified by the distance calculator 225, the object identifier 240 can determine what object in the captured images is the sound source. In one embodiment, the object identifier 240 decomposes the captured video of the user's FOY into image frames. For each frame, the object identifier 240 uses a visual recognition ML technique to
perform object detection within the frame such as, e.g., a custom trained You-Only-Look-Once (YOLO) convolutional neural network model to perform object detection of the elements within the frame, capturing the bounding box coordinates of detected elements/regions. The speech generator 245 can use a language model to generate text for the objects and actions identified by the object identifier 240. In one embodiment, the language model is a long short-term memory (LSTM) recurrent neural network that is trained on encoded video frames images and word-embedding that describe the objects and corresponding actions occurring in the images. The speech generator 245 can then translate the text into the audio descriptions 120. See col. 4, lines 8-33)
McCulloch and Scott teach of presenting augmented reality information to a user about audio and Scott teaches that the information provided can be a contextual annotation of an event occurring based on a detected sound, therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to combine the system of McCulloch with the detection and annotation of event sounds as taught by Scott such that the user could visualize both speech and non-speech audio and visualize the source.
Regarding claim 9, McCulloch in view of Scott teaches The wearable device of claim 8, wherein the textual description includes information identifying a type of the non-speech sound and information identifying a category of a physical object that produced the non-speech sound (Scott; FIG. 1 illustrates providing AR content 110 and an audio description 120 of a real-world event, according to one embodiment described herein. FIG. 1 includes a display 105 and a speaker 115 that can be part of a wearable device (e.g., a HUD) or a handheld device (e.g., a smart phone or tablet). In this example, the display 105 outputs an image of a real-world event that can be captured using a camera. However, in another embodiment, rather than displaying an image captured using a camera on, e.g., a LCD or LED display screen, the display 105 can instead be transparent such that environmental light can pass through the display 105 to reach the eyes of the user. In FIG. 1, the real-world event includes people playing soccer in the distance. If the user is visually impaired, she may be able to hear the sounds associated with this event (e.g., shouting, laughter, a ball being kicked, etc.) but not be able to see the event. In the embodiments herein, an AR device can identify the source of the sound ( or sounds) and generate a magnified version of the source as AR content 110. That is, the AR device determines that the player 10 kicking the ball is the source of the sound and then generates a magnified (larger) version of the source that is then superimposed onto the display 105 as the AR content 110. The AR device can magnify the AR content 110 to a size that is readily viewable to the visually impaired user. Addition- 15 ally, the display 105 can output markers or demarcations to set apart the AR content 110 from the other objects in the display 105. In addition to generating the AR content 110, the AR device can use the speaker 115 to output an audio description 20 of the source as well as its action or actions. In this example, the audio description 120 is "A PERSON PLAYING SOCCER" which indicates the source of the sound (i.e., a person) and its action (i.e., playing soccer). The user can correlate the sound she hears ( e.g., the sounds generated by the soccer 25 player) to the AR content 110 and the audio description 120. See Col. 2, lines 58 to Col. 3, line 26).
Regarding claim 10, McCulloch in view of Scott teaches The wearable display device of claim 8, wherein the ambient sound is a first ambient sound, the sound source is a first sound source, the audio segment is a first audio segment, and the annotation is a first annotation, wherein the executable instructions include instructions that cause the at least one processor (McCulloch in view of Scott allows for multiple sound sources)(Scott; FIG. 1 illustrates providing AR content 110 and an audio description 120 of a real-world event, according to one embodiment described herein. FIG. 1 includes a display 105 and a speaker 115 that can be part of a wearable device (e.g., a HUD) or a handheld device (e.g., a smart phone or tablet). In this example, the display 105 outputs an image of a real-world event that can be captured using a camera. However, in another embodiment, rather than displaying an image captured using a camera on, e.g., a LCD or LED display screen, the display 105 can instead be transparent such that environmental light can pass through the display 105 to reach the eyes of the user. In FIG. 1, the real-world event includes people playing soccer in the distance. If the user is visually impaired, she may be able to hear the sounds associated with this event (e.g., shouting, laughter, a ball being kicked, etc.) but not be able to see the event. In the embodiments herein, an AR device can identify the source of the sound ( or sounds) and generate a magnified version of the source as AR content 110. That is, the AR device determines that the player 10 kicking the ball is the source of the sound and then generates a magnified (larger) version of the source that is then superimposed onto the display 105 as the AR content 110. The AR device can magnify the AR content 110 to a size that is readily viewable to the visually impaired user. Addition- 15 ally, the display 105 can output markers or demarcations to set apart the AR content 110 from the other objects in the display 105. In addition to generating the AR content 110, the AR device can use the speaker 115 to output an audio description 20 of the source as well as its action or actions. In this example, the audio description 120 is "A PERSON PLAYING SOCCER" which indicates the source of the sound (i.e., a person) and its action (i.e., playing soccer). The user can correlate the sound she hears ( e.g., the sounds generated by the soccer 25 player) to the AR content 110 and the audio description 120. See Col. 2, lines 58 to Col. 3, line 26) to:
detect a second ambient sound from the audio data (McCulloch; Second person talking)( McCulloch; See figure 3) (McCulloch; In some examples, the user 304 may have a hearing impairment that can make understanding speech difficult, particularly in an environment with multiple speakers and/or significant background noise. In the example shown in FIG. 3, each of the first person 316, second person 320 and third person 324 may be speaking simultaneously. The wallmounted display 328 may also be emitting audio. All of these audio inputs 68 may be received by the microphones 224, 228, 232, 236, 240, and 244 of the HMD device 200. See paragraph [0036]);
determine, based on the audio data, a location of a second sound source of the second ambient sound relative to the wearable device (McCulloch; Using the location of target face 1 54, the speech conversion program 14 may use the beamformer program 26 to apply one or more beamforming techniques to at least a portion of the audio inputs 68 from the microphone array 66. Alternatively expressed, one or more beamforming techniques may be applied to portions of audio inputs 68 that originate from the location of target facel 54. In this manner, the beamformer program 26 may identify target audio inputs, generally indicated at 336 in FIG. 3, that are associated with the face1 54 of the first person 316. See paragraph [0039]);
determine that the second ambient sound includes speech based on a second audio segment of the audio data that includes the second ambient sound (McCulloch; See figure 3.) (McCulloch; With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]);
generate a second annotation that identifies the second sound source initiate conversion of the speech to text (McCulloch; See figure 3.)( McCulloch; With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]);
and display the second annotation and the text in the display of the wearable device based on the location of the second sound source (McCulloch; See figure 3.)( McCulloch; With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]).
Regarding claim 11, McCulloch in view of Scott teaches the wearable device of claim 8, wherein the executable instructions include instructions that cause the at least one processor to: compute a direction and a distance of the ambient sound relative to the wearable device and determine the display position for the annotation within a three-dimensional space based on the direction and the distance of the ambient sound (Scott; As shown, the memory 220 stores a distance calculator 225, an audio controller 235, an AR controller 250, and a sound of interest (SOI) identifier 260, which may be software applications or modules. The distance calculator 225 uses time difference of arrival (TDOA) 230 to determine a location of a source generating a sound. That is, the distance calculator 225 can use the microphones 213 to perform TDOA 230 and identify a distance from the AR device 205 to a sound source. Acoustic source localization is the task of locating a sound source given measurements of the sound field. The TDOA technique for acoustic source localization uses (at least) two microphones, to compute the direction and distance of the acoustic source by using the different times the signal arrived at the microphones to triangulate its position. Knowing the direction and distance of the sound source can help the AR device 205 to generate the AR
content 110 and the audio description 120 as described below. See col. 3, line 57 to col. 4, line 7)( Scott; The audio controller 235 is tasked with generating the audio description 120. To do so, the audio controller 235 includes an object identifier 240 and a speech generator 245. In one embodiment, the object identifier 240 performs a ML technique that identifies objects in captured images. That is, using images captured by the camera 210 and the location of the sound source identified by the distance calculator 225, the object identifier 240 can determine what object in the captured images is the sound source. In one embodiment, the object identifier 240 decomposes the captured video of the user's FOY into image frames. For each frame, the object identifier 240 uses a visual recognition ML technique to
perform object detection within the frame such as, e.g., a custom trained You-Only-Look-Once (YOLO) convolutional neural network model to perform object detection of the elements within the frame, capturing the bounding box coordinates of detected elements/regions. The speech generator 245 can use a language model to generate text for the objects and actions identified by the object identifier 240. In one embodiment, the language model is a long short-term memory (LSTM) recurrent neural network that is trained on encoded video frames images and word-embedding that describe the objects and corresponding actions occurring in the images. The speech generator 245 can then translate the text into the audio descriptions 120. See col. 4, lines 8-33).
Regarding claim 12, McCulloch in view of Scott teaches The wearable device of claim 11, wherein the executable instructions include instructions that cause the at least one processor to: determine an updated value for at least one of the direction or the distance as the sound source moves relative to the wearable device; and adjust the display position of the annotation within the three-dimensional space based on the updated value (McCulloch; In one example, the speech conversion program 14 may tag the text 40' to the first person 316 such that the text bubble 340 is spatially anchored to the first person and follows the first person as the first person moves. See paragraph [0043]).
Regarding claim 14, McCulloch in view of Scott teaches the wearable device of claim 8, wherein the plurality of microphones are arranged as a beamforming array, wherein the executable instructions include instructions that cause the at least one processor to: receive audio signals from the beamforming array; and compute a direction of arrival of the ambient sound relative to the wearable device using the audio signals (McCulloch; For example, the directionally-adaptive sound signal may be determined based on a time-invariant beamforming technique, adaptive beamforming technique, or a combination of time-invariant and adaptive beamforming techniques. The resulting combined signal may have a narrow directivity pattern, which may be steered in a direction of a speech source, such as the location of face! 54 of the first person 316. It will also be appreciated that any suitable beamforming technique may be used to identify the target audio inputs associated with the target face. See paragraph [0041]).
Regarding claim 15, McCulloch in view of Scott teaches The wearable device of claim 10, wherein the executable instructions include instructions that cause the at least one processor to: detect the second ambient sound at least partially in parallel with detecting the first ambient sound (McCulloch; In the example shown in FIG. 3, each of the first person 316, second person 320 and third person 324 may be speaking simultaneously. The wallmounted display 328 may also be emitting audio. All of these audio inputs 68 may be received by the microphones 224, 228, 232, 236, 240, and 244 of the HMD device 200. See paragraph [0035]).
Regarding claim 16, McCulloch in view of Scott teaches A non-transitory computer-readable medium storing executable instructions that when executed by at least one processor cause the at least one processor to execute operations (In one example, the storage subsystem may include instructions that are executable by the logic subsystem to receive signal inputs from the sensors and forward such inputs to computing device 22 (in unprocessed or processed form), and to present images to a user via the transparent display 44. See paragraph [0032]), the operations comprising:
detecting an ambient sound from audio data generated by a plurality of microphones on a wearable device (In some examples, the user 304 may have a hearing impairment that can make understanding speech difficult, particularly in an environment with multiple speakers and/or significant background noise. In the example shown in FIG. 3, each of the first person 316, second person 320 and third person 324 may be speaking simultaneously. The wallmounted display 328 may also be emitting audio. All of these audio inputs 68 may be received by the microphones 224, 228, 232, 236, 240, and 244 of the HMD device 200. See paragraph [0036]);
determining, based on the audio data, a location of a sound source of the ambient sound relative to the wearable device (Using the location of target face1 54, the speech conversion program 14 may use the beamformer program 26 to apply one or more beamforming techniques to at least a portion of the audio inputs 68 from the microphone array 66. Alternatively expressed, one or more beamforming techniques may be applied to portions of audio inputs 68 that originate from the location of target facel 54. In this manner, the beamformer program 26 may identify target audio inputs, generally indicated at 336 in FIG. 3, that are associated with the face1 54 of the first person 316. See paragraph [0039]);
determining that the ambient sound includes a non-speech sound based on an audio segment of the audio data that includes the ambient sound (The wallmounted display 328 may also be emitting audio. All of these audio inputs 68 may be received by the microphones 224, 228, 232, 236, 240, and 244 of the HMD device 200. See paragraph [0036]);
generating an annotation for a speech sound using the audio segment, the annotation including a textual description that identifies the speech sound and the sound source (See figure 3.)( With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]); and
displaying, the annotation at a display position in the display of the wearable device based on the location of the sound source (See figure 3.)(With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]).
but is silent to generating an annotation for the non-speech sound using the audio segment, the annotation including a textual description that identifies the non-speech sound and the sound source; and displaying the annotation at a display position in a display of the wearable device based on the location of the sound source.
Scott teaches detecting an audio location and providing an augmented reality textual and graphical description of an event superimposed to the user (FIG. 1 illustrates providing AR content 110 and an audio description 120 of a real-world event, according to one embodiment described herein. FIG. 1 includes a display 105 and a speaker 115 that can be part of a wearable device (e.g., a HUD) or a handheld device (e.g., a smart phone or tablet). In this example, the display 105 outputs an image of a real-world event that can be captured using a camera. However, in another embodiment, rather than displaying an image captured using a camera on, e.g., a LCD or LED display screen, the display 105 can instead be transparent such that environmental light can pass through the display 105 to reach the eyes of the user. In FIG. 1, the real-world event includes people playing soccer in the distance. If the user is visually impaired, she may be able to hear the sounds associated with this event (e.g., shouting, laughter, a ball being kicked, etc.) but not be able to see the event. In the embodiments herein, an AR device can identify the source of the sound ( or sounds) and generate a magnified version of the source as AR content 110. That is, the AR device determines that the player 10 kicking the ball is the source of the sound and then generates a magnified (larger) version of the source that is then superimposed onto the display 105 as the AR content 110. The AR device can magnify the AR content 110 to a size that is readily viewable to the visually impaired user. Addition- 15 ally, the display 105 can output markers or demarcations to set apart the AR content 110 from the other objects in the display 105. In addition to generating the AR content 110, the AR device can use the speaker 115 to output an audio description 20 of the source as well as its action or actions. In this example, the audio description 120 is "A PERSON PLAYING SOCCER" which indicates the source of the sound (i.e., a person) and its action (i.e., playing soccer). The user can correlate the sound she hears ( e.g., the sounds generated by the soccer 25 player) to the AR content 110 and the audio description 120. See Col. 2, lines 58 to Col. 3, line 26)( As shown, the memory 220 stores a distance calculator 225, an audio controller 235, an AR controller 250, and a sound of interest (SOI) identifier 260, which may be software applications or modules. The distance calculator 225 uses time difference of arrival (TDOA) 230 to determine a location of a source generating a sound. That is, the distance calculator 225 can use the microphones 213 to perform TDOA 230 and identify a distance from the AR device 205 to a sound source. Acoustic source localization is the task of locating a sound source given measurements of the sound field. The TDOA technique for acoustic source localization uses (at least) two microphones, to compute the direction and distance of the acoustic source by using the different times the signal arrived at the microphones to triangulate its position. Knowing the direction and distance of the sound source can help the AR device 205 to generate the AR
content 110 and the audio description 120 as described below. See col. 3, line 57 to col. 4, line 7)( The audio controller 235 is tasked with generating the audio description 120. To do so, the audio controller 235 includes an object identifier 240 and a speech generator 245. In one embodiment, the object identifier 240 performs a ML technique that identifies objects in captured images. That is, using images captured by the camera 210 and the location of the sound source identified by the distance calculator 225, the object identifier 240 can determine what object in the captured images is the sound source. In one embodiment, the object identifier 240 decomposes the captured video of the user's FOY into image frames. For each frame, the object identifier 240 uses a visual recognition ML technique to
perform object detection within the frame such as, e.g., a custom trained You-Only-Look-Once (YOLO) convolutional neural network model to perform object detection of the elements within the frame, capturing the bounding box coordinates of detected elements/regions. The speech generator 245 can use a language model to generate text for the objects and actions identified by the object identifier 240. In one embodiment, the language model is a long short-term memory (LSTM) recurrent neural network that is trained on encoded video frames images and word-embedding that describe the objects and corresponding actions occurring in the images. The speech generator 245 can then translate the text into the audio descriptions 120. See col. 4, lines 8-33)
McCulloch and Scott teach of presenting augmented reality information to a user about audio and Scott teaches that the information provided can be a contextual annotation of an event occurring based on a detected sound, therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to combine the system of McCulloch with the detection and annotation of event sounds as taught by Scott such that the user could visualize both speech and non-speech audio and visualize the source.
Regarding claim 17, McCulloch in view of Scott teaches The non-transitory computer-readable medium of claim 16,wherein the textual description includes information identifying a type of the non-speech sound and information identifying a category of a physical object that produced the non-speech sound ,the textual description also identifying at least one attribute about the physical object that is different from the category of the physical object (Scott; FIG. 1 illustrates providing AR content 110 and an audio description 120 of a real-world event, according to one embodiment described herein. FIG. 1 includes a display 105 and a speaker 115 that can be part of a wearable device (e.g., a HUD) or a handheld device (e.g., a smart phone or tablet). In this example, the display 105 outputs an image of a real-world event that can be captured using a camera. However, in another embodiment, rather than displaying an image captured using a camera on, e.g., a LCD or LED display screen, the display 105 can instead be transparent such that environmental light can pass through the display 105 to reach the eyes of the user. In FIG. 1, the real-world event includes people playing soccer in the distance. If the user is visually impaired, she may be able to hear the sounds associated with this event (e.g., shouting, laughter, a ball being kicked, etc.) but not be able to see the event. In the embodiments herein, an AR device can identify the source of the sound ( or sounds) and generate a magnified version of the source as AR content 110. That is, the AR device determines that the player 10 kicking the ball is the source of the sound and then generates a magnified (larger) version of the source that is then superimposed onto the display 105 as the AR content 110. The AR device can magnify the AR content 110 to a size that is readily viewable to the visually impaired user. Addition- 15 ally, the display 105 can output markers or demarcations to set apart the AR content 110 from the other objects in the display 105. In addition to generating the AR content 110, the AR device can use the speaker 115 to output an audio description 20 of the source as well as its action or actions. In this example, the audio description 120 is "A PERSON PLAYING SOCCER" which indicates the source of the sound (i.e., a person) and its action (i.e., playing soccer). The user can correlate the sound she hears ( e.g., the sounds generated by the soccer 25 player) to the AR content 110 and the audio description 120. See Col. 2, lines 58 to Col. 3, line 26).
Regarding claim 18, McCulloch in view of Scott teaches The non-transitory computer-readable medium of claim 16,wherein the ambient sound is a first ambient sound, the sound source is a first sound source, the audio segment is a first audio segment, and the annotation is a first annotation, wherein the operations further (McCulloch in view of Scott allows for multiple sound sources)(Scott; FIG. 1 illustrates providing AR content 110 and an audio description 120 of a real-world event, according to one embodiment described herein. FIG. 1 includes a display 105 and a speaker 115 that can be part of a wearable device (e.g., a HUD) or a handheld device (e.g., a smart phone or tablet). In this example, the display 105 outputs an image of a real-world event that can be captured using a camera. However, in another embodiment, rather than displaying an image captured using a camera on, e.g., a LCD or LED display screen, the display 105 can instead be transparent such that environmental light can pass through the display 105 to reach the eyes of the user. In FIG. 1, the real-world event includes people playing soccer in the distance. If the user is visually impaired, she may be able to hear the sounds associated with this event (e.g., shouting, laughter, a ball being kicked, etc.) but not be able to see the event. In the embodiments herein, an AR device can identify the source of the sound ( or sounds) and generate a magnified version of the source as AR content 110. That is, the AR device determines that the player 10 kicking the ball is the source of the sound and then generates a magnified (larger) version of the source that is then superimposed onto the display 105 as the AR content 110. The AR device can magnify the AR content 110 to a size that is readily viewable to the visually impaired user. Addition- 15 ally, the display 105 can output markers or demarcations to set apart the AR content 110 from the other objects in the display 105. In addition to generating the AR content 110, the AR device can use the speaker 115 to output an audio description 20 of the source as well as its action or actions. In this example, the audio description 120 is "A PERSON PLAYING SOCCER" which indicates the source of the sound (i.e., a person) and its action (i.e., playing soccer). The user can correlate the sound she hears ( e.g., the sounds generated by the soccer 25 player) to the AR content 110 and the audio description 120. See Col. 2, lines 58 to Col. 3, line 26) comprise:
detecting a second ambient sound from the audio data at least partially in parallel with detecting the first ambient sound from the audio data (McCulloch; Second person talking)( McCulloch; See figure 3) (McCulloch; In some examples, the user 304 may have a hearing impairment that can make understanding speech difficult, particularly in an environment with multiple speakers and/or significant background noise. In the example shown in FIG. 3, each of the first person 316, second person 320 and third person 324 may be speaking simultaneously. The wallmounted display 328 may also be emitting audio. All of these audio inputs 68 may be received by the microphones 224, 228, 232, 236, 240, and 244 of the HMD device 200. See paragraph [0036]);
determining, based on the audio data, a location of a second sound source of the second ambient sound relative to the wearable device (McCulloch; Using the location of target face 1 54, the speech conversion program 14 may use the beamformer program 26 to apply one or more beamforming techniques to at least a portion of the audio inputs 68 from the microphone array 66. Alternatively expressed, one or more beamforming techniques may be applied to portions of audio inputs 68 that originate from the location of target facel 54. In this manner, the beamformer program 26 may identify target audio inputs, generally indicated at 336 in FIG. 3, that are associated with the face1 54 of the first person 316. See paragraph [0039]);
determining that the second ambient sound includes speech based on a second audio segment of the audio data that includes the second ambient sound (McCulloch; See figure 3.) (McCulloch; With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]);
generating a second annotation that identifies the second sound source; initiating conversion of the speech to text (McCulloch; See figure 3.)( McCulloch; With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]); and
displaying the second annotation and the text in the display of the wearable device based on the location of the second sound source (McCulloch; See figure 3.)( McCulloch; With continued reference to FIGS. 1 and 3, the speech conversion program 14 may be configured to convert the target audio inputs into text 40, and to display the text 40 via the transparent display 44 of the HMD device 200. In the example shown in FIG. 3, the target audio inputs 336 may be converted to text 40' that is displayed by HMD device 200 above the head of the first person 316 in a text bubble 340, thereby enabling user 304 to easily associate the text with the first person. See paragraph [0042]).
Regarding claim 19, McCulloch in view of Scott teaches The non-transitory computer-readable medium of claim 16, wherein the operations further comprise: computing a direction and a distance of the ambient sound relative to the wearable device; and determining the display position for the annotation within a three-dimensional space based on the direction and the distance of the ambient sound (Scott; As shown, the memory 220 stores a distance calculator 225, an audio controller 235, an AR controller 250, and a sound of interest (SOI) identifier 260, which may be software applications or modules. The distance calculator 225 uses time difference of arrival (TDOA) 230 to determine a location of a source generating a sound. That is, the distance calculator 225 can use the microphones 213 to perform TDOA 230 and identify a distance from the AR device 205 to a sound source. Acoustic source localization is the task of locating a sound source given measurements of the sound field. The TDOA technique for acoustic source localization uses (at least) two microphones, to compute the direction and distance of the acoustic source by using the different times the signal arrived at the microphones to triangulate its position. Knowing the direction and distance of the sound source can help the AR device 205 to generate the AR
content 110 and the audio description 120 as described below. See col. 3, line 57 to col. 4, line 7)( Scott; The audio controller 235 is tasked with generating the audio description 120. To do so, the audio controller 235 includes an object identifier 240 and a speech generator 245. In one embodiment, the object identifier 240 performs a ML technique that identifies objects in captured images. That is, using images captured by the camera 210 and the location of the sound source identified by the distance calculator 225, the object identifier 240 can determine what object in the captured images is the sound source. In one embodiment, the object identifier 240 decomposes the captured video of the user's FOY into image frames. For each frame, the object identifier 240 uses a visual recognition ML technique to
perform object detection within the frame such as, e.g., a custom trained You-Only-Look-Once (YOLO) convolutional neural network model to perform object detection of the elements within the frame, capturing the bounding box coordinates of detected elements/regions. The speech generator 245 can use a language model to generate text for the objects and actions identified by the object identifier 240. In one embodiment, the language model is a long short-term memory (LSTM) recurrent neural network that is trained on encoded video frames images and word-embedding that describe the objects and corresponding actions occurring in the images. The speech generator 245 can then translate the text into the audio descriptions 120. See col. 4, lines 8-33).
Regarding claim 20, McCulloch in view of Scott teaches The non-transitory computer-readable medium of claim 19, wherein the operations further comprise: determining an updated value for at least one of the direction or the distance as the sound source moves relative to the wearable device; and adjusting the display position of the annotation within the three-dimensional space based on the updated value (McCulloch; In one example, the speech conversion program 14 may tag the text 40' to the first person 316 such that the text bubble 340 is spatially anchored to the first person and follows the first person as the first person moves. See paragraph [0043]).
Claim(s) 5 is rejected under 35 U.S.C. 103 as being unpatentable over McCulloch et al. (US 2014/0337023)(Hereinafter referred to as McCulloch) in view of Scott, II et al. (US 11,055,533)(Hereinafter referred to as Scott) in view of Hui Song, (US 2018/0330726)(Hereinafter referred to as Song)
Regarding claim 5, McCulloch in view of Scott teaches The method of claim 1, but is silent to further comprising: applying a filter to the audio segment to reduce background noise in the audio segment.
Song teaches a technique which captures audio and uses machine learning to filter out noise (Embodiments of a first aspect of the present disclosure provide a speech recognition method based on artificial intelligence, including: collecting signals of an array of microphones to obtain a plurality of first speech signals; filtering out a reverberation signal in each first speech signal to obtain a plurality of second speech signals, and obtaining a third speech signal based on the plurality of second speech signals; performing noise extraction on each first speech signal based on the third speech signal to obtain a plurality of first noise signals; and filtering and adding the plurality of first noise signals to obtain a second noise signal, and subtracting the second noise signal from the third speech signal to obtain a target speech signal. See paragraph [0007])( As illustrated in FIG. 1, the speech recognition method based on artificial intelligence includes followings. See paragraph [0025])
McCulloch in view of Scott and Song teach of processing audio data and Song teaches that utilizing machine learning the noise can be filtered out therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to combine the system of McCulloch in view of Scott with the machine learning noise filtering techniques of Song such that the system could provide an accurate textual translation of the audio.
Claim(s) 13 is rejected under 35 U.S.C. 103 as being unpatentable over McCulloch et al. (US 2014/0337023)(Hereinafter referred to as McCulloch) in view of Scott, II et al. (US 11,055,533)(Hereinafter referred to as Scott) in view of Poore et al. (US 2021/0055367)(Hereinafter referred to as Poore)
Regarding claim 13, McCulloch in view of Scott teaches The wearable device of claim 8, but is silent to wherein the executable instructions include instructions that cause the at least one processor to: determine that the location of the sound source is outside a field of view of the wearable device; and generate the annotation to include a description of a relative position of the sound source with respect to the wearable device.
Poore teaches detecting an ambient sound is outside the field of view and providing and annotation as to the direction of the sound (Referring now to FIG. 6, the display 190 of the
head-mountable device 100 can provide a view of the external environment, even when the view does not include the source of the sound. One or more of the displayed items in the view of the display 190 can correspond to physical objects in an environment, as discussed herein. For example, a camera of the head-mountable device 100 can capture a view of the external environment. See paragraph [0059])( As shown in FIG. 6, the display 190 can identify a source of a detected sound as having a particular location ( e.g., direction of origin) with respect to the head-mountable device 100. Such determinations can be performed by an array of microphones, as discussed herein. Upon determination of the location of the source 20, it can be further determined that the location of the source is not within a field-of-view provided by the display 190. Such a determination can be made based on a known spatial relationship between the microphones and the display 190 of the head mountable device 100. As further shown in FIG. 6, the indicator 300 can be visually output by the display 190 to indicate the location of the source even when the source is not displayed within the field-of-view of the display 190. As such, the indicator 300 can suggest to the user the direction in which the user may change its position and/or orientation to capture a view of the source. Such an output can help the user visually identify the location of the source even when the user is unable to directly identify the location-based on the user's own detection of the sound. See paragraph [0060]) (The indicator 300 can include an icon, symbol, graphic, text, word, number, character, picture, or other visible feature that can be displayed at, on, and/or near the portion of the display 190 that most closely corresponds to the location of the source. By further example, the indicator 300 can correspond to a known characteristic (e.g., identity, name, color, etc.) of the source. Additionally or alternatively, the indicator 300 can include visual features such as color, highlighting, glowing, outlines, shadows, or other contrasting features that allow portions thereof to be more distinctly visible when displayed along with the view to the external environment and/or objects therein. See paragraph [0061])
McCulloch in view of Scott and Poore teach of presenting information about audio signals to a user and Poore teaches that if the sound is outside the field of view an annotation can indicate to the user the direction of the sound, therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to combine the system of McCulloch in view of Scott with the out of view presentation techniques of Poore such that the user could be directed to turn to view out of view sounds.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICHOLAS R WILSON whose telephone number is (571)272-0936. The examiner can normally be reached M-F 7:30-5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached at (572)-272-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/NICHOLAS R WILSON/Primary Examiner, Art Unit 2611