Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b) CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.
The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.
Claims 1-2, 5-10, 12-14, 16-17, 19-22, and 26 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA 35 U.S.C. 112, the applicant), regards as the invention. Claims 1 and 26 recite at least one audio sensor and echo signals swapped from different audio sensors. However, there can be no swapping of echo signals if only a single audio sensor is present. Dependent claims are rejected for failing to remedy the same issue.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-2, 6-10, 12-13, 16-17, 20-22, and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Christensen (2020, IEEE; search report) and Reiter (2017, SPIE).
Regarding claims 1 and 26, Christensen teaches a depth sensing apparatus configured to generate a depth map of an environment and a depth sensing method for generating a depth map of an environment, the apparatus and method including:
a) an audio output device [[abstract] system emits short chirps from a speaker];
b) at least one audio sensor [[abstract] records returning echoes through microphones in an artificial human pinnae pair]; and,
c) one or more processing devices [[pg. 3, col. 2] capture the complete scene of the 3D space ahead with a small system mounted on a mobile device] configured to:
i) cause the audio output device to emit an omnidirectional emitted audio signal [[fig. 1] shows sound chirps emitted in all directions from speaker as demonstrated by arrows; [pg. 3, col. 2] in [15], sound is localized using emulated binaural hearing, with a model of human ears and head related transfer functions. They test azimuth from 0 to 360 at 5 resolution and test elevation from 40 to 90 at 10 resolution];
ii) acquire echo signals indicative of reflected audio signals captured by the at least one audio sensors in response to reflection of the emitted audio signal from the environment surrounding the depth sensing apparatus [[abstract] records returning echoes through microphones in an artificial human pinnae pair. During training, we additionally use a stereo camera to capture color images for calculating scene depths. We train a model to predict depth maps and even grayscale images from the sound alone. During testing, our trained BatVision provides surprisingly good predictions of 2D visual scenes from two 1D audio signals.]:
iii) generate spectrograms using the echo signals [[pg. 4, col. 2] we consider two audio representations: 1D raw waveforms and 2D amplitude spectrograms. The LibROSA library for Python is used to compute spectrograms with 512 points for FFT and Hanning window size 64. Fig. 3 shows the probing chirp at 3 ms and the returned echoes afterwards.]; and,
iv) apply the spectrograms to a computational model to generate a depth map [[fig. 1] camera captures stereo image pairs, based on which depth maps can be calculated. 2) We train a model to turn binaural signals into visual scenes such as depth-maps or grayscale images], the computational model being trained using reference echo signals and omnidirectional reference depth images [[pg. 4, col. 2] we train our model with two possible audio representations. Our experiments indicate that spectrograms yield slightly better sound-to-vision predictions over raw waveforms. However, as we aim for a real-time BatVision system on embedded platforms, we focus on raw waveforms which are more computationally efficient. [pg. 4, col. 2] we use an encoder-decoder network architecture to turn the audio clip into the visual image, and further improve the quality of generated images using an adversarial discriminator to contrast them against the ground-truth (Fig. 4).].
Christensen does not explicitly teach and yet Reiter teaches a reference depth image inverted about a vertical axis, and reference echo signals swapped from different audio sensors [[abstract] photoacoustic imaging; [sec. 3.2 two simulated sources] feasibility of detecting more than one point source in a single image, our trained CNN was applied to 20 images containing two signal sources. Our trained CNN successfully identified the source locations with mean axial and lateral errors of 2.6 mm and 2.1 mm, respectively. Because the network was only trained to detect single point sources, it is remarkable that we detected the location of more than one source in a single image. To achieve this detection, we noticed that the network was biased toward detecting the right-most wavefield. Thus, as demonstrated in Fig. 3 (top), the images were first input as normal (Input 1), then flipped from left to right, and the flipped image was treated as a new input (Input 2).].
It would have been obvious to combine the learning modal as taught by Christensen, with the inverted image as taught by Reiter so that multiple sound sources may be identified from a single image (Reiter) [[sec. 3.2][ref. 10] depth map prediction from a single image].
Regarding claim 2, Christensen teaches the depth sensing apparatus according to claim 1, wherein at least one of: a) the depth sensing apparatus includes one of: i) at least two audio sensors [[abstract] records returning echoes through microphones in an artificial human pinnae pair]; ii) at least three audio sensors spaced apart around the audio output device; and, iii) four audio sensors spaced apart around the audio output device; b) the at least one audio sensor include at least one of: i) a directional microphone [[pg. 4, sec. b. our hardware] we adopt two low-cost consumer-grade omni-directional USB Lavalier MAONO AU-410 microphones, separated at approximately 23:5 cm apart. Each microphone is mounted in a Soundlink silicone ear to effectively emulate an artificial human auditory system.] ii) an omnidirectional microphone and, iii) an omnidirectional microphone embedded into artificial pinnae and, c) the audio output device is one of: i) a speaker [[abstract] system emits short chirps from a speaker] and, ii) an upwardly facing speaker.
Regarding claim 5, Christensen teaches the depth sensing apparatus according to claim 1, wherein the emitted audio signal is at least one of: a chirp signal; a chirp signal including a linear sweep between about 20 Hz - 20 kHz; and, a chirp signal emitted over a duration of about 3 ms [[pg. 3, col. 1] inspired by echolocation in animals, several papers [4], [2], [5], [6], [7] study target echolocation in the 2D or 3D space using ultrasonic frequency modulated (FM) chirps between 20-200 kHz. Bats emit pulse trains of very short durations (typically < 5 ms) and use received echoes to perceive their surroundings].
Regarding claim 6, Christensen teaches the depth sensing apparatus according to claim 1, wherein the reflected audio signals are captured over a time period dependent on a depth of the reference depth images [[pg. 4, col. 1-2 bridging] we choose the length of each audio instance to be 72.5 ms, so that it includes echoes traveling up to 12m. This time window selection reflects a trade-off between receiving echos within the distance relevant for navigation and reducing later echos from multiple reflection paths. Each of our audio clips has 3200 frames, containing one chirp and returned echoes.].
Regarding claim 7, Christensen teaches the depth sensing apparatus according to claim 1, wherein the spectrograms are greyscale spectrograms [[abstract] we train a model to predict depth maps and even grayscale images from the sound alone].
Regarding claim 8, Christensen teaches the depth sensing apparatus according to claim 1, wherein the depth sensing apparatus includes a range sensor configured to sense a distance to the environment [[pg. 3, col. 1] mobile robot while mapping and avoiding obstacles using azimuth and range information from ultrasonic sensors; [pg. 4, col. 2] we compute the scene depth using the API of our camera, range clipped within 12m. We normalize the depth value to be between 0 and 1. Pixels where the camera is unable to produce a valid measurements are set to 0.], wherein the one or more processing devices are configured to: a) acquire depth signals from the range sensor; and, b) use the depth signals to at least one of: i) generate omnidirectional reference depth images for use in training the computational model; and, ii) perform multi-modal depth sensing [[pg. 3, col. 2] In [11], sound is localized using an acoustic camera [14], a hybrid audio-visual sensor that provides RGB video overlaid with acoustic sound, aligned in time and space. All the works on sound localization receive sound signals passively.].
Regarding claim 9, Christensen teaches the depth sensing apparatus according to claim 8, wherein the range sensor includes at least one of: a lidar; a radar; and, a stereoscopic imaging system [[abstract] radar and ultrasound complement camera-based vision but they are often too costly and complex to set up for very limited information gain … during training we additionally use a stereo camera].
Regarding claim 10, Christensen teaches the depth sensing apparatus according to claim 1, wherein at least one of: a) the computational model includes at least one of: i) a trained encoder-decoder-encoder computational model [[pg. 4, col. 2] encoder-decoder network architecture]; ii) generative adversarial model [[pg. 4, col. 2] adversarial discriminator]; iii) a convolutional neural network [[pg. 3, col. 1] train a neural network model to predict images such as depth maps and grayscale images from audio data alone]; and, iv) a U-net network [[pg. 5, col. 1] for raw waveforms, successive deconvolutions yield the best results, whereas for spectrograms, a UNet-type encoder-decoder network [19] yields best results.]; and, b) the computational model is configured to: i) downsample the spectrograms to generate a feature vector; and, ii) upsample the feature vector to generate the depth map [[pg. 5, col. 1] the encoder of the UNet downsamples the 32x32x1 input through several layers of double convolutions followed by batch normalization and ReLU, whereas the decoder of the UNet upsamples the input through double de-convolutions followed by batch normalization and ReLU. Skip connections are utilized wherever possible.].
Regarding claim 12, Christensen teaches the depth sensing apparatus according to claim 1, wherein the one or more processing devices are configured to: acquire reference depth images and corresponding reference echo signals [[abstract] our system emits short chirps from a speaker and records returning echoes through microphones in an artificial human pinnae pair. During training, we additionally use a stereo camera to capture color images for calculating scene depths.]; and, train a generator and discriminator using the reference depth images and reference echo signals to thereby generate the computational model [[abstract] we train a model to predict depth maps and even grayscale images from the sound alone.; [[fig. 4] our sound to vision network architecture. The temporal convolutional audio encoder A turns the binaural input into a latent audio feature vector, based on which the visual generator G predicts the scene depth map. The discriminator D compares the prediction with the ground-truth and enforces high-frequency structure reconstruction at the patch level.]].
Regarding claim 13, Christensen teaches the depth sensing apparatus according to claim 1, wherein the one or more processing devices are configured to perform pre-processing of at least one of the reference echo signals and reference depth images when training the computational model [[pg. 2, col. 2] they sense the world by continuously emitting ultrasonic pulses and processing echos returned from the environment.; [pg. 4, col. 1] time window selection].
Regarding claim 16, Christensen teaches the depth sensing apparatus according to claim 1, wherein the one or more processing devices are configured to perform augmentation when training the computational model [[pg. 4, col. 2] we synchronize all the audio instances by the time of the recorded chirp. However, during training, we augment the audio data by jittering the position of the window by 30%.].
Regarding claim 17, Christensen teaches the depth sensing apparatus according to claim 16, wherein at least one of: a)the one or more processing devices are configured to perform augmentation by: i) truncating a spectrogram derived from the reference echo signals [[pg. 4, col. 2] we synchronize all the audio instances by the time of the recorded chirp. However, during training, we augment the audio data by jittering the position of the window by 30%.] ;and, [[b)]] ii) limiting a depth of the reference depth images in accordance with truncation of the corresponding spectrograms [[pg. 4, col. 1] we choose the length of each audio instance to be 72:5 ms, so that it includes echoes traveling up to 12m. This time window selection reflects a trade-off between receiving echos within the distance relevant for navigation and reducing later]; and ,b) the one or more processing devices are configured to perform augmentation by: i) replacing the spectrogram for a reference echo signal from a selected audio sensor with silence; and, ii) applying a gradient to a corresponding reference depth image to fade the image from a center towards the selected audio sensor [[pg. 5, col. 1] we use an Generative Adversarial network (GAN) model at the patch level to improve the visual reconstruction quality. We use the following least-squares loss instead of a sigmoid cross-entropy loss in order to avoid vanishing gradients].
Regarding claim 20, Christensen teaches the depth sensing apparatus according to claim 1, wherein the one or more processing devices are configured to: cause the audio output device to emit a series of multiple emitted audio signals [[abstract] system emits short chirps from a speaker and records returning echoes through microphones in an artificial human pinnae pair.]; and, repeatedly update the depth map over the series of multiple emitted audio signals [[pg. 7, col. 2] Our BatVision system with a trained sound-to-vision model can reconstruct depth maps from binaural sound recorded by only two microphones to a remarkable accuracy. It can predict detailed indoor scene depth and obstacles such as walls and furniture.].
Regarding claim 21, Christensen teaches the depth sensing apparatus according to claim 1, wherein the one or more processing devices are configured to implement: a depth autoencoder to learn low-dimensionality representations of depth images; a depth audio encoder to create low-dimensionality representations of the spectrograms; and, a recurrent module to repeatedly update the depth map [[fig. 5] shows that the convolutional network lower dimensionality feature vector as the convolutional layers process the input signals].
Regarding claim 22, Christensen teaches the depth sensing apparatus according to claim 21, wherein at least one of: a) the one or more processing devices are configured to train the depth autoencoder using synthetic reference depth images; b) the one or more processing devices are configured to pre-train the depth audio encoder using a temporal ordering of reference spectrograms derived from reference echo signals as a semi-supervised prior for contrastive learning; c) the one or more processing devices are configured to implement the recurrent module using a gated recurrent unit; and d) inputs to the recurrent module include: i) audio embeddings generated by the audio encoder for a time step; and ii) depth image embeddings generated by the depth autoencoder for the time step [[pg. 3, col. 1-2 bridging] Sound Source Localization. In [9], [10], [11], [12], [13], deep neural network models are trained to localize the source of the sound (e.g. a piano) in images or videos. Remarkable results are obtained in a self-supervised learning framework, demonstrating the potential of learning associations between paired audio-visual data.; [pg. 3, col. 1] during training, we first collect a dataset of time synchronized binaural audio signals and stereo image pairs in an indoor office environment, and then train a neural network model to predict images such as depth maps and grayscale images from audio data alone.].
Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Christensen (2020, IEEE) and Reiter (2017, SPIE) as applied to claim 13 above, and further in view of Li (CN 113936191 A).
Regarding claim 14, Christensen does not explicitly teach and yet Li teaches the depth sensing apparatus according to claim 13, wherein the one or more processing devices are configured to perform pre-processing by applying anisotropic diffusion to reference depth images [[0043][0044][0093] [0096] convolutional neural network module … using anisotropic diffusion method to perform noise reduction processing to the second picture to obtain the anisotropic diffusion result picture corresponding to the second picture].
It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the invention to combine the neural network as taught by Christensen, with the anisotropic diffusion method as taught by Li so that noise is reduced (Li) [[0096]].
Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Christensen (2020, IEEE) and Reiter (2017, SPIE) as applied to claim 1 above, and further in view of Chen (US 20210105578 A1).
Regarding claim 19, Christensen does not explicitly teach and yet Chen teaches the depth sensing apparatus according to claim 1, wherein the one or more processing devices are configured to perform augmentation by applying a random variance to labels used by a discriminator [[0104] labelled data is augmented (or autoencoded, see FIG. 17) to produce an augmented fingerprint. Augmenting the data can include determining statistical moments of the labelled fingerprint such as mean and variance and generating a random realization from a distribution with the determined mean and variance. The random realization is called augmented data. Several such random realizations are obtained from a single determined distribution (see FIGS. 6A, 6B, 7A, 7B, 7C, 9A, 9B, 9C); thus generally an augmented fingerprint includes several fingerprint waveforms].
It would have been obvious to a person having ordinary skill in the art prior to the effective filing date of the invention to combine the neural network as taught by Christensen, with random variance for fingerprinting as taught by Chen so that the augmented data is based on statistical parameters of the data (Chen) [[abstract]].
Response to Arguments
Applicant’s arguments with respect to claim(s) 1 and 26 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. In summary, Reiter (2017, SPIE) has been applied to the amended limitations.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONATHAN D ARMSTRONG whose telephone number is (571)270-7339. The examiner can normally be reached M - F 9am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Isam Alsomiri can be reached at 571-272-6970. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JONATHAN D ARMSTRONG/ Examiner, Art Unit 3645