Last updated: May 29, 2026

Application No. 18/163,618

SYSTEM AND METHOD FOR MATCHING A VISUAL SOURCE WITH A SOUND SIGNAL

Final Rejection §103

Filed

Feb 02, 2023

Priority

Jun 08, 2022 — IN 202241032827 +2 more

Examiner

PASHA, ATHAR N

Art Unit

2657

Tech Center

2600 — Communications

Assignee

Samsung Electronics Co., Ltd.

OA Round

3 (Final)

Interview Optional

— +16.4% interview lift. Examiner has a relatively high allowance rate (90%); +16.4% interview lift. A written response may suffice.

Based on 156 resolved cases, 2023–2026

Examiner Intelligence

PASHA, ATHAR N View full profile →

Grants 90% — above average

Career Allowance Rate

140 granted / 156 resolved

+27.7% vs TC avg

Strong +16% interview lift

Without

With

+16.4%

Interview Lift

resolved cases with interview

Typical timeline

2y 6m

Avg Prosecution

13 currently pending

Career history

174

Total Applications

across all art units

Statute-Specific Performance

§101

4.3%

-35.7% vs TC avg

§103

89.8%

+49.8% vs TC avg

§102

3.0%

-37.0% vs TC avg

§112

1.3%

-38.7% vs TC avg

Black line = Tech Center average estimate • Based on career data from 156 resolved cases

Office Action

§103

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgment is made of applicant' s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. IN202241032827, filed on 6/28/2022.

Response to Arguments
Amendments/Arguments filed on 3/20/26 are entered.
In light of the amendments, the examiner removes the 112 interpretation.
 Arguments on pages 6-11 regarding reference Kruglick and Jang are moot in light of amended claims which are now rejected with new references. Please see 103 rejections below.
Regarding “contrastive learning” arguments on pages 11-13, the examiner cites new passages from Xu that show contrastive learning is applied. Please see the 103 rejections below. The examiner therefore rejects the arguments. 
This Office Action maintains the 103 rejection.

Claim Rejections - 35 USC § 103
 In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claim(s) 1 is(are) rejected under 35 U.S.C. 103 as being unpatentable over Eronen (US 20220225049 A1) in further view of Leppanen (US 20190139312 A1), and Xu (US 20240005941 A1) .
With respect to claim 1 Eronen teaches A method for matching a visual source with a respective sound signal, the system comprising: obtaining the visual source indicative of a camera-preview ([0006] receive video imagery captured by a camera of the capture device, the video imagery having a field of view [camera-preview], wherein the extent of the space from which the spatial audio data is captured is greater than the field of view;
receiving a real-world sound input corresponding to the visual source, the real- world sound input including a combination of one or more sound signals originating from a plurality of sound-generating objects (Eronen_¶_[0007] associate each of the one or more audio sources, determined from the directional information, with, for audio sources within the field of view,.);
separating the one or more sound signals of the real-world sound input ([0100] To provide for control of the audio capture property of an audio source selected based on its position, which is shown on the display 111, the apparatus 100 may associate a region/part of the displayed video imagery or out-of-view graphic 200 with each of the one or more audio sources, which themselves may be determined from the directional information. Thus, for audio sources 113, 114 within the field of view 107, the apparatus 100 may associate regions 203, 204 of the video imagery with the audio sources 113,114 or the direction towards the audio source. For third and fourth audio sources 115, 116 outside the field of view 107, the apparatus 100 may associate a part of the out-of-view graphic 200, shown as markers 215 and 216, that corresponds to the direction towards the audio sources 115, 116. The marker 215 therefore represents the location or direction towards the third audio source 115 and the marker 216 represents the location or direction towards the fourth audio source 116.) ; 
detecting one or more sound generating objects included in the visual source ([0100] To provide for control of the audio capture property of an audio source selected based on its position, which is shown on the display 111, the apparatus 100 may associate a region/part of the displayed video imagery or out-of-view graphic 200 with each of the one or more audio sources, which themselves may be determined from the directional information. Thus, for audio sources 113, 114 within the field of view 107, the apparatus 100 may associate regions 203, 204 of the video imagery with the audio sources 113,114 or the direction towards the audio source. For third and fourth audio sources 115, 116 outside the field of view 107, the apparatus 100 may associate a part of the out-of-view graphic 200, shown as markers 215 and 216, that corresponds to the direction towards the audio sources 115, 116. The marker 215 therefore represents the location or direction towards the third audio source 115 and the marker 216 represents the location or direction towards the fourth audio source 116.) ; 
generating an association between each of the sound generating objects and the one or more separated sound signals ([0007] associate each of the one or more audio sources, determined from the directional information, with, for audio sources within the field of view,.); 
matching, in real-time, each of the detected sound generating objects with a respective sound signal from the one or more separated sound signals based on the association ([0007] associate each of the one or more audio sources, determined from the directional information, with, for audio sources within the field of view); 
Eronen does not explicitly disclose however Leppanen teaches measuring a magnitude of each of the respective sound signals mapped to each of the sound generating objects (Leppanen_¶_[0124] The control of the audio properties may be provided by a user input device, such as a hand held or wearable device. Thus, the audio control graphics may provide for feedback of the control of the audio properties associated with one or more of the distinct audio sources. As can be seen in FIG. 6, sliders 606 are provided in the audio source graphic to show the current level of the audio property being controlled. The audio control graphics, in this example, also include a current level meter 607 [measuring a magnitude] to show the current audio output from its associated distinct audio source) 
obtaining a user interface (UI) including one or more controlling markers for controlling the magnitude of each of the respective sound signals generated by each of the detected sound generating objects (Leppanen_¶_[0124] The control of the audio properties may be provided by a user input device, such as a hand held or wearable device. Thus, the audio control graphics may provide for feedback of the control of the audio properties associated with one or more of the distinct audio sources. As can be seen in FIG. 6, sliders 606 are provided in the audio source graphic to show the current level of the audio property being controlled. The audio control graphics, in this example, also include a current level meter 607 [measuring a magnitude] to show the current audio output from its associated distinct audio source);
displaying the visual source including one or more bounding box for marking a box area around each of the sound generating object (Leppanen_¶_[0107] In FIG. 4, the user has drawn a bounding box 400 around the distinct audio sources 203, 204, 205 to select them) and the UI including the one or more controlling markers ¶[0124] The control of the audio properties may be provided by a user input device, such as a hand held or wearable device. Thus, the audio control graphics may provide for feedback of the control of the audio properties associated with one or more of the distinct audio sources. As can be seen in FIG. 6, sliders 606 are provided in the audio source graphic to show the current level of the audio property being controlled.
receiving a controlling input from a user to vary a position of at least one user interface control of the one or more user interface controls to one of increase or decrease the magnitude of a sound signal of the respective sound generating object associated with the at least one user interface control (Leppanen_¶_[0124] The control of the audio properties may be provided by a user input device, such as a hand held or wearable device. Thus, the audio control graphics may provide for feedback of the control of the audio properties associated with one or more of the distinct audio sources. As can be seen in FIG. 6, sliders 606 are provided in the audio source graphic to show the current level of the audio property being controlled. The audio control graphics, in this example, also include a current level meter 607 to show the current audio output from its associated distinct audio source, ¶ [0125] The audio properties that may be controlled may be selected from one or more of volume, relative volume to one or more other distinct audio sources, bass, treble, pitch, spectrum, modulation characteristics or other audio frequency specific properties, audio dynamic range, spatial audio position, panning position among others.) , and 
increasing or decreasing a magnitude of a sound signal of sound generating object corresponding to the controlling input (Leppanen_¶_[0124] The control of the audio properties may be provided by a user input device, such as a hand held or wearable device. Thus, the audio control graphics may provide for feedback of the control of the audio properties associated with one or more of the distinct audio sources. As can be seen in FIG. 6, sliders 606 are provided in the audio source graphic to show the current level of the audio property being controlled. The audio control graphics, in this example, also include a current level meter 607 to show the current audio output from its associated distinct audio source, ¶[0125] The audio properties that may be controlled may be selected from one or more of volume, relative volume to one or more other distinct audio sources, bass, treble, pitch, spectrum, modulation characteristics or other audio frequency specific properties, audio dynamic range, spatial audio position, panning position among others. ), 
wherein the one or more controlling markers is a sliding knob (Leppanen_¶_[0124] Thus, the audio control graphics may provide for feedback of the control of the audio properties associated with one or more of the distinct audio sources. As can be seen in FIG. 6, sliders 606 are provided in the audio source graphic to show the current level of the audio property being controlled);
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify matching visual source to sound of Eronen to include measurement interface of Leppanen for effective individual selection and control of the audio properties of the distinct audio sources ([011], Leppanen )
Eronen and Leppanen do not explicitly disclose however Xu teaches wherein generating the association between each of the sound generating objects and the one or more separated sound signals is based on contrastive learning (Xu_¶_[0067]“Unmix-and-remix” unsupervised learning is integrated into a cue-driven target speaker separation [separated sound] framework, and meanwhile, the self-adaptive capability of the system in the real noisy environment is promoted by utilizing contrastive learning between speaker-related cue representation and separated auditory signal representation) Examiner Note: each speaker is a sound generating object, and 
wherein the contrastive learning is based on a permutation invariant contrastive learning (Xu_¶_[0033] the other loss being a permutation invariant loss L.sub. of a plurality of interfering sound sources, wherein, optimizing the model based on a reconstruction loss between predicted signals of the plurality of interfering sound sources and clean signals in the simulation data set, ¶[0067]finally, the supervised learning capability of simulation data and the unsupervised learning effect of real mixed data are sufficiently utilized, to construct a more efficient semi-supervised learning method under multi-cue constraints. “Unmix-and-remix” unsupervised learning is integrated into a cue-driven target speaker separation framework, and meanwhile, the self-adaptive capability of the system in the real noisy environment is promoted by utilizing contrastive learning between speaker-related cue representation and separated auditory signal representation.);
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify matching visual source to sound of Eronen in view of measurement interface of Leppanen to include contrastive learning of Xu in order to improve sound separation.

Claims 4 is rejected under 35 U.S.C. 103 as being unpatentable over Eronen and Leppanen
in further view of Matono (US 20090154896 A1).
With respect to claims 4 Eronen and Leppanen do not explicitly disclose, however Matono teaches wherein detecting the one or more sound generating objects is based on an object detection technique (Matono¶ [0016] Therefore, the reality sensation is enhanced by, for example, detecting a location of a specific subject from a video signal, extracting a sound of the specific subject from an audio signal and adjusting the extracted sound on the basis of the detected location. Furthermore, the ratio of distribution of voice components to the left and right can be changed on the basis of a result of speaker detection including whether there is a speaker and a location on the screen by, for example, providing speaker detection as object detection);
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify matching visual source to sound of Eronen in view of measurement interface of Leppanen in view of contrastive learning of Xu to include object detection of Mantono in order to enhance reality sensation ([0016], Matono);

Claims 5, 6, 18-19 are rejected under 35 U.S.C. 103 as being unpatentable over Eronen and Leppanen in further view of Martin (US 11194330 B1), Cahill (US 20160277863 A1) and Kim (US 20200050842 A1).).
With respect to claims 5Eronen and Leppanen do not explicitly disclose , however Martin teaches generating a spectrogram of the real-world sound input (Martin¶ Col1ll62-Col2Lll8 In operation, the system extracts salient patches from an intensity spectrogram of an audio signal. Thereafter, neural-network feature vectors are extracted for all salient patches. The feature vectors are then clustered, with each cluster becoming a key attribute. The process of extracting salient patches and extracting the feature vectors for the salient patches can be repeated for many audio signals in the training data; whereas the clustering is performed on the features for the whole training data set. A test audio signal can then be mapped onto a histogram of key attributes. Based on the histogram, the test audio signal can then be classified as a sound class, allowing for operation of a device based on the classification of the sound class.);
identifying at least one class for each of the sound signals in the spectrogram, wherein the at least one class is indicative of a type of a sound generating object (Martin¶ Col1ll62-Col2Lll8 In operation, the system extracts salient patches from an intensity spectrogram of an audio signal. Thereafter, neural-network feature vectors are extracted for all salient patches. The feature vectors are then clustered, with each cluster becoming a key attribute. The process of extracting salient patches and extracting the feature vectors for the salient patches can be repeated for many audio signals in the training data; whereas the clustering is performed on the features for the whole training data set. A test audio signal can then be mapped onto a histogram of key attributes. Based on the histogram, the test audio signal can then be classified as a sound class, allowing for operation of a device based on the classification of the sound class);
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify matching visual source to sound of Eronen in view of measurement interface of Leppanen to include spectrogram of Martin order to improve the recognition performance by learning salient sound attributes (Col6ll34-36, Martin);
Eronen, Leppanen and Martin do not explicitly disclose, however Cahill teaches generating a heat map for the at least one class (Cahill¶ [0056] FIG. 5C depicts one such example image frame output by the event classification module 408, in accordance with an embodiment of the present disclosure. As shown, the resulting image is depicted with two events (e.g., Event 1 and Event 2 of FIG. 5B) and event labels (Solenoid and Piston). In an embodiment, images such as the example image depicted in FIG. 5C can be rendered and presented on a display of an electronic device (e.g., a smart phone, laptop, or other device with a display). In this embodiment, the electronic device may present a plurality of these images in an augmented reality mode whereby the display presents real-time images of the observed scene with an overlay depicting an acoustic heat map and/or the metadata for the event.);
and determining the separated sound signals based on the heat map (Cahill¶[0056] FIG. 5C depicts one such example image frame output by the event classification module 408, in accordance with an embodiment of the present disclosure. As shown, the resulting image is depicted with two events (e.g., Event 1 and Event 2 of FIG. 5B) and event labels (Solenoid and Piston). In an embodiment, images such as the example image depicted in FIG. 5C can be rendered and presented on a display of an electronic device (e.g., a smart phone, laptop, or other device with a display). In this embodiment, the electronic device may present a plurality of these images in an augmented reality mode whereby the display presents real-time images of the observed scene with an overlay depicting an acoustic heat map and/or the metadata for the event) 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify matching visual source to sound of Eronen in view of measurement interface of Leppanen in view of spectrogram of Martin to include heat map of Cahill order to enable accurate scene analysis ([0012], Cahill);
None of Eronen, Leppanen, Martin and Cahill explicitly disclose, however Kim teaches wherein the heat map is generated using a Class Activation Mapping (CAM) technique (Kim¶ [0195] Here, the processor 180 may extract a heatmap corresponding to the recognition information from the image data based on Class Activation Mapping (CAM));
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify matching visual source to sound of Eronen in view of measurement interface of Leppanen in view of spectrogram of Martin in view of heat map of Cahill to include CAM of Kim order to increase the recognition confidence level ([0005], Kim);
With respect to claims 6 Martin further teaches splitting the spectrogram into a plurality of horizontal patches based on a plurality of predefined classes, using a neural network trainable for splitting the spectrogram based on the plurality of predefined classes (Martin¶Col1ll62Col2ll8 In operation, the system extracts salient patches from an intensity spectrogram of an audio signal. Thereafter, neural-network feature vectors are extracted for all salient patches. The feature vectors are then clustered [feature vectors concatenated], with each cluster becoming a key attribute. The process of extracting salient patches and extracting the feature vectors for the salient patches can be repeated for many audio signals in the training data; whereas the clustering is performed on the features for the whole training data set. A test audio signal can then be mapped onto a histogram of key attributes. Based on the histogram, the test audio signal can then be classified as a sound class, allowing for operation of a device based on the classification of the sound class) Examiner Note: Martin in Figure 7 shows several horizontal patches for different sounds;
extracting a plurality of features from the plurality of horizontal patches (Martin¶ Col1ll64-65 Thereafter, neural-network feature vectors are extracted for all salient patches);
concatenating the plurality of features (Martin¶ Col1ll62Col2ll8 In operation, the system extracts salient patches from an intensity spectrogram of an audio signal. Thereafter, neural-network feature vectors are extracted for all salient patches. The feature vectors are then clustered [feature vectors concatenated], with each cluster becoming a key attribute. The process of extracting salient patches and extracting the feature vectors for the salient patches can be repeated for many audio signals in the training data; whereas the clustering is performed on the features for the whole training data set. A test audio signal can then be mapped onto a histogram of key attributes. Based on the histogram, the test audio signal can then be classified as a sound class, allowing for operation of a device based on the classification of the sound class);
determining weights from the concatenated plurality of features (Col1ll62Col2ll8 In operation, the system extracts salient patches from an intensity spectrogram of an audio signal. Thereafter, neural-network feature vectors are extracted for all salient patches. The feature vectors are then clustered [feature vectors concatenated], with each cluster becoming a key attribute. The process of extracting salient patches and extracting the feature vectors for the salient patches can be repeated for many audio signals in the training data...Based on the histogram, the test audio signal can then be classified as a sound class, allowing for operation of a device based on the classification of the sound class, Col6ll59-65: The system starts with the trained CNN [classes based on determined weights] and learns sound attributes that are encoded in distributed activation patterns of the network.)
and identifying the separated sound signals corresponding to the predefined classes based on the determined weights (Col1ll62Col2ll8 In operation, the system extracts salient patches from an intensity spectrogram of an audio signal. Thereafter, neural-network feature vectors are extracted for all salient patches. The feature vectors are then clustered [feature vectors concatenated], with each cluster becoming a key attribute. The process of extracting salient patches and extracting the feature vectors for the salient patches can be repeated for many audio signals in the training data...Based on the histogram, the test audio signal can then be classified as a sound class, allowing for operation of a device based on the classification of the sound class, Col6ll59-65: The system starts with the trained CNN [classes based on determined weights] and learns sound attributes that are encoded in distributed activation patterns of the network.)

With respect to claim 18 Martin further teaches wherein the neural network is trained in a supervised learning technique with a labeled training set (Martin¶ Col6ll37-50 For example, if one of the classes of sounds is “children playing”, this class may be correlated with the sound attribute “bird song”; thus, learning how to recognize birdsong can help to identify children playing even though birdsong was not explicitly labelled in a training data set. The system operates through a four-phase process, which allows for reliable classification of audio signals based on their attributes. In the first phase, the salient attributes of the input are extracted based on the activation patterns of a deep convolutional neural network (CNN));

With respect to claim 19 Martin further teaches wherein the labeled training set is indicative of the predefined classes. (Martin¶ Col6ll37-50 For example, if one of the classes of sounds is “children playing”, this class may be correlated with the sound attribute “bird song”; thus, learning how to recognize birdsong can help to identify children playing even though birdsong was not explicitly labelled in a training data set. The system operates through a four-phase process, which allows for reliable classification of audio signals based on their attributes. In the first phase, the salient attributes of the input are extracted based on the activation patterns of a deep convolutional neural network (CNN));


Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a).Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ATHAR N PASHA whose telephone number is (408)918-7675. The examiner can normally be reached Monday-Thursday Alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ATHAR N PASHA/ Primary Examiner, Art Unit 2657

Read full office action

Prosecution Timeline

Feb 02, 2023

Application Filed

Jan 15, 2025

Non-Final Rejection mailed — §103

Apr 08, 2025

Response Filed

Jan 02, 2026

Non-Final Rejection mailed — §103

Mar 20, 2026

Response Filed

Apr 13, 2026

Final Rejection mailed — §103 (current)

Precedent Cases

Applications granted by this same examiner with similar technology

18/573,622

Patent 12639516

CLASSIFICATION AND AUGMENTATION OF UNSTRUCTURED DATA FOR AUTOFILL

2y 5m to grant Granted May 26, 2026

17/749,578

Patent 12632445

Applied Artificial Intelligence Technology for Natural Language Generation Using a Story Graph and Configurable Structurer Code

4y 0m to grant Granted May 19, 2026

18/264,595

Patent 12614040

SIMULTANEOUS TRANSLATION DEVICE AND COMPUTER PROGRAM

2y 8m to grant Granted Apr 28, 2026

18/747,081

Patent 12608556

INTENTION RECOGNITION METHOD, DEVICE, ELECTRONIC DEVICE AND STORAGE MEDIUM BASED ON LARGE MODEL

1y 10m to grant Granted Apr 21, 2026

18/747,499

Patent 12608557

CHINESE DIALOGUE SYSTEM FOR COGNITIVELY IMPAIRED ADULTS BASED ON COGNITIVE STIMULATION THERAPY PRINCIPLES

1y 10m to grant Granted Apr 21, 2026

Study what changed to get past this examiner. Based on 5 most recent grants.

Strategy Recommendation AI-generated — please review before filing

Get a prosecution strategy drawn from examiner precedents, rejection analysis, and claim mapping.

Typically takes 5-10 seconds — AI-generated, attorney review required before filing

Prosecution Projections

4-5

Expected OA Rounds

90%

Grant Probability

99%

With Interview (+16.4%)

2y 6m (~0m remaining)

Median Time to Grant

High

PTA Risk

Based on 156 resolved cases by this examiner. Grant probability derived from career allowance rate.