Notice of Pre-AIA or AIA Status
1. The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
2. Applicant's arguments filed 1/21/2026 have been fully considered but they are not persuasive. Applicant argues that Zhang’s disclosed image enhancement such as sharpening or edge detection is not analogous to applying a first coloring and applying a second coloring and further does not identifiably represent first and second audiovisual contents. This argument is not fully persuasive. Claim 1 does not require that the first and second colorings be different manual color maps. Rather, claim 1 requires that visual treatment be applied to the first and second spectrograms such that the resulting visual representations correspond to the respective first and second audiovisual contents. Zhang expressly discloses determining a first spectrogram for a-to-be-evaluated signal and second spectrogram for a reference signal, then performing image enhancement on each of the first and second spectrograms to obtain a first enhanced image corresponding to the first spectrogram and a second enhanced image corresponding to the second spectrogram. Thus, the first enhanced image remains the image representation of the first signal and the second enhanced image remains the image representation of the second signal. The fact that the same type of enhancement may be applied to both does not negate that the resulting first and second processed images respectively correspond to and identifiably represent first and second respective contents. These enhancements do constitute a “coloring” as disclosed by the claim and are not merely mathematical operations done on the audio signal, but actual visual changes done to the actual image of the signal itself. Some examples of this are Laplacian Matrix filters applied in ¶[0093] which is sharpening of filters in an image to improve clarity and detail of images. Additionally, Salamon discloses first and second audio representations, which may be spectrograms stacked together and used by a machine learning model to determine whether the audio visual spatial relationship is misaligned. Accordingly, the combined teachings of Salamon and Zhang continue to support the rejection.
Claim Rejections - 35 USC § 103
3. In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
4. Claims 1-5, 7, 10-15, 17, and 19-20 is rejected under 35 U.S.C. 103 as being unpatentable over Salamon (US 11,308,329) in view of Zhang (US 2023/0386503).
Regarding Claim 1:
Salamon discloses method for comparing a first audiovisual content with a second audiovisual content, the method comprising:
obtaining a first spectrogram representing the first audiovisual content (Salamon: 4 Col 2 lines 5-52 expressly teaches that the first audio channel is represented by a first audio representation and expressly teaches that such representation may be spectrogram or Mel-spectrogram.);
obtaining a second spectrogram representing the second audiovisual content, wherein the second audiovisual content is different than the first audiovisual content (Salamon: 4 Col 2 lines 5-52 explicitly discloses a second audio channel, and teaches a second representation corresponding to this channel. Since Salamon teaches that each such representation may be a spectrogram or Mel-spectrogram, Salamon teaches obtaining a second spectrogram representing a second audio visual content that is different from the first audio visual content);
generating a combined spectrogram by superimposing one of the first spectrogram or the second spectrogram over the other (Salamon: Col 2 lines 5-52 discloses stacked first and second spectrogram representations. These are combined into a single joint representation presented to the audio subnetwork. That stacked joint representation teaches the combined spectrogram representation),
and determining, using a machine learning model, whether the first audiovisual content is misaligned with respect to the second audiovisual content based on whether the first coloring or the second coloring in the combined spectrogram exceeds a threshold size (Salamon: Col 3 lines 13-41 discloses that “The computer is taught by training it to classify a representation of an audio-visual clip based on whether the clip's audio-visual spatial relationship has been misaligned”, e.g., it uses a machine learning model).
Salamon does not explicitly disclose the crossed out limitations above. However, Zhang discloses
applying a first coloring to the first spectrogram to such that the first coloring identifiably represents the first audiovisual content (Zhang: ¶[0011] explicitly discloses that the first signal is first converted into a first spectrogram and then visually processed in image space to obtain a first enhanced image that still corresponds to that first spectrogram and first signal. Therefore, Zhang teaches applying a visual treatment to the first spectrogram so that the resulting processed visual representation corresponds to and identifiably represents, the first content);
applying a second coloring to the second spectrogram such that the second coloring identifiably represents the second audiovisual content (Zhang: ¶[0011] the same reasoning applies to the second spectrogram. Zhang explicitly discloses that the second signal is separately converted into a second spectrogram and separately processed into a second enhanced image corresponding to the second spectrogram. Accordingly, Zhang teaches applying a visual processing treatment to the second spectrogram so that the resulting processed image corresponds to an identifiably represents the second underling content);
wherein generating the combined spectrogram results in a third coloring corresponding to a combination of the first coloring and the second coloring (Zhang: ¶[0120]-[0121], ¶[0126] teaches that the first and second processed spectrogram images are jointly compared in image space to produce a combined comparison results that reflects the relationship between the two respective processed images. When read together with Salamon’s explicit teaching of stacking and combining the first and second spectrogram representations into one combined input, the combination yields the claimed result: a combined spectrogram representation with the resulting visual comparative state reflecting the combination of the first processed visual treatment and the second processed visual treatment. In other words, Salamon provides the combined first plus second spectrogram and superimposing framework, while Zhang provides the teaching that each of those first and second spectrograms is visually processed and then comparatively evaluated together. Thus, the resulting combined comparative visual output corresponds to the joint contribution of the first and second processed spectrogram treatments, i.e., the claimed third coloring corresponding to their combination).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Salamon’s spectrogram based audio visual misalignment system with Zhang’s spectrogram image enhancement and comparative image similarity techniques. Salamon already teaches first and second audio representations which may be spectrograms stacked together for machine learning based determination of whether the audiovisual spatial relationship is misaligned. Zhang teaches that first and second spectrograms corresponding to first and second respective signals may be visually processed into corresponding first and second enhanced images and then comparatively evaluated in image space. A person of ordinary skill in the art would be motivated to apply Zhang’s visual enhancement treatment to Salamon’s first and second spectrogram representation so that the respective first and second spectrograms are visually emphasized prior to being jointly evaluated in Salamon’s combined misalignment framework. Doing so would predictably improve comparative analysis by making the respective signal-derived spectrogram features more visually and computationally distinguishable before the final comparison and classification stage. This is stated in ¶[0003] of Zhang: “it is difficult to evaluate a signal having a wide frequency band. In a word, it is found that the conventional technology at least has the problems of slow evaluation of voice quality and high limitation on the sampling rate.” In other words, frequency bands may be hard to compare and need editing to make these signals more clearly visible.
Regarding Claim 2:
Salamon and Zhang further disclose the method of claim 1, wherein the second spectrogram is obtained based on a homography that maps the first spectrogram to a segment of a spectrogram representing the second audiovisual content (Salamon: Fig. 2 the first and second audio are mapped to each other as they go through the network layers in order to create a merged audio source).
Regarding Claim 3:
Salamon and Zhang further disclose the method of claim 1, wherein obtaining the second spectrogram comprises aligning the first spectrogram to obtain an approximate match between the first spectrogram and a segment of a spectrogram representing the second audiovisual content (Salamon: Fig. 2 the first and second audio are mapped to each other and create an approximated match which is then merged with visual data).
Regarding Claim 4:
Salamon and Zhang further disclose the method of claim 1, wherein: the first audiovisual content comprises a reference end page and the second audiovisual content comprises a promo end page that includes a sequence of video frames that indicate at least when or on what station a promoted show will be broadcast (Salamon: Fig. 2 the first and second audio are mapped to each other and create an approximated match 155 which is then merged with visual data).
Regarding Claim 5:
Salamon and Zhang further disclose the method of claim 4, further comprising: comparing video content of the first audiovisual content with video content of the second audiovisual content (Salamon: Fig. 2 wherein as the stacked audio spectrograms pass through neural network 115 each layer compares and fuses the two).
Regarding Claim 7:
Salamon and Zhang further disclose the method of claim 1, wherein determining whether the first audiovisual content is misaligned with respect to the second audiovisual content comprises at least one of: identifying a misalignment at a beginning of the combined spectrogram with respect to time
identifying a misalignment at or around a middle of the combined spectrogram with respect to time
identifying a misalignment at an end of the combined spectrogram with respect to time
identifying a complete misalignment across the combined spectrogram with respect to time
or identifying a plurality of scattered misalignments across the combined spectrogram with respect to time (Salamon: 10-12 Col 3 line 66 – Col 5 line 18, is able to identify a misalignment during pre-processing, this misalignment may be identified in raw waveforms which is in the time domain, it may also use a spectrogram which although is in the frequency domain is those frequency as it varies with time;
Further it finds misalignments between the merged audio data in Fig. 2 step 125).
Regarding Claim 10:
Salamon and Zhang further disclose the method of claim 1, wherein determining whether the first audiovisual content is misaligned with respect to the second audiovisual content comprises: identifying a misalignment between the first audiovisual content and the second audiovisual content (Salamon: 10 Col 3 line 66-Col 4 line 5 identifies if the audio has been misaligned);
and recording a corresponding time range of the misalignment (Salamon: Col 7 lines 23-50 disclose that the model may detect alignment at a specific time interval).
Regarding Claim 11:
The combination explained in the rejection of Claim 1 renders obvious the computer readable medium of Claim 11 because these steps occur in the operation of the proposed combination as discussed above.
It is noted that Salamon discloses a machine-readable non-transitory (Salamon: Figs. 1, 5 and 6).
Regarding Claim 12:
The combination explained in the rejection of Claim 2 renders obvious the computer readable medium of Claim 12 because these steps occur in the operation of the proposed combination as discussed above.
Regarding Claim 13:
The combination explained in the rejection of Claim 3 renders obvious the computer readable medium of Claim 13 because these steps occur in the operation of the proposed combination as discussed above.
Regarding Claim 14:
The combination explained in the rejection of Claim 4 renders obvious the computer readable medium of Claim 14 because these steps occur in the operation of the proposed combination as discussed above.
Regarding Claim 15:
The combination explained in the rejection of Claim 5 renders obvious the computer readable medium of Claim 15 because these steps occur in the operation of the proposed combination as discussed above.
Regarding Claim 17:
The combination explained in the rejection of Claim 7 renders obvious the computer readable medium of Claim 17 because these steps occur in the operation of the proposed combination as discussed above.
Regarding Claim 19:
The combination explained in the rejection of Claim 10 renders obvious the computer readable medium of Claim 19 because these steps occur in the operation of the proposed combination as discussed above.
Regarding Claim 20:
The combination explained in the rejection of Claim 1 renders obvious the computer readable medium of Claim 20 because these steps occur in the operation of the proposed combination as discussed above.
It is noted that Salamon discloses an apparatus comprising a network communication unit configured to transmit and receive data (Salamon: Figs. 1, 5 and 6).
5. Claims 6 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Salamon in view of Zhang and further in view of Knospe “Privacy-enhanced Perceptual Hashing Audio Data”.
Regarding Claim 6:
The proposed combination of Salamon and Zhang further discloses a method of claim 5, wherein comparing the video content of the first audiovisual content with the video content of the second audiovisual content (Salamon: Fig. 2 wherein as the stacked audio spectrograms pass through neural network 115 each layer compares and fuses the two) is .
Salamon and Zhang do not specifically disclose the comparison being based on perceptual hashing
However, Knospe discloses the comparison being based on perceptual hashing (Knospe: Introduction paragraph 4 “Perceptual hashing can be used to identify similar copies, e.g., replayed spam calls” i.e. it is disclosing using perceptual hashing to compare audio items to discover similarities).
Salamon in view of Knospe are combinable because they are from the same field of endeavor, audio processing e.g., both disclose methods for receiving and processing audio to create useful output. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to disclose in a case comparing using perceptual hashing because it is useful for comparing two pieces of similar audio as done. Therefore, it would have been obvious to incorporate perceptual hashing as disclosed by Knospe into Salamon.
Regarding Claim 16:
The combination explained in the rejection of Claim 6 renders obvious the computer readable medium of Claim 16 because these steps occur in the operation of the proposed combination as discussed above.
Conclusion
THIS ACTION IS MADE FINAL. Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any nonprovisional extension fee (37 CFR 1.17(a)) pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action. In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to IAN SCOTT MCLEAN whose telephone number is (703)756-4599. The examiner can normally be reached "Monday - Friday 8:00-5:00 EST, off Every 2nd Friday".
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Hai Phan can be reached at (571) 272-6338. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/IAN SCOTT MCLEAN/Examiner, Art Unit 2654
/HAI PHAN/Supervisory Patent Examiner, Art Unit 2654