Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
Arguments and amendments filed 8/4/2025 has been examined.
Claims 1, 4, 5, 7-11, 13, 16, 17, 19-23, 25, 27, 28 have been amended;
Claims 2, 3, 14, 15, 26 have been canceled.
Thus, Claims 1, 4-13, 16-25, 27 and 28 are currently pending.
This Office Action is Final
Claim Objections
Claims 1, 13 and 25 recite:
“such that the synthetic audio response and synchronized video animation reflect”.
In claims 1, 13, and 25, the recitation of "such that" in various lines only constitute an Intended Use and does not carry patentable weight since it never has to occur. Claims should be amended to recite more firm and positive language (i.e. "improving", or "wherein").
Appropriate correction is required.
Response to Arguments
Applicant’s arguments with respect to claim(s) and the previous rejection under 35 USC 103 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
Applicant’s arguments, with respect to the previous rejection under 35 USC 101 and the recently amended claims these arguments have been fully considered and are persuasive. The rejection under 35 USC 101 has been withdrawn.
Applicant’s arguments, with respect to the previous rejection under 35 USC 112 have been fully considered and are persuasive. The previous rejection under 35 USC 112 has been withdrawn (as Applicant has amended said claims to remove/clarify the term "life story" from claims 1, 13, and 25.). Please note additional issues under 35 USC 112 below.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
The following is a quotation of the first paragraph of pre-AIA 35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.
Claims 1, 4, 7, 9, 11, 13, 16, 19, 21, 23, 25 and 27-28 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA 35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Claims 1, 13 and 25 recite:
“such that the synthetic audio response and synchronized video animation reflect a voice and visual mannerisms of the target person and wherein the response generated is contextually and semantically related to the target person's historical experience as stored in the biographical data files;”
The Examiner searched the specification for the above limitations, and could find no support for “visual mannerisms of the target person”; as the specification only references “audio
features such as, for example, mannerism, style and tone of speech” (see specification para. [0065]).
Additionally, nowhere does the specification support the limitation “wherein the response generated is contextually and semantically related to the target person's historical experience as stored in the biographical data files”;
the Examiner searched the specification for the amended limitations, however the specification only recites “making it robust for a semantic search based query response pipeline”: (see para. [0078] “In some embodiments, an embedding refers to a numerical representation of a piece of information, for example, text, documents, images, audio, etc. The representation captures the semantic meaning of what is being embedded, making it robust for a semantic search based query response pipeline of the present specification”); consequently, there is simply no recitation/support for the limitation “wherein the response generated is contextually and semantically related to the target person's historical experience as stored in the biographical data files”.
Claim 4 and claim 16 recite: “wherein the semantically indexed biographical data files comprise one or more natural language text transcriptions of audio portions of at least one audio/visual video data generated by the target person”; the Examiner searched the specification, the specification only recites“; ”; the Examiner searched the specification for the amended limitations, however, the specification only recites: “and textual responses to questions presented during the automated Q&A session) are indexed and stored separately from those portions of the one or more first vector data structures 132a, second vector data structures 132b, and fourth vector data structures 132d that are not in the form of prompt-response pairs” (see specification para. [0090]); consequently, there is simply no recitation/support for the claimed limitation of “wherein the semantically indexed biographical data files comprise one or more natural language text transcriptions of audio portions of at least one audio/visual video data generated by the target person”
Claim 7 and claim 19 recite: “wherein the semantically indexed biographical data files additionally comprise one or more natural language text generated by the target person”; the Examiner searched the specification, the specification only recites“; ”; the Examiner searched the specification for the amended limitations, however, the specification only recites: “and textual responses to questions presented during the automated Q&A session) are indexed and stored separately from those portions of the one or more first vector data structures 132a, second vector data structures 132b, and fourth vector data structures 132d that are not in the form of prompt-response pairs” (see specification para. [0090]); consequently, there is simply no recitation/support for the claimed limitation of “wherein the semantically indexed biographical data files additionally comprise one or more natural language text generated by the target person”.
Claim 9 and claim 21 recite: ”wherein the one or more first vector data structures are generated as a result of a word-embedding operation performed by the embedding engine on the semantically indexed biographical data files.”; the Examiner searched the specification, the specification only recites“; ”; the Examiner searched the specification for the amended limitations, however, the specification only recites: “and wherein the one or more first vector data structures are generated as a result of a word-embedding operation performed by the embedding engine on the at least one text file” (see specification para. [0027]); consequently, there is simply no recitation/support for the claimed limitation of “wherein the one or more first vector data structures are generated as a result of a word-embedding operation performed by the embedding engine on the semantically indexed biographical data files.”
Claim 11 and claim 23 recite:
“artificial neural network is trained using the audio portions of said at least one audio/visual video data along with the corresponding semantically indexed biographical data files.”; the Examiner searched the specification for the amended limitations, however the specification only recites: “Optionally, the first artificial neural network is trained using the one or more first vector data structures. Optionally, the second artificial neural network is trained using the audio portions of said at least one audio/visual video data along with the corresponding at least one text file.” (see speciation para. [0015]); consequently, there is simply no recitation/support for the claimed limitation of “artificial neural network is trained using the audio portions of said at least one audio/visual video data along with the corresponding semantically indexed biographical data files.”.
Claim 27 recites:
“wherein the second artificial neural network is trained using audio recordings of the target person along with the corresponding textual biographical content of the target person”; the Examiner searched the specification for the amended limitations, however the specification only recites: “[0025] Optionally, the second artificial neural network is trained using the audio portions of said at least one audio/visual video data along with the corresponding at least one text file.” (see specification para. [0025]); consequently, there is simply no recitation/support for the claimed limitation of “wherein the second artificial neural network is trained using audio recordings of the target person along with the corresponding textual biographical content of the target person”.
Claim 28 recites:
“wherein the third artificial neural network is trained using video recordings of the target person along with the corresponding audio data of the semantically indexed biographical data files of the target person”; the Examiner searched the specification for the amended limitations, however the specification only recites: “Optionally, the third artificial neural network is trained using visual portions of the at least one audio/visual video data along with the corresponding audio portions of the at least one audio/visual video data.” (see specification para. [0015]); consequently, there is simply no recitation/support for the claimed limitation of “wherein the third artificial neural network is trained using video recordings of the target person along with the corresponding audio data of the semantically indexed biographical data files of the target person”.
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The following is a quotation of pre-AIA 35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is invoked.
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph:
(A) the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function;
(B) the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and
(C) the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function.
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function.
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function.
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.
Such claim limitation(s) is/are (in Claim 13): using the generic placeholder “programmatic instructions,” as used in the following limitations:
“programmatic instructions, stored in said computer readable non-transitory medium, for
generating, by a search engine executing on the server, at least one text result by performing a semantic similarity search between the query vector data structure and one or more vector data structures derived from the biographical data files associated with the target person;
programmatic instructions, stored in said computer readable non-transitory medium, for searching, using a contextual search engine configured to perform vector similarity matching using cosine similarity functions, a structured database storing the biographical data files of the target person, to identify at least one semantically relevant text result;
programmatic instructions, stored in said computer readable non-transitory medium, for providing as input, to a third artificial neural network, the synthetic audio response in order to generate a video animation of the avatar, wherein the video animation corresponds to the avatar uttering the synthetic audio response and wherein the first artificial neural network is trained using textual biographical content of the target person, the second artificial neural network is trained using audio recordings of the target person, and the third artificial neural network is trained using video recordings of the target person; and
programmatic instructions, stored in said computer readable non-transitory medium, for rendering, on the user's computing device, the synthetic audio response in synchronization with the video animation of the avatar such that the synthetic audio response and synchronized video animation reflect a voice and visual mannerisms of the target person and wherein the response generated is contextually and semantically related to the target person's historical experience as stored in the biographical data files”
(all recited in claim 13).
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph, applicant may: (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA 35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA 35 U.S.C. 102 and 103 (or as subject to pre-AIA 35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis (i.e., changing from AIA to pre-AIA ) for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claim(s) 1-28 is/are rejected under 35 U.S.C. 103 as being unpatentable over Howard et al. US Pub. No. 2021/0232632 A1, in view of Zhao et al., US Pub. No. 2021/0248376 A1, in view of Wu et al., US Pub. No. 2021/0280190A1, in view of Ramesh et al., US Pub. No.: 2024/0195940, in view of Maloney et al. US Pub. No. 2014/0320504 A1, in view of Obukhov et al., US Pub. No. 2024/0177386 A1.
As to claim 1 (and substantially similar claim 13),
Howard discloses a computer-implemented method
(Howard [0023-0025])
executed by a distributed computing system including at least one server and a user device communicatively coupled over a network for
(Howard Fig. 1 items 100 and 120 “user experience device” and “virtual experience service”; see also [0068] When implemented using a server computer, any of a variety of servers may be
used including, but not limited to, application servers, database servers, mail servers, rack servers, blade servers, tower servers, virtualized servers, or any other type of server, variation of server, or combination thereof. A system that may be used in some environments to implement a virtual experience service 120 is described in FIG. 7. Further, it should be noted that aspects of the virtual experience service 120 may be implemented on more than one device. In some cases, virtual experience service 120 may include components located on user devices, user experience devices, and/or on one or more services implemented on separate
physical devices.; see also [0242] System 1000 can itself include one or more computing systems or devices or be distributed across multiple computing devices or sub-systems that cooperate in executing program instructions)
generating an avatar representative of a target person, wherein the avatar is configured to virtually embody audio, visual and behavioral characteristics of the target person and respond to a user's query
(Howard teaches generating videos/avatars of a target identity using target identity content repositories, i.e. generating an avatar of a target person based on the target person's life story see [0022] A "target identity" describes the personality, person, or other entity from whose perspective a virtual experience is to be constructed. A target identity can be a real
person, either currently alive or deceased. A target identity can be known to the beholder ( e.g., a personal acquaintance, colleague, relative, friend, ancestor), or the beholder may never have met the target identity personally ( e.g., distant ancestor, previously deceased relation, historical figure). See also [0023] User experience device 100 (sometimes abbreviated "UED" herein) may be understood to be a computing device that has certain capabilities to receive a virtual
experience container and render sensory effects in a beholder 101 as part of a virtual experience. One aspect of the described techniques and systems is that any particular virtual experience container delivered to a user experience device 100 is matched to the capabilities of the user experience device 100 on which it is being rendered. Some capabilities a user experience device 100 might have include, for example, the ability to render video in 2D or 3D;
see also [0062] Examples of target identity content repositories 142 include: online social network data (including privately shared data), including written and media posting content,
indicators of sentiment and emotion, tags, information about contacts and relationships; personal, shared, and workrelated event calendars; email accounts; online, cloud-based,
and locally-stored media repositories containing photos;
see also [0174] [0174] (432) Determine the primary delivery medium of the virtual experience container. … as a 2D image or image with 360- degree selectable perspective, a hologram, or a 3D volumetric image), audio of a target identity narrating a story, a virtual avatar, and/or a "conversation" with a chatbot)
based on semantically indexed biographical data files comprising text transcriptions, audio data, and video recordings associated with, and specific to, the target person,
(Howard teaches semantic text processing database, i.e. semantically indexed biographical data files see [0063] In some implementations, a custom database implemented in a relational database system that may have the capability to do full-text or semantic text processing can
be a search module.)
the method comprising:
receiving, by the user device, first data in a form of an audio stream containing a user
query and transmitting the audio stream to the server,
(Howard teaches audio interface/audio queries/spoken prompts and a Content interpretation service(s)/speech interpretation, i.e. receiving/transmitting a user's query is in the form of an audio stream
see [0222] User experience device/system 600 includes a user interface system 606 for presenting user interfaces and receiving user input and indications. User interface system 606 may comprise an audio interface; See also [0064] A search module may be built to optimize for the query of certain types of content, such as images, video, speech, or text. A search module can be a trained classifier of images, text, audio, or other content types that is used to search element sources for the specific content which the classifier is trained to find.;
see also [0038] Example ES4: A topic of conversation prompt, "Grandpa's advice on marriage after 40 years of being married," results in the construction of a virtual experience in which the beholder can have a conversation with an avatar of Grandpa, by asking him a variety of questions relating to marriage and receiving responses from the avatar that are consistent with the personality and opinions of Grandpa as reflected in his digital or digitized writings ( e.g., emails, love letters, etc.).;
see also [0034] The subject matter prompt 112 can take a variety of forms, from a short textual description to a submission of media or other content that, when analyzed by virtual experience service 120 (e.g., using context analysis 121 component), begets a subject matter context in which the target identities act. Subject matter prompts expressed in language may be written, spoken, or otherwise communicated through natural language, descriptor words, or other form, including language in any form such as sign language or braille;
see also [0230] As noted previously in FIG. 1, a subject matter prompt may be indicated by the beholder using various common user interface elements, including natural language command interpretation;
see also [0222] User experience device/system 600 includes a user interface system 606 for presenting user interfaces and receiving user input and indications. User interface system
606 may comprise an audio interface 1040, video interface 1045, and one or more interface devices 1050 as described in FIG. 7.;
See also [0057] Content interpretation service(s) 130 can be used to, for example: identify the grammatical or semantic structure of text, discern key concepts in text, translate text, and identify entities in text; classify objects or places or identify people in images, caption images, perform video and speech interpretation to elicit concepts, identify a speaker, translate speech, identify and track faces, and index content; and analyze the "sentiment" expressed by speech,
text, or images. Different kinds of content interpretation service(s) 130 may be provided by third-party service providers such as Microsoft Azure®, Amazon® Web Services, or Google®, for example via an API of those services.)
transcribing by an automatic speech recognition engine executed on the server,
the audio stream to generate a natural language text transcript
(Howard teaches generating textual natural language prompts determined from audio queries, using a Content interpretation service(s) /speech interpretation i.e. transcribing the first data to generate a natural language text transcript by an automatic speech recognition engine see [0034] Generally, the beholder 101 interacts with the user experience device 100 to construct a new subject matter prompt or select from available subject matter prompts. The subject matter prompt 112 can take a variety of forms, from a short textual description to a submission of media or other content that, when analyzed by virtual experience service 120 (e.g., using context analysis 121 component), begets a subject matter context in which the target identities act. Subject matter prompts expressed in language may be written, spoken, or otherwise communicated through natural language, descriptor words, or other form, including language in any form such as sign language or braille.;
See also [0230] As noted previously in FIG. 1, a subject matter prompt may be indicated by the beholder using various common user interface elements, including natural language command interpretation.;
See also [0057] Content interpretation service(s) 130 can be used to, for example: identify the grammatical or semantic structure of text, discern key concepts in text, translate text, and identify entities in text; classify objects or places or identify people in images, caption images, perform video and speech interpretation to elicit concepts, identify a speaker, translate speech, identify and track faces, and index content; and analyze the "sentiment" expressed by speech, text, or images. Different kinds of content interpretation service(s) 130 may be provided by third-party service providers such as Microsoft Azure®, Amazon® Web Services, or Google®, for example via an API of those services.)
and
rendering, on the user's computing device, the synthetic audio response in synchronization with
the video animation of the avatar
(Howard teaches generating an interactive conversations or narrations with the generated avatar, i.e. rendering an audio response in sync with the video avatar see [0184] Further, if the subject matter prompt suggests that the virtual experience desired is a conversation with a virtual avatar or chatbot that emulates the personality or conversational style of a target identity, the unifying flow is implied by the dynamic processing of the chatbot/avatar in interpreting the conversation/questions of the beholder and responding accordingly; in other words, explicit content linkages are unnecessary because the unifying flow is provided by the beholder.; see also [0189] the base content layer 510 can be constructed from content such as, but not limited to, video (in two dimensions or three dimensions), an immersive VR experience, a rendering of an object or memento (in two or three dimensions, e.g., as a 2D image or image with 360- degree selectable perspective, a hologram, or a 3D volumetric image), audio of a target entity narrating a story, a virtual avatar, and/or a "conversation" with a chatbot. Base content layer 510 may be constructed of one or more discrete content element (DCE), of which element 512 is illustrative;
See also [0098] A personality trait-faceted discrete content element might be used, for
example, to provide consistent and realistic behavioral representations in a virtual avatar of a
target identity, or to understand/represent a target identity's tone or interests in conversation
with a chatbot representing the target identity;)
and wherein the response generated is contextually and semantically related to the target person's historical experience as stored in the biographical data files;
(Howard teaches contextual and semantic analysis for results, i.e. “response generated is contextually and semantically related to the target person's historical experience”
see [0056-0057] [0056] Generally, context analysis involves analyzing aspects of the repository compilation request 109 and beholder request 110 (e.g., the target identity designators 111, subject matter prompt 112, user experience device parameters 113, and beholder-provided content, if any) for appropriate target entities, sentiments, and relationships, and for subject matter context gleaned from the subject matter prompt and beholder-provided content. Context analysis may also involve analyzing search results in the performance of content element deconstruction (see, e.g., FIG. 2A).; [0057] A virtual experience service 120 (e.g., using context analysis component 121 or other subcomponents) may interact with or direct requests to content interpretation service(s) 130 to assist in the identification of concepts in various kinds of content, including subject matter prompts, beholder-provided content, content repository media, and information feeds. Content interpretation service(s) 130 can be used to, for example: identify the grammatical or semantic structure of text, discern key concepts in text, translate text, and identify entities in text;
see also [0066] In brief, a virtual experience container 150 is embodied in a uniquely structured storage ( e.g., a file or streamed data format) that contains ( or links to) content elements that are unified by subject matter context into an experiential vignette.;
see also [0111] Different semantic analysis methods may yield a different selection of key concepts in the search results, as some methods may be more effective with certain kinds of textual material than others. Hence, more than one kind of semantic analysis method may be used to determine key concepts)
Howard does not disclose:
generating, by a non-transitory embedding engine executing on the server and configured with a trained machine learning model, a query vector data structure based on the
natural language text transcript;
generating, by a search engine executing on the server, at least one text result by performing a semantic similarity search between the query vector data structure and one or more vector data structures derived from the biographical data files associated with the target person;
searching, using a contextual search engine configured to perform vector similarity matching using cosine similarity functions, a structured database storing the biographical data files of the target person, to identify at least one semantically relevant text result;
providing as input, to a first artificial neural network, the at least one text result and the
corresponding natural language text transcript in order to generate a text response;
and wherein the first artificial neural network is trained using textual biographical content of the target person,
however, Zhao discloses:
generating, by a non-transitory embedding engine executing on the server and configured with a trained machine learning model, (Zhao teaches a Query-Response-Neural Network with a trained detection neural network, i.e. a embedding engine executing on the server with a trained machine learning model see [0099] FIG. 6B illustrates the query-response system 106 training the detection neural network 604 to detect visual features in accordance with one or more embodiments of the present disclosure. Although shown as a single training cycle, the query-response system 106 may perform the training acts and/or algorithms of FIG. 6B in an iterative manner;
see also [0128] When executed by the one or more processors, the computer-executable instructions of the query-response system 106 can cause the computing device(s) (e.g., the computing device 1002, the server(s) 102) to perform the methods described herein.;
see also Fig. 6A showing a Query-Response-Neural Network 600 for creating a Visual-
Context Vector 608)
a query vector data structure based on the natural language text transcript;
(Zhao teaches an embedding based query-response-neural network can apply one or more neural-network layers specifically trained to generate a query vector (e.g., a vector representation of the question ) i.e. “generating… by an embedding engine, a query vector data structure based on the natural language text” [0062] In some embodiments, the query-response system 106 receives an indication of the question 314 as pertaining to a portion of a video in playback mode at a client device. In particular, the client device can transmit one or more user inputs indicating the question 314 to the query-response system 106. The query-response system 106 subsequently analyzes transcribed or written version of the question 314 using the question-network layers 304. At the question network layers 304, the query-response-neural network 302 can apply one or more neural-network layers specifically trained to generate a query vector (e.g., a vector representation of the question 314).;
see also [0063] To do so, in some embodiments, the question network layers 304 generate word embeddings of the question 314 using a word-vector-representational model such as
word2vec… to generate a query vector.;
see also [0005] To respond to a user's question, the disclosed system further
selects a response from the candidate responses based on a comparison of the query-context vector and the candidate response vectors.;
see also [0038] unlike conventional video systems, the query-response system can accurately
respond to a question having a subject or predicate that depends on the visual context (e.g., "Is there a shortcut for that?"). In this manner, the query-response system can provide increased accuracy to responses to questions received during playback of a video segment.)
generating, by a search engine executing on the server,
(Zhao [0126] Additionally, the user interface manager 1018 can present a variety of types of information, including text, digital media items, search results; see also [0054] the server(s) 102 may receive data from the client device 108 regarding the video, including data indicating a question or comment for a user. In tum, the server(s) 102 can transmit data back to the query-response system)
at least one text result by performing a semantic similarity search between the query vector data structure and one or more vector data structures derived from the biographical data files associated with the target person;
(Zhao teaches using similarity scores between the query context vector and one or more of the candidate-response vectors to find matching candidate responses with a knowledge base, i.e. a “performing a semantic similarity search between the query vector data structure and one or more vector data structures derived from the biographical data files”
See [0126] Additionally, the user interface manager 1018 can present a variety of types of information, including text, digital media items, search results
See also [0072] With the generated candidate-response vectors, the query-response system 106 can then compare the candidate response vectors with the query-context vector described
above. In particular, the query-response system 106 can determine respective similarity scores between the query context vector and one or more of the candidate-response vectors. Based on the similarity scores, the query-response system 106 can select a corresponding candidate response as the response 326. For example, the query-response system 106 may select a candidate response as the response 326 based on the candidate-response vector for the candidate response satisfying a threshold similarity. This comparison of the candidate-response vectors with the query-context vector for determining a response to a user question is described more in relation to FIG. 4 below.;
see also [0083] In one example, the matching act 438 applies a matching threshold to the matching scores by comparing each of the matching scores for the candidate-response
vectors 436 (e.g., dot products between the query-context vector 428 and each of the candidate-response vectors 436) to the matching threshold. Additionally, the matching act
438 may require that a matching score satisfy the matching threshold in order to qualify as a response to the question 202.;
see also [0127] The data storage facility 1020 maintains data for the query-response system 106. For example, the data storage facility 1020 ( e.g., via one or more memory devices) can
maintain data of any type, size, or kind, as necessary to perform the functions of the query-response system 106, including digital images, synthetic-training images, an external domain knowledge base, learned parameters, etc. ; see also [0069] In one example, the response-network layers 308 exploit the external domain knowledge in the knowledge base 322 by generating candidate-response vectors from the candidate responses 324. In these or other embodiments, the candidate responses 324 are based on a graph structure that represents the linkage of responses, entities, and options of the knowledge base 322.;
see also [0069] In one example, the response-network layers 308 exploit the external domain knowledge in the knowledge base 322 by generating candidate-response vectors from the candidate responses 324. In these or other embodiments, the candidate responses 324 are based on a graph structure that represents the linkage of responses, entities, and options of the knowledge base 322)
searching, using a contextual search engine configured to perform vector similarity matching using cosine similarity functions, a structured database storing the biographical data files of the target person, to identify at least one semantically relevant text result;
(Zhao teaches using cosine similarity values to find relevant results see [0035] As described below, the query-response system may generate matching scores represented as matching probabilities ( e.g., as output from fully-connected layers and/or a softmax function), cosine similarity values, or Euclidean distances.; See also [0126] Additionally, the user interface manager 1018 can present a variety of types of information, including text, digital media items, search results;
See also [0068] For instance, the knowledge base 322 may include responses for questions asked about a tutorial video included with the license of an image editing application ( e.g., a tutorial video on a DVD for the image editing application), and an instructional video available online and made by a user of the image editing application. In some embodiments, for a given domain or subject, such as videos about using Adobe Photoshop®, the knowledge base 322 is
not tied to a specific video regarding that domain or subject. Instead, the knowledge base 322 includes responses appropriate to any video on the domain or subject,)
providing as input, to a first artificial neural network, the at least one text result and the
corresponding natural language text transcript in order to generate a text response;
(Zhao teaches a query-response-neural network using the query-context
vector compared with candidate-response vectors to select the response, i.e. “a first artificial neural network, the at least one text result and the corresponding natural language text transcript”; see [0067] Based on the attention weights applied to the various vectors input into the attention mechanism 312, the query-response-neural network 302 generates a query-context vector for comparison with candidate-response vectors. In other embodiments ( as denoted by the dotted box for the attention mechanism 312), the query-response-neural network
302 does not utilize the attention mechanism 312. Instead, the query-response-neural network 302 can combine the query vector with one or more of the context vectors or the hidden-feature vectors to generate a query-context vector (without attention modifications thereto). Regardless
of the format for constituents of such a query-context vector, the query-response system 106 compares the query-context vector with candidate-response vectors to select the response
326.;
See also [0137] (ii) generating the textual-context vectors by utilizing transcript layers from the query-response-neural network. )
and wherein the first artificial neural network is trained using textual biographical content of the target person,
(Zhao [0070-0071] [0070] For instance, the query-response system 106 may initialize the response encoder using vectors determined from embeddings based on the graph structure, and trained together with other components of the query-response system 106 using triplets including a question-response pair generated by a knowledgeable user (e.g., professional artist), and context of the video ( e.g., audio sentences). [0071] For instance, the query-response system 106 may configure the response encoder with a random initialization (e.g., randomly-selected convolution weights), and train the response encoder based on the triplets of questions, responses, and context.)
It would have been obvious to one having ordinary skill in the art at the time the time of the effective filing date to apply a query response neural network as taught by Zhao to the system of Howard since it was known in the art that query systems provide a natural language query processing system that utilizes a query response- neural network for contextualizing and responding to a user question received during display or playback of a video segment, such as a screencast-tutorial segment where the query-response-neural network can include neural-network layers and mechanisms for generating representations of questions, transcript text, visual cues, and answer candidates and by analyzing both audio and visual cues with such a query-response-neural network, the disclosed query-response system provides answers to users' questions with accuracy and multiple contextual modes for questions. (Zhao [0021]).
Howard/Zhao do not disclose:
providing as input, to a second artificial neural network, the text response in order to
generate a synthetic audio response;
however, Wu discloses:
providing as input, to a second artificial neural network, the text response in order to
generate a synthetic audio response;
(Wu teaches using a dialog model/neural network model to generate reply text from input text to generate a reply speech signal, i.e. [0037-0038] [0037] The computing device 108 inputs the obtained input text 204 to a dialog model to obtain reply text 206 for answering. The dialog model is a trained machine learning model, a training process of which can be performed offline. Alternatively or additionally, the dialog model is a neural network model, and the training process of the dialog model is described below in conjunction with FIG. 4, FIG. 5A, and FIG. 5B. [0038] Then, the computing device 108 uses the reply text 206 to generate a reply speech signal 208 by a text-to-speech (TTS) technology, and may further recognize, according to the reply text 206, an identifier 210 of an expression and/or action used in the current reply.;
See also [0043] In some embodiments, the computing device 108 inputs the input text 204 and personality attributes of a virtual object to a dialog model to acquire the reply text 206, the dialog model being a machine learning model which generates the reply text using the personality attributes of the virtual object and the input text. Alternatively or additionally, the dialog model is a neural network model.).
It would have been obvious to one having ordinary skill in the art at the time the time of the effective filing date to apply a dialog model/neural network model to generate a reply speech signal as taught by Wu to the system of Howard/Zhao since it was known in the art that natural language processing systems provide a dialog model is obtained by performing training with personality attributes of the virtual object and dialog samples, the dialog samples including an input text sample and a reply text sample where the dialog model may be obtained by the computing device through offlline training and the computing device first acquires the personality attributes of the virtual object, where the personality attributes describe human-related features of the virtual object, for example, gender, age, constellation, and other human-related characteristics and then, the computing device 108 trains the dialog model based on the personality attributes and the dialog samples, wherein the dialog samples include the input text sample and the reply text sample and during training, the personality attributes and the input text sample are used as input and the reply text sample is used as output for training where the dialog model may alternatively be obtained by another computing device through offline training where a dialog model can be quickly and efficiently obtained. (Wu [0044]).
Howard/Zhao/Wu do not disclose:
providing as input, to a third artificial neural network, the synthetic audio response in order to
generate a video animation of the avatar, wherein the video animation corresponds to the avatar
speaking the synthetic audio response;
However, Ramesh discloses:
providing as input, to a third artificial neural network, the synthetic audio response in order to
generate a video animation of the avatar, wherein the video animation corresponds to the avatar
speaking the synthetic audio response
(Ramesh teaches generating an avatar emulating facial movements using a neural network to use given audio including the user speaking the words “good morning,” where the neural net/the classifier outputs a first video clip with the avatar emulating facial movements involved in speaking the word “good,” followed by a second video clip with the avatar estimating facial movements involved in speaking the word “morning,”, i.e. “providing as input, to a third artificial neural network, the synthetic audio response in order to generate a video animation”
see [0044-0046] To accomplish this, for example, the facial feature model can include a classifier that is configured to map each of a plurality of words to a respective one of a plurality of facial movements. At some point, the computing system 200 can receive one or more pre-recorded videos of the user, and the classifier can be trained using one or more pre-recorded videos of the user as training data. In some cases, the facial feature model can be or include a deep learning-based model that uses convolutional neural networks (CNN), transformer models, and/or deep neural networks (DNNs) trained using the one or more pre-recorded videos. Each such neural network can convert audio into one or more frames of video of corresponding facial movements. The pre-recorded videos can be previous video communication sessions of the user captured by a camera of the first client device 102, or other types of videos of the user. [0045] Having estimated the facial movement of the user speaking the one or more words, the computing system 200 can generate a synthetic video depicting an avatar of the user moving according to the estimated facial movement. For example, the computing system 200 can generate the synthetic video by assembling all of the video clips output by the classifier in temporal order. As a more specific example, given audio including the user speaking the words “good morning,” the classifier might output a first video clip with the avatar emulating facial movements involved in speaking the word “good,” followed by a second video clip with the avatar estimating facial movements involved in speaking the word “morning,” and the computing system 200 can edit the two video clips together.;
[0046] In response to generating the synthetic video, the computing system 200 c