DETAILED ACTION
Notice of Pre-AIA or AIA Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 11/19/2025 has been entered.
Response to Arguments
Applicant's arguments filed 11/19/2025 have been fully considered but they are not persuasive. In particular, Applicant states (pp. 12) that Hunt does not teach “wherein the database is split via application of a set of CAS trees generated across a plurality of files of the database according to global content-defined boundaries derived from a rolling hash applied based on the database as a whole as a continuous data stream, such that identical content segments shared across different files are represented by common nodes within the set of CAS trees”. This is taught instead by combining Hunt with Guilford.
Hunt divides (i.e., splits) user content (i.e., across different files) into data objects (i.e., chunks) [0009], forming a bottom layer (i.e., tier) extending the object tree structure (i.e., content defined tree), where a file is analogous to a folder of contained objects representable in the object tree structure as a subtree. Hunt builds corresponding pointer trees of smashes for files (i.e., set of CAS trees), with one pointer for each object in each file. A leaf node is a grouping of object smashes, and a parent node contains a smash of the grouping of its child smashes [0010].
Data de-duplication algorithms store a single copy (i.e., common node) of duplicate data (i.e., chunk) while replacing duplicate instances with references to the shared copy (Guilford: [0001]). Guilford computes rolling hashes of a set of contiguous bytes (i.e., data stream) of fixed size (Guilford: [0002]), in order to reduce the computational overhead of searching algorithms while providing (i.e., deriving) quality identification of natural data (i.e., chunk) boundaries, based on the rolling hash matching some trigger criteria (i.e., condition) (Guilford: [0011]).
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to apply the teachings of Guilford to Hunt. One having ordinary skill in the art would have found motivation to adopt the rolling hash algorithm of Guilford to compute chunk smashes and perform deduplication in Hunt more efficiently.
In summary, the cited prior art of record combined teaches the argued limitations of independent claims 1, 5 and 13.
Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.
Claims 1-20 are rejected under 35 U.S.C. 112(a), as failing to comply with the written description requirement. The claims contain subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, at the time the application was filed, had possession of the claimed invention. For example, claim elements “content segment” and “content-defined boundaries” are not defined in the instant specification.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
Claims 1-8, 10-16 and 18-20 are rejected under 35 U.S.C. 103 as being unpatentable over Hunt et al. US patent application 2015/0220578 [herein “Hunt”], in view of Ding et al. An Approach for Validating Quality of Datasets for Machine Learning. 2018 IEEE International Conference on Big Data, pp. 2795-2803 [herein “Ding”], and further in view of Guilford et al. US patent application 2016/0188589 [herein “Guilford”].
Claim 1 recites “A system for using a content defined tree and a content addressed storage (CAS) tree to efficiently determine and retrieve locations of each portion of a file within a database for reconstruction of the file, the system comprising: one or more processors; and a non-transitory, computer readable medium having instructions recorded thereon that, when executed by the one or more processors, cause operations comprising: obtaining a request for a file in a database, wherein the request comprises an identification of the file,”
The instant specification defines a “content defined tree” as a tree of hashes in which each leaf node is a chunk hash, and each non-leaf node is the hash of a concatenation of hashes of its child nodes (Spec. [0004]). A “content addressed storage (CAS) tree” is a particular type of content defined tree (i.e., constructed in a similar manner) whose leaf nodes also capture chunk locations in a database (Spec. [0008]). In other words, a CAS tree is a content defined tree with additional metadata in leaf nodes.
Hunt teaches a file system (i.e., database) consisting of files/directories organized in an object tree structure, each being described by an inode [0004]. The file system is constructed on top of a content addressable distributed object store [0008]. Each object has an associated record containing attributes about the object, including its content hash [0009], disk identifier and location [0017]. The smash of an object is a hash of its associated record, which is used to identify the object in a read request [0019].
Claim 1 further recites “wherein the database is split via application of a set of CAS trees generated across a plurality of files of the database according to global content-defined boundaries derived from a rolling hash applied based on the database as a whole as a continuous data stream, such that identical content segments shared across different files are represented by common nodes within the set of CAS trees,”
Hunt divides (i.e., splits) user content (i.e., across multiple files) into data objects (i.e., portions) [0009], forming a bottom layer (i.e., tier) extending the object tree structure (i.e., content defined tree), where a file is analogous to a folder of contained objects representable in the object tree structure as a subtree. Hunt builds corresponding pointer trees of smashes for files (i.e., set of CAS trees), with one pointer for each object in each file. A leaf node is a grouping of object smashes, and a parent node contains a smash of the grouping of its child smashes [0010].
Hunt does not disclose claim element “rolling hash”; however, data de-duplication algorithms store a single copy (i.e., common node) of duplicate data (i.e., chunk) while replacing duplicate instances with references to the shared copy (Guilford: [0001]). Guilford computes rolling hashes of a set of contiguous bytes (i.e., data stream) of fixed size (Guilford: [0002]), in order to reduce the computational overhead of searching algorithms while providing (i.e., deriving) quality identification of natural data (i.e., chunk) boundaries, based on the rolling hash matching some trigger criteria (i.e., condition) (Guilford: [0011]).
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to apply the teachings of Guilford to Hunt. One having ordinary skill in the art would have found motivation to adopt the rolling hash algorithm of Guilford to compute chunk smashes and perform deduplication in Hunt more efficiently.
Hunt does not disclose limitation “the database being split into a training subset, a testing subset and a validation subset for training a machine learning model, wherein the set of CAS trees indicates where each chunk of the file is located in the database and comprises a first set of parent nodes that designates a first portion of the first set of parent nodes as the training subset and a second set of the first set of parent nodes as the testing subset;”
However, Ding teaches validating quality of datasets for machine learning, by partitioning (i.e., splitting) a dataset of images (i.e., chunks along boundaries) into 3 subsets for training/testing/validation purposes respectively (Ding: sec. IV.B, para. 4). In a traditional N-fold cross validation (NFCV) (Ding: sec. I, para. 3) where N=8, images in the dataset are partitioned into 8 equal groups in terms of number of images, 6 of which are used for training, and the remaining 2 are for validation/testing respectively (Ding: sec. III.B.2, para. 1).
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to apply the teachings of Ding to Hunt. One having ordinary skill in the art would have found motivation to divide a file in the database of Hunt into chunks, each file represented by a set of CAS trees, and to partition the resulting set of chunks in Ding into 3 subsets respectively for training/testing/validation in machine learning.
Claim 1 further recites “based on the request and the identification of the file, retrieving a content defined tree corresponding to the file, wherein the content defined tree indicates all chunks needed to reconstruct the file and comprises a second set of parent nodes, a particular parent node relating to at least a portion of a particular one of the training subset, the testing subset, or the validation subset and corresponding to a set of hashes that have been generated using a rolling hash and a grouping condition that identifies a target quantity of child nodes for the particular parent node, wherein the particular parent node comprises a hash of a concatenation of each hash in a corresponding set of hashes, wherein the second set of parent nodes form a tier of the content defined tree, and wherein each hash in each set of hashes corresponds to a chunk in the file;”
Hunt divides user content into data objects (i.e., chunks) [0009], forming a bottom layer extending the object tree structure, where a file is analogous to a folder of contained objects representable in the object tree structure as a subtree. Hunt builds corresponding pointer trees of smashes for files, with one pointer for each object in each file. A leaf node is a grouping of object smashes, and a parent node contains a smash of the grouping (i.e., concatenation) of its child smashes [0010].
Given chunk size BLKSZ and smash length SMASHSZ, each leaf node in the pointer tree can hold up to (i.e., conditioned) n=truncate(BLKSZ÷SMASHSZ) smashes (i.e., target quantity of child nodes for each parent node) [0063]. For a given tree depth d, the maximum size of a file representable by the tree is BLKSZ×nd [0066]. In other words, branching factor can be derived (i.e., identified) from file size and tree depth.
Claim 1 further recites “determining a first node by traversing the content defined tree; comparing the first node with a set of CAS tree nodes in the set of CAS trees; based on a hash of the first node matching a hash of a first CAS tree node of the set of CAS tree nodes, traversing a first CAS tree corresponding to the first CAS tree node, wherein the first CAS tree comprises a set of parent nodes, wherein each parent node of the first set of parent nodes comprises a hash of a concatenation of each hash in a corresponding set of hashes, and wherein each hash in each set of hashes corresponds to a chunk stored in the database, the chunk relating to the at least the portion of the particular one of the training subset, the testing subset, or the validation subset;”
To read a file from storage, Hunt uses its smash (i.e., first node) to find (i.e., retrieve) the inode for the file containing the matching root node of the pointer tree, which is descended (i.e., traversed) from parent to child nodes to reach the leaf nodes [0077].
Claim 1 further recites “based on traversing the first CAS tree, obtaining a set of child nodes of the first CAS tree node, wherein each child node comprises a hash usable as a key to retrieve a location of a chunk of the file; retrieving, based on the set of child nodes, a set of file chunks; and reconstructing the file based on the set of file chunks.”
From the smashes of leaf nodes, Hunt retrieves metadata of identified objects from a hash database [0074], including disk identifier, location and content hash, to reassemble (i.e., reconstruct) sequential file data from chunk data. The reassembled file data is returned to the requestor [0077].
Claims 5 and 13 are analogous to claim 1, and are similarly rejected.
Claim 2 recites “The system of claim 1, wherein the instructions, when executed by the one or more processors, cause operations further comprising: determining, based on the content defined tree and the set of CAS tree nodes, that no CAS tree exists for a portion of the file; and based on no CAS tree existing for the portion of the file, generating a new CAS tree by: dividing the portion of the file into a set of chunks, each chunk in the set of chunks having a boundary, wherein each boundary is determined at least in part based on a first rolling hash satisfying a first condition and each boundary defines a byte size of a corresponding chunk; generating a set of hashes comprising a cryptographic hash for each chunk of the set of chunks, wherein the set of hashes form a first tier of the new CAS tree; generating a set of parent nodes by grouping each hash of the set of hashes based on a second rolling hash satisfying a second condition, and by hashing a concatenation of each resulting group of hashes, wherein the set of parent nodes form a second tier of the content defined tree, wherein a first parent node of the set of parent nodes comprises a hash that is usable as a key to retrieve each hash in a group of hashes that corresponds to the first parent node; and storing the new CAS tree and the portion of the file in the database.”
To write a file (i.e., portion of file) to storage (i.e., database), Hunt divides the file into data objects (i.e., chunks), and computes a smash (i.e., cryptographic hash) for each object. A pointer tree (i.e., CAS tree) is constructed (i.e., generated) by inserting object smashes into leaf nodes, and the smash of the grouping (i.e., concatenation) of child smashes is inserted into the parent node [0010]. The smash (i.e., key) of every object is looked up in storage, and the object is written to storage only if not found (i.e., no CAS tree exists) [0074]. Given chunk size and smash length, file size can be derived from branching factor and tree depth of the pointer tree [0066], e.g., a three-layer (i.e., first/second tier) tree with chunk size 32KB (fig. 7B; [0065]).
Hunt and Guilford teach claim 1, where Guilford computes rolling hashes (i.e., first/second rolling hashes) in a manner that reduces the computational overhead of searching algorithms while providing quality identification of natural data (i.e., chunk) boundaries, based on the rolling hash matching some trigger criteria (i.e., first/second conditions) (Guilford: [0011]).
Claim 3 recites “The system of claim 2, wherein the instructions, when executed by the one or more processors, cause operations further comprising: based on determining that the portion of the file is greater than a threshold byte size, causing generation of the new CAS tree to use a first subpart of the portion of the file that is less than the threshold byte size; and generating a second new CAS tree using a second subpart of the portion of the file, wherein content of the first subpart does not overlap with the second subpart.”
Hunt constructs a pointer tree (i.e., CAS tree), with one pointer for each chunk in the file (i.e., first/second subpart of portion of file). A parent node contains a smash of the grouping of its child smashes [0010]. Given chunk size (e.g., 32KB [0063]) and smash length, the maximum (i.e., threshold) file size can be derived from branching factor and tree depth of the pointer tree [0066]. For a file split into huge number of (i.e., non-overlapping) chunks, a 2-layer file-chunk tree might be insufficient that a 3-layer file-subpart-chunk tree is necessary (fig. 7B; [0065]).
Claims 11 and 19 are analogous to claim 3, and are similarly rejected.
Claim 4 recites “The system of claim 1, wherein each parent node of the first CAS tree comprises a hash that may be used to index to one or more locations in the database where corresponding chunks are stored.”
Each parent node in the pointer tree (i.e., CAS tree) of Hunt contains a smash of the grouping of its child smashes [0010]. From the smashes of leaf nodes, Hunt retrieves metadata of the identified data objects (i.e., chunks) from a hash database [0074], including disk identifier and location, to locate (i.e., index) the associated objects in storage [0017].
Claims 8 and 16 are analogous to claim 4, and are similarly rejected.
Claim 6 recites “The method of claim 5, wherein the content defined tree comprises a second set of parent nodes, each parent node corresponding to a set of hashes that have been further determined using a rolling hash, wherein each parent node comprises a hash of a concatenation of each hash in a corresponding set of hashes, wherein the set of parent nodes form a tier of the content defined tree, and wherein each hash in each set of hashes corresponds to a chunk in the file.”
Hunt divides a file into data objects (i.e., chunks) [0009], forming a bottom layer (i.e., tier) extending the file system object tree structure (i.e., content defined tree). Hunt builds a corresponding pointer tree (i.e., CAS tree) of smashes, with one leaf node for each chunk in the file. A parent node contains a smash of the grouping (i.e., concatenation) of smashes of its child nodes [0010].
Hunt and Guilford teach claim 5, where Guilford computes rolling hashes in a manner that reduces the computational overhead of searching algorithms while providing quality identification of natural data (i.e., chunk) boundaries, based on the rolling hash matching some trigger criteria (Guilford: [0011]).
Claim 14 is analogous to claim 6, and is similarly rejected.
Claim 7 recites “The method of claim 5, further comprising: determining, based on the content defined tree and the set of CAS tree nodes, that no CAS tree exists for a portion of the file; and based on no CAS tree existing for the portion of the file, generating a new CAS tree by: determining a second set of chunks, wherein a total byte size of the second set of chunks is less than a threshold byte size; and generating leaf nodes of the new CAS tree, wherein each leaf node corresponds to a chunk of the second set of chunks.”
To write a file (i.e., portion of file) to storage, Hunt divides the file into data objects (i.e., chunks), and computes a smash for each object. A pointer tree (i.e., CAS tree) is constructed (i.e., generated) by inserting object smashes into leaf nodes, and the smash of the grouping of child smashes is inserted into the parent node [0010]. The smash of every object is looked up in storage, and the object is written to storage only if not found (i.e., no CAS tree exists) [0074]. Given chunk size (e.g., 32KB [0063]) and smash length, the maximum file size can be derived from branching factor and tree depth of the pointer tree [0066]. When the total size of object smashes is less than the size of a single object (i.e., threshold), further grouping is not needed.
Claim 15 is analogous to claim 7, and is similarly rejected.
Claim 10 recites “The method of claim 5, further comprising: determining, based on the content defined tree and the set of CAS tree nodes, that no CAS tree exists for a portion of the file; based on no CAS tree existing for the portion of the file, generating a new CAS tree by: generating a set of hashes comprising a hash for each chunk of a set of chunks, wherein the set of hashes form a first tier of the new CAS tree; and generating a second set of parent nodes by grouping each hash of the set of hashes based on a second rolling hash satisfying a second condition; and storing the new CAS tree and the portion of the file in the database.”
To write a file (i.e., portion of file) to storage (i.e., database), Hunt divides the file into data objects (i.e., chunks), and computes a smash for each object. A pointer tree (i.e., CAS tree) is constructed (i.e., generated) by inserting object smashes into leaf nodes (i.e., first tier) up to the branching factor [0063], and the smash of the grouping of child smashes is inserted into the parent [0010]. The smash of every object is looked up in storage, and the object is written to storage only if not found (i.e., no CAS tree exists) [0074].
Hunt and Guilford teach claim 5, where Guilford computes rolling hashes in a manner that reduces the computational overhead of searching algorithms while providing quality identification of natural data (i.e., chunk) boundaries, based on the rolling hash matching some trigger criteria (i.e., second condition) (Guilford: [0011]).
Claim 18 is analogous to claim 10, and is similarly rejected.
Claim 12 recites “The method of claim 5, wherein reconstructing the file based on the set of file chunks comprises: retrieving additional file chunks based on a second set of parent nodes corresponding to a second database remote from the database; and reconstructing the file based on the set of file chunks and the additional file chunks.”
When there are multiple parent nodes (i.e., second set of parent nodes) in the pointer tree of Hunt, each parent node contains a smash of the grouping of its child smashes [0010]. From the smashes of all leaf nodes, Hunt retrieves metadata of the identified data objects (i.e., chunks) including disk identifier and location, to reassemble (i.e., reconstruct) sequential file data from the associated objects stored in one or more physical nodes (i.e., remote databases) of the storage cluster [0017]. The reassembled file data is returned to the requestor [0077].
Claim 20 is analogous to claim 12, and is similarly rejected.
Claims 9 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Hunt as applied to claims 5 and 13 above respectively, in view of Ding and Guilford, and further in view of Bracciano. A Tree-View Angular Component Tale – Part 2: Nested Information. 2020, pp. 1-21. https://ioannesbracciano.medium.com/ng-tree-view-tale-pt-2-14e25e7e78f6 [herein “TreeView”].
Claim 9 recites “The method of claim 5, further comprising: generating a user interface comprising the set of file chunks and the first CAS tree, wherein the user interface indicates an association between a node in the first CAS tree and a corresponding chunk in the set of file chunks.”
Hunt divides a file into data objects (i.e., chunks) [0009], forming a bottom layer (i.e., tier) extending the file system object tree structure, where a file is analogous to a folder of contained chunks representable in the tree structure as a subtree.
Hunt teaches claim 5, but does not disclose this claim; however, TreeView displays (i.e., generates) a tree view (i.e., user interface) of nested structure of nodes representing (i.e., associating with) contents in a file system directory (TreeView: pp. 10/21). Starting with the local root folder, each node lists (i.e., associates) the contents of a folder, which could be contents of a file or a nested child folder.
Therefore, it would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to apply the teachings of TreeView to Hunt. One having ordinary skill in the art would have found motivation to utilize TreeView to display the pointer tree (i.e., CAS tree) of Hunt, with leaf nodes being pointer objects representing chunks of a file.
Claim 17 is analogous to claim 9, and is similarly rejected.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. For example, Xia et al. The design of Fast Content-Defined Chunking for Data Deduplication Based Storage Systems. IEEE Transactions on Parallel and Distributed Systems 31:9, 2020, pp. 2017-2031.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHELLY X. QIAN whose telephone number is (408)918-7599. The examiner can normally be reached Monday - Friday 8-5 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Boris Gorney can be reached at (571)270-5626. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/BORIS GORNEY/Supervisory Patent Examiner, Art Unit 2154
/SHELLY X QIAN/Examiner, Art Unit 2154