`TED STATES PATENT AND TRADEMARK OFFICE
`BE
`FORE THE PATENT TRIAL AND APPEAL BOARD
`GOOGL
`E LLC,
`Petitioner,
`v.
`C
`ELLULAR SOUTH, INC.,
`Patent Owner.
`C
`ase IPR2025-00877
`Patent 11,126,853 B2
`Issue Date: September 21, 2021
`Title
`: VIDEO TO DATA
`DE
`CLARATION OF HENRY HOUH, PH.D.
`PETITION FOR INTER PARTES REVIEW
`IPR2025-00877
`Patent Owner 2013
`Page 1 of 70
`
`
`
`
`
`
`
`* * *
`
`CB. Ground 1: Claims 1-4 Are Obvious Over Zhao in View of Kritt,
`Steinberg, and Kouzani
`1. Independent Claim 1: “A system for generating data from a
`video, comprising:” (Claim 1[pre])
`93. The preamble of claim 1 recites “[a] system for generating data from
`a video.” Assuming the preamble provides a claim limitation, Zhao discloses it.
`Zhao discloses a system that performs detection and recognition of objects, such as
`people and faces, in video images to generate data, including by creating and
`updating object models used for detection and recognition, and by generating data
`about the video that can be used for indexing and retrieval. (Houh, ¶¶93-94.) Figure
`2 of Zhao, discussed which I wi ll discuss in more detail in the my analysis below,
`illustrates at a high level this overall process of generating data from a video
`performed by the system:
`
`IPR2025-00877
`Patent Owner 2013
`Page 2 of 70
`
`
`
`
`
`
`
`
`(Zhao, Fig. 2; see also id., ¶0036 (“FIG. 2 illustrates a flowchart describing the steps
`involved, from a high-level, in one embodiment of the present system for detecting,
`modeling, recognizing, and tracking object images throughout one or more
`videos.”).)
`94. Zhao explains, for example, that it “generally relate[s] to systems and
`methods for detection, modeling, recognition, and tracking of objects within video
`content” and to “indexing and retrieval systems for videos based on generated object
`models.” (Zhao, ¶0051.) “These objects include people, faces, articles of clothing,
`IPR2025-00877
`Patent Owner 2013
`Page 3 of 70
`
`
`
`
`
`
`
`plants, animals, machinery, electronic equipment, food, and virtually any other type
`of image that can be captured or presented in video.” (Zhao, ¶0051.) Zhao’s system
`“may be operated in a computer environment” and “any results or outputs relating
`to dete ction, modeling, recognition, indexing, and/or tracking of object images
`within videos may be stored in a database” or otherwise output. (Zhao, ¶0055.) Zhao
`states that its system is “useful for a wide variety of applications, including video
`indexing and retrieval, video surveillance and security, unknown person
`identification, advertising, and many other fields.” (Zhao, ¶0053.)
`95. Therefore, Zhao discloses “ a system for generating data from a
`video.” And as explained below, the system as claimed is obvious over Zhao, Kritt,
`Steinberg, and Kouzani.
`(a). “a coordinator communicatively coupled to a splitter
`and to a plurality of demultiplexer nodes, wherein the
`splitter is configured to segment the video, wherein the
`demultiplexer nodes are configured to extract audio
`files from the video and to extract still frame images
`from the video;” (Claim 1[a])
`96. As explained below, claim 1[a] is disclosed by and obvious over Zhao
`in view of Kritt. (Houh, ¶¶96-113.)
`97. For context, the ’853 patent describes a “coordinator,” “splitter,” and
`“demultiplexer” in terms of functions they may perform, and does not state any
`particular form or structure. (Houh, ¶97.) The ’853 patent states, for instance, that
`“[s]ource media can be provided to a coordinator, which can be a non- visual
`IPR2025-00877
`Patent Owner 2013
`Page 4 of 70
`
`
`
`
`
`
`
`component that can perform one or more of several functions.” (’853, 16:26-27.) For
`example, “[t]he coordinator can direct distributed processing and aggregation.”
`(’853, 18:16-17.) Referring to figure 10, reproduced below, the ’853 patent states
`that “the coordinator can send uploaded source media to a splitter and a
`demuxer/demultiplexer”:
`
`
`
`(’853, Fig. 10, 16:60- 63; see also id. , 3:50- 51.) A “ splitter” can be invoked to
`“‘slice’ [media] assets into media segments and/or multiple sub assets comprising
`the entire stream.” (’853, 17:32-37.) A “demultiplexer” can “extract audio files
`IPR2025-00877
`Patent Owner 2013
`Page 5 of 70
`
`
`
`
`
`
`
`from the video and/or to extract still frame images from the video.” (’853, 1:45-47.)
`For example, “a demultiplexer component can strip audio data from segments and
`store audio streams for future analysis.” (’853, 17:61-62.)
`98. A person of ordinary skill in the art would recognize that the
`arrangement of a “coordinator ,” “splitter ,” and “plurality of demultiplexer
`nodes” recited in claim 1[a] is merely an application of the well-known concept of
`distributed processing (or multiprocessing) to video processing, to obtain the
`obvious benefits that processing time can be reduced, and more processing overall
`can be performed, by sharing the workload and performing multiple operations in a
`parallel fashion. (Houh, ¶98.) The ’853 pat ent likewise explains that “the splitter
`can be configured to segment the video,” then later states, for example, that “[t]he
`video-to-data engine can segment the video into chunks for distributed, or parallel,
`processing….Distributed processing in this context can mean that the processing
`time for analyzing a video from beginning to end is a fraction of the play time of the
`video. This can be accomplished by breaking the processes into sections and
`processing them simultaneously.” (’853, 2:35- 36, 6:15- 21.) By 2016,
`distributed/multi processing was well -known and neither novel nor non- obvious.
`(Houh, ¶98 (citing EX1012EX1012 (Microsoft Computer Dictionary 5th ed. 2002),
`pp.005 (entry for “distributed processing” explaining that “distributed processing
`shares the workload among computers that can communicate with one another” ),
`IPR2025-00877
`Patent Owner 2013
`Page 6 of 70
`
`
`
`
`
`
`
`006 (entry for “multiprocessing” ) explaining “objective is increased speed or
`computing power”).) And as explained below, claim 1 [a] is obvious over Zhao in
`view of Kritt.
`99. Turning to Zhao, PetitionerI will first explain how Zhao teaches both
`“splitter” operations and “demultiplexer” operations. PetitionerI will then explain
`how Zhao discloses and renders obvious a “ coordinator” before turning to the
`combination with Kritt.
`100. Zhao itself discloses a “splitter ” “configured to segment the video ”
`in the form of processes that perform “shot/scene boundary detection” to split a video
`into shots or scenes that are extracted for processing. (Houh, ¶100.) Zhao explains:
`Once the video has been retrieved, the video undergoes a process of
`detection of objects and extraction of object features from video images
`(processes 210, 220). Concurrently, shots or scenes are detected in the
`video via a shot/scene boundary detection procedure (process 215). The
`shots or scenes of the video are extracted to provide a sequence of
`images for analysis by the system, and each detected shot or scene is
`individually analyzed. Generally, the term “shot” or “scene” refers to a
`grouping of frames or images recorded by either a stationary or
`smoothly-moving camera, with little or no background change between
`the frames, corresponding to one continuous time period.
`(Zhao, ¶0057.) Figure 2 of Zhao illustrates this process with shot boundary detection
`process 215:
`IPR2025-00877
`Patent Owner 2013
`Page 7 of 70
`
`
`
`
`
`
`
`
`(Zhao, Fig. 2 (highlighted added).)
`101. With respect to the “demultiplexer,” Zhao also discloses at least some
`of the claimed “demultiplexer ” functionality—i.e., “ extract[ing] still frame
`images from the video .” (Houh, ¶101.) Zhao describes “receiving a video file,
`wherein the video file comprises a plurality of video frames.” (Zhao, ¶0023.) Zhao
`teaches that the object detection and recognition processes involve extracting still
`frame images from the video. (Houh, ¶101.) For example, Zhao explains that the
`object recognition process is performed on an “image or frame extracted from a
`IPR2025-00877
`Patent Owner 2013
`Page 8 of 70
`
`
`
`
`
`
`
`video.”2 (Zhao, ¶0075 (“The process 400 shown in FIG. 4 is for one image or frame
`extracted from a video, but as will be appreciated, this process is typically repeated
`for each optimal object image in the video.”).) Zhao similarly explains that the object
`detection process involves “detection of objects and extraction of object features
`from video images (processes 210, 220).” (Zhao, ¶0057; see also id., ¶0059 (“Once
`an object has been detected within a video frame…”).)
`102. With respect to “demultiplexer” functionality to “extract audio files
`from the video,” Petitioner notesI note that the claims of the ’853 patent do not
`recite any additional limitations related to “audio” or the extracted “audio files.”
`(Houh, ¶102.) Zhao likewise does not discuss audio information, which is not
`surprising because audio processing is not a focus of Zhao’s purported invention.
`(Id.) But it would have been obvious for Zhao’s system to further include processes
`for “extracting audio files from the video.” (Id.) For example, Zhao discloses
`performing its techniques on videos such as episodes of television shows such as the
`Gilmore Girls (Zhao, e.g., ¶¶0040-0041, 0093, Fig. 6), which a person of ordinary
`skill in the art would have understood and found obvious to have included one or
`more audio tracks. (Houh, ¶102.) Zhao more generally teaches that its techniques
`are “useful for a wide variety of applications, including video indexing and retrieval,
`
`
`2 All underlining added to quoted material unless noted.
`IPR2025-00877
`Patent Owner 2013
`Page 9 of 70
`
`
`
`
`
`
`
`video surveillance and security, unknown person identification, advertising, and
`many other fields.” (Zhao, ¶0053.) Given Zhao’s broad applicability, a person of
`ordinary skill in the art would have readily appreciated and found it obvious that
`videos processed by Zhao’s system would include audio tracks. (Houh, ¶102.) A
`person of ordinary skill in the art would have found it obvious to extract audio tracks
`from the video into separate audio files —for example, to separately store or
`analyze—but again, those sorts of preparatory operations are not a focus of Zhao.
`(Id.) For example, a person of ordinary skill in the art would have understood and
`found it obvious to perform operations such as speech recognition or speaker
`recognition on the audio data. (Id. ) In a security and surveillance context (and
`numerous other contexts), for instance, these operations could be used to further
`confirm a person’s identity or to make a record of conversations associated with the
`captured video. (Id. ) For example, Kritt describes using audio to “validate that
`the human object was correctly identified” in a video. ( Id. (quoting Kritt, ¶0033).)
`(“[A] visual tag may indicate that a particular human object appears in a shot and an
`audio tag identifies an audio signature of the human object is associated with the
`shot. In these examples, if the tags that are compared indicate the same object, the
`positive or consistent result of the comparison may be used…to validate that the
`human object was correctly identified.”).)
`103. A step of extracting audio tracks from a video file in Zhao would have
`IPR2025-00877
`Patent Owner 2013
`Page 10 of 70
`
`
`
`
`
`
`
`resulted in “ extract[ing] audio files” in at least two separate ways. (Houh, ¶103.)
`First, it was well-knownwell known to a person of ordinary skill in the art that video
`file formats took the form of “container” files (or “wrapper” files) that in turn stored
`video tracks, audio tracks, and other data as files within that container —much like
`how a single “.zip” archive file can store multiple files. (Id. ) For example, Kritt
`explains in its Background section:
`A television show, movie, internet video, or other similar content may
`be stored on a disc or in other memory using a container or wrapper file
`format. The container format may be used to specify how multiple
`different data files are to be used. The container format for a video may
`identify different data types and describe how they are to be interleaved
`when the video is played. A container may contain video files, audio
`files, subtitle files, chapter-information files, metadata, and other files.
`A container also typically includes a file that specifies synchronization
`information needed for simultaneous playback of the various files.
`(Kritt, ¶0002.) Similarly, the 2010 book , Pro HTML5 Programming, by Peter
`Lubbers et al., explains that “[a]n audio or video file is really just a container file,
`similar to a ZIP archive file that contains a number of files.” (EX1011, p.0013.)
`Figure 3-1 from the book, reproduced below, “shows how a video file (a video
`container) contains audio tracks, video tracks, and additional metadata. The audio
`and video tracks are combined at runtime to play the video.” (EX1011, p.0013.)
`IPR2025-00877
`Patent Owner 2013
`Page 11 of 70
`
`
`
`
`
`
`
`
`(EX1011, p.0014.) Therefore, a person of ordinary skill in the art would have
`understood that, by extracting audio tracks from a video file, a step of “extract[ing]
`audio files from the video” would be performed. (Houh, ¶103.)
`104. Second, by extracting audio from the video for separate analysis or
`storage, a person of ordinary skill in the art would have understood and found it
`obvious that the resulting extracted audio would take the form of “files.” (Houh,
`¶104.) This is because, for starters, file storage was one of a finite number of ways
`of storing data. (Id.) Moreover, it was routine to use files—known as “scratch files”
`or “temporary files”—for at least temporary storage while the associated data was
`IPR2025-00877
`Patent Owner 2013
`Page 12 of 70
`
`
`
`
`
`
`
`processed. (Id. (citing EX1012See, e.g., EX1012 (Microsoft Computer Dictionary
`(5th ed. 2002)) , pp.007 (entries for “scratch” and “scratch file”), 008 (entry for
`“temporary file”)).)
`105. Turning back to the claimed “coordinator,” a person of ordinary skill
`in the art would have further understood that, in order to manage the various video
`file retrieval, splitter, and demultiplexer operations, Zhao’s system would also
`include “coordinator” functionality. (Houh, ¶105.) For example, referring to figure
`2, Zhao
` shows and explains how splitter operations (e.g., process 215) and
`demultiplexer operations (e.g., processes 210, 400) take place concurrently:
`Once the video has been retrieved, the video undergoes a process of
`detection of objects and extraction of object features from video images
`(processes 210, 220). Concurrently, shots or scenes are detected in the
`video via a shot/scene boundary detection procedure (process 215).
`IPR2025-00877
`Patent Owner 2013
`Page 13 of 70
`
`
`
`
`
`
`
`
`
`(Zhao, Fig. 2 (highlightingannotations added), ¶0057.) A person of ordinary skill in
`the art would have understood that various operations would need to be performed
`in order to retrieve a video file and then distribute the file for concurrent processing
`by processes 215 and 210, for example. (Houh, ¶105.) Those operations, which I
`have highlighted in figure 2 of Zhao below, would form at least part of a
`“coordinator.”
`
`IPR2025-00877
`Patent Owner 2013
`Page 14 of 70
`
`
`
`
`
`
`
`
`(Zhao, Fig. 2 (highlightingannotations added); Houh, ¶105.) A person of ordinary
`skill in the art would have further understood that, for Zhao’s “coordinator”
`functionality to distribute the video file that it retrieved, it would need to be
`“communicatively coupled” to the system’s “splitter” and “demultiplexer”
`functionality in order to notify those components in some manner. (Houh, ¶105.) In
`fact, Zhao’s disclosure of “concurrent” processing (Zhao, ¶0057) suggests that the
`video file is actually distributed to separate nodes (such as separate processors or
`computer systems) with which Zhao’s “coordinator” would be communicatively
`coupled. (Houh, ¶105.) This is just like “coordinator” functionality described in the
`IPR2025-00877
`Patent Owner 2013
`Page 15 of 70
`
`
`
`
`
`
`
`’853 patent, which describes receiving a media file and distributing it to a splitter and
`demultiplexer. (Id. (citing ’853, 16:60- 63 (“the coordinator can send uploaded
`source media to a splitter and a demuxer/demultiplexer”)).)
`106. Lastly, before turning to the combination with Kritt, with respect to the
`claimed “ plurality of demultiplexer nodes,” this too is suggested by Zhao.
`(Houh, ¶106.) Although Zhao focuses on describing its system with respect to a
`single video file, Zhao explains that “the process may be extrapolated to analyze a
`plurality of videos.” (Zhao, ¶0057.) Because as noted above, Zhao also describes
`“concurrent” processing, a person of ordinary skill in the art would have therefore
`understood and found it obvious for Zhao’s system, in an implementation for
`processing a higher volume of video files, to comprise a “ plurality” of nodes
`providing the claimed demultiplexer functionality. (Houh, ¶106.) As discussed
`above, the benefits of distributed computing were well known and therefore would
`have been obvious to try. (Id E. (citingg., EX1012 (Microsoft Computer Dictionary
`5th ed. 2002) , pp.005 (entry for “distributed processing”), 006 (entry for
`“multiprocessing”)).)
`107. Turning to the combination with Kritt, like Zhao, Kritt describes a
`system for processing both visual and audio data of videos to generate data about the
`videos. (Houh, ¶107 (citing Kritt, ¶¶0017 (“a viewer selects a video, and human,
`nonhuman, and audio objects of the video are identified”For example:
`IPR2025-00877
`Patent Owner 2013
`Page 16 of 70
`
`
`
`
`
`
`
`According to various embodiments, a viewer selects a video, and
`human, nonhuman, and audio objects of the video are identified. In
`addition, key words that are spoken by human objects in the video are
`identified. Human, nonhuman, and audio objects may be used to
`classify a particular segment of a video as a scene. The objects and key
`words are then associated with the scenes of the video.
`), (Kritt, ¶0017.) As Kritt notes in the passage above, both visual and audio data are
`processed. ( See also Kritt, ¶¶ 0043 (“visual objects in a shot may be identified by
`application of one or more known image recognition processes to the shot”), 0059
`(“Segments of the video for which the audio feature values are similar to those that
`are characteristic of speech may be classified as speech.”), 0065 (“In operation 612,
`an audio transcript may be determined. An audio transcript may include all of the
`words spoken in the video….[S]poken words may be determined from audio features
`classified as speech using any known technique.”)).)
`108. With respect to claim 1[a] specifically, and concepts of
`distributed/multi processing, Kritt describes modules for processing video and audio
`of a video file, and that those modules can be distributed on different computer
`systems:
`The memory 104 may store all or a portion of the following: an audio
`visual file container 150 (shown in FIG. 2 as container 202), a video
`processing module 152, an audio processing module 154, and a control
`module 156. These modules are illustrated as being included within the
`IPR2025-00877
`Patent Owner 2013
`Page 17 of 70
`
`
`
`
`
`
`
`memory 104 in the computer system 100, however, in other
`embodiments, some or all of them may be on different computer
`systems and may be accessed remotely, e.g., via a network.
`(Kritt, ¶0022; see also id., ¶0024 (“The video processing module 152 may include
`various processes that generate visual tags according to one embodiment. The audio
`processing module 154 may include various processes for generating audio and key
`word tags according to one embodiment.”); Houh, ¶108.)
`109. Kritt further describes how its systems could be provided using “cloud-
`computing infrastructure.” (Kritt, ¶¶0078-0079.) Kritt explains, as was known to a
`person of ordinary skill in the art , that “[c]loud computing generally refers to the
`provision of scalable computing resources as a service over a network.” (Kritt,
`¶0078; Houh, ¶109.) “More formally,” Kritt explains, “cloud computing may be
`defined as a computing capability that provides an abstraction between the
`computing resource and its underlying technical architecture (e.g., servers, storage,
`networks), enabling convenient, on-demand network access to a shared pool of
`configurable computing resources that can be rapidly provisioned and released with
`minimal management effort or service provider interaction.” (Kritt, ¶0078.) Kritt
`further explains that one benefit of cloud computing is that “[t]ypically, cloud-
`computing resources are provided to a user on a pay-per-use basis, where users
`are charged only for the computing resources actually used (e.g., an amount of
`storage space used by a user or a number of virtualized systems instantiated by the
`IPR2025-00877
`Patent Owner 2013
`Page 18 of 70
`
`
`
`
`
`
`
`user).” (Kritt, ¶0079.) It also allows a user to “access any of the resources that reside
`in the cloud at any time, and from anywhere across the Internet.” (Id Kritt, ¶0079.)
`As discussed above for Zhao, a person of ordinary skill in the art would have
`recognized that in such a distributed/multi processing system, “ coordinator”
`functionality would be needed to manage the operations and workloads among the
`various video and audio processing nodes. (Houh, ¶109.)
`110. Rationale and Motivation to Combine (Zhao with Kritt). It would
`have been obvious to a person of ordinary skill in the art to incorporate Kritt’s
`teachings regarding using a distributed/multi processing system (such as a pool of
`nodes in a cloud -computing system) for processing images and audio into Zhao’s
`video processing system. (Houh, ¶110.) The combination would have involved the
`conventional application of distributed processing techniques to Zhao’s system and
`would have predictably resulted in Zhao’s system in which multiple nodes were used
`for processing images and audio of a retrieved video file, along with functionality
`that managed that processing. (Id.) The combination therefore would have resulted
`in “a coordinator”—i.e., functionality for managing the overall distributed process
`of analyzing images and audio of a video file—“communicatively coupled to
`a splitter”—i.e., functionality to split a video into shots/segments —“and to a
`plurality of demultiplexer nodes ”—e.g., multiple nodes in a cloud computing
`infrastructure for performing image and audio recognition. (Id.) The combination
`IPR2025-00877
`Patent Owner 2013
`Page 19 of 70
`
`
`
`
`
`
`
`therefore would have resulted in a “splitter [] configured to segment the video,”
`and “demultiplexer nodes [] configured to extract audio files from the video and
`to extract still frame images from the video.” (Id. )
`111. Zhao and Kritt are analogous references to the ’853 patent, being. Like
`the ’853 patent, both Zhao and Kritt are in the same field of video processing and
`generating data from video content, and also. (’853, 1:34-36 (“The present invention
`is generally directed to a method to generate data from video content, such as text
`and/or image-related information.”); e.g., Zhao, ¶0036, Fig. 2; e.g., Kritt, ¶0017.)
`Zhao and Kritt would have also been reasonably pertinent to problems facing the ’853
`inventors. (Houh, ¶111 (citing ’853, 1:34-36; Zhao, ¶0036, Fig. 2; Kritt, ¶0017).) ,
`including techniques for recognizing objects in video.
`112. A person of ordinary skill in the art would have been motivated to
`combine. (Houh, ¶112.) As noted above, although Zhao focuses on describing its
`system with respect to a single video file, Zhao explains that “the process may be
`extrapolated to analyze a plurality of videos.” (Zhao, ¶0057.) Zhao also teaches that
`its techniques are “useful for a wide variety of applications, including video indexing
`and retrieval, video surveillance and security, unknown person identification,
`advertising, and many other fields.” (Zhao, ¶0053.) A person of ordinary skill would
`have therefore recognized that an application of Zhao’s system could involve high-
`volume video processing, along with potential time constraints —for example, a
`IPR2025-00877
`Patent Owner 2013
`Page 20 of 70
`
`
`
`
`
`
`
`large-scale video indexing and retrieval website, or a commercial video surveillance
`and security service. (Houh, ¶112.) A person of ordinary skill in the art would have
`therefore looked to an analogous reference such as Kritt in the same field as Zhao
`that provides further teachings regarding system architectures capable of providing
`such a larger-scale service. (Id.) Kritt also provides express motivation to combine,
`describing numerous advantages of a cloud-based system, including “convenient,
`on-demand network access to a shared pool of configurable computing resources
`that can be rapidly provisioned and released wit h minimal management effort or
`service provider interaction,” “resources [] provided to a user on a pay-per -use
`basis,” and “access any of the resources that reside in the cloud at any time, and
`from anywhere across the Internet.” (Kritt, ¶¶0078-0079; Houh, ¶112.) Moreover,
`with respect to audio processing, as explained above, Kritt also explains integrating
`audio and image processing would beneficially enable confirmation of recognized
`visual objects in the video, as well as a transcript that could be used for further
`analysis of the video. (Kritt, e.g., ¶¶0033, 0065; Houh, ¶112 (using audio to “validate
`that the human object was correctly identified” in a video), 0065 (audio transcript).)
`113. A person of ordinary skill in the art would have had a reasonable
`expectation that the combination would have been successful. (Houh, ¶113.) It
`would have involved the straightforward application of conventional techniques for
`distributed processing (and audio processing). (Id.) As Kritt notes, cloud computing
`IPR2025-00877
`Patent Owner 2013
`Page 21 of 70
`
`
`
`
`
`
`
`for example was already well-established, as were techniques for speech and speaker
`recognition.
`(b). “an image detector configured to detect an image of an
`object in the still frame images, wherein the image
`detector is adjustable to increase detection of non -
`primary images in the video; and” (Claim 1[b])
`114. As explained below, claim 1[b] is obvious over Zhao in view of
`Steinberg. (Houh, ¶¶114-124.)
`115. The ’853 patent describes image “detection” and “recognition” as a
`two-step process, first “detection,” then “recognition.”:
`FIG. 8 illustrates a flow diagram of an embodiment of image
`recognition. Images can be classified as faces and/or objects. Image
`recognition can include two components: image detection 800 and
`image recognition 810. Image detection can be utilized to determine if
`there is a pattern or patterns in an image that meet the criteria of a face,
`image, or text. If the result is positive, the detection processing then
`moves to recognition, i.e. image matching.
`(’853, 13:39-46.) Zhao similarly discloses separate detection and recognition steps.
`(Houh, ¶¶115-116.)
`116. With respect to image detection, Zhao discloses “an image detector
`configured to detect an image of an object in the still frame images ”—such as
`processes to detect images of objects such as faces in extracted still frame images of
`a video. (Houh, ¶116.) Referring to Figure 2, reproduced below, Zhao explains that
`IPR2025-00877
`Patent Owner 2013
`Page 22 of 70
`
`
`
`
`
`
`
`“[o]nce the video has been retrieved, the video undergoes a process of detection of
`objects and extraction of object features from video images (processes 210, 220)”
`(Zhao, ¶0057):
`
`(Zhao, Fig. 2 (highlighting added).) Zhao describes in detail image detection process
` 210 to detect an object image from video images. (:
`At process 210 , object images are detected from video scenes using
`wavelet features of the images in combination with an Adaboost
`classifier…. As described herein a “feature” or “local feature” generally
`IPR2025-00877
`Patent Owner 2013
`Page 23 of 70
`
`
`
`
`
`
`
`describes an element of significance within an image that enables
`recognition of an object within the image. A feature typically describes
`a specific structure in an image, ranging from simple structures such as
`points, edges, corners, and blobs, to more complex structures such as
`entire objects. For example, for facial recognition, features include
`eyes, noses, mouths, ears, chins, etc., of a face in an image, as well as
`the corners, curves, color, overall shape, etc., associated with each
`feature. Features are d escribed in greater detail below and throughout
`this document. Generally, an AdaBoost (short for “Adaptive Boosting”)
`classifier refers to a machine-learning algorithm, and may be used in
`conjunction with other algorithms to improve overall performance. The
`AdaBoost classifier assists the system in learning to identify and detect
`certain types of objects, as described in greater detail below.
`(Zhao, ¶0058.)
`117. Claim 1[b] further recites “wherein the image detector is adjustable
`to increase detection of non-primary images in the video.” Zhao does not disclose
`image detection functionality that is “ adjustable to increase detection of non -
`primary images in the video,” but this would have been obvious in view of
`Steinberg. (Houh, ¶¶117-124.)
`118. Zhao does describe techniques for a system operator to adjust settings
`based on desired accuracy and/or precision of recognition and/or object tracking,
`and also acknowledges the issue of false facial recognitions produced by extras or
`crowds in a movie. (Houh, ¶118.) For example, in connection with an object
`IPR2025-00877
`Patent Owner 2013
`Page 24 of 70
`
`
`
`
`
`
`
`tracking/recognition process, Zhao explains how a system operator can adjust a
`similarity threshold. (Zhao, ¶0089 (“[T]he threshold value may be raised or lowered
`by a system operator depending on whether the operator would rather have higher
`accuracy, or more identifications.”):
`At step 525, each average confidence score or similarity measure is
`compared to a predetermined threshold. In one embodiment, the
`predetermined threshold is set by a system operator based on a desired
`accuracy and/or precision of recognition and/or tracking. For example,
`a higher threshold yields more accurate results because a similarity
`measure must exceed that threshold to be identified as matching the
`given model. Thus, if a detected object group and an identified object
`model have a high similarity value, the re is therefore a higher
`percentage chance that the object group was correctly recognized.
`However, a higher threshold value also produces a lower recall, as some
`images or groups that were in fact correctly recognized are discarded
`or ignored because they do not meet a high similarity threshold (i.e. the
`group was correctly recognized, but had too many differences from the
`model based on its images’ poses, resolutions, occlusions, etc.).
`Alternatively, a low threshold leads to higher recognition of images as
`matching a given model, but it is also likely to produce a higher
`percentage of false identifications. As will be understood, the threshold
`value may be raised or lowered by a system operator depending on
`whether the operator would rather have higher accuracy, or more
`identifications.
`(Zhao, ¶0089.) Zhao also teaches that whether a high or low similarity threshold is
`IPR2025-00877
`Patent Owner 2013
`Page 25 of 70
`
`
`
`
`
`
`
`desirable depends on the video content —for example, “in a facial recognition
`embodiment, extras or crowds in a movie often produce false recognitions.” (Zhao,
`¶0091; Houh, ¶118.):
`In addition to the purposes and advantages of object tracking described
`above, a further benefit is that false positives of object recognitions are
`reduced because they likely do not fit into clearly established groups in
`a given video. For example, in a facial recognition embodiment, extras
`or crowds in a movie often produce false recognitions. However,
`because these recognitions are likely to be random and cannot be
`tracked over time (based on inconsistent or infrequent occurrences in a
`video), these false recognitions can be discarded as not belonging to a
`distinct group, and thus can be assumed to be false positives.
`(Zhao, ¶0091.)
`119. Like Zhao, Steinberg also relates to facial detection in digital images.
`(Houh, ¶119 (citing Steinberg notes that “[t]he problem of face detection has not
`received a great deal of attention from researchers. Most conventional techniques
`concentrate on face recognition, assuming that a reg



