Public PhD Defence

Scene Perception: Generating and exploiting vision-based meta-data

Nederlandse versie

General Info

Date 24 August 2021 at 17:00
Venue Jan Pieter de Nayerlaan 5, 2860 St-Katelijne-Waver
Room K104
Submitted Text Link to Lirias

Since this was a public phd defence, feel free to share this page with others that might be interested in my presentation.

Livestream Recording


We, humans, are well trained in understanding what our eyes see. A mere glance suffices to be capable of perceiving visual information in detail. In our current digital age, computers are gradually capable of similar scene perceptions. Many research efforts go to computer vision techniques for image classification, object detection, person re-identification and others. However, these techniques are often evaluated and compared using large generic datasets. This methodology is indeed required to assess the performance of a technique compared to other state-of-the-art approaches. Yet, the performance of real-life applications often remains unknown and unexplored. In this PhD, we research how well state-of-the-art techniques perform using several real-life use cases. For each use case, each posing different challenges, we follow two research steps. Firstly, we extract vision-based meta-data as an abstract intermediate description of the scene. This meta-data can represent various data types, e.g. bounding box coordinates around objects. Depending on the application, further use case specific meta-data can be derived, e.g. steering coordinates for a pan-tilt-zoom camera. On both meta-data types, we compare and evaluate the performances taking into account the use case specific challenges. Our second step then accesses how the extracted meta-data performs for specific real-life exploitation, e.g. triggering camera recording.

Within this PhD, we focused on four use cases. In the first, we worked together with a production house to record a new type of reality TV show. People were recorded 24/7 using several pan-tilt-zoom cameras placed inside their house. Recording 24/7 results in a massive amount of collected footage, along with intensive manual labour to steering the cameras. In this use case, we compared several state-of-the-art techniques to develop a system capable of autonomously steering the cameras to take cinematographically pleasing medium shots. To reduce the amount of footage, we used room activity to trigger recordings. Lastly, we combined several state-of-the-art techniques to generate event timelines on which the editor can search for specific events. These timelines include person re-identification, action and sound classification results.

In the next use case, we developed a system to count people on an overhead embedded system with an omni-directional camera. By using this system, e.g. above flex-desks or meeting-room tables, the degree of occupancy can be measured allowing for their usage optimisation. However, privacy regulations apply when placing visual sensors in work environments. In order to comply to the privacy regulations, our system runs on an embedded platform, transmitting no visual data. Furthermore, we use low resolution images on which people are unidentifiable. In order to raise accuracy performance, we propose to use temporal information using interlacing kernels to compensate for the data loss when down-scaling the visual data. Our results show acceptable performances at a resolution of 48X48, running at 0.1 FPS on an embedded platform.

In our third use case, we developed a tool to automatically extract areas-of-interest on mobile eye-tracker recordings. These eye-trackers are a type of smart glasses capable of recording the wearer’s perspective, along with their gaze location. Opposed to screen mounted eye-trackers, mobile eye-trackers are low-key and mobile allowing their usage in, for example, human-human conversations. However, each study produces new unseen data which caused heavy workloads for human annotators after each recording. Our tool uses a state-of-the-art pose estimator to autonomously calculate person areas-of-interest, e.g. the head area, while imitating the allowed margin set by a human annotator. Furthermore, we allow the user to create identity labels with limited manual effort, by which the system can add person identities to each area-of-interest. By combining the gaze location, along with the areas-of-interest, we can afterwards classify the gaze labels for each user study.

A last use case involves an X-ray scanner in a hospital. To allow for more flexibility, modern X-ray scanners are attached to a robotic arm to enable multiple scan positions by moving around in the scanner room. However, additional safety precautions are needed to assure no collisions can occur between people present and the robotic arm. In this use case, we developed a system capable of calculating a multi-view 3D occupancy map of the persons in the scanner room. Afterwards, this 3D occupancy map can be used to restrict robotic movements to avoid person-robot collisions.

Each of the aforementioned use cases involve a real-life application, each with its own real-live challenges. Within this PhD, we use and compare several techniques capable of extracting abstract vision based meta-data. Based on this abstract meta-data, actual real-life exploitation of it could be achieved. Our research shows that, by pushing techniques to go further than generic large dataset evaluations, we can bring along different insights on their performance for real-life exploitation.