[0001] The present invention refers to a method and system for remotely guarding an area
by means of cameras and microphones at several locations within that area, which are
connected to a central surveillance post.
[0002] Surveillance cameras for monitoring public areas have widespread applications especially
in urban areas. Although the use of such cameras is very useful in guarding such areas,
the effectivity of such systems could be improved.
[0003] It is one aim of the present invention to improve the effectivity of such system,
by combining it with audible information. One problem which has to be overcome is
that in most countries legal privacy regulations forbid eavesdropping (except under
special conditions).
[0004] Because of such privacy based restrictions, another aim of the invention is to provide
a method and system in which audible information is used, however, without infringing
the privacy regulations.
[0005] Still another aim of the invention is to provide a system which makes remote monitoring
of (urban) areas more lively for the operator (e.g. guardsman), as the visual information
offered by the video cameras is supplemented by accompanying "real live audio", however
without passing on (private) conversations etc. in a way that their content could
be followed, i.e. understood, by the operator.
[0006] Yet another aim is to provide that the operator can semantically comprehend the emotional
components (in particular fear, anger, excitement etc.) in the audible signals picked
up in the vicinity of the cameras. These components should, after transfer to the
operator, attract his attention in a natural way and trigger him/her to pay attention
to the location at which such (e.g. excited) audible signal originated or was recorded.
[0007] To comply with those aims, it is preferred that, in a method for remotely guarding
an area by means of cameras and microphones at several locations within that area,
which are connected to a central surveillance post, next steps are included:
- displaying, at an observation screen, the various camera and microphone locations
on a map of the area;
- enabling selective activation, e.g. by a screen observing operator, of one or more
camera images for pointing and/or zooming in;
- deriving, per microphone or group of microphones, called sound source hereinafter,
an attention value based on the sound picked up by that sound source;
- outputting, when the attention value passes a predetermined threshold value, a representation
of the sound picked up by the sound source causing the threshold passage, called sound
representation hereinafter, including an audible and/or visual representation of the
location of the sound source causing the threshold passage, called location representation
hereinafter.
[0008] To comply with the requirements of the privacy legislation the sound it may be preferred
that the representation is processed such that eavesdropping is prevented, e.g. by
time and/or frequency domain filtering and/or scrambling, such as fragmenting the
sound representation, causing that the sound representation supplied to the operator
includes fragmented parts -e.g. having maximum lengths of for example 10 seconds-
of the sound picked up by the sound source, causing that the overall semantic or linguistic
intelligibility of the sound is reduced to a level which complies with the relevant
privacy regulations related to eavesdropping.
[0009] As another way to meet the privacy regulations it may be preferred that the sound
representation includes at least part of the sound picked up by the relevant sound
source, however, processed such that the intelligibility of the sound is reduced to
a level which complies with the relevant privacy regulations related to eavesdropping,
e.g. wherein the Speech Transmission Index (abbreviated STI, see for its definition
e.g. en.wikipedia.org/wiki/Speech_Transmission_Index) of the processed sound is reduced,
e.g. by means of signal scrambling or addition of noise), to a maximum of e.g. or
less 0.35.
[0010] To comply with the aim to provide that the relevant audible signals, picked up in
the vicinity of the cameras and processed as indicated above, will attract the operator's
attention and guide him to the location on his observation screen where the (e.g.
excited) sound was originated, the location representation of that sound may preferably
be performed by spatial (2 or 3 Dimensional) audible reproduction of that sound representation
in the vicinity of the observation screen. As such observation screen (which may be
formed by a group of cooperating display screens) normally will have rather large
dimensions, the operator's attention can be attracted when the sound representations,
originated at several microphone locations, are reproduced (i.e. when the attention
value of the sound passes a predetermined threshold value) via a spatial audio reproduction
system. It has to be noted that the sounds as such may be picked up by single channel
microphones, however their sound representations are reproduced, via a spatial audio
system in the vicinity of the observation screen, in such a way that, in the operator's
perception, the sound representations comes from the direction of the location, as
mapped on the observation screen, where the sound has been produced or recorded.
[0011] Additionally or optionally, the sound originating location may be represented by
means of visual display of the location where the sound has been produced, e.g. by
means of any form of highlighting that location at the area mapping on the observation
screen.
[0012] Hereinafter the method will be elucidated with reference to:
Figure 1 shows an exemplary embodiment of a system in which the method according to
the invention can be performed;
Figure 2 shows the diagram of an exemplary embodiment of a subsystem for sound processing.
[0013] Figures 1 and 2 show a system for remotely guarding an area (the centre of Utrecht)
using cameras and microphones at several locations within that area, which are connected
to a central surveillance post, including an observation screen 1 arranged for displaying
the various camera (Cam) and microphone (Mic) locations on a map of the area. The
system includes means for executing the method as discussed hereinbefore including
processing means and means for the reproduction of the sound representations, i.e.
an event detector (ED) 2 and an intelligibility reductor (IR) 3, as well as means
for the reproduction of the relevant location representations, i.e. a 2D renderer
(2DR) 4 and a set of loudspeakers 5 for acoustic location representation, as well
as a video screen driver (VD) 6 for visual location representation at the observation
screen 1.
[0014] The relevant area thus can be monitored by means of cameras 7 and microphones 8 at
several locations within the area, which are connected to a central surveillance post
which accommodates the components shown in the figures 1 and 2.
[0015] By means of the observation screen 1, the various camera and microphone locations
are displayed on a map image of the area to be monitored. A screen observing operator
9 is able, e.g. by means of a keyboard, mouse, joystick (not shown) or touch screen,
to select and activate cameras and/or camera images to zoom in and out; besides the
operator may be able to move the cameras into different positions.
[0016] In the vicinity of each camera, microphones are installed, picking up the sound present
in the camera's vicinity. In this way the sounds which are present in the vicinity
of each camera is transmitted to the surveillance post, which accommodates the system.
In the event detector 2 per microphone or group of microphones (sound source) an attention
value is derived based on the sound picked up by that sound source. The event detector
2 analyzes the incoming sound and decides -e.g. based on the results of a frequency
spectrum and energy level analysis- whether the incoming sound comprises elements
like fear, excitement (e.g. screaming), uncommon noise like e.g. breaking glass etc.
In such cases the attention value should pass a predetermined threshold value, indicating
that there might be an event which should be investigated.
[0017] When the attention value passes a predetermined threshold value, detected in the
event detector 2, this detector gives an "on" signal to the intelligibility reductor
3 to pass a representation of the sound picked up by the sound source causing the
threshold passage, i.e. a sound representation having a reduced intelligibility. Besides,
an audible representation of the location of the possibly buffered sample of the event
sound source causing the threshold passage (location representation) is performed,
viz. by reproducing the sound representation (having a reduced intelligibility) by
means of a 2D sound rendering subsystem (2DR) 4 and loudspeakers 5 which -by means
of audio phase manipulation causing pseudo stereo/quadraphonic sound reproduction
(see en.wikipedia.org/wiki/Quadraphonic_sound) and/or sound reproduction via a selected
set of loudspeakers 5a and 5b- provides that -in the perception of the operator 9,
standing or sitting before his (widescreen) observation screen 1- the sound representation
comes from the location at that observation screen (in the corner right below in figure
1). Except the audible location representation, audible to the operator, also a visual
location representation is presented to the operator, viz. in the form of an image,
e.g. as shown in figure 1 (again in the corner right below) where the relevant microphone
location and the neighbouring camera location have been accentuated by (bold) encircling
the relevant location. In this way the operator 9 will be guided -in a natural and
intuitive way- to pay his attention to the location in which -according to the sound
picked up by the microphone(s)- something might be wrong. Then the operator may activate
the relevant camera (e.g. by using a touch screen or keyboard function) to zoom in,
which may be made visible via the same observation screen 1 or -as is suggested in
figure 1- via one or more auxiliary screens. In the illustrated example, the operator
may have heard (the sound representation of) breaking glass and/or crying voices "Stop
thief!!", is guided by that sound to the highlighted location at his screen 1, activates
the relevant camera and see at the auxiliary screen 10 a thief running away. The operator
then may contact and inform the police.
[0018] Concerning the sound representation, made in de IR module 3, this may include to
make separated fragmented parts of the sound picked up by the sound source (the microphone(s)),
which fragmentation is such that the overall semantic intelligibility of the sound
is reduced to a level which complies with the relevant privacy regulations related
to eavesdropping. When the length of each fragmented part is limited (e.g. to 10 seconds
or less), the intelligibility will be decreased and thus the possibility to relate
a spoken phrase to a particular individual will be made infeasible.
[0019] Another or an additional method for intelligibility reduction is to process (e.g.
by scrambling and/or distortion) the sound from the originating sound source such
that the intelligibility of the sound is reduced to a level which complies with the
relevant privacy regulations related to eavesdropping. In practice it has been proven
that when the Speech Transmission Index of the processed sound has a maximum of 0.35,
this will fit to the desired lower intelligibility.
1. Method for remotely guarding an area by means of cameras and microphones at several
locations within that area, which are connected to a central surveillance post, comprising
next steps:
- displaying, at an observation screen (1), the various camera and microphone locations
on a map of said area;
- enabling selective activation, e.g. by a screen observing operator (9), of one or
more camera images for zooming in;
- deriving, per microphone or group of microphones, called sound source hereinafter,
an attention value based on the sound picked up by that sound source;
- outputting, when the attention value passes a predetermined threshold value, a representation
of the sound picked up by the sound source causing the threshold passage, called sound
representation hereinafter, including an audible and/or visual representation of the
location of the sound source causing the threshold passage, called location representation
hereinafter.
2. Method according to claim 1, wherein said sound representation includes fragmented
parts of the sound picked up by the sound source, the fragmentation being such that
the overall semantic intelligibility of the sound is reduced to a level which complies
with the relevant privacy regulations related to eavesdropping.
3. Method according to claim 2, wherein the length of each fragmented part has a maximum
of 10 seconds.
4. Method according to any preceding claim, wherein said sound representation includes
at least part of the sound picked up by the relevant sound source, however, processed
such, e.g. by means of time and/or frequency domain scrambling, distorting, filtering
etc., that the intelligibility of the sound is reduced to a level which complies with
the relevant privacy regulations related to eavesdropping.
5. Method according to claim 4, wherein the Speech Transmission Index of the processed
sound has a maximum of 0.35.
6. Method according to any preceding claim, wherein said location representation is performed
by means of spatial audible reproduction of the relevant sound representation in the
vicinity of said observation screen.
7. Method according to any preceding claim, wherein said location representation is performed
by means of visual display of the location of the sound source causing said threshold
passage.
8. System for remotely guarding an area using cameras and microphones at several locations
within that area, which are connected to a central surveillance post, including an
observation screen (1) arranged for displaying the various camera and microphone locations
on a map of said area; the system including means for executing the method according
to any of the preceding claims, including processing means and means for the reproduction
of said sound representations and location representations respectively.