(19)
(11) EP 2 276 007 A1

(12) EUROPEAN PATENT APPLICATION

(43) Date of publication:
19.01.2011 Bulletin 2011/03

(21) Application number: 09165782.5

(22) Date of filing: 17.07.2009
(51) International Patent Classification (IPC): 
G08B 25/14(2006.01)
G08B 13/16(2006.01)
(84) Designated Contracting States:
AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO SE SI SK SM TR
Designated Extension States:
AL BA RS

(71) Applicant: Nederlandse Organisatie voor Toegepast -Natuurwetenschappelijk Onderzoek TNO
2628 VK Delft (NL)

(72) Inventors:
  • Kooi, Frank Leonard
    3994 XE Houten (NL)
  • Kranenborg, Kim
    3769 BW Soesterberg (NL)

(74) Representative: Hatzmann, Martin 
Vereenigde Johan de Wittlaan 7
2517 JR Den Haag
2517 JR Den Haag (NL)

   


(54) Method and system for remotely guarding an area by means of cameras and microphones.


(57) Method and system for remotely guarding an area by means of cameras and microphones at several locations within that area, which are connected to a central surveillance post, comprising the steps of displaying, at an observation screen, the various camera and microphone locations on a map of said area; enabling selective activation, e.g. by an operator, of camera images for zooming in; deriving, per microphone or group of microphones, an attention value based on the sound picked up by that sound source; and outputting, when the attention value passes a predetermined threshold value, a representation of the sound picked up by the sound source causing the threshold passage, called sound representation hereinafter, including an audible and/or visual representation of the location of the sound source causing the threshold passage, called location representation hereinafter. The sound representation may include fragmented parts of the sound picked up by the sound source, the fragmentation being such that the overall semantic intelligibility of the sound is reduced to a level which complies with the relevant privacy regulations related to eavesdropping. The sound representation may in addition be processed such that the intelligibility of the sound is reduced to a level which complies with the relevant privacy regulations related to eavesdropping. The location representation may be performed by means of spatial audible reproduction of the relevant sound representation in the vicinity of said observation screen and/or by means of visual display of the location of the sound source causing said threshold passage.




Description


[0001] The present invention refers to a method and system for remotely guarding an area by means of cameras and microphones at several locations within that area, which are connected to a central surveillance post.

[0002] Surveillance cameras for monitoring public areas have widespread applications especially in urban areas. Although the use of such cameras is very useful in guarding such areas, the effectivity of such systems could be improved.

[0003] It is one aim of the present invention to improve the effectivity of such system, by combining it with audible information. One problem which has to be overcome is that in most countries legal privacy regulations forbid eavesdropping (except under special conditions).

[0004] Because of such privacy based restrictions, another aim of the invention is to provide a method and system in which audible information is used, however, without infringing the privacy regulations.

[0005] Still another aim of the invention is to provide a system which makes remote monitoring of (urban) areas more lively for the operator (e.g. guardsman), as the visual information offered by the video cameras is supplemented by accompanying "real live audio", however without passing on (private) conversations etc. in a way that their content could be followed, i.e. understood, by the operator.

[0006] Yet another aim is to provide that the operator can semantically comprehend the emotional components (in particular fear, anger, excitement etc.) in the audible signals picked up in the vicinity of the cameras. These components should, after transfer to the operator, attract his attention in a natural way and trigger him/her to pay attention to the location at which such (e.g. excited) audible signal originated or was recorded.

[0007] To comply with those aims, it is preferred that, in a method for remotely guarding an area by means of cameras and microphones at several locations within that area, which are connected to a central surveillance post, next steps are included:
  • displaying, at an observation screen, the various camera and microphone locations on a map of the area;
  • enabling selective activation, e.g. by a screen observing operator, of one or more camera images for pointing and/or zooming in;
  • deriving, per microphone or group of microphones, called sound source hereinafter, an attention value based on the sound picked up by that sound source;
  • outputting, when the attention value passes a predetermined threshold value, a representation of the sound picked up by the sound source causing the threshold passage, called sound representation hereinafter, including an audible and/or visual representation of the location of the sound source causing the threshold passage, called location representation hereinafter.


[0008] To comply with the requirements of the privacy legislation the sound it may be preferred that the representation is processed such that eavesdropping is prevented, e.g. by time and/or frequency domain filtering and/or scrambling, such as fragmenting the sound representation, causing that the sound representation supplied to the operator includes fragmented parts -e.g. having maximum lengths of for example 10 seconds- of the sound picked up by the sound source, causing that the overall semantic or linguistic intelligibility of the sound is reduced to a level which complies with the relevant privacy regulations related to eavesdropping.

[0009] As another way to meet the privacy regulations it may be preferred that the sound representation includes at least part of the sound picked up by the relevant sound source, however, processed such that the intelligibility of the sound is reduced to a level which complies with the relevant privacy regulations related to eavesdropping, e.g. wherein the Speech Transmission Index (abbreviated STI, see for its definition e.g. en.wikipedia.org/wiki/Speech_Transmission_Index) of the processed sound is reduced, e.g. by means of signal scrambling or addition of noise), to a maximum of e.g. or less 0.35.

[0010] To comply with the aim to provide that the relevant audible signals, picked up in the vicinity of the cameras and processed as indicated above, will attract the operator's attention and guide him to the location on his observation screen where the (e.g. excited) sound was originated, the location representation of that sound may preferably be performed by spatial (2 or 3 Dimensional) audible reproduction of that sound representation in the vicinity of the observation screen. As such observation screen (which may be formed by a group of cooperating display screens) normally will have rather large dimensions, the operator's attention can be attracted when the sound representations, originated at several microphone locations, are reproduced (i.e. when the attention value of the sound passes a predetermined threshold value) via a spatial audio reproduction system. It has to be noted that the sounds as such may be picked up by single channel microphones, however their sound representations are reproduced, via a spatial audio system in the vicinity of the observation screen, in such a way that, in the operator's perception, the sound representations comes from the direction of the location, as mapped on the observation screen, where the sound has been produced or recorded.

[0011] Additionally or optionally, the sound originating location may be represented by means of visual display of the location where the sound has been produced, e.g. by means of any form of highlighting that location at the area mapping on the observation screen.

[0012] Hereinafter the method will be elucidated with reference to:

Figure 1 shows an exemplary embodiment of a system in which the method according to the invention can be performed;

Figure 2 shows the diagram of an exemplary embodiment of a subsystem for sound processing.



[0013] Figures 1 and 2 show a system for remotely guarding an area (the centre of Utrecht) using cameras and microphones at several locations within that area, which are connected to a central surveillance post, including an observation screen 1 arranged for displaying the various camera (Cam) and microphone (Mic) locations on a map of the area. The system includes means for executing the method as discussed hereinbefore including processing means and means for the reproduction of the sound representations, i.e. an event detector (ED) 2 and an intelligibility reductor (IR) 3, as well as means for the reproduction of the relevant location representations, i.e. a 2D renderer (2DR) 4 and a set of loudspeakers 5 for acoustic location representation, as well as a video screen driver (VD) 6 for visual location representation at the observation screen 1.

[0014] The relevant area thus can be monitored by means of cameras 7 and microphones 8 at several locations within the area, which are connected to a central surveillance post which accommodates the components shown in the figures 1 and 2.

[0015] By means of the observation screen 1, the various camera and microphone locations are displayed on a map image of the area to be monitored. A screen observing operator 9 is able, e.g. by means of a keyboard, mouse, joystick (not shown) or touch screen, to select and activate cameras and/or camera images to zoom in and out; besides the operator may be able to move the cameras into different positions.

[0016] In the vicinity of each camera, microphones are installed, picking up the sound present in the camera's vicinity. In this way the sounds which are present in the vicinity of each camera is transmitted to the surveillance post, which accommodates the system. In the event detector 2 per microphone or group of microphones (sound source) an attention value is derived based on the sound picked up by that sound source. The event detector 2 analyzes the incoming sound and decides -e.g. based on the results of a frequency spectrum and energy level analysis- whether the incoming sound comprises elements like fear, excitement (e.g. screaming), uncommon noise like e.g. breaking glass etc. In such cases the attention value should pass a predetermined threshold value, indicating that there might be an event which should be investigated.

[0017] When the attention value passes a predetermined threshold value, detected in the event detector 2, this detector gives an "on" signal to the intelligibility reductor 3 to pass a representation of the sound picked up by the sound source causing the threshold passage, i.e. a sound representation having a reduced intelligibility. Besides, an audible representation of the location of the possibly buffered sample of the event sound source causing the threshold passage (location representation) is performed, viz. by reproducing the sound representation (having a reduced intelligibility) by means of a 2D sound rendering subsystem (2DR) 4 and loudspeakers 5 which -by means of audio phase manipulation causing pseudo stereo/quadraphonic sound reproduction (see en.wikipedia.org/wiki/Quadraphonic_sound) and/or sound reproduction via a selected set of loudspeakers 5a and 5b- provides that -in the perception of the operator 9, standing or sitting before his (widescreen) observation screen 1- the sound representation comes from the location at that observation screen (in the corner right below in figure 1). Except the audible location representation, audible to the operator, also a visual location representation is presented to the operator, viz. in the form of an image, e.g. as shown in figure 1 (again in the corner right below) where the relevant microphone location and the neighbouring camera location have been accentuated by (bold) encircling the relevant location. In this way the operator 9 will be guided -in a natural and intuitive way- to pay his attention to the location in which -according to the sound picked up by the microphone(s)- something might be wrong. Then the operator may activate the relevant camera (e.g. by using a touch screen or keyboard function) to zoom in, which may be made visible via the same observation screen 1 or -as is suggested in figure 1- via one or more auxiliary screens. In the illustrated example, the operator may have heard (the sound representation of) breaking glass and/or crying voices "Stop thief!!", is guided by that sound to the highlighted location at his screen 1, activates the relevant camera and see at the auxiliary screen 10 a thief running away. The operator then may contact and inform the police.

[0018] Concerning the sound representation, made in de IR module 3, this may include to make separated fragmented parts of the sound picked up by the sound source (the microphone(s)), which fragmentation is such that the overall semantic intelligibility of the sound is reduced to a level which complies with the relevant privacy regulations related to eavesdropping. When the length of each fragmented part is limited (e.g. to 10 seconds or less), the intelligibility will be decreased and thus the possibility to relate a spoken phrase to a particular individual will be made infeasible.

[0019] Another or an additional method for intelligibility reduction is to process (e.g. by scrambling and/or distortion) the sound from the originating sound source such that the intelligibility of the sound is reduced to a level which complies with the relevant privacy regulations related to eavesdropping. In practice it has been proven that when the Speech Transmission Index of the processed sound has a maximum of 0.35, this will fit to the desired lower intelligibility.

[0020] The Speech Transmission Index (STI) is a measure for the intelligibility (understanding) of speech, whose value varies from 0 (completely unintelligible) to 1 (perfect intelligibility). On this scale, an STI of at least 0.5 is desirable for most applications (Steeneken, H. J. M., & Houtgast, T. (1980). A physical method for measuring speech-transmission quality. Journal of the Acoustical Society of America, 67, 318-326)


Claims

1. Method for remotely guarding an area by means of cameras and microphones at several locations within that area, which are connected to a central surveillance post, comprising next steps:

- displaying, at an observation screen (1), the various camera and microphone locations on a map of said area;

- enabling selective activation, e.g. by a screen observing operator (9), of one or more camera images for zooming in;

- deriving, per microphone or group of microphones, called sound source hereinafter, an attention value based on the sound picked up by that sound source;

- outputting, when the attention value passes a predetermined threshold value, a representation of the sound picked up by the sound source causing the threshold passage, called sound representation hereinafter, including an audible and/or visual representation of the location of the sound source causing the threshold passage, called location representation hereinafter.


 
2. Method according to claim 1, wherein said sound representation includes fragmented parts of the sound picked up by the sound source, the fragmentation being such that the overall semantic intelligibility of the sound is reduced to a level which complies with the relevant privacy regulations related to eavesdropping.
 
3. Method according to claim 2, wherein the length of each fragmented part has a maximum of 10 seconds.
 
4. Method according to any preceding claim, wherein said sound representation includes at least part of the sound picked up by the relevant sound source, however, processed such, e.g. by means of time and/or frequency domain scrambling, distorting, filtering etc., that the intelligibility of the sound is reduced to a level which complies with the relevant privacy regulations related to eavesdropping.
 
5. Method according to claim 4, wherein the Speech Transmission Index of the processed sound has a maximum of 0.35.
 
6. Method according to any preceding claim, wherein said location representation is performed by means of spatial audible reproduction of the relevant sound representation in the vicinity of said observation screen.
 
7. Method according to any preceding claim, wherein said location representation is performed by means of visual display of the location of the sound source causing said threshold passage.
 
8. System for remotely guarding an area using cameras and microphones at several locations within that area, which are connected to a central surveillance post, including an observation screen (1) arranged for displaying the various camera and microphone locations on a map of said area; the system including means for executing the method according to any of the preceding claims, including processing means and means for the reproduction of said sound representations and location representations respectively.
 




Drawing










Search report










Cited references

REFERENCES CITED IN THE DESCRIPTION



This list of references cited by the applicant is for the reader's convenience only. It does not form part of the European patent document. Even though great care has been taken in compiling the references, errors or omissions cannot be excluded and the EPO disclaims all liability in this regard.

Non-patent literature cited in the description