Technical Field
[0001] Embodiments relate to providing protection against malware or other computer system
security attacks.
Background
[0002] Return-oriented programming (ROP) is a computer security exploit technique in which
an attacker uses software control of a stack to execute an attacker-chosen sequence
of machine instructions. These clusters of instructions typically end with a programmer-intended
or unintended return (RET) instruction within existing program code. The intended
or unintended RET instruction transfers execution to the attacker-chosen return address
on the stack and allows the attacker to retain execution control through the program
code, and direct execution to the next set of chosen sequence of instructions to achieve
the attacker's intent. The clusters of attacker-chosen instruction sequences are referred
to as gadgets.
[0003] Often the executed gadget includes only several assembler instructions followed by
a RET instruction that can already perform a well-defined attack operation. By chaining
together a set of these gadgets such that the RET instructions from one gadget lands
into the next gadget and so on, the malware writer is able to execute a complex algorithm
without injecting any code into the program. Some of these instruction sequences ending
in a RET can be found in functions compiled into the program or libraries.
[0004] Thus the ROP technique involves delivering a payload having a set of chained list
of pointers to gadgets and parameters to a data memory of a program using vulnerabilities
like stack buffer overflows. The exploit also overwrites the return address of the
vulnerable function that was used to perform the stack buffer overflow to point to
the first gadget in the sequence. When this vulnerable function executes a RET instruction,
control transfers to the first gadget instead of the function caller. This gadget
may then consume one or more data elements from the stack payload. Using this exploit
type, the malware writer is able to change the control flow of the program by causing
a control transfer to a non-programmer intended location in the program (e.g., to
the middle of an instruction).
[0005] A ROP attack technique uses various characteristics of an x86 instruction set architecture
(ISA): variable length and unaligned instruction encoding; large and dense ISA encoding;
a stack holding control and data information; and a single byte opcode RET instruction.
Current techniques to defend against such attacks may be ineffective and have various
shortcomings.
[0006] US 2009/0172332 A1 describes an instruction execution part of an information processing device outputs
an access request including a first address information to specify an access destination
based on an execution of an access command of an address space in a memory. The instruction
execution part also outputs a check request including a second address information
to specify a stack pointer point after extension based on an execution of a stack
extension command to extend a stack included in the address space in the memory by
updating a stack pointer. A protection violation detection section of the information-processing
device detects whether the access destination includes the plurality of the partial
spaces by collating the first information with the memory protection information stored
in the memory protection information storage section.
[0007] US 2010/0174893 A1 describes an apparatus comprising one or more registers, one or more range checking
registers comprising a set of one or more range bound registers indicating an upper
and a lower range bound value for the registers, an arithmetic and logic unit operable
to perform arithmetic and logic operations on one or more of the registers, and a
range checking unit operable to determine a target register. Thereby, an instruction
indicates operations to modify contents of the target register, determine a lower
range bound value and an upper range bound value for the target register, wherein
the set of one or more range bound registers indicate the upper range bound value
and the lower range bound value, determine, based on the lower range bound and the
upper range bound values, that a value to be stored in the target register is invalid;
and generate a range checking exception, wherein the range checking exception indicates
that the value to be stored in the target register violates any one of the lower and
the upper range bound values.
[0008] WO 2014/023894 A1 and the corresponding
US 2015/220328 A1 describe a method for detecting a subroutine call stack modification. The method
comprises, for each subroutine in a sequence of subroutines nested in an initial subroutine,
calling a current subroutine, placing a return address at the top of the stack. The
method further comprises, at the end of the current subroutine, using the address
at the top of the stack as the return address, and removing the address from the stack.
Then, a single memory location to the sequence of subroutines is assigned. The method
further comprises saving the initial value of the memory location before calling the
initial subroutine, and, when calling the current subroutine, accumulating the return
address in the memory location with a first operation. The method further comprises,
at the end of the current subroutine, accumulating the address from the top of the
stack in the memory location with a second operation, reciprocal of the first operation,
and, upon returning from the initial subroutine, detecting a modification when the
content of the memory location differs from the initial value.
[0009] US 7 581 089 B1 describes a method of protecting a return address on a computer stack. Two stacks
are created, the first a normal stack, and the second, or shadow, having shadow frames
containing the return address upon a subroutine call, the address on the first stack
where the return address is stored, and a user-definable state variable which is used
to identify a shadow frame as a return address. Before returning from a subroutine,
the two return addresses are compared, and if they do not match, the second stack
is searched down, and then up, for a matching return address. If there is a match,
the shadow is re-synchronized with the first stack by comparing the stored values
of the first stack pointer with the first stack pointer and adjusting appropriately
the shadow stack pointer. The matching shadow frame must also be a return address
datatype of return address.
Brief Description of the Drawings
[0010]
FIG. 1 is a block diagram of a portion of a processor in accordance with an embodiment
of the present invention.
FIG. 2 is a block diagram of a state machine in accordance with an embodiment of the
present invention.
FIG. 3A is a flow diagram of a method in accordance with a non-claimed alternative.
FIG. 3B is a flow diagram of another method in accordance with a non-claimed alternative.
FIG. 4 is a block diagram of a configuration register in accordance with an embodiment
of the present invention.
FIG. 5 is a block diagram of a call stack frame for code execution in accordance with
an embodiment of the present invention.
FIG. 6 is a block diagram of further details of legacy interworking in accordance
with an embodiment of the present invention.
FIG. 7 is a block diagram of a processor core in accordance with one embodiment of
the present invention.
FIG. 8 is a block diagram of a multicore processor in accordance with an embodiment
of the present invention.
FIG. 9 is a flow diagram of a method for valid gadget protection in accordance with
an embodiment of the invention.
FIG. 10 is a flow diagram of a method for valid gadget protection in accordance with
another embodiment of the invention.
FIG. 11 is a flow diagram of a method for valid gadget protection in accordance with
a still further embodiment of the invention.
FIG. 12 is a block diagram of an example system with which embodiments may be used.
FIG. 13 is a block diagram of a representative computer system.
FIG. 14 is a block diagram of a system in accordance with an embodiment of the present
invention.
Detailed Description
[0011] Embodiments provide a set of instruction set architecture (ISA) instructions that
enable a processor to determine whether a control transfer instruction is directed
to an appropriate target, and if not to take action to prevent instructions beyond
the control transfer instruction from being committed to the architectural state of
the processor. In this way, at least certain unintended control transfers within a
program can be eliminated, constraining the number of gadgets that a malware writer
can use. In addition, embodiments provide additional techniques to further constrain
the ability to chain together gadgets in an undesired manner.
[0012] More specifically, embodiments provide a set of control transfer termination (CTT)
instructions to allow software to specify valid control transfer terminating points
in a program such that hardware can enforce control transfers to occur to only programmer-intended
locations. These CTT instructions perform this enforcement with minimal performance
and energy impacts to the program. The CTT ISA extensions can thus mitigate the execution
of unintended gadgets in programs.
[0013] As more computer systems are used in Internet, text, and multimedia applications,
additional processor support has been introduced over time. In one embodiment, an
instruction set may be associated with one or more computer architectures, including
data types, instructions, register architecture, addressing modes, memory architecture,
interrupt and exception handling, and external input and output (I/O).
[0014] In one embodiment, the ISA may be implemented by one or more micro-architectures,
which include processor logic and circuits used to implement one or more instruction
sets. Accordingly, processors with different micro-architectures can share at least
a portion of a common instruction set. For example, Intel® Pentium 4 processors, Intel®
Core™, and Intel® Atom™ processors from Intel Corp. of Santa Clara, CA, and processors
from Advanced Micro Devices, Inc. of Sunnyvale CA implement nearly identical versions
of the x86 instruction set (with some extensions that have been added with newer versions),
but have different internal designs. Similarly, processors designed by other processor
development companies, such as ARM Holdings, Ltd., MIPS, or their licensees or adopters,
may share at least a portion a common instruction set, but may include different processor
designs. For example, the same register architecture of the ISA may be implemented
in different ways in different micro-architectures using new or well-known techniques,
including dedicated physical registers, one or more dynamically allocated physical
registers using a register renaming mechanism (e.g., the use of a register alias table
(RAT), a reorder buffer (ROB) and a retirement register file). In one embodiment,
registers may include one or more registers, register architectures, register files,
or other register sets that may or may not be addressable by a software programmer.
[0015] In one embodiment, an instruction may include one or more instruction formats. Such
instruction format may indicate various fields (number of bits, location of bits,
etc.) to specify, among other things, the operation to be performed and the operand(s)
on which that operation is to be performed. Some instruction formats may be further
broken defined by instruction templates (or sub formats). For example, the instruction
templates of a given instruction format may be defined to have different subsets of
the instruction format's fields and/or defined to have a given field interpreted differently.
In one embodiment, an instruction is expressed using an instruction format (and, if
defined, in a given one of the instruction templates of that instruction format) and
specifies or indicates the operation and the operands upon which the operation will
operate.
[0016] A first CTT instruction, referred to herein as an ENDBRANCH instruction, is used
to identify valid locations in a program where a control transfer may be validly performed
using an indirect CALL or an indirect jump (JMP) instruction. A second CTT instruction,
referred to herein as an ENDRET instruction, is used to identify valid locations in
a program where a control transfer may be validly performed using a RET instruction.
[0017] In an embodiment, these instructions have the following characteristics: they are
4 byte opcodes chosen such that they do not naturally occur in programs; and they
are defined as no operation (NOP) currently in the x86 ISA to allow programs compiled
with ENDBRANCH/ENDRET instructions to execute on earlier generation processors.
[0018] Although the scope of the present invention is not limited in this regard in an embodiment,
these CTT instructions may have a general form that includes a multi-byte opcode.
In one such implementation these CTT instructions may be represented by a four-byte
opcode that corresponds to an opcode value not presently existing in the current x86
ISA. Beyond this opcode, there may be no additional encoding for the instruction,
since the instruction executes as a no operation within an execution logic. As such
there is no identification of a source operand, destination operand or immediate value
to be associated with the instruction.
[0019] Referring now to FIG. 1, shown is a block diagram of a portion of a processor in
accordance with an embodiment of the present invention. As shown in FIG. 1, portion
100 of the processor includes various portions of a pipelined processor such as an
in-order or out-of-order processor. As seen, incoming instructions which may be macro-instructions
of a given ISA are provided to a decode unit 110 that is configured to decode the
instructions, e.g., into one or more smaller instruction such as micro-operations
(µops).
[0020] As seen, decode unit 110 includes a CTT logic 115 in accordance with an embodiment
of the present invention. In general, CTT logic 115 may analyze each incoming instruction
and determine whether the instruction is associated with a control transfer. If so,
CTT logic 115 may associate certain state information with one or more µops. This
state indicates a state of a state machine 116 that is modified by decoding of at
least certain control transfer and control transfer termination instructions. If instead
the instruction is not associated with a control transfer, a different state may be
associated with the one or more pops.
[0021] More specifically, when an incoming instruction is a control transfer instruction,
a transition of state machine 116 within CTT logic 115 may occur to pass from an idle
state to a given wait state. Furthermore, to reflect this wait state, a given encoding
can be associated with the one or more µops decoded from the incoming control transfer
instruction. If a next incoming instruction is a control transfer termination instruction
that immediately follows the control transfer termination, then state machine 116
may return to the idle state and associate a given encoding with the decoded one or
more pops. As will be discussed, if a control transfer instruction is not immediately
followed by a control transfer termination instruction, CTT logic 115 may insert a
fault µop into the processor pipeline (and the state machine may remain in a wait
state).
[0022] Otherwise, if state machine 116 is in an idle state and an incoming instruction does
not relate to a control transfer (or termination), an encoding of idle state information
may be associated with the one or more µops to indicate that state machine 116 remains
in the idle state.
[0023] Thus as seen in FIG. 1, decode unit 110 outputs a stream of µops and associated state
information to indicate a state of state machine 116 within CTT logic 115. These µops
and state information may be provided to an execution logic 120, which can include
various types of units including arithmetic logic units (ALUs), floating point units
and so forth that thus execute operations indicated by the stream of pops. In an embodiment,
the CTT instructions only control the state transitions in state machine 200, and
in an execution logic of the processor these instructions execute as NOP and do not
cause any change in the program semantics.
[0024] In turn, results of the µops are provided to a retirement unit 130 configured to
determine whether given operations were successfully performed and to retire them
if so, and otherwise to raise a fault or exception if an undesired condition occurs
as a result of the execution. In an out-of-order processor, retirement unit 130 may
further operate to reorder instructions which may be executed in any order, back into
program order. When instructions properly retire, they may be provided to further
portions of a processor such as a memory subsystem.
[0025] As further seen in FIG. 1, retirement unit 130 includes a CTT fault logic 135 which
may be configured to determine whether appropriate behavior occurs with regard to
control transfer instructions. More specifically, CTT fault logic 135 may operate
to raise a fault when a given control transfer instruction to be retired is not directly
followed by an appropriate control transfer termination instruction, as described
herein. In an embodiment, this determination may be based at least in part on an inserted
fault µop and the state information communicated with the µops exiting from decode
unit 110. If a CTT fault µop is detected, a fault is raised and is communicated to
a fault handler 140, which may take various actions in accordance with a given handler
to resolve the faulting behavior. Thus in an embodiment, when a next instruction presented
to retire after a control transfer instruction is not an appropriate CTT instruction,
retirement unit 140 may deliver a fault responsive to this CTT fault µop by delivering
a fault-class exception (e.g., a general protection fault) such that that instruction
does not retire.
[0026] Still referring to FIG. 1, in the case where a misprediction occurs and instructions
are to be re-executed according to a correct branch, retirement unit 130 may communicate
via a feedback path with decode unit 110 to thus provide an indication of a proper
branch or other code flow to be taken. Still further, via this feedback path a state
machine recovery signal can be communicated such that state machine 116 of CTT logic
115 can be placed into an appropriate state to reflect this change in program flow.
Stated another way, when a fault µop is present in a mispredicted path, a fault is
not raised due to this misprediction and accordingly the state machine recovery signal
may cause state machine 116 to pass from a wait state back to an idle state or a wait
state, and to also indicate the last successful instruction to retire, so that decode
unit 110 can decode instructions of the correct branch. Although shown at this high
level in the embodiment of FIG. 1, understand the scope of the present invention is
not limited in this regard.
[0027] Referring now to FIG. 2, shown is a block diagram of a state machine in accordance
with an embodiment of the present invention. As shown in FIG. 2, state machine 200
may correspond to CTT state machine 116 of FIG. 1. State machine 200 begins operation
in an IDLE state 210 into which the state machine is placed after reset of a processor.
[0028] When an indirect CALL or JMP instruction is decoded, the state machine enters the
WAIT_FOR_ENDBRANCH state 220. If the next instruction that is decoded is not an ENDBRANCH
instruction, then state machine 200 performs a DELIVER_FAULT operation 230 which causes
generation of a fault µop (and state machine 200 may remain in the WAIT_FOR_ENDBRANCH
state 220). If instead the next instruction to be decoded following a control transfer
instruction is an ENDBRANCH instruction, state machine 200 transitions back to the
IDLE state 210.
[0029] When a RET instruction is decoded, state machine 200 enters the WAIT_FOR_ENDRET state
240. If the next instruction that is decoded is not an ENDRET instruction, state machine
200 performs the DELIVER_FAULT operation 230. If the next instruction that is decoded
is an ENDRET instruction, state machine 200 transitions back to the IDLE state 210.
[0030] Interrupts/exceptions/traps and VM exits that occur when state machine 200 is in
the WAIT FOR_ENDBRANCH or WAIT_FOR_ENDRET states cause a transition of the state machine
to the IDLE state instead of the DELIVER_FAULT state (not shown for ease of illustration
in FIG. 2). This is so, as the fault delivered due to the target of the indirect CALL,
JMP or RET is higher priority than any faults caused by the target instruction itself
(e.g., #NM, #GP due to various conditions like alignment checks, canonical checks,
etc.) or any VM exits caused due to the target instruction being a VM exiting instruction
(e.g., exiting on XGETBV, CPUID, etc.). This operation ensures that these asynchronous
events do not cause unintended exceptions. Because these events are asynchronous to
normal program execution, an attacker cannot realistically force them to be delivered
after the CALL, JMP or RET, and thus they are not an effective way for an attacker
to bypass the ROP security.
[0031] Enabling CTT additionally has the following effects, in an embodiment: interrupt
returns (IRET) are disallowed and undefined (#UD) in a user level (e.g., ring 3) to
prevent their use for building ROP gadgets; a pop to flags register instruction (POPF)
does not update the trap flag (TF) bit in the system flags when executed at ring 3,
which prevents an attacker from causing a #DB deterministically by using a POPFs instructions
RET gadget, because exception will transition the state machine to the IDLE state.
[0032] State machine 200 thus enforces the following rules: the instruction at the target
of a RET instruction must be an ENDRET instruction and the instruction at the target
of an indirect CALL or indirect JMP instruction must be an ENDBRANCH instruction.
If these rules are violated, then the violating instruction (the instruction at the
target of a RET or CALL/JMP instructions) faults and is prevented from retiring.
[0033] Thus by placing ENDBRANCH and ENDRET instructions in a program at valid control transfer
locations, a programmer or compiler can prevent unintended control transfers from
happening. This placement of ENDBRANCH and ENDRET instructions is as illustrated below
in Table 1, as an example:

[0034] In this example, an ENDBRANCH instruction is placed as the first instruction in the
subroutine foo and in the main program. And an ENDRET instruction is placed after
the CALL instruction to subroutine foo. Thus there are now 3 valid control transfer
points in this program. Specifically, in execution of the main program, a call instruction
(call *%rdx) is executed, causing a control transfer to the subroutine foo. More specifically
as seen, the first instruction in this subroutine is an ENDBRANCH instruction, such
that a valid control transfer occurs and the CTT state machine thus proceeds from
an IDLE state, to a WAIT_FOR_ENDBRANCH state and back to the IDLE state, without raising
a fault.
[0035] Similarly, at the conclusion of the subroutine foo, a return instruction (RETQ) is
executed, thus causing control to transfer to the first instruction after the calling
instruction in the main program. Here, this instruction is an ENDRET instruction and
as such, a valid control transfer occurs. In this case, the CTT state machine proceeds
from the IDLE state, to the WAIT_FOR_ENDRET state, and thereafter back to the IDLE
state, without raising a fault.
[0036] Thus using CTT in accordance with an embodiment of the present invention, a constraint
is introduced that a ROP gadget be preceded with an ENDRET to be usable. As such,
a significant reduction in the number of gadgets that can be harvested from a library
is realized, and the quality of such gadgets is significantly lower in terms of functionality
that these remaining gadgets expose, making ROP attacks harder to execute.
[0037] Referring now to FIG. 3A, shown is a flow diagram of a method in accordance with
a non-claimed alternative. As shown, method 300 may be performed by front end logic
including a CTT state machine as described herein. Note that the operations shown
in FIG. 3A relate to state machine operations for control transfer-related instructions.
For other instructions, if the state machine is currently in the idle state, it remains
there. As seen, method 300 begins by determining whether a feedback signal is received
to update the CTT state machine (diamond 310). In an embodiment, this feedback signal
may be received from a retirement unit or fault handler to cause the state of the
state machine to transition to a given state, e.g., due to a misprediction (as from
a retirement unit) or responsive to resolving a fault (as from a fault handler). If
such feedback signal was received, control passes to block 315 where the state machine
is updated with the state communicated through this feedback signal.
[0038] From both of these cases, control passes next to block 320 where an indication that
an indirect control transfer instruction such as a call, jump or return has occurred
(assuming that the decode unit has decoded such an instruction). And as such, control
passes to block 325 where a transition into a wait sate of the state machine may occur.
[0039] Still referring to FIG. 3A, control next passes to diamond 330 where it can be determined
whether an indication of receipt of a control transfer termination instruction is
received. If so, control passes to block 335 where the idle state of the state machine
be entered, as pursuant to this proper CTT instruction following the control transfer
instruction a valid control transfer occurs.
[0040] If instead it is determined that next decoded instruction is not a control transfer
termination instruction, control passes to block 340 where a control transfer termination
fault instruction can be inserted into the processor pipeline. Note here that the
state of the state machine does not change and thus remains in the selected wait state.
In an embodiment, this fault instruction is a µop that travels through the processor
pipeline and if it is selected for retirement, the retirement unit will cause a fault
to enable an OS-based fault handler to execute to determine the cause of the fault
and take appropriate action.
[0041] FIG. 3B is a flow diagram of another method in accordance with a non-claimed alternative.
Method 350 may be performed at least in part by logic of a retirement unit to handle
CTT-based retirement operations. As seen, method 350 begins by retiring a given instruction
and storing CTT state associated with the instruction (block 355). In an embodiment,
this information may be stored in a given storage of the retirement unit such as reorder
buffer entry. As will be discussed further below, this state can be used in case a
misprediction occurs. Next control passes to diamond 360 to determine whether a misprediction
has occurred. If so, control passes to block 370 where information regarding the last
validly retired instruction present in an entry of the reorder buffer can be obtained
and sent back to CTT logic (of the decode unit) to enable updating the state of the
state machine into the appropriate state. There further typical retirement operations
may continue (at block 375).
[0042] Referring still to FIG. 3B, if a fault instruction is received (diamond 380), control
passes to block 385 where a call can be issued to a fault handler. As an example,
an OS-based fault handler may be executed. As part of this fault handling when the
fault is due to a CTT fault µop, a supervisor-based CTT state machine may be enabled
and used to access the state of the user mode CTT state machine to determine the reason
for fault and to act accordingly. As an example, a target instruction (namely a non-CTT
target instruction) may be prevented from retiring and an appropriate correction mechanism
may be performed. Or the fault handler may take any other action. As part of such
operations, the fault handler may cause the user mode CTT state machine to be set
to the appropriate state.
[0043] As further shown in FIG. 3B, after completion of the fault handler, retirement operations
can be resumed responsive to control of the fault handler (block 390). Although shown
at this high level in the embodiment of FIG. 3B is to be understood that the scope
of the present invention is not limited in this regard.
[0044] With CTT instructions enforcing valid control transfer locations, software checks
can be placed after these instructions to further check for valid control transfers
using techniques like stack canaries. For the example discussed above, reference is
made to Table 2:
Table 2
|
main() {
foo();
endret;
<detour/hook to anti-malware code to perform branch sanity
check>
... }
int foo() {
return
} |
[0045] In the example above there is thus one place in the program (after the ENDRET) where
such a check is to be placed. Without CTT, software cannot effectively check all places
that can be used as gadgets as these gadgets can be crafted out of byte sequences
in the middle of valid instructions.
[0046] The instrumentation of a program using CTT instructions may be performed by a compiler.
In an embodiment, a just-in-time (JIT) compiler may perform the instrumentation of
the CTT instructions. Alternately such instrumentation can be performed by rewriting
the program binary to insert the CTT instructions using a binary rewriting tool that
reconstructs the control flow graph from the application binary. The binary rewriting
technique can be used in cases where the source of the binary is not available for
recompilation. Such binary rewriting may also be done by anti-malware software using
such tools.
[0047] In some cases, applications and libraries compiled with CTT instrumentation may be
merged with libraries that are not compiled with CTT instrumentation, such non-CTT
instrumented libraries referred to herein as "legacy libraries."
[0048] To support interworking with these legacy libraries, embodiments may provide additional
instructions. In one such embodiment, a suppression instruction, referred to herein
as a DISCTT instruction, is used to suppress the CTT state machine such that it stays
in the IDLE state instead of transitioning to the WAIT_FOR_ENDBRANCH or WAIT_FOR_ENDRET
states on an indirect CALL/JMP or RET, respectively. Additionally this instruction
returns into a general purpose register the state of the CTT suppression at the time
the instruction was issued. An enable instruction, referred to herein as an ENCTT
instruction, is used to remove the suppression of the CTT state machine put in place
by the DISCTT instruction such that the state machine enforces the CTT rules. Additionally
this instruction returns the state of the CTT suppression at the time the instruction
was issued.
[0049] The use of DISCTT and ENCTT instructions may be enabled for a process by an operating
system. If the operating system does not allow a program to disable CTT, then the
DISCTT instruction executes as a NOP and does not suppress CTT.
[0050] The use of the DISCTT and ENCTT instructions in a program to perform legacy interworking
is illustrated below in Table 3:
Table 3
|
// Issue a DISCTT before invoking a legacy library function foo()
temp_variable = DISCTT;
foo();
// If CTT was suppressed by DISCTT prior to this legacy library call
then un-suppress it IF (temp_variable == NOT_SUPPRESSED)
ENCTT;
ENDIF |
[0051] Returning the previous state of CTT as a result of the DISCTT instruction allows
for supporting call chains like below:
CTT_function1->legacy_function1->CTT_function2->legacy_function2
[0052] Here the CTT_function1 issues a DISCTT instruction before calling the legacy_function1.
The DISCTT instruction returns the current state of CTT functionality as NOT_SUPPRESSED
and then suppresses the CTT functionality. The legacy_function1 calls the CTT_function2.
Now when the CTT_function2 calls legacy_function2, it again issues a DISCTT instruction.
The DISCTT instruction now returns the current state of the CTT functionality as SUPPRESSED
since it has been suppressed by CTT_function1. When the control returns from legacy_function2
to CTT_function2, it does not un-suppress the CTT functionality since it was already
suppressed when it was invoked. When the control returns to CTT_function1, it un-suppresses
the CTT functionality using the ENCTT instruction since it was suppressed by that
function.
[0053] Returning the previous state of CTT responsive to the ENCTT instruction allows for
a CTT-enabled library function to be called by a non-CTT enabled library/application
to un-suppress CTT before it starts executing and suppress CTT before returning to
the caller, if it was suppressed when the function was called.
[0054] This is as illustrated below in Table 4:
Table 4
|
Legacy_function1()
{
CTT_function1();
}
CTT_function1()
{
//ENDBRANCH is a NOP if this function was called with CTT
suppressed/disabled ENDBRANCH;
// Un-suppress CTT. If already unsuppressed this is gratuitous
temp_variable = ENCTT;
....
....
....
// If CTT was suppressed when this function was called the
suppress
// it before returning
IF (temp_variable == SUPPRESSED)
DISCTT;
ENDIF
RET;
} |
[0055] Referring now to FIG. 4, shown is a block diagram of a configuration register to
store CTT state in accordance with an embodiment of the present invention. As shown
in FIG. 4, configuration register 400 may include various fields to store state values
used in performing CTT functionality. In an embodiment, two such configuration registers
may be provided, with each register associated with a particular mode of operation.
Stated another way, one configuration register may be used to control CTT operation
in a user mode (e.g., ring 3) while a second configuration register may be used to
control CTT functionality in a supervisor mode (e.g., rings less than 3).
[0056] In the embodiment shown, configuration register 400 includes an enable field 405
to store an enable indicator to indicate whether CTT is enabled for the current privilege
level. A legacy enable field 410 is used to store an indicator to indicate whether
legacy interworking is enabled. A suppression field 415 is used to store a suppression
indicator to indicate whether CTT faults and tracking are to be suppressed. A tracker
field 420 is used to store a value of the CTT state machine. In an embodiment, this
tracker field may be two bits where a value of 0 indicates the IDLE state, a value
of 1 indicates the WAIT_FOR_ ENDRET state, and a value of 2 indicates the WAIT_FOR_ENDBRANCH
state. A reserved field 425 may be used for various extensions. Of course other fields
may be present in other embodiments.
[0057] Referring now to FIG. 5, shown is a block diagram of a call stack frame for code
execution that interlaces CTT-enabled code and legacy code without CTT-enabled functionality.
As shown in FIG. 5, a code segment 450 includes a first CTT call stack frame 460 and
a second CTT call stack frame 462 that in turn calls a legacy call stack frame 465.
Thus at the point of calling this legacy call stack frame, the CTT functionality is
disabled responsive to a DISCTT instruction. Thus at this point execution begins with
CTT functionality disabled for a first legacy call stack frame 465 and a second legacy
call stack frame 466. Note that as the called functions return back, at the point
of returning to call stack frame 462, execution with CTT functionality is re-enabled
by an ENCTT instruction.
[0058] As such, FIG. 5 shows an example where a first transfer to legacy code suppresses
CTT, which is done using indirect CALL/JMP instructions (not RET) for security reasons.
Once CTT is suppressed by a DISCTT instruction, subsequent CALL/JMP/RET instructions
can land on instructions other than ENDBRANCH/ENDRET without causing faults. CTT operation
is unsuppressed when control returns to the point where suppression was done, via
an ENCTT instruction.
[0059] Referring now to FIG. 6, shown are further details of legacy interworking in accordance
with an embodiment of the present invention. As shown in FIG. 6, an implementation
is present with a CTT-enabled application image 470 that issues a call to a CTT enabled
library 475 (Call1) that in turn initiates a call to a legacy library 485 (Call2).
In turn, legacy library 485 issues a call to a second CTT-enabled library 490 (Call3).
Also present is a heap/stack 480. After execution in second CTT-enabled library 490,
control passes back to legacy library 485 (RET1), and from there control returns back
to first CTT-enabled library 475 (RET2), and finally control returns back to application
image 470 (RET3).
[0060] Note that upon Call2, a legacy transfer occurs and thus CTT is suppressed via a DISCTT
instruction. Accordingly, for Call3, CTT remains suppressed, as it does for RET1.
Finally, RET2 causes a return to the point of suppression and as such, CTT is unsuppressed
via an ENCTT instruction. Note that this legacy interworking may be entered when a
legacy interworking enable indicator of a CTT control logic is set and an indirect
control transfer (namely a jump or call) occurs to a non-ENDBRANCH instruction.
[0061] The DISCTT and ENCTT instructions may be placed in the program by the programmer
if she is aware of the interworking, and/or these DISCTT and ENCTT instructions may
be placed in the program by the compiler/linker when it is linking statically to legacy
libraries.
[0062] When linking dynamically to libraries, a loader or anti-malware software can insert
trampoline functions between the application and the library, where the trampoline
functions use DISCTT and ENCTT instructions. For example, calls to functions in a
legacy library that are dynamically linked to a CTT enabled application go through
a trampoline function, which suppresses CTT and then calls the legacy library function.
The legacy library function returns to the trampoline function that un-suppresses
CTT and returns to the CTT-enabled application.
[0063] Embodiments may be used by anti-malware software to wrap non-CTT binaries such that
they can be used with CTT-enabled binaries. In addition, anti-malware software can
restrict the use of the gadgets that can be found in the program even with CTT in
use. Embodiments may be particularly applicable to mobile and other portable low power
systems, in that software only techniques to mitigate against ROP (like rewriting
binaries to remove all instances of RET by use of functionally equivalent but larger
more complex sequences), generally lead to much larger binaries and increase the execution
time of the program and thereby are not suited for mobile applications where power
efficiency is a prime concern.
[0064] Referring now to FIG. 7, shown is a block diagram of a processor core in accordance
with one embodiment of the present invention. As shown in FIG. 7, processor core 600
may be a multi-stage pipelined out-of-order processor. Core 600 may support one or
more instructions sets (e.g., the x86 instruction set (with some extensions that have
been added with newer versions); the MIPS instruction set of MIPS Technologies of
Sunnyvale, CA; the ARM instruction set (with optional additional extensions such as
NEON) of ARM Holdings of Sunnyvale, CA). It should be understood that the core may
support multithreading (executing two or more parallel sets of operations or threads),
and may do so in a variety of ways including time sliced multithreading, simultaneous
multithreading (where a single physical core provides a logical core for each of the
threads that physical core is simultaneously multithreading), or a combination thereof
(e.g., time sliced fetching and decoding and simultaneous multithreading thereafter
such as in the Intel® Hyperthreading technology).
[0065] A processor including core 600 may be a general-purpose processor, such as a Core™
i3, i5, i7, 2 Duo and Quad, Xeon™, Itanium™, XScale™ or StrongARM™ processor, which
are available from Intel Corporation. Alternatively, the processor may be from another
company, such as a design from ARM Holdings, Ltd, MIPS, etc.. The processor may be
a special-purpose processor, such as, for example, a network or communication processor,
compression engine, graphics processor, co-processor, embedded processor, or the like.
The processor may be implemented on one or more chips, and may be a part of and/or
may be implemented on one or more substrates using any of a number of process technologies,
such as, for example, BiCMOS, CMOS, or NMOS.
[0066] As shown in FIG. 7, core 600 may operate at various voltages and frequencies as a
result of integrated voltage regulator 609. As seen in FIG. 7, core 600 includes front
end units 610, which may be used to fetch instructions to be executed and prepare
them for use later in the processor. For example, front end units 610 may include
a fetch unit 601, an instruction cache 603, and an instruction decoder 605. Instruction
decoder 605 includes CTT logic 606 in accordance with an embodiment of the present
invention, with an associated CTT state machine to perform CTT operations as described
herein. In some implementations, front end units 610 may further include a trace cache,
along with microcode storage as well as a micro-operation storage. Fetch unit 601
may fetch macro-instructions, e.g., from memory or instruction cache 603, and feed
them to instruction decoder 605 to decode them into primitives, i.e., micro-operations
for execution by the processor.
[0067] Coupled between front end units 610 and execution units 620 is an out-of-order (OOO)
engine 615 that may be used to receive the micro-instructions and prepare them for
execution. More specifically OOO engine 615 may include various buffers to re-order
micro-instruction flow and allocate various resources needed for execution, as well
as to provide renaming of logical registers onto storage locations within various
register files such as machine-specific registers (MSRs) 625, register file 630 and
extended register file 635. MSRs 625 may provide for storage of various configuration
and control registers. In addition, MSRs 625 may include a set of bounds registers
628 to be used as discussed further below. Register file 630 may include separate
register files for integer and floating point operations. Extended register file 635
may provide storage for vector-sized units, e.g., 256 or 512 bits per register.
[0068] Various resources may be present in execution units 620, including, for example,
various integer, floating point, and single instruction multiple data (SIMD) logic
units, among other specialized hardware. For example, such execution units may include
one or more arithmetic logic units (ALUs) 622, among other such execution units such
as a checker or comparison logic, to be used as described herein.
[0069] Results from the execution units may be provided to a retirement unit 640 including
a reorder buffer (ROB). This ROB may include various arrays and logic to receive information
associated with instructions that are executed. This information is then examined
by retirement unit 640 to determine whether the instructions can be validly retired
and result data committed to the architectural state of the processor, or whether
one or more exceptions occurred that prevent a proper retirement of the instructions.
Of course, retirement unit 640 may handle other operations associated with retirement.
For retirement operations here, CTT logic 645 of the retirement unit may store CTT
state machine state received with incoming instructions, and feedback this information
responsive to a misprediction.
[0070] As shown in FIG. 7, retirement unit 640 is coupled to a cache 650 which in one embodiment
may be a low level cache (e.g., an L1 cache), although the scope of the present invention
is not limited in this regard. Also, execution units 620 can be directly coupled to
cache 650. In the embodiment shown, a cache portion 652 may be allocated to store
translated code portions instrumented with CTT instructions, as described further
below. From cache 650, data communication may occur with higher level caches, system
memory and so forth. While shown with this high level in the embodiment of FIG. 7,
understand the scope of the present invention is not limited in this regard. For example,
while the implementation of FIG. 7 is with regard to an out-of-order machine such
as of an ISA, the scope of the present invention is not limited in this regard. That
is, other embodiments may be implemented in an in-order processor, a reduced instruction
set computing (RISC) processor such as an ARM-based processor, or a processor of another
type of ISA that can emulate instructions and operations of a different ISA via an
emulation engine and associated logic circuitry.
[0071] Referring now to FIG. 8, shown is a block diagram of a multicore processor in accordance
with an embodiment of the present invention. As shown in the embodiment of FIG. 8,
processor 700 includes multiple domains. Specifically, a core domain 710 can include
a plurality of cores 710
0-710
n, a graphics domain 720 can include one or more graphics engines, and a system agent
domain 750 may further be present. In various embodiments, system agent domain 750
may handle power control events and power management such that individual units of
domains 710 and 720 such as cores and/or graphics engines can be controlled to independently
dynamically operate at an appropriate turbo mode frequency in light of the activity
(or inactivity) occurring in the given unit. Each of domains 710 and 720 may operate
at different voltage and/or power, and furthermore the individual units within the
domains each may operate at an independent frequency and voltage. Note that while
only shown with three domains, understand the scope of the present invention is not
limited in this regard and additional domains can be present in other embodiments.
[0072] In general, each core 710 may further include low level caches in addition to various
execution units and additional processing elements, including CTT logic as described
herein. In turn, the various cores may be coupled to each other and to a shared cache
memory formed of a plurality of units of a last level cache (LLC) 740
0 - 740
n. In various embodiments, LLC 740 may be shared amongst the cores and the graphics
engine, as well as various media processing circuitry. As seen, a ring interconnect
730 thus couples the cores together, and provides interconnection between the cores,
graphics domain 720 and system agent circuitry 750. In one embodiment, interconnect
730 can be part of the core domain. However in other embodiments the ring interconnect
can be of its own domain.
[0073] As further seen, system agent domain 750 may include display controller 752 which
may provide control of and an interface to an associated display. As further seen,
system agent domain 750 may include a power control unit 755.
[0074] As further seen in FIG. 8, processor 700 can further include an integrated memory
controller (IMC) 770 that can provide for an interface to a system memory, such as
a dynamic random access memory (DRAM). Multiple interfaces 780
0 - 780
n may be present to enable interconnection between the processor and other circuitry.
For example, in one embodiment at least one direct media interface (DMI) interface
may be provided as well as one or more Peripheral Component Interconnect Express (PCIe™)
interfaces. Still further, to provide for communications between other agents such
as additional processors or other circuitry, one or more interfaces in accordance
with an Intel® Quick Path Interconnect (QPI) protocol may also be provided. Although
shown at this high level in the embodiment of FIG. 8, understand the scope of the
present invention is not limited in this regard.
[0075] Note that when code includes CTT instructions as described herein, whether programmer
included, compiler inserted or otherwise, there are still gadgets that remain in the
code. These gadgets or code sequences start in an ENDRET instruction and end in a
RET instruction. It is possible for an attacker such as malware to link these code
sequences together to create a ROP chain. An example of such code sequence is shown
below in Table 5.
Table 5
|
Function_foo:
...
...
Call bar
ENDRET
Call *(edx)
ENDRET
LEAVE
RET |
[0076] In the function of Table 5, the sequence starting with the first ENDRET instruction
can still be returned to without invoking the function_foo. Such gadgets that are
prefixed with an ENDRET or an ENDBRANCH are called valid gadgets. Instrumenting code
including this function with CTT instructions as above restricts the attacker to execute
only valid instructions in the program and hence an attacker cannot do things like
jump into the middle of the instructions. However they can still use a programmer-unintended
RET to execute a valid gadget.
[0077] Embodiments provide further defenses via inclusion and execution of additional code
provided in a program to prevent returning into the middle of a function without invoking
the function from its beginning. As such, embodiments provide a set of defenses that
can be employed in valid gadgets to prevent ROP-based attacks from chaining together
valid gadgets. These defense schemes, along with appropriate CTT instruction inclusion
and hardware enforcement, can be employed to harden the control flow integrity of
a program.
[0078] While different sets of defenses are possible, in one embodiment examples include:
use of interlock registers; use of hardened interlock registers via execute-only memory;
use of a software managed shadow stack; and/or execution of stack pivot checks.
[0079] In general, embodiments may be used to detect invocation of valid gadgets within
programs, which may be attacked by chaining two or more valid gadgets together to
create a ROP chain. The techniques described herein may leverage the CTT instructions
to detect when an attempt is made to return to an ENDRET instruction without invoking
the encapsulating function, in an effort to invoke a valid gadget.
[0080] To perform defense against attacks using interlocks in registers, a set of operations
may be performed on a set of one or more registers or memory to insert random values
before function execution, and confirm whether these values still exist in the one
or more registers/memory on function return.
[0081] Referring now to Table 6, shown is a code sequence to set up an interlock in a register
(reg). In an embodiment, a compiler may insert this code sequence prior to a function
call in code being instrumented.
Table 6
|
push <reg>
mov <function_specific_ID>, reg
call
endret |
[0082] Here the register used (reg) is a callee-saved register, namely a register that the
called function preserves. Note that in an embodiment, the compiler may select a different
register for each instance of the call. Further understand that in embodiments the
random number may be determined, e.g., by a dynamic loader, at the time of loading
the code for this program such that the number is different each time the program
is launched.
[0083] To determine whether the selected register still has the random value upon return
from the function, the compiler may also emit the code sequence of Table 7 following
an ENDRET instruction.
Table 7
|
Call
Endret
xor <function_specific_ID>, reg
jnz error
pop <reg> |
[0084] Thus with this scheme in place an attacker that redirects control flow to the above
ENDRET instruction will find a mismatch in the function_specifc_ID when compared to
the register value selected by the compiler for this check. Responsive to this mismatch,
the program may terminate (e.g., crash) and the attack is defeated. To bypass this
check, an attacker would need to gather the function_specific_IDs used for all the
valid gadgets to be chained together, in addition to a way to control the register
content such that the correct register will include the correct function specific
ID. As such, use of the above code can increase code protection.
[0085] In other cases, interlocks in registers can be hardened by maintaining code using
the above technique in an execute-only (XO) memory. In this case, to protect the function_specific_ID
from a memory read disclosure attack on code pages including such interlock code,
the code pages on which the function_specific_IDs are used can be made to be execute-only,
by controlling memory access attributes of such pages. In one such embodiment, a processor
such as an Intel® processor including Intel® virtualization technology (VT) support
can include management for execute-only permission, which can be applied via an extended
page table (EPT) of a paging structure. In this case, a hypervisor or other supervisor
manages the EPT as a second-level page table, while an OS manages a first-level page
table, namely an OS-controlled page table. In this example, to seek access to code
included in such protected memory pages, the processor performs a page table walk
on both page tables in a nested manner. The resulting permissions obtained from the
two tables may be logically AND'ed for a page mapped to both page table levels, translating
from a guest virtual address to a guest physical address, and from the guest physical
address to a host physical address.
[0086] Continuing with the above example, the code pages may be set to a read-execute (and
may also be read-write-execute) permission in the OS-managed first level page table,
but can be enforced as execute-only via permissions in the second level extended page
table entry (EPTE) that maps to that code page, thus disallowing (and faulting on)
any reads or writes targeting the code page from code running in a VMX-non-root mode
(in the guest). Enforcing execute-only (XO) permission on these code pages disallows
an attacker from gathering the function_specific_IDs to be able to chain gadgets together.
Note that a similar extension for execute-only memory for this usage may be implemented
via the OS-managed page tables in some processors.
[0087] Referring now to FIG. 9, shown is a flow diagram of a method for valid gadget protection
in accordance with an embodiment of the invention. Method 800 of FIG. 9 may be performed
by a collection of components including processor hardware (including CTT hardware),
logic, firmware, and/or software during execution of a process or other collection
of instructions during system operation. In general, the operations identified in
FIG. 9 may be used to enable and use an interlock register technique to ensure that
an attack such as a ROP-based attack does not occur within a valid gadget. Note that
in some embodiments method 800 may be implemented by execution of code, either inserted
by a programmer, a static compiler, a dynamic binary translation engine or a static
binary translator invoked on the program.
[0088] As shown in FIG. 9, method 800 begins at block 810. There, a value of a selected
register of multiple registers may be pushed to a stack. While this register may be
any register of a register file, in some embodiments the register may be selected
from a group of predetermined registers. As an example, the predetermined registers
may include a base index (RBX) register, a destination index register (RDI), a source
index (RSI) register, or any one of general purpose registers R12-R15. This push operation
thus stores a given register value to a stack, where it can be later obtained from
the stack (e.g., responsive to a pop operation) to be used during further execution
of the process after the valid gadget protection technique has been performed.
[0089] Next at block 820, a randomly generated number may be stored to the selected register.
In an embodiment, this randomly generated number may be a 4 or 8-byte random number
determined, e.g., according to a pseudorandom number generator, by a dynamic loader
or any other entity, and inserted as an immediate value encoded in the instruction
that stores this number to the selected register.
[0090] With further reference to FIG. 9, next at block 830 a call to a function occurs.
Accordingly, the given function may be executed through its various operations until
it is ended via a RETURN instruction that returns control flow to the next instruction
following the call function instruction (assuming a correct program control flow occurs
to an end of the function).
[0091] After the return from the function, at diamond 840 it is determined whether a CTT
instruction (namely an ENDRET instruction) is the first instruction encountered following
the return from the function. If not, a CTT fault is reported to the OS (block 850).
Note that this CTT-based monitoring and fault generation may be performed using the
CTT state machine and exception logic as described above. Thus if a valid CTT instruction
does not follow upon return from the function, a fault is raised, which may trigger
a fault handler or interrupt service routine to execute to determine whether an attack
or other malware effect has occurred.
[0092] Still referring to FIG. 9, instead if a function is validly terminated via a CTT
instruction following the return from the function (as determined at diamond 840),
control passes to diamond 860. There it can be determined whether the previously generated
random number (e.g., as inserted by the dynamic loader as an immediate value encoded
in the instruction that does this comparison) matches the value of the selected register.
In an embodiment, a comparison logic of a processor may make this determination. Thus
at diamond 860 it is determined whether the selected register still has the same value
upon function return as it did when the function was called. If not, this indicates
that the full function code was not executed (since the correct function operation
will ensure that the selected register is maintained (or restored) with its calling
value at the point of function return). If the match does not occur, control passes
to block 850 where this condition is treated as if a CTT fault has occurred and the
program is terminated, as discussed above.
[0093] Otherwise, when the randomly generated number matches the value of the selected register,
it is ensured that correct function operation and appropriate control transfer occurred.
Accordingly, control passes to block 870 where the original value from the selected
register is popped from the stack and restored to the register. Thereafter, continued
code execution within a process may continue. While shown at this high level in the
embodiment of FIG. 9, the scope of the present invention is not limited in this regard.
[0094] As another defense, a software managed shadow stack may be used to protect against
ROP-based attacks. Here, at an entry to a function, a program saves the return address
stored on the stack to an alternate stack, which in an embodiment is a software managed
stack, also referred to herein as a shadow stack. Prior to all exit points in the
function (e.g., before returning from the function or making a call to another function),
the return address on the stack is compared to the top entry of the shadow stack.
A mismatch indicates an error or potential attack, responsive to which the program
may be terminated (e.g., by crashing). If the comparison indicates a match, continued
execution of the program occurs. Note that in this case, prior to returning from the
function, the shadow stack is popped. With this shadow stack operation, an attacker
needs to not just control the return addresses on the stack but also control the return
addresses on the shadow stack, making the setup phase for exploit pre-conditions more
difficult.
[0095] Referring now to FIG. 10, shown is a flow diagram of a method for valid gadget protection
in accordance with another embodiment of the invention. Method 900 of FIG. 10 may
be performed by a collection of components including processor hardware (including
CTT hardware), logic, firmware, and/or software during execution of a process or other
collection of instructions during system operation. In general, method 900 may be
used to provide value gadget defense using a software managed shadow stack. As shown
in FIG. 10, method 900 begins by entering into execution of a function, e.g., responsive
to a call instruction of a calling code segment (block 910). Next it is determined
whether a CTT instruction follows an indirect control transfer instruction (e.g.,
CALL/JUMP) to this function (diamond 915). If not, a CTT fault is reported (block
920), and program execution may be terminated (block 960). Otherwise, next at block
925 a return address for this function (to cause a return to the next instruction
of the calling code following the call instruction) that is stored in the stack may
be copied to a top entry of a shadow stack. Although the scope of the present invention
is not limited in this regard, understand that this shadow stack may be present in
various locations such as ring-3 system memory or restricted ring-0 system memory.
Additionally, the shadow stack memory may be restricted to be writable only for the
code that updates the shadow stack memory (e.g., a VMM) or by special instructions
used to write the shadow stack (e.g., an ENDBRANCH). In an embodiment, this shadow
stack memory may be made to be read-writable only to specific trusted code that manages
the shadow stack (e.g., the VMM). Control next passes to diamond 930 where it is determined
whether an exit point of the function has been reached. This exit point may be at
function end, or it may be any other exit point within the function (e.g., a call
to another function). If an exit point is reached, control passes to block 940 where
the top entry of the shadow stack may be compared with the return address obtained
from the stack. Again, comparison logic of the processor may be used for this comparison.
Next it is determined whether a result of the comparison is a match (diamond 950).
If not, this indicates that malicious code or other malware has initiated an attack,
as evidenced by these different values. Accordingly, program execution may terminate
(block 960). For example, an exception may be raised, and an exception handler may
cause the termination (e.g., via a general protection fault).
[0096] Otherwise, if a match is indicated, control passes to block 970 where the shadow
stack may be popped (to thus remove this return address) and control returns to the
caller (block 980), e.g., the next instruction of the calling code, which may be an
appropriate CTT instruction such as an ENDRET instruction. While shown at this high
level in the embodiment of FIG. 10, many variations and alternatives are possible.
[0097] A common attack technique used in ROP attacks is to atomically switch a stack pointer
to an attacker controlled memory buffer, using an operation known as a stack pivot.
Embodiments may provide a measure of defense against such attack by performing stack
pivot checks. For this defense, prior to all exit points in the function (e.g., before
returning from the function or making a call to another function), code (e.g., compiler
emitted code) may be provided to check a stack pointer value against upper and lower
bounds of the stack that have been established for this thread. If an attacker had
managed to pivot the stack, this check may detect the pivot and signal an exception,
leading to the program being terminated. Such bound checks may be efficiently implemented
in hardware using instructions such as those provided by Intel® Memory Protection
Extensions (MPX).
[0098] Referring now to FIG. 11, shown is a flow diagram of a method for valid gadget protection
in accordance with a still further embodiment of the invention. Method 1000 of FIG.
11 may be performed by a collection of components including processor hardware (including
CTT hardware), logic, firmware, and/or software during execution of a process or other
collection of instructions during system operation. In general, method 1000 may be
used to provide valid gadget defense against a stack pivot attack. As illustrated
in FIG. 11, method 1000 begins by executing a function (block 1010). Next it is determined
during function execution whether an exit point of the function has been reached (diamond
1020). If so, control passes to block 1030 where a stack pointer value may be checked
against the bounds of the stack.
[0099] While the scope of the present invention is not limited in this regard, in an embodiment
checking of the stack pointer value against the stack bounds may be performed by execution
of one or more user-level instructions such as upper and lower bounds limit check
instructions. In one embodiment, a processor may be configured with a set of bounds
registers. As one such example, a set of bounds registers may be configured, with
each including multiple fields for lower and upper bounds of a given buffer or other
memory structure. During processor operation, one or more of these bounds registers
may be written with values of a given memory frame such as a stack buffer or other
memory collection. As one example, a first bounds register may include a first field
to store an address of the bottom of the stack and a second field to store an address
of the top of the stack. Of course in other cases, separate bounds registers may be
provided, each to store a given top or lower bound of a memory collection.
[0100] In one embodiment, multiple user-level instructions may be executed to determine
whether a given address such as a return address is within the bounds of the stack.
In one embodiment, a first such instruction is a lower bound check instruction (BNDCL)
to ensure that the stack address is at least greater than this lower bound. Similarly,
a second instruction is an upper bound check instruction (BNDCU) to ensure that the
stack address is less than this upper bound. Next, responsive to this checking it
is determined whether the stack pointer value is within the bounds (diamond 1040).
If not, this indicates that malicious code or other malware has initiated a stack
pivot-based attack. As such, control passes to block 1050, where program execution
may terminate. For example, an exception may be raised, to be handled by an exception
handler (e.g., responsive to a bounds protection fault). Otherwise, if the stack pointer
is within the bounds, control returns to the caller (block 1060). That is so, as the
appropriately bounded stack pointer indicates that a stack pivot-based attack likely
did not occur. Understand while shown at this high level in the embodiment of FIG.
11, many variations and alternatives are possible.
[0101] Embodiments may thus be used in connection with CTT instructions and hardware to
provide further measures to protect against invocation of valid gadgets in code. Using
an embodiment, instruction execution can be prevented from being misaligned or control
being transferred to non-programmer intended locations in programs. Note that one
or more of the above techniques can be used in connection with CTT instructions to
thus harden control flow integrity of the program.
[0102] Referring now to FIG. 12, shown is a block diagram of an example system with which
embodiments may be used. In the illustration of FIG. 12, system 1300 may be mobile
low-power system such as a tablet computer, 2:1 tablet, phablet or other convertible
or standalone tablet system. As illustrated, a SoC 1310 is present and may be configured
to operate as an application processor for the device. SoC 1310 may include CTT hardware,
software, and/or firmware as described herein to provide protection against ROP and
other security attacks.
[0103] A variety of devices may couple to SoC 1310. In the illustration shown, a memory
subsystem includes a flash memory 1340 and a DRAM 1345 coupled to SoC 1310. In addition,
a touch panel 1320 is coupled to the SoC 1310 to provide display capability and user
input via touch, including provision of a virtual keyboard on a display of touch panel
1320. To provide wired network connectivity, SoC 1310 couples to an Ethernet interface
1330. A peripheral hub 1325 is coupled to SoC 1310 to enable interfacing with various
peripheral devices, such as may be coupled to system 1300 by any of various ports
or other connectors.
[0104] In addition to internal power management circuitry and functionality within SoC 1310,
a PMIC 1380 is coupled to SoC 1310 to provide platform-based power management, e.g.,
based on whether the system is powered by a battery 1390 or AC power via an AC adapter
1395. In addition to this power source-based power management, PMIC 1380 may further
perform platform power management activities based on environmental and usage conditions.
Still further, PMIC 1380 may communicate control and status information to SoC 1310
to cause various power management actions within SoC 1310.
[0105] Still referring to FIG. 12, to provide for wireless capabilities, a WLAN unit 1350
is coupled to SoC 1310 and in turn to an antenna 1355. In various implementations,
WLAN unit 1350 may provide for communication according to one or more wireless protocols,
including an IEEE 802.11 protocol, a Bluetooth™ protocol or any other wireless protocol.
[0106] As further illustrated, a plurality of sensors 1360 may couple to SoC 1310. These
sensors may include various accelerometer, environmental and other sensors, including
user gesture sensors. Finally, an audio codec 1365 is coupled to SoC 1310 to provide
an interface to an audio output device 1370. Of course understand that while shown
with this particular implementation in FIG. 12, many variations and alternatives are
possible.
[0107] Referring now to FIG. 13, shown is a block diagram of a representative computer system
such as notebook, Ultrabook™ or other small form factor system. A processor 1410,
in one embodiment, includes a microprocessor, multi-core processor, multithreaded
processor, an ultra low voltage processor, an embedded processor, or other known processing
element. In the illustrated implementation, processor 1410 acts as a main processing
unit and central hub for communication with many of the various components of the
system 1400. As one example, processor 1400 is implemented as a SoC. Processor 1400
may include CTT hardware, software, and/or firmware as described herein to provide
protection against ROP and other security attacks.
[0108] Processor 1410, in one embodiment, communicates with a system memory 1415. As an
illustrative example, the system memory 1415 is implemented via multiple memory devices
or modules to provide for a given amount of system memory.
[0109] To provide for persistent storage of information such as data, applications, one
or more operating systems and so forth, a mass storage 1420 may also couple to processor
1410. In various embodiments, to enable a thinner and lighter system design as well
as to improve system responsiveness, this mass storage may be implemented via a SSD
or the mass storage may primarily be implemented using a hard disk drive (HDD) with
a smaller amount of SSD storage to act as a SSD cache to enable non-volatile storage
of context state and other such information during power down events so that a fast
power up can occur on re-initiation of system activities. Also shown in FIG. 13, a
flash device 1422 may be coupled to processor 1410, e.g., via a serial peripheral
interface (SPI). This flash device may provide for non-volatile storage of system
software, including a basic input/output software (BIOS) as well as other firmware
of the system.
[0110] Various input/output (I/O) devices may be present within system 1400. Specifically
shown in the embodiment of FIG. 13 is a display 1424 which may be a high definition
LCD or LED panel that further provides for a touch screen 1425. In one embodiment,
display 1424 may be coupled to processor 1410 via a display interconnect that can
be implemented as a high performance graphics interconnect. Touch screen 1425 may
be coupled to processor 1410 via another interconnect, which in an embodiment can
be an I
2C interconnect. As further shown in FIG. 13, in addition to touch screen 1425, user
input by way of touch can also occur via a touch pad 1430 which may be configured
within the chassis and may also be coupled to the same I
2C interconnect as touch screen 1425.
[0111] For perceptual computing and other purposes, various sensors may be present within
the system and may be coupled to processor 1410 in different manners. Certain inertial
and environmental sensors may couple to processor 1410 through a sensor hub 1440,
e.g., via an I
2C interconnect. In the embodiment shown in FIG. 13, these sensors may include an accelerometer
1441, an ambient light sensor (ALS) 1442, a compass 1443 and a gyroscope 1444. Other
environmental sensors may include one or more thermal sensors 1446 which in some embodiments
couple to processor 1410 via a system management bus (SMBus) bus.
[0112] Also seen in FIG. 13, various peripheral devices may couple to processor 1410 via
a low pin count (LPC) interconnect. In the embodiment shown, various components can
be coupled through an embedded controller 1435. Such components can include a keyboard
1436 (e.g., coupled via a PS2 interface), a fan 1437, and a thermal sensor 1439. In
some embodiments, touch pad 1430 may also couple to EC 1435 via a PS2 interface. In
addition, a security processor such as a trusted platform module (TPM) 1438 in accordance
with the Trusted Computing Group (TCG) TPM Specification Version 1.2, dated Oct. 2,
2003, may also couple to processor 1410 via this LPC interconnect.
[0113] System 1400 can communicate with external devices in a variety of manners, including
wirelessly. In the embodiment shown in FIG. 13, various wireless modules, each of
which can correspond to a radio configured for a particular wireless communication
protocol, are present. One manner for wireless communication in a short range such
as a near field may be via a NFC unit 1445 which may communicate, in one embodiment
with processor 1410 via an SMBus. Note that via this NFC unit 1445, devices in close
proximity to each other can communicate.
[0114] As further seen in FIG. 13, additional wireless units can include other short range
wireless engines including a WLAN unit 1450 and a Bluetooth unit 1452. Using WLAN
unit 1450, Wi-Fi™ communications in accordance with a given IEEE 802.11 standard can
be realized, while via Bluetooth unit 1452, short range communications via a Bluetooth
protocol can occur. These units may communicate with processor 1410 via, e.g., a USB
link or a universal asynchronous receiver transmitter (UART) link. Or these units
may couple to processor 1410 via an interconnect according to a PCIe™ protocol or
another such protocol such as a serial data input/output (SDIO) standard.
[0115] In addition, wireless wide area communications, e.g., according to a cellular or
other wireless wide area protocol, can occur via a WWAN unit 1456 which in turn may
couple to a subscriber identity module (SIM) 1457. In addition, to enable receipt
and use of location information, a GPS module 1455 may also be present. Note that
in the embodiment shown in FIG. 13, WWAN unit 1456 and an integrated capture device
such as a camera module 1454 may communicate via a given USB protocol such as a USB
2.0 or 3.0 link, or a UART or I
2C protocol.
[0116] An integrated camera module 1454 can be incorporated in the lid. To provide for audio
inputs and outputs, an audio processor can be implemented via a digital signal processor
(DSP) 1460, which may couple to processor 1410 via a high definition audio (HDA) link.
Similarly, DSP 1460 may communicate with an integrated coder/decoder (CODEC) and amplifier
1462 that in turn may couple to output speakers 1463 which may be implemented within
the chassis. Similarly, amplifier and CODEC 1462 can be coupled to receive audio inputs
from a microphone 1465 which in an embodiment can be implemented via dual array microphones
(such as a digital microphone array) to provide for high quality audio inputs to enable
voice-activated control of various operations within the system. Note also that audio
outputs can be provided from amplifier/CODEC 1462 to a headphone jack 1464. Although
shown with these particular components in the embodiment of FIG. 13, understand the
scope of the present invention is not limited in this regard.
[0117] Embodiments may be implemented in many different system types. Referring now to FIG.
14, shown is a block diagram of a system in accordance with an embodiment of the present
invention. As shown in FIG. 14, multiprocessor system 1500 is a point-to-point interconnect
system, and includes a first processor 1570 and a second processor 1580 coupled via
a point-to-point interconnect 1550. As shown in FIG. 14, each of processors 1570 and
1580 may be multicore processors, including first and second processor cores (i.e.,
processor cores 1574a and 1574b and processor cores 1584a and 1584b), although potentially
many more cores may be present in the processors. Each of the processors can include
CTT hardware and logic to perform the security techniques described herein.
[0118] Still referring to FIG. 14, first processor 1570 further includes a memory controller
hub (MCH) 1572 and point-to-point (P-P) interfaces 1576 and 1578. Similarly, second
processor 1580 includes a MCH 1582 and P-P interfaces 1586 and 1588. As shown in FIG.
14, MCH's 1572 and 1582 couple the processors to respective memories, namely a memory
1532 and a memory 1534, which may be portions of system memory (e.g., DRAM) locally
attached to the respective processors. First processor 1570 and second processor 1580
may be coupled to a chipset 1590 via P-P interconnects 1562 and 1564, respectively.
As shown in FIG. 14, chipset 1590 includes P-P interfaces 1594 and 1598.
[0119] Furthermore, chipset 1590 includes an interface 1592 to couple chipset 1590 with
a high performance graphics engine 1538, by a P-P interconnect 1539. In turn, chipset
1590 may be coupled to a first bus 1516 via an interface 1596. As shown in FIG. 14,
various input/output (I/O) devices 1514 may be coupled to first bus 1516, along with
a bus bridge 1518 which couples first bus 1516 to a second bus 1520. Various devices
may be coupled to second bus 1520 including, for example, a keyboard/mouse 1522, communication
devices 1526 and a data storage unit 1528 such as a disk drive or other mass storage
device which may include code 1530, in one embodiment. Further, an audio I/O 1524
may be coupled to second bus 1520. Embodiments can be incorporated into other types
of systems including mobile devices such as a smart cellular telephone, tablet computer,
netbook, Ultrabook™, or so forth.
[0120] Embodiments may be used in many different types of systems. For example, in one embodiment
a communication device can be arranged to perform the various methods and techniques
described herein. Of course, the scope of the present invention is not limited to
a communication device, and instead other embodiments can be directed to other types
of apparatus for processing instructions, or one or more machine readable media including
instructions that in response to being executed on a computing device, cause the device
to carry out one or more of the methods and techniques described herein.
[0121] Embodiments may be implemented in code and may be stored on a non-transitory storage
medium having stored thereon instructions which can be used to program a system to
perform the instructions. The storage medium may include, but is not limited to, any
type of disk including floppy disks, optical disks, solid state drives (SSDs), compact
disk read-only memories (CD-ROMS), compact disk rewritables (CD-RWs), and magneto-optical
disks, semiconductor devices such as read-only memories (ROMs), random access memories
(RAMs) such as dynamic random access memories (DRAMs), static random access memories
(SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically
erasable programmable read-only memories (EEPROMs), magnetic or optical cards, or
any other type of media suitable for storing electronic instructions.
1. Prozessor (600, 700), umfassend:
ein erstes Begrenzungsregister zum Speichern eines ersten Begrenzungswertes für einen
Stapel, der in einem Speicher gespeichert werden soll;
ein zweites Begrenzungsregister, um einen zweiten Begrenzungswert für den Stapel zu
speichern;
eine Prüflogik (115), um vor einem Austrittspunkt am Ende einer Funktion, die auf
dem Prozessor (600, 700) auszuführen ist, zu bestimmen, ob ein Wert eines Stapelzeigers
innerhalb eines Bereichs zwischen dem ersten Begrenzungswert und dem zweiten Begrenzungswert
liegt;
eine Ausnahmelogik, um einen Rücksprung zu einem Aufrufer der Funktion zu verhindern,
wenn der Wert des Stapelzeigers nicht innerhalb des Bereichs liegt;
gekennzeichnet durch
einen Satz von CTT (Control Transfer Termination, Sprungbefehl-Terminierung) -Befehlen
einschließlich eines Multi-Byte-Opcodes; und
eine Logik, um einen Zufallswert vor einem Aufruf der Funktion in einem ersten Register
zu speichern und in Reaktion auf wenigstens einen der CTT-Befehle, die nach einer
Sprungoperation, die von der Funktion zurückspringt, angetroffen werden, zu bestimmen,
ob ein CTT-Befehl der erste Befehl ist, der nach dem Rücksprung von der Funktion angetroffen
wird, und wenn die Funktion nach dem Rücksprung von der Funktion über einen CTT-Befehl
gültig beendet wird, festzustellen, ob ein aktueller Wert des ersten Registers gleich
dem Zufallswert ist, und wenn ja, die Ausführung eines Aufrufers der Funktion fortzusetzen
und andernfalls, wenn der erste Befehl, der nach dem Rücksprung von der Funktion angetroffen
wird, kein CTT-Befehl ist, einen Fehler zu melden, um die Ausführung zu beenden,
wobei ein erster CTT-Befehl des Satzes von CTT-Befehlen verwendet wird, um gültige
Stellen in einem Programm zu identifizieren, an denen ein Sprung unter Verwendung
eines indirekten Aufruf-, CALL, oder eines indirekten Sprung-, JMP, Befehls gültig
durchgeführt werden kann, und
ein zweiter CTT-Befehl des Satzes von CTT-Befehlen verwendet wird, um gültige Stellen
in einem Programm zu identifizieren, an denen ein Sprung unter Verwendung eines RET-Befehls
gültig durchgeführt werden kann.
2. Prozessor nach Anspruch 1, wobei die Ausnahmelogik eine Ausnahme auslösen soll, wenn
der Stapelzeigerwert nicht innerhalb des Bereichs liegt, wobei die Ausnahme einen
Stack-Pivot-Angriff anzeigen soll.
3. Prozessor nach Anspruch 2, der ferner eine Steuerlogik umfasst, um in Reaktion auf
die Ausnahme ein Programm zu beenden.
4. Prozessor nach einem der Ansprüche 1-3, wobei das erste Begrenzungsregister und das
zweite Begrenzungsregister ein einzelnes Register umfassen.
5. Prozessor nach einem der Ansprüche 1-3, wobei die Prüflogik wenigstens einen Befehl
auf Benutzerebene ausführen soll, um zu bestimmen, ob der Wert des Stapelzeigers innerhalb
des Bereichs liegt.
6. Prozessor nach einem der Ansprüche 1 bis 5, wobei die CTT-Befehle durch einen Vier-Byte-Opcode
dargestellt werden, der einem Opcode-Wert entspricht, der gegenwärtig in einer aktuellen
x86-Befehlssatzarchitektur, ISA (Instruction Set Architecture), nicht existiert.
7. Prozessor nach Anspruch 1, wobei während der Ausführung der Funktion die Logik dazu
dient, den Zufallswert des ersten Registers in einem Speicher zu speichern, das erste
Register zu verwenden, um einen ersten Funktionswert zu speichern, und vor einer Beendigung
der Funktion den Zufallswert aus dem Speicher in das erste Register wiederherzustellen.
8. Prozessor nach Anspruch 1, wobei die Logik darin besteht, einen Wert des ersten Registers
vor dem Aufruf der Funktion in einem Stapel zu speichern und bei fortgesetzter Prozessausführung
den Wert des ersten Registers aus dem Stapel in das erste Register wiederherzustellen.
9. Prozessor nach Anspruch 1, der ferner eine zweite Logik umfasst, um eine Seite in
einen Nur-Ausführen-Zustand zu versetzen, wobei die Seite Befehle aufweist, den Zufallswert
über unmittelbare Parameter im ersten Register zu speichern.
10. Prozessor nach Anspruch 9, wobei die zweite Logik darin besteht, einen erweiterten
Seitentabelleneintrag, der mit der Seite verknüpft ist, auf den Nur-Ausführen-Zustand
zu aktualisieren, wobei ein Eintrag, der mit der Seite in einer ersten Seitentabelle
verknüpft ist, in einen Lesen-Ausführen-Zustand gesetzt wird.
11. Verfahren (900),
dadurch gekennzeichnet, dass es umfasst:
einen Satz von CTT (Control Transfer Termination, Sprungbefehl-Terminierung) -Befehlen
einschließlich eines Multi-Byte-Opcodes bereitzustellen;
zu bestimmen (915), ob wenigstens einer der CTT-Befehle auf einen indirekten Sprungbefehl
zu einer Funktion folgt;
wenn die CTT-Befehle nicht einem indirekten Sprungbefehl folgen, wird ein Fehler gemeldet
(920) und die Programmausführung beendet (960), und andernfalls wird eine Rücksprungadresse
der auf einem Prozessor eines Systems ausgeführten und während der Ausführung eines
Programms aufgerufenen Funktion in einen obersten Eintrag eines Schattenstapels eines
Speichers kopiert (925), wobei die Rücksprungadresse ferner in einem Stapel des Speichers
gespeichert wird; am Ende der Funktion die im Stapel gespeicherte Rücksprungadresse
mit einem im obersten Eintrag des Schattenstapels gespeicherten Wert zu vergleichen
(940); und
wenn die im Stapel gespeicherte Rücksprungadresse mit dem im obersten Eintrag des
Schattenstapels gespeicherten Wert übereinstimmt (950), zu einem Aufrufer der Funktion
zurückzuspringen (980), und andernfalls das Programm zu beenden (960),
wobei ein erster CTT-Befehl des Satzes von CTT-Befehlen verwendet wird, um gültige
Stellen in einem Programm zu identifizieren, an denen ein Sprung unter Verwendung
eines indirekten Aufruf-, CALL, oder eines indirekten Sprung-, JMP, Befehls gültig
durchgeführt werden kann, und
ein zweiter CTT-Befehl des CTT-Befehlssatzes verwendet wird, um gültige Stellen in
einem Programm zu identifizieren, an denen ein Sprung unter Verwendung eines RET-Befehls
gültig durchgeführt werden kann.
12. Verfahren nach Anspruch 11, das ferner umfasst, die Rücksprungadresse an einem Eintrittspunkt
in die Funktion in den obersten Eintrag des Schattenstapels zu kopieren.
13. Verfahren nach Anspruch 11, das ferner umfasst, den Schattenstapel auf einer oder
mehreren Seiten eines Speichers zu pflegen, wobei die ein oder mehreren Seiten einen
Lese-Schreib-Zustand haben und nur für einen vertrauenswürdigen Code-Agenten zugänglich
sind, der den Schattenstapel verwaltet, wobei der Schattenstapel einen mittels Software
verwalteten Stapel umfasst.
14. Einrichtung, die Mittel zum Ausführen eines Verfahrens nach einem der Ansprüche 11
bis 13 umfasst.
15. Maschinenlesbares Speichermedium, das maschinenlesbare Anweisungen aufweist, die,
wenn sie ausgeführt werden, ein Verfahren nach einem der Ansprüche 11 bis 13 implementieren.