Skip to content

Latest commit

 

History

History
611 lines (429 loc) · 13.1 KB

File metadata and controls

611 lines (429 loc) · 13.1 KB

Python Byte Code Hacks

Overview

  • Python VM

    • Process Virtual Machines

    • Stack vs Register Machines

    • Python’s Stack Machines

  • Byte Code Hacks

    • Code Objects

    • Modifying Code Objects

    • Handcrafted Code Objects

Python VM

Compilers & Interpreters

  • C — compiler converts C source to machine instructions

  • GW-BASIC, the programs were interpreted line by line

  • Python and Java

    • source code is compiled to a intermediate language byte codes

    • byte codes are then executed by an interpreter

Java vs Python

  • Java, source code is compiled to byte code

    • to reduce machine dependency

    • to increasing portability

  • Python, the source code is compiled to byte code

    • to reduce parsing overhead

    • to ease interpretation.

Program Representation

lang byte code

What are Virtual Machines?

Software that executes the byte code is called the abstract virtual machine. These virtual machines are modelled after microprocessors.

Microprocessor Model v1

microprocessor 1
  • A microprocessor is a device that reads data from memory, processes it and writes back data to memory, based on instructions

Microprocessor Model v1

microprocessor 1
  • Instructions specify, memory location to read operands from, operation to perform, memory location to write the result to

Microprocessor Model v2

microprocessor 2
  • Improvised model of microprocessor

  • Instructions are themselves stored in memory

Microprocessor Model v2

microprocessor 2
  • A microprocessor is a device that reads data from memory, processes it and writes back data to memory, based on instructions, which are themselves stored / fetched from memory."

Microprocessor Model v3

microprocessor 3
  • Registers: memory locations within microprocessors

  • Operands and results are temporarily stored in registers

Python Interpreter Model v1

py vm 1
  • Modelled after microprocessors. Data is represented as objects stored in the heap

  • Works by manipulating these objects in memory, based on instructions

Python Interpreter Model v1

py vm 1
  • Instructions specify the type of objects to create, which objects to delete, the attribute of an object to be read / modified.

Python Interpreter Model v2

py vm 2
  • Byte code instructions are themselves wrapped in "code objects" and are also stored in the heap

Python Interpreter Model v2

py vm 2
  • Python’s compiler, parses functions / modules, converts them to code objects

  • Python interpreter executes the code objects

Types of Virtual Machines

  • Classification depending on how the operands are accessed and results are stored, by instructions

  • Types of Abstract Virtual Machines

    1. Register Machines

    2. Stack Machines

Register Machines

  • Register machines are more like traditional microprocessors

  • Expression: 1 + 2 + 4

    LOADK R1, 1
    LOADK R2, 2
    ADD R3, R2, R1		# R3 = R2 + R1
    LOADK R1, 4
    ADD R3, R3, R1		# R3 = R3 + R1
  • Examples:

    • Lua VM

    • Dalvick VM

Stack Machines

  • Instructions, for a hypothetical stack machine

    loadi 1
    loadi 2
    add
    loadi 4
    add

Step 1

stack machine 1

Step 2

stack machine 2

Step 3

stack machine 3

Step 4

stack machine 4

Step 5

stack machine 5

Python’s Stack Machine: Difference

  • Objects are always stored in the heap

  • Only the pointer to the object is stored in the stack

  • Operation of BINARY_ADD instruction

Step 1: Initial State

py vm stack 1
  • Two integer objects in the heap, stack has pointers to the two objects

Step 2: Add the Stack Operands

py vm stack 2
  • Pops off the top two objects 10 and 20, adds them resulting in a new object 30

Step 3: Final State

py vm stack 3
  • Pushes the pointer to the object on to the stack

Byte Code Hacks

Getting Started

link:code/hack-1.py[role=include]

The function just divides two constants and returns the value back to the caller.

The Code Object

link:code/hack-2.py[role=include]
  • Code object is accessible from myfunc.code.

  • co_consts, contains a tuple of constants used by the function.

  • co_code, contains the byte code instructions, generated by compiling the function

The Code Object

link:code/hack-2.py[role=include]
  • Code object associated with every function

  • Contains the byte code instructions and associated data to execute the function

What does 0x64 mean?

  • dis module has a mapping from opcodes to mnemonics

    >>> import dis
    >>> print dis.opname[0x64]

LOAD_CONST Instruction

  • Followed by 2 byte integer operand

  • Operand is an index into the co_const tuple

  • Specifies the constant to be pushed / loaded into the stack

  • Operand specified in this case 0x0001, corresponds to the constant 6

  • Instruction can thus be represented in mnemonics as

    LOAD_CONST 1

Second Instruction

co_consts: (None, 6, 2)
co_code:
    0: 64
    1: 01
    2: 00
    3: 64 <=
    4: 02
    5: 00
    6: 15
    7: 53
  • Second instruction is LOAD_CONST

  • Operand is the index 0x0002, which corresponds to the constant 2

Decoded Instructions

  • Instructions decoded so far

    64    LOAD_CONST 1
    01
    00
    64    LOAD_CONST 2
    02
    00

Third Instruction

co_consts: (None, 6, 2)
co_code:
    0: 64
    1: 01
    2: 00
    3: 64
    4: 02
    5: 00
    6: 15 <=
    7: 53
>>> print dis.opname[0x15]

BINARY_DIVIDE Instruction

  • Does not require any operands

  • Pops top two values from the stack, and performs a divide and pushes the result back on to the stack

Fourth Instruction

co_consts: (None, 6, 2)
co_code:
    0: 64
    1: 01
    2: 00
    3: 64
    4: 02
    5: 00
    6: 15
    7: 53 <=
>>> print dis.opname[0x53]

RETURN_VALUE Instruction

  • RETURN_VALUE instruction takes the top of the stack and returns the value back to the caller.

Complete Dis-assembly

64    LOAD_CONST 1
01
00
64    LOAD_CONST 2
02
00
15    BINARY_DIVIDE
53    RETURN_VALUE

Writing a Dis-assembler

  • Manual dis-assembly of code

  • A disassembler runs through byte code string

    • prints the opname of the each byte code instruction

  • Only catch is that some of them take an operand

Writing a Dis-assembler (Contd.)

  • Code to determine if an opcode takes an operand

    if op > dis.HAVE_ARGUMENT:
       print "opcode has an operand."
    else:
       print "opcode does not have an operand."
  • Get disas.py

Built-in Disassembler

The dis module has a disassemble() function:

>>> help(dis.disassemble)
Help on function disassemble in module dis:

disassemble(co, lasti=-1)
    Disassemble a code object.

Hack 1

Hacking Code Objects

  • Modify the constants so that the tuple is (None, 10, 2) instead of (None, 6, 2)

  • Will that result in the program printing 5?

  • But code objects are immutable

  • Create a new code object with the new value for co_consts

  • Get hack-1.py

Hack Explained

link:code/hack-5/hack.py[role=include]
  • new module has the constructor to create the code objects

  • Takes a huge list of arguments

  • All arguments are specified from the old code object, except for the co_consts

  • A new set of constants is specified instead

Hack Explained

link:code/hack-5/inject.py[role=include]
  • Old code object is replaced with new code object

Hack 2

Hacking the Byte Code String

  • Modify the byte code string

  • Replace the BINARY_DIVIDE instruction with BINARY_ADD

  • BINARY_ADD corresponds to opcode 0x17

  • BINARY_DIVIDE appears at offset 6 is the byte code string

  • Need to replace it with BINARY_ADD

BINARY_DIVIDE to BINARY_ADD

  • Code to unpack, modify and repack the byte code string

    bcode = co.co_code     # \x64\x01\x00\x64\x02\x00\x15\x53
    bcode = list(bcode)    # ['\x64', '\x01', ..., '\x15', '\x53']
    bcode[6] = "\x17"
    bcode = "".join(bcode) # \x64\x01\x00\x64\x02\x00\x17\x53
  • Get hack-2.py

Hack 3

Look Ma, No Hands!

  • Create code object for a function, without actually writing the Python function

  • Let’s implement the classic "Hello World"

  • Byte code instructions and the constants tuple for implementing this

    Consts: (None, "Hello Byte Code World!")
    Byte Code Instructions:
        LOAD_CONST 1
        PRINT_ITEM
        PRINT_NEWLINE
        LOAD_CONST 0
        RETURN_VALUE

Instructions Explained

  • PRINT_ITEM pops object from the top of the stack and prints it

  • PRINT_NEWLINE, prints an newline character

  • Code returns None, to the caller, as required by Python

  • Get hack-3.py

Code Object Constructor

  • co_argcount

    • Specifies the number of positional arguments

    • Our value is 0

  • co_nlocals

    • Specifies the number of local variables, including positional arguments

    • Our value is 0

  • co_stacksize

    • Specifies stack depth utilized by the code object

    • At any given point, we do not store more than 1 element in the stack

Code Object Constructor (Contd.)

  • co_flags

    • Specifies using bit flags whether the function accepts variable number of arguments, whether the function is a generator, etc.

  • co_varnames

    • Specifies the names of positional arguments and local variables

    • Empty tuple

  • co_names

    • Specifies the names of identifiers other than local variables

    • Empty tuple

Code Object Constructor (Contd.)

  • co_filename

    • Specifies the file in which the function was present

    • Dummy filename test.py

  • co_name

    • Specifies the name of the function.

  • co_firstlineno

    • Specifies the first line number of the function within the file

Code Object Constructor (Contd.)

  • co_lnotab

    • Encodes the mapping of byte code offset to line numbers

    • Used while printing tracebacks

    • Empty string

Hack 4

More Bytes of Code

  • A Python function, that will accept two arguments, add them and return the result

  • The byte code equivalent of the following Python function

    def myfunc(a, b):
        return a + b

LOAD_FAST Instruction

  • Arguments / local variables, can be loaded using LOAD_FAST

  • Loads the value of a local variable on to the stack

  • Instruction accepts an argument that specifies the local variable as an index into the co_varnames tuple.

  • Get hack-4.py

    Var Names: ("a", "b")
    Byte Code:
        LOAD_FAST 0
        LOAD_FAST 1
        BINARY_ADD
        RETURN_VALUE

Code Object Constructor

  • No. of arguments, co_argcount, is 2

  • No. of arguments and local variables co_nlocals, is 2

  • Stack size co_stacksize, is specified as 2

    • A maximum of two items is pushed into the stack.

  • Names of the positional arguments / local variables co_varnames, is ("a", "b")

Python vs Java

Python Byte Code Instructions

  • BINARY_ADD instruction

    • Does not operate on integers

    • Unlike the ADD instruction of a microprocessor

    • Operates on objects, provided they implement the add() or radd() magic methods

  • Distinguishes Python’s byte codes from Java’s byte codes

  • Python’s byte codes are there to simplify interpretation

  • More closer to the source language

Java Byte Code Instructions

  • Java’s byte codes are there to reduce machine dependency

  • Are more closer to the machine instructions

Thank You