Assembly language for CE newbie #3: Arithmetic, Conditions and Branches operations with SSE

General Assembly and Auto Assembler tutorials


Post Reply
User avatar
bbfox
Table Master
Table Master
Journeyman Hacker
Journeyman Hacker
Posts: 180
Joined: Sat Jul 23, 2022 8:59 am
Answers: 0
x 402

Assembly language for CE newbie #3: Arithmetic, Conditions and Branches operations with SSE

Post by bbfox »

This article is for newbies.

Note: I'm not an expert. I'm just someone who knows some instructions and can write Auto Assembler scripts. This article shares my experiences.

Warning: This post features AI-assisted content. While I created the first document, an AI arranged the syntax and wording, which I then curated. If you prefer not to engage with such material, please use your browser's back button.



Table of Contents
Assembly language for CE newbie #3: Arithmetic, Conditions and Branches operations with SSE
Arithmetic operation SSE Scalar instructions
Arithmetic operations with AVX Scalar Instructions
Move / Convert data type between source and SSE/AVX instructions
Conditions and Branches with SSE/AVX instructions




Assembly language for CE newbie #3: Arithmetic, Conditions and Branches operations with SSE


Reference:
About Intel x64 CPU Registers
Performing Arithmetic Operations on Floats using SSE2 (Source Integer in EAX)


"SSE", or "Streaming SIMD Extensions", is a set of single instructions to operate on multiple data. Four 32-bit floats, two 64-bit double or integers.

SSE instructions operate scalar/serial or packed/parallel data. In the instruction name, if it include "s?" means scalar/serial, "p?" means packed/parallel.

In CE environment, we usually operate Scalar single-precision:"float" or Scalar double-precision:"double". In this type of instruction set, the instruction naming contains "ss" / "sd"


SSE XMM Registers
A XMM register is 128-bit width, that means it can store 4 32-bit floats or 2 64-bit double value.

  • If a XMM register store and operate for only one float or double, it's called Scalar (Single/Double)-precision float/double.

  • If a XMM register store and operate for 4 float or 2 double at one instruction, it's called Packed (Single/Double)-precision (4 floats/2 doubles).


AVX instructions introduce new ?MM series registers

  • A YMM register is 256-bit width, that means it can store 8 32-bit floats or 4 64-bit double value.

  • A ZMM register is 512-bit width, that means it can store 16 32-bit floats or 8 64-bit double value.

As Cheat Engine 7.5, it's Auto Assembler (AA) script does not support ZMM. YMM registers can be used but no syntax highlighting for YMM registers.
I never used any YMM registers so far.

XMM  registers (128 bit width)
instruction series:

                                      Scalar             Packed
-----------------------------------------------------------------------------
Single-precision "float"	        ss                 ps (4 floats/Singles)
Double-precision "double"               sd                 pd (2 Doubles)

XMM  registers (128 bit width)

bit#              128    96     64     32      0
---------------------------+------+------+------+
ss instructions                           XXXXXX  1 float
ps instructions      XXXXXX XXXXXX XXXXXX XXXXXX  4 floats
sd instructions                    XXXXXX-XXXXXX  1 double
pd instructions      XXXXXX-XXXXXX XXXXXX-XXXXXX  2 doubles

instructions example:
| Instruction | Operation                       |
|-------------|---------------------------------|
| addss       | Add Scalar Single-precision     |
| addsd       | Add Scalar Double-precision     |
| addps       | Add Packed Single-precision     |

In CE's Auto Assembler, most of the instructions I use are from the ss series -- Scalar Single-precision operations. Occasionally, I use the sd series, depending on whether the source data is already in double precision format.


Some cases for SS operations

Use ss or sd depends on what you want. There is no rule for this.

For integer operation, just use integer arithmetic instructions, i.e. add, sub, mul...etc if you want to.

Reasons I use SS instructions:

  • Source data is float

  • Source data is integer and I want to do some multiplier like 1.33 or 2.5

  • Source data is integer and I don't want to maintain RDX/RAX register via push/pop

Reasons I use SD instructions:

  • Source data is double



Arithmetic operation SSE Scalar instructions

Instruction format:

  • ???ss xmm1, xmm2/m32: operand 1 is xmm register, operand 2 can be an xmm register or 32-bit memory location.

  • ???sd xmm1, xmm2/m64: operand 1 is xmm register, operand 2 can be an xmm register or 64-bit memory location.


Common used instructions i used:
Add

addss: for float
addsd: for double

Example:

addss xmm1, xmm2  ; xmm1 = xmm1 + xmm2

Subtract

subss: for float
subsd: for double

Example:

subss xmm1, xmm2  ; xmm1 = xmm1 - xmm2

Multiply

mulss: for float
mulsd: for double

Example:

mulss xmm1, xmm2  ; xmm1 = xmm1 * xmm2

Divide

divss: for float
divsd: for double

Example:

divss xmm1, xmm2  ; xmm1 = xmm1 / xmm2

The addss/subss.. instructions listed above are included in SSE3 instruction set (Y2004).



Arithmetic operations with AVX Scalar Instructions

AVX Instruction Set
AVX was introduced in 2011:

  • Intel: Starting from the "Sandy Bridge" architecture or later.
  • AMD: Starting from the Bulldozer, Piledriver, Steamroller, Excavator, and Zen architectures support AVX.

Atom, Celeron, or older Pentium CPUs may not support AVX.
Use the CPU-Z tool to check if your CPU supports AVX.

The SSE3 Scalar instructions mentioned above, such as addss/subss, overwrite the destination XMM register's value with the result. In some cases, we have to reload the original data from the source. AVX introduces similar instructions that do not destroy the original register's content. I use AVX whenever possible (note: AVX cannot be used in 32-bit programs). The drawback is that if the user's CPU is very old or does not support AVX, the script may crash the program.

Instruction Format:
v???ss xmm1, xmm2, xmm3/m32
v???sd xmm1, xmm2, xmm3/m64

The result is stored in xmm1.

Common Instructions I Use:
Add

- vaddss: for float
- vaddsd: for double

Example:

vaddss xmm1, xmm2, xmm3              ; xmm1 = xmm2 + xmm3
vaddss xmm1, xmm2, dword ptr [var1]  ; xmm1 = xmm2 + memory location [var1]

Subtract

- vsubss: for float
- vsubsd: for double

Example:

vsubss xmm1, xmm2, xmm3  ; xmm1 = xmm2 - xmm3
vsubss xmm1, xmm1, xmm3  ; xmm1 = xmm1 - xmm3

Multiply

- vmulss: for float
- vmulsd: for double

Example:

vmulss xmm1, xmm2, xmm3  ; xmm1 = xmm2 * xmm3

Divide

- vdivss: for float
- vdivsd: for double

Example:

vdivss xmm1, xmm2, xmm3  ; xmm1 = xmm2 / xmm3


Move / Convert Data Types Between Source and SSE/AVX Instructions

The first problem I faced was: how to convert or move data from a source to xmm registers?
I used two major types of instructions:

  1. Move-in / Move-out data: moving data from general-purpose registers to XMM registers, from memory to XMM registers, and from XMM registers to registers or memory.
  2. Data conversion: converting integers to floats and floats to integers.

Move-In / Move-Out Data

                  SSE         AVX
-------------------------------------
(Source Type)
float            movss      vmovss
double           movsd      vmovsd
32-bit integer    movd       vmovd
64-bit integer    movq       vmovq

Example:

movss xmm1, [fltVar1]    ; copy float from memory [fltVar1] to xmm1
                         ; lower 32 bits replaced by [fltVar1]
                         ; higher bits will be cleared.

vmovss xmm1, [fltVar1]   ; copy float from memory [fltVar1] to xmm1 with AVX
                         ; lower 32 bits replaced by [fltVar1]
                         ; higher bits will be cleared.

movss xmm1, xmm2         ; copy lower 32-bit data from xmm2 to xmm1

vmovss xmm1, xmm3, xmm2  ; copy lower 32-bit data from xmm2 to xmm1
                         ; copy 33-128 bit data from xmm3 to xmm1

movss [fltVar2], xmm1    ; copy lower 32-bit xmm1 float to [fltVar2]

vmovss [fltVar2], xmm1   ; copy lower 32-bit xmm1 float to [fltVar2]

movd xmm1, eax           ; copy data from eax to xmm1 lower 32-bit
                         ; higher bits cleared (set to zero)

vmovd xmm1, eax          ; copy data from eax to xmm1 lower 32-bit with AVX
                         ; higher bits cleared (set to zero)

vmovd xmm1, [intVar1]    ; copy data from [intVar1] to xmm1 lower 32-bit with AVX
                         ; higher bits cleared (set to zero)

Convert data type from one to another

Converting between different types is a common operation for SSE/AVX instructions.
Typically, we convert data between float and integer data types.

SSE instructions
             (to)  32-bit int.      float         double
--------------------------------------------------------
(From)
32/64-bit integer         N/A    cvtsi2ss       cvtsi2sd
float                cvtss2si         N/A       cvtss2sd
double               cvtsd2si    cvtsd2ss            N/A

  
AVX instructions (to) 32-bit int. float double -------------------------------------------------------- (From) 32/64-bit integer N/A vcvtsi2ss vcvtsi2sd float vcvtss2si N/A vcvtss2sd double vcvtsd2si vcvtsd2ss N/A

Most frequently used are the integer <-> float conversions.

Example:

cvtsi2ss xmm0, eax         ; convert integer in eax to float, store in
                           ; xmm0's lower 32-bit location
                           ; other bits in xmm0 remain unchanged

vcvtsi2ss xmm0, xmm1, eax  ; convert integer in eax to float, store in
                           ; xmm0's lower 32-bit location
                           ; other bits in xmm0 replaced by xmm1

cvtss2si eax, xmm0         ; convert float in xmm0's lower 32-bit to
                           ; integer, stored in eax

vcvtss2si eax, xmm0        ; convert float in xmm0's lower 32-bit to
                           ; integer with AVX, stored in eax

Example of multiplication with SSE:

cvtsi2ss xmm15, eax      ; convert 32-bit integer in eax to float in xmm15
movss xmm14, [fltMul]    ; move float value from memory [fltMul] to xmm14
mulss xmm15, xmm14       ; xmm15 = xmm15 * xmm14
cvtss2si eax, xmm15      ; convert float in xmm15 to 32-bit integer, store in eax

Example of multiplication with AVX:

vcvtsi2ss xmm15, xmm15, eax  ; convert 32-bit integer in eax to float in xmm15
vmovss xmm14, [fltMul]       ; move float value from memory [fltMul] to xmm14
vmulss xmm13, xmm15, xmm14   ; xmm13 = xmm15 * xmm14
vcvtss2si eax, xmm13         ; convert float in xmm13 to 32-bit integer, store in eax

Remember to save registers before using them, this depends how these registers are used before injection.
Reference:
Preserving XMM Registers
Preserving XMM Registers to Pre-Allocated Memory
Preserving Register States in Assembly



Conditions and Branches with SSE/AVX Instructions

Sometimes we need to perform checks to see if a value is non-negative (i.e., must be >= 0). For general-purpose registers (GPRs), we use the cmp instruction along with branch instructions like ja, je, or jge to perform different operations.
In SSE/AVX, we can use comiss or ucomiss to complete this task:

  • GPRs: cmp

  • SSE: comiss, ucomiss

  • AVX: vcomiss, vucomiss

The difference between comiss and ucomiss is how they handle NaN (Not a Number). When NaN is found, EFLAGS are set as follows:

        Flags:     CF       PF       ZF
------------------------------------------
comiss              1        1        1
ucomiss             1        1        0
vcomiss             1        1        1
vucomiss            1        1        0

You can check if NaN with instruction, like jp.
I do not check for NaN in all cases. That means whether I use comiss or ucomiss depends on the specific requirements. The program may crash or fail if NaN is encountered.

After executing comiss or ucomiss, we can now use branch instructions like ja, jb, je, jae, jbe, etc., to perform conditional branching.

Example:
Return a multiplied value only if the result is > 0.

cvtsi2ss xmm15, eax      ; Convert 32-bit integer in eax to float in xmm15.
movss xmm14, [fltMul]    ; Move float value from memory [fltMul] to xmm14.
mulss xmm15, xmm14       ; xmm15 = xmm15 * xmm14.
subss xmm15, [decVal]    ; xmm15 = xmm15 - [decVal].
xorps xmm14, xmm14       ; Clear xmm14 to zero.
comiss xmm15, xmm14
jbe endp                 ; Jump if xmm15 <= 0, skip conversion if result not > 0.

cvtss2si eax, xmm15      ; Convert float in xmm15 to 32-bit integer, store in eax.

endp:

Or using AVX instructions:

vcvtsi2ss xmm15, xmm15, eax      ; Convert 32-bit integer in eax to float in xmm15.
vmovss xmm14, [fltMul]           ; Move float value from memory [fltMul] to xmm14.
vmulss xmm13, xmm15, xmm14       ; xmm13 = xmm15 * xmm14.
vsubss xmm13, xmm13, [decVal]    ; xmm13 = xmm13 - [decVal].
vxorps xmm14, xmm14, xmm14       ; Clear xmm14 to zero.
vcomiss xmm15, xmm14
jbe endp                         ; Jump if xmm15 <= 0, skip conversion if result not > 0.

vcvtss2si eax, xmm15             ; Convert float in xmm15 to 32-bit integer, store in eax.

endp:

Notice for float to integer conversion:
If the original value in the xmm register exceeds 2147483647, it is uncertain what will happen to eax. It may become the maximum allowed value of 2147483647 or something else.
You may choose to add more checks before conversion, but I have chosen to ignore it. Users must accept the risk themselves.

Last edited by bbfox on Sun Apr 14, 2024 8:34 pm, edited 1 time in total.

I create tables to suit my preferences. Table is free to use, but need to leave the author's name and source URL: https://opencheattables.com.
Table will not be up-to-date. Feel free to modify it, but kindly provide credit to the source.


Alex Darkside
Cheater
Cheater
Posts: 14
Joined: Sun Aug 21, 2022 12:08 pm
Answers: 0
x 10

Re: Assembly language for CE newbie #3: Arithmetic, Conditions and Branches operations with SSE

Post by Alex Darkside »

bbfox wrote: Sun Apr 14, 2024 9:55 am

movss xmm1, [fltVar1] ; copy float from memory [fltVar1] to xmm1
; lower 32 bits replaced by [fltVar1]
; higher bits remain unchanged.

(Google translation)

Small clarification.
This is a screenshot from the "Intel® 64 and IA-32 Architectures Software Developer's Manual".
Please pay attention to the highlighted places in the text.

Image


User avatar
bbfox
Table Master
Table Master
Journeyman Hacker
Journeyman Hacker
Posts: 180
Joined: Sat Jul 23, 2022 8:59 am
Answers: 0
x 402

Re: Assembly language for CE newbie #3: Arithmetic, Conditions and Branches operations with SSE

Post by bbfox »

Corrected.
I don't know what's behavior. Even in Visual Studio, I tested it and high bits are cleared.

Checked before and after vmovss xmm1, dword ptr [exflt1]
Image

Image

Code: Select all

; masm code start
align 10h

flt1 real4 1.0f
flt2 real4 2.0f
flt3 real4 3.0f
flt4 real4 4.0f

mul1 real4 1.5f
mul2 real4 2.0f
mul3 real4 2.5f
mul4 real4 3.0f

res1 real4 0.0f, 0.0f, 0.0f, 0.0f

exflt1 real4 100.0f
exflt2 real4 200.0f

.code
main PROC
    ; Code start
    movaps xmm1, [flt1]
    vmovss xmm1, dword ptr [exflt1]

movaps xmm1, [flt1]

I don't know what the "load" means. From manual the description for move from memory to xmm is:
Image

Source: Intel 64 and IA-32 Architectures Software Developer’s Manual: Volume 2B, instruction set reference M-U, page 124
So high bits will be cleared.


I create tables to suit my preferences. Table is free to use, but need to leave the author's name and source URL: https://opencheattables.com.
Table will not be up-to-date. Feel free to modify it, but kindly provide credit to the source.


Post Reply