This article is for newbies.
Note: I'm not an expert. I'm just someone who knows some instructions and can write Auto Assembler scripts. This article shares my experiences.
Warning: This post features AI-assisted content. While I created the first document, an AI arranged the syntax and wording, which I then curated. If you prefer not to engage with such material, please use your browser's back button.
Table of Contents
Assembly language for CE newbie #3: Arithmetic, Conditions and Branches operations with SSE
Arithmetic operation SSE Scalar instructions
Arithmetic operations with AVX Scalar Instructions
Move / Convert data type between source and SSE/AVX instructions
Conditions and Branches with SSE/AVX instructions
Understanding SHUFPS and VSHUFPS Instructions in SIMD Programming
Assembly language for CE newbie #3: Arithmetic, Conditions and Branches operations with SSE
Reference:
About Intel x64 CPU Registers
Performing Arithmetic Operations on Floats using SSE2 (Source Integer in EAX)
"SSE", or "Streaming SIMD Extensions", is a set of single instructions to operate on multiple data. Four 32-bit floats, two 64-bit double or integers.
SSE instructions operate scalar/serial or packed/parallel data. In the instruction name, if it include "s?" means scalar/serial, "p?" means packed/parallel.
In CE environment, we usually operate Scalar single-precision:"float" or Scalar double-precision:"double". In this type of instruction set, the instruction naming contains "ss" / "sd"
SSE XMM Registers
A XMM register is 128-bit width, that means it can store 4 32-bit floats or 2 64-bit double value.
If a XMM register store and operate for only one float or double, it's called Scalar (Single/Double)-precision float/double.
If a XMM register store and operate for 4 float or 2 double at one instruction, it's called Packed (Single/Double)-precision (4 floats/2 doubles).
AVX instructions introduce new ?MM series registers
A YMM register is 256-bit width, that means it can store 8 32-bit floats or 4 64-bit double value.
A ZMM register is 512-bit width, that means it can store 16 32-bit floats or 8 64-bit double value.
As Cheat Engine 7.5, it's Auto Assembler (AA) script does not support ZMM. YMM registers can be used but no syntax highlighting for YMM registers.
I never used any YMM registers so far.
XMM registers (128 bit width) instruction series: Scalar Packed ----------------------------------------------------------------------------- Single-precision "float" ss ps (4 floats/Singles) Double-precision "double" sd pd (2 Doubles)
XMM registers (128 bit width) bit# 128 96 64 32 0 ---------------------------+------+------+------+ ss instructions XXXXXX 1 float ps instructions XXXXXX XXXXXX XXXXXX XXXXXX 4 floats sd instructions XXXXXX-XXXXXX 1 double pd instructions XXXXXX-XXXXXX XXXXXX-XXXXXX 2 doubles instructions example: | Instruction | Operation | |-------------|---------------------------------| | addss | Add Scalar Single-precision | | addsd | Add Scalar Double-precision | | addps | Add Packed Single-precision |
In CE's Auto Assembler, most of the instructions I use are from the ss series -- Scalar Single-precision operations. Occasionally, I use the sd series, depending on whether the source data is already in double precision format.
Some cases for SS operations
Use ss or sd depends on what you want. There is no rule for this.
For integer operation, just use integer arithmetic instructions, i.e. add
, sub
, mul
...etc if you want to.
Reasons I use SS instructions:
Source data is float
Source data is integer and I want to do some multiplier like 1.33 or 2.5
Source data is integer and I don't want to maintain RDX/RAX register via push/pop
Reasons I use SD instructions:
Source data is double
Arithmetic operation SSE Scalar instructions
Instruction format:
???ss xmm1, xmm2/m32: operand 1 is xmm register, operand 2 can be an xmm register or 32-bit memory location.
???sd xmm1, xmm2/m64: operand 1 is xmm register, operand 2 can be an xmm register or 64-bit memory location.
Common used instructions i used:
Add
addss: for float addsd: for double
Example:
addss xmm1, xmm2 ; xmm1 = xmm1 + xmm2
Subtract
subss: for float subsd: for double
Example:
subss xmm1, xmm2 ; xmm1 = xmm1 - xmm2
Multiply
mulss: for float mulsd: for double
Example:
mulss xmm1, xmm2 ; xmm1 = xmm1 * xmm2
Divide
divss: for float divsd: for double
Example:
divss xmm1, xmm2 ; xmm1 = xmm1 / xmm2
The addss/subss.. instructions listed above are included in SSE3 instruction set (Y2004).
Arithmetic operations with AVX Scalar Instructions
AVX Instruction Set
AVX was introduced in 2011:
- Intel: Starting from the "Sandy Bridge" architecture or later.
- AMD: Starting from the Bulldozer, Piledriver, Steamroller, Excavator, and Zen architectures support AVX.
Atom, Celeron, or older Pentium CPUs may not support AVX.
Use the CPU-Z tool to check if your CPU supports AVX.
The SSE3 Scalar instructions mentioned above, such as addss/subss, overwrite the destination XMM register's value with the result. In some cases, we have to reload the original data from the source. AVX introduces similar instructions that do not destroy the original register's content. I use AVX whenever possible (note: AVX cannot be used in 32-bit programs). The drawback is that if the user's CPU is very old or does not support AVX, the script may crash the program.
Instruction Format:
v???ss xmm1, xmm2, xmm3/m32
v???sd xmm1, xmm2, xmm3/m64
The result is stored in xmm1.
Common Instructions I Use:
Add
- vaddss: for float - vaddsd: for double
Example:
vaddss xmm1, xmm2, xmm3 ; xmm1 = xmm2 + xmm3 vaddss xmm1, xmm2, dword ptr [var1] ; xmm1 = xmm2 + memory location [var1]
Subtract
- vsubss: for float - vsubsd: for double
Example:
vsubss xmm1, xmm2, xmm3 ; xmm1 = xmm2 - xmm3 vsubss xmm1, xmm1, xmm3 ; xmm1 = xmm1 - xmm3
Multiply
- vmulss: for float - vmulsd: for double
Example:
vmulss xmm1, xmm2, xmm3 ; xmm1 = xmm2 * xmm3
Divide
- vdivss: for float - vdivsd: for double
Example:
vdivss xmm1, xmm2, xmm3 ; xmm1 = xmm2 / xmm3
Move / Convert Data Types Between Source and SSE/AVX Instructions
The first problem I faced was: how to convert or move data from a source to xmm registers?
I used two major types of instructions:
- Move-in / Move-out data: moving data from general-purpose registers to XMM registers, from memory to XMM registers, and from XMM registers to registers or memory.
- Data conversion: converting integers to floats and floats to integers.
Move-In / Move-Out Data
SSE AVX ------------------------------------- (Source Type) float movss vmovss double movsd vmovsd 32-bit integer movd vmovd 64-bit integer movq vmovq
Example:
movss xmm1, [fltVar1] ; copy float from memory [fltVar1] to xmm1 ; lower 32 bits replaced by [fltVar1] ; higher bits will be cleared. vmovss xmm1, [fltVar1] ; copy float from memory [fltVar1] to xmm1 with AVX ; lower 32 bits replaced by [fltVar1] ; higher bits will be cleared. movss xmm1, xmm2 ; copy lower 32-bit data from xmm2 to xmm1 vmovss xmm1, xmm3, xmm2 ; copy lower 32-bit data from xmm2 to xmm1 ; copy 33-128 bit data from xmm3 to xmm1 movss [fltVar2], xmm1 ; copy lower 32-bit xmm1 float to [fltVar2] vmovss [fltVar2], xmm1 ; copy lower 32-bit xmm1 float to [fltVar2] movd xmm1, eax ; copy data from eax to xmm1 lower 32-bit ; higher bits cleared (set to zero) vmovd xmm1, eax ; copy data from eax to xmm1 lower 32-bit with AVX ; higher bits cleared (set to zero) vmovd xmm1, [intVar1] ; copy data from [intVar1] to xmm1 lower 32-bit with AVX ; higher bits cleared (set to zero)
Convert data type from one to another
Converting between different types is a common operation for SSE/AVX instructions.
Typically, we convert data between float and integer data types.
SSE instructions (to) 32-bit int. float double -------------------------------------------------------- (From) 32/64-bit integer N/A cvtsi2ss cvtsi2sd float cvtss2si N/A cvtss2sd double cvtsd2si cvtsd2ss N/A
AVX instructions (to) 32-bit int. float double -------------------------------------------------------- (From) 32/64-bit integer N/A vcvtsi2ss vcvtsi2sd float vcvtss2si N/A vcvtss2sd double vcvtsd2si vcvtsd2ss N/A
Most frequently used are the integer <-> float conversions.
Example:
cvtsi2ss xmm0, eax ; convert integer in eax to float, store in ; xmm0's lower 32-bit location ; other bits in xmm0 remain unchanged vcvtsi2ss xmm0, xmm1, eax ; convert integer in eax to float, store in ; xmm0's lower 32-bit location ; other bits in xmm0 replaced by xmm1 cvtss2si eax, xmm0 ; convert float in xmm0's lower 32-bit to ; integer, stored in eax vcvtss2si eax, xmm0 ; convert float in xmm0's lower 32-bit to ; integer with AVX, stored in eax
Example of multiplication with SSE:
cvtsi2ss xmm15, eax ; convert 32-bit integer in eax to float in xmm15 movss xmm14, [fltMul] ; move float value from memory [fltMul] to xmm14 mulss xmm15, xmm14 ; xmm15 = xmm15 * xmm14 cvtss2si eax, xmm15 ; convert float in xmm15 to 32-bit integer, store in eax
Example of multiplication with AVX:
vcvtsi2ss xmm15, xmm15, eax ; convert 32-bit integer in eax to float in xmm15 vmovss xmm14, [fltMul] ; move float value from memory [fltMul] to xmm14 vmulss xmm13, xmm15, xmm14 ; xmm13 = xmm15 * xmm14 vcvtss2si eax, xmm13 ; convert float in xmm13 to 32-bit integer, store in eax
Remember to save registers before using them, this depends how these registers are used before injection.
Reference:
Preserving XMM Registers
Preserving XMM Registers to Pre-Allocated Memory
Preserving Register States in Assembly
Conditions and Branches with SSE/AVX Instructions
Sometimes we need to perform checks to see if a value is non-negative (i.e., must be >= 0). For general-purpose registers (GPRs), we use the cmp
instruction along with branch instructions like ja
, je
, or jge
to perform different operations.
In SSE/AVX, we can use comiss
or ucomiss
to complete this task:
GPRs:
cmp
SSE:
comiss
,ucomiss
AVX:
vcomiss
,vucomiss
The difference between comiss
and ucomiss
is how they handle NaN (Not a Number). When NaN is found, EFLAGS are set as follows:
Flags: CF PF ZF ------------------------------------------ comiss 1 1 1 ucomiss 1 1 0 vcomiss 1 1 1 vucomiss 1 1 0
You can check if NaN with instruction, like jp
.
I do not check for NaN in all cases. That means whether I use comiss
or ucomiss
depends on the specific requirements. The program may crash or fail if NaN is encountered.
After executing comiss
or ucomiss
, we can now use branch instructions like ja
, jb
, je
, jae
, jbe
, etc., to perform conditional branching.
Example:
Return a multiplied value only if the result is > 0.
cvtsi2ss xmm15, eax ; Convert 32-bit integer in eax to float in xmm15. movss xmm14, [fltMul] ; Move float value from memory [fltMul] to xmm14. mulss xmm15, xmm14 ; xmm15 = xmm15 * xmm14. subss xmm15, [decVal] ; xmm15 = xmm15 - [decVal]. xorps xmm14, xmm14 ; Clear xmm14 to zero. comiss xmm15, xmm14 jbe endp ; Jump if xmm15 <= 0, skip conversion if result not > 0. cvtss2si eax, xmm15 ; Convert float in xmm15 to 32-bit integer, store in eax. endp:
Or using AVX instructions:
vcvtsi2ss xmm15, xmm15, eax ; Convert 32-bit integer in eax to float in xmm15. vmovss xmm14, [fltMul] ; Move float value from memory [fltMul] to xmm14. vmulss xmm13, xmm15, xmm14 ; xmm13 = xmm15 * xmm14. vsubss xmm13, xmm13, [decVal] ; xmm13 = xmm13 - [decVal]. vxorps xmm14, xmm14, xmm14 ; Clear xmm14 to zero. vcomiss xmm15, xmm14 jbe endp ; Jump if xmm15 <= 0, skip conversion if result not > 0. vcvtss2si eax, xmm15 ; Convert float in xmm15 to 32-bit integer, store in eax. endp:
Notice for float to integer conversion:
If the original value in the xmm register exceeds 2147483647, it is uncertain what will happen to eax. It may become the maximum allowed value of 2147483647 or something else.
You may choose to add more checks before conversion, but I have chosen to ignore it. Users must accept the risk themselves.
Understanding SHUFPS and VSHUFPS Instructions in SIMD Programming
shufps
and vshufps
are powerful SIMD instructions used for reordering (shuffling) single-precision floating-point elements within xmm or ymm registers. These instructions are widely used in SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions) for efficiently manipulating data. In this article, we will discuss how these instructions work, with a particular focus on the use of the immediate value (imm8) control byte.
1. The Structure of shufps
and vshufps
Control Byte (imm8)
The control byte (imm8) in shufps
and vshufps
instructions is an 8-bit immediate value that determines how elements in the source registers are rearranged. The 8 bits are split into four pairs, and each pair controls the destination of one of the 32-bit elements in the target register.
- imm8 consists of 8 bits:
Code: Select all
[7, 6, 5, 4, 3, 2, 1, 0]
- Each pair of bits controls a specific position in the target register:
- Bits [1:0]: Controls the element in position 0 of the target register.
- Bits [3:2]: Controls the element in position 1 of the target register.
- Bits [5:4]: Controls the element in position 2 of the target register.
- Bits [7:6]: Controls the element in position 3 of the target register.
The value of each pair determines the source of the element in the final register:
- 00 (“0”): Selects the element from position 0 of the source register.
- 01 (“1”): Selects the element from position 1 of the source register.
- 10 (“2”): Selects the element from position 2 of the source register.
- 11 (“3”): Selects the element from position 3 of the source register.
2. Example: imm8 = 0b01000100 (0x44)
Let’s take an example with imm8 set to 0b01000100 (which is equivalent to 0x44 in hexadecimal). Here’s how the control byte is interpreted:
- Bits [1:0] (“00”): The element at position 0 of the target register will come from position 0 of the source1 register.
- Bits [3:2] (“01”): The element at position 1 of the target register will come from position 1 of the source1 register.
- Bits [5:4] (“00”): The element at position 2 of the target register will come from position 0 of the source2 register.
- Bits [7:6] (“01”): The element at position 3 of the target register will come from position 1 of the source2 register.
Thus, imm8 = 0b01000100 effectively swaps and duplicates elements from the source register.
3. Example Operation
Suppose xmm1 contains the elements [d, c, b, a], and xmm2 contains [h, g, f, e]. Using the following instruction:
Code: Select all
shufps xmm1, xmm2, 0b11010001
The result in xmm1 would be [h, f, a, b]. This means that:
- The 0th element is replaced by the #1 element from xmm1 (“b”).
- The 1st element is replaced by the #0 element from xmm1 (“a”).
- The 2nd element is also taken from the #1 element of xmm2 (“f”).
- The 3rd element is taken from the #3 element of xmm2 (“h”).
3.1 Example Operation: rotate
Code: Select all
shufps xmm1, xmm1, 0x39
We want floats in xmm1 rotate right. From [e, f, g, h] to [h, e, f, g]. The imm8 code should be:
- The 0th element is replaced by the #1 element from xmm1 (“g”); imm bits = 01
- The 1st element is replaced by the #2 element from xmm1 (“f”); imm bits = 10
- The 2nd element is replaced by the #3 element of xmm1 (“e”); imm bits = 11
- The 3rd element is replaced by the #0 element of xmm1 (“h”); imm bits = 00
imm8 = 0b00111001 = 0x39:
4. Key Differences Between shufps
and vshufps
shufps
: Used for 128-bit xmm registers. This instruction is part of the SSE instruction set and allows rearranging elements within 128-bit registers.vshufps
: An AVX instruction that can be used with both 128-bit (xmm) and 256-bit (ymm) registers. It supports three operands: two source registers and a destination register. This flexibility makesvshufps
more powerful for certain operations, as it allows preserving the original source registers while writing the shuffled result to a different destination register.
vshufps
instruction (ymm related not explained):vshufps xmm1, xmm2, xmm3/m128, imm8
Source: xmm2, xmm3
Destination: xmm1
How it works:
Lower position elements 0 - 1: picked from xmm2
Upper position elements 2 - 3: picked from xmm3/m128
Store result in xmm1
More examples
Copy lowest float in xmm to other position 1-3:
shufps xmm1, xmm1, 0 //or vshufps xmm1, xmm1, xmm1, 0
Copy float data from position 1 to position 0 (don't care others):
shufps xmm1, xmm1, 1 //or vshufps xmm1, xmm1, xmm1, 1
Copy float data from position 2 to position 0 (don't care others):
shufps xmm1, xmm1, 2 //or vshufps xmm1, xmm1, xmm1, 2
Combine lowest position of float in xmm2 into xmm1 position 1:
shufps xmm1, xmm2, 0 // pos 0-1: xmm1:0, pos 2-3:xmm2:0; or movlhps xmm1, xmm2: copy 2 low elements to high position shufps xmm1, xmm1, 8 // 0b00001000 // pos0: xmm1:0, pos1:xmm1:2, pos3:xmm1:0, pos4:xmm1:0 //or insertps xmm1, xmm2, 0x40 //7:6 -> dest. pos, 5:4 -> src. pos. 3:0 --> 0 = do not clear source
A shufps / vshufps html JavaScript helper:
Code: Select all
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>SHUFPS / VSHUFPS Imm8 Helper</title>
<style>
body {
font-family: Arial, sans-serif;
}
.container {
max-width: 600px;
margin: 0 auto;
}
.xmm-select {
margin-bottom: 20px;
}
</style>
</head>
<body>
<div class="container">
<h1>SHUFPS / VSHUFPS Imm8 Helper</h1>
<div class="xmm-select">
<label for="xmm1">First XMM Register:</label>
<select id="xmm1">
<option value="xmm0">xmm0</option>
<option value="xmm1">xmm1</option>
<option value="xmm2">xmm2</option>
<option value="xmm3">xmm3</option>
<option value="xmm4">xmm4</option>
<option value="xmm5">xmm5</option>
<option value="xmm6">xmm6</option>
<option value="xmm7">xmm7</option>
<option value="xmm8">xmm8</option>
<option value="xmm9">xmm9</option>
<option value="xmm10">xmm10</option>
<option value="xmm11">xmm11</option>
<option value="xmm12">xmm12</option>
<option value="xmm13">xmm13</option>
<option value="xmm14">xmm14</option>
<option value="xmm15">xmm15</option>
</select>
</div>
<div class="xmm-select">
<label for="xmm2">Second XMM Register:</label>
<select id="xmm2">
<option value="xmm0">xmm0</option>
<option value="xmm1">xmm1</option>
<option value="xmm2">xmm2</option>
<option value="xmm3">xmm3</option>
<option value="xmm4">xmm4</option>
<option value="xmm5">xmm5</option>
<option value="xmm6">xmm6</option>
<option value="xmm7">xmm7</option>
<option value="xmm8">xmm8</option>
<option value="xmm9">xmm9</option>
<option value="xmm10">xmm10</option>
<option value="xmm11">xmm11</option>
<option value="xmm12">xmm12</option>
<option value="xmm13">xmm13</option>
<option value="xmm14">xmm14</option>
<option value="xmm15">xmm15</option>
</select>
</div>
<div class="xmm-select">
<label for="xmm_dest">Destination XMM Register (for VSHUFPS):</label>
<select id="xmm_dest">
<option value="xmm0">xmm0</option>
<option value="xmm1">xmm1</option>
<option value="xmm2">xmm2</option>
<option value="xmm3">xmm3</option>
<option value="xmm4">xmm4</option>
<option value="xmm5">xmm5</option>
<option value="xmm6">xmm6</option>
<option value="xmm7">xmm7</option>
<option value="xmm8">xmm8</option>
<option value="xmm9">xmm9</option>
<option value="xmm10">xmm10</option>
<option value="xmm11">xmm11</option>
<option value="xmm12">xmm12</option>
<option value="xmm13">xmm13</option>
<option value="xmm14">xmm14</option>
<option value="xmm15">xmm15</option>
</select>
</div>
<h3>Select Floats to Shuffle</h3>
<p>Choose the 4 positions from the two registers (0-1 from the first xmm, 2-3 from the second xmm).</p>
<div id="float-positions">
<label>Result Position 0:</label>
<select class="position-select" id="pos0">
<option value="0">xmm 1st[0]</option>
<option value="1">xmm 1st[1]</option>
<option value="2">xmm 1st[2]</option>
<option value="3">xmm 1st[3]</option>
</select>
<br>
<label>Result Position 1:</label>
<select class="position-select" id="pos1">
<option value="0">xmm 1st[0]</option>
<option value="1">xmm 1st[1]</option>
<option value="2">xmm 1st[2]</option>
<option value="3">xmm 1st[3]</option>
</select>
<br>
<label>Result Position 2:</label>
<select class="position-select" id="pos2">
<option value="0">xmm 2nd[0]</option>
<option value="1">xmm 2nd[1]</option>
<option value="2">xmm 2nd[2]</option>
<option value="3">xmm 2nd[3]</option>
</select>
<br>
<label>Result Position 3:</label>
<select class="position-select" id="pos3">
<option value="0">xmm 2nd[0]</option>
<option value="1">xmm 2nd[1]</option>
<option value="2">xmm 2nd[2]</option>
<option value="3">xmm 2nd[3]</option>
</select>
</div>
<br>
<button onclick="generateInstruction('shufps')">Generate SHUFPS Instruction</button>
<button onclick="generateInstruction('vshufps')">Generate VSHUFPS Instruction</button>
<h3>Result:</h3>
<p id="instruction"></p>
</div>
<script>
function generateInstruction(type) {
const xmm1 = document.getElementById('xmm1').value;
const xmm2 = document.getElementById('xmm2').value;
const xmmDest = document.getElementById('xmm_dest').value;
const pos0 = parseInt(document.getElementById('pos0').value);
const pos1 = parseInt(document.getElementById('pos1').value);
const pos2 = parseInt(document.getElementById('pos2').value);
const pos3 = parseInt(document.getElementById('pos3').value);
// Calculate imm8 value
const imm8 = (pos3 << 6) | (pos2 << 4) | (pos1 << 2) | pos0;
// Generate instruction
let instruction = '';
if (type === 'shufps') {
instruction = `shufps ${xmm1}, ${xmm2}, 0x${imm8.toString(16)}`;
} else if (type === 'vshufps') {
instruction = `vshufps ${xmmDest}, ${xmm1}, ${xmm2}, 0x${imm8.toString(16)}`;
}
document.getElementById('instruction').textContent = instruction;
}
</script>
</body>
</html>