Initiating an Offload on Intel® Graphics Technology

This topic only applies to IA-32 architecture targeting Intel® Graphics Technology. Intel® Graphics Technology is a preview feature.

The code inside loops or loop nests following #pragma offload target(gfx) and in functions qualified with __declspec(target(gfx)) is compiled to both the target and the CPU. The qualified functions can be called for execution on the target from other code executed on the target. The code is executed on the target if the target is present on the system and the if clause is evaluated to true, otherwise it is executed on the CPU (see examples below).

You can place #pragma offload target(gfx) only before a perfect loop nest explicitly marked as parallel by #pragma parallel_loop.

#pragma offload can contain the following clauses when programming for Intel® Graphics Technology:

target(gfx)– a required clause for heterogeneous execution of code sections offloaded to the target.
if (condition)– the code will be executed on the target only if the condition is true.
in|out|inout|pin(variable_list: length(length_variable_in_elements))
- in, out, or inout– the variables are copied between the CPU and the target memory.
- pin– the variables are shared between the CPU and the target. The data needs to be 4-kilobyte aligned.
- You must include the length clause for pointers. This clause indicates the size of data to copy to or from the target, or to share with the target, in elements of the type being referenced by the pointer. For pointers to arrays, the size is in elements of the array being referenced.

Note

Using pin substantially reduces the cost of offloading because instead of copying data to or from memory accessible by the target, the pin clause organizes sharing the same memory area between the CPU and the target, which is much faster. For kernels that perform substantial work on a relatively small data size, such as O(N2)), this optimization is not important.

#pragma parallel_loop [collapse(n)] indicates that the underlying perfect nest of one or more loops will be parallelized over the target's threads.

Although by default the compiler builds an application that runs on both the host CPU and target, you can also build the same source code to run on just the CPU, using the Qoffload- compiler option.

Example: Offloading to the Target

unsigned parArrayRHist[256][256],
     parArrayGHist[256][256], parArrayBHist[256][256];

#pragma offload target(gfx) if (do_offload) \
     pin(inputImage: length(imageSize)) \
     out(parArrayRHist, parArrayGHist, parArrayBHist)
#pragma parallel_loop
     for (int ichunk = 0; ichunk < chunkCount; ichunk++){…
     }

In the example above, the generated CPU code and the runtime do the following:

Determine if the target is available on the system.
If either the target is unavailable or do_offload is evaluated to false, the for loop executes on the CPU.
Otherwise the runtime does the following:
- pin the imageSize * sizeof(inputImage[0]) bytes referenced by the pointer inputImage, organize sharing of that memory with the target, without copying data to or from the target memory.
- Create the target memory areas for parArrayRHist, parArrayGHist, and parArrayBHist.
- Split the iteration space of the for loop to N chunks, where N is less than or equal to chunkCount. The choice of a particular value for N is done by the offload runtime and depends on such factors as iteration space configuration, such as bounds or strides, and the maximum value that can be controlled by environment variables, as demonstrated below in the document.
- Create a task with N target threads, each assigned with its own iteration space chunk.
- Enqueue the task for execution on the target.
- Wait for completion of the task’s execution on the target.
- Copy parArrayRHist, parArrayGHist, and parArrayBHist from the target memory to the CPU memory, thereby ensuring that the results are immediately visible to all CPU threads.

Example: Offloading Using `parallel_loop collapse`

float (* A)[k] = (float (*)[k])matA;
float (* B)[n] = (float (*)[n])matB;
float (* C)[n] = (float (*)[n])matC;

#pragma offload target(gfx) if (do_offload) \
     pin(A: length(m*k)), pin(B: length(k*n)), pin(C: length(m*n))
#pragma parallel_loop collapse(2)
     for (int r = 0; r < m; r += TILE_m) {
          for (int c = 0; c < n; c += TILE_n) {…
          }
     }

In the example above:

collapse(2) means that the iteration space of the offloaded loop nest is 2-dimensional, encompassing both r and c loops, and each target thread is allotted a two dimensional iteration space chunk for parallel execution.
Although A, B and C are defined as pointers to arrays, length is specified in elements of the float-type arrays referred to by the pointers.

Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Parent topic: Programming for Intel® Graphics Technology

Initiating an Offload on Intel® Graphics Technology

Note

Example: Offloading to the Target

Example: Offloading Using `parallel_loop collapse`

See Also

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...

Note

Example: Offloading to the Target

Example: Offloading Using parallel_loop collapse

See Also

Trending Articles

Example: Offloading Using `parallel_loop collapse`