Provided by: opencl-1.2-man-doc_1.0~svn33624-5_all bug

NAME

       Function_Qualifiers - Qualifiers for kernel functions.

       __kernel kernel

       __attribute__((vec_type_hint(<typen>)))
       __attribute__((work_group_size_hint(X, Y, Z)))
       __attribute__((reqd_work_group_size(X, Y, Z)))

DESCRIPTION

       The __kernel (or kernel) qualifier declares a function to be a kernel that can be executed
       by an application on an OpenCL device(s). The following rules apply to functions that are
       declared with this qualifier:

       •   It can be executed on the device only

       •   It can be called by the host

       •   It is just a regular function call if a __kernel function is called by another kernel
           function.

       Kernel functions with variables declared inside the function with the local(3clc) or
       local(3clc) qualifier can be called by the host using appropriate APIs such as
       clEnqueueNDRangeKernel(3clc), and clEnqueueTask(3clc).

       The behavior of calling kernel functions with variables declared inside the function with
       the local(3clc) or local(3clc) qualifier from other kernel functions is
       implementation-defined.

       The __kernel and kernel names are reserved for use as functions qualifiers and shall not
       be used otherwise.  Optional Attribute Qualifiers.PP The __kernel qualifier can be used
       with the keyword attribute(3clc) to declare additional information about the kernel
       function as described below.

       The optional __attribute__((vec_type_hint(<type>))) is a hint to the compiler and is
       intended to be a representation of the computational width of the __kernel, and should
       serve as the basis for calculating processor bandwidth utilization when the compiler is
       looking to autovectorize the code. In the __attribute__((vec_type_hint(<type>))) qualifier
       <type> is one of the built-in vector types or the constituent scalar element types. If
       vec_type_hint (<type>) is not specified, the kernel is assumed to have the
       __attribute__((vec_type_hint(int))) qualifier.

       Implicit in autovectorization is the assumption that any libraries called from the
       __kernel must be recompilable at run time to handle cases where the compiler decides to
       merge or separate workitems. This probably means that such libraries can never be hard
       coded binaries or that hard coded binaries must be accompanied either by source or some
       retargetable intermediate representation. This may be a code security question for some.

       For example, where the developer specified a width of float4, the compiler should assume
       that the computation usually uses up 4 lanes of a float vector, and would decide to merge
       work-items or possibly even separate one work-item into many threads to better match the
       hardware capabilities. A conforming implementation is not required to autovectorize code,
       but shall support the hint. A compiler may autovectorize, even if no hint is provided. If
       an implementation merges N work-items into one thread, it is responsible for correctly
       handling cases where the number of global or local work-items in any dimension modulo N is
       not zero.

       If for example, a __kernel function is declared with __attribute__(( vec_type_hint
       (float4))) (meaning that most operations in the __kernel are explicitly vectorized using
       float4) and the kernel is running using Intel® Advanced Vector Instructions (Intel® AVX)
       which implements a 8-float-wide vector unit, the autovectorizer might choose to merge two
       work-items to one thread, running a second work-item in the high half of the 256-bit AVX
       register.

       As another example, a Power4 machine has two scalar double precision floating-point units
       with an 6-cycle deep pipe. An autovectorizer for the Power4 machine might choose to
       interleave six kernels declared with the __attribute__(( vec_type_hint (double2)))
       qualifier into one hardware thread, to ensure that there is always 12-way parallelism
       available to saturate the FPUs. It might also choose to merge 4 or 8 work-items (or some
       other number) if it concludes that these are better choices, due to resource utilization
       concerns or some preference for divisibility by 2.

       The optional __attribute__((work_group_size_hint(X, Y, Z))) is a hint to the compiler and
       is intended to specify the work-group size that may be used i.e. value most likely to be
       specified by the local_work_size argument to clEnqueueNDRangeKernel(3clc). For example the
       __attribute__((work_group_size_hint(1, 1, 1))) is a hint to the compiler that the kernel
       will most likely be executed with a work-group size of 1.

       The optional __attribute__((reqd_work_group_size(X, Y, Z))) is the work-group size that
       must be used as the local_work_size argument to clEnqueueNDRangeKernel(3clc). This allows
       the compiler to optimize the generated code appropriately for this kernel. The optional
       __attribute__((reqd_work_group_size(X, Y, Z))), if specified, must be (1, 1, 1) if the
       kernel is executed via clEnqueueTask(3clc).

       If Z is one, the work_dim argument to clEnqueueNDRangeKernel(3clc) can be 2 or 3. If Y and
       Z are one, the work_dim argument to clEnqueueNDRangeKernel(3clc) can be 1, 2 or 3.

NOTES

       Kernel functions with variables declared inside the function with the local(3clc) or
       local(3clc) qualifier can be called by the host using appropriate APIs such as
       clEnqueueNDRangeKernel(3clc), and clEnqueueTask(3clc).

       The behavior of calling kernel functions with variables declared inside the function with
       the local(3clc) and local(3clc) qualifier from other kernel functions is
       implementation-defined.

       The __kernel and kernel names are reserved for use as functions qualifiers and shall not
       be used otherwise.

   Example
       // autovectorize assuming float4 as the // basic
       computation width __kernel
       __attribute__((vec_type_hint(float4))) void foo(
       __global float4 *p ) { ....

       // autovectorize assuming double as the // basic
       computation width __kernel
       __attribute__((vec_type_hint(double))) void foo(
       __global float4 *p ){ ....

       // autovectorize assuming int (default) // as
       the basic computation width __kernel void foo(
       __global float4 *p ){ ....

SPECIFICATION

       OpenCL Specification[1]

SEE ALSO

       attribute(3clc), clEnqueueNDRangeKernel(3clc), clEnqueueTask(3clc)

AUTHORS

       The Khronos Group

COPYRIGHT

       Copyright © 2007-2011 The Khronos Group Inc.
       Permission is hereby granted, free of charge, to any person obtaining a copy of this
       software and/or associated documentation files (the "Materials"), to deal in the Materials
       without restriction, including without limitation the rights to use, copy, modify, merge,
       publish, distribute, sublicense, and/or sell copies of the Materials, and to permit
       persons to whom the Materials are furnished to do so, subject to the condition that this
       copyright notice and permission notice shall be included in all copies or substantial
       portions of the Materials.

NOTES

        1. OpenCL Specification
           page 227, section 6.7 - Function Qualifiers