Kompute: Accelerate and Innovate GPU Computing with Vulkan

0

Today, I am excited to introduce you to the Kompute framework, based on Vulkan, which is set to revolutionize the future of GPU computing. Kompute is a high-speed, mobile-supported, asynchronous processing framework optimized for a wide range of use cases, opening up new possibilities for any application using graphic cards. Especially with the backing of the Linux Foundation, let’s explore how this powerful framework can benefit your projects.

kompute

Kompute is a versatile GPU computing framework that supports various graphic cards and even mobile devices through Android NDK. Additionally, it integrates seamlessly with existing Vulkan applications through its BYOV (Bring-your-own-Vulkan) design. One of the standout features is its robust codebase, boasting over 90% unit test code coverage, allowing developers to work in a stable and reliable environment.

  • Flexible Python modules and C++ SDK available
  • Supports asynchronous and parallel processing
  • Mobile support with Android NDK examples included
  • Compatible with existing Vulkan applications
  • 90% unit test code coverage

Use Cases

Projects utilizing Kompute are diverse. GPT4ALL provides a large language model ecosystem that runs locally on CPUs and nearly any GPU. Llama.cpp is a project that ports Facebook’s LLaMA model to C/C++, and how-to-optimized-gemm focuses on matrix multiplication optimization. Lastly, vkJAX serves as a JAX interpreter for Vulkan, showcasing the flexibility and performance of Kompute across these projects.

Here’s a simple GPU multiplication example using C++ and Python, demonstrating how Kompute can perform fast calculations on a GPU.

// C++ example code
#include <kompute/Kompute.hpp>

void kompute(const std::string& shader) {
    kp::Manager mgr; 
    auto tensorInA = mgr.tensor({ 2., 2., 2. });
    auto tensorInB = mgr.tensor({ 1., 2., 3. });
    auto tensorOutA = mgr.tensorT<uint32_t>({ 0, 0, 0 });
    auto tensorOutB = mgr.tensorT<uint32_t>({ 0, 0, 0 });
    std::vector<std::shared_ptr<kp::Tensor>> params = {tensorInA, tensorInB, tensorOutA, tensorOutB};

    kp::Workgroup workgroup({3, 1, 1});
    std::vector<float> specConsts({ 2 });
    std::vector<float> pushConstsA({ 2.0 });
    std::vector<float> pushConstsB({ 3.0 });

    auto algorithm = mgr.algorithm(params, compileSource(shader), workgroup, specConsts, pushConstsA);

    mgr.sequence()
        ->record<kp::OpTensorSyncDevice>(params)
        ->record<kp::OpAlgoDispatch>(algorithm)
        ->eval()
        ->record<kp::OpAlgoDispatch>(algorithm, pushConstsB)
        ->eval();

    auto sq = mgr.sequence();
    sq->evalAsync<kp::OpTensorSyncLocal>(params);
    sq->evalAwait();

    for (const float& elem : tensorOutA->vector()) std::cout << elem << "  ";
    for (const float& elem : tensorOutB->vector()) std::cout << elem << "  ";
}
# Python example code
from kompute import Manager, Tensor, OpTensorSyncDevice, OpAlgoDispatch

def kompute(shader):
    mgr = Manager()
    tensor_in_a = mgr.tensor([2, 2, 2])
    tensor_in_b = mgr.tensor([1, 2, 3])
    tensor_out_a = mgr.tensor_t([0, 0, 0], dtype=np.uint32)
    tensor_out_b = mgr.tensor_t([0, 0, 0], dtype=np.uint32)

    params = [tensor_in_a, tensor_in_b, tensor_out_a, tensor_out_b]
    workgroup = (3, 1, 1)
    spec_consts = [2]
    push_consts_a = [2]
    push_consts_b = [3]

    algo = mgr.algorithm(params, compile_source(shader), workgroup, spec_consts, push_consts_a)

    (mgr.sequence()
        .record(OpTensorSyncDevice(params))
        .record(OpAlgoDispatch(algo))
        .eval()
        .record(OpAlgoDispatch(algo, push_consts_b))
        .eval())

    sq = mgr.sequence()
    sq.eval_async(OpTensorSyncLocal(params))
    sq.eval_await()

    print(tensor_out_a)
    print(tensor_out_b)

if __name__ == "__main__":
    shader = """
        #version 450
        layout (local_size_x = 1) in;
        layout(set = 0, binding = 0) buffer buf_in_a { float in_a[]; };
        layout(set = 0, binding = 1) buffer buf_in_b { float in_b[]; };
        layout(set = 0, binding = 2) buffer buf_out_a { uint out_a[]; };
        layout(set = 0, binding = 3) buffer buf_out_b { uint out_b[]; };
        layout(push_constant) uniform PushConstants { float val; } push_const;
        layout(constant_id = 0) const float const_one = 0;
        void main() {
            uint index = gl_GlobalInvocationID.x;
            out_a[index] += uint(in_a[index] * in_b[index]);
            out_b[index] += uint(const_one * push_const.val);
        }
    """
    kompute(shader)

Conclusion

Kompute is ready to elevate your GPU computing experience to the next level. With high-speed performance, flexible modularity, and various support features, it will make your projects faster and more efficient. Start using Kompute now and experience the possibilities for yourself! We wish you success in your projects.

Reference: GitHub, “Kompute”@link=https://github.com/KomputeProject/kompute&target=_blank]

Leave a Reply