Loading...

AI Evaluation Engineer (Data Analysis & Multi-Agent Systems)

  • Full Time
  • Anywhere

Gramian Consulting Group

We are looking for an AI Evaluation Engineer to design benchmark tasks that simulate real-world analytical workflows, with a focus on data analysis, multi-agent systems, and verification logic.

Requirements

  • Design and develop multi-agent benchmark tasks focused on complex data analysis workflows
  • Create or curate realistic datasets (CSV, JSON, logs, reports, financial or operational data)
  • Build tasks requiring: Cross-referencing across multiple data sources, Anomaly detection and contradiction identification, Statistical analysis and interpretation
  • Define task decomposition strategies across specialized sub-agents (e.g., financial, technical, operational analysis)
  • Develop verification logic to validate precise analytical outputs (not generic summaries)
  • Implement evaluation pipelines using Python and SQL
  • Create reproducible environments using Docker
  • Analyze task performance and refine for clarity, difficulty, and scoring accuracy

Originally posted on Himalayas

To apply for this job please visit himalayas.app.

About this role & career path

Working in Egypt

Egypt, officially the Arab Republic of Egypt, is a country spanning the northeast corner of Africa and southwest corner of Asia via the Sinai Peninsula. It is bordered by the Mediterranean Sea to the north, Palestine and Israel to the northeast, the Red Sea to the east, Sudan and the Sahara to the south, and Libya to the west. The Gulf of Aqaba in the northeast separates Egypt from Jordan and Saudi Arabia. Cairo is the capital, largest city, and leading cultural centre, while Alexandria is the second-largest city and an important hub of industry and tourism. With over 107 million inhabitants,

    More jobs at Gramian Consulting Group

    Keep exploring on Get A Job.ai

    Not quite the right fit? Your next opportunity is a click away.

    Hiring instead? Post a job and reach candidates searching right now.