Develop and maintain scalable data processing pipelines using Python and PySpark.
Design, implement, and manage real-time data streaming applications using Apache Kafka.
Collaborate with data scientists, analysts, and other stakeholders to understand data requirements.
Optimize data workflows for performance and reliability in distributed computing environments.
Write efficient, reusable, and well-documented code.
Monitor and troubleshoot data pipeline issues and ensure data quality.
Work with cloud platforms (AWS, Azure, or GCP) and big data tools as needed.
Participate in code reviews, testing, and continuous integration/deployment.
Stay updated with emerging technologies and best practices in big data and streaming platforms.
Required Skills & Qualifications:
Bachelors degree in Computer Science, Engineering, or related field.
Strong programming skills in Python.
Hands-on experience with PySpark for large-scale data processing.
Solid understanding of Apache Kafka and experience with building streaming data pipelines.
Familiarity with distributed computing concepts and frameworks.
Experience working with relational and NoSQL databases.
Knowledge of data formats such as JSON, Avro, Parquet, etc.
Experience with cloud platforms and containerization (Docker) is a plus.
Familiarity with version control (Git) and CI/CD practices.
Strong problem-solving skills and ability to work in a collaborative team environment.
Preferred Qualifications:
Experience with data orchestration tools like Apache Airflow.
Knowledge of other big data technologies such as Hadoop, Hive, or Presto.
Experience with monitoring and logging tools for data pipelines.
Familiarity with machine learning workflows or data science tools.
Job Classification
Industry: IT Services & ConsultingFunctional Area / Department: Engineering - Software & QARole Category: Software DevelopmentRole: Data EngineerEmployement Type: Full time