cover image
Toshiba Global Commerce Solutions

Software Engineer Intern (Agentic AI SRE & Cloud)

On site

Frisco, United states

Fresher

Internship

01-10-2025

Share this job:

Skills

Python Java Incident Response CI/CD Docker Kubernetes Monitoring Scripting and Automation Version Control Serverless Computing Test Test Automation Problem-solving Networking Machine Learning Programming git Azure AWS Shell NodeJs Software Development cloud platforms C++ Spring GCP CI/CD Pipelines Terraform Prometheus Grafana Infrastructure as Code

Job Specifications

Spring 2026 Internship: This position is located at our Frisco, TX site. Expected program dates are January 20 - April 24, 2026. Anticipated working hours will be 20-to-30 per week, around class schedules. Final hours and schedule will be determined between the intern and their manager.

Note: Toshiba Global Commerce Solutions will not be providing visa sponsorship for this position now or in the future. Therefore, in order to be considered for this position, you must have the ability to work without a need for current or future visa sponsorship.

Introduction

Learns by reviewing and refining AI-generated infrastructure code, testing automated deployments, and contributing to resilient cloud operations with strong mentorship.

This intern will support the SRE/Platform team by adapting outputs from AI agents into production-ready infrastructure and automation solutions. The intern will gain hands-on experience in debugging Infrastructure as Code (IaC), testing automated remediation systems, and improving AI-generated platform configurations, while learning how LLM-based tools operate within cloud environments. The role offers strong mentorship, collaborative incident response discussions, and exposure to modern autonomous platform engineering workflows.

Key Outputs & Outcomes

Improved reliability and correctness of AI-generated IaC templates and automation scripts
100% infrastructure provisioned and managed through code (Terraform, Kubernetes manifests, CI/CD pipelines)
AI-driven incident response and remediation systems that achieve MTTR
Change failure rate maintained below 10% through automated testing and validation of infrastructure changes
Functional platform features adapted from AI outputs and deployed to production environments
Clear documentation explaining AI-human collaboration in platform operations and incident management

Major Responsibilities

Code Adaptation & Integration: Translate AI-generated outputs into production-grade code. Add tests, validate behavior, and clean up logic or formatting as needed.
Testing and Validation: Create unit and integration tests for new features. Run experiments to validate that LLM-generated code executes correctly in the sandbox. Assist in debugging failures or errors in the AI toolchain, ensuring reliability. Write and run tests on AI-generated components. Investigate failures or bugs and refine the agent's code accordingly
Documentation: Document code functionality and usage (inline comments and basic user guides). Update team knowledge bases or wikis with information on new utilities or modules.
Code Reviews and Learning: Participate in code reviews, learning from feedback. Pair-program with mentors to gain familiarity with the codebase and agent frameworks. Continuously study new AI agent tools and software techniques as assigned.
Validation Support: Assist in monitoring AI code in test environments and surfacing unexpected agent behaviors or patterns.

Required Skills

Currently enrolled in a Bachelor's degree program in Computer Science or related
0-2 years of software development experience, cloud operations, or infrastructure (internship or project work acceptable) - sophomore, junior or senior students welcome.
Proficiency in at least one programming language such as Python, Java, or C++ with emphasis on automation and scripting
Ability to write and debug Infrastructure as Code (IaC) using tools like Terraform
Basic understanding of cloud platforms (AWS, Azure, or GCP) including compute, storage, and networking services
Familiarity with containerization technologies (Docker, Kubernetes) and container orchestration concepts
Knowledge of version control systems (Git) and collaborative development workflows
Understanding of monitoring, logging, and observability principles for distributed systems
Basic knowledge of CI/CD pipelines and automated deployment strategies
Enthusiasm for AI/LLM technology and willingness to learn AI-driven infrastructure automation (no prior AI experience required but must demonstrate ability to adapt to autonomous tools)
Problem-solving mindset with focus on system reliability, performance optimization, and incident response

Preferred Skills

Exposure to AI or machine learning projects (e.g. coursework or personal projects using an LLM API or building a simple chatbot).
Familiarity with containerization (Docker) and cloud environments (AWS, Azure, or GCP) - basic ability to run applications in a cloud sandbox.
Experience with scripting and automation (NodeJS, Shell, Python scripting) to assist in test automation.
Exposure to AI or machine learning in infrastructure contexts (e.g., anomaly detection, predictive scaling, automated remediation)
Hands-on experience with Kubernetes, service mesh technologies, or serverless computing platforms
Familiarity with monitoring and observability tools (Prometheus, Grafana, ELK stack, Datadog, New Relic)
Experience with scripting and automation for infrastructure management (Python, B

About the Company

Toshiba Global Commerce Solutions is the global market share leader in retail store technology. As retail's first choice for integrated in-store solutions, our innovative commerce technology enhances customer engagement, transforms in-store experience, and accelerates digital transformation. Together, with a global team of dedicated business partners, we advance the future of retail. Know more