Posted in

Principal Software Engineer – Sustaining

Principal Software Engineer – Sustaining

CompanyBerkshire Grey
LocationBedford, MA, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesBachelor’s, Master’s
Experience LevelExpert or higher

Requirements

  • Bachelor’s degree in computer science, or related field
  • 10+ years in software development or reliability engineering
  • Strong coding skills in Python
  • Experience in a fast-paced, agile environment
  • Demonstrated ability to: Investigate and triage production issues end-to-end, Analyze logs, metrics, and telemetry to pinpoint root causes, Develop fixes or workarounds under tight SLAs, Ship stable patches and rollouts with minimal disruption, Drive post-mortems and follow-through on corrective action plans, Communicate status and technical tradeoffs clearly to stakeholders
  • Comfortable with: Linux (Ubuntu), Version control (Git), Issue tracking (Jira)

Responsibilities

  • Lead investigation of field and lab failures; own root-cause analysis and drive fixes
  • Instrument code with metrics/logs; develop health checks and self-healing routines
  • Design, build, test, and deploy hotfixes and maintenance releases
  • Identify recurring issues; propose and implement design or process changes to raise MTBF and lower MTTR
  • Work with development teams to bake reliability into new features; train support teams on diagnostics
  • Maintain clear runbooks; track and report on reliability KPIs
  • Define and drive our sustaining engineering strategy and architecture
  • Mentor and coach other sustaining engineers on best practices for reliability and incident response
  • Collaborate with product leadership to integrate reliability objectives into the product roadmap
  • Own the development and scaling of our platform-monitoring, tracing, and alerting

Preferred Qualifications

  • Master’s degree in CS, Robotics, or related field
  • Familiarity with: Monitoring stacks (Elastic/Kibana, Prometheus/Grafana), Distributed in-code tracing frameworks (OpenTelemetry), Container orchestration (Docker, Kubernetes), Automated test frameworks (pytest, unit/system tests), Chaos engineering and resilience testing methodologies
  • Hands-on experience with robotic applications or other high-uptime systems
  • Data-driven mindset: profiling, statistics, pandas