Principal Software Engineer – Sustaining
Company | Berkshire Grey |
---|---|
Location | Bedford, MA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s, Master’s |
Experience Level | Expert or higher |
Requirements
- Bachelor’s degree in computer science, or related field
- 10+ years in software development or reliability engineering
- Strong coding skills in Python
- Experience in a fast-paced, agile environment
- Demonstrated ability to: Investigate and triage production issues end-to-end, Analyze logs, metrics, and telemetry to pinpoint root causes, Develop fixes or workarounds under tight SLAs, Ship stable patches and rollouts with minimal disruption, Drive post-mortems and follow-through on corrective action plans, Communicate status and technical tradeoffs clearly to stakeholders
- Comfortable with: Linux (Ubuntu), Version control (Git), Issue tracking (Jira)
Responsibilities
- Lead investigation of field and lab failures; own root-cause analysis and drive fixes
- Instrument code with metrics/logs; develop health checks and self-healing routines
- Design, build, test, and deploy hotfixes and maintenance releases
- Identify recurring issues; propose and implement design or process changes to raise MTBF and lower MTTR
- Work with development teams to bake reliability into new features; train support teams on diagnostics
- Maintain clear runbooks; track and report on reliability KPIs
- Define and drive our sustaining engineering strategy and architecture
- Mentor and coach other sustaining engineers on best practices for reliability and incident response
- Collaborate with product leadership to integrate reliability objectives into the product roadmap
- Own the development and scaling of our platform-monitoring, tracing, and alerting
Preferred Qualifications
- Master’s degree in CS, Robotics, or related field
- Familiarity with: Monitoring stacks (Elastic/Kibana, Prometheus/Grafana), Distributed in-code tracing frameworks (OpenTelemetry), Container orchestration (Docker, Kubernetes), Automated test frameworks (pytest, unit/system tests), Chaos engineering and resilience testing methodologies
- Hands-on experience with robotic applications or other high-uptime systems
- Data-driven mindset: profiling, statistics, pandas