Current Statistics
1,726,906 Total Jobs 272,929 Jobs Today 17,857 Cities 222,695 Job Seekers 146,727 Resumes |
|
|
|
|
|
|
Senior Software Engineer, Reliability Engineering (US) - Sunnyvale California
Company: Onehouse Location: Sunnyvale, California
Posted On: 05/02/2024
About OnehouseOnehouse is a mission-driven company dedicated to freeing data from data platform lock-in. We deliver the industry's most interoperable data lakehouse through a cloud-native managed service built on Apache Hudi. Onehouse enables organizations to ingest data at scale with minute-level freshness, centrally store it, and make available to any downstream query engine and use case (from traditional analytics to real-time AI / ML). We are a team of self-driven, inspired, and seasoned builders that have created large-scale data systems and globally distributed platforms that sit at the heart of some of the largest enterprises out there including Uber, Snowflake, AWS, Linkedin, Confluent and many more. Riding off $33M total funding and a fresh Series A backed by Greylock/Addition, we are quickly expanding and looking for rising talent to grow with us and become future leaders of the team. Come help us build the world's best fully managed and self-optimizing data lake platform!The Community You Will JoinWhen you join Onehouse, you're joining a team of passionate professionals tackling the deeply technical challenges of building a 2-sided engineering product. Our engineering team serves as the bridge between the worlds of open source and enterprise: contributing directly to and growing Apache Hudi (already used at scale by global enterprises like Uber, Amazon, ByteDance etc) and concurrently defining a new industry category - the transactional data lake. The Reliability Engineering team is the glue that binds all of this together. You will be responsible for developing and maintaining the tools and systems that enable our engineering teams to operate our services reliably and at scale. You will closely cross functionally partner with our engineering teams to ensure our services are able to scale with our growing business.The Impact You Will Drive: - At Onehouse, you will own our entire live production infrastructure and operational posture to run massive data systems at scale.
- Ensure our services remain resilient by identifying opportunities for improvement and drive their implementation.
- Identify opportunities to improve our overall operational efficiency and growing by owning the modern tools in our cloud-only operation and our practices for proactive automation, monitoring and response.
- Acting as a mentor to guide cross-functional teams during crisis situations and ensure timely resolution, minimizing the impact on our customers and business.A Typical Day:
- Build and own our reliability engineering practice from the ground up, owning our entire production infrastructure and operational posture.
- Establish a culture of reliability across engineering by providing a comprehensive incident management platform that is being used for instrumentation, operability, and around incidents.
- Design, implement and maintain new services, tools, and monitoring to support service reliability and alerting.
- Serve as an active member of our SRE team, responding to and managing high severity incidents or any situations concerning the wellbeing and continuous operation of our mission-critical systems.
- Collaborate with your stakeholders across engineering teams to ensure continuous adoption of best practices, rollout scenarios for the space, and that services are designed with reliability in mind.
- Continuously analyze and evaluate the tradeoffs of the existing designs and make recommendations based on new technologies and industry best practices.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews.
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health through an intimate understanding of how the critical parts of our site work.
- Contribute to better incident management posture and retrospectives, driving improvements in our overall reliability and incident response time as well as on-call runbooks and post-mortem reports.
- Drive our compliance posture; ensuring that all our products and processes comply with relevant regulations and standards, especially during compliance audits.What You Bring to the Table:
|
|
|
|
|
|
|