Microservices Interview Questions - Sample Answers

Microservice Interview Questions#

Interviewer: What are the 5 major components of Spring Cloud?

Candidate:

In the early days, we generally considered the 5 major components of Spring Cloud to be:

Eureka: Service registry

Ribbon: Load balancing

Feign: Remote invocation

Hystrix: Circuit breaker

Zuul/Gateway: Gateway

With the rise of Spring Cloud Alibaba in China, we have used some Alibaba components in our project:

Service registry/configuration center: Nacos

Load balancing: Ribbon

Service invocation: Feign

Service protection: Sentinel

Service gateway: Gateway

Interviewer: What is service registration and discovery? How does Spring Cloud implement service registration and discovery?

Candidate:

In my understanding, there are three main functions: service registration, service discovery, and service status monitoring.

In our project, we used Eureka as the registration center, which is also a core component of the Spring Cloud system.

Service registration: Service providers need to register their information with Eureka, which will store this information, such as service name, IP, port, etc.

Service discovery: Consumers pull the service list information from Eureka. If there are multiple instances of the service provider, the consumer will use a load balancing algorithm to select one for invocation.

Service monitoring: Service providers send heartbeats to Eureka every 30 seconds to report their health status. If Eureka does not receive a heartbeat for 90 seconds, it will remove the service from Eureka.

Interviewer: I see that you have used Nacos before. Can you explain the differences between Nacos and Eureka?

Candidate:

In our project, we used Nacos as the registration center. One important reason for choosing Nacos is that it supports configuration center. However, Nacos is also more convenient to use as a registration center compared to Eureka. The main similarities and differences are as follows:

Common features:

Both Nacos and Eureka support service registration and service pulling. They both support heartbeat-based health checks for service providers.

Differences between Nacos and Eureka:

① Nacos supports active detection of provider status: temporary instances use heartbeat mode, while non-temporary instances use active detection mode.

② Temporary instances will be removed if their heartbeats are abnormal, while non-temporary instances will not be removed.

③ Nacos supports a message push mode for service list changes, making the service list update more timely.

④ Nacos cluster defaults to AP mode. When there are non-temporary instances in the cluster, CP mode is used. Eureka uses AP mode.

Interviewer: How did you implement load balancing in your project?

Candidate:

It's like this~~

Load balancing in the process of service invocation is generally implemented using Spring Cloud's Ribbon component. Feign, which is built on top of Ribbon, already integrates Ribbon automatically, making it very easy to use.

When making a remote invocation, Ribbon first pulls the service address list from the registration center, and then selects one to invoke based on a certain routing strategy. The default strategy is usually round-robin.

Interviewer: What are the load balancing strategies in Ribbon?

Candidate:

Let me think, there are many, but I remember a few:

RoundRobinRule: Simple round-robin selection of servers from the service list.

WeightedResponseTimeRule: Selects servers based on weights, with smaller weights for longer response times.

RandomRule: Randomly selects an available server.

ZoneAvoidanceRule: Zone-sensitive strategy that selects servers based on available zones. Zones can be understood as data centers or racks. Then, round-robin is performed within each zone (default).

Interviewer: How can you customize the load balancing strategy?

Candidate:

There are two ways to achieve this:

Create a class that implements the IRule interface to specify the load balancing strategy. This applies globally to all remote invocations.

In the client's configuration file, you can configure the load balancing strategy for a specific service invocation. This only affects the remote invocation of the configured service.

Interviewer: What is service avalanche and how do you solve this problem?

Candidate:

Service avalanche refers to a situation where the failure of one service causes the failure of the entire service chain. Generally, there are two solutions to this problem in our project: service degradation and service circuit breaking. If the traffic is too high, rate limiting can also be considered.

Service degradation: This is a way for services to protect themselves or protect downstream services. It ensures that services will not be affected by a sudden increase in requests and will not become unavailable. In actual development, it is usually integrated with the Feign interface to write degradation logic.

Service circuit breaking: It is disabled by default and needs to be manually enabled. If the failure rate of requests within 10 seconds exceeds 50%, the circuit breaker mechanism will be triggered. Afterwards, the microservice will attempt to make requests every 5 seconds. If the microservice cannot respond, the circuit breaker mechanism will continue. If the microservice becomes reachable, the circuit breaker mechanism will be closed and normal requests will resume.

Interviewer: How do you monitor your microservices?

Candidate:

We used SkyWalking for monitoring in our project.

SkyWalking mainly monitors the status of interfaces, services, and physical instances. Especially during load testing, we can see which services and interfaces are slower among many services, and we can analyze and optimize them accordingly.

We also set up alert rules in SkyWalking. After the project goes live, if an error occurs, we will send SMS and email notifications to the relevant responsible persons, so that they can know about the project's bugs and fix them as soon as possible.

Interviewer: Have you implemented rate limiting in your project? How did you do it?

Candidate:

In the xx project I worked on, which adopted a microservices architecture, we needed to handle burst traffic. The maximum QPS could reach 2000, but the services could not handle it. Through load testing, we found that the services could support a maximum of 1200 QPS. Since our normal QPS was less than 100, we needed to implement rate limiting to handle the burst traffic.

[Version 1]

We used Nginx for rate limiting. Nginx uses the leaky bucket algorithm to filter requests and process them at a fixed rate. This allows us to handle burst traffic. We controlled the rate by limiting it per IP, with a limit of 20 requests per second.

[Version 2]

In the Spring Cloud Gateway, we used the RequestRateLimiter partial filter to implement rate limiting. It uses the token bucket algorithm and can limit requests per second based on IP or path. We can set the average fill rate per second and the total capacity of the token bucket.

Interviewer: What are the common rate limiting algorithms?

Candidate:

The commonly used rate limiting algorithms are the leaky bucket algorithm and the token bucket algorithm.

The leaky bucket algorithm stores requests in a bucket and allows them to flow out at a fixed rate. This ensures absolute averaging of our services and achieves good rate limiting effects.

The token bucket algorithm stores tokens in a bucket and generates tokens at a certain rate. Each request needs to apply for a token before it can be processed normally. This also achieves good rate limiting effects.

The difference between them is that the leaky bucket algorithm can achieve absolute smoothing, while the token bucket algorithm may result in a burst of requests. Generally, Nginx uses the leaky bucket algorithm for rate limiting, while Spring Cloud Gateway supports the token bucket algorithm.

Interviewer: What is the CAP theorem?

Candidate:

The CAP theorem is a theory in distributed systems. It includes three aspects: consistency, availability, and partition tolerance.

Consistency: After a successful update operation returns to the client, all nodes have the same data at the same time (strong consistency), and there is no intermediate state.

Availability: The service provided by the system must always be available and return results to the user within a limited time for each operation request.

Partition tolerance: In the event of any network partition failure in a distributed system, the system must still be able to provide services that meet both consistency and availability, unless the entire network environment fails.

Interviewer: Why is it impossible to guarantee both consistency and availability in a distributed system?

Candidate:

Well, it's like this~~

First of all, as a premise, for distributed systems, partition tolerance is a basic requirement. Therefore, when designing distributed systems, we can only choose between consistency (C) and availability (A).

If we guarantee consistency (C): For nodes N1 and N2, when writing data to N1, operations on N2 must be paused. Only when N1 synchronizes data to N2 can N2 be read and written. During the pause period, requests submitted by clients will fail or time out, which is contradictory to availability.

If we guarantee availability (A): We cannot pause read and write operations on N2. However, if N1 writes data, it violates the requirement of consistency.

Interviewer: What is the BASE theory?

Candidate:

Well, this is also a theory in distributed system design based on the CAP theory.

BASE is an extension of the AP solution in the CAP theory. The core idea is that even if strong consistency (strong consistency is the consistency in the CAP theory) cannot be achieved, applications can achieve eventual consistency, which is suitable for them. The idea includes three aspects:

Basically Available: Basic availability means that in the event of an unpredictable failure in a distributed system, partial availability is allowed, but it does not mean that the system is unavailable.

Soft State: Soft state means that the system allows data to exist in intermediate states and believes that the existence of these intermediate states does not affect the overall availability of the system. It allows data synchronization between different nodes to have a delay.

Eventually Consistent: Emphasizes that all data replicas in the system will eventually reach a consistent state after a period of synchronization. Its essence is that the system needs to ensure that the final data can reach consistency, without the need for real-time strong consistency of system data.

Interviewer: Which distributed transaction solution did you use?

Candidate:

In the xx project, we mainly used the AT mode of Seata to solve distributed transactions.

The AT mode of Seata consists of two phases:

Phase 1: RM's work: ① Register branch transaction ② Record undo log (data snapshot) ③ Execute business SQL and submit ④ Report transaction status

Phase 2: RM's work during commit: Delete the undo log

Phase 2: RM's work during rollback: Restore data to the state before the update based on the undo log

The AT mode sacrifices consistency to ensure availability, but it guarantees eventual consistency.

Interviewer: How do you design the idempotency of interfaces in distributed services?

Candidate:

In the xx project I worked on, we used token + Redis to implement it in the order placement operation. The process is as follows:

First request: When the user opens the product details page, we send a request to generate a unique token in the background and store it in Redis. The key is the user's ID, and the value is the token. We also return this token to the frontend.

Second request: When the user clicks on the order placement operation, they will carry the previous token. The backend first verifies it in Redis. If the token exists, the business logic can be executed, and the token is deleted. If it does not exist, the request is directly returned without processing the business logic. This ensures that the same token is only processed once, ensuring idempotency.

Interviewer: What are the routing strategies in xxl-job?

Candidate:

xxl-job provides many routing strategies. The ones we commonly use are: round-robin, failover, and shard broadcast.

Interviewer: How do you handle failed task execution in xxl-job?

Candidate:

There are several operations we can perform:

Choose the failover routing strategy to prioritize healthy instances for task execution.

If there are still failures, we can set the retry count when creating the task.

If there are still failures, we can check the logs or configure email alerts to notify the relevant responsible persons to resolve the issues.

Interviewer: How do you handle a large number of tasks that need to be executed simultaneously?

Candidate:

We deploy multiple instances to execute these batch tasks together, and the task routing strategy is shard broadcast.

In the task execution code, we can obtain the total number of shards and the current shard, and distribute the tasks to different instances based on modulo calculation.