当前位置：网站首页>Kube scheduler source code analysis (1) - initialization and startup analysis

Kube scheduler source code analysis (1) - initialization and startup analysis

2022-06-25 07:18:00 【InfoQ】

kube-scheduler Source code analysis （1）- Initialize and start analysis

kube-scheduler brief introduction

kube-scheduler Components are kubernetes One of the core components in , Mainly responsible for pod Scheduling of resource objects , say concretely ,kube-scheduler The component is responsible for the scheduling algorithm （ Including preselection algorithm and optimization algorithm ） The unscheduled pod Schedule to the appropriate optimal node Node .

kube-scheduler Architecture diagram

kube-scheduler The general composition and processing flow of the are as follows ,kube-scheduler Yes pod、node Wait for the object to list/watch, according to informer The unscheduled pod Put it in to be dispatched pod queue , And according to informer Build scheduler cache（ It is used to quickly obtain the required node Objects such as ）, then

sched.scheduleOne

Method is kube-scheduler Component scheduling pod The core processing logic of , Never scheduled pod Take one from the queue pod, After pre selection and optimization algorithm , Finally, choose the best node, And then update cache And execute it asynchronously bind operation , That is, update pod Of nodeName Field , At this point pod The scheduling work of is completed .

kube-scheduler The analysis of components will be divided into two parts , Namely ：

（1）kube-scheduler Initialize and start analysis ;

（2）kube-scheduler Core processing logic analysis .

This chapter begins with kube-scheduler Component initialization and startup analysis , In the next part, we will analyze the core processing logic .

1.kube-scheduler Initialize and start analysis

be based on tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

See... Directly kube-scheduler Of NewSchedulerCommand function , As kube-scheduler The entry to initialize and start analysis .

NewSchedulerCommand

NewSchedulerCommand The main logic of the function is ：

（1） Initialize the default startup parameter value of the component ;

（2） Definition kube-scheduler Method of running command of component , namely runCommand function （runCommand Function final call Run Function to run start kube-scheduler Components , The following will be Run Analysis of functions ）;

（3）kube-scheduler Component startup command line parameter parsing .

// cmd/kube-scheduler/app/server.go
func NewSchedulerCommand(registryOptions ...Option) *cobra.Command {
 // 1. Initialize the default startup parameter value of the component 
 opts, err := options.NewOptions()
 if err != nil {
 klog.Fatalf(&quot;unable to initialize command options: %v&quot;, err)
 }
 
 // 2. Definition kube-scheduler Method of running command of component , namely runCommand function 
 cmd := &cobra.Command{
 Use: &quot;kube-scheduler&quot;,
 Long: `The Kubernetes scheduler is a policy-rich, topology-aware,
workload-specific function that significantly impacts availability, performance,
and capacity. The scheduler needs to take into account individual and collective
resource requirements, quality of service requirements, hardware/software/policy
constraints, affinity and anti-affinity specifications, data locality, inter-workload
interference, deadlines, and so on. Workload-specific requirements will be exposed
through the API as necessary.`,
 Run: func(cmd *cobra.Command, args []string) {
 if err := runCommand(cmd, args, opts, registryOptions...); err != nil {
 fmt.Fprintf(os.Stderr, &quot;%v\n&quot;, err)
 os.Exit(1)
 }
 },
 }
 
 // 3. Component command line starts parameter resolution 
 fs := cmd.Flags()
 namedFlagSets := opts.Flags()
 verflag.AddFlags(namedFlagSets.FlagSet(&quot;global&quot;))
 globalflag.AddGlobalFlags(namedFlagSets.FlagSet(&quot;global&quot;), cmd.Name())
 for _, f := range namedFlagSets.FlagSets {
 fs.AddFlagSet(f)
 }
 ...
}

runCommand

runCommand Defined kube-scheduler The run command function of the component , We mainly see the following two logics ：

（1） call algorithmprovider.ApplyFeatureGates Method , according to FeatureGate Open or not , Decide whether to additionally register the corresponding preselection and optimization algorithms ;

（2） call Run, Run start kube-scheduler Components .

// cmd/kube-scheduler/app/server.go
// runCommand runs the scheduler.
func runCommand(cmd *cobra.Command, args []string, opts *options.Options, registryOptions ...Option) error {
 ...

 // Apply algorithms based on feature gates.
 // TODO: make configurable?
 algorithmprovider.ApplyFeatureGates()

 // Configz registration.
 if cz, err := configz.New(&quot;componentconfig&quot;); err == nil {
 cz.Set(cc.ComponentConfig)
 } else {
 return fmt.Errorf(&quot;unable to register configz: %s&quot;, err)
 }

 ctx, cancel := context.WithCancel(context.Background())
 defer cancel()

 return Run(ctx, cc, registryOptions...)
}

1.1 algorithmprovider.ApplyFeatureGates

according to FeatureGate Open or not , Decide whether to additionally register the corresponding preselection and optimization algorithms .

// pkg/scheduler/algorithmprovider/plugins.go
import (
 &quot;k8s.io/kubernetes/pkg/scheduler/algorithmprovider/defaults&quot;
)

func ApplyFeatureGates() func() {
 return defaults.ApplyFeatureGates()
}

1.1.1 init

plugins.go file import 了 defaults package , So look defaults.ApplyFeatureGates Before method , Let's see first defaults Bag init function , Mainly do the registration of built-in scheduling algorithm , Including preselection algorithm and optimization algorithm .

（1） Let's see first defaults In bag defaults.go file init function .

// pkg/scheduler/algorithmprovider/defaults/defaults.go
func init() {
 registerAlgorithmProvider(defaultPredicates(), defaultPriorities())
}

Budget algorithm ：

// pkg/scheduler/algorithmprovider/defaults/defaults.go
func defaultPredicates() sets.String {
 return sets.NewString(
 predicates.NoVolumeZoneConflictPred,
 predicates.MaxEBSVolumeCountPred,
 predicates.MaxGCEPDVolumeCountPred,
 predicates.MaxAzureDiskVolumeCountPred,
 predicates.MaxCSIVolumeCountPred,
 predicates.MatchInterPodAffinityPred,
 predicates.NoDiskConflictPred,
 predicates.GeneralPred,
 predicates.PodToleratesNodeTaintsPred,
 predicates.CheckVolumeBindingPred,
 predicates.CheckNodeUnschedulablePred,
 )
}

Optimization algorithm ：

// pkg/scheduler/algorithmprovider/defaults/defaults.go
func defaultPriorities() sets.String {
 return sets.NewString(
 priorities.SelectorSpreadPriority,
 priorities.InterPodAffinityPriority,
 priorities.LeastRequestedPriority,
 priorities.BalancedResourceAllocation,
 priorities.NodePreferAvoidPodsPriority,
 priorities.NodeAffinityPriority,
 priorities.TaintTolerationPriority,
 priorities.ImageLocalityPriority,
 )
}

registerAlgorithmProvider Function registration algorithm provider,algorithm provider A list of all types of scheduling algorithms is stored , Including preselection algorithm and optimization algorithm （ Only the algorithm is stored key list , Not including the algorithm itself ）.

// pkg/scheduler/algorithmprovider/defaults/defaults.go
func registerAlgorithmProvider(predSet, priSet sets.String) {
 // Registers algorithm providers. By default we use 'DefaultProvider', but user can specify one to be used
 // by specifying flag.
 scheduler.RegisterAlgorithmProvider(scheduler.DefaultProvider, predSet, priSet)
 // Cluster autoscaler friendly scheduling algorithm.
 scheduler.RegisterAlgorithmProvider(ClusterAutoscalerProvider, predSet,
 copyAndReplace(priSet, priorities.LeastRequestedPriority, priorities.MostRequestedPriority))
}

Will eventually register algorithm provider Assign a value to a variable algorithmProviderMap（ A list of all types of scheduling algorithms is stored ）, This variable is the global variable of the package .

// pkg/scheduler/algorithm_factory.go
// RegisterAlgorithmProvider registers a new algorithm provider with the algorithm registry.
func RegisterAlgorithmProvider(name string, predicateKeys, priorityKeys sets.String) string {
 schedulerFactoryMutex.Lock()
 defer schedulerFactoryMutex.Unlock()
 validateAlgorithmNameOrDie(name)
 algorithmProviderMap[name] = AlgorithmProviderConfig{
 FitPredicateKeys: predicateKeys,
 PriorityFunctionKeys: priorityKeys,
 }
 return name
}

// pkg/scheduler/algorithm_factory.go
var (
 ...
 algorithmProviderMap = make(map[string]AlgorithmProviderConfig)
 ...
)

（2） And then see defaults In bag register_predicates.go Of documents init function , Mainly registered the preselection algorithm .

// pkg/scheduler/algorithmprovider/defaults/register_predicates.go
func init() {
 ...
 // Fit is defined based on the absence of port conflicts.
 // This predicate is actually a default predicate, because it is invoked from
 // predicates.GeneralPredicates()
 scheduler.RegisterFitPredicate(predicates.PodFitsHostPortsPred, predicates.PodFitsHostPorts)
 // Fit is determined by resource availability.
 // This predicate is actually a default predicate, because it is invoked from
 // predicates.GeneralPredicates()
 scheduler.RegisterFitPredicate(predicates.PodFitsResourcesPred, predicates.PodFitsResources)
 ...

（3） Finally, I see defaults In bag register_priorities.go Of documents init function , It mainly registers the optimization algorithm .

// pkg/scheduler/algorithmprovider/defaults/register_priorities.go
func init() {
 ...
 // Prioritize nodes by least requested utilization.
 scheduler.RegisterPriorityMapReduceFunction(priorities.LeastRequestedPriority, priorities.LeastRequestedPriorityMap, nil, 1)

 // Prioritizes nodes to help achieve balanced resource usage
 scheduler.RegisterPriorityMapReduceFunction(priorities.BalancedResourceAllocation, priorities.BalancedResourceAllocationMap, nil, 1)
 ...
}

The final result of registration of preselection algorithm and optimization algorithm , Are assigned to global variables , The pre selected algorithm is assigned to after registration fitPredicateMap, After the optimization algorithm is registered, it is assigned to priorityFunctionMap.

// pkg/scheduler/algorithm_factory.go
var (
 ...
 fitPredicateMap = make(map[string]FitPredicateFactory)
 ...
 priorityFunctionMap = make(map[string]PriorityConfigFactory)
 ...
)

1.1.2 defaults.ApplyFeatureGates

It is mainly used to judge whether to open a specific FeatureGate, Then the corresponding preselection and optimization algorithms are additionally registered .

// pkg/scheduler/algorithmprovider/defaults/defaults.go
func ApplyFeatureGates() (restore func()) {
 ...

 // Only register EvenPodsSpread predicate & priority if the feature is enabled
 if utilfeature.DefaultFeatureGate.Enabled(features.EvenPodsSpread) {
 klog.Infof(&quot;Registering EvenPodsSpread predicate and priority function&quot;)
 // register predicate
 scheduler.InsertPredicateKeyToAlgorithmProviderMap(predicates.EvenPodsSpreadPred)
 scheduler.RegisterFitPredicate(predicates.EvenPodsSpreadPred, predicates.EvenPodsSpreadPredicate)
 // register priority
 scheduler.InsertPriorityKeyToAlgorithmProviderMap(priorities.EvenPodsSpreadPriority)
 scheduler.RegisterPriorityMapReduceFunction(
 priorities.EvenPodsSpreadPriority,
 priorities.CalculateEvenPodsSpreadPriorityMap,
 priorities.CalculateEvenPodsSpreadPriorityReduce,
 1,
 )
 }

 // Prioritizes nodes that satisfy pod's resource limits
 if utilfeature.DefaultFeatureGate.Enabled(features.ResourceLimitsPriorityFunction) {
 klog.Infof(&quot;Registering resourcelimits priority function&quot;)
 scheduler.RegisterPriorityMapReduceFunction(priorities.ResourceLimitsPriority, priorities.ResourceLimitsPriorityMap, nil, 1)
 // Register the priority function to specific provider too.
 scheduler.InsertPriorityKeyToAlgorithmProviderMap(scheduler.RegisterPriorityMapReduceFunction(priorities.ResourceLimitsPriority, priorities.ResourceLimitsPriorityMap, nil, 1))
 }

 ...
}

1.2 Run

Run The function is mainly based on the configuration parameters , Run start kube-scheduler Components , Its core logic is as follows ：

（1） Get ready event Report client, Is used to kube-scheduler All kinds of event Report to api-server;

（2） call scheduler.New Method , Instantiation scheduler object ;

（3） start-up event Escalation Manager ;

（4） Set up kube-scheduler Component health check , And start a health check and communicate with metrics dependent http service ;

（5） Start the of all previously registered objects infomer, Start synchronizing object resources ;

（6） call WaitForCacheSync, Wait for all informer The synchronization of objects is complete , Make the locally cached data and etcd The data in is consistent ;

（7） Judge whether to start according to the component startup parameters leader Election function ;

（8） call sched.Run Method start up kube-scheduler Components （sched.Run As the following kube-scheduler The core handles the entry of logical analysis ）.

// cmd/kube-scheduler/app/server.go
func Run(ctx context.Context, cc schedulerserverconfig.CompletedConfig, outOfTreeRegistryOptions ...Option) error {
 // To help debugging, immediately log version
 klog.V(1).Infof(&quot;Starting Kubernetes Scheduler version %+v&quot;, version.Get())

 outOfTreeRegistry := make(framework.Registry)
 for _, option := range outOfTreeRegistryOptions {
 if err := option(outOfTreeRegistry); err != nil {
 return err
 }
 }
 
 // 1. Get ready event Report client, Is used to kube-scheduler All kinds of event Report to api-server
 // Prepare event clients.
 if _, err := cc.Client.Discovery().ServerResourcesForGroupVersion(eventsv1beta1.SchemeGroupVersion.String()); err == nil {
 cc.Broadcaster = events.NewBroadcaster(&events.EventSinkImpl{Interface: cc.EventClient.Events(&quot;&quot;)})
 cc.Recorder = cc.Broadcaster.NewRecorder(scheme.Scheme, cc.ComponentConfig.SchedulerName)
 } else {
 recorder := cc.CoreBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: cc.ComponentConfig.SchedulerName})
 cc.Recorder = record.NewEventRecorderAdapter(recorder)
 }
 
 // 2. call scheduler.New Method , Instantiation scheduler object 
 // Create the scheduler.
 sched, err := scheduler.New(cc.Client,
 cc.InformerFactory,
 cc.PodInformer,
 cc.Recorder,
 ctx.Done(),
 scheduler.WithName(cc.ComponentConfig.SchedulerName),
 scheduler.WithAlgorithmSource(cc.ComponentConfig.AlgorithmSource),
 scheduler.WithHardPodAffinitySymmetricWeight(cc.ComponentConfig.HardPodAffinitySymmetricWeight),
 scheduler.WithPreemptionDisabled(cc.ComponentConfig.DisablePreemption),
 scheduler.WithPercentageOfNodesToScore(cc.ComponentConfig.PercentageOfNodesToScore),
 scheduler.WithBindTimeoutSeconds(cc.ComponentConfig.BindTimeoutSeconds),
 scheduler.WithFrameworkOutOfTreeRegistry(outOfTreeRegistry),
 scheduler.WithFrameworkPlugins(cc.ComponentConfig.Plugins),
 scheduler.WithFrameworkPluginConfig(cc.ComponentConfig.PluginConfig),
 scheduler.WithPodMaxBackoffSeconds(cc.ComponentConfig.PodMaxBackoffSeconds),
 scheduler.WithPodInitialBackoffSeconds(cc.ComponentConfig.PodInitialBackoffSeconds),
 )
 if err != nil {
 return err
 }
 
 // 3. start-up event Escalation Manager 
 // Prepare the event broadcaster.
 if cc.Broadcaster != nil && cc.EventClient != nil {
 cc.Broadcaster.StartRecordingToSink(ctx.Done())
 }
 if cc.CoreBroadcaster != nil && cc.CoreEventClient != nil {
 cc.CoreBroadcaster.StartRecordingToSink(&corev1.EventSinkImpl{Interface: cc.CoreEventClient.Events(&quot;&quot;)})
 }
 
 // 4. Set up kube-scheduler Component health check , And start a health check and communicate with metrics dependent http service 
 // Setup healthz checks.
 var checks []healthz.HealthChecker
 if cc.ComponentConfig.LeaderElection.LeaderElect {
 checks = append(checks, cc.LeaderElection.WatchDog)
 }

 // Start up the healthz server.
 if cc.InsecureServing != nil {
 separateMetrics := cc.InsecureMetricsServing != nil
 handler := buildHandlerChain(newHealthzHandler(&cc.ComponentConfig, separateMetrics, checks...), nil, nil)
 if err := cc.InsecureServing.Serve(handler, 0, ctx.Done()); err != nil {
 return fmt.Errorf(&quot;failed to start healthz server: %v&quot;, err)
 }
 }
 if cc.InsecureMetricsServing != nil {
 handler := buildHandlerChain(newMetricsHandler(&cc.ComponentConfig), nil, nil)
 if err := cc.InsecureMetricsServing.Serve(handler, 0, ctx.Done()); err != nil {
 return fmt.Errorf(&quot;failed to start metrics server: %v&quot;, err)
 }
 }
 if cc.SecureServing != nil {
 handler := buildHandlerChain(newHealthzHandler(&cc.ComponentConfig, false, checks...), cc.Authentication.Authenticator, cc.Authorization.Authorizer)
 // TODO: handle stoppedCh returned by c.SecureServing.Serve
 if _, err := cc.SecureServing.Serve(handler, 0, ctx.Done()); err != nil {
 // fail early for secure handlers, removing the old error loop from above
 return fmt.Errorf(&quot;failed to start secure server: %v&quot;, err)
 }
 }
 
 // 5. Start the of all previously registered objects informer, Start synchronizing object resources 
 // Start all informers.
 go cc.PodInformer.Informer().Run(ctx.Done())
 cc.InformerFactory.Start(ctx.Done())
 
 // 6. Wait for all informer The synchronization of objects is complete , Make the locally cached data and etcd The data in is consistent 
 // Wait for all caches to sync before scheduling.
 cc.InformerFactory.WaitForCacheSync(ctx.Done())
 
 // 7. Judge whether to start according to the component startup parameters leader Election function 
 // If leader election is enabled, runCommand via LeaderElector until done and exit.
 if cc.LeaderElection != nil {
 cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
 OnStartedLeading: sched.Run,
 OnStoppedLeading: func() {
 klog.Fatalf(&quot;leaderelection lost&quot;)
 },
 }
 leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
 if err != nil {
 return fmt.Errorf(&quot;couldn't create leader elector: %v&quot;, err)
 }

 leaderElector.Run(ctx)

 return fmt.Errorf(&quot;lost lease&quot;)
 }
 
 // 8. call sched.Run Method start up kube-scheduler Components 
 // Leader election is disabled, so runCommand inline until done.
 sched.Run(ctx)
 return fmt.Errorf(&quot;finished without leader elect&quot;)
}

1.2.1 scheduler.New

scheduler The instantiation of objects is divided into 3 Parts of , Namely ：

（1） Instantiation pod、node、pvc、pv Wait for the object infomer;

（2） call configurator.CreateFromConfig, According to the previously registered built-in scheduling algorithm （ Or according to the scheduling policy provided by the user ）, Instantiation scheduler;

（3） to infomer Object registration eventHandler;

// pkg/scheduler/scheduler.go
func New(client clientset.Interface,
 informerFactory informers.SharedInformerFactory,
 podInformer coreinformers.PodInformer,
 recorder events.EventRecorder,
 stopCh <-chan struct{},
 opts ...Option) (*Scheduler, error) {

 stopEverything := stopCh
 if stopEverything == nil {
 stopEverything = wait.NeverStop
 }

 options := defaultSchedulerOptions
 for _, opt := range opts {
 opt(&options)
 }
 
 // 1. Instantiation node、pvc、pv Wait for the object infomer
 schedulerCache := internalcache.New(30*time.Second, stopEverything)
 volumeBinder := volumebinder.NewVolumeBinder(
 client,
 informerFactory.Core().V1().Nodes(),
 informerFactory.Storage().V1().CSINodes(),
 informerFactory.Core().V1().PersistentVolumeClaims(),
 informerFactory.Core().V1().PersistentVolumes(),
 informerFactory.Storage().V1().StorageClasses(),
 time.Duration(options.bindTimeoutSeconds)*time.Second,
 )

 registry := options.frameworkDefaultRegistry
 if registry == nil {
 registry = frameworkplugins.NewDefaultRegistry(&frameworkplugins.RegistryArgs{
 VolumeBinder: volumeBinder,
 })
 }
 registry.Merge(options.frameworkOutOfTreeRegistry)

 snapshot := nodeinfosnapshot.NewEmptySnapshot()

 configurator := &Configurator{
 client: client,
 informerFactory: informerFactory,
 podInformer: podInformer,
 volumeBinder: volumeBinder,
 schedulerCache: schedulerCache,
 StopEverything: stopEverything,
 hardPodAffinitySymmetricWeight: options.hardPodAffinitySymmetricWeight,
 disablePreemption: options.disablePreemption,
 percentageOfNodesToScore: options.percentageOfNodesToScore,
 bindTimeoutSeconds: options.bindTimeoutSeconds,
 podInitialBackoffSeconds: options.podInitialBackoffSeconds,
 podMaxBackoffSeconds: options.podMaxBackoffSeconds,
 enableNonPreempting: utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NonPreemptingPriority),
 registry: registry,
 plugins: options.frameworkPlugins,
 pluginConfig: options.frameworkPluginConfig,
 pluginConfigProducerRegistry: options.frameworkConfigProducerRegistry,
 nodeInfoSnapshot: snapshot,
 algorithmFactoryArgs: AlgorithmFactoryArgs{
 SharedLister: snapshot,
 InformerFactory: informerFactory,
 VolumeBinder: volumeBinder,
 HardPodAffinitySymmetricWeight: options.hardPodAffinitySymmetricWeight,
 },
 configProducerArgs: &frameworkplugins.ConfigProducerArgs{},
 }

 metrics.Register()
 
 // 2. call configurator.CreateFromConfig, According to the previously registered built-in scheduling algorithm （ Or according to the scheduling policy provided by the user ）, Instantiation scheduler
 var sched *Scheduler
 source := options.schedulerAlgorithmSource
 switch {
 case source.Provider != nil:
 // Create the config from a named algorithm provider.
 sc, err := configurator.CreateFromProvider(*source.Provider)
 if err != nil {
 return nil, fmt.Errorf(&quot;couldn't create scheduler using provider %q: %v&quot;, *source.Provider, err)
 }
 sched = sc
 case source.Policy != nil:
 // Create the config from a user specified policy source.
 policy := &schedulerapi.Policy{}
 switch {
 case source.Policy.File != nil:
 if err := initPolicyFromFile(source.Policy.File.Path, policy); err != nil {
 return nil, err
 }
 case source.Policy.ConfigMap != nil:
 if err := initPolicyFromConfigMap(client, source.Policy.ConfigMap, policy); err != nil {
 return nil, err
 }
 }
 sc, err := configurator.CreateFromConfig(*policy)
 if err != nil {
 return nil, fmt.Errorf(&quot;couldn't create scheduler from policy: %v&quot;, err)
 }
 sched = sc
 default:
 return nil, fmt.Errorf(&quot;unsupported algorithm source: %v&quot;, source)
 }
 // Additional tweaks to the config produced by the configurator.
 sched.Recorder = recorder
 sched.DisablePreemption = options.disablePreemption
 sched.StopEverything = stopEverything
 sched.podConditionUpdater = &podConditionUpdaterImpl{client}
 sched.podPreemptor = &podPreemptorImpl{client}
 sched.scheduledPodsHasSynced = podInformer.Informer().HasSynced
 
 // 3. to infomer Object registration eventHandler
 AddAllEventHandlers(sched, options.schedulerName, informerFactory, podInformer)
 return sched, nil
}