This article describes a workaround, which uses a standard private repository as a helm chart, which is then accessible via the helm
CLI, ArgoCD, GitHub Actions and more.
To provide some more context, my use case was that of publishing helm charts from private GitHub repositories using GitHub Actions, and that’s what we’re going to look at today.
Quick sidenote: you may alternatively set up a public repository, which allows everyone to get read access to it.
You’ll need a Personal Access Token to access the helm registry. Check the documentation on how to create one.
We start by creating a simple private repository, where all our charts will be stored.
Create a blank index.yaml
file in the root of the repository (e.g. touch index.yaml
).
Note: We are creating it in the root directory in the
main
branch, but you can also use a subfolder/branch as the registry. You will just need to change the URL when you access the registry, changing the$BRANCH
part of the URL and appending the directory.
You can now add the repository as a helm registry using the command line:
1
helm repo add helm-registry 'https://$TOKEN@raw.githubusercontent.com/$USERNAME/$REPO/$BRANCH'
For example, with a token ghp_xxxxxxxx
, to access the branch main
on the joaomlneto/helm
repository:
1
2
$ helm repo add helm-registry 'https://ghp_xxxxxxxx@raw.githubusercontent.com/joaomlneto/helm/main'
"helm-registry" has been added to your repositories
You can confirm it is added by running helm repo list
:
1
2
3
4
$ helm repo list
NAME URL
…
helm-registry https://ghp_xxxxxxxx@raw.githubusercontent.com/joaomlneto/helm/main
I assume you already have a packaged chart. If not, to create a chart you can use
helm create <name>
and to package it you can runhelm package <name>
.
Publishing a chart to the repository requires a small workaround, as we can’t use helm push
. Instead, we’ll just commit the changes manually:
git clone https://github.com/joaomlneto/helm.git
)cp <chart.tgz> <repository root>
)index.yaml
by running helm repo index .
on the root of the repositorygit add . && git commit -m "Add <name> v0.1.0" && git push
)It may take up to 5 minutes (as contents from rawgithubusercontent.com
are cached), but the chart should then be visible if you run helm repo update
, followed by helm search repo <name>
.
1
2
3
4
5
6
7
8
9
$ helm repo update
Hang tight while we grab the latest from your chart repositories...
…
...Successfully got an update from the "helm-registry" chart repository
…
Update Complete. ⎈Happy Helming!⎈
$ helm search repo testchart
NAME CHART VERSION APP VERSION DESCRIPTION
helm-registry/<name> 0.1.0 1.16.0 A Helm chart for Kubernetes
Publishing the artifact via GitHub Actions is the same — you clone the chart repository, copy the packaged chart file, update the index and then commit the changes.
I suggest splitting the chart generation and chart publishing in two separate jobs, and storing your Packages Access Token in an Encrypted Secret.
This is an example job of how you can generate the chart:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
jobs:
…
package:
name: Package Helm Chart
runs-on: ubuntu-latest
outputs:
chart_filename: ${{ steps.chart_filename.outputs.chart_filename }}
steps:
- name: Checkout Repository
uses: actions/checkout@v3
- name: Install Helm
uses: azure/setup-helm@v2.1
with:
version: v3.5.2
- name: Package Helm Charts
run: helm package chart
- id: chart_filename
name: Output Chart Filename
run: echo "::set-output name=chart_filename::$(ls *.tgz)"
- name: Upload Helm Chart Package as Workflow Artifact
uses: actions/upload-artifact@v3
with:
name: ${{ steps.chart_filename.outputs.chart_filename }}
path: ${{ steps.chart_filename.outputs.chart_filename }}
retention-days: 1
if-no-files-found: error
This job will package the chart and upload it as a workflow artifact. This can be later accessed by our publish
job:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
jobs:
…
publish:
name: Publish Helm Chart
runs-on: ubuntu-latest
needs:
- package
steps:
- name: Checkout Chart Repository
uses: actions/checkout@v3
with:
repository: joaomlneto/helm /* CHANGE ME!!! */
token: ${{ secrets.PACKAGES_ACCESS_TOKEN }}
- name: Configure Git
run: |
git config user.name "$GITHUB_ACTOR"
git config user.email "$GITHUB_ACTOR@users.noreply.github.com"
- name: Install Helm
uses: azure/setup-helm@v2.1
with:
version: v3.5.2
- name: Retrieve Chart Package Workflow Artifact
uses: actions/download-artifact@v3
with:
name: ${{needs.package.outputs.chart_filename }}
- name: Add chart to repository
run: |
helm repo index .
git add .
git commit -m "Add ${{ needs.package.outputs.chart_filename }}"
git push
This article showed that, with very little compromises, we are able to use a vanilla private GitHub repository to emulate a private Helm registry. If something’s amiss, please do let me know!
]]>The writeset is the set of things written (created, updated or deleted), commonly in the context of databases: given a specific transaction, it is the set of the things that will be affected by its execution.
During my PhD, when working on the Bandwidth-adaptive Page Placement in NUMA publication, one interesting question arose - could we do the same for regular code?
Could we write a function computeWriteSet that receives an arbitrary function to be executed and returns the set of writes (including from other functions called from within)? That’s what I asked on StackOverflow.
We worked on a Linux-specific mechanism that relies on playing with mprotect, a function that controls the permissions/protections of regions of memory. In Linux, all memory is split into pages (usually of 4KB each). The minimum granularity that is possible to specify these protections are in page increments.
Our solution simply protects the whole memory from being written (which causes dreadful segmentation faults every time a write is attempted). We then use a Signal Handler to catch the signal, unprotect the memory and record the address accessed.
This is the most compact version I was able to devise:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <ucontext.h>
#include <fcntl.h>
#include <execinfo.h>
#include <sys/mman.h>
#include <set>
#include <functional>
#include <cassert>
extern "C" {
extern int __data_start;
extern int _end;
}
#define PAGE_SIZE sysconf(_SC_PAGESIZE)
#define PAGE_MASK (PAGE_SIZE - 1)
#define PAGE_ALIGN_DOWN(x) (((intptr_t) (x)) & ~PAGE_MASK)
#define PAGE_ALIGN_UP(x) ((((intptr_t) (x)) + PAGE_MASK) & ~PAGE_MASK)
#define GLOBALS_START PAGE_ALIGN_DOWN((intptr_t) &__data_start)
#define GLOBALS_END PAGE_ALIGN_UP((intptr_t) &_end - 1)
#define GLOBALS_SIZE (GLOBALS_END - GLOBALS_START)
std::set<void*> *addresses = new std::set<void*>();
void sighandler(int signum, siginfo_t *siginfo, void *ctx) {
void *addr = siginfo->si_addr;
void *aligned_addr = reinterpret_cast<void*>(PAGE_ALIGN_DOWN(addr));
switch(siginfo->si_code) {
case SEGV_ACCERR:
mprotect(aligned_addr, PAGE_SIZE, PROT_READ | PROT_WRITE);
addresses->insert(aligned_addr);
break;
default:
exit(-1);
}
}
void computeWriteSet(std::function<void()> f) {
static bool initialized = false;
if (!initialized) {
// install signal handler
stack_t sigstk;
sigstk.ss_sp = malloc(SIGSTKSZ);
sigstk.ss_size = SIGSTKSZ;
sigstk.ss_flags = 0;
sigaltstack(&sigstk, NULL);
struct sigaction siga;
sigemptyset(&siga.sa_mask);
sigaddset(&siga.sa_mask, SIGSEGV);
sigprocmask(SIG_BLOCK, &siga.sa_mask, NULL);
siga.sa_flags = SA_SIGINFO | SA_ONSTACK | SA_RESTART | SA_NODEFER;
siga.sa_sigaction = sighandler;
sigaction(SIGSEGV, &siga, NULL);
sigprocmask(SIG_UNBLOCK, &siga.sa_mask, NULL);
initialized = true;
}
addresses->clear();
printf("\nexecuting function\n");
printf("--------------\n");
mprotect(reinterpret_cast<void*>(GLOBALS_START), GLOBALS_SIZE, PROT_READ);
f();
mprotect(reinterpret_cast<void*>(GLOBALS_START), GLOBALS_SIZE, PROT_READ | PROT_WRITE);
printf("--------------\n");
printf("pages written:\n");
for (auto addr : *addresses) {
printf("%p\n", addr);
}
}
We can then use it to check the writes done by a given function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void f() {
static int x[1024] = {0};
static int y[1024] = {0};
static int z[1024] = {0};
static bool firsttime = true;
if (firsttime) {
printf("&x[0] = %p\n&y[0] = %p\n&z[0] = %p\n", x, y, z);
firsttime = false;
}
if (y[0]) z[0]++;
if (x[0]) y[0]++;
x[0] = (x[0] + 1) % 2;
printf("{ x, y, z } = { %d, %d, %d }\n", x[0], y[0], z[0]);
}
int main() {
computeWriteSet(f);
computeWriteSet(f);
computeWriteSet(f);
return 0;
}
When executed, it produces the following output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
executing function
--------------
&x[0] = 0x6041c0
&y[0] = 0x6051c0
&z[0] = 0x6061c0
{x, y, z} = {1, 0, 0}
--------------
pages written:
0x604000
executing function
--------------
{x, y, z} = {0, 1, 0}
--------------
pages written:
0x604000
0x605000
executing function
--------------
{x, y, z} = {1, 1, 1}
--------------
pages written:
0x604000
0x606000
mprotect
at the time of writing. If a page contains several variables, where some are affected and others are not, we will not be able to tell which were affected.In the end, the answer is yes, we can do the same to functions as we can do to database transactions, albeit within the limitations of the operating system and with a significant performance penalty, which prohibits its frequent/extensive usage.
]]>There is a boilerplate repository on GitHub that is ready-to-use by doing a simple clone. This was the main motivation to do this writeup, as it has become one of the most popular references to learn about this.
Travis CI (or any other CI, for that matter) is a service that will run your test suite every time you push a new commit to GitHub, or whenever you receive a pull request.
There are many advantages[1] of using these kind of services, such as ensuring builds are reproducible on a fresh OS install, to have repeatable testing processes and to automate publishing and deployment of your releases.
We will build a simple calculator library in Java, and we will use Maven for building the project.
Create the project model and the source file:
/pom.xml
/src/main/java/io/github/joaomlneto/travis_ci_tutorial_java/SimpleCalculator.java
TODO: compile and show it’s running!
We now have a working calculator! Let’s add some unit tests:
/src/test/java/io/github/joaomlneto/travis_ci_tutorial_java/SimpleCalculatorTest.java
TODO: run tests manually
Before we proceed, you will also need to authorize your GitHub account and GitHub repository to use Travis CI, which is described in the Travis CI Tutorial, To get started with Travis CI.
We can now add Travis CI to our project by creating the /.travis.yml
:
1
language: java
This minimalistic configuration will try to autodetect your repository configuration and use default settings. There are many options that you can explore, such as selecting the operating system, installing required OS packages, setting up docker containers, etc.
In our case, Travis CI will automatically detect we are using Maven through the existence of the pom.xml
file and will execute the tests as per the Maven Lifecycle, executing the following commands:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ mvn install -DskipTests=true -Dmaven.javadoc.skip=true -B -V
$ mvn test -B
...
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running io.github.joaomlneto.travis_ci_tutorial_java.SimpleCalculatorTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.204 sec
Results :
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] --- jacoco-maven-plugin:0.7.7.201606060606:report (report) @ travis-ci-tutorial-java ---
[INFO] Loading execution data file /home/travis/build/joaomlneto/travis-ci-tutorial-java/target/jacoco.exec
[INFO] Analyzed bundle 'travis-ci-tutorial-java' with 1 classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 4.358 s
[INFO] Finished at: 2019-03-28T12:26:18Z
[INFO] Final Memory: 22M/255M
[INFO] ------------------------------------------------------------------------
The command "mvn test -B" exited with 0.
We can trigger Travis to run a build either manually on the website or simply by pushing to GitHub.
[1] DevOps Zone article: 9 Benefits of Continuous Integration
]]>The assignment: Analyze an application for deployment in the MareNostrum3 Supercomputer! Our choice, no restrictions. And we get to run it in one of the fastest supercomputers in existence! How cool is that?
However, freedom comes at a cost – finding a suitable application was no easy task. Some were inadequate for HPC, some were terribly complicated. I remember searching for several days!
I ended up choosing VPFFT++, a Crystal viscoplasticity proxy application by Exascale Co-design Center for Materials in Extreme Environments (ExMatEx).
VPFFT is a solver for Polycrystalline Mechanical Properties, an implementation of a mesoscale micro-mechanical materials model. VPFFT simulates the evolution of a material under deformation by solving the viscoplasticity model. The solution time to the viscoplasticity model, which is described by a set of partial differential equations, is significantly reduced by application of the Fast Fourier Transform (FFT) in the VPFFT algorithm.
VPFFT++ is a reference implementation of the algorithms in VPFFT. While capturing most of its computational complexity, VPFFT++ does not employ most of the machine-specific optimizations and additional physics feature sets from the original VPFFT code.
VPFFT++ is written in C++ and uses GNU Make to reduce external dependency requirements for building it. It is claimed to have implemented parallelization through MPI and OpenMP, though no traces of OpenMP usage were at time present in the source code.
It depends on two open source libraries:
Very cool slides of a presentation on VPFFT++ can be found here
Full disclaimer: I never really understood what the algorithm accomplishes. I am no expert in VPFFT, nor in material modelling. My task was to just analyze the software and try to improve its efficiency in MareNostrum3.
At the time I started the analysis, VPFFT++’s latest version only had support for OpenMP, with a hardcoded limit of 8 threads doing work. Damn! I can’t use it either, I thought. However, I got lucky! After contacting the author (a big thanks to Frankie Li from LLNL), he kindly and swiftly updated the repository with a MPI version (while at the same time dropping OpenMP support). I can live without OpenMP. Let’s give it a shot!
First version of VPFFT++ on two nodes, one process per node, default view |
Updated version of VPFFT++ on 16 nodes, one process per node, MPI calls view |
So, the problem is internally represented as a 3D matrix, and split along its X axis in equal-length chunks (that is, if the length in X axis is divisible by the number of processes), each to be assigned to a specific process. The application itself does not really accept the problem parameters upon execution - all of them are hardcoded(!), and the ones that come by default end up generating a runtime error if compiled with the debug option. However, the production version does not perform sanity checks and runs fine (although the application results might lose their significance completely).
In order to tweak the behavior of the application, several parameters that correspond to the problem to be solved must be tweaked in the source code itself:
16x16x16
. This was tweaked to be up to 256x256x256 due to some processes being idle at high process counts. Matrixes bigger than 256x256x256
are too big to fit in MareNostrum III nodes’ memory. If we require more workers, one may increase the X dimension while reducing the Y and Z lengths. This may be a problem if one requires a matrix that doesn’t fit in memory.100
.3E-2
. This represents the simulation duration of each strain simulation step. It might influence the number of iterations indirectly, by influencing the result of each iteration, by lowering the iteration error (see next parameter).10E5
.As stated before, the algorithm is composed of iterations at two levels - an outer loop, and an inner loop. We will call each iteration of the outer loop a time step simulation - I will use both definitions interchangeably.
The outer loop simply calls the strain step algorithm (the inner loop) a predetermined amount of times. For each time step completed, a line is written in the output file (Stress_Strain.txt
).
The inner loop is slightly more complicated. Before the actual iterations there is a compute-intensive phase, followed by an MPI AllReduce communication phase - we will call this the time step initialization phase. Afterwards, it performs several iterations up to a certain predetermined maximum amount (see previous section). If it finds the stress and strain deviation from the solution (an error value) is under a convergence epsilon (see previous section), it might run less iterations.
A visual representation of this decomposition can be observed in figure 3. We see the initialization phase of the first time step lasts a lot more time and is very imbalanced. This behavior is exclusive of the first iteration. Therefore, the initialization phase of this time step is not included in the analysis. Figure 4 shows the MPI calls timeline in an execution of a single time step.
Visual decomposition of the application structure using 4 outer iterations and 5 inner iterations. |
Different MPI calls located within the trace of a single time step. |
Preliminary runs on the scalability of the application reveals it scales well beyond the 16-core limitation imposed by default values, but with reduced efficiency. The results can be seen in Figure 5. The data gathered was based on the execution time reported by the application. Time spent in initializations prior to the algorithm execution is not included. Strong scaling was evaluated using 5 time steps, with 10 iterations each, using a 256x256x256 matrix. Weak scaling was evaluated using 5 time steps, with 5 iterations each, using a Nx256x256 matrix, where N denotes the number of processes used for each run. Figure 6 shows the percentage of time spent in MPI communication phases for Strong Scaling - the values were gathered from traces generated by Extrae in independent runs from those in Figure 5.
Global Speedup and Efficiency of the Application for Strong Scaling (blue) and Weak Scaling (green). The x-axis represents the number of processes. |
Percentage of time spent in MPI calls for Strong Scaling tests. The x axis represents the number of processes. |
We now present an efficiency model for analysis of VPFFT++:
100%
.Efficiency Model for VPFFT++. The x axis represents the number of processes. |
We see the biggest cause for loss in efficiency is Macro Load Imbalance. We can see that all parameters degrade, though, and eventually each of them may become a problem by itself.
As shown before, the time step initialization phases are not well-balanced. This imbalance is deterministic, and has a slightly different pattern for each of the iterations. In Figure 8, one can see that the shape of these regions (in red) remains the same with varying number of processes. However, at higher core counts, we see there is some extra imbalance in the otherwise balanced iterations. This effect appears to be non-deterministic, and causes some processes to take extra time doing computation.
Computational Imbalance regions with 4, 8, 16, 32, 64 and 128 processes. Identified in red are periodic imbalance. In orange are identified regions where imbalance seems random. |
To gain more insight onto what causes these two regions of imbalance, the first step is to take a look at the Useful Duration. In Figure 9, one can clearly identify the same red and orange regions as before. We see that the red regions share a higher duration than all other phases of the program, and the orange region is a slight deviation in duration of two consecutive processes. It should be pointed out that these traces were gathered using 4 processes per node (which has two sockets), and both processes experiencing this variation are running on the same node.
Useful Duration for 32 processes. Left: timeline of duration of computation regions (blue is higher). On the right, the corresponding histogram in which the x-axis represents duration in a linear fashion. |
We can try to identify the causes of noise by analyzing different views from the same trace, as seen in Figure 10. The red region appears to be algorithmic, while the orange one caused by noise - a reduction in CPU frequency(!). Both processes were placed in the same CPU and TurboBoost got deactivated for a short period of time. We might be able to remove this source of noise by running more processes on a single node (which probably cause all processes to run at the normal CPU frequency), but that is outside the scope of this report.
Total instructions (left) and Cycles per µs (right) for 32 processes. |
The noise caused by the seemingly innocent change in CPU frequency might be the main factor driving down the efficiency of the parallel execution. In order to determine this, we will attempt to remove the noise by simulating a constant CPU frequency (at 3.3GHz, which is the Turbo frequency for two cores being used in the CPUs used by MareNostrum III). The process is described briefly in a graphical way in Figure 11. From the original trace, we will run a Dimemas simulation where we specify a fixed CPU wall clock time. From this, we can apply the same efficiency model as before by also simulating execution with a perfect network, which is depicted in Figure 12 with a comparison against the efficiency obtained with noise. We can see there is a huge improvement in parallel efficiency where CPU noise was a problem (over 32 processes). Now we are left with no significant external noise, and the imbalance that we can see is caused by the algorithm itself.
Methodology applied to remove noise related to changes in CPU frequency |
Efficiency model for the original trace (left) and trace with a constant CPU wall clock time (right). The total time of the original is 13.121 seconds, and the simulated trace is 12.044 seconds. |
In this section we dive deeper into the code of VPFFT to try and find the possible causes of imbalance caused by the algorithm in the time step initialization phases and try to design potential solutions for the problem. For this, we collect information about the user functions, specifically the ones that are called in the time step initialization phases. This section of the code can be seen in the file Src/MaterialGrid.cpp
, on the function MaterialGrid::RunSingleStrainStep
. The relevant functions being called before the iterations are UpdateSchmidtTensors
, InitializeComputation
and BuildLinearRefMediumStiffness
. In Figure 13 we can see the results and clearly identify the InitializeComputation
function as the region where imbalance occurs. We proceed to check its code and repeat the process - results are shown in Figure 14.
Left: User functions for the time step initialization phases. Right: colors identifying which regions correspond to each of the functions. |
Left: User functions for the InitializeComputation function and direct sub-functions, with a zoomed region of 200µs. Right: colors identifying which regions correspond to each function. |
We can see that most of the time in InitializeComputation
is spent in the SolveConstitutiveEquations
(over 99%!), and that the function is called multiple times (as shown on the 200µs section), interleaved with SolveSachEquations
. Upon closer inspection of the source code for this function, we can see it performs a variable number of iterations, depending of a local error value being under a determined epsilon threashold. This appears to be the cause for imbalance in the time step initialization region. The computation performed inside the loop appears to have similar duration across iterations.
As seen in the efficiency models presented, transfer balance is not an issue at small scales but (very) slowly degrades as we add more processes. In this section we evaluate how network latency and bandwidth might affect the execution of the application. The results obtained can be seen in Figure 15.
Speedup for different values of Global bandwidth and End-to-End Latency for 8, 32 and 128 processes. The scale is normalized so that 100% is with an ideal, *perfect network.* |
The application seems to be sensible to bandwidth - there is a rapid decrease from 32 processes to 128 processes and if bandwidth is less than 1Gbps. For today’s supercomputers, this is not an issue, as the global bandwidth exceeds 8Gbps - we expect the application to run very efficiency (over 98%) for 128 processes. As the number of processes is expected to increase, so is the amount of bandwidth required to maintain a constant efficiency ratio.
Regarding latency, there is little impact at this scale - we get over 99.5% speedup with latencies of 2µs or less, which is similar to what exists in today’s machines.
Both factors’ impact mostly the time spent doing MPI SendRecv, with little impact on time spent in AllReduce. As seen before in Figure 5, this means it won’t matter much, as the main problem is the time spent in AllReduce. There appears to be no endpoint contention upon analyzing SendRecv calls - although the communication pattern seems complex, we can see from figure 16 that all processes communicate with a different process.
Speculation about how much speedup from using a certain number of processes is outside the scope of this report, but MareNostrum III’s network is expected to easily support thousands of processes with over 90% efficiency.
A perspective on the SendRecv regions’ communication pattern |
The first version of VPFFT++ had support for OpenMP, which was dropped in the current version for MPI. However, VPFFT++ can benefit from both coexisting at the same time. The same methodology of splitting work between processes in MPI can definitely be applied for OpenMP/OmpSS. If algorithm correctness is not invalidated, one can split work in the Y and Z dimensions of the matrix as well, allowing to use more threads for small matrix sizes (which currently cannot be accomplished with current workload distribution granularity).
While not perfect, VPFFT++ appears to scale pretty well! One must note that if running the default number of iterations per time step (100), the time spent in its initialization will account for less than 2% of the total step execution time. Something to look would be alternative ways of solving the constitutive equations, or somehow computing the error globally between all processes - this would cause them to be balanced at the expense of significantly increased communication during this phase. However, this suggestion does not account for the correctness of the algorithm - it might not be feasible at all!
The imbalance caused by TurboBoost is something external to the application that can be worked around. The simplest solution would be to disable it completely, causing all cores to run at the base frequency - this wouldn’t be the best solution, as 20% of the computational power is lost. The best solution would be to make use of all the cores - either by placing more processes in a single node, or by making use of the Shared Memory paradigm by implementing OpenMP or OmpSS. The former would be preferred for regions that are well balanced, while the latter might bring some extra advantages only to the time step initialization phases - this verification was outside the scope of this report.
Regarding the network, it appears to have a limited impact on the application execution time, and endpoint contention is not an issue. MareNostrum III appears to be able to accomodate several thousands of processes without either network bandwidth or latency becoming a significant issue.
The most important improvement would be to make use of all threads in a single machine, so implementation of OpenMP/OmpSS would be of maximum priority.
]]>