Skip to content

Tls resume PoC (server stateless) #57079

Merged
wfurt merged 15 commits into
dotnet:mainfrom
wfurt:tlsResumeLinux
Sep 3, 2021
Merged

Tls resume PoC (server stateless) #57079
wfurt merged 15 commits into
dotnet:mainfrom
wfurt:tlsResumeLinux

Conversation

@wfurt

@wfurt wfurt commented Aug 9, 2021

Copy link
Copy Markdown
Member

contributes to #22977
fixes #49845

This still needs some more testing and cleanup but it would be great to get some early feedback @bartonjs and @stephentoub

With OpenSSL the general strategy is to create SSL_CTX and set some parameters. Then that is used to create individual SSL sessions. That does not really fit into .NET API surface where one create SslStream and then AuthenticateAs* is called with varying parameters. For that reason (most likely) we always create new SSL_CTX for each stream. Aside from doing more work, this prevents TLS resume to work as the session cache and whatever is needed is linked to SSL_CTX.

This PR is trying to reuse SSL_CTX in some cases when it seems to be safe to do so.
Here is impact on connectionclosehttps benchmark from ASP.NET on Linux.
RPS jumps ~ 21x.

crank compare main.json my-release.json

| load                   | main   | my-release |            |
| ---------------------- | ------ | ---------- | ---------- |
| CPU Usage (%)          |     99 |         96 |     -3.03% |
| Cores usage (%)        |  1,183 |      1,149 |     -2.87% |
| Working Set (MB)       |     84 |         49 |    -41.67% |
| Private Memory (MB)    |    435 |        357 |    -17.93% |
| Start Time (ms)        |      0 |          0 |            |
| First Request (ms)     |    246 |        273 |    +10.98% |
| Requests/sec           |    774 |     16,824 | +2,073.92% |
| Requests               | 11,685 |    253,760 | +2,071.67% |
| Mean latency (ms)      |   4.06 |       0.74 |    -81.82% |
| Max latency (ms)       |  31.46 |      36.03 |    +14.53% |
| Bad responses          |      0 |          0 |            |
| Socket errors          |      0 |          0 |            |
| Read throughput (MB/s) |   0.11 |       2.42 | +2,071.47% |
| Latency 50th (ms)      |   2.19 |       0.43 |    -80.27% |
| Latency 75th (ms)      |   6.79 |       0.57 |    -91.61% |
| Latency 90th (ms)      |  10.65 |       0.86 |    -91.92% |
| Latency 99th (ms)      |  16.59 |       9.32 |    -43.82% |

Windows runs always timed out for me. With #49845 claiming 15K RPS we should be on par with 16K+ unless there was another jump with 6.0. (I'll try to follow up on this with @sebastienros

At this point .NET client still does not work and for that reason performance microbenchmarks are not available.
The reason what ASP.NET shows improvement is because it is using different tool. (wrk)

Since it is getting late I'm wondering if we should add AptContext switch and ENV variable to be able to opt-out in case session reuse create problem for somebody. (not done in this PR)

TL;DR

This is somewhat arbitrary split but I decided to loosely follow Credentials lookup and use certificate, enabled TLS protocols and EncryptionPolicy as unique combination.

Since EncryptionPolicy.NoEncryption and EncryptionPolicy.AllowNoEncryption are somewhat lame and obscure I decided to avoid them for purpose of resume and I use protocols as key to cache attached to SslCertificateContext e.g. unique server certificate. When SSL_CTX exist for given combination we would use it instead of creating new one when creating SafeSslHandle.

With that, many operation we did on SSL_CTX needs to move to SSL itself. Luckily OpenSSL have basically everyone in pars e.g. each operation can be done on either one. There is lot of turn in PAL code but that should mostly be 1:1 change. I also try to use consistently naming CryptoNative_SslCtx* for operations working on SSL_CTX to make it more obvious.

The notable exception is SSL_CTX_set_alpn_select_cb that can be set only on SSL_CTX and there is no matching SSL_set_alpn_select_cb. Since the logic is common I use SslSetData and SslGetData to attache the specific ALPN list to SSL itself and I grab it again in AlpnServerSelectCallback.
That seems to work quite well.
The resining problem is that once the callback is sett it MUST choose ALPN and it cannot wet run NULL.
That creates problem for cases when ALPN is not requested. I don't know how common that is but for now I exclude that from cache lookup. That can be easily fixed if needed either be factoring ALPN to cache lookup or simply by creating parallel cache. (would cost extra dictionary per server/SslCertificateContext)
maybe @Tratcher has some idea how important that is.

I was also wondering if we need any other locking or grabbing extra reference on the SafeSslContextHandle.
Since we are new freeing it immediately it feels like we don't as life cycle will be linked to the SslCertificateContext.
That is not IDisposable so it will live as long as there is reference to it.

@wfurt wfurt added area-System.Net.Security os-linux Linux OS (any supported distro) labels Aug 9, 2021
@wfurt wfurt requested review from bartonjs and stephentoub August 9, 2021 19:52
@wfurt wfurt self-assigned this Aug 9, 2021
@ghost

ghost commented Aug 9, 2021

Copy link
Copy Markdown

Tagging subscribers to this area: @dotnet/ncl, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

contributes to #22977
fixes #49845

This still needs some more testing and cleanup but it would be great to get some early feedback @bartonjs and @stephentoub

With OpenSSL the general strategy is to create SSL_CTX and set some parameters. Then that is used to create individual SSL sessions. That does not really fit into .NET API surface where one create SslStream and then AuthenticateAs* is called with varying parameters. For that reason (most likely) we always create new SSL_CTX for each stream. Aside from doing more work, this prevents TLS resume to work as the session cache and whatever is needed is linked to SSL_CTX.

This PR is trying to reuse SSL_CTX in some cases when it seems to be safe to do so.
Here is impact on connectionclosehttps benchmark from ASP.NET on Linux.
RPS jumps ~ 21x.

crank compare main.json my-release.json

| load                   | main   | my-release |            |
| ---------------------- | ------ | ---------- | ---------- |
| CPU Usage (%)          |     99 |         96 |     -3.03% |
| Cores usage (%)        |  1,183 |      1,149 |     -2.87% |
| Working Set (MB)       |     84 |         49 |    -41.67% |
| Private Memory (MB)    |    435 |        357 |    -17.93% |
| Start Time (ms)        |      0 |          0 |            |
| First Request (ms)     |    246 |        273 |    +10.98% |
| Requests/sec           |    774 |     16,824 | +2,073.92% |
| Requests               | 11,685 |    253,760 | +2,071.67% |
| Mean latency (ms)      |   4.06 |       0.74 |    -81.82% |
| Max latency (ms)       |  31.46 |      36.03 |    +14.53% |
| Bad responses          |      0 |          0 |            |
| Socket errors          |      0 |          0 |            |
| Read throughput (MB/s) |   0.11 |       2.42 | +2,071.47% |
| Latency 50th (ms)      |   2.19 |       0.43 |    -80.27% |
| Latency 75th (ms)      |   6.79 |       0.57 |    -91.61% |
| Latency 90th (ms)      |  10.65 |       0.86 |    -91.92% |
| Latency 99th (ms)      |  16.59 |       9.32 |    -43.82% |

Windows runs always timed out for me. With #49845 claiming 15K RPS we should be on par with 16K+ unless there was another jump with 6.0. (I'll try to follow up on this with @sebastienros

At this point .NET client still does not work and for that reason performance microbenchmarks are not available.
The reason what ASP.NET shows improvement is because it is using different tool. (wrk)

Since it is getting late I'm wondering if we should add AptContext switch and ENV variable to be able to opt-out in case session reuse create problem for somebody. (not done in this PR)

TL;DR

This is somewhat arbitrary split but I decided to loosely follow Credentials lookup and use certificate, enabled TLS protocols and EncryptionPolicy as unique combination.

Since EncryptionPolicy.NoEncryption and EncryptionPolicy.AllowNoEncryption are somewhat lame and obscure I decided to avoid them for purpose of resume and I use protocols as key to cache attached to SslCertificateContext e.g. unique server certificate. When SSL_CTX exist for given combination we would use it instead of creating new one when creating SafeSslHandle.

With that, many operation we did on SSL_CTX needs to move to SSL itself. Luckily OpenSSL have basically everyone in pars e.g. each operation can be done on either one. There is lot of turn in PAL code but that should mostly be 1:1 change. I also try to use consistently naming CryptoNative_SslCtx* for operations working on SSL_CTX to make it more obvious.

The notable exception is SSL_CTX_set_alpn_select_cb that can be set only on SSL_CTX and there is no matching SSL_set_alpn_select_cb. Since the logic is common I use SslSetData and SslGetData to attache the specific ALPN list to SSL itself and I grab it again in AlpnServerSelectCallback.
That seems to work quite well.
The resining problem is that once the callback is sett it MUST choose ALPN and it cannot wet run NULL.
That creates problem for cases when ALPN is not requested. I don't know how common that is but for now I exclude that from cache lookup. That can be easily fixed if needed either be factoring ALPN to cache lookup or simply by creating parallel cache. (would cost extra dictionary per server/SslCertificateContext)
maybe @Tratcher has some idea how important that is.

I was also wondering if we need any other locking or grabbing extra reference on the SafeSslContextHandle.
Since we are new freeing it immediately it feels like we don't as life cycle will be linked to the SslCertificateContext.
That is not IDisposable so it will live as long as there is reference to it.

Author: wfurt
Assignees: wfurt
Labels:

area-System.Net.Security, os-linux

Milestone: -

@wfurt wfurt closed this Aug 10, 2021
@wfurt wfurt reopened this Aug 10, 2021
@wfurt

wfurt commented Aug 10, 2021

Copy link
Copy Markdown
Member Author

/azp run runtime

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@wfurt wfurt marked this pull request as ready for review August 16, 2021 04:32
@karelz

karelz commented Aug 17, 2021

Copy link
Copy Markdown
Member

Browser runs are hitting #57501, unrelated to this PR.
runtime-staging failures are also unrelated to the PR.

@karelz

karelz commented Aug 17, 2021

Copy link
Copy Markdown
Member

@stephentoub if we get your approval for code-review, we can merge (CI is OK).

@karelz karelz added this to the 7.0.0 milestone Aug 17, 2021
@wfurt

wfurt commented Aug 27, 2021

Copy link
Copy Markdown
Member Author

/azp run runtime-libraries stress-ssl

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@wfurt

wfurt commented Aug 27, 2021

Copy link
Copy Markdown
Member Author

This should be ready for any final pass @stephentoub or anybody from @dotnet/ncl
CI is clean.
I addressed all @bartonjs feedback - either by changing code or providing explanation.

  • this batch has some bigger naming cleanup that makes the diff bigger but hopefully easier to understand the code
  • I re-factored some of the CipherPolicy/protocols so separate helper function (keeping actual logic same)
  • filed ALPN processing on Linux can be improved #58299 to track further ALPN improvements

I was planning to run another round of perf tests just to be sure but I have some difficulties - most likely because main moved to 7.0.

}

int32_t CryptoNative_SetCiphers(SSL* ssl, const char* cipherList, const char* cipherSuites)
{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth asserting (cipherList != NULL) ^ (cipherSuites != NULL) ?

@stephentoub stephentoub left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@halter73

halter73 commented Sep 3, 2021

Copy link
Copy Markdown
Member

That creates problem for cases when ALPN is not requested.

Does this only create a problem when SslServerAuthenticationOptions.ApplicationProtocols is unset? Or is it a problem if the client does not request ALPN? I take it from wrk showing an improvement that ALPN only needs to be requested by the server. If so, that's not problem for Kestrel since we always set this list even if only for "http/1.1".

@wfurt

wfurt commented Sep 3, 2021

Copy link
Copy Markdown
Member Author

The problem I bump into as with SslServerAuthenticationOptions.ApplicationProtocols unset @halter73. The current behavior is that we don't set the callback and OpenSSL basically ignores ALPN and you get null on both sides. When the callback is set, OpenSSL expects that the callback picks something or the handshake will fail. I did not find a way how to not choose and preserve old behavior.

@wfurt wfurt merged commit 6262ae8 into dotnet:main Sep 3, 2021
@wfurt wfurt deleted the tlsResumeLinux branch September 3, 2021 22:03
@stephentoub

Copy link
Copy Markdown
Member

@wfurt, this closed #49845. Don't we still need to track the client side as well?

@wfurt

wfurt commented Sep 3, 2021

Copy link
Copy Markdown
Member Author

#49845 was specifically about Kestrel regression and this change should be sufficient IMHO as the test does not use .NET client. I left #22977 open to track the remaining work e.g. client support.

@sebastienros

Copy link
Copy Markdown
Member

Is it getting backported to 6.0 ?

@karelz

karelz commented Sep 7, 2021

Copy link
Copy Markdown
Member

@stephentoub you originally wanted to consider it for 6.0 backport. Do we have enough wins here (e.g. in TechEmpower benchmark) to justify it?

@stephentoub

Copy link
Copy Markdown
Member

While it'd be great to see this in 6.0, at this point I'm skeptical there's enough runway to properly validate the change with low enough risk. I think we should probably just get it in for 7.0. @danmoseley, @geoffkizer, dissent?

@karelz

karelz commented Sep 7, 2021

Copy link
Copy Markdown
Member

@davidfowl might want to chime in here as well ;)
BTW: @geoffkizer is OOF this week

@davidfowl

Copy link
Copy Markdown
Member

Maybe good to follow up with the customers that pushed for this change.

@karelz

karelz commented Sep 8, 2021

Copy link
Copy Markdown
Member

@davidfowl I can't find any customer pushing for the change.

@DavidKarlas

DavidKarlas commented Sep 12, 2021

Copy link
Copy Markdown
Contributor

Hi @wfurt as part of integration tests for https://github.com/dotnet/templating we have arcade bot bumping runtime for us so our main keeps running on main of runtime to catch issues asap, it seems like this PR started causing our CI to fail(no worries its just in PR that bumps runtime so no urgency).

Error is: Cannot get required symbol SSL_set_alpn_protos from libssl
PR in question: dotnet/templating#3865

Is this runtime bug or our CI needs to update libssl or something else?

@wfurt

wfurt commented Sep 12, 2021

Copy link
Copy Markdown
Member Author

It seems like you use Ubuntu 14.04 with OpenSSL 1.0.1? They are long out of support IMHO. (even Ubuntu 16 is getting pretty old) Is there chance you can bump OpenSSL to at least 1.0.2 or update the base container @DavidKarlas ?

@ghost ghost locked as resolved and limited conversation to collaborators Oct 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-System.Net.Security os-linux Linux OS (any supported distro)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTTPS handshake performance

9 participants