Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints

#ML techniques are data hungry, and can show optimism and instability even at a whopping > 200 EPV. https://t.co/PSRAMXixjB How can #ML work, especially with focus on discrimination measures, be trusted, without sample size considerations at all? ht

06 Jan 2024

@hilseth_mistrov @HenningWillers @theabzlab @RRS_RadRes @NorthwesternU And that is with cox regression. Machine learning needs MANY more EPV I just don’t understand this type of work 🤷‍♂️ https://t.co/PSRAMXixjB

30 Aug 2023

@Klonmich @pauladhiman @GSCollins @Richard_D_Riley @DrGSBullock @jamiecsergeant @CSMOxford @ndorms Agreed. But see van see Ploeg et al https://t.co/8huoKPfqwR where they suggest ML may need 200 or more events per var. It's prob safe to say that ML would r

29 Aug 2023

Plus with the use of random forests, is this just optimistic AUC from too few EPV? https://t.co/PSRAMXixjB

14 Jul 2023

@AndreEsteva @NEJMEvidence @arteraAI @felixfengmd Comes to mind https://t.co/PSRAMXixjB

29 Jun 2023

@ElleLettMDPhD @f2harrell The author said to me basically sample size doesn’t matter at all because we got a small p-value, after I inquired where their sample size calcs were, with my concern being ML models are “data hungry” and optimistic and unstable e

24 Mar 2023

@DrSpratticus @jryckman3 @pauladhiman @BenVanCalster @Richard_D_Riley @GSCollins Really, a “misplaced obsession” with sample size considerations for AI prognostic model? I wonder if @pauladhiman @GSCollins @f2harrell would agree with that statement? ht

22 Mar 2023

@DrSpratticus @arteraAI @NRGonc I don’t see any sample size calcs. Hard to take any AI model seriously when this basic step not done Re: AUC comparison, I worry with split sample like was done this is inaccurate, with high optimism even with >200 even

21 Mar 2023

@lemmiwenks @hilseth_mistrov This is other excellent work on how “data hungry” ML techniques are - still unstable and optimism at even > 200 events per variable. https://t.co/PSRAMXj599

20 Jan 2023

Logistic regression was stable at 20 to 50 events per variable, followed by CART, SVM, NN and RF models. Random forest, support vector machine and neural nets were unstable and over-optimistic even with >200 events per variable. https://t.co/6xtDvQ6rK

18 Nov 2022

@JohnProwle How many strokes were in the data? You'd need perhaps 40,000 strokes for some ML algorithms (e.g., random forest) to predict reliably and to reliably measure importance of a single feature. https://t.co/nkZO9oLtha

24 Aug 2022

RT @Richard_D_Riley: @f2harrell @vandy_biostat @VUDataScience Agree - I hear this wrongly promoted a lot too For better guidance and simul…

RT @Richard_D_Riley: @f2harrell @vandy_biostat @VUDataScience Agree - I hear this wrongly promoted a lot too For better guidance and simul…

RT @Richard_D_Riley: @f2harrell @vandy_biostat @VUDataScience Agree - I hear this wrongly promoted a lot too For better guidance and simul…

@f2harrell @vandy_biostat @VUDataScience Agree - I hear this wrongly promoted a lot too For better guidance and simulation results see here: https://t.co/Sx6W5O2x8C https://t.co/tRKwScpWVL https://t.co/GeSo1Nr7tr https://t.co/zM8CyQbLLN

RT @f2harrell: @Richard_D_Riley Did someone have the gall to actually say that? They need to read https://t.co/nkZO9p34FK @VUDataScience

08 Apr 2022

@Richard_D_Riley Did someone have the gall to actually say that? They need to read https://t.co/nkZO9p34FK @VUDataScience

07 Apr 2022

@mmfbee @Inferente3 Acho que esses 2 papers podem ser úteis: https://t.co/J8V5FfHzpk https://t.co/KVoAE59ScZ

28 Mar 2022

@AlexGaraiman My experience is that other ML methods have even larger instability issues, and so I strongly suspect require even larger sample sizes than for regression based approaches - as demonstrated here https://t.co/Sx6W5O2x8C - so our criteria stil

24 Feb 2022

@cd_fuller @PBlanchardMD @cancerphysicist NTCP/TCP was/is never great for anything and is largely based on flawed reasoning/methods/techniques. AI/ML/DL is extremely data hungry to the tune of >200 events per variable https://t.co/PSRAMXixjB

08 Jan 2022

@Sun_Y_Lee @GSCollins I think you will need a very large sample size for this purpose ( >>> number needed for regression). This paper may be relevant https://t.co/Sx6W5O2x8C

11 Nov 2021

@Richard_D_Riley @MaartenvSmeden Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints https://t.co/hkttjIe9qA

06 Nov 2021

@matloff @Richard_D_Riley For biomedical stuff, ML hungry for >200 EPV and still unstable = needing massive amounts of patient data, obvious security/cybersecurity concerns there https://t.co/PSRAMXixjB

31 Oct 2021

@seanjtaylor 1/3 Van der Ploeg et al: "Modern modelling techniques are data hungry:..." touches on ML models: https://t.co/FoJyjb01wp

26 Aug 2021

@tpq__ @MehrbodEstaki The simplest way to deal with this would be mixed-effects models (say an intercept per subject). For Random Forest, there is MERF (https://t.co/XWAbVXqFkD). If the n. of subjects is small, though, I would argue against using machine l

23 Jul 2021

RT @Richard_D_Riley: @999EMJamie Awesome visualisation Jamie. Important to note: (i) sample size for ML will need to be substantially bigge…

07 Jul 2021

@999EMJamie Awesome visualisation Jamie. Important to note: (i) sample size for ML will need to be substantially bigger than when using logistic regression (https://t.co/Sx6W5O2x8C) & (ii) evidence of ML doing better than logistic regression is quite l

07 Jul 2021

RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…

19 Jun 2021

RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…

19 Jun 2021

RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…

19 Jun 2021

RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…

05 Jun 2021

RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…

RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…

RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…

RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…

RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…

RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…

RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…

RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…

Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @ESteyerberg #ML #PredictiveAnalytics #MachineLearning #AI #DataScience #researchpaper #scicomm #Statistics #research #MedTwitter #science

RT @Richard_D_Riley: @skhanshadab87 @MaartenvSmeden @GSCollins @CarlMoons @Kym_Snell @BenVanCalster @TPA_Debray @VMTdeJong @joie_ensor @f2h…

29 May 2021

@skhanshadab87 @MaartenvSmeden @GSCollins @CarlMoons @Kym_Snell @BenVanCalster @TPA_Debray @VMTdeJong @joie_ensor @f2harrell @ESteyerberg Good question. Validation sample size will be same as recommended in our papers such as https://t.co/xT6RZxi66z - assu

28 May 2021

Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. Tjeerd van der Ploeg, Peter C Austin & Ewout W Steyerberg. BMC Medical Research Methodology 2014. https://t.co/al5QkSUZ9K

11 Mar 2021

RT @f2harrell: @pablik007 @stephensenn See https://t.co/4XsHKxRC3e course notes. Regarding sample sizes for trees, machine learning, and s…

15 Feb 2021

RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…

RT @episurgeon: #penalisation #smallSample #statstwitter https://t.co/7fZ9pJZ8KZ

#penalisation #smallSample #statstwitter

RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…

RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…

RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…

RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…

RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…

RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…

Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpWVL Why is machine learning not a panacea for overcoming sample sizes? https://t.co/Sx6W5O2x8C

@pablik007 @stephensenn See https://t.co/4XsHKxRC3e course notes. Regarding sample sizes for trees, machine learning, and statistical models see https://t.co/nkZO9p34FK #rmscourse

12 Feb 2021

@eliaseythorsson @MaartenvSmeden @f2harrell @statsepi https://t.co/grSkYgXhtH I think this came close. Comes with code.

30 Jul 2020

Nice article on sample size determination for machine learning classifiers: https://t.co/D4eOuOy6w9

24 Nov 2019

@ErickRScott @AndrewLBeam @kdpsinghlab @ivivek87 @MaartenvSmeden @ADAlthousePhD @IAmSamFin @signormirko well exactly...terribly prone to overfit with hugely optimistic model performance, and require substantial sample sizes as demonstrated in this study ht

03 Sep 2019