Especially bad: ignoring sample size for AI work that mainly focuses on discrimination measures https://t.co/PSRAMXixjB
This is a greta article showing how hungry machine learning methods are - kudos to the authors...I will definitely be uing this to bat away projects!
#ML techniques are data hungry, and can show optimism and instability even at a whopping > 200 EPV. https://t.co/PSRAMXixjB How can #ML work, especially with focus on discrimination measures, be trusted, without sample size considerations at all? ht
@hilseth_mistrov @HenningWillers @theabzlab @RRS_RadRes @NorthwesternU And that is with cox regression. Machine learning needs MANY more EPV I just don’t understand this type of work 🤷♂️ https://t.co/PSRAMXixjB
@Klonmich @pauladhiman @GSCollins @Richard_D_Riley @DrGSBullock @jamiecsergeant @CSMOxford @ndorms Agreed. But see van see Ploeg et al https://t.co/8huoKPfqwR where they suggest ML may need 200 or more events per var. It's prob safe to say that ML would r
Plus with the use of random forests, is this just optimistic AUC from too few EPV? https://t.co/PSRAMXixjB
@AndreEsteva @NEJMEvidence @arteraAI @felixfengmd Comes to mind https://t.co/PSRAMXixjB
@ElleLettMDPhD @f2harrell The author said to me basically sample size doesn’t matter at all because we got a small p-value, after I inquired where their sample size calcs were, with my concern being ML models are “data hungry” and optimistic and unstable e
@DrSpratticus @jryckman3 @pauladhiman @BenVanCalster @Richard_D_Riley @GSCollins Really, a “misplaced obsession” with sample size considerations for AI prognostic model? I wonder if @pauladhiman @GSCollins @f2harrell would agree with that statement? ht
@DrSpratticus @arteraAI @NRGonc I don’t see any sample size calcs. Hard to take any AI model seriously when this basic step not done Re: AUC comparison, I worry with split sample like was done this is inaccurate, with high optimism even with >200 even
@lemmiwenks @hilseth_mistrov This is other excellent work on how “data hungry” ML techniques are - still unstable and optimism at even > 200 events per variable. https://t.co/PSRAMXj599
Logistic regression was stable at 20 to 50 events per variable, followed by CART, SVM, NN and RF models. Random forest, support vector machine and neural nets were unstable and over-optimistic even with >200 events per variable. https://t.co/6xtDvQ6rK
@JohnProwle How many strokes were in the data? You'd need perhaps 40,000 strokes for some ML algorithms (e.g., random forest) to predict reliably and to reliably measure importance of a single feature. https://t.co/nkZO9oLtha
RT @Richard_D_Riley: @f2harrell @vandy_biostat @VUDataScience Agree - I hear this wrongly promoted a lot too For better guidance and simul…
RT @Richard_D_Riley: @f2harrell @vandy_biostat @VUDataScience Agree - I hear this wrongly promoted a lot too For better guidance and simul…
RT @Richard_D_Riley: @f2harrell @vandy_biostat @VUDataScience Agree - I hear this wrongly promoted a lot too For better guidance and simul…
@f2harrell @vandy_biostat @VUDataScience Agree - I hear this wrongly promoted a lot too For better guidance and simulation results see here: https://t.co/Sx6W5O2x8C https://t.co/tRKwScpWVL https://t.co/GeSo1Nr7tr https://t.co/zM8CyQbLLN
RT @f2harrell: @Richard_D_Riley Did someone have the gall to actually say that? They need to read https://t.co/nkZO9p34FK @VUDataScience
@Richard_D_Riley Did someone have the gall to actually say that? They need to read https://t.co/nkZO9p34FK @VUDataScience
@mmfbee @Inferente3 Acho que esses 2 papers podem ser úteis: https://t.co/J8V5FfHzpk https://t.co/KVoAE59ScZ
@AlexGaraiman My experience is that other ML methods have even larger instability issues, and so I strongly suspect require even larger sample sizes than for regression based approaches - as demonstrated here https://t.co/Sx6W5O2x8C - so our criteria stil
@cd_fuller @PBlanchardMD @cancerphysicist NTCP/TCP was/is never great for anything and is largely based on flawed reasoning/methods/techniques. AI/ML/DL is extremely data hungry to the tune of >200 events per variable https://t.co/PSRAMXixjB
@Sun_Y_Lee @GSCollins I think you will need a very large sample size for this purpose ( >>> number needed for regression). This paper may be relevant https://t.co/Sx6W5O2x8C
@Richard_D_Riley @MaartenvSmeden Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints https://t.co/hkttjIe9qA
@matloff @Richard_D_Riley For biomedical stuff, ML hungry for >200 EPV and still unstable = needing massive amounts of patient data, obvious security/cybersecurity concerns there https://t.co/PSRAMXixjB
@seanjtaylor 1/3 Van der Ploeg et al: "Modern modelling techniques are data hungry:..." touches on ML models: https://t.co/FoJyjb01wp
@tpq__ @MehrbodEstaki The simplest way to deal with this would be mixed-effects models (say an intercept per subject). For Random Forest, there is MERF (https://t.co/XWAbVXqFkD). If the n. of subjects is small, though, I would argue against using machine l
RT @Richard_D_Riley: @999EMJamie Awesome visualisation Jamie. Important to note: (i) sample size for ML will need to be substantially bigge…
@999EMJamie Awesome visualisation Jamie. Important to note: (i) sample size for ML will need to be substantially bigger than when using logistic regression (https://t.co/Sx6W5O2x8C) & (ii) evidence of ML doing better than logistic regression is quite l
RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…
RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…
RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…
RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…
RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…
RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…
RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…
RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…
RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…
RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…
RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…
RT @DJGould94: Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @EStey…
Paper of the week! https://t.co/zLQkRV8B4z What I love most: code is provided Additional File 1 is a great overview @ESteyerberg #ML #PredictiveAnalytics #MachineLearning #AI #DataScience #researchpaper #scicomm #Statistics #research #MedTwitter #science
@skhanshadab87 @MaartenvSmeden @GSCollins @CarlMoons @Kym_Snell @BenVanCalster @TPA_Debray @VMTdeJong @joie_ensor @f2harrell @ESteyerberg Good question. Validation sample size will be same as recommended in our papers such as https://t.co/xT6RZxi66z - assu
Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. Tjeerd van der Ploeg, Peter C Austin & Ewout W Steyerberg. BMC Medical Research Methodology 2014. https://t.co/al5QkSUZ9K
RT @f2harrell: @pablik007 @stephensenn See https://t.co/4XsHKxRC3e course notes. Regarding sample sizes for trees, machine learning, and s…
RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…
RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…
RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…
RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…
RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…
RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…
RT @Richard_D_Riley: Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpW…
Why is penalisation not a panacea for overcoming small sample sizes? https://t.co/1xAqVLkjNm https://t.co/tRKwScpWVL Why is machine learning not a panacea for overcoming sample sizes? https://t.co/Sx6W5O2x8C
@pablik007 @stephensenn See https://t.co/4XsHKxRC3e course notes. Regarding sample sizes for trees, machine learning, and statistical models see https://t.co/nkZO9p34FK #rmscourse
@eliaseythorsson @MaartenvSmeden @f2harrell @statsepi https://t.co/grSkYgXhtH I think this came close. Comes with code.
Nice article on sample size determination for machine learning classifiers: https://t.co/D4eOuOy6w9
@ErickRScott @AndrewLBeam @kdpsinghlab @ivivek87 @MaartenvSmeden @ADAlthousePhD @IAmSamFin @signormirko well exactly...terribly prone to overfit with hugely optimistic model performance, and require substantial sample sizes as demonstrated in this study ht
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
RT @f2harrell: @Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just…
@Richard_D_Riley You can even go further than your reaction: ordinary statistical models may have resulted in an answer just as good as #machinelearning using only 1/10th as many patients: https://t.co/nkZO9p34FK
RT @GSCollins: @DrVeronikaCH Effective sample size is the number of CVD events (which was 4801) - barely meeting the now defunct 10 events…
@DrVeronikaCH Effective sample size is the number of CVD events (which was 4801) - barely meeting the now defunct 10 events per variable. ML is very data hungry, particularly if you throw everything at it (https://t.co/NV4O4sSKx8).
RT @GSCollins: @EricTopol Using random forests with 52 cases, 40526 candidate features of which 18 were selected. Events-per-variable is 0.…
RT @GSCollins: @EricTopol Using random forests with 52 cases, 40526 candidate features of which 18 were selected. Events-per-variable is 0.…
RT @GSCollins: @EricTopol Using random forests with 52 cases, 40526 candidate features of which 18 were selected. Events-per-variable is 0.…
@EricTopol Using random forests with 52 cases, 40526 candidate features of which 18 were selected. Events-per-variable is 0.001. RFs known to be data hungry (https://t.co/32tsIef1zq, with EPV>>200 needed for stability). #overfitting https://t.co/Xzw
@ADAlthousePhD @cecilejanssens @ljbuturovic @paulpharoah @f2harrell @ESteyerberg @BenVanCalster @laure_wynants @statsepi Van der ploeg and other (including @ESteyerberg) looked at how data hungry ML is (https://t.co/xfiBRCGupC)
https://t.co/R45NpP6tDc 👏🏽👏🏽👏🏽
RT @f2harrell: @danelliottster Look at texts covering those methods and see their algorithms. Especially for random forest you'll see no f…
@danelliottster Look at texts covering those methods and see their algorithms. Especially for random forest you'll see no favoratism of additive (main) effects, so that high-order interactions are given big chances to make it into the predictive algorithm
@EikoFried the problem is not to understand machine learning but to understand forecasting. If you want something skeptical about ML for medicine/psych read https://t.co/NcESPuVfEF
@oziadias @f2harrell @EricTopol @JAMANetworkOpen (1/2) A rough rule of thumb to minimise overfitting is to have events-per-variable at least between 10 and 20 (for regression based methods). Simulation studies have shown Machine learning methods are very d
*Critical point*: many alternative popular approaches to LR such as Random Forests and SVM are possibly even more sensitive to overfitting. See also: https://t.co/lYa9Ly81WZ
Often I’m asked about this and I usually find it hard to come up with an answer, so here we go: SVM, NN and RF may need over 10 times as many events per variable to achieve a stable AUC and a small optimism than classical modelling techniques such as LR. h
Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. https://t.co/Rxiy45yWU0 #bmcmedresmethodol
RT @Richard_D_Riley: Developing a prediction model? Machine learning needs “over 10 times as many events" compared to logistic regression h…
RT @Richard_D_Riley: Developing a prediction model? Machine learning needs “over 10 times as many events" compared to logistic regression h…
RT @DrHughHarvey: Proof that machine learning in medical imaging needs tonnes of data!!! https://t.co/ebOKoWwiYV
RT @Richard_D_Riley: Developing a prediction model? Machine learning needs “over 10 times as many events" compared to logistic regression h…
Before using advanced #DataAnalytics for #healthcare let's be practical; is the #health #data actuarially credible? https://t.co/1OQKv7taDs
RT @Richard_D_Riley: Developing a prediction model? Machine learning needs “over 10 times as many events" compared to logistic regression h…
RT @Richard_D_Riley: Developing a prediction model? Machine learning needs “over 10 times as many events" compared to logistic regression h…
RT @Richard_D_Riley: Developing a prediction model? Machine learning needs “over 10 times as many events" compared to logistic regression h…