B-MODE ULTRASONOGRAPHY IS A RELIABLE AND VALID ALTERNATIVE TO MAGNETIC RESONANCE IMAGING FOR MEASURING PATELLAR TENDON CROSS-SECTIONAL AREA

—This study investigated the validity and reliability of measuring patellar tendon (PT) cross-sectional area (CSA) using magnetic resonance imaging (MRI) and ultrasound (US) imaging. Nineteen healthy participants (10 women, 9 men) participated in three imaging sessions of the PT, once via MRI and twice via US, with image acquisition conducted by two raters, one experienced (rater 2) and one inexperienced (rater 1). All PT seg-mentations were analyzed by both raters. The validity of US-derived estimates of PT CSA against MRI estimates was analyzed using linear regression. Within-day reliability of US and MRI measurements and between-day reliability of US measurements were quantiﬁed using typical error (TE) and intra-class correlation coefﬁcients (ICC 3,1 ). There was good agreement between US- and MRI-derived estimations of PT CSA (standard errors of the estimate of 3.3 mm 2 for rater 1 and 2.6 mm 2 for rater 2; Pearson’s r = 0.97 and 0.98 for raters 1 and 2, respectively). Within-session reliability for estimations of total PT CSA from US and MRI were excellent (ICC 3,1 > 0.95, coefﬁcient of variation [CV] < 4.1%, TE = 1.3 (cid:3) 3.6 mm 2 . Between-day reliability for US was excellent (ICC 3,1 > 0.97, CV < 2.7%, TE = 1.6 (cid:3) 2.3 mm 2 ), with little difference between raters. These ﬁndings suggest that MRI and US both provide reliable estimates of PT CSA and that US can provide a valid measure of PT CSA. © 2022 The Author(s). Published by Elsevier Inc. on behalf of World Federation for Ultrasound in Medicine & Biology. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


INTRODUCTION
The human patellar tendon (PT) plays a crucial role in locomotion by transmitting force from the quadriceps muscle group to the tibia, via the patellar. Tendon is a viscoelastic tissue and will deform under loading , with the degree of loading corresponding to the structural properties of the tissue (Maganaris and Paul 1999). These structural properties, such as tendon stiffness and Young's modulus (YM), determine the compliance of the tendon, which in turn can affect the behavior of the muscleÀtendon unit during locomotion (Fukunaga et al. 2002). To calculate tendon stiffness and YM, the cross-sectional area (CSA) of the PT needs to be accurately measured. Moreover, measuring PT CSA can determine adaptations of the PT in response to mechanical loading (Kongsgaard et al. 2007;Couppe et al. 2008) or immobilization (Maganaris et al. 2006). Therefore, accurate and reliable measurements of PT CSA must be obtained to quantify the associated properties and adaptations of the PT.
Methods of assessing tendon morphology in vivo include magnetic resonance imaging (MRI) and 2-D Bmode ultrasound imaging (US). Several studies have validated the accuracy of MRI in measuring tendon properties (Berthoty et al. 1989;Sonin et al. 1996;Carrino et al. 1997), and it is considered the "gold standard" tool in validating other measurement techniques (Bohm et al. 2016;Kruse et al. 2017). However, previous research suggested that US outperformed MRI with respect to the reliability of measuring tendon morphology. Brushøj et al. (2006) found that US, when compared with MRI, had smaller withinand between-rater limits of agreement for Achilles tendon (AT) thickness. Moreover, the same study reported that US measures of AT thickness, CSA and width resulted in lower within-rater coefficients of variation, when compared with MRI. The use of US is recommended as a first-line imaging modality according to the last clinical indications of the European Society of Musculoskeletal radiology (Klauser et al. 2012). In addition, US is an attractive alternative to assess tendon properties because of its affordability, time efficiency, portability and non-invasive nature. Despite the widespread use of US in musculoskeletal research, the reliability of US tendon measures is debated within the literature (Gellhorn and Carlson 2013;McAuliffe et al. 2017). For example, US measures of PT CSA have been reported to be reliable when measured on multiple days (Reeves and Narici 2003), by multiple operators with different experience, using multiple machines (Gellhorn and Carlson 2013). In contrast, more recent studies have found US to be unreliable when measuring PT and AT CSA (Ekizos et al. 2013;Bohm et al. 2016), which was attributable, in part, to poor definition of tendon borders. With respect to the relationship between US and MRI, conflicting results have been reported in the literature. Albano et al. (2017) reported excellent agreement between MRI and US measures of AT (ICC = 0.986). Kruse et al. (2017) reported that intra-rater US measures AT were reliable, but not interchangeable with MRI measures, as US underestimated AT CSA by »5.5%. Additionally, recent research by Stenroth et al. (2019) revealed that systematic differences between US and MRI measures of the PT were noted for inexperienced raters, with US underestimating PT CSA by 13.9% compared with MRI, but not for more experienced raters with more than 5 y of experience in musculoskeletal imaging and segmentation. This suggests a need to investigate the experience of the rater when assessing the reliability of US and MRI estimations of tendon measurements.
A typical approach when assessing tendon CSA is to measure the tendon at multiple sites, typically 25%, 50% and 75% of tendon length, and calculate an average based on those collective measures (Onambele et al. 2007;Hicks et al. 2013). However, studies investigating the reliability of US-and MRI-derived measures of tendons have only reported the results of the combined averages of the tendon and not each specific measurement site, despite taking multiple measurements along the tendon (Kruse et al. 2017;Stenroth et al. 2019). Therefore, whether reliability differs between measurement sites for both US and MRI remains unknown, warranting further investigation.
An additional consideration when using estimates of tendon CSA to calculate structural properties such as tendon stiffness and YM is joint angle. Typically, PT stiffness and YM are calculated with the participant performing a ramped, isometric maximal voluntary contraction (iMVC) in an isokinetic dynamometer with the knee angle fixed at 90˚. As PT CSA is an integral part of the equation used to calculate PT YM (PT stiffness £ [PT length {mm}/PT CSA {mm 2 }]) (Onambele et al. 2007), it would be prudent to calculate PT CSA at the knee angle relevant to the iMVC being performed. This would minimize any miscalculations caused by changes in PT CSA as a result of Poisson's ratio, whereby diameter would decrease at a constant to the strain (Poisson 1827), which would occur with an increase in knee joint angle. However, to date, no other study investigating the validity and reliability of PT CSA via US or MRI has employed a knee angle of 90˚; therefore, the effects of knee joint angle on the accuracy of these measures remain unknown.
Collectively, these data suggest that the reliability and validity of US and MRI measures of tendon CSA are inconsistent and require further investigation. Therefore, the aims of this study were threefold: (i) to determine the agreement between US and MRI measures of PT CSA for two independent raters; (ii) to determine the withinday, inter-and intra-rater reliability for US and MRI measures of PT CSA; and (iii) to determine the betweenday, inter-and intra-rater reliability of US measures of PT CSA.

Participants
Nineteen healthy participants, 10 women and 9 men, participated in the study (age: 25 § 6 y, stature: 1.71 § 0.10 m, mass: 71.3 § 12.5 kg). Participants completed a pre-test questionnaire and were included in the study only if they had had no neuromuscular or musculoskeletal impairments in the lower limbs within the last 6 mo. Contraindications to MRI included cardiac pacemaker, metal objects in the body (such as aneurysm clips or a programmable shunt in the brain), joint prostheses, bone fixation devices and pregnancy. Institutional ethical approval was received from the Northumbria University Faculty of Health & Life Sciences Research Ethics Committee in accordance with the Declaration of Helsinki. Participants were supplied with a participant information sheet detailing the purpose of the study and provided written consent before participating.

Experimental design
Participants were asked to visit the laboratory on three occasions. Figure 1 is a schematic of the experimental protocol. The first session was the imaging of the PT using MRI. On the second (1 wk after the first visit) and third visits, the PT was imaged using US twice, with a 3-d interval. In session 1, two MRI scans of the PT were performed, separated by a 5-min interval. To determine the reliability of the MRI measures, the participant was then removed from the MRI scanner before being repositioned and scanned again. In sessions 2 and 3, two raters each performed two US scans of the PT on the same leg. The participant was then removed from the scanning position, before being repositioned and undergoing US scans again. This resulted in four US scans per rater, per visit (eight US scans in total). Rater 1 was considered less experienced; however, training in image acquisition using US, and image digitisation and analysis was provided in depth before the study by rater 2, who had >5 y of experience in musculoskeletal radiography. Imaging was performed at the same time of day in each session to remove the potential diurnal effects on tendon size (Stenroth et al. 2019). Prior to each visit, participants were asked to refrain from strenuous lower body exercise for 48 h to reduce possible deformations in the PT structure caused by fluid ecchymosis.

Procedures
MRI examinations. Participants were placed in an open MRI device (GE Ovation 0.35 T open MRI scanner, GE Healthcare, Little Chalfont, UK) in a left decubitus position, with the right hip and knee flexed to 85˚and 90˚(0˚= full extension), respectively (Fig. 2a), confirmed using a goniometer. This positioning was chosen to mirror the hip and knee angles of participants during the US measurements. All MRI procedures were performed by a qualified radiographer after positioning of the knee had been confirmed by rater 1.
Ultrasound examinations. Participants were positioned in an isokinetic dynamometer (System 4 Pro, Biodex Medical Systems Inc., Shirley, NY, USA) in a seated position, with the hip flexed at 85˚and the knee flexed at 90˚(0˚= full extension) (Fig. 2b). A real-time B-mode ultrasound (HDI 5000 SonoCT, Philips, Amsterdam, Netherlands) and conductance gel (Aquasonic 100, Parker Laboratories INC, NJ, USA) were used to assess PT CSA and PT length. Sagittal images of the PT were obtained using a US probe (7.5-MHz linear array probe, 55-mm width) to locate the apex of the patellar and tibial tuberosities, with marks placed on the skin at each site. The distance between the two sites was measured via an inextensible anthropometric tape measure and taken as PT length. Patellar tendon CSA was measured in the axial plane at 25% (proximal), 50% (mid) and 75% (distal) of PT length, with the scan locations clearly marked on the skin using a permanent marker. Ultrasound images were captured live using image acquisition software (AVer Media Capture Studio, AVer Media Technologies, New Taipei City, Taiwan) and analyzed offline. Patellar tendon CSA images were obtained by two US operators. Within each US session, when the participant was removed from the dynamometer for 5 min, the scan location marks were removed from the skin before the patient was repositioned. The procedure was then repeated to allow for within-session reliability assessment.
MRI image analysis. Sagittal MRI images, which had a corresponding axial image that could be used to measure PT CSA, were used to locate the apex of the patellar and tibial tuberosities. This was to ensure the consistency of anatomical landmarks used to determine PT length during US examinations. The number of images between the axial image for the apex of the patellar and tibial tuberosities was used to determine the PT CSA image at 25%, 50% and 75% of PT length. For example, if 12 images lay between the apex of the patellar and the tibial tuberosity, images 3, 6 and 9 were analyzed for 25%, 50% and 75% PT length, respectively. When the appropriate point lay between two images, the image toward the proximal region of the PT was analyzed.
Images were exported and analyzed by digitizing software (ImageJ 1.45, National Institutes of Health, Bethesda, MD, USA). Images were first converted to 32bit grayscale. An adjustable threshold cutoff method was used to determine PT borders (Kruse et al. 2017). The threshold was adjusted until the smallest natural appearance of the PT was achieved (Fig. 2c); PT CSA was taken as the area within this border. Both raters performed this sequence twice for each image, with the mean PT CSA recorded for further analysis. All images were independently blinded and randomised for both raters prior to analysis to reduce researcher bias.
Ultrasound image analysis. Ultrasound videos were exported to video editing software (Adobe Premier Elements version 15, Adobe, Mountain View, CA, USA) for frame-by-frame analysis. The images at the appropriate PT CSA location were manually assessed before being exported for analysis in ImageJ software. The tendon border was then manually outlined, and the CSA was calculated (Fig. 2d). Each rater manually analyzed each image twice, with the mean PT CSA used for further analysis. All US images were independently blinded before both raters analyzed all images in a randomised order to prevent the possibility of systematic bias resulting from recalling previous analysis.

Statistical analysis
Data are expressed as the mean § standard deviation. The level of significance was set to a = 0.05. Data were analyzed using a published spreadsheet (Hopkins 2015) in Microsoft Excel (Microsoft Excel 2016, Microsoft, Washington DC, USA) as follows: Agreement between MRI-and US-derived measures of PT CSA was assessed for each rater individually, and the collapsed scores of both raters, via linear regression (Hopkins, 2015). Pearson's correlation coefficients and the standard error of the estimate (SEE) were calculated to quantify agreement, and paired sample t-tests were used to assess for systematic error. The standard error of measurement (SEM) was calculated as the square root of the mean square error from a one-way analysis of variance (Stenroth et al. 2019).
Within-day, intra-rater reliability was assessed for MRI and US images by comparing the PT CSA scores from each scan performed on the respective visits (two scans per visit). Between-day, intra-rater reliability was assessed for US images by comparing the PT CSA scores of the first scan for each rater (before the participant was repositioned) for each visit. Within-day inter-rater reliability for MRI and US images was assessed by comparing the PT CSA scores of raters 1 and 2 during visit 1. Reliability was assessed for the proximal, mid and distal PT CSA images individually, in addition to the mean of all three PT CSA scores. The relative reliability of MRI and US measurements was assessed using ICC 3,1 , while absolute reliability was assessed by calculating the SEM and calculating the TE (95% confidence intervals) expressed as raw units and as a coefficient of variation (CV %). Paired sample t-tests were implemented to assess for systematic error. Reliability via ICC was interpreted as follows: ICC 0.5À0.75, moderately reliable; ICC 0.75À0.9, good reliability; ICC >0.9, excellent reliability (Koo and Li 2016).

Ultrasound versus MRI
Mean § SD PT CSA measures for US and MRI for both raters are presented in Table 1. No systematic differences between US and MRI were present when proximal, mid and distal values were averaged for both raters individually and when values were averaged across both raters (p = 0.055À0.785) ( Table 2). For rater 1, there was evidence of a small systematic bias, as US underestimated MRI PT CSA by 2.6 mm 2 (p = 0.017) and 5.3 mm 2 (p = 0.008) for proximal and mid measurements, respectively, compared with MRI. Pearson's r ranged from 0.88 to 0.96 for location-specific measures and from 0.97 to 0.98 for combined scores, with similar scores between raters. Combined scores between both raters exhibited good agreement between US and MRI, with SEEs that were in the range 3.8 to 4.7 mm 2 for location-specific measures and 2.4 mm 2 when scores were combined. Pearson's r ranged from 0.95 to 0.97 for location-specific measures and 0.98 for combined scores (Table 2). Figure 3 illustrates an excellent association between US and MRI for rater 1 (r 2 = 0.95), rater 2 (r 2 = 0.97) and combined rater 1 and 2 measurements (r 2 = 0.97).

Ultrasound and MRI within-day intra-rater reliability
For rater 2, US analysis overestimated PT CSA by 1.4 mm 2 , in measure 2 compared with measure 1 (p = 0.028), for the proximal PT CSA. No other systematic differences between the first and second measures were found for US or MRI analysis for raters 1 and 2 (p = 0.117À0.997).
The mean TE, CV and ICC were similar for both raters for both US and MRI (Table 3). Within-day measures were good (ICC 0.81) for rater 1 distal MRI, rater 2 proximal MRI and distal MRI analysis. All other within-day measures were considered excellent (ICC 0.91). Association between measurements 1 and 2 was excellent for US (r 2 = 0.98) for both raters (Fig 4a). Association between measurements 1 and 2 for MRI was similar for both raters (rater 1 = 0.89, rater 2 = 0.91; Fig. 3b).

Ultrasound between-day intra-rater reliability
There were no systematic differences between visits for either rater (p = 0.096À0.737). The typical error for rater 1 ranged from 3.2 to 3.5 mm 2 for location-specific measures and was 2.3 mm 2 for combined scores ( Table 4). The typical error for rater 2 ranged from 2.6 to 3.7mm 2 for location-specific measures and was 1.6 mm 2 for combined scores. All between-day measures were considered excellent (ICC 0.94). Figure 4 depicts excellent association for between-day measurements for rater 1 (r 2 = 0.95) and rater 2 (r 2 = 0.98).

Ultrasound within-day inter-rater reliability
No systematic differences were found between raters for within-day MRI or US analysis  (p = 0.127À0.890). Typical errors for US analysis ranged from 3.3 to 4.3 mm 2 for location-specific measures and averaged 2.4 mm 2 for combined scores (Table 5). Typical errors for MRI analysis ranged from 2.2 to 2.8 mm 2 for location-specific measures and averaged 1.5 mm 2 for combined scores. All within-day, inter-rater scores were considered excellent (ICC 0.92). Figure 5a illustrates that within-day associations between raters was excellent for both MRI (r 2 = 0.98) and US (r 2 = 0.94).
Ultrasound between-day inter-rater reliability Rater 1 underestimated PT CSA by 2.7 mm 2 in comparison to rater 2 (p = 0.033), at the proximal PT CSA site (Table 5). There were no other systematic differences between raters (p = 0.351À0.572). Typical errors for US analysis ranged from 3.5 to 4.0 mm 2 for location-specific measures and averaged 2.5 mm 2 for combined scores. All between-day, inter-rater scores were considered excellent (ICC 0.93). Figure 5b illustrates an excellent between-day association between raters 1 and 2 (r 2 = 0.94).

DISCUSSION
The aims of this study were to determine the agreement between US and MRI measures of PT CSA for two independent raters, determine the within-day inter-and intra-rater reliability for US and MRI measures of PT CSA and determine the between-day inter-and intrarater reliability of US measures of PT CSA. This study indicates that there are high levels of agreement between US-and MRI-derived measures of PT CSA. Moreover, both US and MRI provide reliable within-day inter-and intra-rater measures of PT CSA. Finally, US provides reliable between-day, inter-and intra-rater measures of PT CSA. These findings illustrate that US provides a valid and reliable assessment of PT CSA, which increases confidence in downstream measures of tendon properties, such as tendon stiffness and Young's modulus.

Validity of ultrasound versus MRI
Previous studies investigating US versus MRI have reported conflicting results, with US both over-estimating (Stenroth et al. 2019) and under-estimating (Kruse et al. 2017) tendon CSA when compared with MRI. However, this study indicated that high levels of agreement existed between US and MRI and that similar tendon CSA measures were produced. Though there was systematic under-reporting of proximal and mid PT CSA measures by US, there were no systematic differences between US and MRI for either rater when all sites (proximal, mid and distal) were combined for each participant. This is an important finding, as the mean score is commonly used to estimate average tendon CSA and subsequently calculate tendon stiffness and YM (Maganaris and Paul 1999;Kongsgaard et al. 2007;Onambele et al. 2007;Couppe et al. 2008;Hicks et al. 2013;Couppe et al. 2016;Murtagh et al. 2018;Stenroth et al. 2019). This high level of agreement with MRI suggests that the more convenient and cost-effective method of US can be confidently used to measure PT CSA.

Within-day intra-rater reliability
The within-day, intra-rater reliability for both raters was excellent for both US and MRI, with slightly more favorable ICC estimates, relative reliability and absolute reliability for US compared with MRI. However, rater 2 produced a smaller estimation of PT CSA on measure 2 in comparison to measure 1 (1.4 mm 2 ) for US, whereas no systematic differences between measures for MRI were reported. This small systematic difference, in US measures, could be attributed to a small adjustment in probe orientation while scanning, as this can result in an increased diameter when positioned slightly askew (Gellhorn and Carlson 2013). Nevertheless, the systematic difference in this study was confined to the proximal site of the PT, with no difference occurring when the three locations were combined.
To the best of our knowledge, this study is the first to investigate the within-day, intra-rater reliability of MRI estimates of PT CSA. Two comparisons that could be made are from Kubo et al. (2001), who reported a CV of 1.6%, and Stenroth et al. (2019), who reported CVs of 4.1% and 6.0% for experienced and inexperienced raters, respectively; both studies assessed PT CSA estimations by MRI over 2 separate days. In comparison, CVs in the current study were 4.1% and 3.7% for the experienced and inexperienced raters, respectively. It is possible that the higher reliability displayed by Kubo et al. (2001) is due to the small sample size of 6 participants, which can affect estimates of error (Springate 2011), in comparison to the 19 in this study and the 15 participants in the study by Stenroth et al. (2019). Nevertheless, the data from the current study suggest both MRI and US measures of PT CSA indicate excellent within-day intra-rater reliability.

Between-day intra-rater reliability
This study reported that US resulted in excellent relative and absolute between-day intra-rater reliability, by both raters, comparing more favorably than in previous work. For example, Stenroth et al. (2019) reported higher absolute reliability in comparison to the two raters in the current study, with SEMs of 5.0 and 8.9 mm 2 versus 1.5 and 2.6 mm 2 , respectively. Reliability assessed by ICC in this study was higher for both raters (ICC = 0.94À1.00) in comparison to other studies. For example, Stenroth et al. (2019) reported ICCs of 0.87 and 0.50 for experienced and inexperienced raters, respectively. Ekizos et al. (2013) also reported lower reliability (mean ICC 0.60) than the current study, which was attributed to limited visibility of the tendon border, making structure identification difficult. In this study and previous work (Stenroth et al. 2019), anatomical landmarks were used to define the origin and insertion of the PT, whereby the proximal, mid and distal sites were calculated based on these measurements, which was repeatedly done on each visit. This highlights the importance of a rigorous testing protocol which might, in turn, improve reliability (Thoirs and Childs 2018). Inter-rater reliability For both US and MRI, within-and between-day inter-rater reliability was excellent (ICC 0.92), with no systematic differences present for within-day measures. Despite a systematic difference between raters at the proximal site for between-day measures, this did not result in a systematic difference when the three measurement sites were combined. Inter-rater reliability was considerably higher in this study than in Stenroth et al. (2019) for both relative (US ICCs 0.97 vs. 0.56, MRI ICCs 0.99 vs. 0.62) and absolute (US SEM 0.7 mm 2 vs. 6.0 mm 2 ) reliability. The large inter-rater differences between the two studies might be attributable to differences in the experience of the raters in Stenroth et al. (2019), with the inexperienced rater having no prior experience in musculoskeletal radiography. In the current work, although rater 1 was less experienced than rater 2, there was a substantial level of practice with the digitisation process prior to the study onset. There is little doubt that experience can improve the reliability of US measures (Dudley-Javoroski et al. 2010), although it remains to be determined exactly what level of experience might be needed to produce high levels of reliability, but demonstrable high levels of reliability seem to be a good index of competence.

Limitations
The present study provides important methodological evidence which will allow the valid and reliable use of US and MRI in estimating PT CSA. However, this study is not without its limitations. Specifically, the estimation of PT CSA for both US and MRI were based on the judgements of the raters and their interpretation of the tendon borders. Although agreement between the two studies was excellent, it cannot be ruled out that the true CSA is what was measured by MRI analysis. It is difficult to ascertain if both the US and MRI images included the paratenon because of it not being clearly identifiable (Bohm et al. 2016). This gross over-or underestimation might have consequences for subsequent mechanical calculations pertaining to PT CSA (e. g., Young's modulus), and while within-study comparisons would not be affected, extrapolation to other populations might be difficult.
Another limitation is the time period between the test days of the US measurements. With only 3 d between measures, the testÀretest reliability of scores over longer periods is not known. While this approach ensures that the US measures are comparable, it does not consider the potential change in diameter of tendons that can occur over time with exercise (Tardioli et al. 2012). Finally, caution must be taken if future research utilizes equipment different from that used in the present study or uses raters with different musculoskeletal radiography experience, as this might affect the reliability of any subsequent results.  1.00 1.5 1.6 CI = confidence interval; p = paired sample t-test; CV = coefficient of variation expressed as a percentage; ICC = intraclass correlation coefficient; SEM = standard error of measurement; SEM% = standard error of measurement expressed as a percentage of the mean; TE = typical error.
y Mean of the proximal, mid and distal values. 0.97 2.5 2.8 CI = confidence interval; CV = coefficient of variation expressed as a percentage; ICC = intraclass correlation coefficient; MRI = magnetic resonance imaging; p = paired sample t-test; SEM = standard error of measurement; SEM% = standard error of measurement expressed as a percentage of the mean; TE = typical error; US = ultrasound imaging. * Significant difference between raters y Mean of the proximal, mid and distal values.